feat: add TAESD implementation - faster autoencoder (#88)

* add taesd implementation * taesd gpu offloading * show seed when generating image with -s -1 * less restrictive with larger images * cuda: im2col speedup x2 * cuda: group norm speedup x90 * quantized models now works in cuda :) * fix cal mem size --------- Co-authored-by: leejet <leejet714@gmail.com>
2023-12-05 09:40:03 -05:00 · 2023-12-05 09:40:03 -05:00 · 134883aec4
commit 134883aec4
parent f99bcd1f76
14 changed files with 908 additions and 46904 deletions
--- a/.gitignore
+++ b/.gitignore
@ -8,6 +8,7 @@ test/
 *.bin
 *.exe
 *.gguf
+output*.png
+models*
+!taesd-model.gguf
 *.log
-output.png
-models/
--- a/README.md
+++ b/README.md
@ -9,22 +9,23 @@ Inference of [Stable Diffusion](https://github.com/CompVis/stable-diffusion) in
 ## Features

 - Plain C/C++ implementation based on [ggml](https://github.com/ggerganov/ggml), working in the same way as [llama.cpp](https://github.com/ggerganov/llama.cpp)
- Super lightweight and without external dependencies.
+- Super lightweight and without external dependencies
 - SD1.x and SD2.x support
 - 16-bit, 32-bit float support
 - 4-bit, 5-bit and 8-bit integer quantization support
 - Accelerated memory-efficient CPU inference
    - Only requires ~2.3GB when using txt2img with fp16 precision to generate a 512x512 image, enabling Flash Attention just requires ~1.8GB.
 - AVX, AVX2 and AVX512 support for x86 architectures
- Full CUDA backend for GPU acceleration, for now just for float16 and float32 models. There are some issues with quantized models and CUDA; it will be fixed in the future.
- Can load ckpt, safetensors and diffusers models/checkpoints. Standalone VAEs models.
+- Full CUDA backend for GPU acceleration.
+- Can load ckpt, safetensors and diffusers models/checkpoints. Standalone VAEs models
    - No need to convert to `.ggml` or `.gguf` anymore!
- Flash Attention for memory usage optimization (only cpu for now).
+- Flash Attention for memory usage optimization (only cpu for now)
 - Original `txt2img` and `img2img` mode
 - Negative prompt
 - [stable-diffusion-webui](https://github.com/AUTOMATIC1111/stable-diffusion-webui) style tokenizer (not all the features, only token weighting for now)
 - LoRA support, same as [stable-diffusion-webui](https://github.com/AUTOMATIC1111/stable-diffusion-webui/wiki/Features#lora)
 - Latent Consistency Models support (LCM/LCM-LoRA)
+- Faster and memory efficient latent decoding with [TAESD](https://github.com/madebyollin/taesd)
 - Sampling method
    - `Euler A`
    - `Euler`
@ -47,9 +48,10 @@ Inference of [Stable Diffusion](https://github.com/CompVis/stable-diffusion) in
 - [ ] More sampling methods
 - [ ] Make inference faster
    - The current implementation of ggml_conv_2d is slow and has high memory usage
+    - Implement Winograd Convolution 2D for 3x3 kernel filtering
 - [ ] Continuing to reduce memory usage (quantizing the weights of ggml_conv_2d)
 - [ ] Implement BPE Tokenizer
- [ ] Add [TAESD](https://github.com/madebyollin/taesd) for faster VAE decoding
+- [ ] Implement [Real-ESRGAN](https://github.com/xinntao/Real-ESRGAN/tree/master) upscaler
 - [ ] k-quants support

 ## Usage
@ -122,7 +124,7 @@ cmake --build . --config Release
 ### Run

 ```
-usage: ./bin/sd [arguments]
+usage: sd [arguments]

 arguments:
  -h, --help                         show this help message and exit
@ -131,8 +133,10 @@ arguments:
                                     If threads <= 0, then threads will be set to the number of CPU physical cores
  -m, --model [MODEL]                path to model
  --vae [VAE]                        path to vae
+  --taesd [TAESD_PATH]               path to taesd. Using Tiny AutoEncoder for fast decoding (low quality)
  --type [TYPE]                      weight type (f32, f16, q4_0, q4_1, q5_0, q5_1, q8_0)
-                                     If not specified, the default is the type of the weight file.  --lora-model-dir [DIR]             lora model directory  
+                                     If not specified, the default is the type of the weight file.
+  --lora-model-dir [DIR]             lora model directory
  -i, --init-img [IMAGE]             path to the input image, required by img2img
  -o, --output OUTPUT                path to write result image to (default: ./output.png)
  -p, --prompt [PROMPT]              the prompt to render
@ -218,6 +222,23 @@ Here's a simple example:
 | ----  |----    |
 | ![](./assets/without_lcm.png) |![](./assets/with_lcm.png)  |

+## Using TAESD to faster decoding
+
+You can use TAESD to accelerate the decoding of latent images by following these steps:
+
+- Download the model [weights](https://huggingface.co/madebyollin/taesd/blob/main/diffusion_pytorch_model.safetensors).
+
+Or curl
+
+```bash
+curl -L -O https://huggingface.co/madebyollin/taesd/blob/main/diffusion_pytorch_model.safetensors
+```
+
+- Specify the model path using the `--taesd PATH` parameter. example:
+
+```bash
+sd -m ../models/v1-5-pruned-emaonly.safetensors -p "a lovely cat" --taesd ../models/diffusion_pytorch_model.safetensors
+```

 ### Docker

--- a/common/json.hpp
+++ b/common/json.hpp
--- a/common/miniz.h
+++ b/common/miniz.h
--- a/common/stb_image.h
+++ b/common/stb_image.h
--- a/common/stb_image_write.h
+++ b/common/stb_image_write.h
--- a/common/zip.c
+++ b/common/zip.c
--- a/common/zip.h
+++ b/common/zip.h
@ -1,509 +0,0 @@
-/*
- * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
- * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
- * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
- * IN NO EVENT SHALL THE AUTHORS BE LIABLE FOR ANY CLAIM, DAMAGES OR
- * OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE,
- * ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR
- * OTHER DEALINGS IN THE SOFTWARE.
- */
-
-#pragma once
-#ifndef ZIP_H
-#define ZIP_H
-
-#include <stdint.h>
-#include <string.h>
-#include <sys/types.h>
-
-#ifndef ZIP_SHARED
-#define ZIP_EXPORT
-#else
-#ifdef _WIN32
-#ifdef ZIP_BUILD_SHARED
-#define ZIP_EXPORT __declspec(dllexport)
-#else
-#define ZIP_EXPORT __declspec(dllimport)
-#endif
-#else
-#define ZIP_EXPORT __attribute__((visibility("default")))
-#endif
-#endif
-
-#ifdef __cplusplus
-extern "C" {
-#endif
-
-#if !defined(_POSIX_C_SOURCE) && defined(_MSC_VER)
-// 64-bit Windows is the only mainstream platform
-// where sizeof(long) != sizeof(void*)
-#ifdef _WIN64
-typedef long long ssize_t; /* byte count or error */
-#else
-typedef long ssize_t; /* byte count or error */
-#endif
-#endif
-
-/**
- * @mainpage
- *
- * Documentation for @ref zip.
- */
-
-/**
- * @addtogroup zip
- * @{
- */
-
-/**
- * Default zip compression level.
- */
-#define ZIP_DEFAULT_COMPRESSION_LEVEL 6
-
-/**
- * Error codes
- */
-#define ZIP_ENOINIT -1      // not initialized
-#define ZIP_EINVENTNAME -2  // invalid entry name
-#define ZIP_ENOENT -3       // entry not found
-#define ZIP_EINVMODE -4     // invalid zip mode
-#define ZIP_EINVLVL -5      // invalid compression level
-#define ZIP_ENOSUP64 -6     // no zip 64 support
-#define ZIP_EMEMSET -7      // memset error
-#define ZIP_EWRTENT -8      // cannot write data to entry
-#define ZIP_ETDEFLINIT -9   // cannot initialize tdefl compressor
-#define ZIP_EINVIDX -10     // invalid index
-#define ZIP_ENOHDR -11      // header not found
-#define ZIP_ETDEFLBUF -12   // cannot flush tdefl buffer
-#define ZIP_ECRTHDR -13     // cannot create entry header
-#define ZIP_EWRTHDR -14     // cannot write entry header
-#define ZIP_EWRTDIR -15     // cannot write to central dir
-#define ZIP_EOPNFILE -16    // cannot open file
-#define ZIP_EINVENTTYPE -17 // invalid entry type
-#define ZIP_EMEMNOALLOC -18 // extracting data using no memory allocation
-#define ZIP_ENOFILE -19     // file not found
-#define ZIP_ENOPERM -20     // no permission
-#define ZIP_EOOMEM -21      // out of memory
-#define ZIP_EINVZIPNAME -22 // invalid zip archive name
-#define ZIP_EMKDIR -23      // make dir error
-#define ZIP_ESYMLINK -24    // symlink error
-#define ZIP_ECLSZIP -25     // close archive error
-#define ZIP_ECAPSIZE -26    // capacity size too small
-#define ZIP_EFSEEK -27      // fseek error
-#define ZIP_EFREAD -28      // fread error
-#define ZIP_EFWRITE -29     // fwrite error
-#define ZIP_ERINIT -30      // cannot initialize reader
-#define ZIP_EWINIT -31      // cannot initialize writer
-#define ZIP_EWRINIT -32     // cannot initialize writer from reader
-
-/**
- * Looks up the error message string corresponding to an error number.
- * @param errnum error number
- * @return error message string corresponding to errnum or NULL if error is not
- * found.
- */
-extern ZIP_EXPORT const char *zip_strerror(int errnum);
-
-/**
- * @struct zip_t
- *
- * This data structure is used throughout the library to represent zip archive -
- * forward declaration.
- */
-struct zip_t;
-
-/**
- * Opens zip archive with compression level using the given mode.
- *
- * @param zipname zip archive file name.
- * @param level compression level (0-9 are the standard zlib-style levels).
- * @param mode file access mode.
- *        - 'r': opens a file for reading/extracting (the file must exists).
- *        - 'w': creates an empty file for writing.
- *        - 'a': appends to an existing archive.
- *
- * @return the zip archive handler or NULL on error
- */
-extern ZIP_EXPORT struct zip_t *zip_open(const char *zipname, int level,
-                                         char mode);
-
-/**
- * Opens zip archive with compression level using the given mode.
- * The function additionally returns @param errnum -
- *
- * @param zipname zip archive file name.
- * @param level compression level (0-9 are the standard zlib-style levels).
- * @param mode file access mode.
- *        - 'r': opens a file for reading/extracting (the file must exists).
- *        - 'w': creates an empty file for writing.
- *        - 'a': appends to an existing archive.
- * @param errnum 0 on success, negative number (< 0) on error.
- *
- * @return the zip archive handler or NULL on error
- */
-extern ZIP_EXPORT struct zip_t *
-zip_openwitherror(const char *zipname, int level, char mode, int *errnum);
-
-/**
- * Closes the zip archive, releases resources - always finalize.
- *
- * @param zip zip archive handler.
- */
-extern ZIP_EXPORT void zip_close(struct zip_t *zip);
-
-/**
- * Determines if the archive has a zip64 end of central directory headers.
- *
- * @param zip zip archive handler.
- *
- * @return the return code - 1 (true), 0 (false), negative number (< 0) on
- *         error.
- */
-extern ZIP_EXPORT int zip_is64(struct zip_t *zip);
-
-/**
- * Opens an entry by name in the zip archive.
- *
- * For zip archive opened in 'w' or 'a' mode the function will append
- * a new entry. In readonly mode the function tries to locate the entry
- * in global dictionary.
- *
- * @param zip zip archive handler.
- * @param entryname an entry name in local dictionary.
- *
- * @return the return code - 0 on success, negative number (< 0) on error.
- */
-extern ZIP_EXPORT int zip_entry_open(struct zip_t *zip, const char *entryname);
-
-/**
- * Opens an entry by name in the zip archive.
- *
- * For zip archive opened in 'w' or 'a' mode the function will append
- * a new entry. In readonly mode the function tries to locate the entry
- * in global dictionary (case sensitive).
- *
- * @param zip zip archive handler.
- * @param entryname an entry name in local dictionary (case sensitive).
- *
- * @return the return code - 0 on success, negative number (< 0) on error.
- */
-extern ZIP_EXPORT int zip_entry_opencasesensitive(struct zip_t *zip,
-                                                  const char *entryname);
-
-/**
- * Opens a new entry by index in the zip archive.
- *
- * This function is only valid if zip archive was opened in 'r' (readonly) mode.
- *
- * @param zip zip archive handler.
- * @param index index in local dictionary.
- *
- * @return the return code - 0 on success, negative number (< 0) on error.
- */
-extern ZIP_EXPORT int zip_entry_openbyindex(struct zip_t *zip, size_t index);
-
-/**
- * Closes a zip entry, flushes buffer and releases resources.
- *
- * @param zip zip archive handler.
- *
- * @return the return code - 0 on success, negative number (< 0) on error.
- */
-extern ZIP_EXPORT int zip_entry_close(struct zip_t *zip);
-
-/**
- * Returns a local name of the current zip entry.
- *
- * The main difference between user's entry name and local entry name
- * is optional relative path.
- * Following .ZIP File Format Specification - the path stored MUST not contain
- * a drive or device letter, or a leading slash.
- * All slashes MUST be forward slashes '/' as opposed to backwards slashes '\'
- * for compatibility with Amiga and UNIX file systems etc.
- *
- * @param zip: zip archive handler.
- *
- * @return the pointer to the current zip entry name, or NULL on error.
- */
-extern ZIP_EXPORT const char *zip_entry_name(struct zip_t *zip);
-
-/**
- * Returns an index of the current zip entry.
- *
- * @param zip zip archive handler.
- *
- * @return the index on success, negative number (< 0) on error.
- */
-extern ZIP_EXPORT ssize_t zip_entry_index(struct zip_t *zip);
-
-/**
- * Determines if the current zip entry is a directory entry.
- *
- * @param zip zip archive handler.
- *
- * @return the return code - 1 (true), 0 (false), negative number (< 0) on
- *         error.
- */
-extern ZIP_EXPORT int zip_entry_isdir(struct zip_t *zip);
-
-/**
- * Returns the uncompressed size of the current zip entry.
- * Alias for zip_entry_uncomp_size (for backward compatibility).
- *
- * @param zip zip archive handler.
- *
- * @return the uncompressed size in bytes.
- */
-extern ZIP_EXPORT unsigned long long zip_entry_size(struct zip_t *zip);
-
-/**
- * Returns the uncompressed size of the current zip entry.
- *
- * @param zip zip archive handler.
- *
- * @return the uncompressed size in bytes.
- */
-extern ZIP_EXPORT unsigned long long zip_entry_uncomp_size(struct zip_t *zip);
-
-/**
- * Returns the compressed size of the current zip entry.
- *
- * @param zip zip archive handler.
- *
- * @return the compressed size in bytes.
- */
-extern ZIP_EXPORT unsigned long long zip_entry_comp_size(struct zip_t *zip);
-
-/**
- * Returns CRC-32 checksum of the current zip entry.
- *
- * @param zip zip archive handler.
- *
- * @return the CRC-32 checksum.
- */
-extern ZIP_EXPORT unsigned int zip_entry_crc32(struct zip_t *zip);
-
-/**
- * Compresses an input buffer for the current zip entry.
- *
- * @param zip zip archive handler.
- * @param buf input buffer.
- * @param bufsize input buffer size (in bytes).
- *
- * @return the return code - 0 on success, negative number (< 0) on error.
- */
-extern ZIP_EXPORT int zip_entry_write(struct zip_t *zip, const void *buf,
-                                      size_t bufsize);
-
-/**
- * Compresses a file for the current zip entry.
- *
- * @param zip zip archive handler.
- * @param filename input file.
- *
- * @return the return code - 0 on success, negative number (< 0) on error.
- */
-extern ZIP_EXPORT int zip_entry_fwrite(struct zip_t *zip, const char *filename);
-
-/**
- * Extracts the current zip entry into output buffer.
- *
- * The function allocates sufficient memory for a output buffer.
- *
- * @param zip zip archive handler.
- * @param buf output buffer.
- * @param bufsize output buffer size (in bytes).
- *
- * @note remember to release memory allocated for a output buffer.
- *       for large entries, please take a look at zip_entry_extract function.
- *
- * @return the return code - the number of bytes actually read on success.
- *         Otherwise a negative number (< 0) on error.
- */
-extern ZIP_EXPORT ssize_t zip_entry_read(struct zip_t *zip, void **buf,
-                                         size_t *bufsize);
-
-/**
- * Extracts the current zip entry into a memory buffer using no memory
- * allocation.
- *
- * @param zip zip archive handler.
- * @param buf preallocated output buffer.
- * @param bufsize output buffer size (in bytes).
- *
- * @note ensure supplied output buffer is large enough.
- *       zip_entry_size function (returns uncompressed size for the current
- *       entry) can be handy to estimate how big buffer is needed.
- *       For large entries, please take a look at zip_entry_extract function.
- *
- * @return the return code - the number of bytes actually read on success.
- *         Otherwise a negative number (< 0) on error (e.g. bufsize is not large
- * enough).
- */
-extern ZIP_EXPORT ssize_t zip_entry_noallocread(struct zip_t *zip, void *buf,
-                                                size_t bufsize);
-
-/**
- * Extracts the current zip entry into output file.
- *
- * @param zip zip archive handler.
- * @param filename output file.
- *
- * @return the return code - 0 on success, negative number (< 0) on error.
- */
-extern ZIP_EXPORT int zip_entry_fread(struct zip_t *zip, const char *filename);
-
-/**
- * Extracts the current zip entry using a callback function (on_extract).
- *
- * @param zip zip archive handler.
- * @param on_extract callback function.
- * @param arg opaque pointer (optional argument, which you can pass to the
- *        on_extract callback)
- *
- * @return the return code - 0 on success, negative number (< 0) on error.
- */
-extern ZIP_EXPORT int
-zip_entry_extract(struct zip_t *zip,
-                  size_t (*on_extract)(void *arg, uint64_t offset,
-                                       const void *data, size_t size),
-                  void *arg);
-
-/**
- * Returns the number of all entries (files and directories) in the zip archive.
- *
- * @param zip zip archive handler.
- *
- * @return the return code - the number of entries on success, negative number
- *         (< 0) on error.
- */
-extern ZIP_EXPORT ssize_t zip_entries_total(struct zip_t *zip);
-
-/**
- * Deletes zip archive entries.
- *
- * @param zip zip archive handler.
- * @param entries array of zip archive entries to be deleted.
- * @param len the number of entries to be deleted.
- * @return the number of deleted entries, or negative number (< 0) on error.
- */
-extern ZIP_EXPORT ssize_t zip_entries_delete(struct zip_t *zip,
-                                             char *const entries[], size_t len);
-
-/**
- * Extracts a zip archive stream into directory.
- *
- * If on_extract is not NULL, the callback will be called after
- * successfully extracted each zip entry.
- * Returning a negative value from the callback will cause abort and return an
- * error. The last argument (void *arg) is optional, which you can use to pass
- * data to the on_extract callback.
- *
- * @param stream zip archive stream.
- * @param size stream size.
- * @param dir output directory.
- * @param on_extract on extract callback.
- * @param arg opaque pointer.
- *
- * @return the return code - 0 on success, negative number (< 0) on error.
- */
-extern ZIP_EXPORT int
-zip_stream_extract(const char *stream, size_t size, const char *dir,
-                   int (*on_extract)(const char *filename, void *arg),
-                   void *arg);
-
-/**
- * Opens zip archive stream into memory.
- *
- * @param stream zip archive stream.
- * @param size stream size.
- * @param level compression level (0-9 are the standard zlib-style levels).
- * @param mode file access mode.
- *        - 'r': opens a file for reading/extracting (the file must exists).
- *        - 'w': creates an empty file for writing.
- *        - 'a': appends to an existing archive.
- *
- * @return the zip archive handler or NULL on error
- */
-extern ZIP_EXPORT struct zip_t *zip_stream_open(const char *stream, size_t size,
-                                                int level, char mode);
-
-/**
- * Opens zip archive stream into memory.
- * The function additionally returns @param errnum -
- *
- * @param stream zip archive stream.
- * @param size stream size.*
- * @param level compression level (0-9 are the standard zlib-style levels).
- * @param mode file access mode.
- *        - 'r': opens a file for reading/extracting (the file must exists).
- *        - 'w': creates an empty file for writing.
- *        - 'a': appends to an existing archive.
- * @param errnum 0 on success, negative number (< 0) on error.
- *
- * @return the zip archive handler or NULL on error
- */
-extern ZIP_EXPORT struct zip_t *zip_stream_openwitherror(const char *stream,
-                                                         size_t size, int level,
-                                                         char mode,
-                                                         int *errnum);
-
-/**
- * Copy zip archive stream output buffer.
- *
- * @param zip zip archive handler.
- * @param buf output buffer. User should free buf.
- * @param bufsize output buffer size (in bytes).
- *
- * @return copy size
- */
-extern ZIP_EXPORT ssize_t zip_stream_copy(struct zip_t *zip, void **buf,
-                                          size_t *bufsize);
-
-/**
- * Close zip archive releases resources.
- *
- * @param zip zip archive handler.
- *
- * @return
- */
-extern ZIP_EXPORT void zip_stream_close(struct zip_t *zip);
-
-/**
- * Creates a new archive and puts files into a single zip archive.
- *
- * @param zipname zip archive file.
- * @param filenames input files.
- * @param len: number of input files.
- *
- * @return the return code - 0 on success, negative number (< 0) on error.
- */
-extern ZIP_EXPORT int zip_create(const char *zipname, const char *filenames[],
-                                 size_t len);
-
-/**
- * Extracts a zip archive file into directory.
- *
- * If on_extract_entry is not NULL, the callback will be called after
- * successfully extracted each zip entry.
- * Returning a negative value from the callback will cause abort and return an
- * error. The last argument (void *arg) is optional, which you can use to pass
- * data to the on_extract_entry callback.
- *
- * @param zipname zip archive file.
- * @param dir output directory.
- * @param on_extract_entry on extract callback.
- * @param arg opaque pointer.
- *
- * @return the return code - 0 on success, negative number (< 0) on error.
- */
-extern ZIP_EXPORT int zip_extract(const char *zipname, const char *dir,
-                                  int (*on_extract_entry)(const char *filename,
-                                                          void *arg),
-                                  void *arg);
-/** @} */
-#ifdef __cplusplus
-}
-#endif
-
-#endif
--- a/examples/cli/main.cpp
+++ b/examples/cli/main.cpp
@ -58,6 +58,7 @@ struct SDParams {

    std::string model_path;
    std::string vae_path;
+    std::string taesd_path;
    ggml_type wtype = GGML_TYPE_COUNT;
    std::string lora_model_dir;
    std::string output_path = "output.png";
@ -86,6 +87,7 @@ void print_params(SDParams params) {
    printf("    model_path:        %s\n", params.model_path.c_str());
    printf("    wtype:             %s\n", params.wtype < GGML_TYPE_COUNT ? ggml_type_name(params.wtype) : "unspecified");
    printf("    vae_path:          %s\n", params.vae_path.c_str());
+    printf("    taesd_path:        %s\n", params.taesd_path.c_str());
    printf("    output_path:       %s\n", params.output_path.c_str());
    printf("    init_img:          %s\n", params.input_path.c_str());
    printf("    prompt:            %s\n", params.prompt.c_str());
@ -112,8 +114,9 @@ void print_usage(int argc, const char* argv[]) {
    printf("                                     If threads <= 0, then threads will be set to the number of CPU physical cores\n");
    printf("  -m, --model [MODEL]                path to model\n");
    printf("  --vae [VAE]                        path to vae\n");
+    printf("  --taesd [TAESD_PATH]               path to taesd. Using Tiny AutoEncoder for fast decoding (low quality)\n");
    printf("  --type [TYPE]                      weight type (f32, f16, q4_0, q4_1, q5_0, q5_1, q8_0)\n");
-    printf("                                     If not specified, the default is the type of the weight file.");
+    printf("                                     If not specified, the default is the type of the weight file.\n");
    printf("  --lora-model-dir [DIR]             lora model directory\n");
    printf("  -i, --init-img [IMAGE]             path to the input image, required by img2img\n");
    printf("  -o, --output OUTPUT                path to write result image to (default: ./output.png)\n");
@ -176,6 +179,12 @@ void parse_args(int argc, const char** argv, SDParams& params) {
                break;
            }
            params.vae_path = argv[i];
+        } else if (arg == "--taesd") {
+            if (++i >= argc) {
+                invalid_arg = true;
+                break;
+            }
+            params.taesd_path = argv[i];
        } else if (arg == "--type") {
            if (++i >= argc) {
                invalid_arg = true;
@ -449,7 +458,8 @@ int main(int argc, const char* argv[]) {
        }
    }

-    StableDiffusion sd(params.n_threads, vae_decode_only, true, params.lora_model_dir, params.rng_type);
+    StableDiffusion sd(params.n_threads, vae_decode_only, params.taesd_path, true, params.lora_model_dir, params.rng_type);
+
    if (!sd.load_from_file(params.model_path, params.vae_path, params.wtype, params.schedule)) {
        return 1;
    }
--- a/2
+++ b/2
@ -1 +1 @@
-Subproject commit 03669ba9fdc5e0520e919e5c7e1b3a3359d28e59
+Subproject commit 70474c6890c015b53dc10a2300ae35246cc73589
--- a/model.cpp
+++ b/model.cpp
@ -1296,7 +1296,7 @@ bool ModelLoader::load_tensors(on_new_tensor_cb_t on_new_tensor_cb) {
            if (backend == NULL || ggml_backend_is_cpu(backend)) {
                // for the CPU and Metal backend, we can copy directly into the tensor
                if (tensor_storage.type == dst_tensor->type) {
-                    GGML_ASSERT(ggml_nbytes(dst_tensor) == nbytes_to_read);
+                    GGML_ASSERT(ggml_nbytes(dst_tensor) == tensor_storage.nbytes());
                    read_data(tensor_storage, (char*)dst_tensor->data, nbytes_to_read);

                    if (tensor_storage.is_bf16) {
@ -1349,16 +1349,23 @@ bool ModelLoader::load_tensors(on_new_tensor_cb_t on_new_tensor_cb) {
    return success;
 }

-int64_t ModelLoader::cal_mem_size() {
+int64_t ModelLoader::cal_mem_size(ggml_backend_t backend) {
+    size_t alignment = 128;
+    if (backend != NULL) {
+        alignment = ggml_backend_get_alignment(backend);
+    }
    int64_t mem_size = 0;
+    std::vector<TensorStorage> processed_tensor_storages;
    for (auto& tensor_storage : tensor_storages) {
        if (is_unused_tensor(tensor_storage.name)) {
            continue;
        }
-
-        mem_size += tensor_storage.nbytes();
-        mem_size += GGML_MEM_ALIGN * 2;  // for lora alphas
+        preprocess_tensor(tensor_storage, processed_tensor_storages);
    }

-    return mem_size + 10 * 1024 * 1024;
+    for (auto& tensor_storage : processed_tensor_storages) {
+        mem_size += tensor_storage.nbytes() + alignment;
+    }
+
+    return mem_size;
 }
--- a/model.h
+++ b/model.h
@ -8,6 +8,7 @@
 #include <vector>

 #include "ggml/ggml.h"
+#include "ggml/ggml-backend.h"
 #include "json.hpp"
 #include "zip.h"

@ -116,7 +117,7 @@ public:
    ggml_type get_sd_wtype();
    bool load_vocab(on_new_token_cb_t on_new_token_cb);
    bool load_tensors(on_new_tensor_cb_t on_new_tensor_cb);
-    int64_t cal_mem_size();
+    int64_t cal_mem_size(ggml_backend_t backend);
    ~ModelLoader() = default;
 };
 #endif  // __MODEL_H__
--- a/stable-diffusion.cpp
+++ b/stable-diffusion.cpp
--- a/stable-diffusion.h
+++ b/stable-diffusion.h
@ -38,6 +38,7 @@ private:
 public:
    StableDiffusion(int n_threads                = -1,
                    bool vae_decode_only         = false,
+                    std::string taesd_path       = "",
                    bool free_params_immediately = false,
                    std::string lora_model_dir   = "",
                    RNGType rng_type             = STD_DEFAULT_RNG);