feat: add TAESD implementation - faster autoencoder (#88)

* add taesd implementation * taesd gpu offloading * show seed when generating image with -s -1 * less restrictive with larger images * cuda: im2col speedup x2 * cuda: group norm speedup x90 * quantized models now works in cuda :) * fix cal mem size --------- Co-authored-by: leejet <leejet714@gmail.com>
2023-12-05 09:40:03 -05:00 · 2023-12-05 09:40:03 -05:00 · 134883aec4
commit 134883aec4
parent f99bcd1f76
14 changed files with 908 additions and 46904 deletions
--- a/.gitignore
+++ b/.gitignore
@ -8,6 +8,7 @@ test/
 *.bin
 *.exe
 *.gguf
 output*.png
 models*
 !taesd-model.gguf
 *.log
 output.png
 models/
--- a/README.md
+++ b/README.md
@ -9,22 +9,23 @@ Inference of [Stable Diffusion](https://github.com/CompVis/stable-diffusion) in
 ## Features
 - Plain C/C++ implementation based on [ggml](https://github.com/ggerganov/ggml), working in the same way as [llama.cpp](https://github.com/ggerganov/llama.cpp)
- Super lightweight and without external dependencies.
+- Super lightweight and without external dependencies
 - SD1.x and SD2.x support
 - 16-bit, 32-bit float support
 - 4-bit, 5-bit and 8-bit integer quantization support
 - Accelerated memory-efficient CPU inference
    - Only requires ~2.3GB when using txt2img with fp16 precision to generate a 512x512 image, enabling Flash Attention just requires ~1.8GB.
 - AVX, AVX2 and AVX512 support for x86 architectures
- Full CUDA backend for GPU acceleration, for now just for float16 and float32 models. There are some issues with quantized models and CUDA; it will be fixed in the future.
+- Full CUDA backend for GPU acceleration.
- Can load ckpt, safetensors and diffusers models/checkpoints. Standalone VAEs models.
+- Can load ckpt, safetensors and diffusers models/checkpoints. Standalone VAEs models
    - No need to convert to `.ggml` or `.gguf` anymore!
- Flash Attention for memory usage optimization (only cpu for now).
+- Flash Attention for memory usage optimization (only cpu for now)
 - Original `txt2img` and `img2img` mode
 - Negative prompt
 - [stable-diffusion-webui](https://github.com/AUTOMATIC1111/stable-diffusion-webui) style tokenizer (not all the features, only token weighting for now)
 - LoRA support, same as [stable-diffusion-webui](https://github.com/AUTOMATIC1111/stable-diffusion-webui/wiki/Features#lora)
 - Latent Consistency Models support (LCM/LCM-LoRA)
 - Faster and memory efficient latent decoding with [TAESD](https://github.com/madebyollin/taesd)
 - Sampling method
    - `Euler A`
    - `Euler`
@ -47,9 +48,10 @@ Inference of [Stable Diffusion](https://github.com/CompVis/stable-diffusion) in
 - [ ] More sampling methods
 - [ ] Make inference faster
    - The current implementation of ggml_conv_2d is slow and has high memory usage
    - Implement Winograd Convolution 2D for 3x3 kernel filtering
 - [ ] Continuing to reduce memory usage (quantizing the weights of ggml_conv_2d)
 - [ ] Implement BPE Tokenizer
- [ ] Add [TAESD](https://github.com/madebyollin/taesd) for faster VAE decoding
+- [ ] Implement [Real-ESRGAN](https://github.com/xinntao/Real-ESRGAN/tree/master) upscaler
 - [ ] k-quants support
 ## Usage
@ -122,7 +124,7 @@ cmake --build . --config Release
 ### Run
 ```
-usage: ./bin/sd [arguments]
+usage: sd [arguments]
 arguments:
  -h, --help                         show this help message and exit
@ -131,8 +133,10 @@ arguments:
                                     If threads <= 0, then threads will be set to the number of CPU physical cores
  -m, --model [MODEL]                path to model
  --vae [VAE]                        path to vae
  --taesd [TAESD_PATH]               path to taesd. Using Tiny AutoEncoder for fast decoding (low quality)
  --type [TYPE]                      weight type (f32, f16, q4_0, q4_1, q5_0, q5_1, q8_0)
-                                     If not specified, the default is the type of the weight file.  --lora-model-dir [DIR]             lora model directory  
+                                     If not specified, the default is the type of the weight file.
  --lora-model-dir [DIR]             lora model directory
  -i, --init-img [IMAGE]             path to the input image, required by img2img
  -o, --output OUTPUT                path to write result image to (default: ./output.png)
  -p, --prompt [PROMPT]              the prompt to render
@ -218,6 +222,23 @@ Here's a simple example:
 | ----  |----    |
 | ![](./assets/without_lcm.png) |![](./assets/with_lcm.png)  |
 ## Using TAESD to faster decoding
 You can use TAESD to accelerate the decoding of latent images by following these steps:
 - Download the model [weights](https://huggingface.co/madebyollin/taesd/blob/main/diffusion_pytorch_model.safetensors).
 Or curl
 ```bash
 curl -L -O https://huggingface.co/madebyollin/taesd/blob/main/diffusion_pytorch_model.safetensors
 ```
 - Specify the model path using the `--taesd PATH` parameter. example:
 ```bash
 sd -m ../models/v1-5-pruned-emaonly.safetensors -p "a lovely cat" --taesd ../models/diffusion_pytorch_model.safetensors
 ```
 ### Docker
--- a/common/json.hpp
+++ b/common/json.hpp
--- a/common/miniz.h
+++ b/common/miniz.h
--- a/common/stb_image.h
+++ b/common/stb_image.h
--- a/common/stb_image_write.h
+++ b/common/stb_image_write.h
--- a/common/zip.c
+++ b/common/zip.c
--- a/common/zip.h
+++ b/common/zip.h
@ -1,509 +0,0 @@
 /*
 * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
 * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
 * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
 * IN NO EVENT SHALL THE AUTHORS BE LIABLE FOR ANY CLAIM, DAMAGES OR
 * OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE,
 * ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR
 * OTHER DEALINGS IN THE SOFTWARE.
 */
 #pragma once
 #ifndef ZIP_H
 #define ZIP_H
 #include <stdint.h>
 #include <string.h>
 #include <sys/types.h>
 #ifndef ZIP_SHARED
 #define ZIP_EXPORT
 #else
 #ifdef _WIN32
 #ifdef ZIP_BUILD_SHARED
 #define ZIP_EXPORT __declspec(dllexport)
 #else
 #define ZIP_EXPORT __declspec(dllimport)
 #endif
 #else
 #define ZIP_EXPORT __attribute__((visibility("default")))
 #endif
 #endif
 #ifdef __cplusplus
 extern "C" {
 #endif
 #if !defined(_POSIX_C_SOURCE) && defined(_MSC_VER)
 // 64-bit Windows is the only mainstream platform
 // where sizeof(long) != sizeof(void*)
 #ifdef _WIN64
 typedef long long ssize_t; /* byte count or error */
 #else
 typedef long ssize_t; /* byte count or error */
 #endif
 #endif
 /**
 * @mainpage
 *
 * Documentation for @ref zip.
 */
 /**
 * @addtogroup zip
 * @{
 */
 /**
 * Default zip compression level.
 */
 #define ZIP_DEFAULT_COMPRESSION_LEVEL 6
 /**
 * Error codes
 */
 #define ZIP_ENOINIT -1      // not initialized
 #define ZIP_EINVENTNAME -2  // invalid entry name
 #define ZIP_ENOENT -3       // entry not found
 #define ZIP_EINVMODE -4     // invalid zip mode
 #define ZIP_EINVLVL -5      // invalid compression level
 #define ZIP_ENOSUP64 -6     // no zip 64 support
 #define ZIP_EMEMSET -7      // memset error
 #define ZIP_EWRTENT -8      // cannot write data to entry
 #define ZIP_ETDEFLINIT -9   // cannot initialize tdefl compressor
 #define ZIP_EINVIDX -10     // invalid index
 #define ZIP_ENOHDR -11      // header not found
 #define ZIP_ETDEFLBUF -12   // cannot flush tdefl buffer
 #define ZIP_ECRTHDR -13     // cannot create entry header
 #define ZIP_EWRTHDR -14     // cannot write entry header
 #define ZIP_EWRTDIR -15     // cannot write to central dir
 #define ZIP_EOPNFILE -16    // cannot open file
 #define ZIP_EINVENTTYPE -17 // invalid entry type
 #define ZIP_EMEMNOALLOC -18 // extracting data using no memory allocation
 #define ZIP_ENOFILE -19     // file not found
 #define ZIP_ENOPERM -20     // no permission
 #define ZIP_EOOMEM -21      // out of memory
 #define ZIP_EINVZIPNAME -22 // invalid zip archive name
 #define ZIP_EMKDIR -23      // make dir error
 #define ZIP_ESYMLINK -24    // symlink error
 #define ZIP_ECLSZIP -25     // close archive error
 #define ZIP_ECAPSIZE -26    // capacity size too small
 #define ZIP_EFSEEK -27      // fseek error
 #define ZIP_EFREAD -28      // fread error
 #define ZIP_EFWRITE -29     // fwrite error
 #define ZIP_ERINIT -30      // cannot initialize reader
 #define ZIP_EWINIT -31      // cannot initialize writer
 #define ZIP_EWRINIT -32     // cannot initialize writer from reader
 /**
 * Looks up the error message string corresponding to an error number.
 * @param errnum error number
 * @return error message string corresponding to errnum or NULL if error is not
 * found.
 */
 extern ZIP_EXPORT const char *zip_strerror(int errnum);
 /**
 * @struct zip_t
 *
 * This data structure is used throughout the library to represent zip archive -
 * forward declaration.
 */
 struct zip_t;
 /**
 * Opens zip archive with compression level using the given mode.
 *
 * @param zipname zip archive file name.
 * @param level compression level (0-9 are the standard zlib-style levels).
 * @param mode file access mode.
 *        - 'r': opens a file for reading/extracting (the file must exists).
 *        - 'w': creates an empty file for writing.
 *        - 'a': appends to an existing archive.
 *
 * @return the zip archive handler or NULL on error
 */
 extern ZIP_EXPORT struct zip_t *zip_open(const char *zipname, int level,
                                         char mode);
 /**
 * Opens zip archive with compression level using the given mode.
 * The function additionally returns @param errnum -
 *
 * @param zipname zip archive file name.
 * @param level compression level (0-9 are the standard zlib-style levels).
 * @param mode file access mode.
 *        - 'r': opens a file for reading/extracting (the file must exists).
 *        - 'w': creates an empty file for writing.
 *        - 'a': appends to an existing archive.
 * @param errnum 0 on success, negative number (< 0) on error.
 *
 * @return the zip archive handler or NULL on error
 */
 extern ZIP_EXPORT struct zip_t *
 zip_openwitherror(const char *zipname, int level, char mode, int *errnum);
 /**
 * Closes the zip archive, releases resources - always finalize.
 *
 * @param zip zip archive handler.
 */
 extern ZIP_EXPORT void zip_close(struct zip_t *zip);
 /**
 * Determines if the archive has a zip64 end of central directory headers.
 *
 * @param zip zip archive handler.
 *
 * @return the return code - 1 (true), 0 (false), negative number (< 0) on
 *         error.
 */
 extern ZIP_EXPORT int zip_is64(struct zip_t *zip);
 /**
 * Opens an entry by name in the zip archive.
 *
 * For zip archive opened in 'w' or 'a' mode the function will append
 * a new entry. In readonly mode the function tries to locate the entry
 * in global dictionary.
 *
 * @param zip zip archive handler.
 * @param entryname an entry name in local dictionary.
 *
 * @return the return code - 0 on success, negative number (< 0) on error.
 */
 extern ZIP_EXPORT int zip_entry_open(struct zip_t *zip, const char *entryname);
 /**
 * Opens an entry by name in the zip archive.
 *
 * For zip archive opened in 'w' or 'a' mode the function will append
 * a new entry. In readonly mode the function tries to locate the entry
 * in global dictionary (case sensitive).
 *
 * @param zip zip archive handler.
 * @param entryname an entry name in local dictionary (case sensitive).
 *
 * @return the return code - 0 on success, negative number (< 0) on error.
 */
 extern ZIP_EXPORT int zip_entry_opencasesensitive(struct zip_t *zip,
                                                  const char *entryname);
 /**
 * Opens a new entry by index in the zip archive.
 *
 * This function is only valid if zip archive was opened in 'r' (readonly) mode.
 *
 * @param zip zip archive handler.
 * @param index index in local dictionary.
 *
 * @return the return code - 0 on success, negative number (< 0) on error.
 */
 extern ZIP_EXPORT int zip_entry_openbyindex(struct zip_t *zip, size_t index);
 /**
 * Closes a zip entry, flushes buffer and releases resources.
 *
 * @param zip zip archive handler.
 *
 * @return the return code - 0 on success, negative number (< 0) on error.
 */
 extern ZIP_EXPORT int zip_entry_close(struct zip_t *zip);
 /**
 * Returns a local name of the current zip entry.
 *
 * The main difference between user's entry name and local entry name
 * is optional relative path.
 * Following .ZIP File Format Specification - the path stored MUST not contain
 * a drive or device letter, or a leading slash.
 * All slashes MUST be forward slashes '/' as opposed to backwards slashes '\'
 * for compatibility with Amiga and UNIX file systems etc.
 *
 * @param zip: zip archive handler.
 *
 * @return the pointer to the current zip entry name, or NULL on error.
 */
 extern ZIP_EXPORT const char *zip_entry_name(struct zip_t *zip);
 /**
 * Returns an index of the current zip entry.
 *
 * @param zip zip archive handler.
 *
 * @return the index on success, negative number (< 0) on error.
 */
 extern ZIP_EXPORT ssize_t zip_entry_index(struct zip_t *zip);
 /**
 * Determines if the current zip entry is a directory entry.
 *
 * @param zip zip archive handler.
 *
 * @return the return code - 1 (true), 0 (false), negative number (< 0) on
 *         error.
 */
 extern ZIP_EXPORT int zip_entry_isdir(struct zip_t *zip);
 /**
 * Returns the uncompressed size of the current zip entry.
 * Alias for zip_entry_uncomp_size (for backward compatibility).
 *
 * @param zip zip archive handler.
 *
 * @return the uncompressed size in bytes.
 */
 extern ZIP_EXPORT unsigned long long zip_entry_size(struct zip_t *zip);
 /**
 * Returns the uncompressed size of the current zip entry.
 *
 * @param zip zip archive handler.
 *
 * @return the uncompressed size in bytes.
 */
 extern ZIP_EXPORT unsigned long long zip_entry_uncomp_size(struct zip_t *zip);
 /**
 * Returns the compressed size of the current zip entry.
 *
 * @param zip zip archive handler.
 *
 * @return the compressed size in bytes.
 */
 extern ZIP_EXPORT unsigned long long zip_entry_comp_size(struct zip_t *zip);
 /**
 * Returns CRC-32 checksum of the current zip entry.
 *
 * @param zip zip archive handler.
 *
 * @return the CRC-32 checksum.
 */
 extern ZIP_EXPORT unsigned int zip_entry_crc32(struct zip_t *zip);
 /**
 * Compresses an input buffer for the current zip entry.
 *
 * @param zip zip archive handler.
 * @param buf input buffer.
 * @param bufsize input buffer size (in bytes).
 *
 * @return the return code - 0 on success, negative number (< 0) on error.
 */
 extern ZIP_EXPORT int zip_entry_write(struct zip_t *zip, const void *buf,
                                      size_t bufsize);
 /**
 * Compresses a file for the current zip entry.
 *
 * @param zip zip archive handler.
 * @param filename input file.
 *
 * @return the return code - 0 on success, negative number (< 0) on error.
 */
 extern ZIP_EXPORT int zip_entry_fwrite(struct zip_t *zip, const char *filename);
 /**
 * Extracts the current zip entry into output buffer.
 *
 * The function allocates sufficient memory for a output buffer.
 *
 * @param zip zip archive handler.
 * @param buf output buffer.
 * @param bufsize output buffer size (in bytes).
 *
 * @note remember to release memory allocated for a output buffer.
 *       for large entries, please take a look at zip_entry_extract function.
 *
 * @return the return code - the number of bytes actually read on success.
 *         Otherwise a negative number (< 0) on error.
 */
 extern ZIP_EXPORT ssize_t zip_entry_read(struct zip_t *zip, void **buf,
                                         size_t *bufsize);
 /**
 * Extracts the current zip entry into a memory buffer using no memory
 * allocation.
 *
 * @param zip zip archive handler.
 * @param buf preallocated output buffer.
 * @param bufsize output buffer size (in bytes).
 *
 * @note ensure supplied output buffer is large enough.
 *       zip_entry_size function (returns uncompressed size for the current
 *       entry) can be handy to estimate how big buffer is needed.
 *       For large entries, please take a look at zip_entry_extract function.
 *
 * @return the return code - the number of bytes actually read on success.
 *         Otherwise a negative number (< 0) on error (e.g. bufsize is not large
 * enough).
 */
 extern ZIP_EXPORT ssize_t zip_entry_noallocread(struct zip_t *zip, void *buf,
                                                size_t bufsize);
 /**
 * Extracts the current zip entry into output file.
 *
 * @param zip zip archive handler.
 * @param filename output file.
 *
 * @return the return code - 0 on success, negative number (< 0) on error.
 */
 extern ZIP_EXPORT int zip_entry_fread(struct zip_t *zip, const char *filename);
 /**
 * Extracts the current zip entry using a callback function (on_extract).
 *
 * @param zip zip archive handler.
 * @param on_extract callback function.
 * @param arg opaque pointer (optional argument, which you can pass to the
 *        on_extract callback)
 *
 * @return the return code - 0 on success, negative number (< 0) on error.
 */
 extern ZIP_EXPORT int
 zip_entry_extract(struct zip_t *zip,
                  size_t (*on_extract)(void *arg, uint64_t offset,
                                       const void *data, size_t size),
                  void *arg);
 /**
 * Returns the number of all entries (files and directories) in the zip archive.
 *
 * @param zip zip archive handler.
 *
 * @return the return code - the number of entries on success, negative number
 *         (< 0) on error.
 */
 extern ZIP_EXPORT ssize_t zip_entries_total(struct zip_t *zip);
 /**
 * Deletes zip archive entries.
 *
 * @param zip zip archive handler.
 * @param entries array of zip archive entries to be deleted.
 * @param len the number of entries to be deleted.
 * @return the number of deleted entries, or negative number (< 0) on error.
 */
 extern ZIP_EXPORT ssize_t zip_entries_delete(struct zip_t *zip,
                                             char *const entries[], size_t len);
 /**
 * Extracts a zip archive stream into directory.
 *
 * If on_extract is not NULL, the callback will be called after
 * successfully extracted each zip entry.
 * Returning a negative value from the callback will cause abort and return an
 * error. The last argument (void *arg) is optional, which you can use to pass
 * data to the on_extract callback.
 *
 * @param stream zip archive stream.
 * @param size stream size.
 * @param dir output directory.
 * @param on_extract on extract callback.
 * @param arg opaque pointer.
 *
 * @return the return code - 0 on success, negative number (< 0) on error.
 */
 extern ZIP_EXPORT int
 zip_stream_extract(const char *stream, size_t size, const char *dir,
                   int (*on_extract)(const char *filename, void *arg),
                   void *arg);
 /**
 * Opens zip archive stream into memory.
 *
 * @param stream zip archive stream.
 * @param size stream size.
 * @param level compression level (0-9 are the standard zlib-style levels).
 * @param mode file access mode.
 *        - 'r': opens a file for reading/extracting (the file must exists).
 *        - 'w': creates an empty file for writing.
 *        - 'a': appends to an existing archive.
 *
 * @return the zip archive handler or NULL on error
 */
 extern ZIP_EXPORT struct zip_t *zip_stream_open(const char *stream, size_t size,
                                                int level, char mode);
 /**
 * Opens zip archive stream into memory.
 * The function additionally returns @param errnum -
 *
 * @param stream zip archive stream.
 * @param size stream size.*
 * @param level compression level (0-9 are the standard zlib-style levels).
 * @param mode file access mode.
 *        - 'r': opens a file for reading/extracting (the file must exists).
 *        - 'w': creates an empty file for writing.
 *        - 'a': appends to an existing archive.
 * @param errnum 0 on success, negative number (< 0) on error.
 *
 * @return the zip archive handler or NULL on error
 */
 extern ZIP_EXPORT struct zip_t *zip_stream_openwitherror(const char *stream,
                                                         size_t size, int level,
                                                         char mode,
                                                         int *errnum);
 /**
 * Copy zip archive stream output buffer.
 *
 * @param zip zip archive handler.
 * @param buf output buffer. User should free buf.
 * @param bufsize output buffer size (in bytes).
 *
 * @return copy size
 */
 extern ZIP_EXPORT ssize_t zip_stream_copy(struct zip_t *zip, void **buf,
                                          size_t *bufsize);
 /**
 * Close zip archive releases resources.
 *
 * @param zip zip archive handler.
 *
 * @return
 */
 extern ZIP_EXPORT void zip_stream_close(struct zip_t *zip);
 /**
 * Creates a new archive and puts files into a single zip archive.
 *
 * @param zipname zip archive file.
 * @param filenames input files.
 * @param len: number of input files.
 *
 * @return the return code - 0 on success, negative number (< 0) on error.
 */
 extern ZIP_EXPORT int zip_create(const char *zipname, const char *filenames[],
                                 size_t len);
 /**
 * Extracts a zip archive file into directory.
 *
 * If on_extract_entry is not NULL, the callback will be called after
 * successfully extracted each zip entry.
 * Returning a negative value from the callback will cause abort and return an
 * error. The last argument (void *arg) is optional, which you can use to pass
 * data to the on_extract_entry callback.
 *
 * @param zipname zip archive file.
 * @param dir output directory.
 * @param on_extract_entry on extract callback.
 * @param arg opaque pointer.
 *
 * @return the return code - 0 on success, negative number (< 0) on error.
 */
 extern ZIP_EXPORT int zip_extract(const char *zipname, const char *dir,
                                  int (*on_extract_entry)(const char *filename,
                                                          void *arg),
                                  void *arg);
 /** @} */
 #ifdef __cplusplus
 }
 #endif
 #endif
--- a/examples/cli/main.cpp
+++ b/examples/cli/main.cpp
@ -58,6 +58,7 @@ struct SDParams {
    std::string model_path;
    std::string vae_path;
    std::string taesd_path;
    ggml_type wtype = GGML_TYPE_COUNT;
    std::string lora_model_dir;
    std::string output_path = "output.png";
@ -86,6 +87,7 @@ void print_params(SDParams params) {
    printf("    model_path:        %s\n", params.model_path.c_str());
    printf("    wtype:             %s\n", params.wtype < GGML_TYPE_COUNT ? ggml_type_name(params.wtype) : "unspecified");
    printf("    vae_path:          %s\n", params.vae_path.c_str());
    printf("    taesd_path:        %s\n", params.taesd_path.c_str());
    printf("    output_path:       %s\n", params.output_path.c_str());
    printf("    init_img:          %s\n", params.input_path.c_str());
    printf("    prompt:            %s\n", params.prompt.c_str());
@ -112,8 +114,9 @@ void print_usage(int argc, const char* argv[]) {
    printf("                                     If threads <= 0, then threads will be set to the number of CPU physical cores\n");
    printf("  -m, --model [MODEL]                path to model\n");
    printf("  --vae [VAE]                        path to vae\n");
    printf("  --taesd [TAESD_PATH]               path to taesd. Using Tiny AutoEncoder for fast decoding (low quality)\n");
    printf("  --type [TYPE]                      weight type (f32, f16, q4_0, q4_1, q5_0, q5_1, q8_0)\n");
-    printf("                                     If not specified, the default is the type of the weight file.");
+    printf("                                     If not specified, the default is the type of the weight file.\n");
    printf("  --lora-model-dir [DIR]             lora model directory\n");
    printf("  -i, --init-img [IMAGE]             path to the input image, required by img2img\n");
    printf("  -o, --output OUTPUT                path to write result image to (default: ./output.png)\n");
@ -176,6 +179,12 @@ void parse_args(int argc, const char** argv, SDParams& params) {
                break;
            }
            params.vae_path = argv[i];
        } else if (arg == "--taesd") {
            if (++i >= argc) {
                invalid_arg = true;
                break;
            }
            params.taesd_path = argv[i];
        } else if (arg == "--type") {
            if (++i >= argc) {
                invalid_arg = true;
@ -449,7 +458,8 @@ int main(int argc, const char* argv[]) {
        }
    }
-    StableDiffusion sd(params.n_threads, vae_decode_only, true, params.lora_model_dir, params.rng_type);
+    StableDiffusion sd(params.n_threads, vae_decode_only, params.taesd_path, true, params.lora_model_dir, params.rng_type);
    if (!sd.load_from_file(params.model_path, params.vae_path, params.wtype, params.schedule)) {
        return 1;
    }
--- a/2
+++ b/2
@ -1 +1 @@
-Subproject commit 03669ba9fdc5e0520e919e5c7e1b3a3359d28e59
+Subproject commit 70474c6890c015b53dc10a2300ae35246cc73589
--- a/model.cpp
+++ b/model.cpp
@ -1296,7 +1296,7 @@ bool ModelLoader::load_tensors(on_new_tensor_cb_t on_new_tensor_cb) {
            if (backend == NULL || ggml_backend_is_cpu(backend)) {
                // for the CPU and Metal backend, we can copy directly into the tensor
                if (tensor_storage.type == dst_tensor->type) {
-                    GGML_ASSERT(ggml_nbytes(dst_tensor) == nbytes_to_read);
+                    GGML_ASSERT(ggml_nbytes(dst_tensor) == tensor_storage.nbytes());
                    read_data(tensor_storage, (char*)dst_tensor->data, nbytes_to_read);
                    if (tensor_storage.is_bf16) {
@ -1349,16 +1349,23 @@ bool ModelLoader::load_tensors(on_new_tensor_cb_t on_new_tensor_cb) {
    return success;
 }
-int64_t ModelLoader::cal_mem_size() {
+int64_t ModelLoader::cal_mem_size(ggml_backend_t backend) {
    size_t alignment = 128;
    if (backend != NULL) {
        alignment = ggml_backend_get_alignment(backend);
    }
    int64_t mem_size = 0;
    std::vector<TensorStorage> processed_tensor_storages;
    for (auto& tensor_storage : tensor_storages) {
        if (is_unused_tensor(tensor_storage.name)) {
            continue;
        }
-
+        preprocess_tensor(tensor_storage, processed_tensor_storages);
        mem_size += tensor_storage.nbytes();
        mem_size += GGML_MEM_ALIGN * 2;  // for lora alphas
    }
-    return mem_size + 10 * 1024 * 1024;
+    for (auto& tensor_storage : processed_tensor_storages) {
        mem_size += tensor_storage.nbytes() + alignment;
    }
    return mem_size;
 }
--- a/model.h
+++ b/model.h
@ -8,6 +8,7 @@
 #include <vector>
 #include "ggml/ggml.h"
 #include "ggml/ggml-backend.h"
 #include "json.hpp"
 #include "zip.h"
@ -116,7 +117,7 @@ public:
    ggml_type get_sd_wtype();
    bool load_vocab(on_new_token_cb_t on_new_token_cb);
    bool load_tensors(on_new_tensor_cb_t on_new_tensor_cb);
-    int64_t cal_mem_size();
+    int64_t cal_mem_size(ggml_backend_t backend);
    ~ModelLoader() = default;
 };
 #endif  // __MODEL_H__
--- a/stable-diffusion.cpp
+++ b/stable-diffusion.cpp
--- a/stable-diffusion.h
+++ b/stable-diffusion.h
@ -38,6 +38,7 @@ private:
 public:
    StableDiffusion(int n_threads                = -1,
                    bool vae_decode_only         = false,
                    std::string taesd_path       = "",
                    bool free_params_immediately = false,
                    std::string lora_model_dir   = "",
                    RNGType rng_type             = STD_DEFAULT_RNG);
		`@ -1 +1 @@`
			`Subproject commit 03669ba9fdc5e0520e919e5c7e1b3a3359d28e59`				`Subproject commit 70474c6890c015b53dc10a2300ae35246cc73589`