feat: add TencentARC PhotoMaker support (#179)

* first efforts at implementing photomaker; lots more to do

* added PhotoMakerIDEncoder model in SD

* fixed soem bugs; now photomaker model weights can be loaded into their tensor buffers

* added input id image loading

* added preprocessing inpit id images

* finished get_num_tensors

* fixed a bug in remove_duplicates

* add a get_learned_condition_with_trigger function to do photomaker stuff

* add a convert_token_to_id function for photomaker to extract trigger word's token id

* making progress; need to implement tokenizer decoder

* making more progress; finishing vision model forward

* debugging vision_model outputs

* corrected clip vision model output

* continue making progress in id fusion process

* finished stacked id embedding; to be tested

* remove garbage file

* debuging graph compute

* more progress; now alloc buffer failed

* fixed wtype issue; input images can only be 1 because issue with transformer when batch size > 1 (to be investigated)

* added delayed subject conditioning; now photomaker runs and generates images

* fixed stat_merge_step

* added photomaker lora model (to be tested)

* reworked pmid lora

* finished applying pmid lora; to be tested

* finalized pmid lora

* add a few print tensor; tweak in sample again

* small tweak; still not getting ID faces

* fixed a bug in FuseBlock forward; also remove diag_mask op in for vision transformer; getting better results

* disable pmid lora apply for now; 1 input image seems working; > 1 not working

* turn pmid lora apply back on

* fixed a decode bug

* fixed a bug in ggml's conv_2d, and now > 1 input images working

* add style_ratio as a cli param; reworked encode with trigger for attention weights

* merge commit fixing lora free param buffer error

* change default style ratio to 10%

* added an option to offload vae decoder to CPU for mem-limited gpus

* removing image normalization step seems making ID fidelity much higher

* revert default style ratio back ro 20%

* added an option for normalizing input ID images; cleaned up debugging code

* more clean up

* fixed bugs; now failed with cuda error; likely out-of-mem on GPU

* free pmid model params when required

* photomaker working properly now after merging and adapting to GGMLBlock API

* remove tensor renaming;  fixing names in the photomaker model file

* updated README.md to include instructions and notes for running PhotoMaker

* a bit clean up

* remove -DGGML_CUDA_FORCE_MMQ; more clean up and README update

* add input image requirement in README

* bring back freeing pmid lora params buffer; simply pooled output of CLIPvision

* remove MultiheadAttention2; customized MultiheadAttention

* added a WIN32 get_files_from_dir; turn off Photomakder if receiving no input images

* update docs

* fix ci error

* make stable-diffusion.h a pure c header file

This reverts commit 27887b630db6a92f269f0aef8de9bc9832ab50a9.

* fix ci error

* format code

* reuse get_learned_condition

* reuse pad_tokens

* reuse CLIPVisionModel

* reuse LoraModel

* add --clip-on-cpu

* fix lora name conversion for SDXL

---------

Co-authored-by: bssrdf <bssrdf@gmail.com>
Co-authored-by: leejet <leejet714@gmail.com>
This commit is contained in:
bssrdf 2024-03-12 11:15:17 -04:00 committed by GitHub
parent 61980171a1
commit a469688e30
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194
28 changed files with 3935 additions and 186 deletions

View File

@ -14,6 +14,7 @@ Inference of [Stable Diffusion](https://github.com/CompVis/stable-diffusion) in
- !!!The VAE in SDXL encounters NaN issues under FP16, but unfortunately, the ggml_conv_2d only operates under FP16. Hence, a parameter is needed to specify the VAE that has fixed the FP16 NaN issue. You can find it here: [SDXL VAE FP16 Fix](https://huggingface.co/madebyollin/sdxl-vae-fp16-fix/blob/main/sdxl_vae.safetensors). - !!!The VAE in SDXL encounters NaN issues under FP16, but unfortunately, the ggml_conv_2d only operates under FP16. Hence, a parameter is needed to specify the VAE that has fixed the FP16 NaN issue. You can find it here: [SDXL VAE FP16 Fix](https://huggingface.co/madebyollin/sdxl-vae-fp16-fix/blob/main/sdxl_vae.safetensors).
- [SD-Turbo](https://huggingface.co/stabilityai/sd-turbo) and [SDXL-Turbo](https://huggingface.co/stabilityai/sdxl-turbo) support - [SD-Turbo](https://huggingface.co/stabilityai/sd-turbo) and [SDXL-Turbo](https://huggingface.co/stabilityai/sdxl-turbo) support
- [PhotoMaker](https://github.com/TencentARC/PhotoMaker) support.
- 16-bit, 32-bit float support - 16-bit, 32-bit float support
- 4-bit, 5-bit and 8-bit integer quantization support - 4-bit, 5-bit and 8-bit integer quantization support
- Accelerated memory-efficient CPU inference - Accelerated memory-efficient CPU inference
@ -151,7 +152,7 @@ cmake --build . --config Release
### Run ### Run
``` ```
usage: ./build/bin/sd [arguments] usage: ./bin/sd [arguments]
arguments: arguments:
-h, --help show this help message and exit -h, --help show this help message and exit
@ -163,6 +164,9 @@ arguments:
--taesd [TAESD_PATH] path to taesd. Using Tiny AutoEncoder for fast decoding (low quality) --taesd [TAESD_PATH] path to taesd. Using Tiny AutoEncoder for fast decoding (low quality)
--control-net [CONTROL_PATH] path to control net model --control-net [CONTROL_PATH] path to control net model
--embd-dir [EMBEDDING_PATH] path to embeddings. --embd-dir [EMBEDDING_PATH] path to embeddings.
--stacked-id-embd-dir [DIR] path to PHOTOMAKER stacked id embeddings.
--input-id-images-dir [DIR] path to PHOTOMAKER input id images dir.
--normalize-input normalize PHOTOMAKER input id images
--upscale-model [ESRGAN_PATH] path to esrgan model. Upscale images after generate, just RealESRGAN_x4plus_anime_6B supported by now. --upscale-model [ESRGAN_PATH] path to esrgan model. Upscale images after generate, just RealESRGAN_x4plus_anime_6B supported by now.
--upscale-repeats Run the ESRGAN upscaler this many times (default 1) --upscale-repeats Run the ESRGAN upscaler this many times (default 1)
--type [TYPE] weight type (f32, f16, q4_0, q4_1, q5_0, q5_1, q8_0) --type [TYPE] weight type (f32, f16, q4_0, q4_1, q5_0, q5_1, q8_0)
@ -175,6 +179,7 @@ arguments:
-n, --negative-prompt PROMPT the negative prompt (default: "") -n, --negative-prompt PROMPT the negative prompt (default: "")
--cfg-scale SCALE unconditional guidance scale: (default: 7.0) --cfg-scale SCALE unconditional guidance scale: (default: 7.0)
--strength STRENGTH strength for noising/unnoising (default: 0.75) --strength STRENGTH strength for noising/unnoising (default: 0.75)
--style-ratio STYLE-RATIO strength for keeping input identity (default: 20%)
--control-strength STRENGTH strength to apply Control Net (default: 0.9) --control-strength STRENGTH strength to apply Control Net (default: 0.9)
1.0 corresponds to full destruction of information in init image 1.0 corresponds to full destruction of information in init image
-H, --height H image height, in pixel space (default: 512) -H, --height H image height, in pixel space (default: 512)
@ -299,6 +304,39 @@ You can use ESRGAN to upscale the generated images. At the moment, only the [Rea
sd -m ../models/v1-5-pruned-emaonly.safetensors -p "a lovely cat" --upscale-model ../models/RealESRGAN_x4plus_anime_6B.pth sd -m ../models/v1-5-pruned-emaonly.safetensors -p "a lovely cat" --upscale-model ../models/RealESRGAN_x4plus_anime_6B.pth
``` ```
#### Using PhotoMaker to personalize image generation
You can use [PhotoMaker](https://github.com/TencentARC/PhotoMaker) to personalize generated images with your own ID.
**NOTE**, currently PhotoMaker **ONLY** works with **SDXL** (any SDXL model files will work).
Download PhotoMaker model file (in safetensor format) [here](https://huggingface.co/bssrdf/PhotoMaker). The official release of the model file (in .bin format) does not work with ```stablediffusion.cpp```.
- Specify the PhotoMaker model path using the `--stacked-id-embd-dir PATH` parameter.
- Specify the input images path using the `--input-id-images-dir PATH` parameter.
- input images **must** have the same width and height for preprocessing (to be improved)
In prompt, make sure you have a class word followed by the trigger word ```"img"``` (hard-coded for now). The class word could be one of ```"man, woman, girl, boy"```. If input ID images contain asian faces, add ```Asian``` before the class
word.
Another PhotoMaker specific parameter:
- ```--style-ratio (0-100)%```: default is 20 and 10-20 typically gets good results. Lower ratio means more faithfully following input ID (not necessarily better quality).
Other parameters recommended for running Photomaker:
- ```--cfg-scale 5.0```
- ```-H 1024```
- ```-W 1024```
If on low memory GPUs (<= 8GB), recommend running with ```--vae-on-cpu``` option to get artifact free images.
Example:
```bash
bin/sd -m ../models/sdxlUnstableDiffusers_v11.safetensors --vae ../models/sdxl_vae.safetensors --stacked-id-embd-dir ../models/photomaker-v1.safetensors --input-id-images-dir ../assets/examples/scarletthead_woman -p "a girl img, retro futurism, retro game art style but extremely beautiful, intricate details, masterpiece, best quality, space-themed, cosmic, celestial, stars, galaxies, nebulas, planets, science fiction, highly detailed" -n "realistic, photo-realistic, worst quality, greyscale, bad anatomy, bad hands, error, text" --cfg-scale 5.0 --sampling-method euler -H 1024 -W 1024 --style-ratio 10 --vae-on-cpu -o output.png
```
### Docker ### Docker
#### Building using Docker #### Building using Docker
@ -345,3 +383,4 @@ Thank you to all the people who have already contributed to stable-diffusion.cpp
- [k-diffusion](https://github.com/crowsonkb/k-diffusion) - [k-diffusion](https://github.com/crowsonkb/k-diffusion)
- [latent-consistency-model](https://github.com/luosiallen/latent-consistency-model) - [latent-consistency-model](https://github.com/luosiallen/latent-consistency-model)
- [generative-models](https://github.com/Stability-AI/generative-models/) - [generative-models](https://github.com/Stability-AI/generative-models/)
- [PhotoMaker](https://github.com/TencentARC/PhotoMaker)

Binary file not shown.

After

Width:  |  Height:  |  Size: 39 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 311 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 53 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 1.4 MiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 26 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 88 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 26 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 72 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 107 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 54 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 68 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 54 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 48 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 28 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 30 KiB

398
clip.hpp
View File

@ -75,9 +75,13 @@ class CLIPTokenizer {
private: private:
SDVersion version = VERSION_1_x; SDVersion version = VERSION_1_x;
std::map<int, std::u32string> byte_encoder; std::map<int, std::u32string> byte_encoder;
std::map<std::u32string, int> byte_decoder;
std::map<std::u32string, int> encoder; std::map<std::u32string, int> encoder;
std::map<int, std::u32string> decoder;
std::map<std::pair<std::u32string, std::u32string>, int> bpe_ranks; std::map<std::pair<std::u32string, std::u32string>, int> bpe_ranks;
std::regex pat; std::regex pat;
int encoder_len;
int bpe_len;
static std::string strip(const std::string& str) { static std::string strip(const std::string& str) {
std::string::size_type start = str.find_first_not_of(" \t\n\r\v\f"); std::string::size_type start = str.find_first_not_of(" \t\n\r\v\f");
@ -118,7 +122,11 @@ public:
void load_from_merges(const std::string& merges_utf8_str) { void load_from_merges(const std::string& merges_utf8_str) {
auto byte_unicode_pairs = bytes_to_unicode(); auto byte_unicode_pairs = bytes_to_unicode();
byte_encoder = std::map<int, std::u32string>(byte_unicode_pairs.begin(), byte_unicode_pairs.end()); // printf("byte_unicode_pairs have %lu pairs \n", byte_unicode_pairs.size());
byte_encoder = std::map<int, std::u32string>(byte_unicode_pairs.begin(), byte_unicode_pairs.end());
for (auto& pair : byte_unicode_pairs) {
byte_decoder[pair.second] = pair.first;
}
// for (auto & pair: byte_unicode_pairs) { // for (auto & pair: byte_unicode_pairs) {
// std::cout << pair.first << ": " << pair.second << std::endl; // std::cout << pair.first << ": " << pair.second << std::endl;
// } // }
@ -138,6 +146,8 @@ public:
size_t space_pos = merge.find(' '); size_t space_pos = merge.find(' ');
merge_pairs.emplace_back(merge.substr(0, space_pos), merge.substr(space_pos + 1)); merge_pairs.emplace_back(merge.substr(0, space_pos), merge.substr(space_pos + 1));
// LOG_DEBUG("%s", utf32_to_utf8(merge.substr(space_pos + 1)).c_str()); // LOG_DEBUG("%s", utf32_to_utf8(merge.substr(space_pos + 1)).c_str());
// printf("%s :: %s | %s \n", utf32_to_utf8(merge).c_str(), utf32_to_utf8(merge.substr(0, space_pos)).c_str(),
// utf32_to_utf8(merge.substr(space_pos + 1)).c_str());
} }
std::vector<std::u32string> vocab; std::vector<std::u32string> vocab;
for (const auto& pair : byte_unicode_pairs) { for (const auto& pair : byte_unicode_pairs) {
@ -154,15 +164,36 @@ public:
LOG_DEBUG("vocab size: %llu", vocab.size()); LOG_DEBUG("vocab size: %llu", vocab.size());
int i = 0; int i = 0;
for (const auto& token : vocab) { for (const auto& token : vocab) {
encoder[token] = i++; encoder[token] = i;
decoder[i] = token;
i++;
}
encoder_len = i;
auto it = encoder.find(utf8_to_utf32("img</w>"));
if (it != encoder.end()) {
LOG_DEBUG(" trigger word img already in vocab");
} else {
LOG_DEBUG(" trigger word img not in vocab yet");
} }
int rank = 0; int rank = 0;
for (const auto& merge : merge_pairs) { for (const auto& merge : merge_pairs) {
bpe_ranks[merge] = rank++; bpe_ranks[merge] = rank++;
} }
bpe_len = rank;
}; };
void add_token(const std::string& text) {
std::u32string token = utf8_to_utf32(text);
auto it = encoder.find(token);
if (it != encoder.end()) {
encoder[token] = encoder_len;
decoder[encoder_len] = token;
encoder_len++;
}
}
std::u32string bpe(const std::u32string& token) { std::u32string bpe(const std::u32string& token) {
std::vector<std::u32string> word; std::vector<std::u32string> word;
@ -243,6 +274,7 @@ public:
size_t max_length = 0, size_t max_length = 0,
bool padding = false) { bool padding = false) {
std::vector<int32_t> tokens = encode(text, on_new_token_cb); std::vector<int32_t> tokens = encode(text, on_new_token_cb);
tokens.insert(tokens.begin(), BOS_TOKEN_ID); tokens.insert(tokens.begin(), BOS_TOKEN_ID);
if (max_length > 0) { if (max_length > 0) {
if (tokens.size() > max_length - 1) { if (tokens.size() > max_length - 1) {
@ -259,9 +291,34 @@ public:
} }
} }
} }
return tokens; return tokens;
} }
std::string decode(const std::vector<int>& tokens) {
std::string text = "";
for (int t : tokens) {
if (t == 49406 || t == 49407)
continue;
std::u32string ts = decoder[t];
// printf("%d, %s \n", t, utf32_to_utf8(ts).c_str());
std::string s = utf32_to_utf8(ts);
if (s.length() >= 4 && ends_with(s, "</w>")) {
text += " " + s.replace(s.length() - 4, s.length() - 1, "");
} else {
text += " " + s;
}
}
// std::vector<unsigned char> bytes;
// for (auto c : text){
// bytes.push_back(byte_decoder[c]);
// }
// std::string s((char *)bytes.data());
// std::string s = "";
return trim(text);
}
std::vector<int> encode(std::string text, on_new_token_cb_t on_new_token_cb) { std::vector<int> encode(std::string text, on_new_token_cb_t on_new_token_cb) {
std::string original_text = text; std::string original_text = text;
std::vector<int32_t> bpe_tokens; std::vector<int32_t> bpe_tokens;
@ -308,7 +365,8 @@ public:
ss << "\"" << token << "\", "; ss << "\"" << token << "\", ";
} }
ss << "]"; ss << "]";
LOG_DEBUG("split prompt \"%s\" to tokens %s", original_text.c_str(), ss.str().c_str()); // LOG_DEBUG("split prompt \"%s\" to tokens %s", original_text.c_str(), ss.str().c_str());
// printf("split prompt \"%s\" to tokens %s \n", original_text.c_str(), ss.str().c_str());
return bpe_tokens; return bpe_tokens;
} }
}; };
@ -469,7 +527,8 @@ public:
: d_model(d_model), : d_model(d_model),
n_head(n_head), n_head(n_head),
intermediate_size(intermediate_size) { intermediate_size(intermediate_size) {
blocks["self_attn"] = std::shared_ptr<GGMLBlock>(new MultiheadAttention(d_model, n_head)); blocks["self_attn"] = std::shared_ptr<GGMLBlock>(new MultiheadAttention(d_model, n_head, true));
blocks["layer_norm1"] = std::shared_ptr<GGMLBlock>(new LayerNorm(d_model)); blocks["layer_norm1"] = std::shared_ptr<GGMLBlock>(new LayerNorm(d_model));
blocks["layer_norm2"] = std::shared_ptr<GGMLBlock>(new LayerNorm(d_model)); blocks["layer_norm2"] = std::shared_ptr<GGMLBlock>(new LayerNorm(d_model));
@ -508,7 +567,7 @@ public:
struct ggml_tensor* forward(struct ggml_context* ctx, struct ggml_tensor* x, int clip_skip = -1, bool mask = true) { struct ggml_tensor* forward(struct ggml_context* ctx, struct ggml_tensor* x, int clip_skip = -1, bool mask = true) {
// x: [N, n_token, d_model] // x: [N, n_token, d_model]
int layer_idx = n_layer - 1; int layer_idx = n_layer - 1;
LOG_DEBUG("clip_skip %d", clip_skip); // LOG_DEBUG("clip_skip %d", clip_skip);
if (clip_skip > 0) { if (clip_skip > 0) {
layer_idx = n_layer - clip_skip; layer_idx = n_layer - clip_skip;
} }
@ -520,7 +579,7 @@ public:
} }
std::string name = "layers." + std::to_string(i); std::string name = "layers." + std::to_string(i);
auto layer = std::dynamic_pointer_cast<CLIPLayer>(blocks[name]); auto layer = std::dynamic_pointer_cast<CLIPLayer>(blocks[name]);
x = layer->forward(ctx, x); // [N, n_token, d_model] x = layer->forward(ctx, x, mask); // [N, n_token, d_model]
// LOG_DEBUG("layer %d", i); // LOG_DEBUG("layer %d", i);
} }
return x; return x;
@ -703,7 +762,7 @@ public:
auto final_layer_norm = std::dynamic_pointer_cast<LayerNorm>(blocks["final_layer_norm"]); auto final_layer_norm = std::dynamic_pointer_cast<LayerNorm>(blocks["final_layer_norm"]);
auto x = embeddings->forward(ctx, input_ids, tkn_embeddings); // [N, n_token, hidden_size] auto x = embeddings->forward(ctx, input_ids, tkn_embeddings); // [N, n_token, hidden_size]
x = encoder->forward(ctx, x, return_pooled ? -1 : clip_skip, true); x = encoder->forward(ctx, x, return_pooled ? -1 : clip_skip, true);
if (return_pooled || with_final_ln) { if (return_pooled || with_final_ln) {
x = final_layer_norm->forward(ctx, x); x = final_layer_norm->forward(ctx, x);
} }
@ -720,11 +779,6 @@ public:
}; };
class CLIPVisionModel : public GGMLBlock { class CLIPVisionModel : public GGMLBlock {
protected:
void init_params(struct ggml_context* ctx, ggml_type wtype) {
params["visual_projection"] = ggml_new_tensor_2d(ctx, GGML_TYPE_F32, projection_dim, hidden_size);
}
public: public:
// network hparams // network hparams
int32_t num_channels = 3; int32_t num_channels = 3;
@ -735,16 +789,14 @@ public:
int32_t intermediate_size = 4096; int32_t intermediate_size = 4096;
int32_t n_head = 16; int32_t n_head = 16;
int32_t n_layer = 24; int32_t n_layer = 24;
int32_t projection_dim = 768;
public: public:
CLIPVisionModel(CLIPVersion version = OPEN_CLIP_VIT_H_14) { CLIPVisionModel(CLIPVersion version = OPENAI_CLIP_VIT_L_14) {
if (version == OPEN_CLIP_VIT_H_14) { if (version == OPEN_CLIP_VIT_H_14) {
hidden_size = 1280; hidden_size = 1280;
intermediate_size = 5120; intermediate_size = 5120;
n_head = 16; n_head = 16;
n_layer = 32; n_layer = 32;
projection_dim = 1024;
} else if (version == OPEN_CLIP_VIT_BIGG_14) { } else if (version == OPEN_CLIP_VIT_BIGG_14) {
hidden_size = 1664; hidden_size = 1664;
intermediate_size = 8192; intermediate_size = 8192;
@ -758,9 +810,8 @@ public:
blocks["post_layernorm"] = std::shared_ptr<GGMLBlock>(new LayerNorm(hidden_size)); blocks["post_layernorm"] = std::shared_ptr<GGMLBlock>(new LayerNorm(hidden_size));
} }
struct ggml_tensor* forward(struct ggml_context* ctx, struct ggml_tensor* pixel_values) { struct ggml_tensor* forward(struct ggml_context* ctx, struct ggml_tensor* pixel_values, bool return_pooled = true) {
// pixel_values: [N, num_channels, image_size, image_size] // pixel_values: [N, num_channels, image_size, image_size]
// return: // [N, projection_dim]
auto embeddings = std::dynamic_pointer_cast<CLIPVisionEmbeddings>(blocks["embeddings"]); auto embeddings = std::dynamic_pointer_cast<CLIPVisionEmbeddings>(blocks["embeddings"]);
auto pre_layernorm = std::dynamic_pointer_cast<LayerNorm>(blocks["pre_layernorm"]); auto pre_layernorm = std::dynamic_pointer_cast<LayerNorm>(blocks["pre_layernorm"]);
auto encoder = std::dynamic_pointer_cast<CLIPEncoder>(blocks["encoder"]); auto encoder = std::dynamic_pointer_cast<CLIPEncoder>(blocks["encoder"]);
@ -768,26 +819,60 @@ public:
auto x = embeddings->forward(ctx, pixel_values); // [N, num_positions, embed_dim] auto x = embeddings->forward(ctx, pixel_values); // [N, num_positions, embed_dim]
x = pre_layernorm->forward(ctx, x); x = pre_layernorm->forward(ctx, x);
x = encoder->forward(ctx, x, -1, true); x = encoder->forward(ctx, x, -1, false);
x = post_layernorm->forward(ctx, x); // [N, n_token, hidden_size] x = post_layernorm->forward(ctx, x); // [N, n_token, hidden_size]
GGML_ASSERT(x->ne[2] == 1); GGML_ASSERT(x->ne[3] == 1);
int64_t max_token_idx = 0; if (return_pooled) {
ggml_tensor* pooled = ggml_view_1d(ctx, x, x->ne[0], x->nb[1] * max_token_idx); // assert N == 1 ggml_tensor* pooled = ggml_cont(ctx, ggml_view_2d(ctx, x, x->ne[0], x->ne[2], x->nb[2], 0));
auto visual_projection = params["visual_projection"]; return pooled; // [N, hidden_size]
pooled = ggml_mul_mat(ctx, ggml_cont(ctx, ggml_transpose(ctx, visual_projection)), pooled); } else {
return pooled; // [N, projection_dim] return x; // [N, n_token, hidden_size]
}
}
};
class CLIPProjection : public UnaryBlock {
protected:
int64_t in_features;
int64_t out_features;
bool transpose_weight;
void init_params(struct ggml_context* ctx, ggml_type wtype) {
if (transpose_weight) {
LOG_ERROR("transpose_weight");
params["weight"] = ggml_new_tensor_2d(ctx, wtype, out_features, in_features);
} else {
params["weight"] = ggml_new_tensor_2d(ctx, wtype, in_features, out_features);
}
}
public:
CLIPProjection(int64_t in_features,
int64_t out_features,
bool transpose_weight = false)
: in_features(in_features),
out_features(out_features),
transpose_weight(transpose_weight) {}
struct ggml_tensor* forward(struct ggml_context* ctx, struct ggml_tensor* x) {
struct ggml_tensor* w = params["weight"];
if (transpose_weight) {
w = ggml_cont(ctx, ggml_transpose(ctx, w));
}
return ggml_nn_linear(ctx, x, w, NULL);
} }
}; };
class CLIPVisionModelProjection : public GGMLBlock { class CLIPVisionModelProjection : public GGMLBlock {
public: public:
int32_t hidden_size = 1024; int32_t hidden_size = 1024;
int32_t projection_dim = 1024; int32_t projection_dim = 768;
int32_t image_size = 224; int32_t image_size = 224;
public: public:
CLIPVisionModelProjection(CLIPVersion version = OPEN_CLIP_VIT_H_14) { CLIPVisionModelProjection(CLIPVersion version = OPENAI_CLIP_VIT_L_14,
bool transpose_proj_w = false) {
if (version == OPEN_CLIP_VIT_H_14) { if (version == OPEN_CLIP_VIT_H_14) {
hidden_size = 1280; hidden_size = 1280;
projection_dim = 1024; projection_dim = 1024;
@ -795,17 +880,17 @@ public:
hidden_size = 1664; hidden_size = 1664;
} }
blocks["visual_model"] = std::shared_ptr<GGMLBlock>(new CLIPVisionModel(version)); blocks["vision_model"] = std::shared_ptr<GGMLBlock>(new CLIPVisionModel(version));
blocks["visual_projection"] = std::shared_ptr<GGMLBlock>(new Linear(hidden_size, projection_dim, false)); blocks["visual_projection"] = std::shared_ptr<GGMLBlock>(new CLIPProjection(hidden_size, projection_dim, transpose_proj_w));
} }
struct ggml_tensor* forward(struct ggml_context* ctx, struct ggml_tensor* pixel_values) { struct ggml_tensor* forward(struct ggml_context* ctx, struct ggml_tensor* pixel_values) {
// pixel_values: [N, num_channels, image_size, image_size] // pixel_values: [N, num_channels, image_size, image_size]
// return: [N, num_positions, projection_dim] // return: [N, projection_dim]
auto visual_model = std::dynamic_pointer_cast<CLIPVisionModel>(blocks["visual_model"]); auto vision_model = std::dynamic_pointer_cast<CLIPVisionModel>(blocks["vision_model"]);
auto visual_projection = std::dynamic_pointer_cast<Linear>(blocks["visual_projection"]); auto visual_projection = std::dynamic_pointer_cast<CLIPProjection>(blocks["visual_projection"]);
auto x = visual_model->forward(ctx, pixel_values); // [N, embed_dim] auto x = vision_model->forward(ctx, pixel_values); // [N, hidden_size]
x = visual_projection->forward(ctx, x); // [N, projection_dim] x = visual_projection->forward(ctx, x); // [N, projection_dim]
return x; // [N, projection_dim] return x; // [N, projection_dim]
@ -1029,6 +1114,205 @@ struct FrozenCLIPEmbedderWithCustomWords : public GGMLModule {
return tokenize(text, text_model.n_token, padding); return tokenize(text, text_model.n_token, padding);
} }
std::tuple<std::vector<int>, std::vector<float>, std::vector<bool>>
tokenize_with_trigger_token(std::string text,
int num_input_imgs,
int32_t image_token,
bool padding = false) {
return tokenize_with_trigger_token(text, num_input_imgs, image_token,
text_model.n_token, padding);
}
std::vector<int> convert_token_to_id(std::string text) {
auto on_new_token_cb = [&](std::string& str, std::vector<int32_t>& bpe_tokens) -> bool {
size_t word_end = str.find(",");
std::string embd_name = word_end == std::string::npos ? str : str.substr(0, word_end);
embd_name = trim(embd_name);
std::string embd_path = get_full_path(embd_dir, embd_name + ".pt");
if (embd_path.size() == 0) {
embd_path = get_full_path(embd_dir, embd_name + ".ckpt");
}
if (embd_path.size() == 0) {
embd_path = get_full_path(embd_dir, embd_name + ".safetensors");
}
if (embd_path.size() > 0) {
if (load_embedding(embd_name, embd_path, bpe_tokens)) {
if (word_end != std::string::npos) {
str = str.substr(word_end);
} else {
str = "";
}
return true;
}
}
return false;
};
std::vector<int> curr_tokens = tokenizer.encode(text, on_new_token_cb);
return curr_tokens;
}
std::string decode(const std::vector<int>& tokens) {
return tokenizer.decode(tokens);
}
void pad_tokens(std::vector<int>& tokens,
std::vector<float>& weights,
size_t max_length = 0,
bool padding = false) {
if (max_length > 0 && padding) {
size_t n = std::ceil(tokens.size() * 1.0 / (max_length - 2));
if (n == 0) {
n = 1;
}
size_t length = max_length * n;
LOG_DEBUG("token length: %llu", length);
std::vector<int> new_tokens;
std::vector<float> new_weights;
new_tokens.push_back(BOS_TOKEN_ID);
new_weights.push_back(1.0);
int token_idx = 0;
for (int i = 1; i < length; i++) {
if (token_idx >= tokens.size()) {
break;
}
if (i % max_length == 0) {
new_tokens.push_back(BOS_TOKEN_ID);
new_weights.push_back(1.0);
} else if (i % max_length == max_length - 1) {
new_tokens.push_back(EOS_TOKEN_ID);
new_weights.push_back(1.0);
} else {
new_tokens.push_back(tokens[token_idx]);
new_weights.push_back(weights[token_idx]);
token_idx++;
}
}
new_tokens.push_back(EOS_TOKEN_ID);
new_weights.push_back(1.0);
tokens = new_tokens;
weights = new_weights;
if (padding) {
int pad_token_id = PAD_TOKEN_ID;
if (version == VERSION_2_x) {
pad_token_id = 0;
}
tokens.insert(tokens.end(), length - tokens.size(), pad_token_id);
weights.insert(weights.end(), length - weights.size(), 1.0);
}
}
}
std::tuple<std::vector<int>, std::vector<float>, std::vector<bool>>
tokenize_with_trigger_token(std::string text,
int num_input_imgs,
int32_t image_token,
size_t max_length = 0,
bool padding = false) {
auto parsed_attention = parse_prompt_attention(text);
{
std::stringstream ss;
ss << "[";
for (const auto& item : parsed_attention) {
ss << "['" << item.first << "', " << item.second << "], ";
}
ss << "]";
LOG_DEBUG("parse '%s' to %s", text.c_str(), ss.str().c_str());
}
auto on_new_token_cb = [&](std::string& str, std::vector<int32_t>& bpe_tokens) -> bool {
size_t word_end = str.find(",");
std::string embd_name = word_end == std::string::npos ? str : str.substr(0, word_end);
embd_name = trim(embd_name);
std::string embd_path = get_full_path(embd_dir, embd_name + ".pt");
if (embd_path.size() == 0) {
embd_path = get_full_path(embd_dir, embd_name + ".ckpt");
}
if (embd_path.size() == 0) {
embd_path = get_full_path(embd_dir, embd_name + ".safetensors");
}
if (embd_path.size() > 0) {
if (load_embedding(embd_name, embd_path, bpe_tokens)) {
if (word_end != std::string::npos) {
str = str.substr(word_end);
} else {
str = "";
}
return true;
}
}
return false;
};
std::vector<int> tokens;
std::vector<float> weights;
std::vector<bool> class_token_mask;
int32_t class_idx = -1, tokens_acc = 0;
for (const auto& item : parsed_attention) {
std::vector<int> class_token_index;
std::vector<int> clean_input_ids;
const std::string& curr_text = item.first;
float curr_weight = item.second;
// printf(" %s: %f \n", curr_text.c_str(), curr_weight);
std::vector<int> curr_tokens = tokenizer.encode(curr_text, on_new_token_cb);
int32_t clean_index = 0;
for (uint32_t i = 0; i < curr_tokens.size(); i++) {
int token_id = curr_tokens[i];
if (token_id == image_token)
class_token_index.push_back(clean_index - 1);
else {
clean_input_ids.push_back(token_id);
clean_index++;
}
}
// GGML_ASSERT(class_token_index.size() == 1); // PhotoMaker currently does not support multiple
// trigger words in a single prompt.
if (class_token_index.size() == 1) {
// Expand the class word token and corresponding mask
int class_token = clean_input_ids[class_token_index[0]];
class_idx = tokens_acc + class_token_index[0];
std::vector<int> clean_input_ids_tmp;
for (uint32_t i = 0; i < class_token_index[0]; i++)
clean_input_ids_tmp.push_back(clean_input_ids[i]);
for (uint32_t i = 0; i < num_input_imgs; i++)
clean_input_ids_tmp.push_back(class_token);
for (uint32_t i = class_token_index[0] + 1; i < clean_input_ids.size(); i++)
clean_input_ids_tmp.push_back(clean_input_ids[i]);
clean_input_ids.clear();
clean_input_ids = clean_input_ids_tmp;
}
tokens_acc += clean_index;
tokens.insert(tokens.end(), clean_input_ids.begin(), clean_input_ids.end());
weights.insert(weights.end(), clean_input_ids.size(), curr_weight);
}
tokens.insert(tokens.begin(), BOS_TOKEN_ID);
weights.insert(weights.begin(), 1.0);
pad_tokens(tokens, weights, max_length, padding);
for (uint32_t i = 0; i < tokens.size(); i++) {
if (class_idx + 1 <= i && i < class_idx + 1 + num_input_imgs)
class_token_mask.push_back(true);
else
class_token_mask.push_back(false);
}
// printf("[");
// for (int i = 0; i < tokens.size(); i++) {
// printf("%d, ", class_token_mask[i] ? 1 : 0);
// }
// printf("]\n");
// for (int i = 0; i < tokens.size(); i++) {
// std::cout << tokens[i] << ":" << weights[i] << ", ";
// }
// std::cout << std::endl;
return std::make_tuple(tokens, weights, class_token_mask);
}
std::pair<std::vector<int>, std::vector<float>> tokenize(std::string text, std::pair<std::vector<int>, std::vector<float>> tokenize(std::string text,
size_t max_length = 0, size_t max_length = 0,
bool padding = false) { bool padding = false) {
@ -1078,49 +1362,7 @@ struct FrozenCLIPEmbedderWithCustomWords : public GGMLModule {
weights.insert(weights.end(), curr_tokens.size(), curr_weight); weights.insert(weights.end(), curr_tokens.size(), curr_weight);
} }
if (max_length > 0 && padding) { pad_tokens(tokens, weights, max_length, padding);
size_t n = std::ceil(tokens.size() * 1.0 / (max_length - 2));
if (n == 0) {
n = 1;
}
size_t length = max_length * n;
LOG_DEBUG("token length: %llu", length);
std::vector<int> new_tokens;
std::vector<float> new_weights;
new_tokens.push_back(BOS_TOKEN_ID);
new_weights.push_back(1.0);
int token_idx = 0;
for (int i = 1; i < length; i++) {
if (token_idx >= tokens.size()) {
break;
}
if (i % max_length == 0) {
new_tokens.push_back(BOS_TOKEN_ID);
new_weights.push_back(1.0);
} else if (i % max_length == max_length - 1) {
new_tokens.push_back(EOS_TOKEN_ID);
new_weights.push_back(1.0);
} else {
new_tokens.push_back(tokens[token_idx]);
new_weights.push_back(weights[token_idx]);
token_idx++;
}
}
new_tokens.push_back(EOS_TOKEN_ID);
new_weights.push_back(1.0);
tokens = new_tokens;
weights = new_weights;
if (padding) {
int pad_token_id = PAD_TOKEN_ID;
if (version == VERSION_2_x) {
pad_token_id = 0;
}
tokens.insert(tokens.end(), length - tokens.size(), pad_token_id);
weights.insert(weights.end(), length - weights.size(), 1.0);
}
}
// for (int i = 0; i < tokens.size(); i++) { // for (int i = 0; i < tokens.size(); i++) {
// std::cout << tokens[i] << ":" << weights[i] << ", "; // std::cout << tokens[i] << ":" << weights[i] << ", ";
@ -1132,10 +1374,10 @@ struct FrozenCLIPEmbedderWithCustomWords : public GGMLModule {
}; };
struct FrozenCLIPVisionEmbedder : public GGMLModule { struct FrozenCLIPVisionEmbedder : public GGMLModule {
CLIPVisionModel vision_model; CLIPVisionModelProjection vision_model;
FrozenCLIPVisionEmbedder(ggml_backend_t backend, ggml_type wtype) FrozenCLIPVisionEmbedder(ggml_backend_t backend, ggml_type wtype)
: GGMLModule(backend, wtype) { : vision_model(OPEN_CLIP_VIT_H_14, true), GGMLModule(backend, wtype) {
vision_model.init(params_ctx, wtype); vision_model.init(params_ctx, wtype);
} }
@ -1152,7 +1394,7 @@ struct FrozenCLIPVisionEmbedder : public GGMLModule {
} }
void get_param_tensors(std::map<std::string, struct ggml_tensor*>& tensors, const std::string prefix) { void get_param_tensors(std::map<std::string, struct ggml_tensor*>& tensors, const std::string prefix) {
vision_model.get_param_tensors(tensors, prefix + "transformer.visual_model"); vision_model.get_param_tensors(tensors, prefix + "transformer");
} }
struct ggml_cgraph* build_graph(struct ggml_tensor* pixel_values) { struct ggml_cgraph* build_graph(struct ggml_tensor* pixel_values) {

View File

@ -10,6 +10,7 @@
#include "stable-diffusion.h" #include "stable-diffusion.h"
#define STB_IMAGE_IMPLEMENTATION #define STB_IMAGE_IMPLEMENTATION
#define STB_IMAGE_STATIC
#include "stb_image.h" #include "stb_image.h"
#define STB_IMAGE_WRITE_IMPLEMENTATION #define STB_IMAGE_WRITE_IMPLEMENTATION
@ -65,6 +66,8 @@ struct SDParams {
std::string esrgan_path; std::string esrgan_path;
std::string controlnet_path; std::string controlnet_path;
std::string embeddings_path; std::string embeddings_path;
std::string stacked_id_embeddings_path;
std::string input_id_images_path;
sd_type_t wtype = SD_TYPE_COUNT; sd_type_t wtype = SD_TYPE_COUNT;
std::string lora_model_dir; std::string lora_model_dir;
std::string output_path = "output.png"; std::string output_path = "output.png";
@ -73,12 +76,13 @@ struct SDParams {
std::string prompt; std::string prompt;
std::string negative_prompt; std::string negative_prompt;
float min_cfg = 1.0f; float min_cfg = 1.0f;
float cfg_scale = 7.0f; float cfg_scale = 7.0f;
int clip_skip = -1; // <= 0 represents unspecified float style_ratio = 20.f;
int width = 512; int clip_skip = -1; // <= 0 represents unspecified
int height = 512; int width = 512;
int batch_count = 1; int height = 512;
int batch_count = 1;
int video_frames = 6; int video_frames = 6;
int motion_bucket_id = 127; int motion_bucket_id = 127;
@ -95,6 +99,9 @@ struct SDParams {
bool verbose = false; bool verbose = false;
bool vae_tiling = false; bool vae_tiling = false;
bool control_net_cpu = false; bool control_net_cpu = false;
bool normalize_input = false;
bool clip_on_cpu = false;
bool vae_on_cpu = false;
bool canny_preprocess = false; bool canny_preprocess = false;
int upscale_repeats = 1; int upscale_repeats = 1;
}; };
@ -110,10 +117,16 @@ void print_params(SDParams params) {
printf(" esrgan_path: %s\n", params.esrgan_path.c_str()); printf(" esrgan_path: %s\n", params.esrgan_path.c_str());
printf(" controlnet_path: %s\n", params.controlnet_path.c_str()); printf(" controlnet_path: %s\n", params.controlnet_path.c_str());
printf(" embeddings_path: %s\n", params.embeddings_path.c_str()); printf(" embeddings_path: %s\n", params.embeddings_path.c_str());
printf(" stacked_id_embeddings_path: %s\n", params.stacked_id_embeddings_path.c_str());
printf(" input_id_images_path: %s\n", params.input_id_images_path.c_str());
printf(" style ratio: %.2f\n", params.style_ratio);
printf(" normzalize input image : %s\n", params.normalize_input ? "true" : "false");
printf(" output_path: %s\n", params.output_path.c_str()); printf(" output_path: %s\n", params.output_path.c_str());
printf(" init_img: %s\n", params.input_path.c_str()); printf(" init_img: %s\n", params.input_path.c_str());
printf(" control_image: %s\n", params.control_image_path.c_str()); printf(" control_image: %s\n", params.control_image_path.c_str());
printf(" clip on cpu: %s\n", params.clip_on_cpu ? "true" : "false");
printf(" controlnet cpu: %s\n", params.control_net_cpu ? "true" : "false"); printf(" controlnet cpu: %s\n", params.control_net_cpu ? "true" : "false");
printf(" vae decoder on cpu:%s\n", params.vae_on_cpu ? "true" : "false");
printf(" strength(control): %.2f\n", params.control_strength); printf(" strength(control): %.2f\n", params.control_strength);
printf(" prompt: %s\n", params.prompt.c_str()); printf(" prompt: %s\n", params.prompt.c_str());
printf(" negative_prompt: %s\n", params.negative_prompt.c_str()); printf(" negative_prompt: %s\n", params.negative_prompt.c_str());
@ -146,6 +159,9 @@ void print_usage(int argc, const char* argv[]) {
printf(" --taesd [TAESD_PATH] path to taesd. Using Tiny AutoEncoder for fast decoding (low quality)\n"); printf(" --taesd [TAESD_PATH] path to taesd. Using Tiny AutoEncoder for fast decoding (low quality)\n");
printf(" --control-net [CONTROL_PATH] path to control net model\n"); printf(" --control-net [CONTROL_PATH] path to control net model\n");
printf(" --embd-dir [EMBEDDING_PATH] path to embeddings.\n"); printf(" --embd-dir [EMBEDDING_PATH] path to embeddings.\n");
printf(" --stacked-id-embd-dir [DIR] path to PHOTOMAKER stacked id embeddings.\n");
printf(" --input-id-images-dir [DIR] path to PHOTOMAKER input id images dir.\n");
printf(" --normalize-input normalize PHOTOMAKER input id images\n");
printf(" --upscale-model [ESRGAN_PATH] path to esrgan model. Upscale images after generate, just RealESRGAN_x4plus_anime_6B supported by now.\n"); printf(" --upscale-model [ESRGAN_PATH] path to esrgan model. Upscale images after generate, just RealESRGAN_x4plus_anime_6B supported by now.\n");
printf(" --upscale-repeats Run the ESRGAN upscaler this many times (default 1)\n"); printf(" --upscale-repeats Run the ESRGAN upscaler this many times (default 1)\n");
printf(" --type [TYPE] weight type (f32, f16, q4_0, q4_1, q5_0, q5_1, q8_0)\n"); printf(" --type [TYPE] weight type (f32, f16, q4_0, q4_1, q5_0, q5_1, q8_0)\n");
@ -158,6 +174,7 @@ void print_usage(int argc, const char* argv[]) {
printf(" -n, --negative-prompt PROMPT the negative prompt (default: \"\")\n"); printf(" -n, --negative-prompt PROMPT the negative prompt (default: \"\")\n");
printf(" --cfg-scale SCALE unconditional guidance scale: (default: 7.0)\n"); printf(" --cfg-scale SCALE unconditional guidance scale: (default: 7.0)\n");
printf(" --strength STRENGTH strength for noising/unnoising (default: 0.75)\n"); printf(" --strength STRENGTH strength for noising/unnoising (default: 0.75)\n");
printf(" --style-ratio STYLE-RATIO strength for keeping input identity (default: 20%%)\n");
printf(" --control-strength STRENGTH strength to apply Control Net (default: 0.9)\n"); printf(" --control-strength STRENGTH strength to apply Control Net (default: 0.9)\n");
printf(" 1.0 corresponds to full destruction of information in init image\n"); printf(" 1.0 corresponds to full destruction of information in init image\n");
printf(" -H, --height H image height, in pixel space (default: 512)\n"); printf(" -H, --height H image height, in pixel space (default: 512)\n");
@ -244,6 +261,18 @@ void parse_args(int argc, const char** argv, SDParams& params) {
break; break;
} }
params.embeddings_path = argv[i]; params.embeddings_path = argv[i];
} else if (arg == "--stacked-id-embd-dir") {
if (++i >= argc) {
invalid_arg = true;
break;
}
params.stacked_id_embeddings_path = argv[i];
} else if (arg == "--input-id-images-dir") {
if (++i >= argc) {
invalid_arg = true;
break;
}
params.input_id_images_path = argv[i];
} else if (arg == "--type") { } else if (arg == "--type") {
if (++i >= argc) { if (++i >= argc) {
invalid_arg = true; invalid_arg = true;
@ -327,6 +356,12 @@ void parse_args(int argc, const char** argv, SDParams& params) {
break; break;
} }
params.strength = std::stof(argv[i]); params.strength = std::stof(argv[i]);
} else if (arg == "--style-ratio") {
if (++i >= argc) {
invalid_arg = true;
break;
}
params.style_ratio = std::stof(argv[i]);
} else if (arg == "--control-strength") { } else if (arg == "--control-strength") {
if (++i >= argc) { if (++i >= argc) {
invalid_arg = true; invalid_arg = true;
@ -361,6 +396,12 @@ void parse_args(int argc, const char** argv, SDParams& params) {
params.vae_tiling = true; params.vae_tiling = true;
} else if (arg == "--control-net-cpu") { } else if (arg == "--control-net-cpu") {
params.control_net_cpu = true; params.control_net_cpu = true;
} else if (arg == "--normalize-input") {
params.normalize_input = true;
} else if (arg == "--clip-on-cpu") {
params.clip_on_cpu = true; // will slow down get_learned_condiotion but necessary for low MEM GPUs
} else if (arg == "--vae-on-cpu") {
params.vae_on_cpu = true; // will slow down latent decoding but necessary for low MEM GPUs
} else if (arg == "--canny") { } else if (arg == "--canny") {
params.canny_preprocess = true; params.canny_preprocess = true;
} else if (arg == "-b" || arg == "--batch-count") { } else if (arg == "-b" || arg == "--batch-count") {
@ -613,6 +654,7 @@ int main(int argc, const char* argv[]) {
params.controlnet_path.c_str(), params.controlnet_path.c_str(),
params.lora_model_dir.c_str(), params.lora_model_dir.c_str(),
params.embeddings_path.c_str(), params.embeddings_path.c_str(),
params.stacked_id_embeddings_path.c_str(),
vae_decode_only, vae_decode_only,
params.vae_tiling, params.vae_tiling,
true, true,
@ -620,7 +662,9 @@ int main(int argc, const char* argv[]) {
params.wtype, params.wtype,
params.rng_type, params.rng_type,
params.schedule, params.schedule,
params.control_net_cpu); params.clip_on_cpu,
params.control_net_cpu,
params.vae_on_cpu);
if (sd_ctx == NULL) { if (sd_ctx == NULL) {
printf("new_sd_ctx_t failed\n"); printf("new_sd_ctx_t failed\n");
@ -664,7 +708,10 @@ int main(int argc, const char* argv[]) {
params.seed, params.seed,
params.batch_count, params.batch_count,
control_image, control_image,
params.control_strength); params.control_strength,
params.style_ratio,
params.normalize_input,
params.input_id_images_path.c_str());
} else { } else {
sd_image_t input_image = {(uint32_t)params.width, sd_image_t input_image = {(uint32_t)params.width,
(uint32_t)params.height, (uint32_t)params.height,

View File

@ -80,8 +80,27 @@ __STATIC_INLINE__ ggml_fp16_t ggml_tensor_get_f16(const ggml_tensor* tensor, int
return *(ggml_fp16_t*)((char*)(tensor->data) + i * tensor->nb[3] + j * tensor->nb[2] + k * tensor->nb[1] + l * tensor->nb[0]); return *(ggml_fp16_t*)((char*)(tensor->data) + i * tensor->nb[3] + j * tensor->nb[2] + k * tensor->nb[1] + l * tensor->nb[0]);
} }
__STATIC_INLINE__ void print_ggml_tensor(struct ggml_tensor* tensor, bool shape_only = false) { static struct ggml_tensor* get_tensor_from_graph(struct ggml_cgraph* gf, const char* name) {
printf("shape(%zu, %zu, %zu, %zu)\n", tensor->ne[0], tensor->ne[1], tensor->ne[2], tensor->ne[3]); struct ggml_tensor* res = NULL;
for (int i = 0; i < gf->n_nodes; i++) {
// printf("%d, %s \n", i, gf->nodes[i]->name);
if (strcmp(ggml_get_name(gf->nodes[i]), name) == 0) {
res = gf->nodes[i];
break;
}
}
for (int i = 0; i < gf->n_leafs; i++) {
// printf("%d, %s \n", i, gf->leafs[i]->name);
if (strcmp(ggml_get_name(gf->leafs[i]), name) == 0) {
res = gf->leafs[i];
break;
}
}
return res;
}
__STATIC_INLINE__ void print_ggml_tensor(struct ggml_tensor* tensor, bool shape_only = false, const char* mark = "") {
printf("%s (%s): shape(%zu, %zu, %zu, %zu)\n", mark, ggml_type_name(tensor->type), tensor->ne[0], tensor->ne[1], tensor->ne[2], tensor->ne[3]);
fflush(stdout); fflush(stdout);
if (shape_only) { if (shape_only) {
return; return;
@ -217,6 +236,23 @@ __STATIC_INLINE__ uint8_t* sd_tensor_to_image(struct ggml_tensor* input) {
return image_data; return image_data;
} }
__STATIC_INLINE__ uint8_t* sd_tensor_to_mul_image(struct ggml_tensor* input, int idx) {
int64_t width = input->ne[0];
int64_t height = input->ne[1];
int64_t channels = input->ne[2];
GGML_ASSERT(channels == 3 && input->type == GGML_TYPE_F32);
uint8_t* image_data = (uint8_t*)malloc(width * height * channels);
for (int iy = 0; iy < height; iy++) {
for (int ix = 0; ix < width; ix++) {
for (int k = 0; k < channels; k++) {
float value = ggml_tensor_get_f32(input, ix, iy, k, idx);
*(image_data + iy * width * channels + ix * channels + k) = (uint8_t)(value * 255.0f);
}
}
}
return image_data;
}
__STATIC_INLINE__ void sd_image_to_tensor(const uint8_t* image_data, __STATIC_INLINE__ void sd_image_to_tensor(const uint8_t* image_data,
struct ggml_tensor* output, struct ggml_tensor* output,
bool scale = true) { bool scale = true) {
@ -237,6 +273,28 @@ __STATIC_INLINE__ void sd_image_to_tensor(const uint8_t* image_data,
} }
} }
__STATIC_INLINE__ void sd_mul_images_to_tensor(const uint8_t* image_data,
struct ggml_tensor* output,
int idx,
float* mean = NULL,
float* std = NULL) {
int64_t width = output->ne[0];
int64_t height = output->ne[1];
int64_t channels = output->ne[2];
GGML_ASSERT(channels == 3 && output->type == GGML_TYPE_F32);
for (int iy = 0; iy < height; iy++) {
for (int ix = 0; ix < width; ix++) {
for (int k = 0; k < channels; k++) {
int value = *(image_data + iy * width * channels + ix * channels + k);
float pixel_val = value / 255.0f;
if (mean != NULL && std != NULL)
pixel_val = (pixel_val - mean[k]) / std[k];
ggml_tensor_set_f32(output, pixel_val, ix, iy, k, idx);
}
}
}
}
__STATIC_INLINE__ void sd_image_f32_to_tensor(const float* image_data, __STATIC_INLINE__ void sd_image_f32_to_tensor(const float* image_data,
struct ggml_tensor* output, struct ggml_tensor* output,
bool scale = true) { bool scale = true) {
@ -247,7 +305,7 @@ __STATIC_INLINE__ void sd_image_f32_to_tensor(const float* image_data,
for (int iy = 0; iy < height; iy++) { for (int iy = 0; iy < height; iy++) {
for (int ix = 0; ix < width; ix++) { for (int ix = 0; ix < width; ix++) {
for (int k = 0; k < channels; k++) { for (int k = 0; k < channels; k++) {
float value = *(image_data + iy * width * channels + ix * channels + k); int value = *(image_data + iy * width * channels + ix * channels + k);
if (scale) { if (scale) {
value /= 255.f; value /= 255.f;
} }
@ -771,7 +829,10 @@ protected:
// compute the required memory // compute the required memory
size_t compute_buffer_size = ggml_gallocr_get_buffer_size(compute_allocr, 0); size_t compute_buffer_size = ggml_gallocr_get_buffer_size(compute_allocr, 0);
LOG_DEBUG("%s compute buffer size: %.2f MB", get_desc().c_str(), compute_buffer_size / 1024.0 / 1024.0); LOG_DEBUG("%s compute buffer size: %.2f MB(%s)",
get_desc().c_str(),
compute_buffer_size / 1024.0 / 1024.0,
ggml_backend_is_cpu(backend) ? "RAM" : "VRAM");
return true; return true;
} }
@ -816,8 +877,11 @@ public:
return false; return false;
} }
size_t params_buffer_size = ggml_backend_buffer_get_size(params_buffer); size_t params_buffer_size = ggml_backend_buffer_get_size(params_buffer);
LOG_DEBUG("%s params backend buffer size = % 6.2f MB (%i tensors)", LOG_DEBUG("%s params backend buffer size = % 6.2f MB(%s) (%i tensors)",
get_desc().c_str(), params_buffer_size / (1024.0 * 1024.0), num_tensors); get_desc().c_str(),
params_buffer_size / (1024.0 * 1024.0),
ggml_backend_is_cpu(backend) ? "RAM" : "VRAM",
num_tensors);
return true; return true;
} }
@ -865,11 +929,8 @@ public:
alloc_compute_buffer(get_graph); alloc_compute_buffer(get_graph);
reset_compute_ctx(); reset_compute_ctx();
struct ggml_cgraph* gf = get_graph(); struct ggml_cgraph* gf = get_graph();
GGML_ASSERT(ggml_gallocr_alloc_graph(compute_allocr, gf)); GGML_ASSERT(ggml_gallocr_alloc_graph(compute_allocr, gf));
cpy_data_to_backend_tensor(); cpy_data_to_backend_tensor();
if (ggml_backend_is_cpu(backend)) { if (ggml_backend_is_cpu(backend)) {
ggml_backend_cpu_set_n_threads(backend, n_threads); ggml_backend_cpu_set_n_threads(backend, n_threads);
} }
@ -879,13 +940,11 @@ public:
ggml_backend_metal_set_n_cb(backend, n_threads); ggml_backend_metal_set_n_cb(backend, n_threads);
} }
#endif #endif
ggml_backend_graph_compute(backend, gf); ggml_backend_graph_compute(backend, gf);
#ifdef GGML_PERF #ifdef GGML_PERF
ggml_graph_print(gf); ggml_graph_print(gf);
#endif #endif
if (output != NULL) { if (output != NULL) {
auto result = gf->nodes[gf->n_nodes - 1]; auto result = gf->nodes[gf->n_nodes - 1];
if (*output == NULL && output_ctx != NULL) { if (*output == NULL && output_ctx != NULL) {
@ -977,13 +1036,11 @@ public:
} }
for (auto& pair : blocks) { for (auto& pair : blocks) {
auto& block = pair.second; auto& block = pair.second;
block->get_param_tensors(tensors, prefix + pair.first); block->get_param_tensors(tensors, prefix + pair.first);
} }
for (auto& pair : params) { for (auto& pair : params) {
struct ggml_tensor* param = pair.second; struct ggml_tensor* param = pair.second;
tensors[prefix + pair.first] = pair.second; tensors[prefix + pair.first] = pair.second;
} }
} }
@ -1243,11 +1300,10 @@ public:
struct ggml_tensor* kqv = ggml_nn_attention(ctx, q, k, v, mask); // [N * n_head, n_token, d_head] struct ggml_tensor* kqv = ggml_nn_attention(ctx, q, k, v, mask); // [N * n_head, n_token, d_head]
kqv = ggml_reshape_4d(ctx, kqv, d_head, n_token, n_head, N); kqv = ggml_reshape_4d(ctx, kqv, d_head, n_token, n_head, N);
kqv = ggml_cont(ctx, ggml_permute(ctx, kqv, 0, 2, 1, 3)); // [N, n_token, n_head, d_head] kqv = ggml_cont(ctx, ggml_permute(ctx, kqv, 0, 2, 1, 3)); // [N, n_token, n_head, d_head]
x = ggml_reshape_3d(ctx, kqv, d_head * n_head, n_token, N); // [N, n_token, d_head * n_head]
x = ggml_reshape_3d(ctx, kqv, d_head * n_head, n_token, N); // [N * n_token, d_head * n_head] x = out_proj->forward(ctx, x); // [N, n_token, embed_dim]
x = out_proj->forward(ctx, x);
return x; return x;
} }
}; };

View File

@ -14,9 +14,10 @@ struct LoraModel : public GGMLModule {
LoraModel(ggml_backend_t backend, LoraModel(ggml_backend_t backend,
ggml_type wtype, ggml_type wtype,
const std::string file_path = "") const std::string& file_path = "",
const std::string& prefix = "")
: file_path(file_path), GGMLModule(backend, wtype) { : file_path(file_path), GGMLModule(backend, wtype) {
if (!model_loader.init_from_file(file_path)) { if (!model_loader.init_from_file(file_path, prefix)) {
load_failed = true; load_failed = true;
} }
} }
@ -33,8 +34,7 @@ struct LoraModel : public GGMLModule {
return model_loader.get_params_mem_size(NULL); return model_loader.get_params_mem_size(NULL);
} }
bool load_from_file(bool filter_tensor = false) {
bool load_from_file() {
LOG_INFO("loading LoRA from '%s'", file_path.c_str()); LOG_INFO("loading LoRA from '%s'", file_path.c_str());
if (load_failed) { if (load_failed) {
@ -46,6 +46,11 @@ struct LoraModel : public GGMLModule {
auto on_new_tensor_cb = [&](const TensorStorage& tensor_storage, ggml_tensor** dst_tensor) -> bool { auto on_new_tensor_cb = [&](const TensorStorage& tensor_storage, ggml_tensor** dst_tensor) -> bool {
const std::string& name = tensor_storage.name; const std::string& name = tensor_storage.name;
if (filter_tensor && !contains(name, "lora")) {
// LOG_INFO("skipping LoRA tesnor '%s'", name.c_str());
return true;
}
if (dry_run) { if (dry_run) {
struct ggml_tensor* real = ggml_new_tensor(params_ctx, struct ggml_tensor* real = ggml_new_tensor(params_ctx,
tensor_storage.type, tensor_storage.type,
@ -66,7 +71,6 @@ struct LoraModel : public GGMLModule {
dry_run = false; dry_run = false;
model_loader.load_tensors(on_new_tensor_cb, backend); model_loader.load_tensors(on_new_tensor_cb, backend);
LOG_DEBUG("finished loaded lora"); LOG_DEBUG("finished loaded lora");
return true; return true;
} }
@ -85,6 +89,10 @@ struct LoraModel : public GGMLModule {
} }
k_tensor = k_tensor.substr(0, k_pos); k_tensor = k_tensor.substr(0, k_pos);
replace_all_chars(k_tensor, '.', '_'); replace_all_chars(k_tensor, '.', '_');
// LOG_DEBUG("k_tensor %s", k_tensor.c_str());
if (k_tensor == "model_diffusion_model_output_blocks_2_2_conv") { // fix for SDXL
k_tensor = "model_diffusion_model_output_blocks_2_1_conv";
}
std::string lora_up_name = "lora." + k_tensor + ".lora_up.weight"; std::string lora_up_name = "lora." + k_tensor + ".lora_up.weight";
std::string lora_down_name = "lora." + k_tensor + ".lora_down.weight"; std::string lora_down_name = "lora." + k_tensor + ".lora_down.weight";
std::string alpha_name = "lora." + k_tensor + ".alpha"; std::string alpha_name = "lora." + k_tensor + ".alpha";

View File

@ -108,14 +108,14 @@ std::unordered_map<std::string, std::string> open_clip_to_hf_clip_model = {
{"model.positional_embedding", "transformer.text_model.embeddings.position_embedding.weight"}, {"model.positional_embedding", "transformer.text_model.embeddings.position_embedding.weight"},
{"model.token_embedding.weight", "transformer.text_model.embeddings.token_embedding.weight"}, {"model.token_embedding.weight", "transformer.text_model.embeddings.token_embedding.weight"},
{"model.text_projection", "transformer.text_model.text_projection"}, {"model.text_projection", "transformer.text_model.text_projection"},
{"model.visual.class_embedding", "transformer.visual_model.embeddings.class_embedding"}, {"model.visual.class_embedding", "transformer.vision_model.embeddings.class_embedding"},
{"model.visual.conv1.weight", "transformer.visual_model.embeddings.patch_embedding.weight"}, {"model.visual.conv1.weight", "transformer.vision_model.embeddings.patch_embedding.weight"},
{"model.visual.ln_post.bias", "transformer.visual_model.post_layernorm.bias"}, {"model.visual.ln_post.bias", "transformer.vision_model.post_layernorm.bias"},
{"model.visual.ln_post.weight", "transformer.visual_model.post_layernorm.weight"}, {"model.visual.ln_post.weight", "transformer.vision_model.post_layernorm.weight"},
{"model.visual.ln_pre.bias", "transformer.visual_model.pre_layernorm.bias"}, {"model.visual.ln_pre.bias", "transformer.vision_model.pre_layernorm.bias"},
{"model.visual.ln_pre.weight", "transformer.visual_model.pre_layernorm.weight"}, {"model.visual.ln_pre.weight", "transformer.vision_model.pre_layernorm.weight"},
{"model.visual.positional_embedding", "transformer.visual_model.embeddings.position_embedding.weight"}, {"model.visual.positional_embedding", "transformer.vision_model.embeddings.position_embedding.weight"},
{"model.visual.proj", "transformer.visual_model.visual_projection"}, {"model.visual.proj", "transformer.visual_projection.weight"},
}; };
std::unordered_map<std::string, std::string> open_clip_to_hk_clip_resblock = { std::unordered_map<std::string, std::string> open_clip_to_hk_clip_resblock = {
@ -157,6 +157,10 @@ std::string convert_open_clip_to_hf_clip(const std::string& name) {
} else if (starts_with(new_name, "cond_stage_model.")) { } else if (starts_with(new_name, "cond_stage_model.")) {
prefix = "cond_stage_model."; prefix = "cond_stage_model.";
new_name = new_name.substr(strlen("cond_stage_model.")); new_name = new_name.substr(strlen("cond_stage_model."));
} else if (ends_with(new_name, "vision_model.visual_projection.weight")) {
prefix = new_name.substr(0, new_name.size() - strlen("vision_model.visual_projection.weight"));
new_name = prefix + "visual_projection.weight";
return new_name;
} else { } else {
return new_name; return new_name;
} }
@ -186,7 +190,7 @@ std::string convert_open_clip_to_hf_clip(const std::string& name) {
replace_suffix(); replace_suffix();
open_clip_resblock_prefix = "model.visual.transformer.resblocks."; open_clip_resblock_prefix = "model.visual.transformer.resblocks.";
hf_clip_resblock_prefix = "transformer.visual_model.encoder.layers."; hf_clip_resblock_prefix = "transformer.vision_model.encoder.layers.";
replace_suffix(); replace_suffix();
@ -248,7 +252,7 @@ std::unordered_map<std::string, std::unordered_map<std::string, std::string>> su
}, },
}; };
std::string convert_diffusers_name_to_compvis(const std::string& key, char seq) { std::string convert_diffusers_name_to_compvis(std::string key, char seq) {
std::vector<std::string> m; std::vector<std::string> m;
auto match = [](std::vector<std::string>& match_list, const std::regex& regex, const std::string& key) { auto match = [](std::vector<std::string>& match_list, const std::regex& regex, const std::string& key) {
@ -282,6 +286,11 @@ std::string convert_diffusers_name_to_compvis(const std::string& key, char seq)
return inner_key; return inner_key;
}; };
// convert attn to out
if (ends_with(key, "to_out")) {
key += format("%c0", seq);
}
// unet // unet
if (match(m, std::regex(format("unet%cconv_in(.*)", seq)), key)) { if (match(m, std::regex(format("unet%cconv_in(.*)", seq)), key)) {
return format("model%cdiffusion_model%cinput_blocks%c0%c0", seq, seq, seq, seq) + m[0]; return format("model%cdiffusion_model%cinput_blocks%c0%c0", seq, seq, seq, seq) + m[0];
@ -391,8 +400,8 @@ std::string convert_diffusers_name_to_compvis(const std::string& key, char seq)
} }
std::string convert_tensor_name(const std::string& name) { std::string convert_tensor_name(const std::string& name) {
std::string new_name; std::string new_name = name;
if (starts_with(name, "cond_stage_model.") || starts_with(name, "conditioner.embedders.")) { if (starts_with(name, "cond_stage_model.") || starts_with(name, "conditioner.embedders.") || ends_with(name, ".vision_model.visual_projection.weight")) {
new_name = convert_open_clip_to_hf_clip(name); new_name = convert_open_clip_to_hf_clip(name);
} else if (starts_with(name, "first_stage_model.decoder")) { } else if (starts_with(name, "first_stage_model.decoder")) {
new_name = convert_vae_decoder_name(name); new_name = convert_vae_decoder_name(name);
@ -416,6 +425,26 @@ std::string convert_tensor_name(const std::string& name) {
} else { } else {
new_name = name; new_name = name;
} }
} else if (contains(name, "lora_up") || contains(name, "lora_down") || contains(name, "lora.up") || contains(name, "lora.down")) {
size_t pos = new_name.find(".processor");
if (pos != std::string::npos) {
new_name.replace(pos, strlen(".processor"), "");
}
pos = new_name.find_last_of('_');
if (pos != std::string::npos) {
std::string name_without_network_parts = new_name.substr(0, pos);
std::string network_part = new_name.substr(pos + 1);
// LOG_DEBUG("%s %s", name_without_network_parts.c_str(), network_part.c_str());
std::string new_key = convert_diffusers_name_to_compvis(name_without_network_parts, '.');
replace_all_chars(new_key, '.', '_');
if (starts_with(network_part, "lora.")) {
network_part = "lora_" + network_part.substr(5);
}
if (new_key.size() > 0) {
new_name = "lora." + new_key + "." + network_part;
}
// LOG_DEBUG("new name: %s", new_name.c_str());
}
} else if (starts_with(name, "unet") || starts_with(name, "vae") || starts_with(name, "te")) { // for diffuser } else if (starts_with(name, "unet") || starts_with(name, "vae") || starts_with(name, "te")) { // for diffuser
size_t pos = name.find_last_of('.'); size_t pos = name.find_last_of('.');
if (pos != std::string::npos) { if (pos != std::string::npos) {
@ -830,7 +859,6 @@ bool ModelLoader::init_from_safetensors_file(const std::string& file_path, const
} }
TensorStorage tensor_storage(prefix + name, type, ne, n_dims, file_index, ST_HEADER_SIZE_LEN + header_size_ + begin); TensorStorage tensor_storage(prefix + name, type, ne, n_dims, file_index, ST_HEADER_SIZE_LEN + header_size_ + begin);
tensor_storage.reverse_ne(); tensor_storage.reverse_ne();
size_t tensor_data_size = end - begin; size_t tensor_data_size = end - begin;
@ -1169,7 +1197,9 @@ bool ModelLoader::parse_data_pkl(uint8_t* buffer,
if (reader.phase == PickleTensorReader::READ_DIMENS) { if (reader.phase == PickleTensorReader::READ_DIMENS) {
reader.tensor_storage.reverse_ne(); reader.tensor_storage.reverse_ne();
reader.tensor_storage.file_index = file_index; reader.tensor_storage.file_index = file_index;
reader.tensor_storage.name = prefix + reader.tensor_storage.name; // if(strcmp(prefix.c_str(), "scarlett") == 0)
// printf(" got tensor %s \n ", reader.tensor_storage.name.c_str());
reader.tensor_storage.name = prefix + reader.tensor_storage.name;
tensor_storages.push_back(reader.tensor_storage); tensor_storages.push_back(reader.tensor_storage);
// LOG_DEBUG("%s", reader.tensor_storage.name.c_str()); // LOG_DEBUG("%s", reader.tensor_storage.name.c_str());
// reset // reset
@ -1272,7 +1302,8 @@ std::string ModelLoader::load_merges() {
return merges_utf8_str; return merges_utf8_str;
} }
void remove_duplicates(std::vector<TensorStorage>& vec) { std::vector<TensorStorage> remove_duplicates(const std::vector<TensorStorage>& vec) {
std::vector<TensorStorage> res;
std::unordered_map<std::string, size_t> name_to_index_map; std::unordered_map<std::string, size_t> name_to_index_map;
for (size_t i = 0; i < vec.size(); ++i) { for (size_t i = 0; i < vec.size(); ++i) {
@ -1280,13 +1311,16 @@ void remove_duplicates(std::vector<TensorStorage>& vec) {
auto it = name_to_index_map.find(current_name); auto it = name_to_index_map.find(current_name);
if (it != name_to_index_map.end()) { if (it != name_to_index_map.end()) {
vec[it->second] = vec[i]; res[it->second] = vec[i];
} else { } else {
name_to_index_map[current_name] = i; name_to_index_map[current_name] = i;
res.push_back(vec[i]);
} }
} }
vec.resize(name_to_index_map.size()); // vec.resize(name_to_index_map.size());
return res;
} }
bool ModelLoader::load_tensors(on_new_tensor_cb_t on_new_tensor_cb, ggml_backend_t backend) { bool ModelLoader::load_tensors(on_new_tensor_cb_t on_new_tensor_cb, ggml_backend_t backend) {
@ -1300,7 +1334,9 @@ bool ModelLoader::load_tensors(on_new_tensor_cb_t on_new_tensor_cb, ggml_backend
preprocess_tensor(tensor_storage, processed_tensor_storages); preprocess_tensor(tensor_storage, processed_tensor_storages);
} }
remove_duplicates(processed_tensor_storages); std::vector<TensorStorage> dedup = remove_duplicates(processed_tensor_storages);
processed_tensor_storages = dedup;
bool success = true; bool success = true;
for (size_t file_index = 0; file_index < file_paths_.size(); file_index++) { for (size_t file_index = 0; file_index < file_paths_.size(); file_index++) {
std::string file_path = file_paths_[file_index]; std::string file_path = file_paths_[file_index];
@ -1362,7 +1398,6 @@ bool ModelLoader::load_tensors(on_new_tensor_cb_t on_new_tensor_cb, ggml_backend
if (tensor_storage.file_index != file_index) { if (tensor_storage.file_index != file_index) {
continue; continue;
} }
ggml_tensor* dst_tensor = NULL; ggml_tensor* dst_tensor = NULL;
success = on_new_tensor_cb(tensor_storage, &dst_tensor); success = on_new_tensor_cb(tensor_storage, &dst_tensor);

View File

@ -7,6 +7,7 @@
#include <set> #include <set>
#include <sstream> #include <sstream>
#include <string> #include <string>
#include <tuple>
#include <vector> #include <vector>
#include "ggml/ggml-backend.h" #include "ggml/ggml-backend.h"

305
pmid.hpp Normal file
View File

@ -0,0 +1,305 @@
#ifndef __PMI_HPP__
#define __PMI_HPP__
#include "ggml_extend.hpp"
#include "clip.hpp"
#include "lora.hpp"
struct FuseBlock : public GGMLBlock {
// network hparams
int in_dim;
int out_dim;
int hidden_dim;
bool use_residue;
public:
FuseBlock(int i_d, int o_d, int h_d, bool use_residue = true)
: in_dim(i_d), out_dim(o_d), hidden_dim(h_d), use_residue(use_residue) {
blocks["fc1"] = std::shared_ptr<GGMLBlock>(new Linear(in_dim, hidden_dim, true));
blocks["fc2"] = std::shared_ptr<GGMLBlock>(new Linear(hidden_dim, out_dim, true));
blocks["layernorm"] = std::shared_ptr<GGMLBlock>(new LayerNorm(in_dim));
}
struct ggml_tensor* forward(struct ggml_context* ctx, struct ggml_tensor* x) {
// x: [N, channels, h, w]
auto fc1 = std::dynamic_pointer_cast<Linear>(blocks["fc1"]);
auto fc2 = std::dynamic_pointer_cast<Linear>(blocks["fc2"]);
auto layer_norm = std::dynamic_pointer_cast<LayerNorm>(blocks["layernorm"]);
struct ggml_tensor* r = x;
// x = ggml_nn_layer_norm(ctx, x, ln_w, ln_b);
x = layer_norm->forward(ctx, x);
// x = ggml_add(ctx, ggml_mul_mat(ctx, fc1_w, x), fc1_b);
x = fc1->forward(ctx, x);
x = ggml_gelu_inplace(ctx, x);
x = fc2->forward(ctx, x);
// x = ggml_add(ctx, ggml_mul_mat(ctx, fc2_w, x), fc2_b);
if (use_residue)
x = ggml_add(ctx, x, r);
return x;
}
};
struct FuseModule : public GGMLBlock {
// network hparams
int embed_dim;
public:
FuseModule(int imb_d)
: embed_dim(imb_d) {
blocks["mlp1"] = std::shared_ptr<GGMLBlock>(new FuseBlock(imb_d * 2, imb_d, imb_d, false));
blocks["mlp2"] = std::shared_ptr<GGMLBlock>(new FuseBlock(imb_d, imb_d, imb_d, true));
blocks["layer_norm"] = std::shared_ptr<GGMLBlock>(new LayerNorm(embed_dim));
}
struct ggml_tensor* fuse_fn(struct ggml_context* ctx,
struct ggml_tensor* prompt_embeds,
struct ggml_tensor* id_embeds) {
auto mlp1 = std::dynamic_pointer_cast<FuseBlock>(blocks["mlp1"]);
auto mlp2 = std::dynamic_pointer_cast<FuseBlock>(blocks["mlp2"]);
auto layer_norm = std::dynamic_pointer_cast<LayerNorm>(blocks["layer_norm"]);
auto prompt_embeds0 = ggml_cont(ctx, ggml_permute(ctx, prompt_embeds, 2, 0, 1, 3));
auto id_embeds0 = ggml_cont(ctx, ggml_permute(ctx, id_embeds, 2, 0, 1, 3));
// concat is along dim 2
auto stacked_id_embeds = ggml_concat(ctx, prompt_embeds0, id_embeds0);
stacked_id_embeds = ggml_cont(ctx, ggml_permute(ctx, stacked_id_embeds, 1, 2, 0, 3));
// stacked_id_embeds = mlp1.forward(ctx, stacked_id_embeds);
// stacked_id_embeds = ggml_add(ctx, stacked_id_embeds, prompt_embeds);
// stacked_id_embeds = mlp2.forward(ctx, stacked_id_embeds);
// stacked_id_embeds = ggml_nn_layer_norm(ctx, stacked_id_embeds, ln_w, ln_b);
stacked_id_embeds = mlp1->forward(ctx, stacked_id_embeds);
stacked_id_embeds = ggml_add(ctx, stacked_id_embeds, prompt_embeds);
stacked_id_embeds = mlp2->forward(ctx, stacked_id_embeds);
stacked_id_embeds = layer_norm->forward(ctx, stacked_id_embeds);
return stacked_id_embeds;
}
struct ggml_tensor* forward(struct ggml_context* ctx,
struct ggml_tensor* prompt_embeds,
struct ggml_tensor* id_embeds,
struct ggml_tensor* class_tokens_mask,
struct ggml_tensor* class_tokens_mask_pos,
struct ggml_tensor* left,
struct ggml_tensor* right) {
// x: [N, channels, h, w]
struct ggml_tensor* valid_id_embeds = id_embeds;
// # slice out the image token embeddings
// print_ggml_tensor(class_tokens_mask_pos, false);
ggml_set_name(class_tokens_mask_pos, "class_tokens_mask_pos");
ggml_set_name(prompt_embeds, "prompt_embeds");
// print_ggml_tensor(valid_id_embeds, true, "valid_id_embeds");
// print_ggml_tensor(class_tokens_mask_pos, true, "class_tokens_mask_pos");
struct ggml_tensor* image_token_embeds = ggml_get_rows(ctx, prompt_embeds, class_tokens_mask_pos);
ggml_set_name(image_token_embeds, "image_token_embeds");
struct ggml_tensor* stacked_id_embeds = fuse_fn(ctx, image_token_embeds, valid_id_embeds);
stacked_id_embeds = ggml_cont(ctx, ggml_permute(ctx, stacked_id_embeds, 0, 2, 1, 3));
if (left && right) {
stacked_id_embeds = ggml_concat(ctx, left, stacked_id_embeds);
stacked_id_embeds = ggml_concat(ctx, stacked_id_embeds, right);
} else if (left) {
stacked_id_embeds = ggml_concat(ctx, left, stacked_id_embeds);
} else if (right) {
stacked_id_embeds = ggml_concat(ctx, stacked_id_embeds, right);
}
stacked_id_embeds = ggml_cont(ctx, ggml_permute(ctx, stacked_id_embeds, 0, 2, 1, 3));
class_tokens_mask = ggml_cont(ctx, ggml_transpose(ctx, class_tokens_mask));
class_tokens_mask = ggml_repeat(ctx, class_tokens_mask, prompt_embeds);
prompt_embeds = ggml_mul(ctx, prompt_embeds, class_tokens_mask);
struct ggml_tensor* updated_prompt_embeds = ggml_add(ctx, prompt_embeds, stacked_id_embeds);
ggml_set_name(updated_prompt_embeds, "updated_prompt_embeds");
return updated_prompt_embeds;
}
};
struct PhotoMakerIDEncoderBlock : public CLIPVisionModelProjection {
PhotoMakerIDEncoderBlock()
: CLIPVisionModelProjection(OPENAI_CLIP_VIT_L_14) {
blocks["visual_projection_2"] = std::shared_ptr<GGMLBlock>(new Linear(1024, 1280, false));
blocks["fuse_module"] = std::shared_ptr<GGMLBlock>(new FuseModule(2048));
}
struct ggml_tensor* forward(struct ggml_context* ctx,
struct ggml_tensor* id_pixel_values,
struct ggml_tensor* prompt_embeds,
struct ggml_tensor* class_tokens_mask,
struct ggml_tensor* class_tokens_mask_pos,
struct ggml_tensor* left,
struct ggml_tensor* right) {
// x: [N, channels, h, w]
auto vision_model = std::dynamic_pointer_cast<CLIPVisionModel>(blocks["vision_model"]);
auto visual_projection = std::dynamic_pointer_cast<CLIPProjection>(blocks["visual_projection"]);
auto visual_projection_2 = std::dynamic_pointer_cast<Linear>(blocks["visual_projection_2"]);
auto fuse_module = std::dynamic_pointer_cast<FuseModule>(blocks["fuse_module"]);
struct ggml_tensor* shared_id_embeds = vision_model->forward(ctx, id_pixel_values); // [N, hidden_size]
struct ggml_tensor* id_embeds = visual_projection->forward(ctx, shared_id_embeds); // [N, proj_dim(768)]
struct ggml_tensor* id_embeds_2 = visual_projection_2->forward(ctx, shared_id_embeds); // [N, 1280]
id_embeds = ggml_cont(ctx, ggml_permute(ctx, id_embeds, 2, 0, 1, 3));
id_embeds_2 = ggml_cont(ctx, ggml_permute(ctx, id_embeds_2, 2, 0, 1, 3));
id_embeds = ggml_concat(ctx, id_embeds, id_embeds_2); // [batch_size, seq_length, 1, 2048] check whether concat at dim 2 is right
id_embeds = ggml_cont(ctx, ggml_permute(ctx, id_embeds, 1, 2, 0, 3));
struct ggml_tensor* updated_prompt_embeds = fuse_module->forward(ctx,
prompt_embeds,
id_embeds,
class_tokens_mask,
class_tokens_mask_pos,
left, right);
return updated_prompt_embeds;
}
};
struct PhotoMakerIDEncoder : public GGMLModule {
public:
SDVersion version = VERSION_XL;
PhotoMakerIDEncoderBlock id_encoder;
float style_strength;
std::vector<float> ctm;
std::vector<ggml_fp16_t> ctmf16;
std::vector<int> ctmpos;
std::vector<ggml_fp16_t> zeros_left_16;
std::vector<float> zeros_left;
std::vector<ggml_fp16_t> zeros_right_16;
std::vector<float> zeros_right;
public:
PhotoMakerIDEncoder(ggml_backend_t backend, ggml_type wtype, SDVersion version = VERSION_XL, float sty = 20.f)
: GGMLModule(backend, wtype),
version(version),
style_strength(sty) {
id_encoder.init(params_ctx, wtype);
}
std::string get_desc() {
return "pmid";
}
size_t get_params_mem_size() {
size_t params_mem_size = id_encoder.get_params_mem_size();
return params_mem_size;
}
size_t get_params_num() {
size_t params_num = id_encoder.get_params_num();
return params_num;
}
void get_param_tensors(std::map<std::string, struct ggml_tensor*>& tensors, const std::string prefix) {
id_encoder.get_param_tensors(tensors, prefix);
}
struct ggml_cgraph* build_graph( // struct ggml_allocr* allocr,
struct ggml_tensor* id_pixel_values,
struct ggml_tensor* prompt_embeds,
std::vector<bool>& class_tokens_mask) {
ctm.clear();
ctmf16.clear();
ctmpos.clear();
zeros_left.clear();
zeros_left_16.clear();
zeros_right.clear();
zeros_right_16.clear();
ggml_context* ctx0 = compute_ctx;
struct ggml_cgraph* gf = ggml_new_graph(compute_ctx);
int64_t hidden_size = prompt_embeds->ne[0];
int64_t seq_length = prompt_embeds->ne[1];
ggml_type type = GGML_TYPE_F32;
struct ggml_tensor* class_tokens_mask_d = ggml_new_tensor_1d(ctx0, type, class_tokens_mask.size());
struct ggml_tensor* id_pixel_values_d = to_backend(id_pixel_values);
struct ggml_tensor* prompt_embeds_d = to_backend(prompt_embeds);
struct ggml_tensor* left = NULL;
struct ggml_tensor* right = NULL;
for (int i = 0; i < class_tokens_mask.size(); i++) {
if (class_tokens_mask[i]) {
ctm.push_back(0.f); // here use 0.f instead of 1.f to make a scale mask
ctmf16.push_back(ggml_fp32_to_fp16(0.f)); // here use 0.f instead of 1.f to make a scale mask
ctmpos.push_back(i);
} else {
ctm.push_back(1.f); // here use 1.f instead of 0.f to make a scale mask
ctmf16.push_back(ggml_fp32_to_fp16(1.f)); // here use 0.f instead of 1.f to make a scale mask
}
}
if (ctmpos[0] > 0) {
left = ggml_new_tensor_3d(ctx0, type, hidden_size, 1, ctmpos[0]);
}
if (ctmpos[ctmpos.size() - 1] < seq_length - 1) {
right = ggml_new_tensor_3d(ctx0, type,
hidden_size, 1, seq_length - ctmpos[ctmpos.size() - 1] - 1);
}
struct ggml_tensor* class_tokens_mask_pos = ggml_new_tensor_1d(ctx0, GGML_TYPE_I32, ctmpos.size());
{
if (type == GGML_TYPE_F16)
set_backend_tensor_data(class_tokens_mask_d, ctmf16.data());
else
set_backend_tensor_data(class_tokens_mask_d, ctm.data());
set_backend_tensor_data(class_tokens_mask_pos, ctmpos.data());
if (left) {
if (type == GGML_TYPE_F16) {
for (int i = 0; i < ggml_nelements(left); ++i)
zeros_left_16.push_back(ggml_fp32_to_fp16(0.f));
set_backend_tensor_data(left, zeros_left_16.data());
} else {
for (int i = 0; i < ggml_nelements(left); ++i)
zeros_left.push_back(0.f);
set_backend_tensor_data(left, zeros_left.data());
}
}
if (right) {
if (type == GGML_TYPE_F16) {
for (int i = 0; i < ggml_nelements(right); ++i)
zeros_right_16.push_back(ggml_fp32_to_fp16(0.f));
set_backend_tensor_data(right, zeros_right_16.data());
} else {
for (int i = 0; i < ggml_nelements(right); ++i)
zeros_right.push_back(0.f);
set_backend_tensor_data(right, zeros_right.data());
}
}
}
struct ggml_tensor* updated_prompt_embeds = id_encoder.forward(ctx0,
id_pixel_values_d,
prompt_embeds_d,
class_tokens_mask_d,
class_tokens_mask_pos,
left, right);
ggml_build_forward_expand(gf, updated_prompt_embeds);
return gf;
}
void compute(const int n_threads,
struct ggml_tensor* id_pixel_values,
struct ggml_tensor* prompt_embeds,
std::vector<bool>& class_tokens_mask,
struct ggml_tensor** updated_prompt_embeds,
ggml_context* output_ctx) {
auto get_graph = [&]() -> struct ggml_cgraph* {
// return build_graph(compute_allocr, id_pixel_values, prompt_embeds, class_tokens_mask);
return build_graph(id_pixel_values, prompt_embeds, class_tokens_mask);
};
// GGMLModule::compute(get_graph, n_threads, updated_prompt_embeds);
GGMLModule::compute(get_graph, n_threads, true, updated_prompt_embeds, output_ctx);
}
};
#endif // __PMI_HPP__

View File

@ -11,10 +11,19 @@
#include "denoiser.hpp" #include "denoiser.hpp"
#include "esrgan.hpp" #include "esrgan.hpp"
#include "lora.hpp" #include "lora.hpp"
#include "pmid.hpp"
#include "tae.hpp" #include "tae.hpp"
#include "unet.hpp" #include "unet.hpp"
#include "vae.hpp" #include "vae.hpp"
#define STB_IMAGE_IMPLEMENTATION
#define STB_IMAGE_STATIC
#include "stb_image.h"
// #define STB_IMAGE_WRITE_IMPLEMENTATION
// #define STB_IMAGE_WRITE_STATIC
// #include "stb_image_write.h"
const char* model_version_to_str[] = { const char* model_version_to_str[] = {
"1.x", "1.x",
"2.x", "2.x",
@ -56,8 +65,11 @@ void calculate_alphas_cumprod(float* alphas_cumprod,
class StableDiffusionGGML { class StableDiffusionGGML {
public: public:
ggml_backend_t backend = NULL; // general backend ggml_backend_t backend = NULL; // general backend
ggml_type model_data_type = GGML_TYPE_COUNT; ggml_backend_t clip_backend = NULL;
ggml_backend_t control_net_backend = NULL;
ggml_backend_t vae_backend = NULL;
ggml_type model_data_type = GGML_TYPE_COUNT;
SDVersion version; SDVersion version;
bool vae_decode_only = false; bool vae_decode_only = false;
@ -73,10 +85,13 @@ public:
std::shared_ptr<AutoEncoderKL> first_stage_model; std::shared_ptr<AutoEncoderKL> first_stage_model;
std::shared_ptr<TinyAutoEncoder> tae_first_stage; std::shared_ptr<TinyAutoEncoder> tae_first_stage;
std::shared_ptr<ControlNet> control_net; std::shared_ptr<ControlNet> control_net;
std::shared_ptr<PhotoMakerIDEncoder> pmid_model;
std::shared_ptr<LoraModel> pmid_lora;
std::string taesd_path; std::string taesd_path;
bool use_tiny_autoencoder = false; bool use_tiny_autoencoder = false;
bool vae_tiling = false; bool vae_tiling = false;
bool stacked_id = false;
std::map<std::string, struct ggml_tensor*> tensors; std::map<std::string, struct ggml_tensor*> tensors;
@ -86,6 +101,8 @@ public:
std::shared_ptr<Denoiser> denoiser = std::make_shared<CompVisDenoiser>(); std::shared_ptr<Denoiser> denoiser = std::make_shared<CompVisDenoiser>();
std::string trigger_word = "img"; // should be user settable
StableDiffusionGGML() = default; StableDiffusionGGML() = default;
StableDiffusionGGML(int n_threads, StableDiffusionGGML(int n_threads,
@ -106,17 +123,23 @@ public:
~StableDiffusionGGML() { ~StableDiffusionGGML() {
ggml_backend_free(backend); ggml_backend_free(backend);
ggml_backend_free(clip_backend);
ggml_backend_free(control_net_backend);
ggml_backend_free(vae_backend);
} }
bool load_from_file(const std::string& model_path, bool load_from_file(const std::string& model_path,
const std::string& vae_path, const std::string& vae_path,
const std::string control_net_path, const std::string control_net_path,
const std::string embeddings_path, const std::string embeddings_path,
const std::string id_embeddings_path,
const std::string& taesd_path, const std::string& taesd_path,
bool vae_tiling_, bool vae_tiling_,
ggml_type wtype, ggml_type wtype,
schedule_t schedule, schedule_t schedule,
bool control_net_cpu) { bool clip_on_cpu,
bool control_net_cpu,
bool vae_on_cpu) {
use_tiny_autoencoder = taesd_path.size() > 0; use_tiny_autoencoder = taesd_path.size() > 0;
#ifdef SD_USE_CUBLAS #ifdef SD_USE_CUBLAS
LOG_DEBUG("Using CUDA backend"); LOG_DEBUG("Using CUDA backend");
@ -161,6 +184,7 @@ public:
LOG_ERROR("get sd version from file failed: '%s'", model_path.c_str()); LOG_ERROR("get sd version from file failed: '%s'", model_path.c_str());
return false; return false;
} }
LOG_INFO("Stable Diffusion %s ", model_version_to_str[version]); LOG_INFO("Stable Diffusion %s ", model_version_to_str[version]);
if (wtype == GGML_TYPE_COUNT) { if (wtype == GGML_TYPE_COUNT) {
model_data_type = model_loader.get_sd_wtype(); model_data_type = model_loader.get_sd_wtype();
@ -195,7 +219,12 @@ public:
first_stage_model->alloc_params_buffer(); first_stage_model->alloc_params_buffer();
first_stage_model->get_param_tensors(tensors, "first_stage_model"); first_stage_model->get_param_tensors(tensors, "first_stage_model");
} else { } else {
cond_stage_model = std::make_shared<FrozenCLIPEmbedderWithCustomWords>(backend, model_data_type, version); clip_backend = backend;
if (clip_on_cpu && !ggml_backend_is_cpu(backend)) {
LOG_INFO("CLIP: Using CPU backend");
clip_backend = ggml_backend_cpu_init();
}
cond_stage_model = std::make_shared<FrozenCLIPEmbedderWithCustomWords>(clip_backend, model_data_type, version);
cond_stage_model->alloc_params_buffer(); cond_stage_model->alloc_params_buffer();
cond_stage_model->get_param_tensors(tensors, "cond_stage_model."); cond_stage_model->get_param_tensors(tensors, "cond_stage_model.");
@ -211,24 +240,59 @@ public:
} }
if (!use_tiny_autoencoder) { if (!use_tiny_autoencoder) {
first_stage_model = std::make_shared<AutoEncoderKL>(backend, vae_type, vae_decode_only); if (vae_on_cpu && !ggml_backend_is_cpu(backend)) {
LOG_INFO("VAE Autoencoder: Using CPU backend");
vae_backend = ggml_backend_cpu_init();
} else {
vae_backend = backend;
}
first_stage_model = std::make_shared<AutoEncoderKL>(vae_backend, vae_type, vae_decode_only);
first_stage_model->alloc_params_buffer(); first_stage_model->alloc_params_buffer();
first_stage_model->get_param_tensors(tensors, "first_stage_model"); first_stage_model->get_param_tensors(tensors, "first_stage_model");
} else { } else {
tae_first_stage = std::make_shared<TinyAutoEncoder>(backend, model_data_type, vae_decode_only); tae_first_stage = std::make_shared<TinyAutoEncoder>(backend, model_data_type, vae_decode_only);
} }
// first_stage_model->get_param_tensors(tensors, "first_stage_model.");
if (control_net_path.size() > 0) { if (control_net_path.size() > 0) {
ggml_backend_t cn_backend = NULL; ggml_backend_t controlnet_backend = NULL;
if (control_net_cpu && !ggml_backend_is_cpu(backend)) { if (control_net_cpu && !ggml_backend_is_cpu(backend)) {
LOG_DEBUG("ControlNet: Using CPU backend"); LOG_DEBUG("ControlNet: Using CPU backend");
cn_backend = ggml_backend_cpu_init(); controlnet_backend = ggml_backend_cpu_init();
} else { } else {
cn_backend = backend; controlnet_backend = backend;
} }
control_net = std::make_shared<ControlNet>(cn_backend, model_data_type, version); control_net = std::make_shared<ControlNet>(controlnet_backend, model_data_type, version);
} }
pmid_model = std::make_shared<PhotoMakerIDEncoder>(clip_backend, model_data_type, version);
if (id_embeddings_path.size() > 0) {
pmid_lora = std::make_shared<LoraModel>(backend, model_data_type, id_embeddings_path, "");
if (!pmid_lora->load_from_file(true)) {
LOG_WARN("load photomaker lora tensors from %s failed", id_embeddings_path.c_str());
return false;
}
LOG_INFO("loading stacked ID embedding (PHOTOMAKER) model file from '%s'", id_embeddings_path.c_str());
if (!model_loader.init_from_file(id_embeddings_path, "pmid.")) {
LOG_WARN("loading stacked ID embedding from '%s' failed", id_embeddings_path.c_str());
} else {
stacked_id = true;
}
}
if (stacked_id) {
if (!pmid_model->alloc_params_buffer()) {
LOG_ERROR(" pmid model params buffer allocation failed");
return false;
}
// LOG_INFO("pmid param memory buffer size = %.2fMB ",
// pmid_model->params_buffer_size / 1024.0 / 1024.0);
pmid_model->get_param_tensors(tensors, "pmid");
}
// if(stacked_id){
// pmid_model.init_params(GGML_TYPE_F32);
// pmid_model.map_by_name(tensors, "pmid.");
// }
LOG_DEBUG("loading vocab"); LOG_DEBUG("loading vocab");
std::string merges_utf8_str = model_loader.load_merges(); std::string merges_utf8_str = model_loader.load_merges();
if (merges_utf8_str.size() == 0) { if (merges_utf8_str.size() == 0) {
@ -250,6 +314,7 @@ public:
// load weights // load weights
LOG_DEBUG("loading weights"); LOG_DEBUG("loading weights");
int64_t t0 = ggml_time_ms(); int64_t t0 = ggml_time_ms();
std::set<std::string> ignore_tensors; std::set<std::string> ignore_tensors;
@ -257,6 +322,10 @@ public:
if (use_tiny_autoencoder) { if (use_tiny_autoencoder) {
ignore_tensors.insert("first_stage_model."); ignore_tensors.insert("first_stage_model.");
} }
if (stacked_id) {
ignore_tensors.insert("lora.");
}
if (vae_decode_only) { if (vae_decode_only) {
ignore_tensors.insert("first_stage_model.encoder"); ignore_tensors.insert("first_stage_model.encoder");
ignore_tensors.insert("first_stage_model.quant"); ignore_tensors.insert("first_stage_model.quant");
@ -296,14 +365,54 @@ public:
} }
control_net_params_mem_size = control_net->get_params_mem_size(); control_net_params_mem_size = control_net->get_params_mem_size();
} }
size_t pmid_params_mem_size = 0;
if (stacked_id) {
pmid_params_mem_size = pmid_model->get_params_mem_size();
}
size_t total_params_size = clip_params_mem_size + clip_params_mem_size + clip_params_mem_size + control_net_params_mem_size; size_t total_params_ram_size = 0;
LOG_INFO("total params memory size = %.2fMB (clip %.2fMB, unet %.2fMB, vae %.2fMB, controlnet %.2fMB)", size_t total_params_vram_size = 0;
total_params_size / 1024.0 / 1024.0, if (ggml_backend_is_cpu(clip_backend)) {
clip_params_mem_size / 1024.0 / 1024.0, total_params_ram_size += clip_params_mem_size + pmid_params_mem_size;
unet_params_mem_size / 1024.0 / 1024.0, } else {
vae_params_mem_size / 1024.0 / 1024.0, total_params_vram_size += clip_params_mem_size + pmid_params_mem_size;
control_net_params_mem_size / 1024.0 / 1024.0); }
if (ggml_backend_is_cpu(backend)) {
total_params_ram_size += unet_params_mem_size;
} else {
total_params_vram_size += unet_params_mem_size;
}
if (ggml_backend_is_cpu(vae_backend)) {
total_params_ram_size += vae_params_mem_size;
} else {
total_params_vram_size += vae_params_mem_size;
}
if (ggml_backend_is_cpu(control_net_backend)) {
total_params_ram_size += control_net_params_mem_size;
} else {
total_params_vram_size += control_net_params_mem_size;
}
size_t total_params_size = total_params_ram_size + total_params_vram_size;
LOG_INFO(
"total params memory size = %.2fMB (VRAM %.2fMB, RAM %.2fMB): "
"clip %.2fMB(%s), unet %.2fMB(%s), vae %.2fMB(%s), controlnet %.2fMB(%s), pmid %.2fMB(%s)",
total_params_size / 1024.0 / 1024.0,
total_params_vram_size / 1024.0 / 1024.0,
total_params_ram_size / 1024.0 / 1024.0,
clip_params_mem_size / 1024.0 / 1024.0,
ggml_backend_is_cpu(clip_backend) ? "RAM" : "VRAM",
unet_params_mem_size / 1024.0 / 1024.0,
ggml_backend_is_cpu(backend) ? "RAM" : "VRAM",
vae_params_mem_size / 1024.0 / 1024.0,
ggml_backend_is_cpu(vae_backend) ? "RAM" : "VRAM",
control_net_params_mem_size / 1024.0 / 1024.0,
ggml_backend_is_cpu(control_net_backend) ? "RAM" : "VRAM",
pmid_params_mem_size / 1024.0 / 1024.0,
ggml_backend_is_cpu(clip_backend) ? "RAM" : "VRAM");
} }
int64_t t1 = ggml_time_ms(); int64_t t1 = ggml_time_ms();
@ -444,16 +553,80 @@ public:
curr_lora_state = lora_state; curr_lora_state = lora_state;
} }
std::string remove_trigger_from_prompt(ggml_context* work_ctx,
const std::string& prompt) {
auto image_tokens = cond_stage_model->convert_token_to_id(trigger_word);
GGML_ASSERT(image_tokens.size() == 1);
auto tokens_and_weights = cond_stage_model->tokenize(prompt, false);
std::vector<int>& tokens = tokens_and_weights.first;
auto it = std::find(tokens.begin(), tokens.end(), image_tokens[0]);
GGML_ASSERT(it != tokens.end()); // prompt must have trigger word
tokens.erase(it);
return cond_stage_model->decode(tokens);
}
std::tuple<ggml_tensor*, ggml_tensor*, std::vector<bool>>
get_learned_condition_with_trigger(ggml_context* work_ctx,
const std::string& text,
int clip_skip,
int width,
int height,
int num_input_imgs,
bool force_zero_embeddings = false) {
auto image_tokens = cond_stage_model->convert_token_to_id(trigger_word);
// if(image_tokens.size() == 1){
// printf(" image token id is: %d \n", image_tokens[0]);
// }
GGML_ASSERT(image_tokens.size() == 1);
auto tokens_and_weights = cond_stage_model->tokenize_with_trigger_token(text,
num_input_imgs,
image_tokens[0],
true);
std::vector<int>& tokens = std::get<0>(tokens_and_weights);
std::vector<float>& weights = std::get<1>(tokens_and_weights);
std::vector<bool>& clsm = std::get<2>(tokens_and_weights);
// printf("tokens: \n");
// for(int i = 0; i < tokens.size(); ++i)
// printf("%d ", tokens[i]);
// printf("\n");
// printf("clsm: \n");
// for(int i = 0; i < clsm.size(); ++i)
// printf("%d ", clsm[i]?1:0);
// printf("\n");
auto cond = get_learned_condition_common(work_ctx, tokens, weights, clip_skip, width, height, force_zero_embeddings);
return std::make_tuple(cond.first, cond.second, clsm);
}
ggml_tensor* id_encoder(ggml_context* work_ctx,
ggml_tensor* init_img,
ggml_tensor* prompts_embeds,
std::vector<bool>& class_tokens_mask) {
ggml_tensor* res = NULL;
pmid_model->compute(n_threads, init_img, prompts_embeds, class_tokens_mask, &res, work_ctx);
return res;
}
std::pair<ggml_tensor*, ggml_tensor*> get_learned_condition(ggml_context* work_ctx, std::pair<ggml_tensor*, ggml_tensor*> get_learned_condition(ggml_context* work_ctx,
const std::string& text, const std::string& text,
int clip_skip, int clip_skip,
int width, int width,
int height, int height,
bool force_zero_embeddings = false) { bool force_zero_embeddings = false) {
auto tokens_and_weights = cond_stage_model->tokenize(text, true);
std::vector<int>& tokens = tokens_and_weights.first;
std::vector<float>& weights = tokens_and_weights.second;
return get_learned_condition_common(work_ctx, tokens, weights, clip_skip, width, height, force_zero_embeddings);
}
std::pair<ggml_tensor*, ggml_tensor*> get_learned_condition_common(ggml_context* work_ctx,
std::vector<int>& tokens,
std::vector<float>& weights,
int clip_skip,
int width,
int height,
bool force_zero_embeddings = false) {
cond_stage_model->set_clip_skip(clip_skip); cond_stage_model->set_clip_skip(clip_skip);
auto tokens_and_weights = cond_stage_model->tokenize(text, true);
std::vector<int>& tokens = tokens_and_weights.first;
std::vector<float>& weights = tokens_and_weights.second;
int64_t t0 = ggml_time_ms(); int64_t t0 = ggml_time_ms();
struct ggml_tensor* hidden_states = NULL; // [N, n_token, hidden_size] struct ggml_tensor* hidden_states = NULL; // [N, n_token, hidden_size]
struct ggml_tensor* chunk_hidden_states = NULL; // [n_token, hidden_size] struct ggml_tensor* chunk_hidden_states = NULL; // [n_token, hidden_size]
@ -466,7 +639,7 @@ public:
std::vector<int> chunk_tokens(tokens.begin() + chunk_idx * chunk_len, std::vector<int> chunk_tokens(tokens.begin() + chunk_idx * chunk_len,
tokens.begin() + (chunk_idx + 1) * chunk_len); tokens.begin() + (chunk_idx + 1) * chunk_len);
std::vector<float> chunk_weights(weights.begin() + chunk_idx * chunk_len, std::vector<float> chunk_weights(weights.begin() + chunk_idx * chunk_len,
weights.begin() + (chunk_idx + 1) * chunk_len); weights.begin() + (chunk_idx + 1) * chunk_len);
auto input_ids = vector_to_ggml_tensor_i32(work_ctx, chunk_tokens); auto input_ids = vector_to_ggml_tensor_i32(work_ctx, chunk_tokens);
struct ggml_tensor* input_ids2 = NULL; struct ggml_tensor* input_ids2 = NULL;
@ -664,7 +837,10 @@ public:
float min_cfg, float min_cfg,
float cfg_scale, float cfg_scale,
sample_method_t method, sample_method_t method,
const std::vector<float>& sigmas) { const std::vector<float>& sigmas,
int start_merge_step,
ggml_tensor* c_id,
ggml_tensor* c_vec_id) {
size_t steps = sigmas.size() - 1; size_t steps = sigmas.size() - 1;
// x_t = load_tensor_from_file(work_ctx, "./rand0.bin"); // x_t = load_tensor_from_file(work_ctx, "./rand0.bin");
// print_ggml_tensor(x_t); // print_ggml_tensor(x_t);
@ -730,17 +906,30 @@ public:
// GGML_ASSERT(0); // GGML_ASSERT(0);
} }
// cond if (start_merge_step == -1 || step <= start_merge_step) {
diffusion_model->compute(n_threads, // cond
noised_input, diffusion_model->compute(n_threads,
timesteps, noised_input,
c, timesteps,
c_concat, c,
c_vector, c_concat,
-1, c_vector,
controls, -1,
control_strength, controls,
&out_cond); control_strength,
&out_cond);
} else {
diffusion_model->compute(n_threads,
noised_input,
timesteps,
c_id,
c_concat,
c_vec_id,
-1,
controls,
control_strength,
&out_cond);
}
float* negative_data = NULL; float* negative_data = NULL;
if (has_unconditioned) { if (has_unconditioned) {
@ -1283,6 +1472,7 @@ sd_ctx_t* new_sd_ctx(const char* model_path_c_str,
const char* control_net_path_c_str, const char* control_net_path_c_str,
const char* lora_model_dir_c_str, const char* lora_model_dir_c_str,
const char* embed_dir_c_str, const char* embed_dir_c_str,
const char* id_embed_dir_c_str,
bool vae_decode_only, bool vae_decode_only,
bool vae_tiling, bool vae_tiling,
bool free_params_immediately, bool free_params_immediately,
@ -1290,7 +1480,9 @@ sd_ctx_t* new_sd_ctx(const char* model_path_c_str,
enum sd_type_t wtype, enum sd_type_t wtype,
enum rng_type_t rng_type, enum rng_type_t rng_type,
enum schedule_t s, enum schedule_t s,
bool keep_control_net_cpu) { bool keep_clip_on_cpu,
bool keep_control_net_cpu,
bool keep_vae_on_cpu) {
sd_ctx_t* sd_ctx = (sd_ctx_t*)malloc(sizeof(sd_ctx_t)); sd_ctx_t* sd_ctx = (sd_ctx_t*)malloc(sizeof(sd_ctx_t));
if (sd_ctx == NULL) { if (sd_ctx == NULL) {
return NULL; return NULL;
@ -1300,6 +1492,7 @@ sd_ctx_t* new_sd_ctx(const char* model_path_c_str,
std::string taesd_path(taesd_path_c_str); std::string taesd_path(taesd_path_c_str);
std::string control_net_path(control_net_path_c_str); std::string control_net_path(control_net_path_c_str);
std::string embd_path(embed_dir_c_str); std::string embd_path(embed_dir_c_str);
std::string id_embd_path(id_embed_dir_c_str);
std::string lora_model_dir(lora_model_dir_c_str); std::string lora_model_dir(lora_model_dir_c_str);
sd_ctx->sd = new StableDiffusionGGML(n_threads, sd_ctx->sd = new StableDiffusionGGML(n_threads,
@ -1315,11 +1508,14 @@ sd_ctx_t* new_sd_ctx(const char* model_path_c_str,
vae_path, vae_path,
control_net_path, control_net_path,
embd_path, embd_path,
id_embd_path,
taesd_path, taesd_path,
vae_tiling, vae_tiling,
(ggml_type)wtype, (ggml_type)wtype,
s, s,
keep_control_net_cpu)) { keep_clip_on_cpu,
keep_control_net_cpu,
keep_vae_on_cpu)) {
delete sd_ctx->sd; delete sd_ctx->sd;
sd_ctx->sd = NULL; sd_ctx->sd = NULL;
free(sd_ctx); free(sd_ctx);
@ -1348,7 +1544,10 @@ sd_image_t* txt2img(sd_ctx_t* sd_ctx,
int64_t seed, int64_t seed,
int batch_count, int batch_count,
const sd_image_t* control_cond, const sd_image_t* control_cond,
float control_strength) { float control_strength,
float style_ratio,
bool normalize_input,
const char* input_id_images_path_c_str) {
LOG_DEBUG("txt2img %dx%d", width, height); LOG_DEBUG("txt2img %dx%d", width, height);
if (sd_ctx == NULL) { if (sd_ctx == NULL) {
return NULL; return NULL;
@ -1356,6 +1555,35 @@ sd_image_t* txt2img(sd_ctx_t* sd_ctx,
// LOG_DEBUG("%s %s %f %d %d %d", prompt_c_str, negative_prompt_c_str, cfg_scale, sample_steps, seed, batch_count); // LOG_DEBUG("%s %s %f %d %d %d", prompt_c_str, negative_prompt_c_str, cfg_scale, sample_steps, seed, batch_count);
std::string prompt(prompt_c_str); std::string prompt(prompt_c_str);
std::string negative_prompt(negative_prompt_c_str); std::string negative_prompt(negative_prompt_c_str);
std::string input_id_images_path(input_id_images_path_c_str);
// preprocess input id images
std::vector<sd_image_t*> input_id_images;
if (sd_ctx->sd->pmid_model && input_id_images_path.size() > 0) {
std::vector<std::string> img_files = get_files_from_dir(input_id_images_path);
for (std::string img_file : img_files) {
int c = 0;
int width, height;
uint8_t* input_image_buffer = stbi_load(img_file.c_str(), &width, &height, &c, 3);
if (input_image_buffer == NULL) {
LOG_ERROR("PhotoMaker load image from '%s' failed", img_file.c_str());
continue;
} else {
LOG_INFO("PhotoMaker loaded image from '%s'", img_file.c_str());
}
sd_image_t* input_image = NULL;
input_image = new sd_image_t{(uint32_t)width,
(uint32_t)height,
3,
input_image_buffer};
input_image = preprocess_id_image(input_image);
if (input_image == NULL) {
LOG_ERROR("preprocess input id image from '%s' failed", img_file.c_str());
continue;
}
input_id_images.push_back(input_image);
}
}
// extract and remove lora // extract and remove lora
auto result_pair = extract_and_remove_lora(prompt); auto result_pair = extract_and_remove_lora(prompt);
@ -1372,8 +1600,22 @@ sd_image_t* txt2img(sd_ctx_t* sd_ctx,
sd_ctx->sd->apply_loras(lora_f2m); sd_ctx->sd->apply_loras(lora_f2m);
int64_t t1 = ggml_time_ms(); int64_t t1 = ggml_time_ms();
LOG_INFO("apply_loras completed, taking %.2fs", (t1 - t0) * 1.0f / 1000); LOG_INFO("apply_loras completed, taking %.2fs", (t1 - t0) * 1.0f / 1000);
if (sd_ctx->sd->stacked_id) {
t0 = ggml_time_ms();
sd_ctx->sd->pmid_lora->apply(sd_ctx->sd->tensors, sd_ctx->sd->n_threads);
t1 = ggml_time_ms();
LOG_INFO("pmid_lora apply completed, taking %.2fs", (t1 - t0) * 1.0f / 1000);
if (sd_ctx->sd->free_params_immediately) {
sd_ctx->sd->pmid_lora->free_params_buffer();
}
}
struct ggml_init_params params; struct ggml_init_params params;
params.mem_size = static_cast<size_t>(10 * 1024 * 1024); // 10 MB params.mem_size = static_cast<size_t>(10 * 1024 * 1024); // 10 MB
if (sd_ctx->sd->stacked_id) {
params.mem_size += static_cast<size_t>(10 * 1024 * 1024); // 10 MB
}
params.mem_size += width * height * 3 * sizeof(float); params.mem_size += width * height * 3 * sizeof(float);
params.mem_size *= batch_count; params.mem_size *= batch_count;
params.mem_buffer = NULL; params.mem_buffer = NULL;
@ -1394,10 +1636,67 @@ sd_image_t* txt2img(sd_ctx_t* sd_ctx,
seed = rand(); seed = rand();
} }
t0 = ggml_time_ms(); std::string prompt_text_only;
auto cond_pair = sd_ctx->sd->get_learned_condition(work_ctx, prompt, clip_skip, width, height); ggml_tensor* init_img = NULL;
ggml_tensor* c = cond_pair.first; ggml_tensor* prompts_embeds = NULL;
ggml_tensor* c_vector = cond_pair.second; // [adm_in_channels, ] ggml_tensor* pooled_prompts_embeds = NULL;
// ggml_tensor* class_tokens_mask = NULL;
std::vector<bool> class_tokens_mask;
if (sd_ctx->sd->stacked_id) {
if (input_id_images.size() > 0) {
sd_ctx->sd->pmid_model->style_strength = style_ratio;
int32_t w = input_id_images[0]->width;
int32_t h = input_id_images[0]->height;
int32_t channels = input_id_images[0]->channel;
int32_t num_input_images = (int32_t)input_id_images.size();
init_img = ggml_new_tensor_4d(work_ctx, GGML_TYPE_F32, w, h, channels, num_input_images);
// TODO: move these to somewhere else and be user settable
float mean[] = {0.48145466f, 0.4578275f, 0.40821073f};
float std[] = {0.26862954f, 0.26130258f, 0.27577711f};
for (int i = 0; i < num_input_images; i++) {
sd_image_t* init_image = input_id_images[i];
if (normalize_input)
sd_mul_images_to_tensor(init_image->data, init_img, i, mean, std);
else
sd_mul_images_to_tensor(init_image->data, init_img, i, NULL, NULL);
}
t0 = ggml_time_ms();
auto cond_tup = sd_ctx->sd->get_learned_condition_with_trigger(work_ctx, prompt,
clip_skip, width, height, num_input_images);
prompts_embeds = std::get<0>(cond_tup);
pooled_prompts_embeds = std::get<1>(cond_tup); // [adm_in_channels, ]
class_tokens_mask = std::get<2>(cond_tup); //
prompts_embeds = sd_ctx->sd->id_encoder(work_ctx, init_img, prompts_embeds, class_tokens_mask);
t1 = ggml_time_ms();
LOG_INFO("Photomaker ID Stacking, taking %" PRId64 " ms", t1 - t0);
if (sd_ctx->sd->free_params_immediately) {
sd_ctx->sd->pmid_model->free_params_buffer();
}
// Encode input prompt without the trigger word for delayed conditioning
prompt_text_only = sd_ctx->sd->remove_trigger_from_prompt(work_ctx, prompt);
// printf("%s || %s \n", prompt.c_str(), prompt_text_only.c_str());
prompt = prompt_text_only; //
if (sample_steps < 50) {
LOG_INFO("sampling steps increases from %d to 50 for PHOTOMAKER", sample_steps);
sample_steps = 50;
}
} else {
LOG_WARN("Provided PhotoMaker model file, but NO input ID images");
LOG_WARN("Turn off PhotoMaker");
sd_ctx->sd->stacked_id = false;
}
}
for (sd_image_t* img : input_id_images) {
free(img->data);
}
input_id_images.clear();
t0 = ggml_time_ms();
auto cond_pair = sd_ctx->sd->get_learned_condition(work_ctx, prompt, clip_skip, width, height);
ggml_tensor* c = cond_pair.first;
ggml_tensor* c_vector = cond_pair.second; // [adm_in_channels, ]
struct ggml_tensor* uc = NULL; struct ggml_tensor* uc = NULL;
struct ggml_tensor* uc_vector = NULL; struct ggml_tensor* uc_vector = NULL;
if (cfg_scale != 1.0) { if (cfg_scale != 1.0) {
@ -1438,6 +1737,14 @@ sd_image_t* txt2img(sd_ctx_t* sd_ctx,
std::vector<float> sigmas = sd_ctx->sd->denoiser->schedule->get_sigmas(sample_steps); std::vector<float> sigmas = sd_ctx->sd->denoiser->schedule->get_sigmas(sample_steps);
int start_merge_step = -1;
if (sd_ctx->sd->stacked_id) {
start_merge_step = int(sd_ctx->sd->pmid_model->style_strength / 100.f * sample_steps);
if (start_merge_step > 30)
start_merge_step = 30;
LOG_INFO("PHOTOMAKER: start_merge_step: %d", start_merge_step);
}
struct ggml_tensor* x_0 = sd_ctx->sd->sample(work_ctx, struct ggml_tensor* x_0 = sd_ctx->sd->sample(work_ctx,
x_t, x_t,
NULL, NULL,
@ -1452,7 +1759,10 @@ sd_image_t* txt2img(sd_ctx_t* sd_ctx,
cfg_scale, cfg_scale,
cfg_scale, cfg_scale,
sample_method, sample_method,
sigmas); sigmas,
start_merge_step,
prompts_embeds,
pooled_prompts_embeds);
// struct ggml_tensor* x_0 = load_tensor_from_file(ctx, "samples_ddim.bin"); // struct ggml_tensor* x_0 = load_tensor_from_file(ctx, "samples_ddim.bin");
// print_ggml_tensor(x_0); // print_ggml_tensor(x_0);
int64_t sampling_end = ggml_time_ms(); int64_t sampling_end = ggml_time_ms();
@ -1619,7 +1929,10 @@ sd_image_t* img2img(sd_ctx_t* sd_ctx,
cfg_scale, cfg_scale,
cfg_scale, cfg_scale,
sample_method, sample_method,
sigma_sched); sigma_sched,
-1,
NULL,
NULL);
// struct ggml_tensor *x_0 = load_tensor_from_file(ctx, "samples_ddim.bin"); // struct ggml_tensor *x_0 = load_tensor_from_file(ctx, "samples_ddim.bin");
// print_ggml_tensor(x_0); // print_ggml_tensor(x_0);
int64_t t3 = ggml_time_ms(); int64_t t3 = ggml_time_ms();
@ -1755,7 +2068,10 @@ SD_API sd_image_t* img2vid(sd_ctx_t* sd_ctx,
min_cfg, min_cfg,
cfg_scale, cfg_scale,
sample_method, sample_method,
sigmas); sigmas,
-1,
NULL,
NULL);
int64_t t2 = ggml_time_ms(); int64_t t2 = ggml_time_ms();
LOG_INFO("sampling completed, taking %.2fs", (t2 - t1) * 1.0f / 1000); LOG_INFO("sampling completed, taking %.2fs", (t2 - t1) * 1.0f / 1000);

View File

@ -65,12 +65,12 @@ enum sd_type_t {
SD_TYPE_Q8_0 = 8, SD_TYPE_Q8_0 = 8,
SD_TYPE_Q8_1 = 9, SD_TYPE_Q8_1 = 9,
// k-quantizations // k-quantizations
SD_TYPE_Q2_K = 10, SD_TYPE_Q2_K = 10,
SD_TYPE_Q3_K = 11, SD_TYPE_Q3_K = 11,
SD_TYPE_Q4_K = 12, SD_TYPE_Q4_K = 12,
SD_TYPE_Q5_K = 13, SD_TYPE_Q5_K = 13,
SD_TYPE_Q6_K = 14, SD_TYPE_Q6_K = 14,
SD_TYPE_Q8_K = 15, SD_TYPE_Q8_K = 15,
SD_TYPE_IQ2_XXS = 16, SD_TYPE_IQ2_XXS = 16,
SD_TYPE_IQ2_XS = 17, SD_TYPE_IQ2_XS = 17,
SD_TYPE_IQ3_XXS = 18, SD_TYPE_IQ3_XXS = 18,
@ -95,7 +95,7 @@ enum sd_log_level_t {
}; };
typedef void (*sd_log_cb_t)(enum sd_log_level_t level, const char* text, void* data); typedef void (*sd_log_cb_t)(enum sd_log_level_t level, const char* text, void* data);
typedef void (*sd_progress_cb_t)(int step,int steps,float time, void* data); typedef void (*sd_progress_cb_t)(int step, int steps, float time, void* data);
SD_API void sd_set_log_callback(sd_log_cb_t sd_log_cb, void* data); SD_API void sd_set_log_callback(sd_log_cb_t sd_log_cb, void* data);
SD_API void sd_set_progress_callback(sd_progress_cb_t cb, void* data); SD_API void sd_set_progress_callback(sd_progress_cb_t cb, void* data);
@ -117,6 +117,7 @@ SD_API sd_ctx_t* new_sd_ctx(const char* model_path,
const char* control_net_path_c_str, const char* control_net_path_c_str,
const char* lora_model_dir, const char* lora_model_dir,
const char* embed_dir_c_str, const char* embed_dir_c_str,
const char* stacked_id_embed_dir_c_str,
bool vae_decode_only, bool vae_decode_only,
bool vae_tiling, bool vae_tiling,
bool free_params_immediately, bool free_params_immediately,
@ -124,7 +125,9 @@ SD_API sd_ctx_t* new_sd_ctx(const char* model_path,
enum sd_type_t wtype, enum sd_type_t wtype,
enum rng_type_t rng_type, enum rng_type_t rng_type,
enum schedule_t s, enum schedule_t s,
bool keep_control_net_cpu); bool keep_clip_on_cpu,
bool keep_control_net_cpu,
bool keep_vae_on_cpu);
SD_API void free_sd_ctx(sd_ctx_t* sd_ctx); SD_API void free_sd_ctx(sd_ctx_t* sd_ctx);
@ -140,7 +143,10 @@ SD_API sd_image_t* txt2img(sd_ctx_t* sd_ctx,
int64_t seed, int64_t seed,
int batch_count, int batch_count,
const sd_image_t* control_cond, const sd_image_t* control_cond,
float control_strength); float control_strength,
float style_strength,
bool normalize_input,
const char* input_id_images_path);
SD_API sd_image_t* img2img(sd_ctx_t* sd_ctx, SD_API sd_image_t* img2img(sd_ctx_t* sd_ctx,
sd_image_t init_image, sd_image_t init_image,

2585
thirdparty/stb_image_resize.h vendored Normal file

File diff suppressed because it is too large Load Diff

113
util.cpp
View File

@ -25,6 +25,9 @@
#include "ggml/ggml.h" #include "ggml/ggml.h"
#include "stable-diffusion.h" #include "stable-diffusion.h"
#define STB_IMAGE_RESIZE_IMPLEMENTATION
#include "stb_image_resize.h"
bool ends_with(const std::string& str, const std::string& ending) { bool ends_with(const std::string& str, const std::string& ending) {
if (str.length() >= ending.length()) { if (str.length() >= ending.length()) {
return (str.compare(str.length() - ending.length(), ending.length(), ending) == 0); return (str.compare(str.length() - ending.length(), ending.length(), ending) == 0);
@ -40,6 +43,13 @@ bool starts_with(const std::string& str, const std::string& start) {
return false; return false;
} }
bool contains(const std::string& str, const std::string& substr) {
if (str.find(substr) != std::string::npos) {
return true;
}
return false;
}
void replace_all_chars(std::string& str, char target, char replacement) { void replace_all_chars(std::string& str, char target, char replacement) {
for (size_t i = 0; i < str.length(); ++i) { for (size_t i = 0; i < str.length(); ++i) {
if (str[i] == target) { if (str[i] == target) {
@ -88,6 +98,43 @@ std::string get_full_path(const std::string& dir, const std::string& filename) {
} }
} }
std::vector<std::string> get_files_from_dir(const std::string& dir) {
std::vector<std::string> files;
WIN32_FIND_DATA findFileData;
HANDLE hFind;
char currentDirectory[MAX_PATH];
GetCurrentDirectory(MAX_PATH, currentDirectory);
char directoryPath[MAX_PATH]; // this is absolute path
sprintf(directoryPath, "%s\\%s\\*", currentDirectory, dir.c_str());
// Find the first file in the directory
hFind = FindFirstFile(directoryPath, &findFileData);
// Check if the directory was found
if (hFind == INVALID_HANDLE_VALUE) {
printf("Unable to find directory.\n");
return files;
}
// Loop through all files in the directory
do {
// Check if the found file is a regular file (not a directory)
if (!(findFileData.dwFileAttributes & FILE_ATTRIBUTE_DIRECTORY)) {
files.push_back(std::string(currentDirectory) + "\\" + dir + "\\" + std::string(findFileData.cFileName));
}
} while (FindNextFile(hFind, &findFileData) != 0);
// Close the handle
FindClose(hFind);
sort(files.begin(), files.end());
return files;
}
#else // Unix #else // Unix
#include <dirent.h> #include <dirent.h>
#include <sys/stat.h> #include <sys/stat.h>
@ -102,6 +149,7 @@ bool is_directory(const std::string& path) {
return (stat(path.c_str(), &buffer) == 0 && S_ISDIR(buffer.st_mode)); return (stat(path.c_str(), &buffer) == 0 && S_ISDIR(buffer.st_mode));
} }
// TODO: add windows version
std::string get_full_path(const std::string& dir, const std::string& filename) { std::string get_full_path(const std::string& dir, const std::string& filename) {
DIR* dp = opendir(dir.c_str()); DIR* dp = opendir(dir.c_str());
@ -121,6 +169,27 @@ std::string get_full_path(const std::string& dir, const std::string& filename) {
return ""; return "";
} }
std::vector<std::string> get_files_from_dir(const std::string& dir) {
std::vector<std::string> files;
DIR* dp = opendir(dir.c_str());
if (dp != nullptr) {
struct dirent* entry;
while ((entry = readdir(dp)) != nullptr) {
std::string fname = dir + "/" + entry->d_name;
if (!is_directory(fname))
files.push_back(fname);
}
closedir(dp);
}
sort(files.begin(), files.end());
return files;
}
#endif #endif
// get_num_physical_cores is copy from // get_num_physical_cores is copy from
@ -161,8 +230,8 @@ int32_t get_num_physical_cores() {
return n_threads > 0 ? (n_threads <= 4 ? n_threads : n_threads / 2) : 4; return n_threads > 0 ? (n_threads <= 4 ? n_threads : n_threads / 2) : 4;
} }
static sd_progress_cb_t sd_progress_cb = NULL; static sd_progress_cb_t sd_progress_cb = NULL;
void* sd_progress_cb_data = NULL; void* sd_progress_cb_data = NULL;
std::u32string utf8_to_utf32(const std::string& utf8_str) { std::u32string utf8_to_utf32(const std::string& utf8_str) {
std::wstring_convert<std::codecvt_utf8<char32_t>, char32_t> converter; std::wstring_convert<std::codecvt_utf8<char32_t>, char32_t> converter;
@ -207,9 +276,42 @@ std::string path_join(const std::string& p1, const std::string& p2) {
return p1 + "/" + p2; return p1 + "/" + p2;
} }
sd_image_t* preprocess_id_image(sd_image_t* img) {
int shortest_edge = 224;
int size = shortest_edge;
sd_image_t* resized = NULL;
uint32_t w = img->width;
uint32_t h = img->height;
uint32_t c = img->channel;
// 1. do resize using stb_resize functions
unsigned char* buf = (unsigned char*)malloc(sizeof(unsigned char) * 3 * size * size);
if (!stbir_resize_uint8(img->data, w, h, 0,
buf, size, size, 0,
c)) {
fprintf(stderr, "%s: resize operation failed \n ", __func__);
return resized;
}
// 2. do center crop (likely unnecessary due to step 1)
// 3. do rescale
// 4. do normalize
// 3 and 4 will need to be done in float format.
resized = new sd_image_t{(uint32_t)shortest_edge,
(uint32_t)shortest_edge,
3,
buf};
return resized;
}
void pretty_progress(int step, int steps, float time) { void pretty_progress(int step, int steps, float time) {
if (sd_progress_cb) { if (sd_progress_cb) {
sd_progress_cb(step,steps,time, sd_progress_cb_data); sd_progress_cb(step, steps, time, sd_progress_cb_data);
return; return;
} }
if (step == 0) { if (step == 0) {
@ -255,9 +357,8 @@ std::string trim(const std::string& s) {
return rtrim(ltrim(s)); return rtrim(ltrim(s));
} }
static sd_log_cb_t sd_log_cb = NULL; static sd_log_cb_t sd_log_cb = NULL;
void* sd_log_cb_data = NULL; void* sd_log_cb_data = NULL;
#define LOG_BUFFER_SIZE 1024 #define LOG_BUFFER_SIZE 1024

8
util.h
View File

@ -3,11 +3,13 @@
#include <cstdint> #include <cstdint>
#include <string> #include <string>
#include <vector>
#include "stable-diffusion.h" #include "stable-diffusion.h"
bool ends_with(const std::string& str, const std::string& ending); bool ends_with(const std::string& str, const std::string& ending);
bool starts_with(const std::string& str, const std::string& start); bool starts_with(const std::string& str, const std::string& start);
bool contains(const std::string& str, const std::string& substr);
std::string format(const char* fmt, ...); std::string format(const char* fmt, ...);
@ -17,10 +19,16 @@ bool file_exists(const std::string& filename);
bool is_directory(const std::string& path); bool is_directory(const std::string& path);
std::string get_full_path(const std::string& dir, const std::string& filename); std::string get_full_path(const std::string& dir, const std::string& filename);
std::vector<std::string> get_files_from_dir(const std::string& dir);
std::u32string utf8_to_utf32(const std::string& utf8_str); std::u32string utf8_to_utf32(const std::string& utf8_str);
std::string utf32_to_utf8(const std::u32string& utf32_str); std::string utf32_to_utf8(const std::u32string& utf32_str);
std::u32string unicode_value_to_utf32(int unicode_value); std::u32string unicode_value_to_utf32(int unicode_value);
sd_image_t* preprocess_id_image(sd_image_t* img);
// std::string sd_basename(const std::string& path);
typedef struct { typedef struct {
uint32_t width; uint32_t width;
uint32_t height; uint32_t height;