nota-netspresso / bk-sdm Goto Github PK

A Compressed Stable Diffusion for Efficient Text-to-Image Generation [ICCV'23 Demo] [ICML'23 Workshop]

License: Other

Shell 16.39% Python 83.61%

compression stable-diffusion huggingface pytorch distillation lightweight

bk-sdm's Introduction

Block-removed Knowledge-distilled Stable Diffusion

Official codebase for BK-SDM: Architecturally Compressed Stable Diffusion for Efficient Text-to-Image Generation [ArXiv] [ICCV 2023 Demo Track] [ICML 2023 Workshop on ES-FoMo].

BK-SDMs are lightweight text-to-image (T2I) synthesis models:

Certain residual & attention blocks are eliminated from the U-Net of SD.
- Applicable to all SD-v1 & v2 — e.g., v1.4; v1.5; v2.1-base; v2.1
Distillation pretraining is conducted with very limited data, but it (surprisingly) remains effective.

⚡Quick Links: KD Pretraining | Evaluation on MS-COCO | DreamBooth Finetuning | Demo

Notice

[Dec/07/2023] KOALA introduces BK-SDXL baselines, big thanks!
[Aug/23/2023] Release Core ML weights of BK-SDMs (4-sec inference on iPhone 14).
[Aug/20/2023] Release finetuning code for efficient personalized T2I.
[Aug/14/2023] Release BK-SDM-*-2M models (trained with 10× more data).
[Aug/12/2023] 🎉Release pretraining code for efficient general-purpose T2I.
- MODEL_CARD.md includes the process of distillation pretraining and results using various data volumes.
[Aug/02/2023] Segmind introduces their BK-SDM implementation, big thanks!
[Aug/01/2023] Hugging Face Spaces of the week 🔥 introduces our demo, many thanks!

Model Description

See Compression Method in MODEL_CARD.md
Available at 🤗Hugging Face Models — Compressed from SD-v1.4
- BK-SDM-{Base, Small, Tiny}: trained with 0.22M LAION pairs, 50K training iterations.
- BK-SDM-{Base-2M, Small-2M, Tiny-2M}: 2.3M LAION pairs, 50K training iterations.

Installation

conda create -n bk-sdm python=3.8
conda activate bk-sdm
git clone https://github.com/Nota-NetsPresso/BK-SDM.git
cd BK-SDM
pip install -r requirements.txt

Note on the torch versions we've used:

torch 1.13.1 for MS-COCO evaluation & DreamBooth finetuning on a single 24GB RTX3090
torch 2.0.1 for KD pretraining on a single 80GB A100
- If pretraining with a total batch size of 256 on A100 causes out-of-GPU-memory, check torch version & consider upgrading to torch>2.0.0.

Minimal Example with 🤗Diffusers

With the default PNDM scheduler and 50 denoising steps:

import torch
from diffusers import StableDiffusionPipeline

pipe = StableDiffusionPipeline.from_pretrained("nota-ai/bk-sdm-small", torch_dtype=torch.float16)
pipe = pipe.to("cuda")

prompt = "a golden vase with different flowers"
image = pipe(prompt).images[0]  
    
image.save("example.png")

An equivalent code (modifying solely the U-Net of SD-v1.4 while preserving its Text Encoder and Image Decoder):

import torch
from diffusers import StableDiffusionPipeline, UNet2DConditionModel

pipe = StableDiffusionPipeline.from_pretrained("CompVis/stable-diffusion-v1-4", torch_dtype=torch.float16)
pipe.unet = UNet2DConditionModel.from_pretrained("nota-ai/bk-sdm-small", subfolder="unet", torch_dtype=torch.float16)
pipe = pipe.to("cuda")

prompt = "a golden vase with different flowers"
image = pipe(prompt).images[0]  
    
image.save("example.png")

Distillation Pretraining

Our code was based on train_text_to_image.py of Diffusers 0.15.0. To access the latest version, use this link.

[Optional] Toy to check runnability

bash scripts/get_laion_data.sh preprocessed_11k
bash scripts/kd_train_toy.sh

Note

A toy dataset (11K img-txt pairs) is downloaded at ./data/laion_aes/preprocessed_11k (1.7GB in tar.gz; 1.8GB data folder).
A toy script can be used to verify the code executability and find the batch size that matches your GPU. With a batch size of 8 (=4×2), training BK-SDM-Base for 20 iterations takes about 5 minutes and 22GB GPU memory.

Single-gpu training for BK-SDM-{Base, Small, Tiny}

bash scripts/get_laion_data.sh preprocessed_212k
bash scripts/kd_train.sh

Note

The dataset with 212K (=0.22M) pairs is downloaded at ./data/laion_aes/preprocessed_212k (18GB tar.gz; 20GB data folder).
With a batch size of 256 (=4×64), training BK-SDM-Base for 50K iterations takes about 300 hours and 53GB GPU memory. With a batch size of 64 (=4×16), it takes 60 hours and 28GB GPU memory.
Training BK-SDM-{Small, Tiny} results in 5∼10% decrease in GPU memory usage.

Single-gpu training for BK-SDM-{Base-2M, Small-2M, Tiny-2M}

bash scripts/get_laion_data.sh preprocessed_2256k
bash scripts/kd_train_2m.sh

Note

The dataset with 2256K (=2.3M) pairs is downloaded at ./data/laion_aes/preprocessed_2256k (182GB tar.gz; 204GB data folder).
Except the dataset, kd_train_2m.sh is the same as kd_train.sh; given the same number of iterations, the training computation remains identical.

Multi-gpu training

bash scripts/kd_train_toy_ddp.sh

Note

Multi-GPU training is supported (sample results: link), although all experiments for our paper were conducted using a single GPU. Thanks @youngwanLEE for sharing the script :)

Compression of SD-v2 with BK-SDM

bash scripts/kd_train_v2-base-im512.sh
bash scripts/kd_train_v2-im768.sh

# For inference, see: 'scripts/generate_with_trained_unet.sh'

Note on training code

Key segments for KD training

Define Student U-Net by adjusting config.json [link]
Initialize Student U-Net by copying Teacher U-Net's weights [link]
Define hook locations for feature KD [link]
Define losses for feature-and-output KD [link]

Key learning hyperparams

--unet_config_name "bk_small" # option: ["bk_base", "bk_small", "bk_tiny"]
--use_copy_weight_from_teacher # initialize student unet with teacher weights
--learning_rate 5e-05
--train_batch_size 64
--gradient_accumulation_steps 4
--lambda_sd 1.0
--lambda_kd_output 1.0
--lambda_kd_feat 1.0

Evaluation on MS-COCO Benchmark

We used the following codes to obtain the results on MS-COCO. After generating 512×512 images with the PNDM scheduler and 25 denoising steps, we downsampled them to 256×256 for computing scores.

Generation with released models (using BK-SDM-Small as default)

On a single 3090 GPU, '(2)' takes ~10 hours per model, and '(3)' takes a few minutes.

(1) Download metadata.csv and real_im256.npz:
```
bash scripts/get_mscoco_files.sh

# ./data/mscoco_val2014_30k/metadata.csv: 30K prompts from the MS-COCO validation set (used in '(2)')  
# ./data/mscoco_val2014_41k_full/real_im256.npz: FID statistics of 41K real images (used in '(3)')
```
Note on 'real_im256.npz'
- Following the evaluation protocol [DALL·E, Imagen], the FID stat for real images was computed over the full validation set (41K images) of MS-COCO. A precomputed stat file is downloaded via '(1)' at ./data/mscoco_val2014_41k_full/real_im256.npz.
- Additionally, real_im256.npz can be computed with python3 src/get_stat_mscoco_val2014.py, which downloads the whole images, resizes them to 256×256, and computes the FID stat.

(2) Generate 512×512 images over 30K prompts from the MS-COCO validation set → Resize them to 256×256:

python3 src/generate.py 

# python3 src/generate.py --model_id nota-ai/bk-sdm-base --save_dir ./results/bk-sdm-base
# python3 src/generate.py --model_id nota-ai/bk-sdm-tiny --save_dir ./results/bk-sdm-tiny

[Batched generation] Increase --batch_sz (default: 1) for a faster inference at the cost of higher VRAM usage. Thanks @Godofnothing for providing this feature :)

Click for inference cost details.

Setup: BK-SDM-Small on MS-COCO 30K image generation

We used an eval batch size of 1 for our paper results. Different batch sizes affect the sampling of random latent codes, resulting in slightly different generation scores.

Eval Batch Size	1	2	4	8
GPU Memory	4.9GB	6.3GB	11.3GB	19.6GB
Generation Time	9.4h	7.9h	7.6h	7.3h
FID	16.98	17.01	17.16	16.97
IS	31.68	31.20	31.62	31.22
CLIP Score	0.2677	0.2679	0.2677	0.2675

(3) Compute FID, IS, and CLIP score:

bash scripts/eval_scores.sh

# For the other models, modify the `./results/bk-sdm-*` path in the scripts to specify different models.

[After training] Generation with a trained U-Net

bash scripts/get_mscoco_files.sh
bash scripts/generate_with_trained_unet.sh

A trained U-Net is used for Step (2) Generation in the above benchmark evaluation.
To test with a specific checkpoint, modify --unet_path by referring to the example directory structure.

Results on Zero-shot MS-COCO 256×256 30K

See Results in MODEL_CARD.md

DreamBooth Finetuning with 🤗PEFT

Our lightweight SD backbones can be used for efficient personalized generation. DreamBooth refines text-to-image diffusion models given a small number of images. DreamBooth+LoRA can drastically reduce finetuning cost.

DreamBooth dataset

The dataset is downloaded at ./data/dreambooth/dataset [folder tree]: 30 subjects × 25 prompts × 4∼6 images.

git clone https://github.com/google/dreambooth ./data/dreambooth

DreamBooth finetuning (using BK-SDM-Base as default)

Our code was based on train_dreambooth.py of PEFT 0.1.0. To access the latest version, use this link.

(1) without LoRA — full finetuning & used in our paper

bash scripts/finetune_full.sh # learning rate 1e-6
bash scripts/generate_after_full_ft.sh

(2) with LoRA — parameter-efficient finetuning

bash scripts/finetune_lora.sh # learning rate 1e-4
bash scripts/generate_after_lora_ft.sh

On a single 3090 GPU, finetuning takes 10~20 minutes per subject.

Results of Personalized Generation

See DreamBooth Results in MODEL_CARD.md

Gradio Demo

Check out our Gradio demo and the codes (main: app.py)!

[Aug/01/2023] featured in Hugging Face Spaces of the week 🔥

Core ML Weights

For iOS or macOS applications, we have converted our models to Core ML format. They are available at 🤗Hugging Face Models (nota-ai/coreml-bk-sdm) and can be used with Apple's Core ML Stable Diffusion library.

4-sec inference on iPhone 14 (with 10 denoising steps): results

License

This project, along with its weights, is subject to the CreativeML Open RAIL-M license, which aims to mitigate any potential negative effects arising from the use of highly advanced machine learning systems. A summary of this license is as follows.

1. You can't use the model to deliberately produce nor share illegal or harmful outputs or content,
2. We claim no rights on the outputs you generate, you are free to use them and are accountable for their use which should not go against the provisions set in the license, and
3. You may re-distribute the weights and use the model commercially and/or as a service. If you do, please be aware you have to include the same use restrictions as the ones in the license and share a copy of the CreativeML OpenRAIL-M to all your users.

Acknowledgments

Microsoft for Startups Founders Hub and Gwangju AICA for generously providing GPU resources.
CompVis, Runway, and Stability AI for the pioneering research on Stable Diffusion.
LAION, Diffusers, PEFT, DreamBooth, Gradio, and Core ML Stable Diffusion for their valuable contributions.

Citation

@article{kim2023architectural,
  title={BK-SDM: A Lightweight, Fast, and Cheap Version of Stable Diffusion},
  author={Kim, Bo-Kyeong and Song, Hyoung-Kyu and Castells, Thibault and Choi, Shinkook},
  journal={arXiv preprint arXiv:2305.15798},
  year={2023},
  url={https://arxiv.org/abs/2305.15798}
}

@article{kim2023bksdm,
  title={BK-SDM: Architecturally Compressed Stable Diffusion for Efficient Text-to-Image Generation},
  author={Kim, Bo-Kyeong and Song, Hyoung-Kyu and Castells, Thibault and Choi, Shinkook},
  journal={ICML Workshop on Efficient Systems for Foundation Models (ES-FoMo)},
  year={2023},
  url={https://openreview.net/forum?id=bOVydU0XKC}
}

bk-sdm's People

Contributors

Stargazers

Watchers

Forkers

ai-machine-vision-lab techthiyanes paperwave thibaultcastells treksis godofnothing aninda-leonardo wonkyoc camenduru mvandermeulen a7mad-magdy77 saranga7

bk-sdm's Issues

How to replicate this work offline

Hi,thanks for your great work!
I currently have an A100 GPU server that is not connected to the internet. I can configure the environment offline. **Can I replicate your work offline?**Could you please provide me with some guidance? Thank you.

OSError: Error no file named scheduler_config.json found in directory CompVis/stable-diffusion-v1-4

i download the stable-diffusion-v1-4 ckpt in compvis，but still have this problem, i have triied to install transformers==4.25 4.27 and so on,but didn't work, this is the error details

bash scripts/kd_train_toy.sh
The following values were not passed to accelerate launch and had defaults used instead:
--num_processes was set to a value of 1
--num_machines was set to a value of 1
--mixed_precision was set to a value of 'no'
--dynamo_backend was set to a value of 'no'
To avoid this warning pass in values for each of the problematic parameters or run accelerate config.
/home/lzj/miniconda3/envs/bk-sdm/lib/python3.8/site-packages/accelerate/accelerator.py:249: FutureWarning: logging_dir is deprecated and will be removed in version 0.18.0 of 🤗 Accelerate. Use project_dir instead.
warnings.warn(
./results/toy_bk_small/log_loss.csv
03/11/2024 21:34:33 - INFO - main - Distributed environment: NO
Num processes: 1
Process index: 0
Local process index: 0
Device: cuda

Mixed precision type: fp16

Traceback (most recent call last):
File "src/kd_train_text_to_image.py", line 914, in
main()
File "src/kd_train_text_to_image.py", line 429, in main
noise_scheduler = DDPMScheduler.from_pretrained(args.pretrained_model_name_or_path, subfolder="scheduler")
File "/home/lzj/miniconda3/envs/bk-sdm/lib/python3.8/site-packages/diffusers/schedulers/scheduling_utils.py", line 139, in from_pretrained
config, kwargs, commit_hash = cls.load_config(
File "/home/lzj/miniconda3/envs/bk-sdm/lib/python3.8/site-packages/diffusers/configuration_utils.py", line 331, in load_config
raise EnvironmentError(
OSError: Error no file named scheduler_config.json found in directory CompVis/stable-diffusion-v1-4.
Traceback (most recent call last):
File "/home/lzj/miniconda3/envs/bk-sdm/bin/accelerate", line 8, in
sys.exit(main())
File "/home/lzj/miniconda3/envs/bk-sdm/lib/python3.8/site-packages/accelerate/commands/accelerate_cli.py", line 45, in main
args.func(args)
File "/home/lzj/miniconda3/envs/bk-sdm/lib/python3.8/site-packages/accelerate/commands/launch.py", line 923, in launch_command
simple_launcher(args)
File "/home/lzj/miniconda3/envs/bk-sdm/lib/python3.8/site-packages/accelerate/commands/launch.py", line 579, in simple_launcher
raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['/home/lzj/miniconda3/envs/bk-sdm/bin/python', 'src/kd_train_text_to_image.py', '--pretrained_model_name_or_path', 'CompVis/stable-diffusion-v1-4', '--train_data_dir', '/home/lzj/work/data/preprocessed_11k', '--use_ema', '--resolution', '512', '--center_crop', '--random_flip', '--train_batch_size', '2', '--gradient_checkpointing', '--mixed_precision=fp16', '--learning_rate', '5e-05', '--max_grad_norm', '1', '--lr_scheduler=constant', '--lr_warmup_steps=0', '--report_to=all', '--max_train_steps=20', '--seed', '1234', '--gradient_accumulation_steps', '4', '--checkpointing_steps', '5', '--valid_steps', '5', '--lambda_sd', '1.0', '--lambda_kd_output', '1.0', '--lambda_kd_feat', '1.0', '--use_copy_weight_from_teacher', '--unet_config_path', './src/unet_config', '--unet_config_name', 'bk_small', '--output_dir', './results/toy_bk_small']' returned non-zero exit status 1.

any plans for more models?

Greetings!

these tiny models are amazing! i love fp16 versions,
could u please in the future make models that are based on 1.5 and mixed with uncensored models such as lyriel or deliberate for better face and anatomy?

kind regards

Is it possible to fine-tune it for inpainting/outpainting task

Do you guys have any plan to fine-tune it for inpainting?

Thanks.

Add downloading 2.3M LAION training pairs

Related to #7
Data folder, preprocessed_2256k: 2,256,472 image-text pairs (182GB tar.gz; 204GB data folder)

Question about the lambda

Hi there,
It's me again, I am curious about whether you guys tried different combination of lambda for feat_loss and out_loss or maybe add a lambda for the task_loss?

From my training process, it seems that the feat_loss contributes most part of the total loss.

Scale of KD-feature loss for SD inpainting 1.5

Hi there,

I am trying to distill the Unet in SD inpainting 1.5 to a smaller Unet by using your code. (I replaced the pipeline to inpainting and the input data)
I have trained for 130K steps with batch size 64.
Right now the kd_feat_loss is around 20.

I am wondering what kd_feat_loss you have when you finished distill the Unet in your experiment?

Thank you.

Loading preprocessed_212k laion dataset without any response in terminal

Hi @bokyeong1015 , thanks for your great work!

I modified diffusers/train_text_to_image.py and used your fine-tuning strategy: on 212k subset of laion. But when I run the training code, loading dataset will consume too much time and there is no response in the terminal after even 40 minutes.... Is it caused by the large number of images or some bugs in my code?

    # In distributed training, the load_dataset function guarantees that only one local process can concurrently
    if args.dataset_name is not None:
        # Downloading and loading a dataset from the hub.
        dataset = load_dataset(
            args.dataset_name,
            args.dataset_config_name,
            cache_dir=args.cache_dir,
            data_dir=args.train_data_dir,
        )
    else:
        data_files = {}
        if args.train_data_dir is not None:
            data_files["train"] = os.path.join(args.train_data_dir, "**")
        print("*** load dataset: start")
        t0 = time.time()
        dataset = load_dataset(
            "imagefolder",
            # data_files=data_files,
            cache_dir=args.cache_dir,
            split="train",
            data_dir=args.train_data_dir,
        )
        print(f"*** load dataset: end --- {time.time()-t0} sec")

        # See more about loading custom images at
        # https://huggingface.co/docs/datasets/v2.4.0/en/image_load#imagefolder

    # Preprocessing the datasets.
    # We need to tokenize inputs and targets.
        
    # column_names = dataset["train"].column_names
    
    ##############################################################################################
    column_names = dataset.column_names
    image_column = column_names[0]
    caption_column = column_names[1]
    ###################################################################################################

This is the loading dataset code. How much time will 'load_dataset' function cost?

Thanks for your great work, looking forward to your reply!

Best wishes,
Qianli

About the training speed

I found that the total number of iterations for the training is 400,000. May I ask, how many days does it take for you to train a distilled model? I use 8*V100, I found that I can only complete around 3,800 iterations in one night (from 19:55 to 10:00 the next day).

Refine generation code

remove use_auth_token=True in StableDiffusionPipeline.from_pretrained [ref]
disable NSFW filter in recent diffusers versions [ref] [ref] for MS-COCO benchmark

Question of Dreambooth evaluation

Hi, thank you for sharing your awesome work ☺️
How to reproduce your Dreambooth quantitative performance in Table. 5?
Would you provide the evaluation code?

issue about training iterations

We note the readme show training BK-SDM-Base need 50K interations， while we find in the "kd_train.py" show --max_train_steps=400K , so can we think the 50K is good enough?

Generation with trained unet

response to #10 (comment)

I want to conduct zero-shot MSCOCO evaluation for my intermediate checkpoint trained with multi-GPU setting, I'm not sure how to denote my checkpoint.

Could you give me some hints for this?

In your instruction(2), you enter model_id.

Could I change the model_id to my checkpoint path?

However, I don't know which one should be denoted.

I guess the unet_ema/diffusion_pytorch_model.bin. Am I right?

Thanks in advance.

Add training code

multi-gpu training error

Hi, I'm really impressed by your work and nice code.

When I ran the training code with multi-gpu setting, I encountered this error.

Traceback (most recent call last):
File "/home/user01/BK-SDM/src/kd_train_text_to_image.py", line 891, in
main()
File "/home/user01/BK-SDM/src/kd_train_text_to_image.py", line 766, in main
a_stu = acts_stu[m_stu]
KeyError: 'up_blocks.0'

Could you check this?

Thanks in advance :)

Discussion on experimental settings

[Inquiry]

hi, I tried this method, but found that the performance was very poor. My experimental configuration was to train on laion_11k data for 10k steps, and the unet is bk_tiny. And I also replaced the pipeline to inpainting and the input data. I would like to ask you for any good suggestions, thanks.

We find the 2.3M dataset can not download, the link is wrong?

1 we can download the 212K dataset by
https://netspresso-research-code-release.s3.us-east-2.amazonaws.com/data/improved_aesthetics_6.5plus/preprocessed_212k.tar.gz
but the 2.3M dataset cannot
https://netspresso-research-code-release.s3.us-east-2.amazonaws.com/data/improved_aesthetics_6.5plus/preprocessed_2256k.tar.gz

2 we also try "bash " method
bash scripts/get_laion_data.sh preprocessed_2256k

Queries

@bokyeong1015 hi thanks for sharing this wonderful work , i had few queries and request

Can you please share ur checkpoint-45000 on one drive or google drive , i wanted to test it on the system as donot have the resources to train it on gpu system
In ur paper u have mentioned deploying on Nvidia orin ? did u test it on any other platforms like Nvidia Agx / Nx/ nano if so whats the time taken on it
When deploying it on nvidia orin did u used docker or straight up with hugging face models
The Techniques used in this paper and the snap fusion can we bring in those in this code and can we expect to see some better improvements

Thanks n advanc

Add training data

Add improved models

SDXL support?

Hi there!

I'd like to ask, do you have or plan to have support for the SDXL model? It's quite heavy and the process of making it more fast and lightweights would have insane benefits to the community.

Thanks for your work!

improved wandb logger

To incorporate the below feature

In addition, the base training script src/kd_train_text_to_image.py logs only the total loss to W&B and one may be interested in each particular contribution. I added image logging to W&B as well.

ValueError: Invalid pattern: '**' can only be an entire path component

pip install -U datasets

This solves the issue of loading the data.

Is there someway to test Img2Img?

Repo update

Code for SD-V2 applicability
Readme & model card for SD-V2 applicability
- Updated description & results
- Updated package info
Credit BK-SDXL from KOALA

Could the authors share the code of producting heat map of Figure.8? I am very appreciate your nice work and kind help.

batched image generation

To incorporate the below feature

The original src/generate.py generates images one-by-one which leads to under utilization of GPU and as consequence, generation of 30k images takes a while. I've added batched generation of images to speed-up it.

Snapfusion seems to get better results?

Thanks for the generosity of open sourcing your work, but there was a previous work similar to yours, called Snapfusion, aimed at speeding up Stable diffusion.

From the results of their paper, they achieved better results through efficient-unet and step distillation, but unfortunately this work is not open source.

Do you have any opinion on this work? https://snap-research.github.io/SnapFusion/

Add DreamBooth finetuning

Goal: Efficient personalized generation with lightweight SD backbones
Method: DreamBooth finetuning without and with LoRA

Wonderful work and hi from 🧨 diffusers

Hi folks!

Simply amazing work here 🔥

I am Sayak, one of the maintainers of 🧨 diffusers at HF. I see all the weights of BK-SDM are already diffusers-compatible. This is really amazing!

I wanted to know if there is any plan to also open-source the distillation pre-training code. I think that will be beneficial to the community.

Additionally, any plans on doing for SDXL as well?

Cc: @patrickvonplaten

Any plan to release v2.1-base model?

Could the author share the code for calculating the model parameters(Param.) and the model computational complexity(MACs) of the pipeline.

Could the author share the code for calculating the model parameters(Param.) and the model computational complexity(MACs) of the pipeline. very thank you!

Unhandled exception while generating images that considered NSFW

Hi! I ran this line of code to generate samples to compute FID:

!python3 src/generate.py --model_id nota-ai/bk-sdm-base --save_dir ./results/bk-sdm-base

Then I got this error:

0/30000 | COCO_val2014_000000000042.jpg **A small dog is curled up on top of the shoes** | 25 steps
Total 751.9M (U-Net 579.4M; TextEnc 123.1M; ImageDec 49.5M)
100% 25/25 [00:03<00:00,  8.14it/s]
Traceback (most recent call last):
  File "/content/BK-SDM/src/generate.py", line 53, in <module>
    img = pipeline.generate(prompt = val_prompt,
  File "/content/BK-SDM/src/utils/inference_pipeline.py", line 34, in generate
    out = self.pipe(
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion.py", line 706, in __call__
    do_denormalize = [not has_nsfw for has_nsfw in has_nsfw_concept]
TypeError: 'bool' object is not iterable

data loading problem with 89M pairs

Hi, thanks to your excellent work, I have conducted many experiments.

When I trained on a subset of LAION-aesthetic-5+ (about 89M pairs), my training process was killed without specific error message:(

Maybe it occurred at the load_dataset.

I guess that the number of training sets is too big, but I'm not sure.

I think this problem may be caused by the huggingface's dataset library.

Have you ever faced this problem? and have you tried to train your model on much bigger training set?

Thanks in advance :)

About gpu memory

Thanks for your great work. May I ask a question about the GPU mermory? You write

A toy script can be used to verify the code executability and find the batch size that matches your GPU. With a batch size of 8 (=4×2), training BK-SDM-Base for 20 iterations takes about 5 minutes and 22GB GPU memory.

With a batch size of 256 (=4×64), training BK-SDM-Base for 50K iterations takes about 300 hours and 53GB GPU memory. With a batch size of 64 (=4×16), it takes 60 hours and 28GB GPU memory.

That is about batch size increase about 32x (from 2 to 64), but gpu memory only inscrease less than 3x (from 22G to 53G). Why the gpu memory is so saving? Does the diffusers more gpu efficient than pytorch-lightning (sd v1.5 used)?
Thanks very much

I have another question.

I split the LAION-aesthetic V2 5+ dataset into several subsets, e.g., 5M, 10M, 89M, etc, and I made metadata.csv for each subset.

Then, when I tried to train with multi-gpus using the subset dataset, I faced the below error.

I guess that the problem was caused by the data itself.

FYI, I didn't pre-process the data except for resolution (512x512) when I downloaded data.

Did you also face this problem?

Or did you conduct any pre-processing of the LAION data??

Steps: 0%| | 283/400000 [35:52<813:24:06, 7.33s/it, kd_feat_loss=58.6, kd_output_loss=0.0447, lr=5e-5, sd_loss=0.185, step_loss=58.9]
Traceback (most recent call last):
File "/home/user01/bk-sdm/src/kd_train_text_to_image.py, line 1171, in
main()
File "/home/user01/bk-sdm/src/kd_train_text_to_image.py", line 961, in main
for step, batch in enumerate(train_dataloader):
File "/home/user01/anaconda3/envs/kd-sdm/lib/python3.9/site-packages/accelerate/data_loader.py", line 388, in iter
next_batch = next(dataloader_iter)
File "/home/user01/anaconda3/envs/kd-sdm/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 628, in next
data = self._next_data()
File "/home/user01/anaconda3/envs/kd-sdm/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 671, in _next_data
data = self._dataset_fetcher.fetch(index) # may raise StopIteration
File "/home/user01/anaconda3/envs/kd-sdm/lib/python3.9/site-packages/torch/utils/data/_utils/fetch.py", line 56, in fetch
data = self.dataset.getitems(possibly_batched_index)
File "/home/user01/anaconda3/envs/kd-sdm/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 2715, in getitems
return [{col: array[i] for col, array in batch.items()} for i in range(n_examples)]
File "/home/user01/anaconda3/envs/kd-sdm/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 2715, in
return [{col: array[i] for col, array in batch.items()} for i in range(n_examples)]
File "/home/user01/anaconda3/envs/kd-sdm/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 2715, in
return [{col: array[i] for col, array in batch.items()} for i in range(n_examples)]
IndexError: index 63 is out of bounds for dimension 0 with size 63

nota-netspresso / bk-sdm Goto Github PK

bk-sdm's Introduction

Block-removed Knowledge-distilled Stable Diffusion

Notice

Model Description

Installation

Note on the torch versions we've used:

Minimal Example with 🤗Diffusers

Distillation Pretraining

[Optional] Toy to check runnability

Single-gpu training for BK-SDM-{Base, Small, Tiny}

Single-gpu training for BK-SDM-{Base-2M, Small-2M, Tiny-2M}

Multi-gpu training

Compression of SD-v2 with BK-SDM

Note on training code

Evaluation on MS-COCO Benchmark

Generation with released models (using BK-SDM-Small as default)

[After training] Generation with a trained U-Net

Results on Zero-shot MS-COCO 256×256 30K

DreamBooth Finetuning with 🤗PEFT

DreamBooth dataset

DreamBooth finetuning (using BK-SDM-Base as default)

Results of Personalized Generation

Gradio Demo

Core ML Weights

License

Acknowledgments

Citation

bk-sdm's People

Contributors

Stargazers

Watchers

Forkers

bk-sdm's Issues

Recommend Projects

Recommend Topics

Recommend Org