basujindal / stable-diffusion Goto Github PK

This project forked from compvis/stable-diffusion

Optimized Stable Diffusion modified to run on lower GPU VRAM

License: Other

Shell 0.07% Python 12.52% Jupyter Notebook 87.38% Dockerfile 0.03%

stable-diffusion's Introduction

Optimized Stable Diffusion

This repo is a modified version of the Stable Diffusion repo, optimized to use less VRAM than the original by sacrificing inference speed.

To reduce the VRAM usage, the following opimizations are used:

the stable diffusion model is fragmented into four parts which are sent to the GPU only when needed. After the calculation is done, they are moved back to the CPU.
The attention calculation is done in parts.

Installation

All the modified files are in the optimizedSD folder, so if you have already cloned the original repository you can just download and copy this folder into the original instead of cloning the entire repo. You can also clone this repo and follow the same installation steps as the original (mainly creating the conda environment and placing the weights at the specified location).

Alternatively, if you prefer to use Docker, you can do the following:

Install Docker, Docker Compose plugin, and NVIDIA Container Toolkit
Clone this repo to, e.g., ~/stable-diffusion
Put your downloaded model.ckpt file into ~/sd-data (it's a relative path, you can change it in docker-compose.yml)
cd into ~/stable-diffusion and execute docker compose up --build

This will launch gradio on port 7860 with txt2img. You can also use docker compose run to execute other Python scripts.

Usage

img2img

img2img can generate 512x512 images from a prior image and prompt using under 2.4GB VRAM in under 20 seconds per image on an RTX 2060.
The maximum size that can fit on 6GB GPU (RTX 2060) is around 1152x1088.
For example, the following command will generate 10 512x512 images:

python optimizedSD/optimized_img2img.py --prompt "Austrian alps" --init-img ~/sketch-mountains-input.jpg --strength 0.8 --n_iter 2 --n_samples 5 --H 512 --W 512

txt2img

txt2img can generate 512x512 images from a prompt using under 2.4GB GPU VRAM in under 24 seconds per image on an RTX 2060.
For example, the following command will generate 10 512x512 images:

python optimizedSD/optimized_txt2img.py --prompt "Cyberpunk style image of a Tesla car reflection in rain" --H 512 --W 512 --seed 27 --n_iter 2 --n_samples 5 --ddim_steps 50

inpainting

inpaint_gradio.py can fill masked parts of an image based on a given prompt. It can inpaint 512x512 images while using under 2.5GB of VRAM.
To launch the gradio interface for inpainting, run python optimizedSD/inpaint_gradio.py. The mask for the image can be drawn on the selected image using the brush tool.
The results are not yet perfect but can be improved by using a combination of prompt weighting, prompt engineering and testing out multiple values of the --strength argument.
Suggestions to improve the inpainting algorithm are most welcome.

Using the Gradio GUI

You can also use the built-in gradio interface for img2img, txt2img & inpainting instead of the command line interface. Activate the conda environment and install the latest version of gradio using pip install gradio,
Run img2img using python optimizedSD/img2img_gradio.py, txt2img using python optimizedSD/txt2img_gradio.py and inpainting using python optimizedSD/inpaint_gradio.py.
img2img_gradio.py has a feature to crop input images. Look for the pen symbol in the image box after selecting the image.

Arguments

`--seed`

Seed for image generation, can be used to reproduce previously generated images. Defaults to a random seed if unspecified.

The code will give the seed number along with each generated image. To generate the same image again, just specify the seed using --seed argument. Images are saved with its seed number as its name by default.
For example if the seed number for an image is 1234 and it's the 55th image in the folder, the image name will be named seed_1234_00055.png.

`--n_samples`

Batch size/amount of images to generate at once.

To get the lowest inference time per image, use the maximum batch size --n_samples that can fit on the GPU. Inference time per image will reduce on increasing the batch size, but the required VRAM will increase.
If you get a CUDA out of memory error, try reducing the batch size --n_samples. If it doesn't work, the other option is to reduce the image width --W or height --H or both.

`--n_iter`

Run x amount of times

Equivalent to running the script n_iter number of times. Only difference is that the model is loaded only once per n_iter iterations. Unlike n_samples, reducing it doesn't have an effect on VRAM required or inference time.

`--H` & `--W`

Height & width of the generated image.

Both height and width should be a multiple of 64.

`--turbo`

Increases inference speed at the cost of extra VRAM usage.

Using this argument increases the inference speed by using around 700MB of extra GPU VRAM. It is especially effective when generating a small batch of images (~ 1 to 4) images. It takes under 20 seconds for txt2img and 15 seconds for img2img (on an RTX 2060, excluding the time to load the model). Use it on larger batch sizes if GPU VRAM available.

`--precision autocast` or `--precision full`

Whether to use full or mixed precision

Mixed Precision is enabled by default. If you don't have a GPU with tensor cores (any GTX 10 series card), you may not be able use mixed precision. Use the --precision full argument to disable it.

`--format png` or `--format jpg`

Output image format

The default output format is png. While png is lossless, it takes up a lot of space (unless large portions of the image happen to be a single colour). Use lossy jpg to get smaller image file sizes.

`--unet_bs`

Batch size for the unet model

Takes up a lot of extra RAM for very little improvement in inference time. unet_bs > 1 is not recommended!
Should generally be a multiple of 2x(n_samples)

Weighted Prompts

Prompts can also be weighted to put relative emphasis on certain words. eg. --prompt tabby cat:0.25 white duck:0.75 hybrid.
The number followed by the colon represents the weight given to the words before the colon. The weights can be both fractions or integers.

Troubleshooting

Green colored output images

If you have a Nvidia GTX series GPU, the output images maybe entirely green in color. This is because GTX series do not support half precision calculation, which is the default mode of calculation in this repository. To overcome the issue, use the --precision full argument. The downside is that it will lead to higher GPU VRAM usage.

Changelog

v1.0: Added support for multiple samplers for txt2img. Based on crowsonkb
v0.9: Added support for calculating attention in parts. (Thanks to @neonsecret @Doggettx, @ryudrigo)
v0.8: Added gradio interface for inpainting.
v0.7: Added support for logging, jpg file format
v0.6: Added support for using weighted prompts. (based on @lstein's repo)
v0.5: Added support for using gradio interface.
v0.4: Added support for specifying image seed.
v0.3: Added support for using mixed precision.
v0.2: Added support for generating images in batches.
v0.1: Split the model into multiple parts to run it on lower VRAM.

stable-diffusion's People

Contributors

Stargazers

Watchers

Forkers

faraday coej mountainmanbill sawolf fortbrrrt aivean cannin inukai132 suddenlyhazel thatgardnerone etstefano rjhenderson205 leonchan219 jags111 gsajko lesterlxy yrbashazar cblagden satyam-cyc jackcloudman gleb-akhmerov athuggins qhduan laureartwork xavierutox f0nt j-cunanan starlyht ironchariot loadletter krumeluren matthewhanley derzlo aravind598 pyr-000 steboss nights192 myfuna azcobu lukexyz maa123 hollings abdurraheemali jnormaster dradows skagr live4hisglory ps0317ix mooseh81 smiletondi ghoszt serenity-firefly df2df roncarlos yetangitu tennymint tapucosmo williamsin9g rahimnathwani c00renut eusebiogit rwx-im hyde042 legutierr fuoum jcaw mohasal1413 yawnoc cpaxton rbbrdckybk mfpousa jamieleigh3d yukishigure korakoe ac1224 fcqsn296 drallcom3 harmonichemispheres ebycow gqgs taralever philparzer gekinz artifaith hbinnie gattograsso ekaterinalicht hideo54 thinklikeanarchitect svens-uk orochiz sirmorland sifue power-74 sugayoiya neverkai jbtiv santarh mat4m0 tangtang95

stable-diffusion's Issues

Fixed Code, Latent Channels and Downsampling Factor

I see these commands in the python file but they're not in the GUI, are they useful in any way for generations? ddim_eta (which is in the GUI now) does make a huge difference sometimes and I'm wondering if these do too, and if so can they be added?

Invalid syntax

Hello!

I was having this issue repeatedly:

EDIT: I forgot --prompt on this example; sorry about that!

(ldm) C:\StableDiffusion\stable-diffusion-main>python "optimizedSD\optimized_txt2img.py" "bubble sheep" usage: optimized_txt2img.py [-h] [--prompt [PROMPT]] [--outdir [OUTDIR]] [--skip_grid] [--skip_save] [--ddim_steps DDIM_STEPS] [--fixed_code] [--ddim_eta DDIM_ETA] [--n_iter N_ITER] [--H H] [--W W] [--C C] [--f F] [--n_samples N_SAMPLES] [--n_rows N_ROWS] [--scale SCALE] [--from-file FROM_FILE] [--seed SEED] [--small_batch] [--precision {full,autocast}] optimized_txt2img.py: error: unrecognized arguments: bubble sheep

(This was happening regardless of height and width. I literally typed something about a banana chasing an ice cream monster and got it.)

So I updated, ran things like I was before, which were working, and I got:

(ldm) C:\StableDiffusion\stable-diffusion-main>python "optimizedSD\optimized_txt2img.py" --prompt "bubble sheep" File "optimizedSD\optimized_txt2img.py", line 8 <!DOCTYPE html> ^ SyntaxError: invalid syntax

So I just thought I would let you know. :)

Thank you for all your hard work, and have a lovely day!!

EDIT2: I also want to mention I originally followed Tingting's tutorial, to a degree.

Optimizing speed

Hi,

As an RTX 2060 user (MSI RTX 2060 Gaming card), I really appreciate the work you put into this repo. Amazing job!

Since I want to generate 512x704 pictures, your repo is the only one that works with 6 GB of VRAM. However, when I generate a 512x512 image for example, I get around an image in 55 seconds (for 50 steps), which seems to be twice the time that is announced in the readme. If I increase the amount of samples, then the time per image is optimized (about 2:30 minutes for 8 images). But when entering an iterative phase where I only fine tune one image, it can get pretty slow.

When I monitor the used VRAM, I see that for 512x704 images, only about 4.5 GB are used. In the readme, it is stated that the model is split into 4 pieces, that are uploaded when required. So I was wondering: for people having 6 GB of VRAM, do you think that only splitting the model in 2 parts would be faster? Also, do you think that having the script running and keeping the model in CPU RAM between prompts would make initialization faster?

Thank you so much!

Edit: I'm realizing that the WebUI is already doing my second suggestion of keeping the script running without loading the model off the disk each time, right?

Gradio interfaces lack CFG scale option

Seems like a simple oversight? Pretty important parameter to include in a slider by default, in my opinion.

Add ETA Scale in Gradio Interface

Hello! :)

It would be nice to implement ETA scale in the Gradio interface because it is proven to have an high effect on the pictures.

Hitting clear in the Gradio interfaces deletes outdir

I think it should reset it to its default setting like all the other options, but it doesn't, so if you try to run something again without manually setting it back you get an error message.

How to specify n_start_index when getting hi-res sample?

So... using higher value of n_samples would increase gpu memory usage but allow to see multiple variants.
I like one variant out of 10 and want to get higher resolution of this file but it has index = e.g. 5
my GPU ram is limited and I can't run n_samples > 2 with W&H=700
is there any way out of this situation? suppose having option to specify n_start_index which would create file that was at that index?

Tensor size error

I get the error: RuntimeError: SIzes of tensors must match except in dimension 1. Expected size 23 for tensor number 1 in the list.

I have a GTX 1080. The weird thing is that it does work sometimes to run the program, but I mostly get this error.

Sorry if I missed something obvious.

All Generated Images Green

I was pretty excited that I could finally get everything running until this. I've tried several different setting configurations, and nothing seems to help. I have no errors to report since the program thinks it's working fine. I'm on Windows 10 with a 1650 with 4GB of VRAM, if that is of any help. I'm using miniconda with the default configuration.yaml file, and I used the optimized txt2img script in this repo.

Gradio interface "full_precision" returns a RuntimeError on txt2img

Whenever I try to run with full precision (due to a bug with green output/fp16 with the GTX 1660) I get a RuntimeError, output is as follows:

Traceback (most recent call last):
File "X:\miniconda3\envs\ldm\lib\site-packages\gradio\routes.py", line 248, in run_predict
output = await app.blocks.process_api(
File "X:\miniconda3\envs\ldm\lib\site-packages\gradio\blocks.py", line 643, in process_api
predictions, duration = await self.call_function(fn_index, processed_input)
File "X:\miniconda3\envs\ldm\lib\site-packages\gradio\blocks.py", line 556, in call_function
prediction = await block_fn.fn(*processed_input)
File "X:\miniconda3\envs\ldm\lib\site-packages\gradio\interface.py", line 655, in submit_func
prediction = await self.run_prediction(args)
File "X:\miniconda3\envs\ldm\lib\site-packages\gradio\interface.py", line 684, in run_prediction
prediction = await anyio.to_thread.run_sync(
File "X:\miniconda3\envs\ldm\lib\site-packages\anyio\to_thread.py", line 31, in run_sync
return await get_asynclib().run_sync_in_worker_thread(
File "X:\miniconda3\envs\ldm\lib\site-packages\anyio_backends_asyncio.py", line 937, in run_sync_in_worker_thread
return await future
File "X:\miniconda3\envs\ldm\lib\site-packages\anyio_backends_asyncio.py", line 867, in run
result = context.run(func, *args)
File ".\optimizedSD\txt2img_gradio.py", line 118, in generate
uc = modelCS.get_learned_conditioning(batch_size * [""])
File "c:\stable-diffusion\optimizedSD\ddpm.py", line 297, in get_learned_conditioning
c = self.cond_stage_model.encode(c)
File "c:\stable-diffusion\ldm\modules\encoders\modules.py", line 162, in encode
return self(text)
File "X:\miniconda3\envs\ldm\lib\site-packages\torch\nn\modules\module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "c:\stable-diffusion\ldm\modules\encoders\modules.py", line 156, in forward
outputs = self.transformer(input_ids=tokens)
File "X:\miniconda3\envs\ldm\lib\site-packages\torch\nn\modules\module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "X:\miniconda3\envs\ldm\lib\site-packages\transformers\models\clip\modeling_clip.py", line 722, in forward
return self.text_model(
File "X:\miniconda3\envs\ldm\lib\site-packages\torch\nn\modules\module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "X:\miniconda3\envs\ldm\lib\site-packages\transformers\models\clip\modeling_clip.py", line 643, in forward
encoder_outputs = self.encoder(
File "X:\miniconda3\envs\ldm\lib\site-packages\torch\nn\modules\module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "X:\miniconda3\envs\ldm\lib\site-packages\transformers\models\clip\modeling_clip.py", line 574, in forward
layer_outputs = encoder_layer(
File "X:\miniconda3\envs\ldm\lib\site-packages\torch\nn\modules\module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "X:\miniconda3\envs\ldm\lib\site-packages\transformers\models\clip\modeling_clip.py", line 317, in forward
hidden_states, attn_weights = self.self_attn(
File "X:\miniconda3\envs\ldm\lib\site-packages\torch\nn\modules\module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "X:\miniconda3\envs\ldm\lib\site-packages\transformers\models\clip\modeling_clip.py", line 257, in forward
attn_output = torch.bmm(attn_probs, value_states)
RuntimeError: expected scalar type Half but found Float

Allow --small_batch to specify arbitrary number of images to be generated at the same time

I am using a Gtx1070ti and when generating images, if I don't use --small_batch 4GB of VRAM is used, and if I use it, 5GB is used.
I have 8GB of VRAM and would like to use more VRAM to speed up the inference speed.
Currently --small_batch is a flag that causes two images to be generated at the same time.
Is there any problem to change this so that we can specify a numerical value as a parameter so that more than 3 images can be generated at the same time?

random seed

the seed are not random by default:

parser.add_argument(
"--seed",
type=int,
default=42,
help="the seed (for reproducible sampling)",
)

if someone tired of it just comment this line:

default=42,

it's not a beautiful but it works

No module named 'taming'

I'm getting this error when using Optimized Stable Diffusion but NOT Stable Diffusion:

Traceback (most recent call last):
File "D:\Programs\Optimized Stable Diffusion\optimizedSD\optimized_txt2img.py", line 198, in
model = instantiate_from_config(config.modelUNet)
File "d:\programs\stable diffusion\ldm\util.py", line 85, in instantiate_from_config
return get_obj_from_str(config["target"])(**config.get("params", dict()))
File "d:\programs\stable diffusion\ldm\util.py", line 93, in get_obj_from_str
return getattr(importlib.import_module(module, package=None), cls)
File "C:\My Files\Programs\Python\Python39\lib\importlib_init_.py", line 127, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
File "", line 1030, in _gcd_import
File "", line 1007, in _find_and_load
File "", line 986, in _find_and_load_unlocked
File "", line 680, in _load_unlocked
File "", line 850, in exec_module
File "", line 228, in _call_with_frames_removed
File "d:\programs\stable diffusion\optimizedSD\ddpm.py", line 14, in
from ldm.models.autoencoder import VQModelInterface
File "d:\programs\stable diffusion\ldm\models\autoencoder.py", line 6, in
from taming.modules.vqvae.quantize import VectorQuantizer2 as VectorQuantizer
ModuleNotFoundError: No module named 'taming'

AssertionError when running img2img

I've been running some img2img prompts all day with no hitch, but a few hours ago I started picking up an AssertionError

Traceback (most recent call last):
File "scripts/optimized_img2img.py", line 220, in
assert os.path.isfile(opt.init_img)
AssertionError

After a while of trying to fix it it began to say that ldm was having syntax errors around some print statements, saying the issue was that parentheses were missing. I tried reinstalling ldm and then it began saying that ldm couldn't be found as a module. I reinstalled the entire env and now I'm back to square one with the AssertionError above.

Anaconda blocked while installing pip dependencies

Hi,

I spent the whole day trying to figure out why the procedure to install the SD environment didn't work on my Windows 10 machine. Everything was downloading properly, but the process was stuck at "Installing pip dependencies", which lasted forever. After lots of tests, I finally discovered that it was due to a hidden prompt that was not appearing, blocking the whole installation. This prompt happens when the installer tries to download a repo from github: if the folder already exists, pip asks if you want to wipe it or backup it, causing this block (as it is not visible).

Solution: just delete the "src\clip" and "src\taming-transformers" folders before launching the environment creation. I strongly suggest you remove them from the repo, as they are not in the original one. It will save many headaches to people, I guess. Cheers!

Trouble when trying to run the example command.

I ran this through conda powershell prompt
(ldm) PS D:\stable-diffusion> python optimizedSD/optimized_img2img.py --prompt "Austrian alps" --init-img ~/sketch-mountains-input.jpg --strength 0.8 --n_iter 2 --n_samples 5 --H 576 --W 768
Traceback (most recent call last):
File "optimizedSD/optimized_img2img.py", line 16, in
from ldm.util import instantiate_from_config
File "C:\Users\ebarr\miniconda3\envs\ldm\lib\site-packages\ldm.py", line 77
print os.path.exists("%s.bz2"%(predictor_path))
^
SyntaxError: invalid syntax

try half precision ftw

figured out that if you call model.half() right after model = instantiate_from_config(config.modelUNet)
and change precision = "full" to precision = "autocast"

it reliably reduces generation time for me (1.82 min to ~1.1) on a 512x512

it might save some memory too and allow larger dimensions

I compared full precision and half in Beyond Compare, there are some minor differences it detects, but nothing I can notice side by side

Prompt weighting feature

Code is already written, it's here:
invoke-ai/InvokeAI#18

Prompt weighting is a very powerful feature allowing users to make specific changes to already generated image (by using its seed) .
For example, I have a picture of a woman and I want this woman to wear a crown, then I just alter original prompt as "beautiful woman sitting on a chair:1 wearing a crown:0.2" and a crown will appear almost without any modification to original image.

Save/Print seed per sample?

I've modified the code to set a random seed in the n_iter loop...and for now have that set as the filename. This works, but ideally it would instead be triggered off of each sample generated instead of sledgehammered in.
But what I can't figure out is if I wanted to use n_samples...how do I fetch the seed that generated each sample? I'd like to be able to get not only the image out, but also the seed associated with it so that I can run variations on that seed.

Thanks!

ModuleNotFoundError: No module named 'optimizedSD'

see answer below by @hwharrison

Inference Speed

I wanted to compare the inference speeds to https://github.com/harubaru/waifu-diffusion/

I noticed that while the first inference is around the same speed, the waifu-diffusion repo infers the iterations that follow at a much faster speed than this one does.

I'm already using --skip_grid and --small_batch.

Any idea why that is the case and what we could change to get the same output?

Additionally what are the chances of being able to add weighted prompts like this repo does: https://github.com/lstein/stable-diffusion

Prompts only generating blank images

Running the repo with the leaked weights, the process takes its time and appears to complete without any errors, but only outputs entirely blank green images. I've tried playing around with the width, height, steps, iterations, and samples parameters to no avail. Anything that could be the matter?

Is k-diffusion included with this optimized version of SD?

Apologizes if this is the wrong place to ask. Was just curious if K-Diffusion/k_lms was added for the 0.5 release. I've been using a different setup, a fork or companion of this version with Gradio, that uses it and quite like the results.

If K-Diffusion isn't available yet, any chance it could be added as a sampling option? :)

If the number of prompts entered from the argument "--from-file" does not match the number of "--n_samples", an error occurs.

Not optimized, but I personally add the code between 211ish and 212ish:

else:
    print(f"reading prompts from {opt.from_file}")
    with open(opt.from_file, "r") as f:
        data = f.read().splitlines()
        data = batch_size * list(data)
        data = list(chunk(data, batch_size))

You can then output to the prompt entered from the "-from-file" argument.

I am Japanese.
I use automatic translation.
Please forgive my poor English.

Traceback Error

Traceback (most recent call last):
  File "optimizedSD/optimized_txt2img.py", line 16, in <module>
    from ldm.util import instantiate_from_config
  File "C:\Users\nicho\miniconda3\envs\ldm\lib\site-packages\ldm.py", line 20
    print self.face_rec_model_path

SyntaxError: Missing parentheses in call to 'print'. Did you mean print(self.face_rec_model_path)?

I just tried to install the latest version, and got this error whenever I tried to use the optimized_txt2img.py. It worked before, so I'm not sure what the exact issue is.

Out of memory.... how big is maximum resolution?

I tried 1024x1024 on an RTX 3090, it always crashes (while 768x768 works out). The same thing happens with the original stable repo. How better is this one? What's the maximum res on 24GB? Thanks!

Gradio interface for img2img has wrong default outdir

It's using outputs/txt2img-samples when it should be outputs/img2img-samples.

Need to include an option to not append the prompt in the output path

As mencioned in the discussion, when there is a long prompt, it causes windows to truncate the output directory name, causing a FileNotFound exception.
It will be grate to have an option to use the exact output path specified.

Discussed in #26

^{Originally posted by lessirey August 23, 2022}
The Folder and file names try to include the entire prompt, which is good.. but on lengthy prompts the system breaks and gives an error that file cannot be found since (windows) cant create such a lengthy name. its fine on smaller prompts.

Ideally its good to include the prompt but truncated to specific length and end with seed value.
eg:
[beginning part of the prompt, but truncated]+[addtitional parameters]+[seed value].png

Can't use weighted prompts because of invalid symbol on layer folder name (colon).

colon (:) is a invalid folder name so i can't use it's parameters on the prompt because i get a error on folder name since it automatically create a folder name with the prompt.

"File name too long" error if prompt exceeds valid filename length

If your prompt is too long, this happens:

OSError: [Errno 36] File name too long

Pretty easy fix. Here, we can just change:

sample_path = os.path.join(outpath, "samples", "_".join(opt.prompt.split()) )

sample_path = os.path.join(outpath, "samples", "_".join(opt.prompt.split())[0:50] )

... or something similar.

Issue with file names and lengthy prompts.

The Folder and file names try to include the entire prompt, which is good.. but on lengthy prompts the system breaks and gives an error that file cannot be found since (windows) cant create such a lengthy name. its fine on smaller prompts.

Ideally its good to include the prompt but truncated to specific length and end with seed value.
eg:
[beginning part of the prompt, but truncated]+[addtitional parameters]+[seed value].png

Figured out how to maintain aspect ratio in img2img

This will keep the number of pixels (area) constant, while also maintaining aspect ratio of the original picture.

def resize_wh(w, h):
    area = 512 * 512
    aspect_ratio = w / h
    resized_w = round(sqrt(aspect_ratio) * sqrt(area))
    resized_h = round(sqrt(area) / sqrt(aspect_ratio))

    return resized_w, resized_h

>>> resize_wh(1920, 1080)
(683, 384)

RuntimeError: Error(s) in loading state_dict for UNet:

how can i use stable diffusion weights (or any other) with this fork? i got an error:

Traceback (most recent call last):
File "optimizedSD/optimized_txt2img.py", line 193, in
_, _ = model.load_state_dict(sd, strict=False)
File "C:\Users\cotton\miniconda3\envs\ldm\lib\site-packages\torch\nn\modules\module.py", line 1604, in load_state_dict
raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for UNet:
size mismatch for model1.diffusion_model.input_blocks.1.1.transformer_blocks.0.attn2.to_k.weight: copying a param with shape torch.Size([320, 1280]) from checkpoint, the shape in current model is torch.Size([320, 768]).
size mismatch for model1.diffusion_model.input_blocks.1.1.transformer_blocks.0.attn2.to_v.weight: copying a param with shape torch.Size([320, 1280]) from checkpoint, the shape in current model is torch.Size([320, 768]).
size mismatch for model1.diffusion_model.input_blocks.2.1.transformer_blocks.0.attn2.to_k.weight: copying a param with shape torch.Size([320, 1280]) from checkpoint, the shape in current model is torch.Size([320, 768]).
size mismatch for model1.diffusion_model.input_blocks.2.1.transformer_blocks.0.attn2.to_v.weight: copying a param with shape torch.Size([320, 1280]) from checkpoint, the shape in current model is torch.Size([320, 768]).
size mismatch for model1.diffusion_model.input_blocks.4.1.transformer_blocks.0.attn2.to_k.weight: copying a param with shape torch.Size([640, 1280]) from checkpoint, the shape in current model is torch.Size([640, 768]).
size mismatch for model1.diffusion_model.input_blocks.4.1.transformer_blocks.0.attn2.to_v.weight: copying a param with shape torch.Size([640, 1280]) from checkpoint, the shape in current model is torch.Size([640, 768]).
size mismatch for model1.diffusion_model.input_blocks.5.1.transformer_blocks.0.attn2.to_k.weight: copying a param with shape torch.Size([640, 1280]) from checkpoint, the shape in current model is torch.Size([640, 768]).
size mismatch for model1.diffusion_model.input_blocks.5.1.transformer_blocks.0.attn2.to_v.weight: copying a param with shape torch.Size([640, 1280]) from checkpoint, the shape in current model is torch.Size([640, 768]).
size mismatch for model1.diffusion_model.input_blocks.7.1.transformer_blocks.0.attn2.to_k.weight: copying a param with shape torch.Size([1280, 1280]) from checkpoint, the shape in current model is torch.Size([1280, 768]).
size mismatch for model1.diffusion_model.input_blocks.7.1.transformer_blocks.0.attn2.to_v.weight: copying a param with shape torch.Size([1280, 1280]) from checkpoint, the shape in current model is torch.Size([1280, 768]).
size mismatch for model1.diffusion_model.input_blocks.8.1.transformer_blocks.0.attn2.to_k.weight: copying a param with shape torch.Size([1280, 1280]) from checkpoint, the shape in current model is torch.Size([1280, 768]).
size mismatch for model1.diffusion_model.input_blocks.8.1.transformer_blocks.0.attn2.to_v.weight: copying a param with shape torch.Size([1280, 1280]) from checkpoint, the shape in current model is torch.Size([1280, 768]).
size mismatch for model1.diffusion_model.middle_block.1.transformer_blocks.0.attn2.to_k.weight: copying a param with shape torch.Size([1280, 1280]) from checkpoint, the shape in current model is torch.Size([1280, 768]).
size mismatch for model1.diffusion_model.middle_block.1.transformer_blocks.0.attn2.to_v.weight: copying a param with shape torch.Size([1280, 1280]) from checkpoint, the shape in current model is torch.Size([1280, 768]).
size mismatch for model2.diffusion_model.output_blocks.3.1.transformer_blocks.0.attn2.to_k.weight: copying a param with shape torch.Size([1280, 1280]) from checkpoint, the shape in current model is torch.Size([1280, 768]).
size mismatch for model2.diffusion_model.output_blocks.3.1.transformer_blocks.0.attn2.to_v.weight: copying a param with shape torch.Size([1280, 1280]) from checkpoint, the shape in current model is torch.Size([1280, 768]).
size mismatch for model2.diffusion_model.output_blocks.4.1.transformer_blocks.0.attn2.to_k.weight: copying a param with shape torch.Size([1280, 1280]) from checkpoint, the shape in current model is torch.Size([1280, 768]).
size mismatch for model2.diffusion_model.output_blocks.4.1.transformer_blocks.0.attn2.to_v.weight: copying a param with shape torch.Size([1280, 1280]) from checkpoint, the shape in current model is torch.Size([1280, 768]).
size mismatch for model2.diffusion_model.output_blocks.5.1.transformer_blocks.0.attn2.to_k.weight: copying a param with shape torch.Size([1280, 1280]) from checkpoint, the shape in current model is torch.Size([1280, 768]).
size mismatch for model2.diffusion_model.output_blocks.5.1.transformer_blocks.0.attn2.to_v.weight: copying a param with shape torch.Size([1280, 1280]) from checkpoint, the shape in current model is torch.Size([1280, 768]).
size mismatch for model2.diffusion_model.output_blocks.6.1.transformer_blocks.0.attn2.to_k.weight: copying a param with shape torch.Size([640, 1280]) from checkpoint, the shape in current model is torch.Size([640, 768]).
size mismatch for model2.diffusion_model.output_blocks.6.1.transformer_blocks.0.attn2.to_v.weight: copying a param with shape torch.Size([640, 1280]) from checkpoint, the shape in current model is torch.Size([640, 768]).
size mismatch for model2.diffusion_model.output_blocks.7.1.transformer_blocks.0.attn2.to_k.weight: copying a param with shape torch.Size([640, 1280]) from checkpoint, the shape in current model is torch.Size([640, 768]).
size mismatch for model2.diffusion_model.output_blocks.7.1.transformer_blocks.0.attn2.to_v.weight: copying a param with shape torch.Size([640, 1280]) from checkpoint, the shape in current model is torch.Size([640, 768]).
size mismatch for model2.diffusion_model.output_blocks.8.1.transformer_blocks.0.attn2.to_k.weight: copying a param with shape torch.Size([640, 1280]) from checkpoint, the shape in current model is torch.Size([640, 768]).
size mismatch for model2.diffusion_model.output_blocks.8.1.transformer_blocks.0.attn2.to_v.weight: copying a param with shape torch.Size([640, 1280]) from checkpoint, the shape in current model is torch.Size([640, 768]).
size mismatch for model2.diffusion_model.output_blocks.9.1.transformer_blocks.0.attn2.to_k.weight: copying a param with shape torch.Size([320, 1280]) from checkpoint, the shape in current model is torch.Size([320, 768]).
size mismatch for model2.diffusion_model.output_blocks.9.1.transformer_blocks.0.attn2.to_v.weight: copying a param with shape torch.Size([320, 1280]) from checkpoint, the shape in current model is torch.Size([320, 768]).
size mismatch for model2.diffusion_model.output_blocks.10.1.transformer_blocks.0.attn2.to_k.weight: copying a param with shape torch.Size([320, 1280]) from checkpoint, the shape in current model is torch.Size([320, 768]).
size mismatch for model2.diffusion_model.output_blocks.10.1.transformer_blocks.0.attn2.to_v.weight: copying a param with shape torch.Size([320, 1280]) from checkpoint, the shape in current model is torch.Size([320, 768]).
size mismatch for model2.diffusion_model.output_blocks.11.1.transformer_blocks.0.attn2.to_k.weight: copying a param with shape torch.Size([320, 1280]) from checkpoint, the shape in current model is torch.Size([320, 768]).
size mismatch for model2.diffusion_model.output_blocks.11.1.transformer_blocks.0.attn2.to_v.weight: copying a param with shape torch.Size([320, 1280]) from checkpoint, the shape in current model is torch.Size([320, 768]).

img2img: Can first step be done on CPU?

img2img has an initial spike in GPU VRAM:

Can this step be done on cpu?

SSL Error related to HuggingFace

So the whole thing was working for a while but now I'm facing an SSL issue related to HuggingFace. It originates at

self.tokenizer = CLIPTokenizer.from_pretrained(version)

raise SSLError(e, request=request)
requests.exceptions.SSLError: HTTPSConnectionPool(host='huggingface.co', port=443): Max retries exceeded with url: /openai/clip-vit-large-patch14/resolve/main/vocab.json (Caused by SSLError("Can't connect to HTTPS URL because the SSL module is not available."))

Is there any way to fix it?

optimized inpaint

is it possible to do optimized inpaint script based on sd1.4?

Advanced features

Please implement and optimize the advanced features from hlky/stable-diffusion-webui for low VRAM usage, thank you.

Missing support of PLMSSampler from txt2img.py

Latest version of txt2img.py (from August 22nd) supports PLMSSampler but this fork lacks it.
PLMSSampler can be activated in non-forked version by passing "--plms" command line argument.

Img2Img 6GB VRAM optimization

I tried porting the couple of instances of

 mem = torch.cuda.memory_allocated()/1e6
 while(torch.cuda.memory_allocated()/1e6 >= mem):
     time.sleep(1)

to img2img.py, but my understanding of the code is limited and it just stays stuck at 0% Sampling and data.
Any Ideas?

img2img problem

loaded input image of size (640, 640) from G:\ejem.jpg
Traceback (most recent call last):
File "optimizedSD\optimized_img2img.py", line 219, in
init_image = load_img(opt.init_img, opt.H, opt.W).to(device)
File "optimizedSD\optimized_img2img.py", line 41, in load_img
w, h = map(lambda x: x - x % 32, (w0, h0)) # resize to integer multiple of 32
File "optimizedSD\optimized_img2img.py", line 41, in
w, h = map(lambda x: x - x % 32, (w0, h0)) # resize to integer multiple of 32
TypeError: unsupported operand type(s) for %: 'NoneType' and 'int'

any suggestions? thanks

Inference speed very slow

I ran the example prompt from the Readme, which states it finishes in less than 30 seconds on an RTX 2060. I have an RTX 2060 Super which has 2 GB more VRAM but somehow it takes over two minutes for the same prompt:

python optimizedSD/optimized_txt2img.py --prompt "Cyberpunk style image of a Telsa car reflection in rain" --H 512 --W 512 --seed 27 --n_samples 15 --ddim_steps 50 --skip_grid --small_batch

Am I doing something incorrectly? It's using around 6 GB of VRAM during inference so I can put the n_samples slightly higher, but even so it should be way faster right?

Edit: Okay, I understand now. The n_samples in the Readme refers to the simultaneously infered number of images I think. That's why it's a lot faster when you set the number lower, because it generates less images

length of prompt

there are an error on Windows because with long prompt you can't save the result for this path. maybe it will be better to cut the prompt for short path to keep it away from any issue?

Traceback (most recent call last):
File "optimizedSD/optimized_txt2img.py", line 158, in
os.makedirs(sample_path, exist_ok=True)
File "C:\Users\cotton\miniconda3\envs\ldm\lib\os.py", line 223, in makedirs
mkdir(name, mode)
FileNotFoundError: [WinError 3] System can't find the path ...

use more ram for more speed?

is it possible to use more ram for more speed while still maintaining lower vram usage from main sd for some people with more vram like 3070 users for example?

Gradio interface displaying nothing

Hi,

I'm trying to display the Gradio interface (I installed it using the "pip install gradio" command), but when I run the python script and go to http://127.0.0.1:7860/, it only displays a blank page. If I use the GUI with the waifu repo, it works fine.

Here is the code of this (seemingly) blank page:

<!DOCTYPE html>
<html lang="en" style="margin: 0; padding: 0; min-height: 100%">
	<head>
		<meta charset="utf-8" />
		<meta
			name="viewport"
			content="width=device-width, initial-scale=1, shrink-to-fit=no, maximum-scale=1"
		/>

		
		<meta property="og:url" content="https://gradio.app/" />
		<meta property="og:type" content="website" />
		<meta property="og:image" content="" />
		<meta property="og:title" content="Gradio" />
		<meta
			property="og:description"
			content=""
		/>
		<meta name="twitter:card" content="summary_large_image" />
		<meta name="twitter:creator" content="@teamGradio" />
		<meta name="twitter:title" content="Gradio" />
		<meta
			name="twitter:description"
			content=""
		/>
		<meta name="twitter:image" content="" />

		<script>
			window.dataLayer = window.dataLayer || [];
			function gtag() {
				dataLayer.push(arguments);
			}
			gtag("js", new Date());
			gtag("config", "UA-156449732-1");
			window.__gradio_mode__ = "app";
		</script>

		<script>window.gradio_config = {"components": [{"id": 13, "props": {"style": {"equal_height": false, "mobile_collapse": true}, "type": "row", "visible": true}, "type": "row"}, {"id": 14, "props": {"style": {}, "type": "column", "variant": "panel", "visible": true}, "type": "column"}, {"id": 15, "props": {"style": {}, "type": "column", "variant": "default", "visible": true}, "type": "column"}, {"id": 7, "props": {"label": "prompt", "lines": 1, "max_lines": 20, "name": "textbox", "show_label": true, "style": {}, "value": "", "visible": true}, "type": "textbox"}, {"id": 0, "props": {"label": "ddim_steps", "maximum": 1000, "minimum": 1, "name": "slider", "show_label": true, "step": 1, "style": {}, "value": 50, "visible": true}, "type": "slider"}, {"id": 1, "props": {"label": "n_iter", "maximum": 100, "minimum": 1, "name": "slider", "show_label": true, "step": 1, "style": {}, "value": 1, "visible": true}, "type": "slider"}, {"id": 2, "props": {"label": "batch_size", "maximum": 100, "minimum": 1, "name": "slider", "show_label": true, "step": 1, "style": {}, "value": 1, "visible": true}, "type": "slider"}, {"id": 3, "props": {"label": "Height", "maximum": 4096, "minimum": 512, "name": "slider", "show_label": true, "step": 64, "style": {}, "value": 512, "visible": true}, "type": "slider"}, {"id": 4, "props": {"label": "Width", "maximum": 4096, "minimum": 512, "name": "slider", "show_label": true, "step": 64, "style": {}, "value": 512, "visible": true}, "type": "slider"}, {"id": 8, "props": {"label": "seed", "lines": 1, "max_lines": 20, "name": "textbox", "show_label": true, "style": {}, "value": "", "visible": true}, "type": "textbox"}, {"id": 9, "props": {"label": "small_batch", "name": "checkbox", "show_label": true, "style": {}, "value": false, "visible": true}, "type": "checkbox"}, {"id": 10, "props": {"label": "full_precision", "name": "checkbox", "show_label": true, "style": {}, "value": false, "visible": true}, "type": "checkbox"}, {"id": 5, "props": {"label": "outdir", "lines": 1, "max_lines": 20, "name": "textbox", "show_label": true, "style": {}, "value": "outputs/txt2img-samples", "visible": true}, "type": "textbox"}, {"id": 16, "props": {"style": {"mobile_collapse": false}, "type": "row", "visible": true}, "type": "row"}, {"id": 17, "props": {"name": "button", "style": {}, "value": "Clear", "variant": "secondary", "visible": true}, "type": "button"}, {"id": 18, "props": {"name": "button", "style": {}, "value": "Submit", "variant": "primary", "visible": true}, "type": "button"}, {"id": 19, "props": {"style": {}, "type": "column", "variant": "panel", "visible": true}, "type": "column"}, {"id": 20, "props": {"cover_container": true, "name": "statustracker", "style": {}, "visible": true}, "type": "statustracker"}, {"id": 11, "props": {"image_mode": "RGB", "interactive": false, "label": "output 0", "mirror_webcam": true, "name": "image", "show_label": true, "source": "upload", "streaming": false, "style": {}, "tool": "editor", "visible": true}, "type": "image"}, {"id": 12, "props": {"interactive": false, "label": "output 1", "lines": 1, "max_lines": 20, "name": "textbox", "show_label": true, "style": {}, "value": "", "visible": true}, "type": "textbox"}, {"id": 21, "props": {"style": {"mobile_collapse": false}, "type": "row", "visible": true}, "type": "row"}, {"id": 22, "props": {"name": "button", "style": {}, "value": "Flag", "variant": "secondary", "visible": true}, "type": "button"}], "css": null, "dependencies": [{"api_name": "predict", "backend_fn": true, "documentation": [[["text", "str | None"], ["numeric input", "float"], ["numeric input", "float"], ["numeric input", "float"], ["numeric input", "float"], ["numeric input", "float"], ["text", "str | None"], ["boolean input", "bool"], ["boolean input", "bool"], ["text", "str | None"]], [["base64 url data", "str"], ["text", "str | None"]]], "inputs": [7, 0, 1, 2, 3, 4, 8, 9, 10, 5], "js": null, "outputs": [11, 12], "queue": null, "scroll_to_output": true, "show_progress": true, "status_tracker": 20, "targets": [18], "trigger": "click"}, {"api_name": null, "backend_fn": false, "inputs": [], "js": "() =\u003e [\"\", 50, 1, 1, 512, 512, \"\", null, null, \"\", null, \"\", {\"variant\": null, \"visible\": true, \"__type__\": \"update\"}]\n                ", "outputs": [7, 0, 1, 2, 3, 4, 8, 9, 10, 5, 11, 12, 15], "queue": false, "scroll_to_output": false, "show_progress": true, "status_tracker": null, "targets": [17], "trigger": "click"}, {"api_name": null, "backend_fn": true, "inputs": [7, 0, 1, 2, 3, 4, 8, 9, 10, 5, 11, 12], "js": null, "outputs": [], "queue": false, "scroll_to_output": false, "show_progress": true, "status_tracker": null, "targets": [22], "trigger": "click"}], "dev_mode": false, "enable_queue": false, "is_space": false, "layout": {"children": [{"children": [{"children": [{"children": [{"id": 7}, {"id": 0}, {"id": 1}, {"id": 2}, {"id": 3}, {"id": 4}, {"id": 8}, {"id": 9}, {"id": 10}, {"id": 5}], "id": 15}, {"children": [{"id": 17}, {"id": 18}], "id": 16}], "id": 14}, {"children": [{"id": 20}, {"id": 11}, {"id": 12}, {"children": [{"id": 22}], "id": 21}], "id": 19}], "id": 13}], "id": 6}, "mode": "interface", "show_error": false, "theme": "default", "title": "Gradio", "version": "3.1.7\n"};</script>

		<link rel="preconnect" href="[https://fonts.googleapis.com](view-source:https://fonts.googleapis.com/)" />
		<link
			rel="preconnect"
			href="[https://fonts.gstatic.com](view-source:https://fonts.gstatic.com/)"
			crossorigin="anonymous"
		/>
		<link
			href="[https://fonts.googleapis.com/css?family=Source Sans Pro](view-source:https://fonts.googleapis.com/css?family=Source%20Sans%20Pro)"
			rel="stylesheet"
		/>
		<link
			href="[https://fonts.googleapis.com/css?family=IBM Plex Mono](view-source:https://fonts.googleapis.com/css?family=IBM%20Plex%20Mono)"
			rel="stylesheet"
		/>
		<script src="[https://cdnjs.cloudflare.com/ajax/libs/iframe-resizer/4.3.1/iframeResizer.contentWindow.min.js](view-source:https://cdnjs.cloudflare.com/ajax/libs/iframe-resizer/4.3.1/iframeResizer.contentWindow.min.js)"></script>
		<script type="module" crossorigin src="[./assets/index.09173af6.js](view-source:http://127.0.0.1:7860/assets/index.09173af6.js)"></script>
		
	</head>

	<body style="width: 100%; margin: 0; padding: 0; height: 100%">
		<gradio-app>
			<div
				id="root"
				style="display: flex; flex-direction: column; flex-grow: 1"
			></div>
		</gradio-app>
		<script>
			const ce = document.getElementsByTagName("gradio-app");
			if (ce[0]) {
				ce[0].addEventListener("domchange", () => {
					document.body.style.padding = "0";
				});
				document.body.style.padding = "0";
			}
		</script>
	</body>
</html>

Do you have an idea why I can make it work except with this repo, please? I get no error at all. Thank you!

Memory is staying reserved by PyTorch and not accessible for next generation

When using img2img through this repo I'm getting a cuda memory error:

RuntimeError: CUDA out of memory. Tried to allocate 2.00 GiB (GPU 0; 8.00 GiB total capacity; 5.09 GiB already allocated; 888.62 MiB free; 5.25 GiB reserved in total by PyTorch)

This doesn't seem to be a case of not having enough vram, since it's showing that 5.25 GiB are reserved by PyTorch and that only 888 MiB are free. It seems like each time i've used it some amount was moved to reserved.

Option to don't include the prompt in the output path

If you run a long prompt in the Windows, the SO will trunc the directory name causing a FileNotFound error.

The ideal is let me decide where to save, using only the output path and not appendind the prompt.

FirstStage object has no attribute 'get_first_stage_encoding'

Traceback (most recent call last):
File "optimizedSD/optimized_img2img.py", line 259, in
init_latent = modelFS.get_first_stage_encoding(modelFS.encode_first_stage(init_image)) # move to latent space
File "C:\Users<me>.conda\envs\ldm\lib\site-packages\torch\nn\modules\module.py", line 1185, in getattr
raise AttributeError("'{}' object has no attribute '{}'".format(
AttributeError: 'FirstStage' object has no attribute 'get_first_stage_encoding'

passed both a jpg and png image to see if it had anything to do with filetype and it does not work for either

Old GPUs

How to run StableDiffusion on old GPUs with shared pc memory?
GTX650 1 Gb VRAM, 6 Gb shared RAM, CUDA 3.0 in my case.

Out-of-bounds index errors for select img2img inputs owing to steps_out = ddim_timesteps + 1

It took a bit of tracing through the callstack to find out what was the matter when I ran certain img2img parameters (such as ddim_steps = 27 and strength = 1), but I believe it is this line in util.py for the make_ddim_timesteps method: steps_out = ddim_timesteps + 1. Consequently the last index may be 1000, triggering an OOB error later when it tries to access the 1000th element of alphacums.

There is this comment in the code to explain the +1:

# add one to get the final alpha values right (the ones from first scale to data during sampling)

but I don't know what to make of it. Is this necessary?

No module named 'optimizedSD'

When trying to run
python optimizedSD/optimized_img2img.py
I get the error below. The non-optimized img2img works fine.

It's something with optimizedSD.ddpm.UNet in v1-inference.yaml.

Traceback (most recent call last):
File "optimizedSD/optimized_img2img.py", line 224, in
model = instantiate_from_config(config.modelUNet)
File "c:\git\stable-diffusion\ldm\util.py", line 85, in instantiate_from_config
return get_obj_from_str(config["target"])(**config.get("params", dict()))
File "c:\git\stable-diffusion\ldm\util.py", line 93, in get_obj_from_str
return getattr(importlib.import_module(module, package=None), cls)
File "C:\Users\Christoph\miniconda3\envs\ldm\lib\importlib_init_.py", line 127, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
File "", line 1014, in _gcd_import
File "", line 991, in _find_and_load
File "", line 961, in _find_and_load_unlocked
File "", line 219, in _call_with_frames_removed
File "", line 1014, in _gcd_import
File "", line 991, in _find_and_load
File "", line 973, in _find_and_load_unlocked
ModuleNotFoundError: No module named 'optimizedSD'