Git Product home page Git Product logo

textual_inversion's Introduction

An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion

arXiv

[Project Website]

An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion
Rinon Gal1,2, Yuval Alaluf1, Yuval Atzmon2, Or Patashnik1, Amit H. Bermano1, Gal Chechik2, Daniel Cohen-Or1
1Tel Aviv University, 2NVIDIA

Abstract:
Text-to-image models offer unprecedented freedom to guide creation through natural language. Yet, it is unclear how such freedom can be exercised to generate images of specific unique concepts, modify their appearance, or compose them in new roles and novel scenes. In other words, we ask: how can we use language-guided models to turn our cat into a painting, or imagine a new product based on our favorite toy? Here we present a simple approach that allows such creative freedom. Using only 3-5 images of a user-provided concept, like an object or a style, we learn to represent it through new "words" in the embedding space of a frozen text-to-image model. These "words" can be composed into natural language sentences, guiding personalized creation in an intuitive way. Notably, we find evidence that a single word embedding is sufficient for capturing unique and varied concepts. We compare our approach to a wide range of baselines, and demonstrate that it can more faithfully portray the concepts across a range of applications and tasks.

Description

This repo contains the official code, data and sample inversions for our Textual Inversion paper.

Updates

29/08/2022 Merge embeddings now supports SD embeddings. Added SD pivotal tuning code (WIP), fixed training duration, checkpoint save iterations. 21/08/2022 Code released!

TODO:

  • Release code!
  • Optimize gradient storing / checkpointing. Memory requirements, training times reduced by ~55%
  • Release data sets
  • Release pre-trained embeddings
  • Add Stable Diffusion support

Setup

Our code builds on, and shares requirements with Latent Diffusion Models (LDM). To set up their environment, please run:

conda env create -f environment.yaml
conda activate ldm

You will also need the official LDM text-to-image checkpoint, available through the LDM project page.

Currently, the model can be downloaded by running:

mkdir -p models/ldm/text2img-large/
wget -O models/ldm/text2img-large/model.ckpt https://ommer-lab.com/files/latent-diffusion/nitro/txt2img-f8-large/model.ckpt

Usage

Inversion

To invert an image set, run:

python main.py --base configs/latent-diffusion/txt2img-1p4B-finetune.yaml 
               -t 
               --actual_resume /path/to/pretrained/model.ckpt 
               -n <run_name> 
               --gpus 0, 
               --data_root /path/to/directory/with/images
               --init_word <initialization_word>

where the initialization word should be a single-token rough description of the object (e.g., 'toy', 'painting', 'sculpture'). If the input is comprised of more than a single token, you will be prompted to replace it.

Please note that init_word is not the placeholder string that will later represent the concept. It is only used as a beggining point for the optimization scheme.

In the paper, we use 5k training iterations. However, some concepts (particularly styles) can converge much faster.

To run on multiple GPUs, provide a comma-delimited list of GPU indices to the --gpus argument (e.g., --gpus 0,3,7,8)

Embeddings and output images will be saved in the log directory.

See configs/latent-diffusion/txt2img-1p4B-finetune.yaml for more options, such as: changing the placeholder string which denotes the concept (defaults to "*"), changing the maximal number of training iterations, changing how often checkpoints are saved and more.

Important All training set images should be upright. If you are using phone captured images, check the inputs_gs*.jpg files in the output image directory and make sure they are oriented correctly. Many phones capture images with a 90 degree rotation and denote this in the image metadata. Windows parses these correctly, but PIL does not. Hence you will need to correct them manually (e.g. by pasting them into paint and re-saving) or wait until we add metadata parsing.

Generation

To generate new images of the learned concept, run:

python scripts/txt2img.py --ddim_eta 0.0 
                          --n_samples 8 
                          --n_iter 2 
                          --scale 10.0 
                          --ddim_steps 50 
                          --embedding_path /path/to/logs/trained_model/checkpoints/embeddings_gs-5049.pt 
                          --ckpt_path /path/to/pretrained/model.ckpt 
                          --prompt "a photo of *"

where * is the placeholder string used during inversion.

Merging Checkpoints

LDM embedding checkpoints can be merged into a single file by running:

python merge_embeddings.py 
--manager_ckpts /path/to/first/embedding.pt /path/to/second/embedding.pt [...]
--output_path /path/to/output/embedding.pt

For SD embeddings, simply add the flag: -sd or --stable_diffusion.

If the checkpoints contain conflicting placeholder strings, you will be prompted to select new placeholders. The merged checkpoint can later be used to prompt multiple concepts at once ("A photo of * in the style of @").

Pretrained Models / Data

Datasets which appear in the paper are being uploaded here. Some sets are unavailable due to image ownership. We will upload more as we recieve permissions to do so.

Pretained models coming soon.

Stable Diffusion

Stable Diffusion support is a work in progress and will be completed soon™.

Tips and Tricks

  • Adding "a photo of" to the prompt usually results in better target consistency.
  • Results can be seed sensititve. If you're unsatisfied with the model, try re-inverting with a new seed (by adding --seed <#> to the prompt).

Citation

If you make use of our work, please cite our paper:

@misc{gal2022textual,
      doi = {10.48550/ARXIV.2208.01618},
      url = {https://arxiv.org/abs/2208.01618},
      author = {Gal, Rinon and Alaluf, Yuval and Atzmon, Yuval and Patashnik, Or and Bermano, Amit H. and Chechik, Gal and Cohen-Or, Daniel},
      title = {An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion},
      publisher = {arXiv},
      year = {2022},
      primaryClass={cs.CV}
}

Results

Here are some sample results. Please visit our project page or read our paper for more!

textual_inversion's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

textual_inversion's Issues

'pytorch_lightning.loggers' has no attribute 'TestTubeLogger'

continuing from my previous attempts i now get this error, apparently pytorch has removed this module 'TestTubeLogger'.

also tried installing the setup.py requirements for the torch version
pytorch-lightning==1.5.9
pytorch=1.10.2

tried both:

python3 main.py --base configs/stable-diffusion/v1-finetune.yaml -t --actual_resume models\ldm\stable-diffusion-v1\model.ckpt -n FireballRun --data_root object\artstyle\scene --gpus 1 --init-word TrainedFireball

python3 main.py --base configs/latent-diffusion/txt2img-1p4B-finetune.yaml -t --actual_resume models/ldm/text2img-large/model.ckpt -n FireballRun --data_root object\artstyle\scene --gpus 1 --init-word TrainedFireball

Restored from models\ldm\stable-diffusion-v1\model.ckpt with 0 missing and 2 unexpected keys
Unexpected Keys: ['model_ema.decay', 'model_ema.num_updates']
Traceback (most recent call last):
  File "D:\Workspace\textual_inversion-main\main.py", line 646, in <module>
    trainer_kwargs["logger"] = instantiate_from_config(logger_cfg)
  File "D:\Workspace\textual_inversion-main\ldm\util.py", line 85, in instantiate_from_config
    return get_obj_from_str(config["target"])(**config.get("params", dict()), **kwargs)
  File "D:\Workspace\textual_inversion-main\ldm\util.py", line 93, in get_obj_from_str
    return getattr(importlib.import_module(module, package=None), cls)
AttributeError: module 'pytorch_lightning.loggers' has no attribute 'TestTubeLogger'. Did you mean: 'NeptuneLogger'?

Disable flip augmentation?

I understand in the vast majority of cases there is a tremendous advantage to augmenting data by flipping it, however in my particular case I have a large and well varied dataset comprised of very strictly asymmetrical data and I worry that the flipping might be hurting the training process from resolving as well as it ought to be.

I will stress I'm not expecting a flag or such to be implemented for my admittedly very much edge-case use, however a quick pointer as to what files I need to look at and roughly where to disable the flip process would be very much appreciated.

Inpainting only accepts image and mask

I've seen in other instances of inpainting tasks, that it takes a prompt to guide the inpainting. Looking at scripts/inpaint.py there is no prompt taken. There is only the image and mask (per image in indir). And since the inverted embeddings are guided via the text prompt and their custom token, how do we pass that information to the inpainting?

I also see there is nowhere in scripts/inpaint.py to load any custom fine-tuned embeddings.

NameError: name 'trainer' is not defined

Hi

When I try to launch the main.py, I got

if trainer.global_rank == 0:

NameError: name 'trainer' is not defined

I use the same env than stable-diffusion (it works well)

TypeError: __init__() got an unexpected keyword argument 'embedding_reg_weight'

I train with this command with success:

python main.py --base configs/stable-diffusion/v1-finetune.yaml -t --actual_resume ../stable-diffusion-main/models/ldm/stable.ckpt -n test1 --gpus 0, --data_root img/loran --init_word l

but when I launch this command

python scripts/txt2img.py --ddim_eta 0.0 --n_samples 1 --n_iter 2 --scale 10.0 --ddim_steps 50 --embedding_path logs/loran2022-08-29T16-27-54_test1/checkpoints/embeddings_gs-6099.pt --ckpt_path ../stable-diffusion-main/models/ldm/stable.ckpt --prompt "a photo of *"

I got this error:

Loading model from /home/.../stable-diffusion-main/models/ldm/stable.ckpt
Traceback (most recent call last):
File "/home/.../textual_inversion/scripts/txt2img.py", line 120, in
model = load_model_from_config(config, opt.ckpt_path) # TODO: check path
File "/home/.../textual_inversion/scripts/txt2img.py", line 18, in load_model_from_config
model = instantiate_from_config(config.model)
File "/home/.../stable-diffusion-main/ldm/util.py", line 79, in instantiate_from_config
return get_obj_from_str(config["target"])(**config.get("params", dict()))
File "/home/.../stable-diffusion-main/ldm/models/diffusion/ddpm.py", line 510, in init
super().init(conditioning_key=conditioning_key, *args, **kwargs)
TypeError: init() got an unexpected keyword argument 'embedding_reg_weight'

license?

Hi! Awesome stuff - just wondering what the license for the code is (MIT?) Mostly so I know if I'm able to do a video on it :)

Thanks!

Instantiating model for inference fails

I have been having an issue with starting either version of txt2img, where the script crashes at the instantiate_from_config call.

  File "scripts/stable_txt2img.py", line 287, in <module>
    main()
  File "scripts/stable_txt2img.py", line 195, in main
    model = load_model_from_config(config, f"{opt.ckpt}")
  File "scripts/stable_txt2img.py", line 31, in load_model_from_config
    model = instantiate_from_config(config.model)
  File "f:\stablediffusion\stable-diffusion-main\ldm\util.py", line 85, in instantiate_from_config
    return get_obj_from_str(config["target"])(**config.get("params", dict()))
  File "f:\stablediffusion\stable-diffusion-main\ldm\models\diffusion\ddpm.py", line 448, in __init__
    super().__init__(conditioning_key=conditioning_key, *args, **kwargs)
TypeError: __init__() got an unexpected keyword argument 'personalization_config'

Training works perfectly fine, and appears to make an identical call.
The trace appears to be using the model in the stable-diffusion directory instead of the local textual inversion one, which seems to make sense in terms of failing (since the extra argument is not present there). Is there some environmental switch that's needed to point the script to the correct model? Or is there some other way to work around this since the same call is made in main.py without issue.

Edit:
To clarify, the txt2img script is being run in my textual_inversion directory f:\text_inversion\textual_inversion-main

Lightning Error

Strange error when training on SD: pytorch_lightning.utilities.exceptions.MisconfigurationException: No 'test_dataloader()' method defined to run 'Trainer.test'.

After some steps(6000 if i'm not wrong) I got this.

Unable to start training

When I attempt to start the trainer, I get the following:

Global seed set to 23
Running on GPUs 1
Traceback (most recent call last):
File "main.py", line 535, in
model = instantiate_from_config(config.model)
File "C:\StableDiffusion\ldm\util.py", line 85, in instantiate_from_config
return get_obj_from_str(config["target"])(**config.get("params", dict()), **kwargs)
File "C:\StableDiffusion\ldm\models\diffusion\ddpm.py", line 454, in init
super().init(conditioning_key=conditioning_key, *args, **kwargs)
TypeError: init() got an unexpected keyword argument 'unfreeze_model'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "main.py", line 740, in
if trainer.global_rank == 0:
NameError: name 'trainer' is not defined

I am not sure how to solve this issue. It appears to be something wrong with the actual execution of main.py. I am running Windows 10 Pro.

training time

Cool project! I was wondering what is the expected runtime for 5K iterations?

error when trying to merge embeddings

When I try to merge 2 embeddings I get this error (on ubuntu wsl on windows 11):

Traceback (most recent call last):
File "merge_embeddings.py", line 56, in
manager = EmbeddingManager()
File "/home/user/textual_inversion/ldm/modules/embedding_manager.py", line 59, in init
get_embedding_for_tkn = partial(get_embedding_for_clip_token, embedder.transformer.text_model.embeddings)
File "/home/user/anaconda3/envs/ldm/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1177, in getattr
raise AttributeError("'{}' object has no attribute '{}'".format(
AttributeError: 'BERTTokenizer' object has no attribute 'transformer'

Resuming from log checkpoint

Is resuming from a log checkpoint supported?

I did try using the --actual_resume and --embedding_manager_ckpt arguments in a few combinations, but they all lead to crashes. So I must be missing something. If this is supported, what should the argument structure be?

Higher vector number

is worth using a higher value for the config num_vectors_per_token, like 2048 or even higher?

RAM needs - SOLVED

Hi,

running

python main.py  --base configs/latent-diffusion/txt2img-1p4B-finetune.yaml 
                          -t 
                          --actual_resume models/ldm/text2img-large/model.ckpt 
                          -n oldphone 
                          --gpus 0  
                          --data_root inputimg/ 
                          --init_word oldphonesynth

Global seed set to 23
Running on GPUs 0
Loading model from models/ldm/text2img-large/model.ckpt
LatentDiffusion: Running in eps-prediction mode
DiffusionWrapper has 872.30 M params.
making attention of type 'vanilla' with 512 in_channels
Working with z of shape (1, 4, 32, 32) = 4096 dimensions.
making attention of type 'vanilla' with 512 in_channels

It suddenly halts after a minute or so saying "Killed"

Killed

Apparently I had to free up loads of RAM to get the training started, 32Gb was barely enough, I had to kill all open applications. Just posting here so you know what to do when you encounter "Killed"

Support for Image Prompts

Do any of the current revisions of models support image+text prompt to guide the sampling?

I'm looking through the code and see support for masks, which makes me think I should be able to pass an image in w/o a mask.

Is there more to do to enable this?

"String maps to more than a single token" after running main.py a few times

I was able to run Textual Inversion's main.py successfully a few times today, training on two new concepts (and I cancelled one run between the two successful runs because I had a power outage at home and I disconnected from my instance and wanted a fresh start). The two successful runs didn't seem to be improving after a while and I stopped them with a ctrl-C in the terminal.

However, when I tried running Textual Inversion after that, any --init_word that I entered as an argument and any initializer_words that I tried setting in v1-finetune.yaml resulted in the error "string maps to more than a single token."

I was running Textual Inversion on an AWS EC2 instance running Amazon Linux 2. I tried exiting conda and even rebooted the instance, but continued getting the same error.

Code explanation?

Can You explain why in yaml file defaults are this --
target: ldm.modules.embedding_manager.EmbeddingManager
params:
placeholder_strings: ["*"]
initializer_words: ["sculpture"]
per_image_tokens: false
num_vectors_per_token: 1
progressive_words: False

So that means i am trainig a sculpture even if my --init word is "mydog sketch" ? I dont quite get it, should i remove the "sculpture from there ? Where are instructions about all this ? Was it skipped cause its not important ? Id prefere to know before i spend days on training and i realise i trained a sculpture not a sketch drawing

Ooooor , these are just defaults if you dont use --init word and theyre ignored once you use init word ?

So next question---------------------------------
I want to train Stablediffusion model on images of he-man from filmation cartoon( cause now he-man doesnt look like him much in stable diffusion) , sooo, i have --init word "he-man filmation" and img folder with images from cartoon and now questions....
Will this make other versions of he-man to improve in any way ? ( when i prompt he-man filmation photorealistic, film still ) Or only this he-man filmation cartoon version wil be improved when i hit two words fron --init words and not just one ? What if i prompt only he-man without filmation ? WIll embedding checkpoint affect my results too ? I want to eventually train on more realistic images of he-man , should i do it separately ? Shouuld i just bring all he-man images and train on them at once with realistic, cartoony ? I cant find info on this anywyere.I want to basically force out current bad looking he-man images and bring in good looking images by embedding, how i can do that so it wont generate cartoon and realistic he-man images from original stable diffusion model but rather from finetuned checkpoints where he looks like he should ?
So when i train with --init word he-man filmation, should i adjust something in yaml file ? like initializer words or something else ? Or is the command line with --init word enough ?

Next question ------------------------------
I can see some empty white images in preview during training with text on them "a photo of my *" , why theyre white ? I dont get this part at all, am i training this wrong way ? Should they show me he-man images ? I have to add that i can see he-man images as well in this folder ( on reconstruction gs images and input gs images ) but im concerned about the white ones named "conditioning" , what are they ?
What are images called "reconstruction gs" ????
What are images called "inputs gs" ????
What are images called "samples" ?? Are those actual results from the model with finetuned data being used when generating ?
I know that samples are made with makesamples.py but it does not contain any info on what samples are and how they might guide you.
Sorry lot of questions but i want to get this right and the paper does not have any info about all that, where is that info so i can read what each image means for finetuning ?
How i can resume training? just by running it again with same command ?

while running main.py I get: "RuntimeError: Expected all tensors to be on the same device, but found at least two devices"

Hi I am launching main.py with this line:

python.exe .\main.py --base configs/stable-diffusion/v1-finetune.yaml -t --actual_resume .\sd-v1-4.ckpt -n my-model --gpus=0, --data_root my-folder --init_word mytestword

and I get up to this point:

  | Name              | Type               | Params
---------------------------------------------------------
0 | model             | DiffusionWrapper   | 859 M
1 | first_stage_model | AutoencoderKL      | 83.7 M
2 | cond_stage_model  | FrozenCLIPEmbedder | 123 M
3 | embedding_manager | EmbeddingManager   | 1.5 K
---------------------------------------------------------
768       Trainable params
1.1 B     Non-trainable params
1.1 B     Total params
4,264.947 Total estimated model params size (MB)
Validation sanity check:   0%|                                                        | 0/2 [00:00<?, ?it/s]
Summoning checkpoint.

but after a while it stops with this error:

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0! (when checking argument for argument index in method wrapper__index_select)

Do you know what may cause it?

Dtype mismatch when using embeddings with SD

I've trained my model using the v1-finetune Stable Diffusion config. Everything went smoothly until I actually tried to use the model with Stable Diffusion txt2img. When I run txt2img with my embeddings loaded, I get this error:
Index put requires the source and destination dtypes match, got Half for the destination and Float for the source.

Unique placeholder string

Can placeholder string be something unique like 0xCAFEBABE for training then used like a photo of 0xCAFEBABE for inference?

error during training setup: 'int' object has no attribute 'strip'

Hi there,
first of all: thank you for work and this repo!

I was trying to run a training session using the SD model as follows:

(ldm_2) ➜  textual_inversion git:(main) ✗ which python
~/.conda/envs/ldm_2/bin/python

python main.py --base configs/stable-diffusion/v1-finetune.yaml -t --actual_resume ../sd-v1-4.ckpt -n sheep --gpus=1 --data_root /home/tom/Projects/StableDiffusion/training_images/sheep --init_word "toy"

I did create a new environment with conda and activated it. I have however renamed ldm to ldm_2 in environment.yaml, (shouldn't affect anything right?)

So far so good, but after some seconds I get the following:

Traceback (most recent call last):
File "main.py", line 762, in
ngpu = len(lightning_config.trainer.gpus.strip(",").split(','))
AttributeError: 'int' object has no attribute 'strip'

Full log output is attached. Any ideas what this could be?

errorlog.txt

Interval to save embeddings during training

Hi, I've trained a few textual inversions with stable diffusion with mixed success, and I'm wondering if it's possible to change how often embeddings are saved during training.
I've looked through the configs/stable-diffusion/v1-finetune.yaml, but I can't find the option to change how often embeddings are saved to disk, or if it's even possible to adjust this.
Having more granular checkpoints to pick from after the training is done would be very helpful to get more useful results. Currently, my custom v1-finetune.yaml is saving every 600 steps, but it would be great to have it save on every epoch.
Is this possible, and what would I need to change to make this happen?

size mismatch for model.diffusion_model.input_blocks.1.1.transformer

cant get past this error, i have squared the 5 images to 512 px png's and run

python3 main.py --base configs/stable-diffusion/v1-finetune.yaml -t --actual_resume models/ldm/text2img-large/model.ckpt -n FireballRun --data_root object\artstyle\scene --gpus 1 --init-word TrainedFireball

Traceback (most recent call last):
File "\textual_inversion-main\main.py", line 614, in
model = load_model_from_config(config, opt.actual_resume)
File "
\textual_inversion-main\main.py", line 29, in load_model_from_config
model = instantiate_from_config(config.model)
File "\textual_inversion-main\ldm\util.py", line 85, in instantiate_from_config
return get_obj_from_str(config["target"])(config.get("params", dict()), kwargs)
File "
\textual_inversion-main\ldm\models\diffusion\ddpm.py", line 481, in init
self.init_from_ckpt(ckpt_path, ignore_keys)
File "
\textual_inversion-main\ldm\models\diffusion\ddpm.py", line 205, in init_from_ckpt
missing, unexpected = self.load_state_dict(sd, strict=False) if not only_model else self.model.load_state_dict(
File "
****\lib\site-packages\torch\nn\modules\module.py", line 1497, in load_state_dict
raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for LatentDiffusion:
size mismatch for model.diffusion_model.input_blocks.1.1.transformer_blocks.0.attn2.to_k.weight: copying a param with shape torch.Size([320, 1280]) from checkpoint, the shape in current model is torch.Size([320, 768]).
size mismatch for model.diffusion_model.input_blocks.1.1.transformer_blocks.0.attn2.to_v.weight: copying a param with shape torch.Size([320, 1280]) from checkpoint, the shape in current model is torch.Size([320, 768]).
size mismatch for model.diffusion_model.input_blocks.2.1.transformer_blocks.0.attn2.to_k.weight: copying a param with shape torch.Size([320, 1280]) from checkpoint, the shape in current model is torch.Size([320, 768]).
size mismatch for model.diffusion_model.input_blocks.2.1.transformer_blocks.0.attn2.to_v.weight: copying a param with shape torch.Size([320, 1280]) from checkpoint, the shape in current model is torch.Size([320, 768]).
size mismatch for model.diffusion_model.input_blocks.4.1.transformer_blocks.0.attn2.to_k.weight: copying a param with shape torch.Size([640, 1280]) from checkpoint, the shape in current model is torch.Size([640, 768]).
size mismatch for model.diffusion_model.input_blocks.4.1.transformer_blocks.0.attn2.to_v.weight: copying a param with shape torch.Size([640, 1280]) from checkpoint, the shape in current model is torch.Size([640, 768]).
size mismatch for model.diffusion_model.input_blocks.5.1.transformer_blocks.0.attn2.to_k.weight: copying a param with shape torch.Size([640, 1280]) from checkpoint, the shape in current model is torch.Size([640, 768]).
size mismatch for model.diffusion_model.input_blocks.5.1.transformer_blocks.0.attn2.to_v.weight: copying a param with shape torch.Size([640, 1280]) from checkpoint, the shape in current model is torch.Size([640, 768]).
size mismatch for model.diffusion_model.input_blocks.7.1.transformer_blocks.0.attn2.to_k.weight: copying a param with shape torch.Size([1280, 1280]) from checkpoint, the shape in current model is torch.Size([1280, 768]).
size mismatch for model.diffusion_model.input_blocks.7.1.transformer_blocks.0.attn2.to_v.weight: copying a param with shape torch.Size([1280, 1280]) from checkpoint, the shape in current model is torch.Size([1280, 768]).
size mismatch for model.diffusion_model.input_blocks.8.1.transformer_blocks.0.attn2.to_k.weight: copying a param with shape torch.Size([1280, 1280]) from checkpoint, the shape in current model is torch.Size([1280, 768]).
size mismatch for model.diffusion_model.input_blocks.8.1.transformer_blocks.0.attn2.to_v.weight: copying a param with shape torch.Size([1280, 1280]) from checkpoint, the shape in current model is torch.Size([1280, 768]).
size mismatch for model.diffusion_model.middle_block.1.transformer_blocks.0.attn2.to_k.weight: copying a param with shape torch.Size([1280, 1280]) from checkpoint, the shape in current model is torch.Size([1280, 768]).
size mismatch for model.diffusion_model.middle_block.1.transformer_blocks.0.attn2.to_v.weight: copying a param with shape torch.Size([1280, 1280]) from checkpoint, the shape in current model is torch.Size([1280, 768]).
size mismatch for model.diffusion_model.output_blocks.3.1.transformer_blocks.0.attn2.to_k.weight: copying a param with shape torch.Size([1280, 1280]) from checkpoint, the shape in current model is torch.Size([1280, 768]).

Conda env create fails on M1 Mac

Running conda env create -f environment.yaml on M1 Mac fails with:

Collecting package metadata (repodata.json): done
Solving environment: failed

ResolvePackageNotFound: 
  - torchvision=0.11.3
  - cudatoolkit=11.3
  - pip=20.3
  - python=3.8.10

From a Google search, the problem seems to have something to do with exported env not working on other OS's, and the solution might be to include --no-builds option in conda export: conda/conda#7311.

run in colab not pro

can you please create a notebook or someone has a notebook to train in colab not pro.

Problems running on Windows

I tried running the trainer in windows 10 (cmd not WSL), and ran into a few issues. I was able to solve them (perhaps in a less optimal way), but I figured I should share the solutions.

  1. windows doesn't have the same SIGUSR signals. After a quick search, the suggestion from other sources was to revert to using SIGTERM when running on windows.
  2. I get a NCCL runtime error. The solution that I found was to add os.environ["PL_TORCH_DISTRIBUTED_BACKEND"] = "gloo" to main in main.py
  3. the dejavu-sans font is missing. It needed to be explicitly downloaded and added to <repo_root>/data/.

Other than that, everything seems to be running.

getting string maps to more than single token

what that means ? should i change it ?No matter what i change it to i get this error over and over , should i use two words ? how they should be structured, ? word1,word2 or (word,word) or word1 word2
this is veru important but yet nowhere to be found in instructions
Can you uinclude actual command line to train lets say cartoon t-rex image by john smith ? I have lot of trouble with this here with SD
Do i need to name my images somehow? my image names are all over the place

OSError: cannot open resource

I get this error after 500 iterations when it attempts to print out the sample/test images of the checkpoint. I think it has to do with the image that's dumped which contains nothing but the prompt text. (1) Is there an easy fix? (2) Is there a better way to log the prompt instead of dumping a mostly white image?

Question about resuming from checkpoint

Hey there,

so I have this one colab of mine with a command of:

command = "python main.py" +\ " --base " + model_config +\ " --actual_resume " + model_checkpoint +\ " -t " +\ " -n " + name +\ " --gpus 0, "+\ " --data_root " + train_images +\ " -l logs " +\ " --embedding_manager_ckpt " + train_checkpoint_file

where I load up a present model, checkpoint file and try to resume from that point of time.

train_checkpoint_file = logs/<custom_logs>/checkpoints/embeddings_gs-2499.pt
model_checkpoint = ldm/stable-diffusion/model.ckpt

Now my question is: Is it normal for when trying to resume from checkpoint that the epoch starts at 0 with 0 steps?
How do I know if previous run was being continued on.
Is there an example command that can be used?

Thanks in advance and for this awesome project. :)

What should be done with the epoch=000XXX.ckpt files?

Hi, awesome paper. Thanks for making the code available. I wanted to know about the epoch...ckpt that appear in the same folder as the embeddings. They appear a whenever a new minimal val loss is reached, but they can't be used as the ckpt for the stable diffusion model. They're too small to be the weights.

I'd presume I should use the embeddings that correspond to the best rather than the latest. Is there a way to correlate them?

Training stays at maximum loss at or above 5000 global steps, nothing usable

Running on Windows 10, RTX 3090, 128GB DDR4 and AMD Ryzen 9 5950X 16-core CPU.

Using "gloo" as the backend. Changed SIGUSR1(&2) to SIGTERM or I get an signal error.

I'm using the stable-diffusion v1-finetune.yaml in a repo that includes Stable diffusion, however I also tested on your repo and get the same result.

running with:

python.exe main.py --base configs/stable-diffusion/v1-finetune.yaml -t --actual-resume models/ldm/stable-diffusion-v1/model.ckpt -n TestTrain --gpus 1, --data_root imagesets/Jn3

init word is "person" defined in the yaml, string is still *

Base_learning_rate: 5.0e-03

The input images in the logs look fine but the reconstruction is just extreme noise, as is the others.. the conditioning image is just white with text(normal?)

At the start of the run I'm getting "RuntimeWarning: You are using LearningRateMonitor callback with models that have no learning rate schedulers. Please see documentation for configure_optimizers method.
rank_zero_warn(" from pytorch lightning lr_monitor.py

I've tried training with 56 images, 17, 5 and 3, didn't make a difference.
Here's a samples_scaled at 6500 global steps (same result at 60000 global steps as well)
samples_scaled_gs-006500_e-000043_b-000050

ETA?

Hello, just curious when the source code for this project might become available? I think a lot of us are very excited to dig in and see how it works, especially given that the Stable Diffusion model is near release. Thanks!!

missing checkpoint files

currently training the model but nothing is saving to the checkpoints/ directory. i have files configs/ and images/ but no checkpoint file. i'm currently 7k iterations into the run.

Got weird results, not sure if I missed a step?

Hey @rinongal thank you so much for this amazing repo.

I trained with over 10K steps I believe, and around 7 images. (Trained on my face)
Using this colab

I then used those pt files in running the SD version right in the collab and a weird thing happens, when I mention * in my prompts, I get results that look identical to the photos in style, but it does try to ... draw the objects.

For example :
CleanShot 2022-08-29 at 14 04 48@2x

Prompt was portrait of joe biden with long hair and glasses eating a burger, detailed painting by da vinci
and
portrait of * with long hair and glasses eating a burger, detailed painting by da vinci

So SD added the glasses and the eating pose, but completely disregarded the detailed painting and da vincin and the style.

What could be causing this? Any idea? 🙏

Per image tokens on a larger dataset

I'm curious about trying per-image token training with my dataset however the per_img_token_list in personalized.py is only 22 entries long, which appears to be posing an issue in that there aren't enough of those tokens to assign to samples in my set.

I have considered manually supplementing the list by hand which would be an incredibly tedious but perhaps doable task on account of the fact that the set I'm wanting to train with only has 912 samples, but other datasets I was interested in experimenting with have past half a million samples for which a manual approach simply wouldn't suffice. Is there any possibility of generating unique multi-character tokens on the fly when starting training to an extent varying depending on dataset size?

Clarity For Stable Diffusion Training / Inference

Hello! Thanks for your work. I was reading through your paper, and had some questions.

Initially when I tried this, I was having trouble with the default parameters listed here. It turned out I was getting a sculpture representation of the images images I was trying to invert! 🙃

I then realized, that in the v1-finetune.yaml has a string list of initializer_words, and not init_word.
My understanding (correct me if I'm wrong) is that the words are converted into a numerical representation of the images you are trying to invert, and then the pseudo word is the representation of the inversion.

In the paper, it says that one token is sufficient with roughly 5K iterations. My confusion stems from placeholder_strings and initializer_words.

The placeholder_strings is a pseudo keyword to that describes the subject, and can be anything. Am I right to assume that the initializer_words guide the image you're trying to invert? Is there any limit to this during training? Would over describing lead to some sort of overfitting edge case?

I know that you're still working on support for SD, but I'm eager to try during development. Any clarity would be greatly appreciated!

black reconstruction_gs images during logging

Thanks for providing the code for your awesome paper!
When logging images during training, reconstruction_gs images and samples_scaled_gs images remains black during the entire training, and I am not able to reproduce the results. Also, after finishing the training, I can not reproduce the results in the paper, can you please help me, how can I fix the issue?

Thank you in advance for your help!!

can't reproduce the results

hi! i trained ldm with three images and the token "container":



training takes lasted a few hours, the loss jumps, but i got exactly the same result as without training:

the config is loaded correctly. are there any logs besides the loss?

Analyzing Training

First of all, this is incredible work and is a wonder to try out. So thank you! Double-thanks for making such an easy-to-use codebase as well!

Now my question. What do you look for while training images. Especially if the given style or subject you're attempting to re-embed has particular details, for instance, a face.

So far, even going into tens of thousands of global steps, looking at the scaled samples in the log dir, the faces (simple, drawn ones) are still masses of semi-legible squiggly lines that only offer a face via pareidolia.

A follow-on question is, how do you/should we use the several different output images in determining whether to stop training?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.