bloc97 / crossattentioncontrol Goto Github PK

View Code? Open in Web Editor NEW

1.2K 1.2K 89.0 63.82 MB

Unofficial implementation of "Prompt-to-Prompt Image Editing with Cross Attention Control" with Stable Diffusion

License: MIT License

Jupyter Notebook 100.00%

cross-attention deep-learning diffusion-models stable-diffusion

crossattentioncontrol's People

Contributors

Stargazers

Watchers

Forkers

apolinario mattrix marcus-arcadius techthiyanes krishnanunnir tingtingin juangon bux10 sunwoo76 b1sounours yc015 jaedukseo vishal2241 robertlucian pblwk jatindahiya027 sangohan aymenolm allstreamer anubhav712 mindrages lucasleandro1204 zerohackz plachenko richarddev01 cryptophry isaiahinocentes mcx lewington-pitsos mastertrainerpk dataminds rpfilomeno nekwo akshat0098 hpwebdeveloper c00renut mikosamey nielsrolf nopeanuts kameronski codeaudit fastrocket researchoor gaohuan2015 cccntu we-jay jorahn xiang-cd repo-collection matteosid wn1695173791 mengmengbai zurichrain ethan-jiang-1 enockipp kaiyuesun98 forks-learning humandesign01 peternara hshi-speech chunchi031 xaddwell flyingbird93 cruelpleasure jxzhangjhu classicvalues arifsaeed xs1997zju kaien125 lishuai1993 shwang linchuanxuthesea vincentneemie justaguyoninternet aniketgurav clouddevstudios frontenda lawchingman eagertofly mthonar 5l1v3r1 hasanmdmahmudul rohit1726 sumerudataanalaytics steveefemsc minha12 rby011 zlwu00 zeusioi

crossattentioncontrol's Issues

Implement with Stable Diffusion repositories

Hello, thanks for the implementation! It works very well.

As a suggestion, it may be helpful to provide code that works with the broader Github community. While the Diffusers library does allow for better ease of use and a more streamlined experience, it can possibly hinder the freedom to use across similar implementations due to how their library works.

[question] I'd like to help contribute, but my knowledge of how diffusors work is lacking...

Where did you learn this stuff? There are a lot of "for dummies" level explanations about how diffusors work, and there's a fair amount of documentation out there that assumes you have am understanding of the inner workings of them, but there doesn't seem to be much in between. I would like to, for instance, add k-diffusion to this notebook, but I'm really confused as to how to get started. Is there any kind of reading material out there that I could use to familiarize myself with this stuff a bit more?

Add direct target editing to the notebook

It would be great to have a direct target editing in the notebook examples. Something like:

Code for this:

#https://lexica.art/prompt/2127efd3-e23b-44dc-baac-494993bc9688
image = stablediffusion("A photo of a Corgi dog riding a bike in Times Square wearing sunglasses and beach hat, cinestill, 800t, 35mm, full-HD",
                        seed=2401809524,
                        guidance_scale=7,
                        steps = 150)
image

prompt = "A photo of a teddy bear riding a bike in Times Square wearing sunglasses and beach hat, cinestill, 800t, 35mm, full-HD"

print(prompt_token(prompt, 5), prompt_token(prompt, 6))

stablediffusion(prompt,
                        seed=2401809524,
                        guidance_scale=7,
                        prompt_edit_token_weights = [(5, 5), (6, 5)],
                        init_image=image,
                        steps=150)

how to prevent promp 1 from being distorted

Here is a black box and I want to change its color to red,

but changing color also changes the size and shape of the box. how does it change only the color?

Edit to README.md

Image prompt: A fantasy landscape with a pine forest without fog and without rocks
However, we still see fog and rocks.

That's not how SD works.
You can't type a prompt: "not a woman without a crown not wearing a red dress" and expect SD to follow it.
You will 100% get "a woman, a crown, wearing red dress"

If you want to have a forest without fog and rocks just don't make these words a part of the prompt or specify alternative words.
Like "A fantasy landscape with a pine forest with a shiny morning sun and with asphalt road".

An observation

Hi, thanks for the code.
I have observed that in the examples you have provided, even if I just directly use the cross attention from the edited prompt by commenting out the line "attn_slice = attn_slice * (1 - self.last_attn_slice_mask) + new_attn_slice * self.last_attn_slice_mask", I get the same result for most of the cases. I checked for the cases where the words are replaced or new phrases like ' in winter' is added. So, it seems like the cross attention editing is not having any effect. Please comment on this. Thanks.

Implementing Dreambooth weights

Is it possible to use a trained dreambooth model into cross attention control?
I trained a model in Dreambooth-Stable-Diffusion on a new car and I have an image where I want to change the car to the one I trained in Dreambooth.
Changing 'model_path_diffusion' to the downloaded weights of Dreambooth does not seem to work, it does not generate the new car but something totaly different.

[feature request] Automatically print token being modified (and by how much) when generating an image

I was getting some nonsensical results for a little while until it occurred to me that some words are multiple tokens (and punctuation, etc, are tokens as well).

Here's some code I'm using to do it:

        #Process prompt editing
        if prompt_edit is not None:
            tokens_conditional_edit = clip_tokenizer(prompt_edit, padding="max_length", max_length=clip_tokenizer.model_max_length, truncation=True, return_tensors="pt", return_overflowing_tokens=True)
            embedding_conditional_edit = clip(tokens_conditional_edit.input_ids.to(device)).last_hidden_state
            
            init_attention_edit(tokens_conditional, tokens_conditional_edit)
            
            #My code starts here
            for t in prompt_edit_token_weights:
                token_word = prompt_token(prompt_edit, t[0])
                print(f"{token_word}: {t[1]}")
        else:
            for t in prompt_edit_token_weights:
                token_word = prompt_token(prompt, t[0])
                print(f"{token_word}: {t[1]}")

Probably ought to check and make sure I'm not misunderstanding anything, but it appears to work.

Add some notes on running on Windows to readme

Got this running on windows. I had to do the following after setting up the python environment:

Install Jupyter: pip install jupyterlab ipywidgets (see https://jupyter.org/install and https://ipywidgets.readthedocs.io/en/stable/user_install.html)\
Go into Windows Developer Settings and enable Developer Mode (the notebook uses symlinks and windows only allows them if developer mode is turned on or if you run jupyter as an administrator)

For the record, I did this with my existing environment from hlky's Stable Diffusion Webui, which can be found here: https://github.com/sd-webui/stable-diffusion-webui, so I didn't need to install the other packages because I already had them.

This isn't quite good enough to go into the readme yet because I didn't install from a blank environment, but maybe other windows users can use this info and some instructions can be assembled.

AttributeError: 'dict' object has no attribute 'sample'

Dear @bloc97 ,

Thank you for your great implementation. I really like it.

When I run the codes, it report error:

  File "/cs/labs/danix/wuzongze/diffusion_manipulation/CrossAttentionControl/test1", line 280, in <module>
    img=stablediffusion("A fantasy landscape with a pine forest, trending on artstation", prompt_edit_token_weights=[(2, -3)], seed=2483964025, width=768)

  File "/cs/labs/danix/wuzongze/anaconda3/envs/ldm/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)

  File "/cs/labs/danix/wuzongze/diffusion_manipulation/CrossAttentionControl/test1", line 213, in stablediffusion
    noise_pred_uncond = unet(latent_model_input, t, encoder_hidden_states=embedding_unconditional).sample

AttributeError: 'dict' object has no attribute 'sample'

it seems the output of unet is a dict, rather than a class. Am I the only one meet this prblem?

Best Wishes,

Zongze

Why LMSDiscreteScheduler？

why should we use LMSDiscreteScheduler, rather than DDPMScheduler or DDIMScheduler?

How to make image inversion more precise?

Fantastic work on this project @bloc97!

I'm able to get super impressive results with prompt editing. However, when doing img2img I find that the results degrade greatly. For example, here I'm editing the prompt to change to a charcoal drawing, which works well. However, if I pass in the initial image generated from the original prompt, there's no values of parameters I can find to get anywhere close to the quality of the prompt edit without initial image. I'm observing similar issues to stock SD where either the macro structure of the initial image is lost or the prompt edit has little to no effect.

The reason I want this is to edit real images and to build edits on top of each other. I realize this may be unsolved, and depend on how well the network understands the scene content, but I'm very interested in your thoughts and suggestions here as I think it would be incredibly powerful.

img_original = stablediffusion(
    prompt="a fantasy landscape with a maple forest",
    steps=50,
    seed=42,
)

img_prompt_edit = stablediffusion(
    prompt="a fantasy landscape with a maple forest",
    prompt_edit="a charcoal sketch of a fantasy landscape with a maple forest",
    steps=50,
    seed=42,
)

img_init_image = stablediffusion(
    prompt="a fantasy landscape with a maple forest",
    prompt_edit="a charcoal sketch of a fantasy landscape with a maple forest",
    steps=50,
    seed=42,
    init_image=img_original,
    init_image_strength=0.6,
)

Optmised for 6 GB?

Can this be optimised to run in 6 GB similarly to Hlky's https://github.com/sd-webui/stable-diffusion-webui/tree/master/optimizedSD ?

I'm getting CUDA out of memory when using prompt and edit_prompt in 6 GB @ 512x512. Generation of a single 512x512 image is working.

please add InverseCrossAttention to colab

Relating to the recent paper about 'Self-guidance' method

Hello @bloc97,

Your work has been instrumental in my understanding of the topic, especially since I encountered some difficulties when trying to run the official prompt to prompt code.

Recently, I've been engrossed in a paper titled "Diffusion Self-Guidance for Controllable Image Generation" (https://dave.ml/selfguidance/), where the authors introduce a novel 'Self Guidance' method. This technique edits an image by manipulating the attention maps, and I notice its resemblance to the 'Prompt to Prompt' method.

As an undergraduate student eager to delve deeper into the realm of Computer Vision, I'm interested in implementing this 'Self Guidance' method for my project. However, as of now, the authors have not released their official code. Hence I'm considering implementing that self guidance method upon the foundation of your code.

Given your expertise in this area, I was wondering if you think it's feasible to implement the 'Self Guidance' method based on your code? Any insights or suggestions you could provide would be immensely appreciated.

Can't install dependencies

I'm in an Anaconda prompt (on Windows). Installing required packages doesn't work:

(base) C:\Users\andre>pip install torch transformers diffusers numpy PIL tqdm difflib
Collecting torch
  Using cached torch-1.12.1-cp39-cp39-win_amd64.whl (161.8 MB)
Collecting transformers
  Using cached transformers-4.21.3-py3-none-any.whl (4.7 MB)
Collecting diffusers
  Using cached diffusers-0.3.0-py3-none-any.whl (153 kB)
Collecting numpy
  Downloading numpy-1.23.3-cp39-cp39-win_amd64.whl (14.7 MB)
     |████████████████████████████████| 14.7 MB 6.4 MB/s
ERROR: Could not find a version that satisfies the requirement PIL (from versions: none)
ERROR: No matching distribution found for PIL

Notebook error

Everything runs but I am getting an error when running
stablediffusion("A fantasy landscape with a pine forest, trending on artstation", seed=2483964025, width=768)

The error says

AttributeError Traceback (most recent call last)
in
1 prompt_token("A fantasy landscape with a pine forest, trending on artstation", 7)
----> 2 stablediffusion("A fantasy landscape with a pine forest, trending on artstation", seed=2483964025, width=768)
3 stablediffusion("A fantasy landscape with a pine forest, trending on artstation", prompt_edit_token_weights=[(2, -3)], seed=2483964025, width=768)
4 stablediffusion("A fantasy landscape with a pine forest, trending on artstation", prompt_edit_token_weights=[(2, -8)], seed=2483964025, width=768)
5 stablediffusion("A fantasy landscape with a pine forest, trending on artstation", "A fantasy landscape with a pine forest, trending on artstation", prompt_edit_token_weights=[(2, 2), (7, 5)], seed=2483964025, width=768)

2 frames
/usr/local/lib/python3.7/dist-packages/diffusers/schedulers/scheduling_lms_discrete.py in add_noise(self, original_samples, noise, timesteps)
260 sigmas = self.sigmas.to(original_samples.device)
261 schedule_timesteps = self.timesteps.to(original_samples.device)
--> 262 timesteps = timesteps.to(original_samples.device)
263 if isinstance(timesteps, torch.IntTensor) or isinstance(timesteps, torch.LongTensor):
264 deprecate(

AttributeError: 'int' object has no attribute 'to'

Better support for prompt_edit_token_weights parsing

Instead of counting indices for tokens to pass into prompt_edit_token_weights, it would be easier to reference it by 'word'.
parse_edit_weights converts weights with words and word list, in addition to int indices to weights with int indices:

prompt = 'the quick brown fox jumps over the lazy dog'
parse_edit_weights(prompt, None, [('brown', -1), (2, 0.5), (['lazy', 'dog'], -1.5)])

returned result is [(3, -1), (2, 0.5), (8, -1.5), (9, -1.5)].

Here's the code:

def sep_token(prompt):
    tokens = clip_tokenizer(prompt, padding="max_length", max_length=clip_tokenizer.model_max_length, truncation=True, return_tensors="pt", return_overflowing_tokens=True).input_ids[0]
    words = []
    index = 1
    while True:
        word = clip_tokenizer.decode(tokens[index:index+1])
        if not word: break
        if word == '<|endoftext|>': break
        words.append(word)
        index += 1
        if index > 500: break
    return words

def parse_edit_weights(prompt, prompt_edit, edit_weights):
    if prompt_edit:
        tokens = sep_token(prompt_edit)
    else:
        tokens = sep_token(prompt)
    
    prompt_edit_token_weights=[]
    for tl, w in edit_weights:
        if isinstance(tl, list) or isinstance(tl, tuple):
            pass
        else:
            tl = [tl]
        for t in tl:
            try:
                if isinstance(t, str):
                    idx = tokens.index(t) + 1
                elif isinstance(t, int):
                    idx = t
                prompt_edit_token_weights.append((idx, w))
            except ValueError as e:
                print(f'error {e}')
            
    return prompt_edit_token_weights

About the finite difference gradient descent method

Hi @bloc97 ,

Thanks for your great work.

Do you know any other papers/implementations using the finite difference gradient descent to do inversion?
I want more references for this solution.

Also, could you please give more hints about the magic number tless?

Question about original google implementation with stable diffusion

Hi bloc, firstly thank you for your great work!
I've been spending a lot of time trying to implement google's original release into a custom pipeline with diffusers. I figured it wouldn't be too difficult as they have an example there running with SD that looks pretty good. Although I'm getting very strange results even though everything seems to be in working order. I was considering that it may be because I had been using SD1.5 whereas they had been using 1.4, but I don't think there were any changes in architecture that would be causing that?

Could you elaborate a bit more on the changes you made to get it to work with stable?

The differences from the official implementation?

Hi developers, thank you for completing this wonderful re-implementation.
As I am checking the differences between this repo and the original one, I noticed that the original repo also implemented stable diffusion.

I am wondering if you would like to list the additional features, those exclusive in this repo, on the README page. Will appreciate your clarification very much!

About terms["nll"]

Thanks for your great work. In line 633 of gaussian_diffusion.py, terms["nll"] is calculated but not used. Whther it is a mistake, or whether it doesn't work.
terms["nll"] = self._token_discrete_loss(model_out_x_start, get_logits, input_ids_x, mask=input_ids_mask, truncate=True, t=t)
terms["loss"] = terms["mse"] + decoder_nll + tT_loss

negative weighting

does negative weighting work properly?
most repos that have weighting don't have the option for negative weights,
does it work in this one?
thanks

Can't run the notebook in Google Colab, some issues with versions.

Some issues with LMSDiscreteScheduler and new_attention, it requires now sequence_length and dim
def new_attention(self, query, key, value, sequence_length, dim):
but diffusers/models/attention.py calling
hidden_states = self._attention(query, key, value)

[/usr/local/lib/python3.7/dist-packages/diffusers/models/attention.py](https://localhost:8080/#) in forward(self, hidden_states, context)
    196     def forward(self, hidden_states, context=None):
    197         hidden_states = hidden_states.contiguous() if hidden_states.device.type == "mps" else hidden_states
--> 198         hidden_states = self.attn1(self.norm1(hidden_states)) + hidden_states
    199         hidden_states = self.attn2(self.norm2(hidden_states), context=context) + hidden_states
    200         hidden_states = self.ff(self.norm3(hidden_states)) + hidden_states

[/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py](https://localhost:8080/#) in _call_impl(self, *input, **kwargs)
   1128         if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1129                 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1130             return forward_call(*input, **kwargs)
   1131         # Do not call functions when jit is used
   1132         full_backward_hooks, non_full_backward_hooks = [], []

[/usr/local/lib/python3.7/dist-packages/diffusers/models/attention.py](https://localhost:8080/#) in forward(self, hidden_states, context, mask)
    268 
    269         if self._slice_size is None or query.shape[0] // self._slice_size == 1:
--> 270             hidden_states = self._attention(query, key, value)
    271         else:
    272             hidden_states = self._sliced_attention(query, key, value, sequence_length, dim)

TypeError: new_attention() missing 2 required positional arguments: 'sequence_length' and 'dim'

Did you get a same result?

Hello.

I have another question.

From your code, I tried to reproduce the results in the prompt-to-prompt.

However, I got a result as below:

Did you get an identical result?

How do I set the parameters to get the results in the prompt-to-prompt paper?

And,

Please let me know why you divided latent with "0.1825 before inserting it to VAE.

Thanks :) !!

Question about the code in CrossAttention_Release.ipynb

Hello,
Thank you for sharing your awesome code! :)

I have a question about this line:
latent_model_input = (latent_model_input / ((sigma**2 + 1) ** 0.5)).to(unet.dtype)

Could you give me some explanations about the reason that "(simga**2+1)**0.5" is needed ?

Thanks.