bloc97 / crossattentioncontrol Goto Github PK
View Code? Open in Web Editor NEWUnofficial implementation of "Prompt-to-Prompt Image Editing with Cross Attention Control" with Stable Diffusion
License: MIT License
Unofficial implementation of "Prompt-to-Prompt Image Editing with Cross Attention Control" with Stable Diffusion
License: MIT License
Hello, thanks for the implementation! It works very well.
As a suggestion, it may be helpful to provide code that works with the broader Github community. While the Diffusers library does allow for better ease of use and a more streamlined experience, it can possibly hinder the freedom to use across similar implementations due to how their library works.
Where did you learn this stuff? There are a lot of "for dummies" level explanations about how diffusors work, and there's a fair amount of documentation out there that assumes you have am understanding of the inner workings of them, but there doesn't seem to be much in between. I would like to, for instance, add k-diffusion to this notebook, but I'm really confused as to how to get started. Is there any kind of reading material out there that I could use to familiarize myself with this stuff a bit more?
It would be great to have a direct target editing in the notebook examples. Something like:
Code for this:
#https://lexica.art/prompt/2127efd3-e23b-44dc-baac-494993bc9688
image = stablediffusion("A photo of a Corgi dog riding a bike in Times Square wearing sunglasses and beach hat, cinestill, 800t, 35mm, full-HD",
seed=2401809524,
guidance_scale=7,
steps = 150)
image
prompt = "A photo of a teddy bear riding a bike in Times Square wearing sunglasses and beach hat, cinestill, 800t, 35mm, full-HD"
print(prompt_token(prompt, 5), prompt_token(prompt, 6))
stablediffusion(prompt,
seed=2401809524,
guidance_scale=7,
prompt_edit_token_weights = [(5, 5), (6, 5)],
init_image=image,
steps=150)
Image prompt: A fantasy landscape with a pine forest without fog and without rocks
However, we still see fog and rocks.
That's not how SD works.
You can't type a prompt: "not a woman without a crown not wearing a red dress" and expect SD to follow it.
You will 100% get "a woman, a crown, wearing red dress"
If you want to have a forest without fog and rocks just don't make these words a part of the prompt or specify alternative words.
Like "A fantasy landscape with a pine forest with a shiny morning sun and with asphalt road".
Hi, thanks for the code.
I have observed that in the examples you have provided, even if I just directly use the cross attention from the edited prompt by commenting out the line "attn_slice = attn_slice * (1 - self.last_attn_slice_mask) + new_attn_slice * self.last_attn_slice_mask", I get the same result for most of the cases. I checked for the cases where the words are replaced or new phrases like ' in winter' is added. So, it seems like the cross attention editing is not having any effect. Please comment on this. Thanks.
Is it possible to use a trained dreambooth model into cross attention control?
I trained a model in Dreambooth-Stable-Diffusion on a new car and I have an image where I want to change the car to the one I trained in Dreambooth.
Changing 'model_path_diffusion' to the downloaded weights of Dreambooth does not seem to work, it does not generate the new car but something totaly different.
I was getting some nonsensical results for a little while until it occurred to me that some words are multiple tokens (and punctuation, etc, are tokens as well).
Here's some code I'm using to do it:
#Process prompt editing
if prompt_edit is not None:
tokens_conditional_edit = clip_tokenizer(prompt_edit, padding="max_length", max_length=clip_tokenizer.model_max_length, truncation=True, return_tensors="pt", return_overflowing_tokens=True)
embedding_conditional_edit = clip(tokens_conditional_edit.input_ids.to(device)).last_hidden_state
init_attention_edit(tokens_conditional, tokens_conditional_edit)
#My code starts here
for t in prompt_edit_token_weights:
token_word = prompt_token(prompt_edit, t[0])
print(f"{token_word}: {t[1]}")
else:
for t in prompt_edit_token_weights:
token_word = prompt_token(prompt, t[0])
print(f"{token_word}: {t[1]}")
Probably ought to check and make sure I'm not misunderstanding anything, but it appears to work.
Got this running on windows. I had to do the following after setting up the python environment:
pip install jupyterlab ipywidgets
(see https://jupyter.org/install and https://ipywidgets.readthedocs.io/en/stable/user_install.html)\For the record, I did this with my existing environment from hlky's Stable Diffusion Webui, which can be found here: https://github.com/sd-webui/stable-diffusion-webui
, so I didn't need to install the other packages because I already had them.
This isn't quite good enough to go into the readme yet because I didn't install from a blank environment, but maybe other windows users can use this info and some instructions can be assembled.
Dear @bloc97 ,
Thank you for your great implementation. I really like it.
When I run the codes, it report error:
File "/cs/labs/danix/wuzongze/diffusion_manipulation/CrossAttentionControl/test1", line 280, in <module>
img=stablediffusion("A fantasy landscape with a pine forest, trending on artstation", prompt_edit_token_weights=[(2, -3)], seed=2483964025, width=768)
File "/cs/labs/danix/wuzongze/anaconda3/envs/ldm/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/cs/labs/danix/wuzongze/diffusion_manipulation/CrossAttentionControl/test1", line 213, in stablediffusion
noise_pred_uncond = unet(latent_model_input, t, encoder_hidden_states=embedding_unconditional).sample
AttributeError: 'dict' object has no attribute 'sample'
it seems the output of unet is a dict, rather than a class. Am I the only one meet this prblem?
Best Wishes,
Zongze
why should we use LMSDiscreteScheduler, rather than DDPMScheduler or DDIMScheduler?
Fantastic work on this project @bloc97!
I'm able to get super impressive results with prompt editing. However, when doing img2img I find that the results degrade greatly. For example, here I'm editing the prompt to change to a charcoal drawing, which works well. However, if I pass in the initial image generated from the original prompt, there's no values of parameters I can find to get anywhere close to the quality of the prompt edit without initial image. I'm observing similar issues to stock SD where either the macro structure of the initial image is lost or the prompt edit has little to no effect.
The reason I want this is to edit real images and to build edits on top of each other. I realize this may be unsolved, and depend on how well the network understands the scene content, but I'm very interested in your thoughts and suggestions here as I think it would be incredibly powerful.
img_original = stablediffusion(
prompt="a fantasy landscape with a maple forest",
steps=50,
seed=42,
)
img_prompt_edit = stablediffusion(
prompt="a fantasy landscape with a maple forest",
prompt_edit="a charcoal sketch of a fantasy landscape with a maple forest",
steps=50,
seed=42,
)
img_init_image = stablediffusion(
prompt="a fantasy landscape with a maple forest",
prompt_edit="a charcoal sketch of a fantasy landscape with a maple forest",
steps=50,
seed=42,
init_image=img_original,
init_image_strength=0.6,
)
Can this be optimised to run in 6 GB similarly to Hlky's https://github.com/sd-webui/stable-diffusion-webui/tree/master/optimizedSD ?
I'm getting CUDA out of memory when using prompt and edit_prompt in 6 GB @ 512x512. Generation of a single 512x512 image is working.
please add InverseCrossAttention to colab
Hello @bloc97,
Your work has been instrumental in my understanding of the topic, especially since I encountered some difficulties when trying to run the official prompt to prompt code.
Recently, I've been engrossed in a paper titled "Diffusion Self-Guidance for Controllable Image Generation" (https://dave.ml/selfguidance/), where the authors introduce a novel 'Self Guidance' method. This technique edits an image by manipulating the attention maps, and I notice its resemblance to the 'Prompt to Prompt' method.
As an undergraduate student eager to delve deeper into the realm of Computer Vision, I'm interested in implementing this 'Self Guidance' method for my project. However, as of now, the authors have not released their official code. Hence I'm considering implementing that self guidance method upon the foundation of your code.
Given your expertise in this area, I was wondering if you think it's feasible to implement the 'Self Guidance' method based on your code? Any insights or suggestions you could provide would be immensely appreciated.
I'm in an Anaconda prompt (on Windows). Installing required packages doesn't work:
(base) C:\Users\andre>pip install torch transformers diffusers numpy PIL tqdm difflib
Collecting torch
Using cached torch-1.12.1-cp39-cp39-win_amd64.whl (161.8 MB)
Collecting transformers
Using cached transformers-4.21.3-py3-none-any.whl (4.7 MB)
Collecting diffusers
Using cached diffusers-0.3.0-py3-none-any.whl (153 kB)
Collecting numpy
Downloading numpy-1.23.3-cp39-cp39-win_amd64.whl (14.7 MB)
|████████████████████████████████| 14.7 MB 6.4 MB/s
ERROR: Could not find a version that satisfies the requirement PIL (from versions: none)
ERROR: No matching distribution found for PIL
Everything runs but I am getting an error when running
stablediffusion("A fantasy landscape with a pine forest, trending on artstation", seed=2483964025, width=768)
AttributeError Traceback (most recent call last)
in
1 prompt_token("A fantasy landscape with a pine forest, trending on artstation", 7)
----> 2 stablediffusion("A fantasy landscape with a pine forest, trending on artstation", seed=2483964025, width=768)
3 stablediffusion("A fantasy landscape with a pine forest, trending on artstation", prompt_edit_token_weights=[(2, -3)], seed=2483964025, width=768)
4 stablediffusion("A fantasy landscape with a pine forest, trending on artstation", prompt_edit_token_weights=[(2, -8)], seed=2483964025, width=768)
5 stablediffusion("A fantasy landscape with a pine forest, trending on artstation", "A fantasy landscape with a pine forest, trending on artstation", prompt_edit_token_weights=[(2, 2), (7, 5)], seed=2483964025, width=768)
2 frames
/usr/local/lib/python3.7/dist-packages/diffusers/schedulers/scheduling_lms_discrete.py in add_noise(self, original_samples, noise, timesteps)
260 sigmas = self.sigmas.to(original_samples.device)
261 schedule_timesteps = self.timesteps.to(original_samples.device)
--> 262 timesteps = timesteps.to(original_samples.device)
263 if isinstance(timesteps, torch.IntTensor) or isinstance(timesteps, torch.LongTensor):
264 deprecate(
AttributeError: 'int' object has no attribute 'to'
Instead of counting indices for tokens to pass into prompt_edit_token_weights, it would be easier to reference it by 'word'.
parse_edit_weights converts weights with words and word list, in addition to int indices to weights with int indices:
prompt = 'the quick brown fox jumps over the lazy dog'
parse_edit_weights(prompt, None, [('brown', -1), (2, 0.5), (['lazy', 'dog'], -1.5)])
returned result is [(3, -1), (2, 0.5), (8, -1.5), (9, -1.5)].
Here's the code:
def sep_token(prompt):
tokens = clip_tokenizer(prompt, padding="max_length", max_length=clip_tokenizer.model_max_length, truncation=True, return_tensors="pt", return_overflowing_tokens=True).input_ids[0]
words = []
index = 1
while True:
word = clip_tokenizer.decode(tokens[index:index+1])
if not word: break
if word == '<|endoftext|>': break
words.append(word)
index += 1
if index > 500: break
return words
def parse_edit_weights(prompt, prompt_edit, edit_weights):
if prompt_edit:
tokens = sep_token(prompt_edit)
else:
tokens = sep_token(prompt)
prompt_edit_token_weights=[]
for tl, w in edit_weights:
if isinstance(tl, list) or isinstance(tl, tuple):
pass
else:
tl = [tl]
for t in tl:
try:
if isinstance(t, str):
idx = tokens.index(t) + 1
elif isinstance(t, int):
idx = t
prompt_edit_token_weights.append((idx, w))
except ValueError as e:
print(f'error {e}')
return prompt_edit_token_weights
Hi @bloc97 ,
Thanks for your great work.
Do you know any other papers/implementations using the finite difference gradient descent to do inversion?
I want more references for this solution.
Also, could you please give more hints about the magic number tless?
Hi bloc, firstly thank you for your great work!
I've been spending a lot of time trying to implement google's original release into a custom pipeline with diffusers. I figured it wouldn't be too difficult as they have an example there running with SD that looks pretty good. Although I'm getting very strange results even though everything seems to be in working order. I was considering that it may be because I had been using SD1.5 whereas they had been using 1.4, but I don't think there were any changes in architecture that would be causing that?
Could you elaborate a bit more on the changes you made to get it to work with stable?
Hi developers, thank you for completing this wonderful re-implementation.
As I am checking the differences between this repo and the original one, I noticed that the original repo also implemented stable diffusion.
I am wondering if you would like to list the additional features, those exclusive in this repo, on the README page. Will appreciate your clarification very much!
Thanks for your great work. In line 633 of gaussian_diffusion.py, terms["nll"] is calculated but not used. Whther it is a mistake, or whether it doesn't work.
terms["nll"] = self._token_discrete_loss(model_out_x_start, get_logits, input_ids_x, mask=input_ids_mask, truncate=True, t=t)
terms["loss"] = terms["mse"] + decoder_nll + tT_loss
does negative weighting work properly?
most repos that have weighting don't have the option for negative weights,
does it work in this one?
thanks
Some issues with LMSDiscreteScheduler and new_attention, it requires now sequence_length and dim
def new_attention(self, query, key, value, sequence_length, dim):
but diffusers/models/attention.py calling
hidden_states = self._attention(query, key, value)
[/usr/local/lib/python3.7/dist-packages/diffusers/models/attention.py](https://localhost:8080/#) in forward(self, hidden_states, context)
196 def forward(self, hidden_states, context=None):
197 hidden_states = hidden_states.contiguous() if hidden_states.device.type == "mps" else hidden_states
--> 198 hidden_states = self.attn1(self.norm1(hidden_states)) + hidden_states
199 hidden_states = self.attn2(self.norm2(hidden_states), context=context) + hidden_states
200 hidden_states = self.ff(self.norm3(hidden_states)) + hidden_states
[/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py](https://localhost:8080/#) in _call_impl(self, *input, **kwargs)
1128 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
1129 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1130 return forward_call(*input, **kwargs)
1131 # Do not call functions when jit is used
1132 full_backward_hooks, non_full_backward_hooks = [], []
[/usr/local/lib/python3.7/dist-packages/diffusers/models/attention.py](https://localhost:8080/#) in forward(self, hidden_states, context, mask)
268
269 if self._slice_size is None or query.shape[0] // self._slice_size == 1:
--> 270 hidden_states = self._attention(query, key, value)
271 else:
272 hidden_states = self._sliced_attention(query, key, value, sequence_length, dim)
TypeError: new_attention() missing 2 required positional arguments: 'sequence_length' and 'dim'
Hello.
I have another question.
From your code, I tried to reproduce the results in the prompt-to-prompt.
However, I got a result as below:
Did you get an identical result?
How do I set the parameters to get the results in the prompt-to-prompt paper?
And,
Please let me know why you divided latent with "0.1825 before inserting it to VAE.
Thanks :) !!
Hello,
Thank you for sharing your awesome code! :)
I have a question about this line:
latent_model_input = (latent_model_input / ((sigma**2 + 1) ** 0.5)).to(unet.dtype)
Could you give me some explanations about the reason that "(simga**2+1)**0.5" is needed ?
Thanks.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.