Git Product home page Git Product logo

llm-groundeddiffusion's Introduction

LLM-grounded Diffusion: Enhancing Prompt Understanding of Text-to-Image Diffusion Models with Large Language Models

Long Lian, Boyi Li, Adam Yala, Trevor Darrell at UC Berkeley/UCSF.

Transactions on Machine Learning Research (TMLR), with Featured Certification

Paper | Project Page | 5-minute Blog Post | HuggingFace Demo (updated!) | Citation | LLM-grounded Video Diffusion Models

TL;DR: Text Prompt -> LLM as a Request Parser -> Intermediate Representation (such as an image layout) -> Stable Diffusion -> Image.

Main Image Visualizations: Enhanced Prompt Understanding

Updates

[2024.1] Added a result with self-hosted Mixtral-8x7B-Instruct-v0.1 (see our reference benchmark results section). Surprisingly, the Mixtral model's performance is comparable with GPT-3.5. This shows that it's possible to self-host LMD/LMD+ without external API calls to LLMs to achieve good results.

[2023.11] Our LLM-grounded Diffusion (LMD+) has been officially integrated to upstream diffusers v0.24.0! This is an example colab that shows using our pipeline with official diffusers. The implementation in upstream diffusers is a simplified LMD+, and we recommend using the current full repo to reproduce our results.

Using our pipeline with only a few lines of code with official diffusers
# Requires diffusers >= 0.24.0

import torch
from diffusers import DiffusionPipeline

pipe = DiffusionPipeline.from_pretrained(
    "longlian/lmd_plus",
    custom_pipeline="llm_grounded_diffusion",
    custom_revision="main",
    variant="fp16", torch_dtype=torch.float16
)
pipe.enable_model_cpu_offload()

# An example prompt with LLM response
prompt = "a waterfall and a modern high speed train in a beautiful forest with fall foliage"
llm_response = """
[('a waterfall', [71, 105, 148, 258]), ('a modern high speed train', [255, 223, 181, 149])]
Background prompt: A beautiful forest with fall foliage
Negative prompt:
"""

phrases, boxes, bg_prompt, neg_prompt = pipe.parse_llm_response(llm_response)

# Use `LLMGroundedDiffusionPipeline` to generate an image
images = pipe(
    prompt=prompt,
    negative_prompt=neg_prompt,
    phrases=phrases,
    boxes=boxes,
    gligen_scheduled_sampling_beta=0.4,
    output_type="pil",
    num_inference_steps=50,
    lmd_guidance_kwargs={}
).images

# PIL Image:
images[0]

[2023.10] Our repo now supports using SDXL for high-quality generation with SDXL Refiner! Simply add --sdxl to generation command to use it. You can also use --sdxl-step-ratio to control the strength of the refinement (use 0.5 for stronger refinement and 0.1 for weaker refinement). See examples above.

[2023.10] Please also check out our new work LLM-grounded Video Diffusion Models (LVD), which shows that LLMs have knowledge in their weights that can ground video diffusion models ๐Ÿ”ฅ๐Ÿ”ฅ๐Ÿ”ฅ!

[2023.8] Our repo has been largely improved: now we have a repo with many methods implemented, including our training-free LMD and LMD+ (LMD with GLIGEN adapters).

[2023.6] Our huggingface WebUI demo for stage 1 and 2 is updated: now we support enabling each of the guidance components to get a taste of contributions! Check it out here.

Our WebUI is also available to run locally. The instructions to run our WebUI locally to get faster generation without queues are here.

Our repo implements the following layout-to-image methods (stage 2)

These methods can be freely combined with our proposed LLM-based box-to-layout method (stage 1) also implemented in this repo.

Feel free to contact me / submit a pull request to add your methods!

Our repo's features

  • (New) Supports SDXL refiner for high-resolution high-quality generation
  • Both web-based ChatGPT and OpenAI API on GPT-3.5/4 supported: Allows generating bounding boxes by either asking ChatGPT yourself (free) or in batch with OpenAI API (fully automated).
  • LLM queries are cached to save $$$ on LLM APIs: we cache each LLM query for layout generation so it does not re-generate the layouts from the same prompt.
  • Open-source LLMs supported!: Host LLMs yourself for more freedom and lower costs! We support Vicuna, LLaMA 2, StableBeluga2, etc. More in FAQ.
  • Supports both LMD (which uses SD weights without training and performs attention guidance) and LMD+ (which adds GLIGEN adapters to SD in addition to attention guidance)
  • Supports SD v1 and SD v2 in the same codebase: if you implement a new feature or a new loss, it's likely that it will work on both SD v1 and v2.
  • Several baseline stage 2 methods implemented in the same codebase: handy if you want to benchmark and compare
  • Hackable: we provide a minimal copy of diffusion UNet architecture in our repo that exports the attention maps according to your need. This allows you to change things without maintaining your own diffusers package.
  • Parallel and resumable image generation supported! You can generate in parallel to make use of multiple GPUs/servers. If a generation fails on some images (e.g., CUDA OOM), you can simply rerun generation to regenerate those. More in FAQ.
  • Modular: we implement different methods in different files. Copy from a file in generation and start creating your method without impacting existing methods.
  • Web UI supported: don't want to code or run anything? Try our public WebUI demo or instructions to run WebUI locally.
And more exciting features! Expand to see.
  • FlashAttention and PyTorch v2 supported.
  • Unified benchmark: same evaluation protocol on layouts (stage 1) and generated images (stage 1+2) for all methods implemented.
  • Provides different presets to balance better control and fast generation in Web UI.

LLM-grounded Diffusion (LMD)

We provide instructions to run our code in this section.

Installation

pip install -r requirements.txt

Stage 1: Text-to-Layout Generation

Note that we have uploaded the layout caches into this repo so that you can skip this step if you don't need layouts for new prompts.

Since we have cached the layout generation (which will be downloaded when you clone the repo), you need to remove the cache in cache directory if you want to re-generate the layout with the same prompts.

Our layout generation format: The LLM takes in a text prompt describing the image and outputs three elements: 1. captioned boxes, 2. a background prompt, 3. a negative prompt (useful if the LLM wants to express negation). The template and examples are in prompt.py. You can edit the template and the parsing function to ask the LLM to generate additional things or even perform chain-of-thought for better generation.

Option 1 (automated): Use an OpenAI API key

If you have an OpenAI API key, you can put the API key in utils/api_key.py or set OPENAI_API_KEY environment variable. Then you can use OpenAI's API for batch text-to-layout generation by querying an LLM, with GPT-4 as an example:

python prompt_batch.py --prompt-type demo --model gpt-4 --auto-query --always-save --template_version v0.1

--prompt-type demo includes a few prompts for demonstrations. The layout generation will be cached so it does not query the LLM again with the same prompt (lowers the cost).

You can visualize the bounding boxes in img_generations/imgs_demo_templatev0.1.

Option 2 (free): Manually copy and paste to ChatGPT

python prompt_batch.py --prompt-type demo --model gpt-4 --always-save --template_version v0.1

Then copy and paste the template to ChatGPT. Note that you want to use GPT-4 or change the --model to gpt-3.5 in order to match the cache file name. Then copy the response back. The generation will be cached.

If you want to visualize before deciding to save or not, you don't need to pass in --always-save.

Run our benchmark on text-to-layout generation evaluation

We provide a benchmark that applies both to stage 1 and stage 2. This benchmarks includes a set of prompts with four tasks (negation, numeracy, attribute binding, and spatial relationships) as well as unified benchmarking code for all implemented methods and both stages.

This will generate layouts from the prompts in the benchmark (with --prompt-type lmd) and evaluate the results:

python prompt_batch.py --prompt-type lmd --model gpt-3.5 --auto-query --always-save --template_version v0.1
python scripts/eval_stage_one.py --prompt-type lmd --model gpt-3.5 --template_version v0.1
Our reference benchmark results (stage 1, evaluating the generated layouts only)
Method Negation Numeracy Attribution Spatial Overall
GPT-3.5 100 97 100 99 99.0%
GPT-4 100 100 100 100 100.0%

Stage 2: Layout-to-Image Generation

Note that since we provide caches for stage 1, you don't need to run stage 1 on your own for cached prompts that we provide (i.e., you don't need an OpenAI API key or to query an LLM).

Run layout-to-image generation using the gpt-4 cache and LMD+:

python generate.py --prompt-type demo --model gpt-4 --save-suffix "gpt-4" --repeats 5 --frozen_step_ratio 0.5 --regenerate 1 --force_run_ind 0 --run-model lmd_plus --no-scale-boxes-default --template_version v0.1

--save-suffix is the suffix added to the name of the run. You can change that if you change the args to mark the setting in the runs. --run-model specifies the method to run. You can set to LMD/LMD+ or the implemented baselines (with examples below). Use --use-sdv2 to enable SDv2.

Run our benchmark on layout-to-image generation evaluation

We use a unified evaluation metric as stage 1 in stage 2 (--prompt-type lmd). Since we have layout boxes for stage 1 but only images for stage 2, we use OWL-ViT in order to detect the objects and ensure they are generated (or not generated in negation) in the right number, with the right attributes, and in the right place.

This runs generation with LMD+ and evaluate the generation:

# Use GPT-3.5 layouts
python generate.py --prompt-type lmd --model gpt-3.5 --save-suffix "gpt-3.5" --repeats 1 --frozen_step_ratio 0.5 --regenerate 1 --force_run_ind 0 --run-model lmd_plus --no-scale-boxes-default --template_version v0.1
python scripts/owl_vit_eval.py --model gpt-3.5 --run_base_path img_generations/img_generations_templatev0.1_lmd_plus_lmd_gpt-3.5/run0 --skip_first_prompts 0 --prompt_start_ind 0 --verbose --detection_score_threshold 0.15 --nms_threshold 0.15 --class-aware-nms
# Use GPT-4 layouts
python generate.py --prompt-type lmd --model gpt-4 --save-suffix "gpt-4" --repeats 1 --frozen_step_ratio 0.5 --regenerate 1 --force_run_ind 0 --run-model lmd_plus --no-scale-boxes-default --template_version v0.1
python scripts/owl_vit_eval.py --model gpt-4 --run_base_path img_generations/img_generations_templatev0.1_lmd_plus_lmd_gpt-4/run0 --skip_first_prompts 0 --prompt_start_ind 0 --verbose --detection_score_threshold 0.15 --nms_threshold 0.15 --class-aware-nms
Our reference benchmark results
Method Negation Numeracy Attribution Spatial Overall
SD v1.5 28 39 52 28 36.8%
LMD+ (GPT-3.5) 100 86 69 67 80.5%
LMD+ (GPT-4) 100 84 79 82 86.3%
LMD+ (StableBeluga2*) 88 60 56 64 67.0%
LMD+ (Mixtral-8x7B-Instruct-v0.1*) 98 72 62 78 77.5%

* StableBeluga2 is an open-sourced model based on Llama 2. Mixtral-8x7B-Instruct-v0.1 is an open-sourced MoE model that can be served on 1x A100 if quantized. We discover that the fact that LLMs' spatial reasoning ability is also applicable to open-sourced models. Surprisingly, the Mixtral model's performance is close to the one with GPT-3.5. This shows that it's possible to self-host LMD/LMD+ without external API calls to LLMs to achieve good results. However, it can still be improved, compared to proprietary model GPT-4. We leave LLM fine-tuning for better layout generation in stage 1 to future research.

To run generation with LMD with original SD weights and evaluate the generation:

Generate and evaluate samples with LMD
# Use GPT-3.5 layouts
python generate.py --prompt-type lmd --model gpt-3.5 --save-suffix "gpt-3.5" --repeats 1 --frozen_step_ratio 0.5 --regenerate 1 --force_run_ind 0 --run-model lmd --no-scale-boxes-default --template_version v0.1
python scripts/owl_vit_eval.py --model gpt-3.5 --run_base_path img_generations/img_generations_templatev0.1_lmd_lmd_gpt-3.5/run0 --skip_first_prompts 0 --prompt_start_ind 0 --verbose --detection_score_threshold 0.15 --nms_threshold 0.15 --class-aware-nms

Note: You can enable autocast (mixed precision) to reduce the memory used in generation with --use_autocast 1 with potentially slightly lower generation quality.

Generate samples with other stage 2 baseline methods
# SD v1.5
python generate.py --prompt-type lmd --model gpt-3.5 --save-suffix "gpt-3.5" --repeats 1 --regenerate 1 --force_run_ind 0 --run-model sd --no-scale-boxes-default --template_version v0.1 --ignore-negative-prompt
# MultiDiffusion (training-free)
python generate.py --prompt-type lmd --model gpt-3.5 --save-suffix "gpt-3.5" --repeats 1 --regenerate 1 --force_run_ind 0 --run-model multidiffusion --no-scale-boxes-default --template_version v0.1 --multidiffusion_bootstrapping 10
# Backward Guidance (training-free)
python generate.py --prompt-type lmd --model gpt-3.5 --save-suffix "gpt-3.5" --repeats 1 --regenerate 1 --force_run_ind 0 --run-model backward_guidance --no-scale-boxes-default --template_version v0.1
# Boxdiff (training-free, our reimplementation)
python generate.py --prompt-type lmd --model gpt-3.5 --save-suffix "gpt-3.5" --repeats 1 --regenerate 1 --force_run_ind 0 --run-model boxdiff --no-scale-boxes-default --template_version v0.1
# GLIGEN (training-based)
python generate.py --prompt-type lmd --model gpt-3.5 --save-suffix "gpt-3.5" --repeats 1 --regenerate 1 --force_run_ind 0 --run-model gligen --no-scale-boxes-default --template_version v0.1

Note: we set --ignore-negative-prompt in SD v1.5 so that SD generation does not depend on the LLM and follows a text-to-image generation baseline (otherwise we take the LLM-generated negative prompts and put them into the negative prompt). For other baselines, you can feel free to generate. Evaluation is similar to LMD+, except you need to change the image path in the evaluation command.

Our reference benchmark results (stage 2, LMD, without autocast)
Method Negation Numeracy Attribution Spatial Overall
SD v1.5 28 39 52 28 36.8%
LMD (GPT-3.5) 100 62 65 79 76.5%
Ablation: Our reference benchmark results by combining LMD stage 1 with various layout-to-image baselines as stage 2

The stage 1 in this table is LMD (GPT-3.5) unless stated otherwise. We keep stage 1 in LMD the same and replace the stage 2 by other layout-to-image methods.

Stage 1 / Stage 2 Method Negation* Numeracy Attribution Spatial Overall
None / SD v1.5 28 39 52 28 36.8%
Training-free:
(uses SD weights out-of-the-box)
LMD / MultiDiffusion 100 30 42 36 52.0%
LMD / Backward Guidance 100 42 36 61 59.8%
LMD / BoxDiff 100 32 55 62 62.3%
LMD / LMD 100 62 65 79 76.5%
Training-based:
LMD / GLIGEN 100 57 57 45 64.8%
LMD / LMD+** 100 86 69 67 80.5%
LMD / LMD+ (GPT-4) 100 84 79 82 86.3%

* All methods equipped with LMD stage 1 understand negation well because LMD stage 1 generates the negative prompts, which is applicable to all methods that use classifier-free guidance on SD.

** Note that LMD+ uses attention control that we proposed in addition to GLIGEN, which has much better generation compared to using only GLIGEN, showing that our proposed training-free control is orthogonal to training-based methods such as GLIGEN.

FAQs

How do I use open-source LLMs (e.g., Mixtral, LLaMA-2, StableBeluga2, Vicuna)?

You can install fastchat and start a LLM server (note that the server does not have to be the same one as this repo). This requires running three terminals (e.g., three tmux windows). Using Mixtral-8x7B-Instruct-v0.1 as an example (which performs the best among all open-source LLMs from our experience):

pip install fschat

export FASTCHAT_WORKER_API_TIMEOUT=600
# Run this in window 1
python3 -m fastchat.serve.controller

# Run this in window 2
CUDA_VISIBLE_DEVICES=0,1 python3 -m fastchat.serve.model_worker --model-path mistralai/Mixtral-8x7B-Instruct-v0.1 --num-gpus 2 --max-gpu-memory 48GiB
# Command for StableBeluga2:
# CUDA_VISIBLE_DEVICES=0,1 python3 -m fastchat.serve.model_worker --model-path stabilityai/StableBeluga2 --num-gpus 2

# Run this in window 3
python3 -m fastchat.serve.openai_api_server --host localhost --port 8000

StableBeluga2 is a 70b model so you need at least 2 GPUs, and Mixtral-8x7B-Instruct-v0.1 also requires 2 GPUs (with 80GB memory), but you can run smaller models with only 1 GPU. Simply replace the model path to the huggingface model key (e.g., meta-llama/Llama-2-70b-hf, lmsys/vicuna-33b-v1.3). Note that you probably want models without RLHF (e.g., not Llama-2-70b-chat-hf), as we use text completion endpoints for layout generation, although Mixtral with instruction tuning seems to perform slightly better than Mixtral without instruction tuning. Then change the --model argument to the intended model.

If your LLM server is not on localhost:8000, you can change the API endpoint URL in utils/llm.py. If you model name is not in the list in utils/llm.py, you can add it to the model_names list. We created this list to prevent typos in the command.

My LLM queries finished very quickly, why?

Check whether you have a lot of cache hits in the output. If so, you might want to use the cache (you are all set) or remove the cache in cache directory to regenerate.

Note that we allows different versions of templates so that you can manage several templates easily without cache overwrites.

Contact us

Please contact Long (Tony) Lian if you have any questions: [email protected].

Acknowledgements

This repo uses code from diffusers, GLIGEN, and layout-guidance. This code also has an implementation of boxdiff and MultiDiffusion (region control). Using their code means adhering to their license.

Citation

If you use our work or our implementation in this repo, or find them helpful, please consider giving a citation.

@article{lian2023llmgrounded,
    title={LLM-grounded Diffusion: Enhancing Prompt Understanding of Text-to-Image Diffusion Models with Large Language Models}, 
    author={Lian, Long and Li, Boyi and Yala, Adam and Darrell, Trevor},
    journal={arXiv preprint arXiv:2305.13655},
    year={2023}
}

llm-groundeddiffusion's People

Contributors

tonylianlong avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

llm-groundeddiffusion's Issues

How to get the image a man rides a horse?

I try this project. It's amzing and interesting.
But now, I meet a question. It's hard for me to get a good image by the text "a man rides a horse".
Can you give me some advice?
Thank you!

Can you load LoRAs with lmd, or lmd_plus?

I've been loving this tool but have been wanting to use it with some LoRAs I've created.

Is it possible to:

  • load LoRAs based on runwayml/stable-diffusion-v1-5 when generating with lmd?
  • load LoRAs based on CompVis/stable-diffusion-v1-4 when generating with lmd_plus?
  • finetune or create LoRAs based on longlian/lmd_plus?
    Unexpected key(s) in state_dict: "position_net.null_positive_feature", "position_net.null_position_feature", 
    "position_net.linears.0.weight", "position_net.linears.0.bias", "position_net.linears.2.weight", "position_net.linears.2.bias", 
    "position_net.linears.4.weight", "position_net.linears.4.bias", "down_blocks.0.attentions.0.transformer_blocks.0.fuser.alpha_attn", 
    "down_blocks.0.attentions.0.transformer_blocks.0.fuser.alpha_dense", 
    "down_blocks.0.attentions.0.transformer_blocks.0.fuser.linear.weight", 
    "down_blocks.0.attentions.0.transformer_blocks.0.fuser.linear.bias", 
    "down_blocks.0.attentions.0.transformer_blocks.0.fuser.attn.to_q.weight", 
    "down_blocks.0.attentions.0.transformer_blocks.0.fuser.attn.to_k.weight", 
    "down_blocks.0.attentions.0.transformer_blocks.0.fuser.attn.to_v.weight", 
    "down_blocks.0.attentions.0.transformer_blocks.0.fuser.attn.to_out.0.weight", 
    "down_blocks.0.attentions.0.transformer_blocks.0.fuser.attn.to_out.0.bias", 
    "down_blocks.0.attentions.0.transformer_blocks.0.fuser.ff.net.0.proj.weight", 
    "down_blocks.0.attentions.0.transformer_blocks.0.fuser.ff.net.0.proj.bias", 
    "down_blocks.0.attentions.0.transformer_blocks.0.fuser.ff.net.2.weight", 
    "down_blocks.0.attentions.0.transformer_blocks.0.fuser.ff.net.2.bias", 
    "down_blocks.0.attentions.0.transformer_blocks.0.fuser.norm1.weight", 
    "down_blocks.0.attentions.0.transformer_blocks.0.fuser.norm1.bias"
    ...
    
    It seems the attentions from GLIGEN are messing up the training
  • So far with my attempts at loading a LoRA for lmd_plus with the pipeline on huggingface https://github.com/huggingface/diffusers/blob/main/examples/community/README.md#llm-grounded-diffusion have been unsuccessful. I get 'UNet2DConditionModel' object has no attribute 'attn_processors' after I load LoRAs on the pipeline. It seems that loading a LoRA clears our the attention processors from GLIGEN which causes issues down stream. Is there a way to preserve them? Does that even make sense, I'm not super familiar with how the GLIGEN attention processors work and if updating the unet layers with LoRAs would mess up the attentions.

So I guess is it possible? and are there any examples of it that I can reference? any additional help would be appreciated

How to Swap Object Positions While Maintaining Consistent Background in Image Synthesis?

I'm working on image synthesis with a focus on vision-language fine-grained understanding. I'm facing a challenge in generating two images that maintain a consistent background but swap the positions of two objects (e.g., a dog on the left and a cat on the right in the first image, and vice versa in the second image).

I've tried fixing seed and bounding box location only swapping object names but it doesn't seem to be working. Any guidance would be greatly appreciated.

How to Visualize the masked latents or the attention maps

Hi, I am working with LLM grounded diffusion and facing some issues. I cannot visualize the mask_latents or attention maps even if I enable the following code in lmd_plus.py or lmd.py

if visualize:
        vis.visualize(mask_selected, "Mask (selected) after resize")
        # This is only for visualizations
        masked_latents = latents_all * mask_selected_tensor
        vis.visualize_masked_latents(
            latents_all, masked_latents, timestep_T=False, timestep_0=True

Can you please help me with that?
Thanks

License

Thanks a lot for sharing the source code for this project.
What is the license of the code?

Questions about per-box generation process

I have a small question about the per-box generation process.

I'm curious why it isn't possible to generate multiple objects simultaneously.
For example, let's say we have two boxes, A and B, in an image. Couldn't we allow box A attend to object a and box B attend to object b?
Is there a significant difference in quality compared to the suggested "per-box generation" approach?
If we can generate all objects simultaneously, we don't even have to perform the DDIM inversion process

Can we use small model like LLAMA to get layout?

Sometimes, there is no way to use Chatgpt or GPT4. Can we use smaller model than GPT like llama to get layout?
If we can use small model to get layout, It will be more and more popular for this job.
Looking forward to your reply! Thanks!

Some failure cases about attribute assignment

Hi, thanks for your nice work, I have tried the demo, and the method does have strong reasoning abilities, but there are some failure cases in attribute assignment.

Given the following response from ChatGPT:
Caption: A cartoon painting of a man in red standing next to another woman in blue
Objects: [('a man in red', [80, 150, 100, 200]), ('a woman in blue', [200, 150, 100, 200])]
Background prompt: A cartoon painting

I obtained:

  1. seed=4354
    image (7)
  2. seed=3628
    image (8)

What do you think of the problem might be?

Thanks

Invoke AI

Can this work with Invoke AI? So I can enter a text to generate or improve a prompt?

"TypeError: Linear.forward() takes 2 positional arguments but 3 were given" when trying to replicate the pipeline with diffusers

Here's my code. I pretty much copied it across from your README.

from diffusers import DiffusionPipeline
import torch

print("Torch version:", torch.__version__)
print("Is CUDA enabled?", torch.cuda.is_available())

pipe = DiffusionPipeline.from_pretrained(
        "longlian/lmd_plus",
        custom_pipeline="llm_grounded_diffusion",
        custom_revision="main",
        torch_dtype=torch.float16,
        variant="fp16")
pipe.to("cuda")

prompt = "a waterfall and a modern high speed train in a beautiful forest with fall foliage."
response = """
[('a waterfall', [100, 50, 200, 450]), ('a beautiful deer', [350, 250, 150, 200])] Background prompt: A dense forest surrounded by mountains Negative prompt:
"""

phrases, boxes, bg_prompt, neg_prompt = pipe.parse_llm_response(response)

image = pipe(
    prompt=prompt,
    negative_prompt=neg_prompt,
    phrases=phrases,
    boxes=boxes,
    gligen_scheduled_sampling_beta=0.4,
    output_type="pil",
    num_inference_steps=50,
    lmd_guidance_kwargs={}
).images

image[0].show()

Here's the stacktrace:

Loading pipeline components...: 57%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‹ | 4/7 [00:00<00:00, 5.76it/s]text_config_dict is provided which will be used to initialize CLIPTextConfig. The value text_config["id2label"] will be overriden.
text_config_dict is provided which will be used to initialize CLIPTextConfig. The value text_config["bos_token_id"] will be overriden.
text_config_dict is provided which will be used to initialize CLIPTextConfig. The value text_config["eos_token_id"] will be overriden.
Loading pipeline components...: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 7/7 [00:01<00:00, 6.19it/s]
0%| | 0/50 [00:00<?, ?it/s]
Traceback (most recent call last):
File "...\anaconda3\envs\conda310diss\lib\site-packages\torch\utils_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "....cache\huggingface\modules\diffusers_modules\git\llm_grounded_diffusion.py", line 1019, in call
latents, loss_attn = self.latent_lmd_guidance(
File "...\anaconda3\envs\conda310diss\lib\site-packages\torch\utils_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "....cache\huggingface\modules\diffusers_modules\git\llm_grounded_diffusion.py", line 1129, in latent_lmd_guidance
unet(
File ...\anaconda3\envs\conda310diss\lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "...\anaconda3\envs\conda310diss\lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "...\anaconda3\envs\conda310diss\lib\site-packages\diffusers\models\unets\unet_2d_condition.py", line 1216, in forward
sample, res_samples = downsample_block(
File "...\anaconda3\envs\conda310diss\lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "...\anaconda3\envs\conda310diss\lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "...\anaconda3\envs\conda310diss\lib\site-packages\diffusers\models\unets\unet_2d_blocks.py", line 1279, in forward
hidden_states = attn(
File "...\anaconda3\envs\conda310diss\lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "...\anaconda3\envs\conda310diss\lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "...\anaconda3\envs\conda310diss\lib\site-packages\diffusers\models\transformers\transformer_2d.py", line 397, in forward
hidden_states = block(
File "...\anaconda3\envs\conda310diss\lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "...\anaconda3\envs\conda310diss\lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "...\anaconda3\envs\conda310diss\lib\site-packages\diffusers\models\attention.py", line 366, in forward
attn_output = self.attn2(
File "...\anaconda3\envs\conda310diss\lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "...\anaconda3\envs\conda310diss\lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "...\anaconda3\envs\conda310diss\lib\site-packages\diffusers\models\attention_processor.py", line 522, in forward
return self.processor(
File "....cache\huggingface\modules\diffusers_modules\git\llm_grounded_diffusion.py", line 198, in call
query = attn.to_q(hidden_states, *args)
File "...\anaconda3\envs\conda310diss\lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "...\anaconda3\envs\conda310diss\lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
TypeError: Linear.forward() takes 2 positional arguments but 3 were given

Torch: 2.1.2+cu118
accelerate: 0.21.0
transformers: 4.31.0
diffusers: 0.27.2

I've been trying to troubleshoot but it seems like the error is coming from inside the package. I will try to fork your repo and work with it directly, but I thought it will be useful to flag this here.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.