ybybzhang / controlvideo Goto Github PK

View Code? Open in Web Editor NEW

742.0 742.0 56.0 82.24 MB

[ICLR 2024] Official pytorch implementation of "ControlVideo: Training-free Controllable Text-to-Video Generation"

License: MIT License

Python 99.85% Shell 0.15%

diffusion-models training-free video-generation

controlvideo's People

Contributors

Stargazers

Watchers

controlvideo's Issues

no file named pytorch_model.bin, tf_model.h5, model.ckpt.index or flax_model.msgpack found in directory checkpoints/stable-diffusion-v1-5

git clone fro https://huggingface.co/runwayml/stable-diffusion-v1-5
and it looks like this
ls
flownet.pkl sd-controlnet-canny sd-controlnet-depth sd-controlnet-openpose stable-diffusion-v1-5
but when I run the inference.py, there is an ERROR.
no file named pytorch_model.bin, tf_model.h5, model.ckpt.index or flax_model.msgpack found in directory checkpoints/stable-diffusion-v1-5
I checked the file in my directory, it's same as the file on https://huggingface.co/runwayml/stable-diffusion-v1-5/tree/main
what's my wrong?

questions about loading ckpt

Hi, thanks for your code so much!
I met a problem when loading ckpt of stable diffusion v1.5, hope that you'll help me with it.
I download the ckpt from the url in README.md (https://huggingface.co/runwayml/stable-diffusion-v1-5), named "v1-5-pruned-emaonly.ckpt", but it seemed that there was no weights for "tokenizer", "text encoder" and etc. I was wondering which ckpt file in specific is correct?
Thanks for your kindest help!

nothing

Pure text generation

Thank you for your remarkable work! I wonder if this model is able to generate video under pure text input?

CUDA out of memory

Very good job.
When I run inference, --video_length 49 --frame_rate 2, it shows "cuda out of memory" error. I am running it on a 3090 GPU with 24GB of VRAM. Is there any way to solve this issue? Can I use multi-GPU inference?

Looking forward to your reply.

RuntimeError: The size of tensor a (24) must match the size of tensor b (12) at non-singleton dimension 2

When we run by python inference.py --condition "depth" --video_length 24 --smoother_steps 19 20 --width 512 --height 512, we get the error
Traceback (most recent call last): File "/root/paddlejob/workspace/lxz/ControlVideo/inference.py", line 129, in <module> sample = pipe(args.prompt + POS_PROMPT, video_length=args.video_length, frames=pil_annotation, File "/root/paddlejob/workspace/lxz/miniconda3/envs/controlvideo_py310/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context return func(*args, **kwargs) File "/root/paddlejob/workspace/lxz/ControlVideo/models/pipeline_controlvideo.py", line 902, in __call__ down_block_res_samples, mid_block_res_sample = self.controlnet( File "/root/paddlejob/workspace/lxz/miniconda3/envs/controlvideo_py310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl return forward_call(*input, **kwargs) File "/root/paddlejob/workspace/lxz/ControlVideo/models/controlnet.py", line 519, in forward sample += controlnet_cond RuntimeError: The size of tensor a (24) must match the size of tensor b (12) at non-singleton dimension 2
It seems that the depth detector extracts the depth of the size of channel b(12), but the video_length is 24, the size of a(24), so how can solve this problem?

ERROR: Could not find a version that satisfies the requirement clip==1.0 (from versions: none) ERROR: No matching distribution found for clip==1.0

ERROR: Could not find a version that satisfies the requirement clip==1.0 (from versions: none)
ERROR: No matching distribution found for clip==1.0

请教大神

Is the same frame noise important?

Thanks for your great work!
I notice the prepare_latents function in pipeline_controlvideo.py makes the frame noise same:

But the frame noise in Vid2vid-zero, Tune-A-Video are individually generated (the noises are different among frames):

Thus I want to know wheather the same frame noise is important for your great results?
Thank you!

The idea of full-frames attention shares a great similarity with the spatial-temporal modeling approach used in Vid2Vid-Zeros.

If I could be wrong, the full-frame attention is similar to https://github.com/baaivision/vid2vid-zero.

cfa-vram

hi, thanks for your excellent work. I'm confused about one question. In my impression, the VRAM that the text2video-zero model occup will increase as the number of frames referenced by cross-frame attention. Why do you use full cross-frame attention but only need 2080ti---only 11GB?

Single Character

Hi,
Is it possible to generate a single character from the Pose for more than 5 seconds?

I have a video of Pose ( openpose + hands + face) and i was wondering if it is possible to generate an output video withe the length of 5 seconds that has a consistent character/Avatar which plays Dance, .... from the controled (pose) input?

Thanks
Best regards

First frame conditioning possible?

Great work and paper!
Is it possible with the current model to inititalize the video with some initial first image and generate the following frames based on this image? If not, what modifications would be needed to achieve that?

Excellent work

Excellent work! When will you release the code please, we would like to follow your work!

The prompt seems wrong in Readme

The prompt seems wrong in the readme of "An astronaut dancing in the outer space".

ValueError: mutable default <class 'timm.models.maxxvit.MaxxVitConvCfg'> for fie

Traceback (most recent call last):
File "", line 1, in
File "/home/ubuntu/.local/lib/python3.11/site-packages/timm/init.py", line 2, in
from .models import create_model, list_models, is_model, list_modules, model _entrypoint,
File "/home/ubuntu/.local/lib/python3.11/site-packages/timm/models/init.py ", line 28, in
from .maxxvit import *
File "/home/ubuntu/.local/lib/python3.11/site-packages/timm/models/maxxvit.py" , line 216, in
@DataClass
^^^^^^^^^
File "/usr/lib/python3.11/dataclasses.py", line 1221, in dataclass
return wrap(cls)
^^^^^^^^^
File "/usr/lib/python3.11/dataclasses.py", line 1211, in wrap
return _process_class(cls, init, repr, eq, order, unsafe_hash,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.11/dataclasses.py", line 959, in _process_class
cls_fields.append(_get_field(cls, name, type, kw_only))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.11/dataclasses.py", line 816, in _get_field
raise ValueError(f'mutable default {type(f.default)} for field '
ValueError: mutable default <class 'timm.models.maxxvit.MaxxVitConvCfg'> for fie

The input must have a source video?

Thank you for the interesting work! The input must have a source video?

Typo in NEG_prompt

The fewer difits should have been fewer digits IMO.

Is 11GB GPU the minimum requirements?

guide to run on kaggle or google collab

got few errors while running in collab or kaggle any specific commands to run there ?

Combination with ip adpater

Great project. I used your method combined with ip adapter to generate the action posture of a specific character. The final result has almost nothing to do with the character ID provided by ip adapter. I don’t know why this is and I want to explore it.

controlnet-aux==0.0.6 ValueError: depth is not a valid processor id

Thanks for your great work. I can inference the code when condition == 'canny'.

But when I inference the code with controlnet-aux==0.0.6 and condition == "depth". There is an error reported:
ValueError: depth is not a valid processor id. Please make sure to choose one of scribble_hed, softedge_hed, scribble_hedsafe, softedge_hedsafe, depth_midas, mlsd, openpose, openpose_face, openpose_faceonly, openpose_full, openpose_hand, scribble_pidinet, softedge_pidinet, scribble_pidsafe, softedge_pidsafe, normal_bae, lineart_coarse, lineart_realistic, lineart_anime, depth_zoe, depth_leres, depth_leres++, shuffle, mediapipe_face, canny.

Do you have some suggestions?

The size of tensor a (30) must match the size of tensor b (16) at non-singleton dimension 2

Hello! thank you for your excellent work. When i try to use a video of 912 * 512，i changed the height and width in video_reading here:

But i still got error: The size of tensor a (30) must match the size of tensor b (16) at non-singleton dimension 2

ValueError: Calling CLIPTokenizer.from_pretrained() with the path to a single file or url is not supported for this tokenizer. Use a model identifier or the path to a directory instead.

Hello! I am running the Inference: inference.sh
which results in the following error:

Traceback (most recent call last):
File "inference.py", line 96, in
tokenizer = CLIPTokenizer.from_pretrained(sd_path, subfolder="tokenizer")
File "/root/miniconda3/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 1702, in from_pretrained
raise ValueError(
ValueError: Calling CLIPTokenizer.from_pretrained() with the path to a single file or url is not supported for this tokenizer. Use a model identifier or the path to a directory instead.

about the evaluate date

After reading the paper，I am very curiousity about the evaluate data. Knowning that they were selected by your team and annotate their descriptions. If you could share the evaluate data with me? Looking forward U reply! thinks!

ask for help

pretrained file

Hello, which pre-trained model do I need to import in this line of code, and which specific folder do I store it in after downloading it, I have downloaded all the weights you need in your readme file, but this line of code is reporting an error because the server does not have internet access.
processor = processor.from_pretrained("lllyasviel/Annotators")

problem with triton

(venv) E:\ControlVideo>python inference.py
A matching Triton is not available, some optimizations will not be enabled.
Error caught was: No module named 'triton.language'
Traceback (most recent call last):
File "E:\ControlVideo\inference.py", line 20, in
from models.pipeline_controlvideo import ControlVideoPipeline
File "E:\ControlVideo\models\pipeline_controlvideo.py", line 28, in
from .controlnet import ControlNetOutput
File "E:\ControlVideo\models\controlnet.py", line 27, in
from .controlnet_unet_blocks import (
File "E:\ControlVideo\models\controlnet_unet_blocks.py", line 6, in
from .controlnet_attention import Transformer3DModel
File "E:\ControlVideo\models\controlnet_attention.py", line 15, in
from diffusers.models.attention import CrossAttention, FeedForward, AdaLayerNorm
ImportError: cannot import name 'CrossAttention' from 'diffusers.models.attention' (E:\ControlVideo\venv\lib\site-packages\diffusers\models\attention.py)

Evaluation Question: 125 Video-prompt Pairs

Dear authors,

First of all, very nice work and impressive results, congratulations! In the paper, you mention that you have 125 video-prompt pairs in total that you use for quantitative evaluations, could you please specify the exact prompts and which sequences from the DAVIS dataset you used so that I can replicate the results and evaluate in the exact same setting as you?

Thank you!

smoother_step

hi, I restart a new issue, because I'm also confused about the timesteps of interleaved-frame smoother. The paper says that : "interleaved-frame smoother is performed on predicted RGB frames at timesteps {30, 31} by default" , But your code use 19, 20 as default. How is this parameter determined， Why not do smoother on all DDIM steps？Have you done ablation for this part of the experiment? I don't seem to see it in the paper.

ybybzhang / controlvideo Goto Github PK

controlvideo's People

Contributors

Stargazers

Watchers

Forkers

controlvideo's Issues

Recommend Projects

Recommend Topics

Recommend Org