ybybzhang / controlvideo Goto Github PK
View Code? Open in Web Editor NEW[ICLR 2024] Official pytorch implementation of "ControlVideo: Training-free Controllable Text-to-Video Generation"
License: MIT License
[ICLR 2024] Official pytorch implementation of "ControlVideo: Training-free Controllable Text-to-Video Generation"
License: MIT License
(venv) E:\ControlVideo>python inference.py
A matching Triton is not available, some optimizations will not be enabled.
Error caught was: No module named 'triton.language'
Traceback (most recent call last):
File "E:\ControlVideo\inference.py", line 20, in
from models.pipeline_controlvideo import ControlVideoPipeline
File "E:\ControlVideo\models\pipeline_controlvideo.py", line 28, in
from .controlnet import ControlNetOutput
File "E:\ControlVideo\models\controlnet.py", line 27, in
from .controlnet_unet_blocks import (
File "E:\ControlVideo\models\controlnet_unet_blocks.py", line 6, in
from .controlnet_attention import Transformer3DModel
File "E:\ControlVideo\models\controlnet_attention.py", line 15, in
from diffusers.models.attention import CrossAttention, FeedForward, AdaLayerNorm
ImportError: cannot import name 'CrossAttention' from 'diffusers.models.attention' (E:\ControlVideo\venv\lib\site-packages\diffusers\models\attention.py)
When we run by python inference.py --condition "depth" --video_length 24 --smoother_steps 19 20 --width 512 --height 512
, we get the error
Traceback (most recent call last): File "/root/paddlejob/workspace/lxz/ControlVideo/inference.py", line 129, in <module> sample = pipe(args.prompt + POS_PROMPT, video_length=args.video_length, frames=pil_annotation, File "/root/paddlejob/workspace/lxz/miniconda3/envs/controlvideo_py310/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context return func(*args, **kwargs) File "/root/paddlejob/workspace/lxz/ControlVideo/models/pipeline_controlvideo.py", line 902, in __call__ down_block_res_samples, mid_block_res_sample = self.controlnet( File "/root/paddlejob/workspace/lxz/miniconda3/envs/controlvideo_py310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl return forward_call(*input, **kwargs) File "/root/paddlejob/workspace/lxz/ControlVideo/models/controlnet.py", line 519, in forward sample += controlnet_cond RuntimeError: The size of tensor a (24) must match the size of tensor b (12) at non-singleton dimension 2
It seems that the depth detector extracts the depth of the size of channel b(12), but the video_length is 24, the size of a(24), so how can solve this problem?
Hello! I am running the Inference: inference.sh
which results in the following error:
Traceback (most recent call last):
File "inference.py", line 96, in
tokenizer = CLIPTokenizer.from_pretrained(sd_path, subfolder="tokenizer")
File "/root/miniconda3/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 1702, in from_pretrained
raise ValueError(
ValueError: Calling CLIPTokenizer.from_pretrained() with the path to a single file or url is not supported for this tokenizer. Use a model identifier or the path to a directory instead.
Great work and paper!
Is it possible with the current model to inititalize the video with some initial first image and generate the following frames based on this image? If not, what modifications would be needed to achieve that?
Great project. I used your method combined with ip adapter to generate the action posture of a specific character. The final result has almost nothing to do with the character ID provided by ip adapter. I don’t know why this is and I want to explore it.
Hi,
Is it possible to generate a single character from the Pose for more than 5 seconds?
I have a video of Pose ( openpose + hands + face) and i was wondering if it is possible to generate an output video withe the length of 5 seconds that has a consistent character/Avatar which plays Dance, .... from the controled (pose) input?
Thanks
Best regards
If I could be wrong, the full-frame attention is similar to https://github.com/baaivision/vid2vid-zero.
Thank you for the interesting work! The input must have a source video?
Thanks for your great work. I can inference the code when condition == 'canny'.
But when I inference the code with controlnet-aux==0.0.6 and condition == "depth". There is an error reported:
ValueError: depth is not a valid processor id. Please make sure to choose one of scribble_hed, softedge_hed, scribble_hedsafe, softedge_hedsafe, depth_midas, mlsd, openpose, openpose_face, openpose_faceonly, openpose_full, openpose_hand, scribble_pidinet, softedge_pidinet, scribble_pidsafe, softedge_pidsafe, normal_bae, lineart_coarse, lineart_realistic, lineart_anime, depth_zoe, depth_leres, depth_leres++, shuffle, mediapipe_face, canny.
Do you have some suggestions?
git clone fro https://huggingface.co/runwayml/stable-diffusion-v1-5
and it looks like this
ls
flownet.pkl sd-controlnet-canny sd-controlnet-depth sd-controlnet-openpose stable-diffusion-v1-5
but when I run the inference.py, there is an ERROR.
no file named pytorch_model.bin, tf_model.h5, model.ckpt.index or flax_model.msgpack found in directory checkpoints/stable-diffusion-v1-5
I checked the file in my directory, it's same as the file on https://huggingface.co/runwayml/stable-diffusion-v1-5/tree/main
what's my wrong?
After reading the paper,I am very curiousity about the evaluate data. Knowning that they were selected by your team and annotate their descriptions. If you could share the evaluate data with me? Looking forward U reply! thinks!
Thanks for your great work!
I notice the prepare_latents function in pipeline_controlvideo.py makes the frame noise same:
But the frame noise in Vid2vid-zero, Tune-A-Video are individually generated (the noises are different among frames):
Thus I want to know wheather the same frame noise is important for your great results?
Thank you!
Thank you for your remarkable work! I wonder if this model is able to generate video under pure text input?
Traceback (most recent call last):
File "", line 1, in
File "/home/ubuntu/.local/lib/python3.11/site-packages/timm/init.py", line 2, in
from .models import create_model, list_models, is_model, list_modules, model _entrypoint,
File "/home/ubuntu/.local/lib/python3.11/site-packages/timm/models/init.py ", line 28, in
from .maxxvit import *
File "/home/ubuntu/.local/lib/python3.11/site-packages/timm/models/maxxvit.py" , line 216, in
@DataClass
^^^^^^^^^
File "/usr/lib/python3.11/dataclasses.py", line 1221, in dataclass
return wrap(cls)
^^^^^^^^^
File "/usr/lib/python3.11/dataclasses.py", line 1211, in wrap
return _process_class(cls, init, repr, eq, order, unsafe_hash,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.11/dataclasses.py", line 959, in _process_class
cls_fields.append(_get_field(cls, name, type, kw_only))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.11/dataclasses.py", line 816, in _get_field
raise ValueError(f'mutable default {type(f.default)} for field '
ValueError: mutable default <class 'timm.models.maxxvit.MaxxVitConvCfg'> for fie
ERROR: Could not find a version that satisfies the requirement clip==1.0 (from versions: none)
ERROR: No matching distribution found for clip==1.0
got few errors while running in collab or kaggle any specific commands to run there ?
hi, thanks for your excellent work. I'm confused about one question. In my impression, the VRAM that the text2video-zero model occup will increase as the number of frames referenced by cross-frame attention. Why do you use full cross-frame attention but only need 2080ti---only 11GB?
The prompt seems wrong in the readme of "An astronaut dancing in the outer space".
Dear authors,
First of all, very nice work and impressive results, congratulations! In the paper, you mention that you have 125 video-prompt pairs in total that you use for quantitative evaluations, could you please specify the exact prompts and which sequences from the DAVIS dataset you used so that I can replicate the results and evaluate in the exact same setting as you?
Thank you!
Hello, which pre-trained model do I need to import in this line of code, and which specific folder do I store it in after downloading it, I have downloaded all the weights you need in your readme file, but this line of code is reporting an error because the server does not have internet access.
processor = processor.from_pretrained("lllyasviel/Annotators")
Hi, thanks for your code so much!
I met a problem when loading ckpt of stable diffusion v1.5, hope that you'll help me with it.
I download the ckpt from the url in README.md (https://huggingface.co/runwayml/stable-diffusion-v1-5), named "v1-5-pruned-emaonly.ckpt", but it seemed that there was no weights for "tokenizer", "text encoder" and etc. I was wondering which ckpt file in specific is correct?
Thanks for your kindest help!
hi, I restart a new issue, because I'm also confused about the timesteps of interleaved-frame smoother. The paper says that : "interleaved-frame smoother is performed on predicted RGB frames at timesteps {30, 31} by default" , But your code use 19, 20 as default. How is this parameter determined, Why not do smoother on all DDIM steps?Have you done ablation for this part of the experiment? I don't seem to see it in the paper.
Excellent work! When will you release the code please, we would like to follow your work!
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.