Git Product home page Git Product logo

controlvideo's People

Contributors

chenxwh avatar hackeranonymousdeepweb avatar mayuelala avatar ybybzhang avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

controlvideo's Issues

no file named pytorch_model.bin, tf_model.h5, model.ckpt.index or flax_model.msgpack found in directory checkpoints/stable-diffusion-v1-5

git clone fro https://huggingface.co/runwayml/stable-diffusion-v1-5
and it looks like this
ls
flownet.pkl sd-controlnet-canny sd-controlnet-depth sd-controlnet-openpose stable-diffusion-v1-5
but when I run the inference.py, there is an ERROR.
no file named pytorch_model.bin, tf_model.h5, model.ckpt.index or flax_model.msgpack found in directory checkpoints/stable-diffusion-v1-5
I checked the file in my directory, it's same as the file on https://huggingface.co/runwayml/stable-diffusion-v1-5/tree/main
what's my wrong?

Pure text generation

Thank you for your remarkable work! I wonder if this model is able to generate video under pure text input?

CUDA out of memory

6f8b4adcb3af1cec586bbdd1703c11a
Very good job.
When I run inference, --video_length 49 --frame_rate 2, it shows "cuda out of memory" error. I am running it on a 3090 GPU with 24GB of VRAM. Is there any way to solve this issue? Can I use multi-GPU inference?

Looking forward to your reply.

RuntimeError: The size of tensor a (24) must match the size of tensor b (12) at non-singleton dimension 2

When we run by python inference.py --condition "depth" --video_length 24 --smoother_steps 19 20 --width 512 --height 512, we get the error
Traceback (most recent call last): File "/root/paddlejob/workspace/lxz/ControlVideo/inference.py", line 129, in <module> sample = pipe(args.prompt + POS_PROMPT, video_length=args.video_length, frames=pil_annotation, File "/root/paddlejob/workspace/lxz/miniconda3/envs/controlvideo_py310/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context return func(*args, **kwargs) File "/root/paddlejob/workspace/lxz/ControlVideo/models/pipeline_controlvideo.py", line 902, in __call__ down_block_res_samples, mid_block_res_sample = self.controlnet( File "/root/paddlejob/workspace/lxz/miniconda3/envs/controlvideo_py310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl return forward_call(*input, **kwargs) File "/root/paddlejob/workspace/lxz/ControlVideo/models/controlnet.py", line 519, in forward sample += controlnet_cond RuntimeError: The size of tensor a (24) must match the size of tensor b (12) at non-singleton dimension 2
It seems that the depth detector extracts the depth of the size of channel b(12), but the video_length is 24, the size of a(24), so how can solve this problem?

Is the same frame noise important?

Thanks for your great work!
I notice the prepare_latents function in pipeline_controlvideo.py makes the frame noise same:
image

But the frame noise in Vid2vid-zero, Tune-A-Video are individually generated (the noises are different among frames):
image

Thus I want to know wheather the same frame noise is important for your great results?
Thank you!

cfa-vram

hi, thanks for your excellent work. I'm confused about one question. In my impression, the VRAM that the text2video-zero model occup will increase as the number of frames referenced by cross-frame attention. Why do you use full cross-frame attention but only need 2080ti---only 11GB?

Single Character

Hi,
Is it possible to generate a single character from the Pose for more than 5 seconds?

I have a video of Pose ( openpose + hands + face) and i was wondering if it is possible to generate an output video withe the length of 5 seconds that has a consistent character/Avatar which plays Dance, .... from the controled (pose) input?

Thanks
Best regards

First frame conditioning possible?

Great work and paper!
Is it possible with the current model to inititalize the video with some initial first image and generate the following frames based on this image? If not, what modifications would be needed to achieve that?

Excellent work

Excellent work! When will you release the code please, we would like to follow your work!

ValueError: mutable default <class 'timm.models.maxxvit.MaxxVitConvCfg'> for fie

Traceback (most recent call last):
File "", line 1, in
File "/home/ubuntu/.local/lib/python3.11/site-packages/timm/init.py", line 2, in
from .models import create_model, list_models, is_model, list_modules, model _entrypoint,
File "/home/ubuntu/.local/lib/python3.11/site-packages/timm/models/init.py ", line 28, in
from .maxxvit import *
File "/home/ubuntu/.local/lib/python3.11/site-packages/timm/models/maxxvit.py" , line 216, in
@DataClass
^^^^^^^^^
File "/usr/lib/python3.11/dataclasses.py", line 1221, in dataclass
return wrap(cls)
^^^^^^^^^
File "/usr/lib/python3.11/dataclasses.py", line 1211, in wrap
return _process_class(cls, init, repr, eq, order, unsafe_hash,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.11/dataclasses.py", line 959, in _process_class
cls_fields.append(_get_field(cls, name, type, kw_only))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.11/dataclasses.py", line 816, in _get_field
raise ValueError(f'mutable default {type(f.default)} for field '
ValueError: mutable default <class 'timm.models.maxxvit.MaxxVitConvCfg'> for fie

Combination with ip adpater

Great project. I used your method combined with ip adapter to generate the action posture of a specific character. The final result has almost nothing to do with the character ID provided by ip adapter. I don’t know why this is and I want to explore it.

controlnet-aux==0.0.6 ValueError: depth is not a valid processor id

Thanks for your great work. I can inference the code when condition == 'canny'.

But when I inference the code with controlnet-aux==0.0.6 and condition == "depth". There is an error reported:
ValueError: depth is not a valid processor id. Please make sure to choose one of scribble_hed, softedge_hed, scribble_hedsafe, softedge_hedsafe, depth_midas, mlsd, openpose, openpose_face, openpose_faceonly, openpose_full, openpose_hand, scribble_pidinet, softedge_pidinet, scribble_pidsafe, softedge_pidsafe, normal_bae, lineart_coarse, lineart_realistic, lineart_anime, depth_zoe, depth_leres, depth_leres++, shuffle, mediapipe_face, canny.

Do you have some suggestions?

image

ValueError: Calling CLIPTokenizer.from_pretrained() with the path to a single file or url is not supported for this tokenizer. Use a model identifier or the path to a directory instead.

Hello! I am running the Inference: inference.sh
which results in the following error:

Traceback (most recent call last):
File "inference.py", line 96, in
tokenizer = CLIPTokenizer.from_pretrained(sd_path, subfolder="tokenizer")
File "/root/miniconda3/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 1702, in from_pretrained
raise ValueError(
ValueError: Calling CLIPTokenizer.from_pretrained() with the path to a single file or url is not supported for this tokenizer. Use a model identifier or the path to a directory instead.

about the evaluate date

After reading the paper,I am very curiousity about the evaluate data. Knowning that they were selected by your team and annotate their descriptions. If you could share the evaluate data with me? Looking forward U reply! thinks!

pretrained file

Hello, which pre-trained model do I need to import in this line of code, and which specific folder do I store it in after downloading it, I have downloaded all the weights you need in your readme file, but this line of code is reporting an error because the server does not have internet access.
processor = processor.from_pretrained("lllyasviel/Annotators")

problem with triton

(venv) E:\ControlVideo>python inference.py
A matching Triton is not available, some optimizations will not be enabled.
Error caught was: No module named 'triton.language'
Traceback (most recent call last):
File "E:\ControlVideo\inference.py", line 20, in
from models.pipeline_controlvideo import ControlVideoPipeline
File "E:\ControlVideo\models\pipeline_controlvideo.py", line 28, in
from .controlnet import ControlNetOutput
File "E:\ControlVideo\models\controlnet.py", line 27, in
from .controlnet_unet_blocks import (
File "E:\ControlVideo\models\controlnet_unet_blocks.py", line 6, in
from .controlnet_attention import Transformer3DModel
File "E:\ControlVideo\models\controlnet_attention.py", line 15, in
from diffusers.models.attention import CrossAttention, FeedForward, AdaLayerNorm
ImportError: cannot import name 'CrossAttention' from 'diffusers.models.attention' (E:\ControlVideo\venv\lib\site-packages\diffusers\models\attention.py)

Evaluation Question: 125 Video-prompt Pairs

Dear authors,

First of all, very nice work and impressive results, congratulations! In the paper, you mention that you have 125 video-prompt pairs in total that you use for quantitative evaluations, could you please specify the exact prompts and which sequences from the DAVIS dataset you used so that I can replicate the results and evaluate in the exact same setting as you?

Thank you!

smoother_step

hi, I restart a new issue, because I'm also confused about the timesteps of interleaved-frame smoother. The paper says that : "interleaved-frame smoother is performed on predicted RGB frames at timesteps {30, 31} by default" , But your code use 19, 20 as default. How is this parameter determined, Why not do smoother on all DDIM steps?Have you done ablation for this part of the experiment? I don't seem to see it in the paper.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.