tmelyralab / musepose Goto Github PK

View Code? Open in Web Editor NEW

2.0K 41.0 140.0 2.77 MB

MusePose: a Pose-Driven Image-to-Video Framework for Virtual Human Generation

License: Other

Python 100.00%

musepose's Introduction

MusePose

MusePose: a Pose-Driven Image-to-Video Framework for Virtual Human Generation.

Zhengyan Tong, Chao Li, Zhaokang Chen, Bin Wu^†, Wenjiang Zhou (^†Corresponding Author, [email protected])

Lyra Lab, Tencent Music Entertainment

github huggingface space (comming soon) Project (comming soon) Technical report (comming soon)

MusePose is an image-to-video generation framework for virtual human under control signal such as pose. The current released model was an implementation of AnimateAnyone by optimizing Moore-AnimateAnyone.

MusePose is the last building block of the Muse opensource serie. Together with MuseV and MuseTalk, we hope the community can join us and march towards the vision where a virtual human can be generated end2end with native ability of full body movement and interaction. Please stay tuned for our next milestone!

We really appreciate AnimateAnyone for their academic paper and Moore-AnimateAnyone for their code base, which have significantly expedited the development of the AIGC community and MusePose.

Update:

We support Comfyui-MusePose now!

Recruitment

Join Lyra Lab, Tencent Music Entertainment!

We are currently seeking AIGC researchers including Internships, New Grads, and Senior (实习、校招、社招).

Please find details in the following two links or contact [email protected]

AI Researcher (https://join.tencentmusic.com/social/post-details/?id=13488, https://join.tencentmusic.com/social/post-details/?id=13502)

Overview

MusePose is a diffusion-based and pose-guided virtual human video generation framework.
Our main contributions could be summarized as follows:

The released model can generate dance videos of the human character in a reference image under the given pose sequence. The result quality exceeds almost all current open source models within the same topic.
We release the pose align algorithm so that users could align arbitrary dance videos to arbitrary reference images, which SIGNIFICANTLY improved inference performance and enhanced model usability.
We have fixed several important bugs and made some improvement based on the code of Moore-AnimateAnyone.

Demos

demo.0.mp4	demo.1.mp4
demo.2.mp4	demo.3.mp4
demo.4.mp4	demo.5.mp4
demo.6.mp4	demo.7.mp4

News

[05/27/2024] Release MusePose and pretrained models.
[05/31/2024] Support Comfyui-MusePose
[06/14/2024] Bug Fixed in inference_v2.yaml.

Todo:

release our trained models and inference codes of MusePose.
release pose align algorithm.
Comfyui-MusePose
training guidelines.
Huggingface Gradio demo.
a improved architecture and model (may take longer).

Getting Started

We provide a detailed tutorial about the installation and the basic usage of MusePose for new users:

Installation

To prepare the Python environment and install additional packages such as opencv, diffusers, mmcv, etc., please follow the steps below:

Build environment

We recommend a python version >=3.10 and cuda version =11.7. Then build environment as follows:

pip install -r requirements.txt

mmlab packages

pip install --no-cache-dir -U openmim 
mim install mmengine 
mim install "mmcv>=2.0.1" 
mim install "mmdet>=3.1.0" 
mim install "mmpose>=1.1.0"

Download weights

You can download weights manually as follows:

Download our trained weights.
Download the weights of other components:
- sd-image-variations-diffusers
- sd-vae-ft-mse
- dwpose
- yolox - Make sure to rename to yolox_l_8x8_300e_coco.pth
- image_encoder

Finally, these weights should be organized in pretrained_weights as follows:

./pretrained_weights/
|-- MusePose
|   |-- denoising_unet.pth
|   |-- motion_module.pth
|   |-- pose_guider.pth
|   └── reference_unet.pth
|-- dwpose
|   |-- dw-ll_ucoco_384.pth
|   └── yolox_l_8x8_300e_coco.pth
|-- sd-image-variations-diffusers
|   └── unet
|       |-- config.json
|       └── diffusion_pytorch_model.bin
|-- image_encoder
|   |-- config.json
|   └── pytorch_model.bin
└── sd-vae-ft-mse
    |-- config.json
    └── diffusion_pytorch_model.bin

Quickstart

Inference

Preparation

Prepare your referemce images and dance videos in the folder ./assets and organnized as the example:

./assets/
|-- images
|   └── ref.png
└── videos
    └── dance.mp4

Pose Alignment

Get the aligned dwpose of the reference image:

python pose_align.py --imgfn_refer ./assets/images/ref.png --vidfn ./assets/videos/dance.mp4

After this, you can see the pose align results in ./assets/poses, where ./assets/poses/align/img_ref_video_dance.mp4 is the aligned dwpose and the ./assets/poses/align_demo/img_ref_video_dance.mp4 is for debug.

Inferring MusePose

Add the path of the reference image and the aligned dwpose to the test config file ./configs/test_stage_2.yaml as the example:

test_cases:
  "./assets/images/ref.png":
    - "./assets/poses/align/img_ref_video_dance.mp4"

Then, simply run

python test_stage_2.py --config ./configs/test_stage_2.yaml

./configs/test_stage_2.yaml is the path to the inference configuration file.

Finally, you can see the output results in ./output/

Reducing VRAM cost

If you want to reduce the VRAM cost, you could set the width and height for inference. For example,

python test_stage_2.py --config ./configs/test_stage_2.yaml -W 512 -H 512

It will generate the video at 512 x 512 first, and then resize it back to the original size of the pose video.

Currently, it takes 16GB VRAM to run on 512 x 512 x 48 and takes 28GB VRAM to run on 768 x 768 x 48. However, it should be noticed that the inference resolution would affect the final results (especially face region).

Face Enhancement

If you want to enhance the face region to have a better consistency of the face, you could use FaceFusion. You could use the face-swap function to swap the face in the reference image to the generated video.

Training

Acknowledgement

We thank AnimateAnyone for their technical report, and have refer much to Moore-AnimateAnyone and diffusers.
We thank open-source components like AnimateDiff, dwpose, Stable Diffusion, etc..

Thanks for open-sourcing!

Limitations

Detail consitency: some details of the original character are not well preserved (e.g. face region and complex clothing).
Noise and flickering: we observe noise and flicking in complex background.

Citation

@article{musepose,
  title={MusePose: a Pose-Driven Image-to-Video Framework for Virtual Human Generation},
  author={Tong, Zhengyan and Li, Chao and Chen, Zhaokang and Wu, Bin and Zhou, Wenjiang},
  journal={arxiv},
  year={2024}
}

Disclaimer/License

code: The code of MusePose is released under the MIT License. There is no limitation for both academic and commercial usage.
model: The trained model are available for non-commercial research purposes only.
other opensource model: Other open-source models used must comply with their license, such as ft-mse-vae, dwpose, etc..
The testdata are collected from internet, which are available for non-commercial research purposes only.
AIGC: This project strives to impact the domain of AI-driven video generation positively. Users are granted the freedom to create videos using this tool, but they are expected to comply with local laws and utilize it responsibly. The developers do not assume any responsibility for potential misuse by users.

musepose's People

Contributors

Stargazers

Watchers

Forkers

zperzendetta hbcbh1999 burnzez aixia121 taichuai kustomzone pasikoodaa camenduru paperwave ansonkao slimevrx jamesnguyenweb3 utopic-dev leetesla liuwenran deepalgoexpert mrkomiljon drkhoinguyen wzhen12 sdbds abdoiiii hirajanwin chnxindong aycaecemgul spgoodman asdlei99 gargantua-voided samuraibarbi zcfrank1st ydl832 iamilluminati navezjt hubin858130 conglesolutionx sartq333 daocodedao huiminlim wuzhongdehua f901107 saonam damonxiao154 joe2hpimn coinhubx flixxtiltz sergioricardoo prawi rkp64 ailabteam osyahbana yuanzhongqiao mojowebs joelwmulongo kingler rebots-online dasarishiv runshouse msameen99 mrrobotaxi cvcuiwei karbon0x jhj0517 jhlim-gsds stophobia aaag1980 cloudenginehub ouic lycsqq aimizing blisssan henry326532 yzxzero assassindesign dgqplusonesec binhetech woshizhide karldx66 ronnielige smartinezbragado jdola morzorz ningcaichen555 chenhao889 strategist922 dungeonmassster yuanqs wangqi-xxxx zhaopufeng douwantech yangemail cjszhj krystalfighting dfqytcom hamdiboukamcha zhou937 sreesree2004 bc96 lizhongguo metavraag leegle ylhua

musepose's Issues

yolox link points to new filename

Yolox link points to file name yolox_l_8x8_300e_coco_20211126_140236-d3bd2b23.pth

it needs to be renamed to yolox_l_8x8_300e_coco.pth in order to work.

Training GPU requirement

Hi! Thanks for the amazing code!

I would like to ask about the requirement of training. Currently, I am using a single A100 with 40G RAM. My training code follows Moore. The problem is that no matter what video size I use (I even tried 64*64), it will be out of memory.

Could you please kindly share some information about training?

Thanks!

it is ver slow

It is very slow even for 5sec 360x640 video

环境安装存在冲突

安装这个环境的时候

报出如下冲突

关于align_demo奇怪的结果，感觉像大猩猩？

这是我的align效果，这样的该如何处理呢

这是最终驱动结果：

torch.cuda.OutOfMemoryError

CUDA out of memory. Tried to allocate 1.05 GiB (GPU 0; 23.64 GiB total capacity; 14.04 GiB already allocated; 74.44 MiB free; 14.88 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

通常建议多少的显存可以正常运行？

Error on stage 2

I'm getting this error on stage 2, any idea what could be going on?

PS G:\musepose> python test_stage_2.py --config ./configs/test_stage_2.yaml
Traceback (most recent call last):
File "G:\musepose\test_stage_2.py", line 21, in
from musepose.models.pose_guider import PoseGuider
ModuleNotFoundError: No module named 'musepose'

The graphics card does not have enough memory

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.11 GiB (GPU 0; 8.00 GiB total capacity; 19.84 GiB already allocated; 0 bytes free; 22.26 GiB
reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Mana
gement and PYTORCH_CUDA_ALLOC_CONF

ValueError: cross_attention_dim must be specified for CrossAttnDownBlock2D

When going through the quickstart instructions I'm getting the following error:

python test_stage_2.py --config ./configs/test_stage_2.yaml
/home/zzz/software/MusePose/venv/lib/python3.11/site-packages/diffusers/models/dual_transformer_2d.py:20: FutureWarning: `DualTransformer2DModel` is deprecated and will be removed in version 0.29. Importing `DualTransformer2DModel` from `diffusers.models.dual_transformer_2d` is deprecated and this will be removed in a future version. Please use `from diffusers.models.transformers.dual_transformer_2d import DualTransformer2DModel`, instead.
  deprecate("DualTransformer2DModel", "0.29", deprecation_message)
Width: 768
Height: 768
Length: 300
Slice: 48
Overlap: 4
Classifier free guidance: 3.5
DDIM sampling steps : 20
skip 1
Traceback (most recent call last):
  File "/home/zzz/software/MusePose/test_stage_2.py", line 237, in <module>
    main()
  File "/home/zzz/software/MusePose/test_stage_2.py", line 76, in main
    vae = AutoencoderKL.from_pretrained(
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/zzz/software/MusePose/venv/lib/python3.11/site-packages/huggingface_hub/utils/_validators.py", line 114, in _inner_fn
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/home/zzz/software/MusePose/venv/lib/python3.11/site-packages/diffusers/models/modeling_utils.py", line 650, in from_pretrained
    model = cls.from_config(config, **unused_kwargs)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/zzz/software/MusePose/venv/lib/python3.11/site-packages/diffusers/configuration_utils.py", line 259, in from_config
    model = cls(**init_dict)
            ^^^^^^^^^^^^^^^^
  File "/home/zzz/software/MusePose/venv/lib/python3.11/site-packages/diffusers/configuration_utils.py", line 653, in inner_init
    init(self, *args, **init_kwargs)
  File "/home/zzz/software/MusePose/venv/lib/python3.11/site-packages/diffusers/models/autoencoders/autoencoder_kl.py", line 90, in __init__
    self.encoder = Encoder(
                   ^^^^^^^^
  File "/home/zzz/software/MusePose/venv/lib/python3.11/site-packages/diffusers/models/autoencoders/vae.py", line 103, in __init__
    down_block = get_down_block(
                 ^^^^^^^^^^^^^^^
  File "/home/zzz/software/MusePose/venv/lib/python3.11/site-packages/diffusers/models/unets/unet_2d_blocks.py", line 128, in get_down_block
    raise ValueError("cross_attention_dim must be specified for CrossAttnDownBlock2D")
ValueError: cross_attention_dim must be specified for CrossAttnDownBlock2D

🍊 Jupyter Notebook

Thanks for the project ❤️ I made a jupyter notebook 🥳 I hope you like it.

https://github.com/camenduru/MusePose-jupyter

Why was my program killed during the last step?

The output is as follows:
Width: 768
Height: 768
Length: 300
Slice: 48
Overlap: 4
Classifier free guidance: 3.5
DDIM sampling steps : 20
skip 1
Some weights of the model checkpoint were not used when initializing UNet2DConditionModel:
['conv_norm_out.weight, conv_norm_out.bias, conv_out.weight, conv_out.bias']
/usr/local/lib/python3.10/dist-packages/torch/_utils.py:776: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
return self.fget.get(instance, owner)()
Killed

The same goes for changing w and h.

Missing body parts

Love playing around with the model so far! Awesome work!

The legs are missing in the pose. Is there a way to use the body positioning of the video instead or get all limbs from the picture?

img_ref_video_dance2.mp4

Video test file Do you know what I did wrong

hello

do you know why I have this result ?
watch the video result :
https://github.com/TMElyralab/MusePose/assets/150287877/76fff4bd-9759-46d8-8f7b-686a4732724f

thank you

CUDA out of memory on 8 GB VRAM + 8 GB shared RAM

I installed it on Windows 11 but getting the CUDA out of memory error. I reduced the dance video dimensions to half 540 x 960 still the same.

Anyway to make it work by tweaking any settings like batch size etc....

Does not recognize more than one person

I tried using a two person dance video and the dwpose only recognizes one person each frame.

Is there a way to modify the alignment to recognize multiple people?

Thanks.

What's the difference between MuseV and MusePose？

Did they all borrow from animate anyone?

which dataset is used for training in this repo.

thanks for your impressive work, may I ask does this repo only use tiktop dataset for training or includes multiple dataset like Champ,

thanks!

About arxiv paper

Dear MusePose authors,

First of all, thank you very much for sharing your incredible works. I really appreaciate it.

I am interested in the technical details of MusePose. In the github readme, it seems that there is a citation information but I cannot seem to find the referenced paper in anywhere.

@Article{musepose,
title={MusePose: a Pose-Driven Image-to-Video Framework for Virtual Human Generation},
author={Tong, Zhengyan and Li, Chao and Chen, Zhaokang and Wu, Bin and Zhou, Wenjiang},
journal={arxiv},
year={2024}
}

Could you please share the direct link to the paper? Will you release the paper too?
Thank you very much. Looking forward to hear from you.

How to keep fps the same as video source

First of all this is a amazing opensource tool so thank you developers!

The default example video is 24fps.
But the final video is 12fps.

How can I adjust the "python test_stage_2.py --config ./configs/test_stage_2.yaml" command to make it use the same fps.

Currently if I type in "python test_stage_2.py --config ./configs/test_stage_2.yaml --fps 24" it will output 24fps but it will be 2x the speed too which is not what I want.

test gone wrong

My results didn't turn out well :(

cat_img_cat_video_dance_3.5_20_1__.mp4

some good results

感谢作者的分享（you are my hero)，加入我自己收集的数据微调了一版，分享一些不错的结果：

1123-ff.mp4

MusePose2VideoPipeline-img_1-002.mp4-img_1-1.1sec-1080p-2-12345-1bdf5613-ff2.mp4

更多AIGC结果可以看我的视频号(温少的AIGC），欢迎讨论交流～

Training code

Hello, thanks for the awesome work! I am eagerly waiting for the training code release. Pl, let me know your planned timeline.

Thank you,
Ish

About training dataset

Thanks for your great work! Would you pleased to share some information on your training dataset and share some instructions on how to collect my own dataset from web? Thank you!

How to use multi gpu for inference?

Thanks for your great work!

I have 2 gpus so want to inference with multi gpus.
How to use multi gpu for inference?

VRAM usage

Hello!

Thanks again for this great project.

I'm wondering if their are ways to split VRAM usage for multi GPU ? Mostly in order to reduce compute cost (have only been able to execute on A100 so far)

关于test stage 1 的疑问

configs/test_stage_1.yaml中指定了inference_config: "./configs/inference_v2.yaml"，
而./configs/inference_v2.yaml中设置了use_motion_module: true，
是否意味着stage1训练时已经加载并训练了motion module？

请问一下，这个报错是什么问题呢

这个是什么问题

对齐阶段生成的视频，BGR/RGB 通道顺序不对

./assets/poses/align_demo/img_ref_video_dance.mp4

请问下，4090显卡应该如何修改设置，不让它跑共享显存

在win11下用wsl装的ubuntu，4090显卡应该如何修改设置，不让它跑共享显存

Latest inference bug fix?

Hello, thanks for being on top of things.

What was the bug that was fixed for the latest update on the inference v2 yaml file?

I got a error about museposealign

Is there a way to use full face keypoints too?

Hi, great work the consistency across frames is superb, I noticed you developed it partly from moore animate anyone, is there a way to use the face keypoints thay they use?
https://github.com/MooreThreads/Moore-AnimateAnyone/blob/master/configs/inference/pose_videos/anyone-video-4_kps.mp4

question about pose output

Hi,

I found out there are some problem when generating pose, please see below image which I compare with dwopenpose (I believe musepose use the same)

you can find out musepose generate longer arms.... not sure if this issue happened in comfyui part or in here....

segmentation fault

Hello!
When I run the code
python pose_align.py --imgfn_refer ./assets/images/ref.png --vidfn ./assets/videos/dance.mp4, I encounter this error:

height: 1920.0
width: 1080.0
fps: 24.0
Loads checkpoint by local backend from path: ./pretrained_weights/dwpose/yolox_l_8x8_300e_coco.pth
Loads checkpoint by local backend from path: ./pretrained_weights/dwpose/dw-ll_ucoco_384.pth
[1]    36412 segmentation fault  python3 pose_align.py --imgfn_refer ./assets/images/ref.png --vidfn

Mac M1Max

How to set the output video duration

How can I make the output video time consistent with the input video time? Now I can only output 10 seconds of video. Where should I set it?

建个群吧，大家可以方便交流

建立一个微信群，方便大家交流 MusePose

Pls help me

I have finished the 1st step： python pose_align.py --imgfn_refer ./assets/images/ref.png --vidfn ./assets/videos/dance.mp4
and something wrong at the step：

Moviepy - Done !
Moviepy - video ready ./assets/poses/align/img_ref_video_dance.mp4
pose align done
(musePose) [root@localhost MusePose]# python test_stage_2.py --config ./configs/test_stage_2.yaml
Width: 768
Height: 768
Length: 300
Slice: 48
Overlap: 4
Classifier free guidance: 3.5
DDIM sampling steps : 20
skip 1
Traceback (most recent call last):
File "/home/wangxin/MusePose/test_stage_2.py", line 238, in
main()
File "/home/wangxin/MusePose/test_stage_2.py", line 76, in main
vae = AutoencoderKL.from_pretrained(
File "/data/glm3/anaconda3/envs/musePose/lib/python3.10/site-packages/diffusers/models/modeling_utils.py", line 805, in from_pretrained
raise ValueError(
ValueError: Cannot load <class 'diffusers.models.autoencoder_kl.AutoencoderKL'> from ./pretrained_weights/sd-vae-ft-mse because the following keys are missing:
decoder.mid_block.attentions.0.to_q.weight, decoder.up_blocks.3.resnets.0.conv2.bias, encoder.down_blocks.2.resnets.1.conv2.bias, decoder.up_blocks.0.resnets.2.conv2.weight, encoder.down_blocks.0.resnets.0.conv2.bias, decoder.up_blocks.1.resnets.0.conv2.bias, encoder.conv_in.weight, decoder.up_blocks.1.resnets.2.norm1.weight, decoder.up_blocks.3.resnets.1.norm1.weight, decoder.up_blocks.3.resnets.1.conv2.bias, decoder.up_blocks.2.resnets.0.conv1.weight, decoder.up_blocks.1.resnets.2.conv1.weight, encoder.down_blocks.1.downsamplers.0.conv.weight, encoder.down_blocks.3.resnets.0.norm2.weight, encoder.mid_block.resnets.0.norm1.bias, decoder.up_blocks.2.resnets.0.norm1.bias, decoder.mid_block.resnets.0.norm2.bias, decoder.up_blocks.3.resnets.0.conv_shortcut.bias, encoder.down_blocks.3.resnets.0.conv1.bias, decoder.up_blocks.2.resnets.1.norm1.weight, decoder.up_blocks.0.resnets.0.norm2.weight, encoder.down_blocks.1.resnets.0.norm1.bias, decoder.up_blocks.2.resnets.2.norm2.weight, quant_conv.weight, decoder.up_blocks.3.resnets.0.conv_shortcut.weight, decoder.up_blocks.0.resnets.2.norm2.weight, decoder.up_blocks.0.resnets.0.norm1.bias, encoder.down_blocks.1.resnets.1.conv1.weight, encoder.down_blocks.1.resnets.0.conv_shortcut.weight, encoder.down_blocks.0.resnets.1.norm1.weight, encoder.down_blocks.3.resnets.1.conv1.bias, encoder.down_blocks.1.resnets.0.conv2.bias, encoder.mid_block.resnets.0.conv2.bias, decoder.mid_block.attentions.0.group_norm.bias, encoder.down_blocks.3.resnets.1.conv2.bias, encoder.down_blocks.2.downsamplers.0.conv.weight, encoder.mid_block.resnets.1.conv2.weight, encoder.down_blocks.0.resnets.1.norm1.bias, encoder.mid_block.attentions.0.to_q.weight, decoder.up_blocks.1.resnets.1.norm2.weight, decoder.up_blocks.1.upsamplers.0.conv.bias, encoder.down_blocks.3.resnets.0.norm2.bias, decoder.up_blocks.3.resnets.1.norm2.bias, encoder.mid_block.attentions.0.to_out.0.weight, decoder.up_blocks.1.resnets.2.norm1.bias, encoder.down_blocks.2.resnets.1.norm2.weight, decoder.up_blocks.0.resnets.2.conv2.bias, decoder.mid_block.resnets.0.conv2.weight, encoder.down_blocks.1.resnets.0.conv_shortcut.bias, decoder.conv_norm_out.weight, decoder.mid_block.resnets.1.norm1.bias, encoder.mid_block.resnets.0.norm1.weight, encoder.down_blocks.2.resnets.0.norm2.bias, decoder.mid_block.attentions.0.to_k.bias, encoder.down_blocks.1.resnets.1.norm1.bias, encoder.mid_block.attentions.0.to_k.bias, encoder.conv_norm_out.weight, decoder.up_blocks.0.resnets.1.conv1.bias, decoder.up_blocks.0.resnets.2.conv1.weight, decoder.up_blocks.3.resnets.0.conv1.bias, encoder.down_blocks.2.resnets.0.conv1.weight, encoder.down_blocks.1.resnets.0.conv1.bias, encoder.conv_in.bias, decoder.up_blocks.3.resnets.2.norm1.weight, encoder.down_blocks.3.resnets.1.norm2.weight, decoder.mid_block.resnets.1.conv1.bias, decoder.up_blocks.2.resnets.0.conv_shortcut.bias, decoder.conv_in.weight, decoder.up_blocks.2.resnets.2.conv1.weight, decoder.up_blocks.3.resnets.1.norm1.bias, decoder.up_blocks.1.resnets.2.norm2.bias, decoder.mid_block.attentions.0.to_out.0.weight, encoder.down_blocks.0.resnets.1.norm2.bias, decoder.up_blocks.3.resnets.1.conv1.weight, encoder.down_blocks.3.resnets.0.norm1.weight, encoder.conv_norm_out.bias, encoder.down_blocks.0.resnets.0.norm2.weight, encoder.mid_block.resnets.0.conv1.weight, encoder.mid_block.resnets.0.conv2.weight, decoder.conv_in.bias, decoder.up_blocks.0.resnets.2.norm1.bias, encoder.down_blocks.0.resnets.0.norm1.bias, decoder.up_blocks.1.resnets.1.conv1.weight, decoder.mid_block.attentions.0.to_out.0.bias, encoder.down_blocks.0.resnets.0.conv1.bias, decoder.up_blocks.2.resnets.2.norm1.weight, decoder.mid_block.resnets.0.norm1.bias, encoder.down_blocks.0.resnets.1.conv1.bias, decoder.up_blocks.0.resnets.0.conv2.bias, decoder.up_blocks.3.resnets.2.conv1.bias, decoder.up_blocks.1.resnets.0.conv1.bias, decoder.up_blocks.2.resnets.0.norm2.weight, decoder.up_blocks.2.resnets.1.conv2.weight, decoder.up_blocks.1.resnets.1.conv1.bias, encoder.down_blocks.2.resnets.1.conv1.weight, encoder.down_blocks.0.downsamplers.0.conv.weight, encoder.down_blocks.1.resnets.0.conv2.weight, decoder.up_blocks.1.resnets.2.conv1.bias, decoder.mid_block.resnets.1.norm2.bias, encoder.mid_block.attentions.0.to_q.bias, decoder.mid_block.resnets.1.conv2.bias, encoder.down_blocks.2.resnets.1.norm1.bias, encoder.mid_block.attentions.0.group_norm.weight, encoder.down_blocks.2.resnets.0.norm1.weight, encoder.mid_block.resnets.1.norm2.bias, decoder.conv_out.bias, encoder.down_blocks.0.resnets.1.conv2.weight, encoder.down_blocks.1.resnets.1.conv2.weight, decoder.up_blocks.2.resnets.0.conv1.bias, decoder.up_blocks.3.resnets.2.conv2.weight, decoder.up_blocks.0.upsamplers.0.conv.bias, decoder.up_blocks.0.upsamplers.0.conv.weight, decoder.up_blocks.3.resnets.0.conv2.weight, decoder.up_blocks.2.resnets.0.conv_shortcut.weight, decoder.up_blocks.0.resnets.1.conv1.weight, decoder.up_blocks.0.resnets.2.norm1.weight, decoder.up_blocks.1.resnets.0.norm2.bias, encoder.down_blocks.3.resnets.1.norm2.bias, encoder.down_blocks.1.resnets.1.conv2.bias, decoder.up_blocks.2.resnets.2.norm1.bias, decoder.up_blocks.1.resnets.2.conv2.weight, decoder.up_blocks.2.resnets.0.norm2.bias, encoder.down_blocks.2.resnets.1.conv2.weight, encoder.down_blocks.0.resnets.1.conv1.weight, encoder.down_blocks.1.resnets.0.norm2.bias, encoder.mid_block.resnets.0.norm2.weight, decoder.up_blocks.1.resnets.0.norm1.bias, decoder.up_blocks.2.resnets.1.norm1.bias, decoder.mid_block.attentions.0.to_k.weight, decoder.up_blocks.0.resnets.1.conv2.bias, decoder.up_blocks.2.upsamplers.0.conv.bias, quant_conv.bias, decoder.up_blocks.3.resnets.0.norm2.weight, decoder.up_blocks.1.resnets.1.conv2.weight, encoder.mid_block.resnets.1.norm1.weight, encoder.down_blocks.2.resnets.1.norm2.bias, decoder.mid_block.resnets.0.conv1.bias, decoder.up_blocks.3.resnets.2.norm2.bias, encoder.down_blocks.0.downsamplers.0.conv.bias, decoder.up_blocks.0.resnets.2.conv1.bias, decoder.up_blocks.3.resnets.0.norm1.weight, encoder.down_blocks.2.resnets.0.norm1.bias, encoder.down_blocks.2.resnets.0.conv1.bias, decoder.up_blocks.2.resnets.1.norm2.weight, decoder.up_blocks.1.resnets.2.norm2.weight, decoder.mid_block.resnets.0.conv1.weight, decoder.mid_block.attentions.0.to_v.bias, decoder.up_blocks.2.resnets.2.conv1.bias, encoder.down_blocks.3.resnets.1.conv2.weight, encoder.down_blocks.3.resnets.1.norm1.bias, encoder.mid_block.attentions.0.to_k.weight, decoder.mid_block.resnets.0.conv2.bias, decoder.up_blocks.1.resnets.2.conv2.bias, decoder.mid_block.resnets.0.norm1.weight, decoder.mid_block.attentions.0.to_v.weight, encoder.mid_block.resnets.1.norm1.bias, decoder.conv_out.weight, encoder.down_blocks.1.resnets.1.conv1.bias, decoder.up_blocks.3.resnets.2.norm1.bias, decoder.mid_block.resnets.1.conv2.weight, encoder.down_blocks.0.resnets.0.conv2.weight, decoder.up_blocks.3.resnets.0.norm1.bias, encoder.down_blocks.2.resnets.0.conv_shortcut.weight, decoder.up_blocks.2.resnets.0.conv2.bias, decoder.up_blocks.2.resnets.1.conv2.bias, encoder.mid_block.resnets.1.conv1.weight, encoder.down_blocks.0.resnets.1.conv2.bias, encoder.down_blocks.3.resnets.0.norm1.bias, encoder.mid_block.attentions.0.group_norm.bias, encoder.mid_block.attentions.0.to_v.weight, encoder.down_blocks.1.resnets.1.norm2.bias, decoder.up_blocks.1.resnets.1.conv2.bias, encoder.mid_block.resnets.1.norm2.weight, encoder.mid_block.resnets.0.conv1.bias, decoder.up_blocks.2.resnets.1.norm2.bias, decoder.mid_block.resnets.1.norm2.weight, decoder.mid_block.attentions.0.group_norm.weight, decoder.up_blocks.2.resnets.1.conv1.weight, post_quant_conv.weight, decoder.up_blocks.2.resnets.0.norm1.weight, encoder.down_blocks.1.resnets.1.norm1.weight, encoder.mid_block.resnets.1.conv2.bias, decoder.up_blocks.0.resnets.1.conv2.weight, encoder.mid_block.attentions.0.to_out.0.bias, decoder.up_blocks.3.resnets.0.conv1.weight, decoder.up_blocks.0.resnets.1.norm1.bias, decoder.up_blocks.1.resnets.1.norm1.weight, decoder.up_blocks.3.resnets.1.conv1.bias, decoder.mid_block.resnets.1.norm1.weight, encoder.mid_block.resnets.1.conv1.bias, decoder.up_blocks.0.resnets.1.norm1.weight, encoder.down_blocks.2.downsamplers.0.conv.bias, decoder.up_blocks.2.resnets.2.conv2.weight, encoder.down_blocks.2.resnets.1.norm1.weight, decoder.up_blocks.1.resnets.0.norm2.weight, decoder.up_blocks.0.resnets.0.conv2.weight, encoder.down_blocks.1.resnets.0.conv1.weight, decoder.up_blocks.0.resnets.0.conv1.bias, encoder.down_blocks.1.downsamplers.0.conv.bias, decoder.up_blocks.0.resnets.1.norm2.weight, encoder.down_blocks.0.resnets.0.conv1.weight, decoder.up_blocks.2.resnets.0.conv2.weight, decoder.mid_block.resnets.1.conv1.weight, encoder.down_blocks.2.resnets.1.conv1.bias, encoder.down_blocks.0.resnets.1.norm2.weight, decoder.up_blocks.3.resnets.2.conv1.weight, encoder.down_blocks.2.resnets.0.norm2.weight, encoder.down_blocks.1.resnets.0.norm2.weight, encoder.down_blocks.3.resnets.1.conv1.weight, encoder.mid_block.resnets.0.norm2.bias, decoder.up_blocks.1.resnets.0.conv1.weight, encoder.down_blocks.2.resnets.0.conv_shortcut.bias, decoder.up_blocks.3.resnets.2.conv2.bias, encoder.down_blocks.3.resnets.0.conv2.weight, post_quant_conv.bias, encoder.down_blocks.2.resnets.0.conv2.bias, encoder.down_blocks.3.resnets.0.conv1.weight, encoder.conv_out.bias, decoder.up_blocks.0.resnets.0.conv1.weight, decoder.up_blocks.1.resnets.0.conv2.weight, decoder.up_blocks.2.resnets.2.conv2.bias, encoder.down_blocks.0.resnets.0.norm2.bias, decoder.conv_norm_out.bias, decoder.up_blocks.1.resnets.1.norm1.bias, encoder.down_blocks.2.resnets.0.conv2.weight, encoder.conv_out.weight, decoder.up_blocks.1.upsamplers.0.conv.weight, decoder.up_blocks.0.resnets.1.norm2.bias, decoder.up_blocks.1.resnets.1.norm2.bias, decoder.up_blocks.3.resnets.0.norm2.bias, encoder.down_blocks.1.resnets.1.norm2.weight, decoder.up_blocks.1.resnets.0.norm1.weight, decoder.up_blocks.2.resnets.2.norm2.bias, decoder.up_blocks.3.resnets.2.norm2.weight, decoder.up_blocks.0.resnets.0.norm2.bias, encoder.mid_block.attentions.0.to_v.bias, encoder.down_blocks.3.resnets.1.norm1.weight, decoder.up_blocks.2.upsamplers.0.conv.weight, decoder.up_blocks.2.resnets.1.conv1.bias, decoder.up_blocks.3.resnets.1.conv2.weight, encoder.down_blocks.0.resnets.0.norm1.weight, encoder.down_blocks.1.resnets.0.norm1.weight, decoder.mid_block.resnets.0.norm2.weight, decoder.up_blocks.0.resnets.2.norm2.bias, encoder.down_blocks.3.resnets.0.conv2.bias, decoder.mid_block.attentions.0.to_q.bias, decoder.up_blocks.3.resnets.1.norm2.weight, decoder.up_blocks.0.resnets.0.norm1.weight.
Please make sure to pass low_cpu_mem_usage=False and device_map=None if you want to randomly initialize those weights or else make sure your checkpoint file is correct.

想学习AnimateAnyone快加学习交流群

some questions about training

positional_encoding_max_len is 128 in inference config, but in the inference slice_number default number is 48, i am confusing about that , is slice_number equal to 128 in the training?

Ignores command line arguments eg -W 448 -H 448 --steps 15 --fps 15

Output video at 768x768, script defaults in test_stage_2.py Even when I changed the defaults it still outputs at 768x768

CUDA out of memory

After starting the test_stage_2 the loading bar appears but nothing happens for a long time, after that i just get this:

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 243.00 GiB (GPU 0; 15.99 GiB total capacity; 13.51 GiB already allocated; 0 bytes free; 15.91 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Why is it trying to get such an absurd amount of memory? I cant seem to find where to put the max_split_size_mb.
Is it something with the training parameters?

Thanks in advance!

对齐阶段生成的视频，BGR/RGB 通道顺序不对

位于 ./assets/poses/align_demo/img_ref_video_dance.mp4
另外，生成的视频没有声音

Error with pose_align.py on MacBook M3 Max Using Anaconda torch-metal Profile

Hello @TZYSJTU

I am experiencing issues running pose_align.py from the MusePose repository on my MacBook M3 Max. Despite following the setup instructions in the README and installing all the required libraries and dependencies, I encounter errors when executing the script.

Environment:

OS: macOS
Device: MacBook M3 Max
Python Environment: Anaconda torch-metal profile
Python Version: 3.10
MusePose Version: 1.3.1
Platform: macOS-14.5-arm64-arm-64bit

I executed the following command:

python pose_align.py --imgfn_refer ./assets/images/ref.png --vidfn ./assets/videos/dance.mp4

###Observation###

Multiple warnings about missing modules despite it being installed correctly:

warnings.warn('Fail to import `MultiScaleDeformableAttention` from `mmcv.ops.multi_scale_deform_attn` ')
warnings.warn('The module `mmpose` is not installed. The package will have limited functionality.')
warnings.warn('The module `mmdet` is not installed. The package will have limited functionality.')

The script terminates with a NameError indicating that init_detector is not defined:

Traceback (most recent call last):
  File "/Users/jaydip/Documents/BCS/ML/MusePose/pose_align.py", line 556, in <module>
    main()
  File "/Users/jaydip/Documents/BCS/ML/MusePose/pose_align.py", line 551, in main
    run_align_video_with_filterPose.translate_smooth(args)
  File "/Users/jaydip/Documents/BCS/ML/MusePose/pose_align.py", line 270, in run_align_video_with_filterPose.translate_smooth
    detector = DWPoseDetector(
  File "/Users/jaydip/Documents/BCS/ML/MusePose/pose/script/dwpose.py", line 74, in __init__
    self.pose_estimation = Wholebody(det_config, det_ckpt, pose_config, pose_ckpt, device)
  File "/Users/jaydip/Documents/BCS/ML/MusePose/pose/script/wholebody.py", line 51, in __init__
    self.detector = init_detector(det_config, det_ckpt, device=device)
NameError: name 'init_detector' is not defined

Screnshot###

I would appreciate any guidance on resolving these issues. Thank you!

Error extracting pose when deploying on Docker

I am trying to deploy the model on a GPU machine using Docker containers. The inference works well but the extraction of the pose fails bc of a limitation on the resource usage.

Error:

ValueError: not allowed to raise maximum limit

Description

When running the pose_align.py script, a ValueError is raised indicating that the maximum limit for file descriptors cannot be increased. This error occurs during the initialization of the DWposeDetector in the run_align_video_with_filterPose_translate_smooth function.

Error Log

2024-06-10T18:02:00.973592162Z Traceback (most recent call last):
2024-06-10T18:02:00.973623702Z File "/muse_pose/pose_align.py", line 556, in
2024-06-10T18:02:00.973628712Z main()
2024-06-10T18:02:00.973633572Z File "/muse_pose/pose_align.py", line 551, in main
2024-06-10T18:02:00.974056225Z run_align_video_with_filterPose_translate_smooth(args)
2024-06-10T18:02:00.974070052Z File "/muse_pose/pose_align.py", line 270, in run_align_video_with_filterPose_translate_smooth
2024-06-10T18:02:00.974073629Z detector = DWposeDetector(
2024-06-10T18:02:00.974077346Z File "/muse_pose/pose/script/dwpose.py", line 72, in init
2024-06-10T18:02:00.974079570Z from pose.script.wholebody import Wholebody
2024-06-10T18:02:00.974082726Z File "/muse_pose/pose/script/wholebody.py", line 14, in
2024-06-10T18:02:00.974085692Z from mmpose.apis import inference_topdown
2024-06-10T18:02:00.974087976Z File "/usr/local/lib/python3.10/site-packages/mmpose/apis/init.py", line 2, in
2024-06-10T18:02:00.974102573Z from .inference import (collect_multi_frames, inference_bottomup,
2024-06-10T18:02:00.974105359Z File "/usr/local/lib/python3.10/site-packages/mmpose/apis/inference.py", line 16, in
2024-06-10T18:02:00.974107544Z from mmpose.datasets.datasets.utils import parse_pose_metainfo
2024-06-10T18:02:00.974109838Z File "/usr/local/lib/python3.10/site-packages/mmpose/datasets/init.py", line 2, in
2024-06-10T18:02:00.974112643Z from .builder import build_dataset
2024-06-10T18:02:00.974114777Z File "/usr/local/lib/python3.10/site-packages/mmpose/datasets/builder.py", line 20, in
2024-06-10T18:02:00.974116961Z resource.setrlimit(resource.RLIMIT_NOFILE, (soft_limit, hard_limit))
2024-06-10T18:02:00.974119205Z ValueError: not allowed to raise maximum limit
.
python:3.10-slim running on Linux

I made huggingface space

Hi, Thanks for wonderful project!

I made a huggingface space:
https://huggingface.co/spaces/jhj0517/musepose

Note that it runs on ZeroGPU and it only allows functions to run with GPU for a maximum of 2 minutes.
In my tests, I managed to generate output for up to ~3 seconds of video input.

Please feel free to ignore or close this issue!

low inference speed 第二步推理异常，一直卡住没有进度

推理config.json及模型提示放到 MusePose/pretrained_weights/sd-image-variations-diffusers/unet
unet,移动unet下后，卡住长时间没有进度

root@153a7e76ceb5:~/MusePose# python test_stage_2.py --config ./configs/test_stage_2.yaml
Width: 768
Height: 768
Length: 300
Slice: 48
Overlap: 4
Classifier free guidance: 3.5
DDIM sampling steps : 20
skip 1
Some weights of the model checkpoint were not used when initializing UNet2DConditionModel:
['conv_norm_out.weight, conv_norm_out.bias, conv_out.weight, conv_out.bias']
/usr/local/lib/python3.10/dist-packages/torch/_utils.py:776: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
return self.fget.get(instance, owner)()
handle=== ./assets/images/ref.png ./assets/poses/align/img_ref_video_dance.mp4
pose video has 288 frames, with 24 fps
processing length: 144
fps 12
/root/MusePose/musepose/pipelines/pipeline_pose2vid_long.py:406: FutureWarning: Accessing config attribute in_channels directly via 'UNet3DConditionModel' object attribute is deprecated. Please access 'in_channels' over 'UNet3DConditionModel's config object instead, e.g. 'unet.config.in_channels'.
num_channels_latents = self.denoising_unet.in_channels
0%| | 0/20 [00:00<?, ?it/s]

最后一步出错是怎么回事python test_stage_2.py --config ./configs/test_stage_2.yaml

WARNING[XFORMERS]: xFormers can't load C++/CUDA extensions. xFormers was built for:
PyTorch 2.0.1+cu118 with CUDA 1108 (you have 2.0.1+cpu)
Python 3.11.5 (you have 3.11.7)
Please reinstall xformers (see https://github.com/facebookresearch/xformers#installing-xformers)
Memory-efficient attention, SwiGLU, sparse and more won't be available.
Set XFORMERS_MORE_DETAILS=1 for more details
Width: 512
Height: 512
Length: 300
Slice: 48
Overlap: 4
Classifier free guidance: 3.5
DDIM sampling steps : 20
skip 1
Traceback (most recent call last):
File "F:\kuaisufangwen\Desktop\MusePose\test_stage_2.py", line 238, in
main()
File "F:\kuaisufangwen\Desktop\MusePose\test_stage_2.py", line 78, in main
).to("cuda", dtype=weight_dtype)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "F:\kuaisufangwen\Desktop\ylad\Lib\site-packages\torch\nn\modules\module.py", line 1145, in to
return self._apply(convert)
^^^^^^^^^^^^^^^^^^^^
File "F:\kuaisufangwen\Desktop\ylad\Lib\site-packages\torch\nn\modules\module.py", line 797, in _apply
module._apply(fn)
File "F:\kuaisufangwen\Desktop\ylad\Lib\site-packages\torch\nn\modules\module.py", line 797, in _apply
module._apply(fn)
File "F:\kuaisufangwen\Desktop\ylad\Lib\site-packages\torch\nn\modules\module.py", line 820, in apply
param_applied = fn(param)
^^^^^^^^^
File "F:\kuaisufangwen\Desktop\ylad\Lib\site-packages\torch\nn\modules\module.py", line 1143, in convert
return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "F:\kuaisufangwen\Desktop\ylad\Lib\site-packages\torch\cuda_init.py", line 239, in _lazy_init
raise AssertionError("Torch not compiled with CUDA enabled")
AssertionError: Torch not compiled with CUDA enabled