Git Product home page Git Product logo

musepose's Introduction

MusePose

MusePose: a Pose-Driven Image-to-Video Framework for Virtual Human Generation.

Zhengyan Tong, Chao Li, Zhaokang Chen, Bin Wu, Wenjiang Zhou (Corresponding Author, [email protected])

Lyra Lab, Tencent Music Entertainment

github huggingface space (comming soon) Project (comming soon) Technical report (comming soon)

MusePose is an image-to-video generation framework for virtual human under control signal such as pose. The current released model was an implementation of AnimateAnyone by optimizing Moore-AnimateAnyone.

MusePose is the last building block of the Muse opensource serie. Together with MuseV and MuseTalk, we hope the community can join us and march towards the vision where a virtual human can be generated end2end with native ability of full body movement and interaction. Please stay tuned for our next milestone!

We really appreciate AnimateAnyone for their academic paper and Moore-AnimateAnyone for their code base, which have significantly expedited the development of the AIGC community and MusePose.

Update:

  1. We support Comfyui-MusePose now!

Recruitment

Join Lyra Lab, Tencent Music Entertainment!

We are currently seeking AIGC researchers including Internships, New Grads, and Senior (实习、校招、社招).

Please find details in the following two links or contact [email protected]

Overview

MusePose is a diffusion-based and pose-guided virtual human video generation framework.
Our main contributions could be summarized as follows:

  1. The released model can generate dance videos of the human character in a reference image under the given pose sequence. The result quality exceeds almost all current open source models within the same topic.
  2. We release the pose align algorithm so that users could align arbitrary dance videos to arbitrary reference images, which SIGNIFICANTLY improved inference performance and enhanced model usability.
  3. We have fixed several important bugs and made some improvement based on the code of Moore-AnimateAnyone.

Demos

demo.0.mp4
demo.1.mp4
demo.2.mp4
demo.3.mp4
demo.4.mp4
demo.5.mp4
demo.6.mp4
demo.7.mp4

News

  • [05/27/2024] Release MusePose and pretrained models.
  • [05/31/2024] Support Comfyui-MusePose
  • [06/14/2024] Bug Fixed in inference_v2.yaml.

Todo:

  • release our trained models and inference codes of MusePose.
  • release pose align algorithm.
  • Comfyui-MusePose
  • training guidelines.
  • Huggingface Gradio demo.
  • a improved architecture and model (may take longer).

Getting Started

We provide a detailed tutorial about the installation and the basic usage of MusePose for new users:

Installation

To prepare the Python environment and install additional packages such as opencv, diffusers, mmcv, etc., please follow the steps below:

Build environment

We recommend a python version >=3.10 and cuda version =11.7. Then build environment as follows:

pip install -r requirements.txt

mmlab packages

pip install --no-cache-dir -U openmim 
mim install mmengine 
mim install "mmcv>=2.0.1" 
mim install "mmdet>=3.1.0" 
mim install "mmpose>=1.1.0" 

Download weights

You can download weights manually as follows:

  1. Download our trained weights.

  2. Download the weights of other components:

Finally, these weights should be organized in pretrained_weights as follows:

./pretrained_weights/
|-- MusePose
|   |-- denoising_unet.pth
|   |-- motion_module.pth
|   |-- pose_guider.pth
|   └── reference_unet.pth
|-- dwpose
|   |-- dw-ll_ucoco_384.pth
|   └── yolox_l_8x8_300e_coco.pth
|-- sd-image-variations-diffusers
|   └── unet
|       |-- config.json
|       └── diffusion_pytorch_model.bin
|-- image_encoder
|   |-- config.json
|   └── pytorch_model.bin
└── sd-vae-ft-mse
    |-- config.json
    └── diffusion_pytorch_model.bin

Quickstart

Inference

Preparation

Prepare your referemce images and dance videos in the folder ./assets and organnized as the example:

./assets/
|-- images
|   └── ref.png
└── videos
    └── dance.mp4

Pose Alignment

Get the aligned dwpose of the reference image:

python pose_align.py --imgfn_refer ./assets/images/ref.png --vidfn ./assets/videos/dance.mp4

After this, you can see the pose align results in ./assets/poses, where ./assets/poses/align/img_ref_video_dance.mp4 is the aligned dwpose and the ./assets/poses/align_demo/img_ref_video_dance.mp4 is for debug.

Inferring MusePose

Add the path of the reference image and the aligned dwpose to the test config file ./configs/test_stage_2.yaml as the example:

test_cases:
  "./assets/images/ref.png":
    - "./assets/poses/align/img_ref_video_dance.mp4"

Then, simply run

python test_stage_2.py --config ./configs/test_stage_2.yaml

./configs/test_stage_2.yaml is the path to the inference configuration file.

Finally, you can see the output results in ./output/

Reducing VRAM cost

If you want to reduce the VRAM cost, you could set the width and height for inference. For example,

python test_stage_2.py --config ./configs/test_stage_2.yaml -W 512 -H 512

It will generate the video at 512 x 512 first, and then resize it back to the original size of the pose video.

Currently, it takes 16GB VRAM to run on 512 x 512 x 48 and takes 28GB VRAM to run on 768 x 768 x 48. However, it should be noticed that the inference resolution would affect the final results (especially face region).

Face Enhancement

If you want to enhance the face region to have a better consistency of the face, you could use FaceFusion. You could use the face-swap function to swap the face in the reference image to the generated video.

Training

Acknowledgement

  1. We thank AnimateAnyone for their technical report, and have refer much to Moore-AnimateAnyone and diffusers.
  2. We thank open-source components like AnimateDiff, dwpose, Stable Diffusion, etc..

Thanks for open-sourcing!

Limitations

  • Detail consitency: some details of the original character are not well preserved (e.g. face region and complex clothing).
  • Noise and flickering: we observe noise and flicking in complex background.

Citation

@article{musepose,
  title={MusePose: a Pose-Driven Image-to-Video Framework for Virtual Human Generation},
  author={Tong, Zhengyan and Li, Chao and Chen, Zhaokang and Wu, Bin and Zhou, Wenjiang},
  journal={arxiv},
  year={2024}
}

Disclaimer/License

  1. code: The code of MusePose is released under the MIT License. There is no limitation for both academic and commercial usage.
  2. model: The trained model are available for non-commercial research purposes only.
  3. other opensource model: Other open-source models used must comply with their license, such as ft-mse-vae, dwpose, etc..
  4. The testdata are collected from internet, which are available for non-commercial research purposes only.
  5. AIGC: This project strives to impact the domain of AI-driven video generation positively. Users are granted the freedom to create videos using this tool, but they are expected to comply with local laws and utilize it responsibly. The developers do not assume any responsibility for potential misuse by users.

musepose's People

Contributors

ansonkao avatar czk32611 avatar jhj0517 avatar phighting avatar sartq333 avatar tzysjtu avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

musepose's Issues

yolox link points to new filename

Yolox link points to file name yolox_l_8x8_300e_coco_20211126_140236-d3bd2b23.pth

it needs to be renamed to yolox_l_8x8_300e_coco.pth in order to work.

Training GPU requirement

Hi! Thanks for the amazing code!

I would like to ask about the requirement of training. Currently, I am using a single A100 with 40G RAM. My training code follows Moore. The problem is that no matter what video size I use (I even tried 64*64), it will be out of memory.

Could you please kindly share some information about training?

Thanks!

torch.cuda.OutOfMemoryError

CUDA out of memory. Tried to allocate 1.05 GiB (GPU 0; 23.64 GiB total capacity; 14.04 GiB already allocated; 74.44 MiB free; 14.88 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

通常建议多少的显存可以正常运行?

Error on stage 2

I'm getting this error on stage 2, any idea what could be going on?

PS G:\musepose> python test_stage_2.py --config ./configs/test_stage_2.yaml
Traceback (most recent call last):
File "G:\musepose\test_stage_2.py", line 21, in
from musepose.models.pose_guider import PoseGuider
ModuleNotFoundError: No module named 'musepose'

The graphics card does not have enough memory

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.11 GiB (GPU 0; 8.00 GiB total capacity; 19.84 GiB already allocated; 0 bytes free; 22.26 GiB
reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Mana
gement and PYTORCH_CUDA_ALLOC_CONF

ValueError: cross_attention_dim must be specified for CrossAttnDownBlock2D

When going through the quickstart instructions I'm getting the following error:

python test_stage_2.py --config ./configs/test_stage_2.yaml
/home/zzz/software/MusePose/venv/lib/python3.11/site-packages/diffusers/models/dual_transformer_2d.py:20: FutureWarning: `DualTransformer2DModel` is deprecated and will be removed in version 0.29. Importing `DualTransformer2DModel` from `diffusers.models.dual_transformer_2d` is deprecated and this will be removed in a future version. Please use `from diffusers.models.transformers.dual_transformer_2d import DualTransformer2DModel`, instead.
  deprecate("DualTransformer2DModel", "0.29", deprecation_message)
Width: 768
Height: 768
Length: 300
Slice: 48
Overlap: 4
Classifier free guidance: 3.5
DDIM sampling steps : 20
skip 1
Traceback (most recent call last):
  File "/home/zzz/software/MusePose/test_stage_2.py", line 237, in <module>
    main()
  File "/home/zzz/software/MusePose/test_stage_2.py", line 76, in main
    vae = AutoencoderKL.from_pretrained(
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/zzz/software/MusePose/venv/lib/python3.11/site-packages/huggingface_hub/utils/_validators.py", line 114, in _inner_fn
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/home/zzz/software/MusePose/venv/lib/python3.11/site-packages/diffusers/models/modeling_utils.py", line 650, in from_pretrained
    model = cls.from_config(config, **unused_kwargs)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/zzz/software/MusePose/venv/lib/python3.11/site-packages/diffusers/configuration_utils.py", line 259, in from_config
    model = cls(**init_dict)
            ^^^^^^^^^^^^^^^^
  File "/home/zzz/software/MusePose/venv/lib/python3.11/site-packages/diffusers/configuration_utils.py", line 653, in inner_init
    init(self, *args, **init_kwargs)
  File "/home/zzz/software/MusePose/venv/lib/python3.11/site-packages/diffusers/models/autoencoders/autoencoder_kl.py", line 90, in __init__
    self.encoder = Encoder(
                   ^^^^^^^^
  File "/home/zzz/software/MusePose/venv/lib/python3.11/site-packages/diffusers/models/autoencoders/vae.py", line 103, in __init__
    down_block = get_down_block(
                 ^^^^^^^^^^^^^^^
  File "/home/zzz/software/MusePose/venv/lib/python3.11/site-packages/diffusers/models/unets/unet_2d_blocks.py", line 128, in get_down_block
    raise ValueError("cross_attention_dim must be specified for CrossAttnDownBlock2D")
ValueError: cross_attention_dim must be specified for CrossAttnDownBlock2D

Why was my program killed during the last step?

The output is as follows:
Width: 768
Height: 768
Length: 300
Slice: 48
Overlap: 4
Classifier free guidance: 3.5
DDIM sampling steps : 20
skip 1
Some weights of the model checkpoint were not used when initializing UNet2DConditionModel:
['conv_norm_out.weight, conv_norm_out.bias, conv_out.weight, conv_out.bias']
/usr/local/lib/python3.10/dist-packages/torch/_utils.py:776: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
return self.fget.get(instance, owner)()
Killed

The same goes for changing w and h.

Missing body parts

Love playing around with the model so far! Awesome work!

The legs are missing in the pose. Is there a way to use the body positioning of the video instead or get all limbs from the picture?

img_ref_video_dance2.mp4

CUDA out of memory on 8 GB VRAM + 8 GB shared RAM

I installed it on Windows 11 but getting the CUDA out of memory error. I reduced the dance video dimensions to half 540 x 960 still the same.

Anyway to make it work by tweaking any settings like batch size etc....

Does not recognize more than one person

I tried using a two person dance video and the dwpose only recognizes one person each frame.

Is there a way to modify the alignment to recognize multiple people?

Thanks.

About arxiv paper

Dear MusePose authors,

First of all, thank you very much for sharing your incredible works. I really appreaciate it.

I am interested in the technical details of MusePose. In the github readme, it seems that there is a citation information but I cannot seem to find the referenced paper in anywhere.

@Article{musepose,
title={MusePose: a Pose-Driven Image-to-Video Framework for Virtual Human Generation},
author={Tong, Zhengyan and Li, Chao and Chen, Zhaokang and Wu, Bin and Zhou, Wenjiang},
journal={arxiv},
year={2024}
}

Could you please share the direct link to the paper? Will you release the paper too?
Thank you very much. Looking forward to hear from you.

How to keep fps the same as video source

First of all this is a amazing opensource tool so thank you developers!

The default example video is 24fps.
But the final video is 12fps.

How can I adjust the "python test_stage_2.py --config ./configs/test_stage_2.yaml" command to make it use the same fps.

Currently if I type in "python test_stage_2.py --config ./configs/test_stage_2.yaml --fps 24" it will output 24fps but it will be 2x the speed too which is not what I want.

test gone wrong

My results didn't turn out well :(

cat_img_cat_video_dance_3.5_20_1__.mp4

some good results

感谢作者的分享(you are my hero),加入我自己收集的数据微调了一版,分享一些不错的结果:

1123-ff.mp4
MusePose2VideoPipeline-img_1-002.mp4-img_1-1.1sec-1080p-2-12345-1bdf5613-ff2.mp4

更多AIGC结果可以看我的视频号(温少的AIGC),欢迎讨论交流~

Training code

Hello, thanks for the awesome work! I am eagerly waiting for the training code release. Pl, let me know your planned timeline.

Thank you,
Ish

About training dataset

Thanks for your great work! Would you pleased to share some information on your training dataset and share some instructions on how to collect my own dataset from web? Thank you!

VRAM usage

Hello!

Thanks again for this great project.

I'm wondering if their are ways to split VRAM usage for multi GPU ? Mostly in order to reduce compute cost (have only been able to execute on A100 so far)

关于test stage 1 的疑问

configs/test_stage_1.yaml中指定了inference_config: "./configs/inference_v2.yaml",
而./configs/inference_v2.yaml中设置了use_motion_module: true,
是否意味着stage1训练时已经加载并训练了motion module?

Latest inference bug fix?

Hello, thanks for being on top of things.

What was the bug that was fixed for the latest update on the inference v2 yaml file?

question about pose output

Hi,

I found out there are some problem when generating pose, please see below image which I compare with dwopenpose (I believe musepose use the same)

you can find out musepose generate longer arms.... not sure if this issue happened in comfyui part or in here....

图片1

segmentation fault

Hello!
When I run the code
python pose_align.py --imgfn_refer ./assets/images/ref.png --vidfn ./assets/videos/dance.mp4, I encounter this error:

height: 1920.0
width: 1080.0
fps: 24.0
Loads checkpoint by local backend from path: ./pretrained_weights/dwpose/yolox_l_8x8_300e_coco.pth
Loads checkpoint by local backend from path: ./pretrained_weights/dwpose/dw-ll_ucoco_384.pth
[1]    36412 segmentation fault  python3 pose_align.py --imgfn_refer ./assets/images/ref.png --vidfn 

Mac M1Max

Pls help me

I have finished the 1st step: python pose_align.py --imgfn_refer ./assets/images/ref.png --vidfn ./assets/videos/dance.mp4
and something wrong at the step:

Moviepy - Done !
Moviepy - video ready ./assets/poses/align/img_ref_video_dance.mp4
pose align done
(musePose) [root@localhost MusePose]# python test_stage_2.py --config ./configs/test_stage_2.yaml
Width: 768
Height: 768
Length: 300
Slice: 48
Overlap: 4
Classifier free guidance: 3.5
DDIM sampling steps : 20
skip 1
Traceback (most recent call last):
File "/home/wangxin/MusePose/test_stage_2.py", line 238, in
main()
File "/home/wangxin/MusePose/test_stage_2.py", line 76, in main
vae = AutoencoderKL.from_pretrained(
File "/data/glm3/anaconda3/envs/musePose/lib/python3.10/site-packages/diffusers/models/modeling_utils.py", line 805, in from_pretrained
raise ValueError(
ValueError: Cannot load <class 'diffusers.models.autoencoder_kl.AutoencoderKL'> from ./pretrained_weights/sd-vae-ft-mse because the following keys are missing:
decoder.mid_block.attentions.0.to_q.weight, decoder.up_blocks.3.resnets.0.conv2.bias, encoder.down_blocks.2.resnets.1.conv2.bias, decoder.up_blocks.0.resnets.2.conv2.weight, encoder.down_blocks.0.resnets.0.conv2.bias, decoder.up_blocks.1.resnets.0.conv2.bias, encoder.conv_in.weight, decoder.up_blocks.1.resnets.2.norm1.weight, decoder.up_blocks.3.resnets.1.norm1.weight, decoder.up_blocks.3.resnets.1.conv2.bias, decoder.up_blocks.2.resnets.0.conv1.weight, decoder.up_blocks.1.resnets.2.conv1.weight, encoder.down_blocks.1.downsamplers.0.conv.weight, encoder.down_blocks.3.resnets.0.norm2.weight, encoder.mid_block.resnets.0.norm1.bias, decoder.up_blocks.2.resnets.0.norm1.bias, decoder.mid_block.resnets.0.norm2.bias, decoder.up_blocks.3.resnets.0.conv_shortcut.bias, encoder.down_blocks.3.resnets.0.conv1.bias, decoder.up_blocks.2.resnets.1.norm1.weight, decoder.up_blocks.0.resnets.0.norm2.weight, encoder.down_blocks.1.resnets.0.norm1.bias, decoder.up_blocks.2.resnets.2.norm2.weight, quant_conv.weight, decoder.up_blocks.3.resnets.0.conv_shortcut.weight, decoder.up_blocks.0.resnets.2.norm2.weight, decoder.up_blocks.0.resnets.0.norm1.bias, encoder.down_blocks.1.resnets.1.conv1.weight, encoder.down_blocks.1.resnets.0.conv_shortcut.weight, encoder.down_blocks.0.resnets.1.norm1.weight, encoder.down_blocks.3.resnets.1.conv1.bias, encoder.down_blocks.1.resnets.0.conv2.bias, encoder.mid_block.resnets.0.conv2.bias, decoder.mid_block.attentions.0.group_norm.bias, encoder.down_blocks.3.resnets.1.conv2.bias, encoder.down_blocks.2.downsamplers.0.conv.weight, encoder.mid_block.resnets.1.conv2.weight, encoder.down_blocks.0.resnets.1.norm1.bias, encoder.mid_block.attentions.0.to_q.weight, decoder.up_blocks.1.resnets.1.norm2.weight, decoder.up_blocks.1.upsamplers.0.conv.bias, encoder.down_blocks.3.resnets.0.norm2.bias, decoder.up_blocks.3.resnets.1.norm2.bias, encoder.mid_block.attentions.0.to_out.0.weight, decoder.up_blocks.1.resnets.2.norm1.bias, encoder.down_blocks.2.resnets.1.norm2.weight, decoder.up_blocks.0.resnets.2.conv2.bias, decoder.mid_block.resnets.0.conv2.weight, encoder.down_blocks.1.resnets.0.conv_shortcut.bias, decoder.conv_norm_out.weight, decoder.mid_block.resnets.1.norm1.bias, encoder.mid_block.resnets.0.norm1.weight, encoder.down_blocks.2.resnets.0.norm2.bias, decoder.mid_block.attentions.0.to_k.bias, encoder.down_blocks.1.resnets.1.norm1.bias, encoder.mid_block.attentions.0.to_k.bias, encoder.conv_norm_out.weight, decoder.up_blocks.0.resnets.1.conv1.bias, decoder.up_blocks.0.resnets.2.conv1.weight, decoder.up_blocks.3.resnets.0.conv1.bias, encoder.down_blocks.2.resnets.0.conv1.weight, encoder.down_blocks.1.resnets.0.conv1.bias, encoder.conv_in.bias, decoder.up_blocks.3.resnets.2.norm1.weight, encoder.down_blocks.3.resnets.1.norm2.weight, decoder.mid_block.resnets.1.conv1.bias, decoder.up_blocks.2.resnets.0.conv_shortcut.bias, decoder.conv_in.weight, decoder.up_blocks.2.resnets.2.conv1.weight, decoder.up_blocks.3.resnets.1.norm1.bias, decoder.up_blocks.1.resnets.2.norm2.bias, decoder.mid_block.attentions.0.to_out.0.weight, encoder.down_blocks.0.resnets.1.norm2.bias, decoder.up_blocks.3.resnets.1.conv1.weight, encoder.down_blocks.3.resnets.0.norm1.weight, encoder.conv_norm_out.bias, encoder.down_blocks.0.resnets.0.norm2.weight, encoder.mid_block.resnets.0.conv1.weight, encoder.mid_block.resnets.0.conv2.weight, decoder.conv_in.bias, decoder.up_blocks.0.resnets.2.norm1.bias, encoder.down_blocks.0.resnets.0.norm1.bias, decoder.up_blocks.1.resnets.1.conv1.weight, decoder.mid_block.attentions.0.to_out.0.bias, encoder.down_blocks.0.resnets.0.conv1.bias, decoder.up_blocks.2.resnets.2.norm1.weight, decoder.mid_block.resnets.0.norm1.bias, encoder.down_blocks.0.resnets.1.conv1.bias, decoder.up_blocks.0.resnets.0.conv2.bias, decoder.up_blocks.3.resnets.2.conv1.bias, decoder.up_blocks.1.resnets.0.conv1.bias, decoder.up_blocks.2.resnets.0.norm2.weight, decoder.up_blocks.2.resnets.1.conv2.weight, decoder.up_blocks.1.resnets.1.conv1.bias, encoder.down_blocks.2.resnets.1.conv1.weight, encoder.down_blocks.0.downsamplers.0.conv.weight, encoder.down_blocks.1.resnets.0.conv2.weight, decoder.up_blocks.1.resnets.2.conv1.bias, decoder.mid_block.resnets.1.norm2.bias, encoder.mid_block.attentions.0.to_q.bias, decoder.mid_block.resnets.1.conv2.bias, encoder.down_blocks.2.resnets.1.norm1.bias, encoder.mid_block.attentions.0.group_norm.weight, encoder.down_blocks.2.resnets.0.norm1.weight, encoder.mid_block.resnets.1.norm2.bias, decoder.conv_out.bias, encoder.down_blocks.0.resnets.1.conv2.weight, encoder.down_blocks.1.resnets.1.conv2.weight, decoder.up_blocks.2.resnets.0.conv1.bias, decoder.up_blocks.3.resnets.2.conv2.weight, decoder.up_blocks.0.upsamplers.0.conv.bias, decoder.up_blocks.0.upsamplers.0.conv.weight, decoder.up_blocks.3.resnets.0.conv2.weight, decoder.up_blocks.2.resnets.0.conv_shortcut.weight, decoder.up_blocks.0.resnets.1.conv1.weight, decoder.up_blocks.0.resnets.2.norm1.weight, decoder.up_blocks.1.resnets.0.norm2.bias, encoder.down_blocks.3.resnets.1.norm2.bias, encoder.down_blocks.1.resnets.1.conv2.bias, decoder.up_blocks.2.resnets.2.norm1.bias, decoder.up_blocks.1.resnets.2.conv2.weight, decoder.up_blocks.2.resnets.0.norm2.bias, encoder.down_blocks.2.resnets.1.conv2.weight, encoder.down_blocks.0.resnets.1.conv1.weight, encoder.down_blocks.1.resnets.0.norm2.bias, encoder.mid_block.resnets.0.norm2.weight, decoder.up_blocks.1.resnets.0.norm1.bias, decoder.up_blocks.2.resnets.1.norm1.bias, decoder.mid_block.attentions.0.to_k.weight, decoder.up_blocks.0.resnets.1.conv2.bias, decoder.up_blocks.2.upsamplers.0.conv.bias, quant_conv.bias, decoder.up_blocks.3.resnets.0.norm2.weight, decoder.up_blocks.1.resnets.1.conv2.weight, encoder.mid_block.resnets.1.norm1.weight, encoder.down_blocks.2.resnets.1.norm2.bias, decoder.mid_block.resnets.0.conv1.bias, decoder.up_blocks.3.resnets.2.norm2.bias, encoder.down_blocks.0.downsamplers.0.conv.bias, decoder.up_blocks.0.resnets.2.conv1.bias, decoder.up_blocks.3.resnets.0.norm1.weight, encoder.down_blocks.2.resnets.0.norm1.bias, encoder.down_blocks.2.resnets.0.conv1.bias, decoder.up_blocks.2.resnets.1.norm2.weight, decoder.up_blocks.1.resnets.2.norm2.weight, decoder.mid_block.resnets.0.conv1.weight, decoder.mid_block.attentions.0.to_v.bias, decoder.up_blocks.2.resnets.2.conv1.bias, encoder.down_blocks.3.resnets.1.conv2.weight, encoder.down_blocks.3.resnets.1.norm1.bias, encoder.mid_block.attentions.0.to_k.weight, decoder.mid_block.resnets.0.conv2.bias, decoder.up_blocks.1.resnets.2.conv2.bias, decoder.mid_block.resnets.0.norm1.weight, decoder.mid_block.attentions.0.to_v.weight, encoder.mid_block.resnets.1.norm1.bias, decoder.conv_out.weight, encoder.down_blocks.1.resnets.1.conv1.bias, decoder.up_blocks.3.resnets.2.norm1.bias, decoder.mid_block.resnets.1.conv2.weight, encoder.down_blocks.0.resnets.0.conv2.weight, decoder.up_blocks.3.resnets.0.norm1.bias, encoder.down_blocks.2.resnets.0.conv_shortcut.weight, decoder.up_blocks.2.resnets.0.conv2.bias, decoder.up_blocks.2.resnets.1.conv2.bias, encoder.mid_block.resnets.1.conv1.weight, encoder.down_blocks.0.resnets.1.conv2.bias, encoder.down_blocks.3.resnets.0.norm1.bias, encoder.mid_block.attentions.0.group_norm.bias, encoder.mid_block.attentions.0.to_v.weight, encoder.down_blocks.1.resnets.1.norm2.bias, decoder.up_blocks.1.resnets.1.conv2.bias, encoder.mid_block.resnets.1.norm2.weight, encoder.mid_block.resnets.0.conv1.bias, decoder.up_blocks.2.resnets.1.norm2.bias, decoder.mid_block.resnets.1.norm2.weight, decoder.mid_block.attentions.0.group_norm.weight, decoder.up_blocks.2.resnets.1.conv1.weight, post_quant_conv.weight, decoder.up_blocks.2.resnets.0.norm1.weight, encoder.down_blocks.1.resnets.1.norm1.weight, encoder.mid_block.resnets.1.conv2.bias, decoder.up_blocks.0.resnets.1.conv2.weight, encoder.mid_block.attentions.0.to_out.0.bias, decoder.up_blocks.3.resnets.0.conv1.weight, decoder.up_blocks.0.resnets.1.norm1.bias, decoder.up_blocks.1.resnets.1.norm1.weight, decoder.up_blocks.3.resnets.1.conv1.bias, decoder.mid_block.resnets.1.norm1.weight, encoder.mid_block.resnets.1.conv1.bias, decoder.up_blocks.0.resnets.1.norm1.weight, encoder.down_blocks.2.downsamplers.0.conv.bias, decoder.up_blocks.2.resnets.2.conv2.weight, encoder.down_blocks.2.resnets.1.norm1.weight, decoder.up_blocks.1.resnets.0.norm2.weight, decoder.up_blocks.0.resnets.0.conv2.weight, encoder.down_blocks.1.resnets.0.conv1.weight, decoder.up_blocks.0.resnets.0.conv1.bias, encoder.down_blocks.1.downsamplers.0.conv.bias, decoder.up_blocks.0.resnets.1.norm2.weight, encoder.down_blocks.0.resnets.0.conv1.weight, decoder.up_blocks.2.resnets.0.conv2.weight, decoder.mid_block.resnets.1.conv1.weight, encoder.down_blocks.2.resnets.1.conv1.bias, encoder.down_blocks.0.resnets.1.norm2.weight, decoder.up_blocks.3.resnets.2.conv1.weight, encoder.down_blocks.2.resnets.0.norm2.weight, encoder.down_blocks.1.resnets.0.norm2.weight, encoder.down_blocks.3.resnets.1.conv1.weight, encoder.mid_block.resnets.0.norm2.bias, decoder.up_blocks.1.resnets.0.conv1.weight, encoder.down_blocks.2.resnets.0.conv_shortcut.bias, decoder.up_blocks.3.resnets.2.conv2.bias, encoder.down_blocks.3.resnets.0.conv2.weight, post_quant_conv.bias, encoder.down_blocks.2.resnets.0.conv2.bias, encoder.down_blocks.3.resnets.0.conv1.weight, encoder.conv_out.bias, decoder.up_blocks.0.resnets.0.conv1.weight, decoder.up_blocks.1.resnets.0.conv2.weight, decoder.up_blocks.2.resnets.2.conv2.bias, encoder.down_blocks.0.resnets.0.norm2.bias, decoder.conv_norm_out.bias, decoder.up_blocks.1.resnets.1.norm1.bias, encoder.down_blocks.2.resnets.0.conv2.weight, encoder.conv_out.weight, decoder.up_blocks.1.upsamplers.0.conv.weight, decoder.up_blocks.0.resnets.1.norm2.bias, decoder.up_blocks.1.resnets.1.norm2.bias, decoder.up_blocks.3.resnets.0.norm2.bias, encoder.down_blocks.1.resnets.1.norm2.weight, decoder.up_blocks.1.resnets.0.norm1.weight, decoder.up_blocks.2.resnets.2.norm2.bias, decoder.up_blocks.3.resnets.2.norm2.weight, decoder.up_blocks.0.resnets.0.norm2.bias, encoder.mid_block.attentions.0.to_v.bias, encoder.down_blocks.3.resnets.1.norm1.weight, decoder.up_blocks.2.upsamplers.0.conv.weight, decoder.up_blocks.2.resnets.1.conv1.bias, decoder.up_blocks.3.resnets.1.conv2.weight, encoder.down_blocks.0.resnets.0.norm1.weight, encoder.down_blocks.1.resnets.0.norm1.weight, decoder.mid_block.resnets.0.norm2.weight, decoder.up_blocks.0.resnets.2.norm2.bias, encoder.down_blocks.3.resnets.0.conv2.bias, decoder.mid_block.attentions.0.to_q.bias, decoder.up_blocks.3.resnets.1.norm2.weight, decoder.up_blocks.0.resnets.0.norm1.weight.
Please make sure to pass low_cpu_mem_usage=False and device_map=None if you want to randomly initialize those weights or else make sure your checkpoint file is correct.

some questions about training

positional_encoding_max_len is 128 in inference config, but in the inference slice_number default number is 48, i am confusing about that , is slice_number equal to 128 in the training?

CUDA out of memory

After starting the test_stage_2 the loading bar appears but nothing happens for a long time, after that i just get this:

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 243.00 GiB (GPU 0; 15.99 GiB total capacity; 13.51 GiB already allocated; 0 bytes free; 15.91 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Why is it trying to get such an absurd amount of memory? I cant seem to find where to put the max_split_size_mb.
Is it something with the training parameters?

Thanks in advance!

Error with pose_align.py on MacBook M3 Max Using Anaconda torch-metal Profile

Hello @TZYSJTU

I am experiencing issues running pose_align.py from the MusePose repository on my MacBook M3 Max. Despite following the setup instructions in the README and installing all the required libraries and dependencies, I encounter errors when executing the script.

Environment:

  • OS: macOS
  • Device: MacBook M3 Max
  • Python Environment: Anaconda torch-metal profile
  • Python Version: 3.10
  • MusePose Version: 1.3.1
  • Platform: macOS-14.5-arm64-arm-64bit

I executed the following command:

python pose_align.py --imgfn_refer ./assets/images/ref.png --vidfn ./assets/videos/dance.mp4

###Observation###

  • Multiple warnings about missing modules despite it being installed correctly:
warnings.warn('Fail to import `MultiScaleDeformableAttention` from `mmcv.ops.multi_scale_deform_attn` ')
warnings.warn('The module `mmpose` is not installed. The package will have limited functionality.')
warnings.warn('The module `mmdet` is not installed. The package will have limited functionality.')
  • The script terminates with a NameError indicating that init_detector is not defined:
Traceback (most recent call last):
  File "/Users/jaydip/Documents/BCS/ML/MusePose/pose_align.py", line 556, in <module>
    main()
  File "/Users/jaydip/Documents/BCS/ML/MusePose/pose_align.py", line 551, in main
    run_align_video_with_filterPose.translate_smooth(args)
  File "/Users/jaydip/Documents/BCS/ML/MusePose/pose_align.py", line 270, in run_align_video_with_filterPose.translate_smooth
    detector = DWPoseDetector(
  File "/Users/jaydip/Documents/BCS/ML/MusePose/pose/script/dwpose.py", line 74, in __init__
    self.pose_estimation = Wholebody(det_config, det_ckpt, pose_config, pose_ckpt, device)
  File "/Users/jaydip/Documents/BCS/ML/MusePose/pose/script/wholebody.py", line 51, in __init__
    self.detector = init_detector(det_config, det_ckpt, device=device)
NameError: name 'init_detector' is not defined

Screnshot###

Screenshot 2024-06-19 at 2 48 33 PM

I would appreciate any guidance on resolving these issues. Thank you!

Error extracting pose when deploying on Docker

I am trying to deploy the model on a GPU machine using Docker containers. The inference works well but the extraction of the pose fails bc of a limitation on the resource usage.

Error:

ValueError: not allowed to raise maximum limit

Description

When running the pose_align.py script, a ValueError is raised indicating that the maximum limit for file descriptors cannot be increased. This error occurs during the initialization of the DWposeDetector in the run_align_video_with_filterPose_translate_smooth function.

Error Log

2024-06-10T18:02:00.973592162Z Traceback (most recent call last):
2024-06-10T18:02:00.973623702Z File "/muse_pose/pose_align.py", line 556, in
2024-06-10T18:02:00.973628712Z main()
2024-06-10T18:02:00.973633572Z File "/muse_pose/pose_align.py", line 551, in main
2024-06-10T18:02:00.974056225Z run_align_video_with_filterPose_translate_smooth(args)
2024-06-10T18:02:00.974070052Z File "/muse_pose/pose_align.py", line 270, in run_align_video_with_filterPose_translate_smooth
2024-06-10T18:02:00.974073629Z detector = DWposeDetector(
2024-06-10T18:02:00.974077346Z File "/muse_pose/pose/script/dwpose.py", line 72, in init
2024-06-10T18:02:00.974079570Z from pose.script.wholebody import Wholebody
2024-06-10T18:02:00.974082726Z File "/muse_pose/pose/script/wholebody.py", line 14, in
2024-06-10T18:02:00.974085692Z from mmpose.apis import inference_topdown
2024-06-10T18:02:00.974087976Z File "/usr/local/lib/python3.10/site-packages/mmpose/apis/init.py", line 2, in
2024-06-10T18:02:00.974102573Z from .inference import (collect_multi_frames, inference_bottomup,
2024-06-10T18:02:00.974105359Z File "/usr/local/lib/python3.10/site-packages/mmpose/apis/inference.py", line 16, in
2024-06-10T18:02:00.974107544Z from mmpose.datasets.datasets.utils import parse_pose_metainfo
2024-06-10T18:02:00.974109838Z File "/usr/local/lib/python3.10/site-packages/mmpose/datasets/init.py", line 2, in
2024-06-10T18:02:00.974112643Z from .builder import build_dataset
2024-06-10T18:02:00.974114777Z File "/usr/local/lib/python3.10/site-packages/mmpose/datasets/builder.py", line 20, in
2024-06-10T18:02:00.974116961Z resource.setrlimit(resource.RLIMIT_NOFILE, (soft_limit, hard_limit))
2024-06-10T18:02:00.974119205Z ValueError: not allowed to raise maximum limit
.
python:3.10-slim running on Linux

low inference speed 第二步推理异常,一直卡住没有进度

推理config.json及模型提示放到 MusePose/pretrained_weights/sd-image-variations-diffusers/unet
unet,移动unet下后,卡住长时间没有进度

root@153a7e76ceb5:~/MusePose# python test_stage_2.py --config ./configs/test_stage_2.yaml
Width: 768
Height: 768
Length: 300
Slice: 48
Overlap: 4
Classifier free guidance: 3.5
DDIM sampling steps : 20
skip 1
Some weights of the model checkpoint were not used when initializing UNet2DConditionModel:
['conv_norm_out.weight, conv_norm_out.bias, conv_out.weight, conv_out.bias']
/usr/local/lib/python3.10/dist-packages/torch/_utils.py:776: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
return self.fget.get(instance, owner)()
handle=== ./assets/images/ref.png ./assets/poses/align/img_ref_video_dance.mp4
pose video has 288 frames, with 24 fps
processing length: 144
fps 12
/root/MusePose/musepose/pipelines/pipeline_pose2vid_long.py:406: FutureWarning: Accessing config attribute in_channels directly via 'UNet3DConditionModel' object attribute is deprecated. Please access 'in_channels' over 'UNet3DConditionModel's config object instead, e.g. 'unet.config.in_channels'.
num_channels_latents = self.denoising_unet.in_channels
0%| | 0/20 [00:00<?, ?it/s]

最后一步出错是怎么回事python test_stage_2.py --config ./configs/test_stage_2.yaml

WARNING[XFORMERS]: xFormers can't load C++/CUDA extensions. xFormers was built for:
PyTorch 2.0.1+cu118 with CUDA 1108 (you have 2.0.1+cpu)
Python 3.11.5 (you have 3.11.7)
Please reinstall xformers (see https://github.com/facebookresearch/xformers#installing-xformers)
Memory-efficient attention, SwiGLU, sparse and more won't be available.
Set XFORMERS_MORE_DETAILS=1 for more details
Width: 512
Height: 512
Length: 300
Slice: 48
Overlap: 4
Classifier free guidance: 3.5
DDIM sampling steps : 20
skip 1
Traceback (most recent call last):
File "F:\kuaisufangwen\Desktop\MusePose\test_stage_2.py", line 238, in
main()
File "F:\kuaisufangwen\Desktop\MusePose\test_stage_2.py", line 78, in main
).to("cuda", dtype=weight_dtype)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "F:\kuaisufangwen\Desktop\ylad\Lib\site-packages\torch\nn\modules\module.py", line 1145, in to
return self._apply(convert)
^^^^^^^^^^^^^^^^^^^^
File "F:\kuaisufangwen\Desktop\ylad\Lib\site-packages\torch\nn\modules\module.py", line 797, in _apply
module._apply(fn)
File "F:\kuaisufangwen\Desktop\ylad\Lib\site-packages\torch\nn\modules\module.py", line 797, in _apply
module._apply(fn)
File "F:\kuaisufangwen\Desktop\ylad\Lib\site-packages\torch\nn\modules\module.py", line 820, in apply
param_applied = fn(param)
^^^^^^^^^
File "F:\kuaisufangwen\Desktop\ylad\Lib\site-packages\torch\nn\modules\module.py", line 1143, in convert
return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "F:\kuaisufangwen\Desktop\ylad\Lib\site-packages\torch\cuda_init
.py", line 239, in _lazy_init
raise AssertionError("Torch not compiled with CUDA enabled")
AssertionError: Torch not compiled with CUDA enabled

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.