tmelyralab / musetalk Goto Github PK

View Code? Open in Web Editor NEW

1.9K 1.9K 234.0 23.89 MB

MuseTalk: Real-Time High Quality Lip Synchorization with Latent Space Inpainting

License: Other

Python 99.94% Shell 0.06%

lip-sync virtualhumans

musetalk's People

Contributors

Stargazers

Watchers

Forkers

pustar wuzhongdehua xxsuper ylz201 zhang9song render-ai chrysfay kustomzone yingzi6776 hotea zyc-glesi lidachuan211 songshikang0111 cncbec mashiroshina adambear cylonspace songfang lkswqyfn ishine huterox yanglane2021 prog-ape yuan505 kellhuang uestcqxq julianyangjingjun kevenleung okiahooo kevinwang676 qoo8888 zhangcaihui gluttony-10 tobycroft sal-dti assassindesign alexliuminhao chicohan cvcuiwei gold74386 yzxzero erickong1985 dyjng shadowlinyf 0x1001u misterypoem heli-dawnlab703 stardtx lgxzqx fangfei1080 baron2050 yang1hu henryhesz g711ab himyjan ingeniousfrog wuyigq owami peizhou ishaonian codersun123 oneliao bigrixin chenkaic4 aimdreamboy neargostudio qinb jackieglq kuron88 carloskenvin zhaopufeng pengchaojay sinopoc kenwaytis xiaozhiob jimmyleesnow motoyagugu liunix61 jinqinn ameerazam08 leeeex bill007bill tinaa23 zcfrank1st xfylsj ameerazam00 pabiglesias-sd qingshuiyuyu akapril shreeshreee fishke22 mkjiang li940310009 codepkcaesar anime-ad clivelau1990 imucs-machinelearning cloudenginehub xcrystal627 fevolq

musetalk's Issues

下牙太长了，把下牙弄短点就好了

下牙太长了，而且合成后下牙是链接到一起的，能把下牙弄短点就好了，256分辨率嘴巴和牙齿看着还是比较模糊，但是开启人脸增强后牙齿很难看

The fidelity of the generated faces is poor

Thank your great work!
I conducted some experiments and found that the fidelity of the generated faces is poor, generated person does not resemble the original video closely.
Could you give me some suggestions?

CUDA error when run on Mac with cpu

python -m scripts.inference --inference_config configs/inference/demo.yaml
add ffmpeg to path
Loads checkpoint by local backend from path: ./models/dwpose/dw-ll_ucoco_384.pth
{'task_1': {'video_path': 'data/video/demo.mov', 'audio_path': 'data/audio/demo.wav', 'bbox_shift': -7}}
/Users/sukai/Documents/ai/MuseTalk/musetalk/whisper/whisper/transcribe.py:79: UserWarning: FP16 is not supported on CPU; using FP32 instead
  warnings.warn("FP16 is not supported on CPU; using FP32 instead")
video in 25 FPS, audio idx in 50FPS
extracting landmarks...time consuming
reading images...
0it [00:00, ?it/s]
get key_landmark and face bounding boxes with the bbox_shift: -7
0it [00:00, ?it/s]
********************************************bbox_shift parameter adjustment**********************************************************
 1 preprocessing.py +                                                                                                                                               X
Traceback (most recent call last):
 1 preprocessing.py +

Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/Users/hotea/Documents/ai/MuseTalk/scripts/inference.py", line 145, in <module>
    main(args)
  File "/Users/hotea/Documents/ai/MuseTalk/venv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/Users/hotea/Documents/ai/MuseTalk/scripts/inference.py", line 115, in main
    combine_frame = get_image(ori_frame,res_frame,bbox)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/hotea/Documents/ai/MuseTalk/musetalk/utils/blending.py", line 41, in get_image
    mask_image = face_seg(face_large)
                 ^^^^^^^^^^^^^^^^^^^^
  File "/Users/hotea/Documents/ai/MuseTalk/musetalk/utils/blending.py", line 17, in face_seg
    seg_image = fp(image)
                ^^^^^^^^^
  File "/Users/hotea/Documents/ai/MuseTalk/musetalk/utils/face_parsing/__init__.py", line 38, in __call__
    img = torch.unsqueeze(img, 0).cuda()
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/hotea/Documents/ai/MuseTalk/venv/lib/python3.12/site-packages/torch/cuda/__init__.py", line 293, in _lazy_init
    raise AssertionError("Torch not compiled with CUDA enabled")
AssertionError: Torch not compiled with CUDA enabled

paper link

could you update your paper link？

跑实时推理的测试例子出现了 ZeroDivisionError: float division by zero 异常

python -m scripts.realtime_inference --inference_config configs/inference/realtime.yaml

Loads checkpoint by local backend from path: ./models/dwpose/dw-ll_ucoco_384.pth
cuda start
D:\model\MuseTalk.glut\lib\site-packages\torch\utils_contextlib.py:125: UserWarning: Decorating classes is deprecated and will be disabled in future versions. You should only decorate functions or methods. To preserve the current behavior of class decoration, you can directly decorate the __init__ method and nothing else.
warnings.warn("Decorating classes is deprecated and will be disabled in "
{'avator_1': {'preparation': False, 'bbox_shift': 5, 'video_path': 'data/video/sun.mp4', 'audio_clips': {'audio_0': 'data/audio/yongen.wav', 'audio_1': 'data/audio/sun.wav'}}}
reading images...
100%|█████████████████████████████████████████████████████████████████████████████| 1100/1100 [00:05<00:00, 195.67it/s]
reading images...
100%|████████████████████████████████████████████████████████████████████████████| 1100/1100 [00:00<00:00, 6295.40it/s]
Inferring using: data/audio/yongen.wav
video in 25 FPS, audio idx in 50FPS
start inference
202
processing audio:data/audio/yongen.wav costs 0.0ms
2%|█▋ | 1/51 [00:01<01:18, 1.57s/it]Generating the 0-th frame with FPS: 1.76
Exception in thread Thread-4 (process_frames):
Traceback (most recent call last):
File "D:\model\MuseTalk.glut\lib\threading.py", line 1016, in _bootstrap_inner
self.run()
File "D:\model\MuseTalk.glut\lib\threading.py", line 953, in run
self._target(*self._args, **self._kwargs)
File "D:\model\MuseTalk\scripts\realtime_inference.py", line 208, in process_frames
fps = 1/(time.time()-start)
ZeroDivisionError: float division by zero
100%|██████████████████████████████████████████████████████████████████████████████████| 51/51 [00:07<00:00, 7.19it/s]
ffmpeg -y -v fatal -r 25 -f image2 -i ./results/avatars/avator_1/tmp/%08d.png -vcodec libx264 -vf format=rgb24,scale=out_color_matrix=bt709,format=yuv420p -crf 18 ./results/avatars/avator_1/temp.mp4
ffmpeg -y -v fatal -i data/audio/yongen.wav -i ./results/avatars/avator_1/temp.mp4 ./results/avatars/avator_1/vid_output/audio_0.mp4
result is save to ./results/avatars/avator_1/vid_output/audio_0.mp4
Inferring using: data/audio/sun.wav
video in 25 FPS, audio idx in 50FPS
start inference
564
processing audio:data/audio/sun.wav costs 0.0ms
0%| | 0/141 [00:00<?, ?it/s]Generating the 0-th frame with FPS: 10.45
Exception in thread Thread-7 (process_frames):
Traceback (most recent call last):
File "D:\model\MuseTalk.glut\lib\threading.py", line 1016, in _bootstrap_inner
self.run()
File "D:\model\MuseTalk.glut\lib\threading.py", line 953, in run
self._target(*self._args, **self._kwargs)
File "D:\model\MuseTalk\scripts\realtime_inference.py", line 208, in process_frames
fps = 1/(time.time()-start)
ZeroDivisionError: float division by zero
100%|████████████████████████████████████████████████████████████████████████████████| 141/141 [00:13<00:00, 10.42it/s]
ffmpeg -y -v fatal -r 25 -f image2 -i ./results/avatars/avator_1/tmp/%08d.png -vcodec libx264 -vf format=rgb24,scale=out_color_matrix=bt709,format=yuv420p -crf 18 ./results/avatars/avator_1/temp.mp4
ffmpeg -y -v fatal -i data/audio/sun.wav -i ./results/avatars/avator_1/temp.mp4 ./results/avatars/avator_1/vid_output/audio_1.mp4
result is save to ./results/avatars/avator_1/vid_output/audio_1.mp4

It's just not working on Windows 11

I am trying with the attached files and even if I wait for 2 hours no progress. Am I doing something wrong

image.zip

test.yaml

task_0:
 video_path: "data/image/face.jpeg"
 audio_path: "data/audio/audio.wav"

C:\sd\MuseTalk>venv\scripts\activate

(venv) C:\sd\MuseTalk>python -m scripts.inference --inference_config configs/inference/test.yaml
add ffmpeg to path
Loads checkpoint by local backend from path: ./models/dwpose/dw-ll_ucoco_384.pth
cuda start
{'task_0': {'video_path': 'data/image/face.jpeg', 'audio_path': 'data/audio/audio.wav'}}
video in 25 FPS, audio idx in 50FPS
extracting landmarks...time consuming
reading images...
100%|████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<?, ?it/s]
get key_landmark and face bounding boxes with the default value
100%|████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:02<00:00,  2.33s/it]
********************************************bbox_shift parameter adjustment**********************************************************
Total frame:「1」 Manually adjust range : [ -23~25 ] , the current value: 0
*************************************************************************************************************************************
start inference
  0%|                                                                                            | 0/3 [00:00<?, ?it/s]

FFmpeg is in path

C:\Users\nitin>ffmpeg
ffmpeg version 6.1-full_build-www.gyan.dev Copyright (c) 2000-2023 the FFmpeg developers
  built with gcc 12.2.0 (Rev10, Built by MSYS2 project)
  configuration: --enable-gpl --enable-version3 --enable-static --pkg-config=pkgconf --disable-w32threads --disable-autodetect --enable-fontconfig --enable-iconv --enable-gnutls --enable-libxml2 --enable-gmp --enable-bzlib --enable-lzma --enable-libsnappy --enable-zlib --enable-librist --enable-libsrt --enable-libssh --enable-libzmq --enable-avisynth --enable-libbluray --enable-libcaca --enable-sdl2 --enable-libaribb24 --enable-libaribcaption --enable-libdav1d --enable-libdavs2 --enable-libuavs3d --enable-libzvbi --enable-librav1e --enable-libsvtav1 --enable-libwebp --enable-libx264 --enable-libx265 --enable-libxavs2 --enable-libxvid --enable-libaom --enable-libjxl --enable-libopenjpeg --enable-libvpx --enable-mediafoundation --enable-libass --enable-frei0r --enable-libfreetype --enable-libfribidi --enable-libharfbuzz --enable-liblensfun --enable-libvidstab --enable-libvmaf --enable-libzimg --enable-amf --enable-cuda-llvm --enable-cuvid --enable-ffnvcodec --enable-nvdec --enable-nvenc --enable-dxva2 --enable-d3d11va --enable-libvpl --enable-libshaderc --enable-vulkan --enable-libplacebo --enable-opencl --enable-libcdio --enable-libgme --enable-libmodplug --enable-libopenmpt --enable-libopencore-amrwb --enable-libmp3lame --enable-libshine --enable-libtheora --enable-libtwolame --enable-libvo-amrwbenc --enable-libcodec2 --enable-libilbc --enable-libgsm --enable-libopencore-amrnb --enable-libopus --enable-libspeex --enable-libvorbis --enable-ladspa --enable-libbs2b --enable-libflite --enable-libmysofa --enable-librubberband --enable-libsoxr --enable-chromaprint
  libavutil      58. 29.100 / 58. 29.100
  libavcodec     60. 31.102 / 60. 31.102
  libavformat    60. 16.100 / 60. 16.100
  libavdevice    60.  3.100 / 60.  3.100
  libavfilter     9. 12.100 /  9. 12.100
  libswscale      7.  5.100 /  7.  5.100
  libswresample   4. 12.100 /  4. 12.100
  libpostproc    57.  3.100 / 57.  3.100
Hyper fast Audio and Video encoder
usage: ffmpeg [options] [[infile options] -i infile]... {[outfile options] outfile}...

Use -h to get full help or, even better, run 'man ffmpeg'

建议增加推流的实时交互

音频中间没有声音的时候，嘴部还一直在动

提供推理的音频，中间和结尾有一段没有声音的，但是嘴巴一直在动，请问可以怎么优化吗？

进程挂掉后，可否根据之前的内容继续生成

目前遇到进程挂掉后，之前生成的png图片有保存，但是不能断点接续，有没有方法可以接着之前的生成内容继续生成。

Getting error with cuda 11.8 , pytorch 2.2.1, conda

ImportError: /home/ubuntu/miniconda3/envs/myenv/lib/python3.10/site-packages/mmcv/_ext.cpython-310-x86_64-linux-gnu.so: undefined symbol: _ZN2at4_ops10zeros_like4callERKNS_6TensorEN3c108optionalINS5_10ScalarTypeEEENS6_INS5_6LayoutEEENS6_INS5_6DeviceEEENS6_IbEENS6_INS5_12MemoryFormatEEE

How to download weights？

add dubbing example code

it is a nice project, seems it is supports video dubbing, so please add dubbing example code

请问如何可以实时？

4090 显卡，现在 15秒语音，用语音驱动，转换一次，耗时大概2分钟。

有计划支持华为的昇腾平台吗

有计划提供训练代码和paper吗

请问有计划提供训练代码或者paper吗？如果有计划，大概什么时候提供？

为什么realtime脚本生成用时，比日志显示的fps实际要慢很多？

生成数据：均使用的测试样例（video: sun.mp4 | audio: yongen.wav）；
使用脚本：realtime_inference，已提前完成avater数据的抽取，仅仅进行语音片段生成；
问题1：202帧的片段，日志显示平均fps为77.5
- 此处为什么每四帧处理一次，且第一帧的处理速度要远远慢于后者（10 >> 100）？

问题2：按fps推算，202处理完总共应只需几秒钟，但实际打点的耗时为23s（未统计后续的帧+音频合成视频的时间）

KeyError: 'encoder_embeddings'

Traceback (most recent call last):
File "/root/anaconda3/envs/musetalk/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/root/anaconda3/envs/musetalk/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/home/heqing/test/MuseTalk/scripts/inference.py", line 141, in
main(args)
File "/root/anaconda3/envs/musetalk/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/home/heqing/test/MuseTalk/scripts/inference.py", line 55, in main
whisper_feature = audio_processor.audio2feat(audio_path)
File "/home/heqing/test/MuseTalk/musetalk/whisper/audio2feature.py", line 99, in audio2feat
encoder_embeddings = emb['encoder_embeddings']
KeyError: 'encoder_embeddings'

运行案例的时候加载face-parse-bisent报错

(venv1) root@3bdaf96b00b0:/workspace/MuseTalk# python -m scripts.inference --inference_config configs/inference/test.yaml
add ffmpeg to path
Loads checkpoint by local backend from path: ./models/dwpose/dw-ll_ucoco_384.pth
cuda start
Traceback (most recent call last):
File "/opt/conda/envs/musev/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/opt/conda/envs/musev/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/workspace/MuseTalk/scripts/inference.py", line 13, in
from musetalk.utils.preprocessing import get_landmark_and_bbox,read_imgs,coord_placeholder
File "/workspace/MuseTalk/musetalk/utils/preprocessing.py", line 23, in
fa = FaceAlignment(LandmarksType._2D, flip_input=False,device=device)
File "/workspace/MuseTalk/musetalk/utils/face_detection/api.py", line 69, in init
self.face_detector = face_detector_module.FaceDetector(device=device, verbose=verbose)
File "/workspace/MuseTalk/musetalk/utils/face_detection/detection/sfd/sfd_detector.py", line 22, in init
model_weights = load_url(models_urls['s3fd'])
File "/workspace/MuseTalk/venv1/lib/python3.10/site-packages/torch/hub.py", line 750, in load_state_dict_from_url
return torch.load(cached_file, map_location=map_location)
File "/workspace/MuseTalk/venv1/lib/python3.10/site-packages/torch/serialization.py", line 815, in load
return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args)
File "/workspace/MuseTalk/venv1/lib/python3.10/site-packages/torch/serialization.py", line 1051, in _legacy_load
typed_storage._untyped_storage._set_from_file(
RuntimeError: unexpected EOF, expected 1202694 more bytes. The file might be corrupted.

斜脸生成效果不好

不清楚是不是因为面部在边缘，关键点检测不够准确导致的

input-1.mp4

output-1.mp4

单步生成

for i, (whisper_batch, latent_batch) in enumerate(
    tqdm(gen, total=int(np.ceil(float(video_num) / batch_size)))
):
    audio_feature_batch = torch.from_numpy(whisper_batch)
    audio_feature_batch = audio_feature_batch.to(
        device=unet.device, dtype=unet.model.dtype
    )  # torch, B, 5*N,384
    audio_feature_batch = pe(audio_feature_batch)
    latent_batch = latent_batch.to(dtype=unet.model.dtype)

    pred_latents = unet.model(latent_batch, timesteps, encoder_hidden_states=audio_feature_batch).sample
    recon = vae.decode_latents(pred_latents)
    for res_frame in recon:
        res_frame_list.append(res_frame)

看代码似乎生成每一帧的时候 Unet 只 forward 了一次？我的理解正确吗，那这还算是扩散模型吗

bbox without range check

python -m scripts.inference --inference_config results/musetalk_test_result.yaml

video in 25.0 FPS, audio idx in 50FPS
extracting landmarks...time consuming
reading images...
100%|████████████████████████████████████████████████████████████████████| 250/250 [00:01<00:00, 207.18it/s]get key_landmark and face bounding boxes with the default value
100%|█████████████████████████████████████████████████████████████████████| 250/250 [00:13<00:00, 19.16it/s]bbox_shift parameter adjustment**************
Total frame:「250」 Manually adjust range : [ -16~16 ] , the current value: 0

(61, 0, 277, 308)
(61, -1, 277, 309)
Traceback (most recent call last):
File "/envs/musetalk/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/envs/musetalk/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/MuseTalk/scripts/inference.py", line 153, in
main(args)
File "/envs/musetalk/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(args, kwargs)
File "/MuseTalk/scripts/inference.py", line 86, in main
crop_frame = cv2.resize(crop_frame,(256,256),interpolation = cv2.INTER_LANCZOS4)
cv2.error: OpenCV(4.9.0) /io/opencv/modules/imgproc/src/resize.cpp:4152: error: (-215:Assertion failed) !ssize.empty() in function 'resize'

I print bbox in inference and find that the negtive bbox index would cause this bug.

给UNET加入噪声训练会更好吗？

给UNET加入噪声训练会更好吗？似乎只把它当成特征提取器了。

有没有办法实时生成骨骼同步unity中的人物模型呢？

我看了文档都只是说视频生成视频。而我希望的是通过音频或文本直接生成口型动画加到unity中让3d人物同步

近景时候嘴部会变得奇怪且很模糊，中景时候就挺好的，希望修复下，感谢作者

小白的优化建议

我使用了两个步骤去生成最终视频：
第一步: 使用bbox参数为最小，生成视频；
第二步: 把第一步的成品再丢进去生成，bbox调为5；

这样生成的效果似乎会好很多。参考其他人的经验，也就是全程闭嘴的视频做嘴唇拟合的效果似乎更棒。
我不是专业人员，不知道这个会不会对项目有所帮助~

有计划提供训练部分的代码么

No module named 'ffmpeg'

Windows 下载了ffmpeg-static，设置了环境变量，怎么不行啊

Did you try to add a GAN loss to make the mouth more clear.

imageio_ffmpeg saving video error

Hi, can anyone help me resolve this issue? Thank you!

I encountered the error below when running app.py, specifically imageio.mimwrite(output_video, images, 'FFMPEG', fps=fps, codec='libx264', pixelformat='yuv420p')

Error message:

File "/usr/local/lib/python3.10/dist-packages/imageio_ffmpeg/_io.py", line 627, in write_frames p.stdin.write(bb) BrokenPipeError: [Errno 32] Broken pipe