Git Product home page Git Product logo

musetalk's People

Contributors

czk32611 avatar gluttony-10 avatar hotea avatar itechmusic avatar phighting avatar tobycroft avatar zhanchao019 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

musetalk's Issues

下牙太长了,把下牙弄短点就好了

下牙太长了,而且合成后下牙是链接到一起的,能把下牙弄短点就好了,256分辨率嘴巴和牙齿看着还是比较模糊,但是开启人脸增强后牙齿很难看

The fidelity of the generated faces is poor

Thank your great work!
I conducted some experiments and found that the fidelity of the generated faces is poor, generated person does not resemble the original video closely.
Could you give me some suggestions?

CUDA error when run on Mac with cpu

python -m scripts.inference --inference_config configs/inference/demo.yaml
add ffmpeg to path
Loads checkpoint by local backend from path: ./models/dwpose/dw-ll_ucoco_384.pth
{'task_1': {'video_path': 'data/video/demo.mov', 'audio_path': 'data/audio/demo.wav', 'bbox_shift': -7}}
/Users/sukai/Documents/ai/MuseTalk/musetalk/whisper/whisper/transcribe.py:79: UserWarning: FP16 is not supported on CPU; using FP32 instead
  warnings.warn("FP16 is not supported on CPU; using FP32 instead")
video in 25 FPS, audio idx in 50FPS
extracting landmarks...time consuming
reading images...
0it [00:00, ?it/s]
get key_landmark and face bounding boxes with the bbox_shift: -7
0it [00:00, ?it/s]
********************************************bbox_shift parameter adjustment**********************************************************
 1 preprocessing.py +                                                                                                                                               X
Traceback (most recent call last):
 1 preprocessing.py +
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/Users/hotea/Documents/ai/MuseTalk/scripts/inference.py", line 145, in <module>
    main(args)
  File "/Users/hotea/Documents/ai/MuseTalk/venv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/Users/hotea/Documents/ai/MuseTalk/scripts/inference.py", line 115, in main
    combine_frame = get_image(ori_frame,res_frame,bbox)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/hotea/Documents/ai/MuseTalk/musetalk/utils/blending.py", line 41, in get_image
    mask_image = face_seg(face_large)
                 ^^^^^^^^^^^^^^^^^^^^
  File "/Users/hotea/Documents/ai/MuseTalk/musetalk/utils/blending.py", line 17, in face_seg
    seg_image = fp(image)
                ^^^^^^^^^
  File "/Users/hotea/Documents/ai/MuseTalk/musetalk/utils/face_parsing/__init__.py", line 38, in __call__
    img = torch.unsqueeze(img, 0).cuda()
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/hotea/Documents/ai/MuseTalk/venv/lib/python3.12/site-packages/torch/cuda/__init__.py", line 293, in _lazy_init
    raise AssertionError("Torch not compiled with CUDA enabled")
AssertionError: Torch not compiled with CUDA enabled

跑实时推理的测试例子出现了 ZeroDivisionError: float division by zero 异常

python -m scripts.realtime_inference --inference_config configs/inference/realtime.yaml

Loads checkpoint by local backend from path: ./models/dwpose/dw-ll_ucoco_384.pth
cuda start
D:\model\MuseTalk.glut\lib\site-packages\torch\utils_contextlib.py:125: UserWarning: Decorating classes is deprecated and will be disabled in future versions. You should only decorate functions or methods. To preserve the current behavior of class decoration, you can directly decorate the __init__ method and nothing else.
warnings.warn("Decorating classes is deprecated and will be disabled in "
{'avator_1': {'preparation': False, 'bbox_shift': 5, 'video_path': 'data/video/sun.mp4', 'audio_clips': {'audio_0': 'data/audio/yongen.wav', 'audio_1': 'data/audio/sun.wav'}}}
reading images...
100%|█████████████████████████████████████████████████████████████████████████████| 1100/1100 [00:05<00:00, 195.67it/s]
reading images...
100%|████████████████████████████████████████████████████████████████████████████| 1100/1100 [00:00<00:00, 6295.40it/s]
Inferring using: data/audio/yongen.wav
video in 25 FPS, audio idx in 50FPS
start inference
202
processing audio:data/audio/yongen.wav costs 0.0ms
2%|█▋ | 1/51 [00:01<01:18, 1.57s/it]Generating the 0-th frame with FPS: 1.76
Exception in thread Thread-4 (process_frames):
Traceback (most recent call last):
File "D:\model\MuseTalk.glut\lib\threading.py", line 1016, in _bootstrap_inner
self.run()
File "D:\model\MuseTalk.glut\lib\threading.py", line 953, in run
self._target(*self._args, **self._kwargs)
File "D:\model\MuseTalk\scripts\realtime_inference.py", line 208, in process_frames
fps = 1/(time.time()-start)
ZeroDivisionError: float division by zero
100%|██████████████████████████████████████████████████████████████████████████████████| 51/51 [00:07<00:00, 7.19it/s]
ffmpeg -y -v fatal -r 25 -f image2 -i ./results/avatars/avator_1/tmp/%08d.png -vcodec libx264 -vf format=rgb24,scale=out_color_matrix=bt709,format=yuv420p -crf 18 ./results/avatars/avator_1/temp.mp4
ffmpeg -y -v fatal -i data/audio/yongen.wav -i ./results/avatars/avator_1/temp.mp4 ./results/avatars/avator_1/vid_output/audio_0.mp4
result is save to ./results/avatars/avator_1/vid_output/audio_0.mp4
Inferring using: data/audio/sun.wav
video in 25 FPS, audio idx in 50FPS
start inference
564
processing audio:data/audio/sun.wav costs 0.0ms
0%| | 0/141 [00:00<?, ?it/s]Generating the 0-th frame with FPS: 10.45
Exception in thread Thread-7 (process_frames):
Traceback (most recent call last):
File "D:\model\MuseTalk.glut\lib\threading.py", line 1016, in _bootstrap_inner
self.run()
File "D:\model\MuseTalk.glut\lib\threading.py", line 953, in run
self._target(*self._args, **self._kwargs)
File "D:\model\MuseTalk\scripts\realtime_inference.py", line 208, in process_frames
fps = 1/(time.time()-start)
ZeroDivisionError: float division by zero
100%|████████████████████████████████████████████████████████████████████████████████| 141/141 [00:13<00:00, 10.42it/s]
ffmpeg -y -v fatal -r 25 -f image2 -i ./results/avatars/avator_1/tmp/%08d.png -vcodec libx264 -vf format=rgb24,scale=out_color_matrix=bt709,format=yuv420p -crf 18 ./results/avatars/avator_1/temp.mp4
ffmpeg -y -v fatal -i data/audio/sun.wav -i ./results/avatars/avator_1/temp.mp4 ./results/avatars/avator_1/vid_output/audio_1.mp4
result is save to ./results/avatars/avator_1/vid_output/audio_1.mp4

It's just not working on Windows 11

I am trying with the attached files and even if I wait for 2 hours no progress. Am I doing something wrong

image.zip

test.yaml

task_0:
 video_path: "data/image/face.jpeg"
 audio_path: "data/audio/audio.wav"

C:\sd\MuseTalk>venv\scripts\activate

(venv) C:\sd\MuseTalk>python -m scripts.inference --inference_config configs/inference/test.yaml
add ffmpeg to path
Loads checkpoint by local backend from path: ./models/dwpose/dw-ll_ucoco_384.pth
cuda start
{'task_0': {'video_path': 'data/image/face.jpeg', 'audio_path': 'data/audio/audio.wav'}}
video in 25 FPS, audio idx in 50FPS
extracting landmarks...time consuming
reading images...
100%|████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<?, ?it/s]
get key_landmark and face bounding boxes with the default value
100%|████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:02<00:00,  2.33s/it]
********************************************bbox_shift parameter adjustment**********************************************************
Total frame:「1」 Manually adjust range : [ -23~25 ] , the current value: 0
*************************************************************************************************************************************
start inference
  0%|                                                                                            | 0/3 [00:00<?, ?it/s]

image

FFmpeg is in path

image
C:\Users\nitin>ffmpeg
ffmpeg version 6.1-full_build-www.gyan.dev Copyright (c) 2000-2023 the FFmpeg developers
  built with gcc 12.2.0 (Rev10, Built by MSYS2 project)
  configuration: --enable-gpl --enable-version3 --enable-static --pkg-config=pkgconf --disable-w32threads --disable-autodetect --enable-fontconfig --enable-iconv --enable-gnutls --enable-libxml2 --enable-gmp --enable-bzlib --enable-lzma --enable-libsnappy --enable-zlib --enable-librist --enable-libsrt --enable-libssh --enable-libzmq --enable-avisynth --enable-libbluray --enable-libcaca --enable-sdl2 --enable-libaribb24 --enable-libaribcaption --enable-libdav1d --enable-libdavs2 --enable-libuavs3d --enable-libzvbi --enable-librav1e --enable-libsvtav1 --enable-libwebp --enable-libx264 --enable-libx265 --enable-libxavs2 --enable-libxvid --enable-libaom --enable-libjxl --enable-libopenjpeg --enable-libvpx --enable-mediafoundation --enable-libass --enable-frei0r --enable-libfreetype --enable-libfribidi --enable-libharfbuzz --enable-liblensfun --enable-libvidstab --enable-libvmaf --enable-libzimg --enable-amf --enable-cuda-llvm --enable-cuvid --enable-ffnvcodec --enable-nvdec --enable-nvenc --enable-dxva2 --enable-d3d11va --enable-libvpl --enable-libshaderc --enable-vulkan --enable-libplacebo --enable-opencl --enable-libcdio --enable-libgme --enable-libmodplug --enable-libopenmpt --enable-libopencore-amrwb --enable-libmp3lame --enable-libshine --enable-libtheora --enable-libtwolame --enable-libvo-amrwbenc --enable-libcodec2 --enable-libilbc --enable-libgsm --enable-libopencore-amrnb --enable-libopus --enable-libspeex --enable-libvorbis --enable-ladspa --enable-libbs2b --enable-libflite --enable-libmysofa --enable-librubberband --enable-libsoxr --enable-chromaprint
  libavutil      58. 29.100 / 58. 29.100
  libavcodec     60. 31.102 / 60. 31.102
  libavformat    60. 16.100 / 60. 16.100
  libavdevice    60.  3.100 / 60.  3.100
  libavfilter     9. 12.100 /  9. 12.100
  libswscale      7.  5.100 /  7.  5.100
  libswresample   4. 12.100 /  4. 12.100
  libpostproc    57.  3.100 / 57.  3.100
Hyper fast Audio and Video encoder
usage: ffmpeg [options] [[infile options] -i infile]... {[outfile options] outfile}...

Use -h to get full help or, even better, run 'man ffmpeg'

Getting error with cuda 11.8 , pytorch 2.2.1, conda

ImportError: /home/ubuntu/miniconda3/envs/myenv/lib/python3.10/site-packages/mmcv/_ext.cpython-310-x86_64-linux-gnu.so: undefined symbol: _ZN2at4_ops10zeros_like4callERKNS_6TensorEN3c108optionalINS5_10ScalarTypeEEENS6_INS5_6LayoutEEENS6_INS5_6DeviceEEENS6_IbEENS6_INS5_12MemoryFormatEEE

为什么realtime脚本生成用时,比日志显示的fps实际要慢很多?

  • 生成数据:均使用的测试样例(video: sun.mp4 | audio: yongen.wav);
  • 使用脚本:realtime_inference,已提前完成avater数据的抽取,仅仅进行语音片段生成;
  • 问题1:202帧的片段,日志显示平均fps为77.5
    • 此处为什么每四帧处理一次,且第一帧的处理速度要远远慢于后者(10 >> 100)?
image
  • 问题2:按fps推算,202处理完总共应只需几秒钟,但实际打点的耗时为23s(未统计后续的帧+音频合成视频的时间)
image

KeyError: 'encoder_embeddings'

Traceback (most recent call last):
File "/root/anaconda3/envs/musetalk/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/root/anaconda3/envs/musetalk/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/home/heqing/test/MuseTalk/scripts/inference.py", line 141, in
main(args)
File "/root/anaconda3/envs/musetalk/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/home/heqing/test/MuseTalk/scripts/inference.py", line 55, in main
whisper_feature = audio_processor.audio2feat(audio_path)
File "/home/heqing/test/MuseTalk/musetalk/whisper/audio2feature.py", line 99, in audio2feat
encoder_embeddings = emb['encoder_embeddings']
KeyError: 'encoder_embeddings'

运行案例的时候加载face-parse-bisent报错

(venv1) root@3bdaf96b00b0:/workspace/MuseTalk# python -m scripts.inference --inference_config configs/inference/test.yaml
add ffmpeg to path
Loads checkpoint by local backend from path: ./models/dwpose/dw-ll_ucoco_384.pth
cuda start
Traceback (most recent call last):
File "/opt/conda/envs/musev/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/opt/conda/envs/musev/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/workspace/MuseTalk/scripts/inference.py", line 13, in
from musetalk.utils.preprocessing import get_landmark_and_bbox,read_imgs,coord_placeholder
File "/workspace/MuseTalk/musetalk/utils/preprocessing.py", line 23, in
fa = FaceAlignment(LandmarksType._2D, flip_input=False,device=device)
File "/workspace/MuseTalk/musetalk/utils/face_detection/api.py", line 69, in init
self.face_detector = face_detector_module.FaceDetector(device=device, verbose=verbose)
File "/workspace/MuseTalk/musetalk/utils/face_detection/detection/sfd/sfd_detector.py", line 22, in init
model_weights = load_url(models_urls['s3fd'])
File "/workspace/MuseTalk/venv1/lib/python3.10/site-packages/torch/hub.py", line 750, in load_state_dict_from_url
return torch.load(cached_file, map_location=map_location)
File "/workspace/MuseTalk/venv1/lib/python3.10/site-packages/torch/serialization.py", line 815, in load
return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args)
File "/workspace/MuseTalk/venv1/lib/python3.10/site-packages/torch/serialization.py", line 1051, in _legacy_load
typed_storage._untyped_storage._set_from_file(
RuntimeError: unexpected EOF, expected 1202694 more bytes. The file might be corrupted.

斜脸生成效果不好

不清楚是不是因为面部在边缘,关键点检测不够准确导致的

input-1.mp4
output-1.mp4

单步生成

for i, (whisper_batch, latent_batch) in enumerate(
    tqdm(gen, total=int(np.ceil(float(video_num) / batch_size)))
):
    audio_feature_batch = torch.from_numpy(whisper_batch)
    audio_feature_batch = audio_feature_batch.to(
        device=unet.device, dtype=unet.model.dtype
    )  # torch, B, 5*N,384
    audio_feature_batch = pe(audio_feature_batch)
    latent_batch = latent_batch.to(dtype=unet.model.dtype)

    pred_latents = unet.model(latent_batch, timesteps, encoder_hidden_states=audio_feature_batch).sample
    recon = vae.decode_latents(pred_latents)
    for res_frame in recon:
        res_frame_list.append(res_frame)

看代码似乎生成每一帧的时候 Unet 只 forward 了一次?我的理解正确吗,那这还算是扩散模型吗

bbox without range check

python -m scripts.inference --inference_config results/musetalk_test_result.yaml

video in 25.0 FPS, audio idx in 50FPS
extracting landmarks...time consuming
reading images...
100%|████████████████████████████████████████████████████████████████████| 250/250 [00:01<00:00, 207.18it/s]get key_landmark and face bounding boxes with the default value
100%|█████████████████████████████████████████████████████████████████████| 250/250 [00:13<00:00, 19.16it/s]bbox_shift parameter adjustment**************
Total frame:「250」 Manually adjust range : [ -16~16 ] , the current value: 0


(61, 0, 277, 308)
(61, -1, 277, 309)
Traceback (most recent call last):
File "/envs/musetalk/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "
/envs/musetalk/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/MuseTalk/scripts/inference.py", line 153, in
main(args)
File "
/envs/musetalk/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(args, kwargs)
File "
/MuseTalk/scripts/inference.py", line 86, in main
crop_frame = cv2.resize(crop_frame,(256,256),interpolation = cv2.INTER_LANCZOS4)
cv2.error: OpenCV(4.9.0) /io/opencv/modules/imgproc/src/resize.cpp:4152: error: (-215:Assertion failed) !ssize.empty() in function 'resize'

I print bbox in inference and find that the negtive bbox index would cause this bug.

小白的优化建议

我使用了两个步骤去生成最终视频:
第一步: 使用bbox参数为最小,生成视频;
第二步: 把第一步的成品再丢进去生成,bbox调为5;

这样生成的效果似乎会好很多。参考其他人的经验,也就是全程闭嘴的视频做嘴唇拟合的效果似乎更棒。
我不是专业人员,不知道这个会不会对项目有所帮助~

imageio_ffmpeg saving video error

Hi, can anyone help me resolve this issue? Thank you!

I encountered the error below when running app.py, specifically imageio.mimwrite(output_video, images, 'FFMPEG', fps=fps, codec='libx264', pixelformat='yuv420p')

Error message:

File "/usr/local/lib/python3.10/dist-packages/imageio_ffmpeg/_io.py", line 627, in write_frames p.stdin.write(bb) BrokenPipeError: [Errno 32] Broken pipe

关于实时性的一些讨论

你好,在我们的的部署中发现一个问题,即在VAE的模型中,将GPU上的数据拷贝到CPU上花费了巨量的时间。简单来说就是在不考虑这一步的情况下,实时性可以达到60+的fps。但是因为它的存在导致我们的性能只能在30fps左右。请问有没有什么办法在这个基础上做到优化呢?这是因为显卡位宽所导致的吗?我们的实验环境是4090。

最后合成的时候太慢了

每次进行到pad talking image to original video这个地方的时候,速度非常慢,4090的显卡也只能有10it/s的样子,CPU,GPU和内存的占用都非常小,不知道是不是还有优化的地方

如何进行效率最大化或者效果降级呢

Readme实时性的结论是在V100的机器上进行测试的,测试了一下实时体验确实已经是超出预期的好了!但这里想请教下,如果实时推理时我想要使用更低端的显卡,同时允许输出视频帧的效果有一定的降级,这里有哪些参数或者方法我可以参考的吗。

如何实现实时视频生成

During online chatting, only UNet and the VAE decoder are involved, which makes MuseTalk real-time.

这个应该怎么做?

在cuda start特别慢

Loading: ComfyUI-Manager (V2.9)

ComfyUI Revision: 2052 [55f37baa] | Released on '2024-03-07'

[ComfyUI-Manager] default cache updated: https://gitcode.net/ranting8323/ComfyUI-Manager/-/raw/main/alter-list.json
[ComfyUI-Manager] default cache updated: https://gitcode.net/ranting8323/ComfyUI-Manager/-/raw/main/custom-node-list.json
[ComfyUI-Manager] default cache updated: https://gitcode.net/ranting8323/ComfyUI-Manager/-/raw/main/model-list.json
[ComfyUI-Manager] default cache updated: https://gitcode.net/ranting8323/ComfyUI-Manager/-/raw/main/extension-node-map.json
Loads checkpoint by local backend from path: C:\ComfyUI-aki-v1.3\models\diffusers\TMElyralab/MuseTalk/dwpose/dw-ll_ucoco_384.pth
cuda start
cgi-bin_mmwebwx-bin_webwxgetmsgimg_ MsgID=8529752270275561770 skey=@crypt_954ca980_d21c8e45b22552d73cfafb16b812ddf1 mmweb_appid=wx_webfilehelper

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.