tmelyralab / musetalk Goto Github PK
View Code? Open in Web Editor NEWMuseTalk: Real-Time High Quality Lip Synchorization with Latent Space Inpainting
License: Other
MuseTalk: Real-Time High Quality Lip Synchorization with Latent Space Inpainting
License: Other
下牙太长了,而且合成后下牙是链接到一起的,能把下牙弄短点就好了,256分辨率嘴巴和牙齿看着还是比较模糊,但是开启人脸增强后牙齿很难看
Thank your great work!
I conducted some experiments and found that the fidelity of the generated faces is poor, generated person does not resemble the original video closely.
Could you give me some suggestions?
python -m scripts.inference --inference_config configs/inference/demo.yaml
add ffmpeg to path
Loads checkpoint by local backend from path: ./models/dwpose/dw-ll_ucoco_384.pth
{'task_1': {'video_path': 'data/video/demo.mov', 'audio_path': 'data/audio/demo.wav', 'bbox_shift': -7}}
/Users/sukai/Documents/ai/MuseTalk/musetalk/whisper/whisper/transcribe.py:79: UserWarning: FP16 is not supported on CPU; using FP32 instead
warnings.warn("FP16 is not supported on CPU; using FP32 instead")
video in 25 FPS, audio idx in 50FPS
extracting landmarks...time consuming
reading images...
0it [00:00, ?it/s]
get key_landmark and face bounding boxes with the bbox_shift: -7
0it [00:00, ?it/s]
********************************************bbox_shift parameter adjustment**********************************************************
1 preprocessing.py + X
Traceback (most recent call last):
1 preprocessing.py +
Traceback (most recent call last):
File "<frozen runpy>", line 198, in _run_module_as_main
File "<frozen runpy>", line 88, in _run_code
File "/Users/hotea/Documents/ai/MuseTalk/scripts/inference.py", line 145, in <module>
main(args)
File "/Users/hotea/Documents/ai/MuseTalk/venv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/Users/hotea/Documents/ai/MuseTalk/scripts/inference.py", line 115, in main
combine_frame = get_image(ori_frame,res_frame,bbox)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/hotea/Documents/ai/MuseTalk/musetalk/utils/blending.py", line 41, in get_image
mask_image = face_seg(face_large)
^^^^^^^^^^^^^^^^^^^^
File "/Users/hotea/Documents/ai/MuseTalk/musetalk/utils/blending.py", line 17, in face_seg
seg_image = fp(image)
^^^^^^^^^
File "/Users/hotea/Documents/ai/MuseTalk/musetalk/utils/face_parsing/__init__.py", line 38, in __call__
img = torch.unsqueeze(img, 0).cuda()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/hotea/Documents/ai/MuseTalk/venv/lib/python3.12/site-packages/torch/cuda/__init__.py", line 293, in _lazy_init
raise AssertionError("Torch not compiled with CUDA enabled")
AssertionError: Torch not compiled with CUDA enabled
could you update your paper link?
python -m scripts.realtime_inference --inference_config configs/inference/realtime.yaml
Loads checkpoint by local backend from path: ./models/dwpose/dw-ll_ucoco_384.pth
cuda start
D:\model\MuseTalk.glut\lib\site-packages\torch\utils_contextlib.py:125: UserWarning: Decorating classes is deprecated and will be disabled in future versions. You should only decorate functions or methods. To preserve the current behavior of class decoration, you can directly decorate the __init__
method and nothing else.
warnings.warn("Decorating classes is deprecated and will be disabled in "
{'avator_1': {'preparation': False, 'bbox_shift': 5, 'video_path': 'data/video/sun.mp4', 'audio_clips': {'audio_0': 'data/audio/yongen.wav', 'audio_1': 'data/audio/sun.wav'}}}
reading images...
100%|█████████████████████████████████████████████████████████████████████████████| 1100/1100 [00:05<00:00, 195.67it/s]
reading images...
100%|████████████████████████████████████████████████████████████████████████████| 1100/1100 [00:00<00:00, 6295.40it/s]
Inferring using: data/audio/yongen.wav
video in 25 FPS, audio idx in 50FPS
start inference
202
processing audio:data/audio/yongen.wav costs 0.0ms
2%|█▋ | 1/51 [00:01<01:18, 1.57s/it]Generating the 0-th frame with FPS: 1.76
Exception in thread Thread-4 (process_frames):
Traceback (most recent call last):
File "D:\model\MuseTalk.glut\lib\threading.py", line 1016, in _bootstrap_inner
self.run()
File "D:\model\MuseTalk.glut\lib\threading.py", line 953, in run
self._target(*self._args, **self._kwargs)
File "D:\model\MuseTalk\scripts\realtime_inference.py", line 208, in process_frames
fps = 1/(time.time()-start)
ZeroDivisionError: float division by zero
100%|██████████████████████████████████████████████████████████████████████████████████| 51/51 [00:07<00:00, 7.19it/s]
ffmpeg -y -v fatal -r 25 -f image2 -i ./results/avatars/avator_1/tmp/%08d.png -vcodec libx264 -vf format=rgb24,scale=out_color_matrix=bt709,format=yuv420p -crf 18 ./results/avatars/avator_1/temp.mp4
ffmpeg -y -v fatal -i data/audio/yongen.wav -i ./results/avatars/avator_1/temp.mp4 ./results/avatars/avator_1/vid_output/audio_0.mp4
result is save to ./results/avatars/avator_1/vid_output/audio_0.mp4
Inferring using: data/audio/sun.wav
video in 25 FPS, audio idx in 50FPS
start inference
564
processing audio:data/audio/sun.wav costs 0.0ms
0%| | 0/141 [00:00<?, ?it/s]Generating the 0-th frame with FPS: 10.45
Exception in thread Thread-7 (process_frames):
Traceback (most recent call last):
File "D:\model\MuseTalk.glut\lib\threading.py", line 1016, in _bootstrap_inner
self.run()
File "D:\model\MuseTalk.glut\lib\threading.py", line 953, in run
self._target(*self._args, **self._kwargs)
File "D:\model\MuseTalk\scripts\realtime_inference.py", line 208, in process_frames
fps = 1/(time.time()-start)
ZeroDivisionError: float division by zero
100%|████████████████████████████████████████████████████████████████████████████████| 141/141 [00:13<00:00, 10.42it/s]
ffmpeg -y -v fatal -r 25 -f image2 -i ./results/avatars/avator_1/tmp/%08d.png -vcodec libx264 -vf format=rgb24,scale=out_color_matrix=bt709,format=yuv420p -crf 18 ./results/avatars/avator_1/temp.mp4
ffmpeg -y -v fatal -i data/audio/sun.wav -i ./results/avatars/avator_1/temp.mp4 ./results/avatars/avator_1/vid_output/audio_1.mp4
result is save to ./results/avatars/avator_1/vid_output/audio_1.mp4
I am trying with the attached files and even if I wait for 2 hours no progress. Am I doing something wrong
test.yaml
task_0:
video_path: "data/image/face.jpeg"
audio_path: "data/audio/audio.wav"
C:\sd\MuseTalk>venv\scripts\activate
(venv) C:\sd\MuseTalk>python -m scripts.inference --inference_config configs/inference/test.yaml
add ffmpeg to path
Loads checkpoint by local backend from path: ./models/dwpose/dw-ll_ucoco_384.pth
cuda start
{'task_0': {'video_path': 'data/image/face.jpeg', 'audio_path': 'data/audio/audio.wav'}}
video in 25 FPS, audio idx in 50FPS
extracting landmarks...time consuming
reading images...
100%|████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<?, ?it/s]
get key_landmark and face bounding boxes with the default value
100%|████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:02<00:00, 2.33s/it]
********************************************bbox_shift parameter adjustment**********************************************************
Total frame:「1」 Manually adjust range : [ -23~25 ] , the current value: 0
*************************************************************************************************************************************
start inference
0%| | 0/3 [00:00<?, ?it/s]
FFmpeg is in path
C:\Users\nitin>ffmpeg
ffmpeg version 6.1-full_build-www.gyan.dev Copyright (c) 2000-2023 the FFmpeg developers
built with gcc 12.2.0 (Rev10, Built by MSYS2 project)
configuration: --enable-gpl --enable-version3 --enable-static --pkg-config=pkgconf --disable-w32threads --disable-autodetect --enable-fontconfig --enable-iconv --enable-gnutls --enable-libxml2 --enable-gmp --enable-bzlib --enable-lzma --enable-libsnappy --enable-zlib --enable-librist --enable-libsrt --enable-libssh --enable-libzmq --enable-avisynth --enable-libbluray --enable-libcaca --enable-sdl2 --enable-libaribb24 --enable-libaribcaption --enable-libdav1d --enable-libdavs2 --enable-libuavs3d --enable-libzvbi --enable-librav1e --enable-libsvtav1 --enable-libwebp --enable-libx264 --enable-libx265 --enable-libxavs2 --enable-libxvid --enable-libaom --enable-libjxl --enable-libopenjpeg --enable-libvpx --enable-mediafoundation --enable-libass --enable-frei0r --enable-libfreetype --enable-libfribidi --enable-libharfbuzz --enable-liblensfun --enable-libvidstab --enable-libvmaf --enable-libzimg --enable-amf --enable-cuda-llvm --enable-cuvid --enable-ffnvcodec --enable-nvdec --enable-nvenc --enable-dxva2 --enable-d3d11va --enable-libvpl --enable-libshaderc --enable-vulkan --enable-libplacebo --enable-opencl --enable-libcdio --enable-libgme --enable-libmodplug --enable-libopenmpt --enable-libopencore-amrwb --enable-libmp3lame --enable-libshine --enable-libtheora --enable-libtwolame --enable-libvo-amrwbenc --enable-libcodec2 --enable-libilbc --enable-libgsm --enable-libopencore-amrnb --enable-libopus --enable-libspeex --enable-libvorbis --enable-ladspa --enable-libbs2b --enable-libflite --enable-libmysofa --enable-librubberband --enable-libsoxr --enable-chromaprint
libavutil 58. 29.100 / 58. 29.100
libavcodec 60. 31.102 / 60. 31.102
libavformat 60. 16.100 / 60. 16.100
libavdevice 60. 3.100 / 60. 3.100
libavfilter 9. 12.100 / 9. 12.100
libswscale 7. 5.100 / 7. 5.100
libswresample 4. 12.100 / 4. 12.100
libpostproc 57. 3.100 / 57. 3.100
Hyper fast Audio and Video encoder
usage: ffmpeg [options] [[infile options] -i infile]... {[outfile options] outfile}...
Use -h to get full help or, even better, run 'man ffmpeg'
建议增加推流的实时交互
提供推理的音频,中间和结尾有一段没有声音的,但是嘴巴一直在动,请问可以怎么优化吗?
目前遇到进程挂掉后,之前生成的png图片有保存,但是不能断点接续,有没有方法可以接着之前的生成内容继续生成。
ImportError: /home/ubuntu/miniconda3/envs/myenv/lib/python3.10/site-packages/mmcv/_ext.cpython-310-x86_64-linux-gnu.so: undefined symbol: _ZN2at4_ops10zeros_like4callERKNS_6TensorEN3c108optionalINS5_10ScalarTypeEEENS6_INS5_6LayoutEEENS6_INS5_6DeviceEEENS6_IbEENS6_INS5_12MemoryFormatEEE
it is a nice project, seems it is supports video dubbing, so please add dubbing example code
4090 显卡,现在 15秒语音,用语音驱动,转换一次,耗时大概2分钟。
有计划支持华为的昇腾平台吗
ImportError: cannot import name 'ForkProcess' from 'multiprocessing.context'
请问有计划提供训练代码或者paper吗?如果有计划,大概什么时候提供?
Traceback (most recent call last):
File "/root/anaconda3/envs/musetalk/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/root/anaconda3/envs/musetalk/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/home/heqing/test/MuseTalk/scripts/inference.py", line 141, in
main(args)
File "/root/anaconda3/envs/musetalk/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/home/heqing/test/MuseTalk/scripts/inference.py", line 55, in main
whisper_feature = audio_processor.audio2feat(audio_path)
File "/home/heqing/test/MuseTalk/musetalk/whisper/audio2feature.py", line 99, in audio2feat
encoder_embeddings = emb['encoder_embeddings']
KeyError: 'encoder_embeddings'
(venv1) root@3bdaf96b00b0:/workspace/MuseTalk# python -m scripts.inference --inference_config configs/inference/test.yaml
add ffmpeg to path
Loads checkpoint by local backend from path: ./models/dwpose/dw-ll_ucoco_384.pth
cuda start
Traceback (most recent call last):
File "/opt/conda/envs/musev/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/opt/conda/envs/musev/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/workspace/MuseTalk/scripts/inference.py", line 13, in
from musetalk.utils.preprocessing import get_landmark_and_bbox,read_imgs,coord_placeholder
File "/workspace/MuseTalk/musetalk/utils/preprocessing.py", line 23, in
fa = FaceAlignment(LandmarksType._2D, flip_input=False,device=device)
File "/workspace/MuseTalk/musetalk/utils/face_detection/api.py", line 69, in init
self.face_detector = face_detector_module.FaceDetector(device=device, verbose=verbose)
File "/workspace/MuseTalk/musetalk/utils/face_detection/detection/sfd/sfd_detector.py", line 22, in init
model_weights = load_url(models_urls['s3fd'])
File "/workspace/MuseTalk/venv1/lib/python3.10/site-packages/torch/hub.py", line 750, in load_state_dict_from_url
return torch.load(cached_file, map_location=map_location)
File "/workspace/MuseTalk/venv1/lib/python3.10/site-packages/torch/serialization.py", line 815, in load
return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args)
File "/workspace/MuseTalk/venv1/lib/python3.10/site-packages/torch/serialization.py", line 1051, in _legacy_load
typed_storage._untyped_storage._set_from_file(
RuntimeError: unexpected EOF, expected 1202694 more bytes. The file might be corrupted.
不清楚是不是因为面部在边缘,关键点检测不够准确导致的
for i, (whisper_batch, latent_batch) in enumerate(
tqdm(gen, total=int(np.ceil(float(video_num) / batch_size)))
):
audio_feature_batch = torch.from_numpy(whisper_batch)
audio_feature_batch = audio_feature_batch.to(
device=unet.device, dtype=unet.model.dtype
) # torch, B, 5*N,384
audio_feature_batch = pe(audio_feature_batch)
latent_batch = latent_batch.to(dtype=unet.model.dtype)
pred_latents = unet.model(latent_batch, timesteps, encoder_hidden_states=audio_feature_batch).sample
recon = vae.decode_latents(pred_latents)
for res_frame in recon:
res_frame_list.append(res_frame)
看代码似乎生成每一帧的时候 Unet 只 forward 了一次?我的理解正确吗,那这还算是扩散模型吗
python -m scripts.inference --inference_config results/musetalk_test_result.yaml
video in 25.0 FPS, audio idx in 50FPS
extracting landmarks...time consuming
reading images...
100%|████████████████████████████████████████████████████████████████████| 250/250 [00:01<00:00, 207.18it/s]get key_landmark and face bounding boxes with the default value
100%|█████████████████████████████████████████████████████████████████████| 250/250 [00:13<00:00, 19.16it/s]bbox_shift parameter adjustment**************
Total frame:「250」 Manually adjust range : [ -16~16 ] , the current value: 0
(61, 0, 277, 308)
(61, -1, 277, 309)
Traceback (most recent call last):
File "/envs/musetalk/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/envs/musetalk/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/MuseTalk/scripts/inference.py", line 153, in
main(args)
File "/envs/musetalk/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(args, kwargs)
File "/MuseTalk/scripts/inference.py", line 86, in main
crop_frame = cv2.resize(crop_frame,(256,256),interpolation = cv2.INTER_LANCZOS4)
cv2.error: OpenCV(4.9.0) /io/opencv/modules/imgproc/src/resize.cpp:4152: error: (-215:Assertion failed) !ssize.empty() in function 'resize'
I print bbox in inference and find that the negtive bbox index would cause this bug.
给UNET加入噪声训练会更好吗?似乎只把它当成特征提取器了。
我看了文档都只是说视频生成视频。 而我希望的是通过音频或文本 直接生成口型动画加到unity中让3d人物同步
近景时候嘴部会变得奇怪且很模糊,中景时候就挺好的, 希望修复下,感谢作者
我使用了两个步骤去生成最终视频:
第一步: 使用bbox参数为最小,生成视频;
第二步: 把第一步的成品再丢进去生成,bbox调为5;
这样生成的效果似乎会好很多。参考其他人的经验,也就是全程闭嘴的视频做嘴唇拟合的效果似乎更棒。
我不是专业人员,不知道这个会不会对项目有所帮助~
Windows 下载了ffmpeg-static,设置了环境变量,怎么不行啊
Hi, can anyone help me resolve this issue? Thank you!
I encountered the error below when running app.py
, specifically imageio.mimwrite(output_video, images, 'FFMPEG', fps=fps, codec='libx264', pixelformat='yuv420p')
Error message:
File "/usr/local/lib/python3.10/dist-packages/imageio_ffmpeg/_io.py", line 627, in write_frames p.stdin.write(bb) BrokenPipeError: [Errno 32] Broken pipe
不说话的时候,嘴巴还是张着嘴
我在以下环境进行测试
https://huggingface.co/spaces/TMElyralab/MuseTalk
使用的MuseV生成的8s视频,在上述环境上传后就变成了2s ,可以解答一下嘛
环境是按照requirement.txt创建的,运行时候提示ValueError: transformers.spec is None
你好,在我们的的部署中发现一个问题,即在VAE的模型中,将GPU上的数据拷贝到CPU上花费了巨量的时间。简单来说就是在不考虑这一步的情况下,实时性可以达到60+的fps。但是因为它的存在导致我们的性能只能在30fps左右。请问有没有什么办法在这个基础上做到优化呢?这是因为显卡位宽所导致的吗?我们的实验环境是4090。
能否增加个在线驱动的例子和代码,那样直接秒杀其他数字人了
每次进行到pad talking image to original video这个地方的时候,速度非常慢,4090的显卡也只能有10it/s的样子,CPU,GPU和内存的占用都非常小,不知道是不是还有优化的地方
Readme实时性的结论是在V100的机器上进行测试的,测试了一下实时体验确实已经是超出预期的好了!但这里想请教下,如果实时推理时我想要使用更低端的显卡,同时允许输出视频帧的效果有一定的降级,这里有哪些参数或者方法我可以参考的吗。
项目里的app.py里的 import spaces 这个 spaces是那个版本,版本号是多少
如何支持静音闭嘴功能
During online chatting, only UNet and the VAE decoder are involved, which makes MuseTalk real-time.
这个应该怎么做?
输入的是图片就一直报这个错,请问各位是怎么解决的。单张图片多张图片都报错
[ComfyUI-Manager] default cache updated: https://gitcode.net/ranting8323/ComfyUI-Manager/-/raw/main/alter-list.json
[ComfyUI-Manager] default cache updated: https://gitcode.net/ranting8323/ComfyUI-Manager/-/raw/main/custom-node-list.json
[ComfyUI-Manager] default cache updated: https://gitcode.net/ranting8323/ComfyUI-Manager/-/raw/main/model-list.json
[ComfyUI-Manager] default cache updated: https://gitcode.net/ranting8323/ComfyUI-Manager/-/raw/main/extension-node-map.json
Loads checkpoint by local backend from path: C:\ComfyUI-aki-v1.3\models\diffusers\TMElyralab/MuseTalk/dwpose/dw-ll_ucoco_384.pth
cuda start
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.