opentalker / video-retalking Goto Github PK

View Code? Open in Web Editor NEW

6.4K 72.0 948.0 45.48 MB

[SIGGRAPH Asia 2022] VideoReTalking: Audio-based Lip Synchronization for Talking Head Video Editing In the Wild

Home Page: https://opentalker.github.io/video-retalking/

License: Apache License 2.0

Python 96.52% C++ 0.21% Cuda 1.39% Shell 0.54% Jupyter Notebook 1.35%

lip-synchronization talking-head-videos video-editing siggraph-asia-2022

video-retalking's Issues

Why input a video , then error like this found?

FaceDet:: 100% 107/107 [03:23<00:00, 1.90s/it]
^C

python 3.10.6

pip install torch==1.9.0+cu111 torchvision==0.10.0+cu111 -f https://download.pytorch.org/whl/torch_stable.html
not working

which version should i install

how can id run this project on Mac OS M1, use the GPU via torch "mps"

how can id run this project on Mac OS M1, use the GPU via torch mps

Colab notebook throwing error while running inference.py

Traceback (most recent call last):
File "/content/video-retalking/inference.py", line 342, in
main()
File "/content/video-retalking/inference.py", line 78, in main
kp_extractor = KeypointExtractor()
File "/content/video-retalking/third_part/face3d/extract_kp_videos.py", line 16, in init
self.detector = face_alignment.FaceAlignment(face_alignment.LandmarksType._2D)
File "/usr/lib/python3.10/enum.py", line 437, in getattr
raise AttributeError(name) from None
AttributeError: _2D

中文的唇动效果似乎比较差？请问是否有对中文做优化？

Traceback error

I am running into error when I try:
python inference.py \ --face examples/face/1.mp4 \ --audio examples/audio/1.wav \ --outfile results/1_1.mp4

Traceback (most recent call last): File "inference.py", line 4, in <module> from PIL import Image File "C:\Users\username4\anaconda3\envs\video_retalking\lib\site-packages\PIL\Image.py", line 103, in <module> from . import _imaging as core ImportError: DLL load failed while importing _imaging: The specified module could not be found.

/bin/sh: 1: Syntax error: "(" unexpected outfile: results/output.mp4

Getting this issue while running google colab:

/bin/sh: 1: Syntax error: "(" unexpected
outfile: results/output.mp4

Can't generate and save output mp4 file.

ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (5,) + inhomogeneous part.

大神您好，请帮忙看看这个错误怎么解决。搞了8个小时，目前卡在这个错误，实在搞不定了，辛苦您帮忙看下，多谢多谢。
我的环境是本地电脑win10，64位，显卡GTX960 4G显存，已经安装了anaconda3，以及python3.8，cuda11.1，在环境变量里也配置了。
在powershell中输入命令：
python inference.py --face examples/face/1.mp4 --audio examples/audio/1.wav --outfile results/1_1.mp4

输出结果：
C:\Python\Python38\lib\site-packages\setuptools\distutils_patch.py:25: UserWarning: Distutils was imported before Setuptools. This usage is discouraged and may exhibit undesirable behaviors or errors. Please use Setuptools' objects directly or at least import Setuptools first.
warnings.warn(
[Info] Using cuda for inference.
[Step 0] Number of frames available for inference: 135
[Step 1] Landmarks Extraction in Video.
landmark Det:: 100%|█████████████████████████████████████████████████████████████████| 135/135 [00:25<00:00, 5.29it/s]
[Step 2] 3DMM Extraction In Video:: 0%| | 0/135 [00:00<?, ?it/s]
Traceback (most recent call last):
File "inference.py", line 342, in
main()
File "inference.py", line 100, in main
trans_params, im_idx, lm_idx, _ = align_img(frame, lm_idx, lm3d_std)
File "C:\ai\video-retalking\third_part\face3d\util\preprocess.py", line 196, in align_img
trans_params = np.array([w0, h0, s, t[0], t[1]])
ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (5,) + inhomogeneous part.

setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (5,) + inhomogeneous part.

[Step 2] 3DMM Extraction In Video:: 0%| | 0/135 [00:00<?, ?it/s]
Traceback (most recent call last):
File "inference.py", line 342, in
main()
File "inference.py", line 100, in main
trans_params, im_idx, lm_idx, _ = align_img(frame, lm_idx, lm3d_std)
File "G:\pythonProject\third_part\face3d\util\preprocess.py", line 196, in align_img
trans_params = np.array([w0, h0, s, t[0], t[1]])
ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (5,) + inhomogeneous part.
PS G:\pythonProject>

什么时候上传训练代码？

测试了你们的模型的效果很不错，请问什么时候上传训练代码？

The GPU memory required to run this program. ?

I have a question about Identity aware enhancement network

I am also interested in talking head generation so that I have read your paper with impression in SIGGRAPH 2022.
I have a question about 'identity aware enhancement network'.

I can not understand this part in your paper. Does it mean that high resolution LRS2 dataset images from restoration network again put into L-Net? Then why to produce low resolution input of E-Net?

你好请问如何把d-net里面的512修改为更大

Image too big (error)

I'll leave the error log below but from what I understand my image size is too big for face detection on my gpu, I only have a 4GB GPU
using the following code
python inference.py --face examples/face/12.mp4 --audio examples/audio/3.wav --outfile results/1_6.mp4 --face_det_batch_size 2 --LNet_batch_size 2 --one_shot
I don't really want to be resizing my video as I want to preserve qaulity can i do something else like increase batch sizes or crop my video somehow (please if you are replying with code add to the lines I provided above if possible)

FaceDet::   0%|                                                                                | 0/204 [00:01<?, ?it/s]
Recovering from OOM error; New batch size: 1                                                   | 0/204 [00:00<?, ?it/s]
FaceDet::   0%|                                                                                | 0/407 [00:00<?, ?it/s]
[Step 6] Lip Synthesis::   0%|                                                                 | 0/204 [01:36<?, ?it/s]
Traceback (most recent call last):
  File "C:\Users\leolo\video-retalking\utils\inference_utils.py", line 118, in face_detect
    predictions.extend(detector.get_detections_for_batch(np.array(images[i:i + batch_size])))
  File "C:\Users\leolo\video-retalking\third_part\face_detection\api.py", line 66, in get_detections_for_batch
    detected_faces = self.face_detector.detect_from_batch(images.copy())
  File "C:\Users\leolo\video-retalking\third_part\face_detection\detection\sfd\sfd_detector.py", line 42, in detect_from_batch
    bboxlists = batch_detect(self.face_detector, images, device=self.device)
  File "C:\Users\leolo\video-retalking\third_part\face_detection\detection\sfd\detect.py", line 69, in batch_detect
    olist = net(imgs)
  File "C:\Users\leolo\.conda\envs\video_retalking\lib\site-packages\torch\nn\modules\module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "C:\Users\leolo\video-retalking\third_part\face_detection\detection\sfd\net_s3fd.py", line 71, in forward
    h = F.relu(self.conv1_1(x))
  File "C:\Users\leolo\.conda\envs\video_retalking\lib\site-packages\torch\nn\modules\module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "C:\Users\leolo\.conda\envs\video_retalking\lib\site-packages\torch\nn\modules\conv.py", line 443, in forward
    return self._conv_forward(input, self.weight, self.bias)
  File "C:\Users\leolo\.conda\envs\video_retalking\lib\site-packages\torch\nn\modules\conv.py", line 439, in _conv_forward
    return F.conv2d(input, weight, bias, self.stride,
RuntimeError: CUDA out of memory. Tried to allocate 960.00 MiB (GPU 0; 4.00 GiB total capacity; 1.96 GiB already allocated; 0 bytes free; 2.67 GiB reserved in total by PyTorch)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "inference.py", line 342, in <module>
    main()
  File "inference.py", line 211, in main
    for i, (img_batch, mel_batch, frames, coords, img_original, f_frames) in enumerate(tqdm(gen, desc='[Step 6] Lip Synthesis:', total=int(np.ceil(float(len(mel_chunks)) / args.LNet_batch_size)))):
  File "C:\Users\leolo\.conda\envs\video_retalking\lib\site-packages\tqdm\std.py", line 1178, in __iter__
    for obj in iterable:
  File "inference.py", line 292, in datagen
    face_det_results = face_detect(full_frames, args, jaw_correction=True)
  File "C:\Users\leolo\video-retalking\utils\inference_utils.py", line 121, in face_detect
    raise RuntimeError('Image too big to run face detection on GPU. Please use the --resize_factor argument')
RuntimeError: Image too big to run face detection on GPU. Please use the --resize_factor argument

Odd GPEN Error - 'mask_sharp' referenced before assignment

When running inference.py the following error occurs when Step 5 begins

  File "/inference.py", line 342, in <module>
    main()
  File "/inference.py", line 198, in main
    pred, _, _ = enhancer.process(img, img, face_enhance=True, possion_blending=False)
  File "/third_part/GPEN/gpen_face_enhancer.py", line 116, in process
    mask_sharp = cv2.GaussianBlur(mask_sharp, (0,0), sigmaX=1, sigmaY=1, borderType = cv2.BORDER_DEFAULT)
UnboundLocalError: local variable 'mask_sharp' referenced before assignment

What's likely the cause?

About the train code

Hi contributor,

This is fantastic work, and it is very exciting that the code has been released.

I really would like to train my own dataset based on your excellent work. So, may I ask for help with the training process or whether the training code will be released later?

Looking for your response.

关于E-Net的效果问题

作者您好，我运行了demo发现面部生成效果有些问题，于是我把L-Net和E-Net的结果都保存了下来，发现L-Net效果正常，但E-Net超分效果比较差，和您的示例视频有所差距，想问一下原因，谢谢。

第五步卡住不动了

[Step 5] Reference Enhancement: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 109/109 [18:44<00:00, 10.32s/it]
landmark Det:: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 109/109 [00:28<00:00,  3.80it/s]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 109/109 [00:00<00:00, 16176.46it/s] 
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 109/109 [00:00<00:00, 500.96it/s] 
 94%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▋            | 102/109 [00:00<00:00, 503.14it/s] 
FaceDet::   0%|                                                                                                                                                                                                 | 0/28 [00:00<?, ?it/s]

RuntimeError: storage has wrong size: expected 0 got 256

After deploying according to the document, the following error is reported:

RuntimeError: storage has wrong size: expected 0 got 256

Please ask the Great God for help. thx！

你好我想了解一下如何训练LNet与eNet，希望可以提供一个简单的示例

我想试试对自有数据集进行训练，看看训练结果怎么样。希望可以提供一个简单的流程示例，谢谢。

Performance concern

I tried to produce a 3 minutes video with example/3.mp4 plus a 3 minutes wav audio, it took 20 minutes in step 6, i'm using 4090RTX, is this performance normal? Note: Cuda already enabled.

RuntimeError: unexpected EOF, expected 2644905 more bytes. The file might be corrupted.

当我使用推理的时候返回RuntimeError: unexpected EOF, expected 2644905 more bytes. The file might be corrupted.

Video-retalking program errors out on step 6

I am running this program on my Windows machine using Anaconda. I run the following command and get all the way to step 6 before erroring out:

(video_retalking) E:\video-retalking>python inference.py --face examples/face/1.mp4 --audio examples/audio/1.wav --outfile results/1_1.mp4

See attached screen shot. inference.py
line 271. Please advise on what I can do to resolve this issue. Thanks.

video_retalking error screen shot.pdf

原来顺利运行的，突然报错step1

[Step 1] Landmarks Extraction in Video.
Traceback (most recent call last):
File "inference.py", line 342, in
main()
File "inference.py", line 78, in main
kp_extractor = KeypointExtractor()
File "C:\Users\Qiyun\video-retalking\third_part\face3d\extract_kp_videos.py", line 16, in init
self.detector = face_alignment.FaceAlignment(face_alignment.LandmarksType._2D)
File "C:\Users\Qiyun\miniconda3\envs\video_retalking\lib\enum.py", line 384, in getattr
raise AttributeError(name) from None
AttributeError: _2D

Using cuda for inference. ^C

While running google colab, I get:

/usr/local/lib/python3.9/dist-packages/torchvision/transforms/functional_tensor.py:5: UserWarning: The torchvision.transforms.functional_tensor module is deprecated in 0.15 and will be **removed in 0.17**. Please don't rely on it. You probably just need to use APIs in torchvision.transforms.functional or in torchvision.transforms.v2.functional. warnings.warn( [Info] Using cuda for inference. ^C

Is there a way that the video does not influence the movements of the mouth?

It seems that some of the mouth movements copy the video. Is there a way for the video not to influence the movements.
for when there is silence in the audio the mouth is not closed

blurry inference result even if input video is 1080p

Hi, this is an awesome project and the lipsync is really good.

I’ve encountered some problems: if I use a high resolution video (1920*1080) as input, the output video is blurry on the whole (not just the face area) though the output resolution is also 1080p. It seems that the output video is scaled up from a low-res one.

Based on my understanding of the paper, the generated talking face is pasted back onto the original video. So I wonder if this global blurriness is normal…

I used the command from the readme for inference. Not sure if there are other options I missed.

python3 inference.py \
  --face examples/face/1.mp4 \
  --audio examples/audio/1.wav \
  --outfile results/1_1.mp4

Thank you for your great repo.

ValueError: cannot reshape array of size 81600 into shape (742,newaxis,2)

When I use my own vidoe, it always return this error, would you please help answer? Thank you!

Install Error /w requirements.txt

Followed instructions, got error

Installing collected packages: yapf, tensorboard-plugin-wit, pyasn1, ninja, lmdb, einops, dlib, addict, tensorboard-data-server, six, rsa, pyyaml, pyparsing, pyasn1-modules, protobuf, oauthlib, numpy, networkx, MarkupSafe, llvmlite, kiwisolver, importlib-resources, grpcio, future, fonttools, cycler, colorama, charset-normalizer, cachetools, absl-py, werkzeug, tqdm, tifffile, scipy, PyWavelets, python-dateutil, opencv-python, numba, markdown, kornia, imageio, google-auth, contourpy, scikit-image, resampy, requests-oauthlib, matplotlib, librosa, google-auth-oauthlib, filterpy, face-alignment, tb-nightly, facexlib, basicsr
Running setup.py install for dlib ... error
error: subprocess-exited-with-error

× Running setup.py install for dlib did not run successfully.
│ exit code: 1
╰─> [9 lines of output]
running install
C:\Users\chlyw\anaconda3\envs\video_retalking\lib\site-packages\setuptools\command\install.py:34: SetuptoolsDeprecationWarning: setup.py install is deprecated. Use build and pip and other standards-based tools.
warnings.warn(
running build
running build_py
running build_ext

  ERROR: CMake must be installed to build dlib

  [end of output]

note: This error originates from a subprocess, and is likely not a problem with pip.
error: legacy-install-failure

× Encountered error while trying to install package.
╰─> dlib

how improve the video generate speed ,i already use gpu.

how improve the video generate speed ,i already use gpu(Tesla V100-PCIE-16GB).
right now it need almost two minutes to generate six seconds video

missing "face3d_pretrain_epoch_20.pth"

Dear Vinthony

Where to download face3d_pretrain_epoch_20.pth? thanks.

Error in the fifth step of the program: (GPEN)

Traceback (most recent call last):
File "inference.py", line 342, in
main()
File "inference.py", line 198, in main
pred, _, _ = enhancer.process(img, img, face_enhance=True, possion_blending=False)
File "C:\AI-3\third_part\GPEN\gpen_face_enhancer.py", line 116, in process
mask_sharp = cv2.GaussianBlur(mask_sharp, (0,0), sigmaX=1, sigmaY=1, borderType = cv2.BORDER_DEFAULT)
UnboundLocalError: local variable 'mask_sharp' referenced before assignment

Looking for any help and answers, thank you teachers and designers.

Why does the introduction of large-sized video cause the face to be unclear.

When I choose a 1080p video, the header size is about 360*360. As a result of program inference, there are jagged eyes. Is there a way to improve it?

高分处理会导致，嘴唇的抖动，而且比较明显

高分处理会导致，嘴唇的抖动，而且比较明显。不知道有没有比较好的建议和处理方法。

你好传入图片的问题

你好传入图片的问题,生成出来的视频头没有动画，也没有任何报错,请问是我哪里出了问题吗？

Can't get smile expression video

try with --exp_img smile or --exp_img "smile_face.jpeg" did not work, still get nerual face result.

Step2 ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (5,) + inhomogeneous part.

when I run the command python3 inference.py
--face examples/face/1.mp4
--audio examples/audio/1.wav
--outfile results/1_1.mp4
I meet this question.
my device GPU is 3090,cuda is 11.7

Optimise inference time

Thanks for the great work. you are all doing a great contribution to community.
However, is there any idea to reduce inference time, thanks,

Regards

clear

ValueError: cannot reshape array of size

错误信息提示：
[Step 1] Using saved landmarks.
Traceback (most recent call last):
File "inference.py", line 364, in
main()
File "inference.py", line 88, in main
lm = lm.reshape([len(full_frames), -1, 2])
ValueError: cannot reshape array of size 206040 into shape (6577,newaxis,2)
输入原始视频尺寸：

不知道具体什么原因？对于的尺寸有具体要求吗？还是文件大小导致的？

修改哪里可以保持头部是正面姿势

虚拟人表情情绪的最佳实践

希望实现虚拟人对话中改变情绪，比如悲伤的说话，开心的说话，兴奋的说话，是使用不同的带有情绪的视频会比较好呢？亦或者是有其他实现？

extract_kp_videos.py中读取图片关键点时索引超出界限

第33行 keypoints.append(current_kp[-1]) 报错 list index out of range ，用的你们的例子我想请问一下是代码有问题吗谢谢

FileNotFoundError: [WinError 2]

where I came in the sixth step,it broke out.
Traceback (most recent call last):
File "inference.py", line 342, in
main()
File "inference.py", line 271, in main
subprocess.call(command, shell=platform.system() != 'Windows')
File "E:\anaconda\envs\video_retalking\lib\subprocess.py", line 340, in call
with Popen(*popenargs, **kwargs) as p:
File "E:\anaconda\envs\video_retalking\lib\subprocess.py", line 858, in init
self._execute_child(args, executable, preexec_fn, close_fds,
File "E:\anaconda\envs\video_retalking\lib\subprocess.py", line 1311, in _execute_child
hp, ht, pid, tid = _winapi.CreateProcess(executable, args,
FileNotFoundError: [WinError 2] 系统找不到指定的文件。

About model training

Thank you for your contribution video lipsync model !
Can I have any info about your code train model Lnet and ENet apply for my own dataset?
Looking forward to your response soon!

output resolution problem

This is a great open source, but there is a gap between the resolution of the output and the original video, the resolution of the output is low, and there are jagged, is there a way to deal with it, thank you very much!

The following error occurred when executing step six!

Traceback (most recent call last):
File "D:\AIGC\videotalk\video-retalking\utils\inference_utils.py", line 118, in face_detect
predictions.extend(detector.get_detections_for_batch(np.array(images[i:i + batch_size])))
File "D:\AIGC\videotalk\video-retalking\third_part\face_detection\api.py", line 66, in get_detections_for_batch
detected_faces = self.face_detector.detect_from_batch(images.copy())
File "D:\AIGC\videotalk\video-retalking\third_part\face_detection\detection\sfd\sfd_detector.py", line 42, in detect_from_batch
bboxlists = batch_detect(self.face_detector, images, device=self.device)
File "D:\AIGC\videotalk\video-retalking\third_part\face_detection\detection\sfd\detect.py", line 69, in batch_detect
olist = net(imgs)
File "C:\Users\menka.conda\envs\video_retalking\lib\site-packages\torch\nn\modules\module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "D:\AIGC\videotalk\video-retalking\third_part\face_detection\detection\sfd\net_s3fd.py", line 71, in forward
h = F.relu(self.conv1_1(x))
File "C:\Users\menka.conda\envs\video_retalking\lib\site-packages\torch\nn\functional.py", line 1298, in relu
result = torch.relu(input)
RuntimeError: CUDA out of memory. Tried to allocate 508.00 MiB (GPU 0; 4.00 GiB total capacity; 2.69 GiB already allocated; 0 bytes free; 3.11 GiB reserved in total by PyTorch)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "inference.py", line 342, in
main()
File "inference.py", line 211, in main
for i, (img_batch, mel_batch, frames, coords, img_original, f_frames) in enumerate(tqdm(gen, desc='[Step 6] Lip Synthesis:', total=int(np.ceil(float(len(mel_chunks)) / args.LNet_batch_size)))):
File "C:\Users\menka.conda\envs\video_retalking\lib\site-packages\tqdm\std.py", line 1178, in iter
for obj in iterable:
File "inference.py", line 292, in datagen
face_det_results = face_detect(full_frames, args, jaw_correction=True)
File "D:\AIGC\videotalk\video-retalking\utils\inference_utils.py", line 121, in face_detect
raise RuntimeError('Image too big to run face detection on GPU. Please use the --resize_factor argument')
RuntimeError: Image too big to run face detection on GPU. Please use the --resize_factor argument

when release the paper and code?

Different Errors while trying to install

I followed the instructions on the main page,
to make my first test simple I put the audio and wav in the root directory to make a short path:

python3 inference.py --face 1.mp4 --audio 1.wav --outfile 1_1.mp4

After installing the cuda and requirements as explained,
I had a lot of different errors of missing modules so tried to install them by myself
But the last error I'm stuck on I can't get rid of:

D:\Video_Retalking>python3 inference.py --face 1.mp4 --audio 1.wav --outfile 1_1.mp4
No CUDA runtime is found, using CUDA_HOME='C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.3'
C:\Users\Alon\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.9_qbz5n2kfra8p0\LocalCache\local-packages\Python39\site-packages\torchvision\transforms\functional_tensor.py:5: UserWarning: The torchvision.transforms.functional_tensor module is deprecated in 0.15 and will be **removed in 0.17**. Please don't rely on it. You probably just need to use APIs in torchvision.transforms.functional or in torchvision.transforms.v2.functional.
  warnings.warn(
Traceback (most recent call last):
  File "D:\Video_Retalking\inference.py", line 19, in <module>
    from utils.ffhq_preprocess import Croper
  File "D:\Video_Retalking\utils\ffhq_preprocess.py", line 31, in <module>
    import dlib
  File "C:\Users\Alon\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.9_qbz5n2kfra8p0\LocalCache\local-packages\Python39\site-packages\dlib\__init__.py", line 19, in <module>
    from _dlib_pybind11 import *
ImportError: DLL load failed while importing _dlib_pybind11: The specified module could not be found.

I tried pip install it, pip3 install and even conda install but nothing solve the problem.
I also tried to re-install CUDA but I still get the same error as above.

Any idea how to fix that and make Video-Retalking run local on Anaconda?
Thanks ahead 🙏

opentalker / video-retalking Goto Github PK

video-retalking's Issues

Recommend Projects

Recommend Topics

Recommend Org