Git Product home page Git Product logo

opentalker / video-retalking Goto Github PK

View Code? Open in Web Editor NEW
6.4K 72.0 948.0 45.48 MB

[SIGGRAPH Asia 2022] VideoReTalking: Audio-based Lip Synchronization for Talking Head Video Editing In the Wild

Home Page: https://opentalker.github.io/video-retalking/

License: Apache License 2.0

Python 96.52% C++ 0.21% Cuda 1.39% Shell 0.54% Jupyter Notebook 1.35%
lip-synchronization talking-head-videos video-editing siggraph-asia-2022

video-retalking's Issues

Colab notebook throwing error while running inference.py

Traceback (most recent call last):
File "/content/video-retalking/inference.py", line 342, in
main()
File "/content/video-retalking/inference.py", line 78, in main
kp_extractor = KeypointExtractor()
File "/content/video-retalking/third_part/face3d/extract_kp_videos.py", line 16, in init
self.detector = face_alignment.FaceAlignment(face_alignment.LandmarksType._2D)
File "/usr/lib/python3.10/enum.py", line 437, in getattr
raise AttributeError(name) from None
AttributeError: _2D

Traceback error

I am running into error when I try:
python inference.py \ --face examples/face/1.mp4 \ --audio examples/audio/1.wav \ --outfile results/1_1.mp4

Traceback (most recent call last): File "inference.py", line 4, in <module> from PIL import Image File "C:\Users\username4\anaconda3\envs\video_retalking\lib\site-packages\PIL\Image.py", line 103, in <module> from . import _imaging as core ImportError: DLL load failed while importing _imaging: The specified module could not be found.

ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (5,) + inhomogeneous part.

大神您好,请帮忙看看这个错误怎么解决。搞了8个小时,目前卡在这个错误,实在搞不定了,辛苦您帮忙看下,多谢多谢。
我的环境是本地电脑win10,64位,显卡GTX960 4G显存,已经安装了anaconda3,以及python3.8,cuda11.1,在环境变量里也配置了。
在powershell中输入命令:
python inference.py --face examples/face/1.mp4 --audio examples/audio/1.wav --outfile results/1_1.mp4

输出结果:
C:\Python\Python38\lib\site-packages\setuptools\distutils_patch.py:25: UserWarning: Distutils was imported before Setuptools. This usage is discouraged and may exhibit undesirable behaviors or errors. Please use Setuptools' objects directly or at least import Setuptools first.
warnings.warn(
[Info] Using cuda for inference.
[Step 0] Number of frames available for inference: 135
[Step 1] Landmarks Extraction in Video.
landmark Det:: 100%|█████████████████████████████████████████████████████████████████| 135/135 [00:25<00:00, 5.29it/s]
[Step 2] 3DMM Extraction In Video:: 0%| | 0/135 [00:00<?, ?it/s]
Traceback (most recent call last):
File "inference.py", line 342, in
main()
File "inference.py", line 100, in main
trans_params, im_idx, lm_idx, _ = align_img(frame, lm_idx, lm3d_std)
File "C:\ai\video-retalking\third_part\face3d\util\preprocess.py", line 196, in align_img
trans_params = np.array([w0, h0, s, t[0], t[1]])
ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (5,) + inhomogeneous part.

setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (5,) + inhomogeneous part.

[Step 2] 3DMM Extraction In Video:: 0%| | 0/135 [00:00<?, ?it/s]
Traceback (most recent call last):
File "inference.py", line 342, in
main()
File "inference.py", line 100, in main
trans_params, im_idx, lm_idx, _ = align_img(frame, lm_idx, lm3d_std)
File "G:\pythonProject\third_part\face3d\util\preprocess.py", line 196, in align_img
trans_params = np.array([w0, h0, s, t[0], t[1]])
ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (5,) + inhomogeneous part.
PS G:\pythonProject>

I have a question about Identity aware enhancement network

I am also interested in talking head generation so that I have read your paper with impression in SIGGRAPH 2022.
I have a question about 'identity aware enhancement network'.
image
I can not understand this part in your paper. Does it mean that high resolution LRS2 dataset images from restoration network again put into L-Net? Then why to produce low resolution input of E-Net?

Image too big (error)

I'll leave the error log below but from what I understand my image size is too big for face detection on my gpu, I only have a 4GB GPU
using the following code
python inference.py --face examples/face/12.mp4 --audio examples/audio/3.wav --outfile results/1_6.mp4 --face_det_batch_size 2 --LNet_batch_size 2 --one_shot
I don't really want to be resizing my video as I want to preserve qaulity can i do something else like increase batch sizes or crop my video somehow (please if you are replying with code add to the lines I provided above if possible)

FaceDet::   0%|                                                                                | 0/204 [00:01<?, ?it/s]
Recovering from OOM error; New batch size: 1                                                   | 0/204 [00:00<?, ?it/s]
FaceDet::   0%|                                                                                | 0/407 [00:00<?, ?it/s]
[Step 6] Lip Synthesis::   0%|                                                                 | 0/204 [01:36<?, ?it/s]
Traceback (most recent call last):
  File "C:\Users\leolo\video-retalking\utils\inference_utils.py", line 118, in face_detect
    predictions.extend(detector.get_detections_for_batch(np.array(images[i:i + batch_size])))
  File "C:\Users\leolo\video-retalking\third_part\face_detection\api.py", line 66, in get_detections_for_batch
    detected_faces = self.face_detector.detect_from_batch(images.copy())
  File "C:\Users\leolo\video-retalking\third_part\face_detection\detection\sfd\sfd_detector.py", line 42, in detect_from_batch
    bboxlists = batch_detect(self.face_detector, images, device=self.device)
  File "C:\Users\leolo\video-retalking\third_part\face_detection\detection\sfd\detect.py", line 69, in batch_detect
    olist = net(imgs)
  File "C:\Users\leolo\.conda\envs\video_retalking\lib\site-packages\torch\nn\modules\module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "C:\Users\leolo\video-retalking\third_part\face_detection\detection\sfd\net_s3fd.py", line 71, in forward
    h = F.relu(self.conv1_1(x))
  File "C:\Users\leolo\.conda\envs\video_retalking\lib\site-packages\torch\nn\modules\module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "C:\Users\leolo\.conda\envs\video_retalking\lib\site-packages\torch\nn\modules\conv.py", line 443, in forward
    return self._conv_forward(input, self.weight, self.bias)
  File "C:\Users\leolo\.conda\envs\video_retalking\lib\site-packages\torch\nn\modules\conv.py", line 439, in _conv_forward
    return F.conv2d(input, weight, bias, self.stride,
RuntimeError: CUDA out of memory. Tried to allocate 960.00 MiB (GPU 0; 4.00 GiB total capacity; 1.96 GiB already allocated; 0 bytes free; 2.67 GiB reserved in total by PyTorch)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "inference.py", line 342, in <module>
    main()
  File "inference.py", line 211, in main
    for i, (img_batch, mel_batch, frames, coords, img_original, f_frames) in enumerate(tqdm(gen, desc='[Step 6] Lip Synthesis:', total=int(np.ceil(float(len(mel_chunks)) / args.LNet_batch_size)))):
  File "C:\Users\leolo\.conda\envs\video_retalking\lib\site-packages\tqdm\std.py", line 1178, in __iter__
    for obj in iterable:
  File "inference.py", line 292, in datagen
    face_det_results = face_detect(full_frames, args, jaw_correction=True)
  File "C:\Users\leolo\video-retalking\utils\inference_utils.py", line 121, in face_detect
    raise RuntimeError('Image too big to run face detection on GPU. Please use the --resize_factor argument')
RuntimeError: Image too big to run face detection on GPU. Please use the --resize_factor argument

Odd GPEN Error - 'mask_sharp' referenced before assignment

When running inference.py the following error occurs when Step 5 begins

  File "/inference.py", line 342, in <module>
    main()
  File "/inference.py", line 198, in main
    pred, _, _ = enhancer.process(img, img, face_enhance=True, possion_blending=False)
  File "/third_part/GPEN/gpen_face_enhancer.py", line 116, in process
    mask_sharp = cv2.GaussianBlur(mask_sharp, (0,0), sigmaX=1, sigmaY=1, borderType = cv2.BORDER_DEFAULT)
UnboundLocalError: local variable 'mask_sharp' referenced before assignment

What's likely the cause?

About the train code

Hi contributor,

This is fantastic work, and it is very exciting that the code has been released.

I really would like to train my own dataset based on your excellent work. So, may I ask for help with the training process or whether the training code will be released later?

Looking for your response.

关于E-Net的效果问题

作者您好,我运行了demo发现面部生成效果有些问题,于是我把L-Net和E-Net的结果都保存了下来,发现L-Net效果正常,但E-Net超分效果比较差,和您的示例视频有所差距,想问一下原因,谢谢。
0000

0000

第五步卡住不动了

[Step 5] Reference Enhancement: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 109/109 [18:44<00:00, 10.32s/it]
landmark Det:: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 109/109 [00:28<00:00,  3.80it/s]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 109/109 [00:00<00:00, 16176.46it/s] 
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 109/109 [00:00<00:00, 500.96it/s] 
 94%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▋            | 102/109 [00:00<00:00, 503.14it/s] 
FaceDet::   0%|                                                                                                                                                                                                 | 0/28 [00:00<?, ?it/s] 

Performance concern

I tried to produce a 3 minutes video with example/3.mp4 plus a 3 minutes wav audio, it took 20 minutes in step 6, i'm using 4090RTX, is this performance normal? Note: Cuda already enabled.

Video-retalking program errors out on step 6

I am running this program on my Windows machine using Anaconda. I run the following command and get all the way to step 6 before erroring out:

(video_retalking) E:\video-retalking>python inference.py --face examples/face/1.mp4 --audio examples/audio/1.wav --outfile results/1_1.mp4

See attached screen shot. inference.py
line 271. Please advise on what I can do to resolve this issue. Thanks.

video_retalking error screen shot.pdf

原来顺利运行的,突然报错step1

[Step 1] Landmarks Extraction in Video.
Traceback (most recent call last):
File "inference.py", line 342, in
main()
File "inference.py", line 78, in main
kp_extractor = KeypointExtractor()
File "C:\Users\Qiyun\video-retalking\third_part\face3d\extract_kp_videos.py", line 16, in init
self.detector = face_alignment.FaceAlignment(face_alignment.LandmarksType._2D)
File "C:\Users\Qiyun\miniconda3\envs\video_retalking\lib\enum.py", line 384, in getattr
raise AttributeError(name) from None
AttributeError: _2D

Using cuda for inference. ^C

While running google colab, I get:

/usr/local/lib/python3.9/dist-packages/torchvision/transforms/functional_tensor.py:5: UserWarning: The torchvision.transforms.functional_tensor module is deprecated in 0.15 and will be **removed in 0.17**. Please don't rely on it. You probably just need to use APIs in torchvision.transforms.functional or in torchvision.transforms.v2.functional. warnings.warn( [Info] Using cuda for inference. ^C

blurry inference result even if input video is 1080p

Hi, this is an awesome project and the lipsync is really good.

I’ve encountered some problems: if I use a high resolution video (1920*1080) as input, the output video is blurry on the whole (not just the face area) though the output resolution is also 1080p. It seems that the output video is scaled up from a low-res one.

Based on my understanding of the paper, the generated talking face is pasted back onto the original video. So I wonder if this global blurriness is normal…

I used the command from the readme for inference. Not sure if there are other options I missed.

python3 inference.py \
  --face examples/face/1.mp4 \
  --audio examples/audio/1.wav \
  --outfile results/1_1.mp4

Thank you for your great repo.

Install Error /w requirements.txt

Followed instructions, got error

Installing collected packages: yapf, tensorboard-plugin-wit, pyasn1, ninja, lmdb, einops, dlib, addict, tensorboard-data-server, six, rsa, pyyaml, pyparsing, pyasn1-modules, protobuf, oauthlib, numpy, networkx, MarkupSafe, llvmlite, kiwisolver, importlib-resources, grpcio, future, fonttools, cycler, colorama, charset-normalizer, cachetools, absl-py, werkzeug, tqdm, tifffile, scipy, PyWavelets, python-dateutil, opencv-python, numba, markdown, kornia, imageio, google-auth, contourpy, scikit-image, resampy, requests-oauthlib, matplotlib, librosa, google-auth-oauthlib, filterpy, face-alignment, tb-nightly, facexlib, basicsr
Running setup.py install for dlib ... error
error: subprocess-exited-with-error

× Running setup.py install for dlib did not run successfully.
│ exit code: 1
╰─> [9 lines of output]
running install
C:\Users\chlyw\anaconda3\envs\video_retalking\lib\site-packages\setuptools\command\install.py:34: SetuptoolsDeprecationWarning: setup.py install is deprecated. Use build and pip and other standards-based tools.
warnings.warn(
running build
running build_py
running build_ext

  ERROR: CMake must be installed to build dlib

  [end of output]

note: This error originates from a subprocess, and is likely not a problem with pip.
error: legacy-install-failure

× Encountered error while trying to install package.
╰─> dlib

Error in the fifth step of the program: (GPEN)

Traceback (most recent call last):
File "inference.py", line 342, in
main()
File "inference.py", line 198, in main
pred, _, _ = enhancer.process(img, img, face_enhance=True, possion_blending=False)
File "C:\AI-3\third_part\GPEN\gpen_face_enhancer.py", line 116, in process
mask_sharp = cv2.GaussianBlur(mask_sharp, (0,0), sigmaX=1, sigmaY=1, borderType = cv2.BORDER_DEFAULT)
UnboundLocalError: local variable 'mask_sharp' referenced before assignment

Looking for any help and answers, thank you teachers and designers.

你好传入图片的问题

你好传入图片的问题,生成出来的视频头没有动画,也没有任何报错,请问是我哪里出了问题吗?

Optimise inference time

Thanks for the great work. you are all doing a great contribution to community.
However, is there any idea to reduce inference time, thanks,

Regards

ValueError: cannot reshape array of size

错误信息提示:
[Step 1] Using saved landmarks.
Traceback (most recent call last):
File "inference.py", line 364, in
main()
File "inference.py", line 88, in main
lm = lm.reshape([len(full_frames), -1, 2])
ValueError: cannot reshape array of size 206040 into shape (6577,newaxis,2)
输入原始视频尺寸:
image

不知道具体什么原因?对于的尺寸有具体要求吗?还是文件大小导致的?

虚拟人表情情绪的最佳实践

希望实现虚拟人对话中改变情绪,比如悲伤的说话,开心的说话,兴奋的说话 , 是使用不同的带有情绪的视频会比较好呢 ? 亦或者是有其他实现?

FileNotFoundError: [WinError 2]

where I came in the sixth step,it broke out.
Traceback (most recent call last):
File "inference.py", line 342, in
main()
File "inference.py", line 271, in main
subprocess.call(command, shell=platform.system() != 'Windows')
File "E:\anaconda\envs\video_retalking\lib\subprocess.py", line 340, in call
with Popen(*popenargs, **kwargs) as p:
File "E:\anaconda\envs\video_retalking\lib\subprocess.py", line 858, in init
self._execute_child(args, executable, preexec_fn, close_fds,
File "E:\anaconda\envs\video_retalking\lib\subprocess.py", line 1311, in _execute_child
hp, ht, pid, tid = _winapi.CreateProcess(executable, args,
FileNotFoundError: [WinError 2] 系统找不到指定的文件。

About model training

Thank you for your contribution video lipsync model !
Can I have any info about your code train model Lnet and ENet apply for my own dataset?
Looking forward to your response soon!

output resolution problem

This is a great open source, but there is a gap between the resolution of the output and the original video, the resolution of the output is low, and there are jagged, is there a way to deal with it, thank you very much!

The following error occurred when executing step six!

Traceback (most recent call last):
File "D:\AIGC\videotalk\video-retalking\utils\inference_utils.py", line 118, in face_detect
predictions.extend(detector.get_detections_for_batch(np.array(images[i:i + batch_size])))
File "D:\AIGC\videotalk\video-retalking\third_part\face_detection\api.py", line 66, in get_detections_for_batch
detected_faces = self.face_detector.detect_from_batch(images.copy())
File "D:\AIGC\videotalk\video-retalking\third_part\face_detection\detection\sfd\sfd_detector.py", line 42, in detect_from_batch
bboxlists = batch_detect(self.face_detector, images, device=self.device)
File "D:\AIGC\videotalk\video-retalking\third_part\face_detection\detection\sfd\detect.py", line 69, in batch_detect
olist = net(imgs)
File "C:\Users\menka.conda\envs\video_retalking\lib\site-packages\torch\nn\modules\module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "D:\AIGC\videotalk\video-retalking\third_part\face_detection\detection\sfd\net_s3fd.py", line 71, in forward
h = F.relu(self.conv1_1(x))
File "C:\Users\menka.conda\envs\video_retalking\lib\site-packages\torch\nn\functional.py", line 1298, in relu
result = torch.relu(input)
RuntimeError: CUDA out of memory. Tried to allocate 508.00 MiB (GPU 0; 4.00 GiB total capacity; 2.69 GiB already allocated; 0 bytes free; 3.11 GiB reserved in total by PyTorch)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "inference.py", line 342, in
main()
File "inference.py", line 211, in main
for i, (img_batch, mel_batch, frames, coords, img_original, f_frames) in enumerate(tqdm(gen, desc='[Step 6] Lip Synthesis:', total=int(np.ceil(float(len(mel_chunks)) / args.LNet_batch_size)))):
File "C:\Users\menka.conda\envs\video_retalking\lib\site-packages\tqdm\std.py", line 1178, in iter
for obj in iterable:
File "inference.py", line 292, in datagen
face_det_results = face_detect(full_frames, args, jaw_correction=True)
File "D:\AIGC\videotalk\video-retalking\utils\inference_utils.py", line 121, in face_detect
raise RuntimeError('Image too big to run face detection on GPU. Please use the --resize_factor argument')
RuntimeError: Image too big to run face detection on GPU. Please use the --resize_factor argument

Different Errors while trying to install

I followed the instructions on the main page,
to make my first test simple I put the audio and wav in the root directory to make a short path:

python3 inference.py --face 1.mp4 --audio 1.wav --outfile 1_1.mp4

After installing the cuda and requirements as explained,
I had a lot of different errors of missing modules so tried to install them by myself
But the last error I'm stuck on I can't get rid of:

D:\Video_Retalking>python3 inference.py --face 1.mp4 --audio 1.wav --outfile 1_1.mp4
No CUDA runtime is found, using CUDA_HOME='C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.3'
C:\Users\Alon\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.9_qbz5n2kfra8p0\LocalCache\local-packages\Python39\site-packages\torchvision\transforms\functional_tensor.py:5: UserWarning: The torchvision.transforms.functional_tensor module is deprecated in 0.15 and will be **removed in 0.17**. Please don't rely on it. You probably just need to use APIs in torchvision.transforms.functional or in torchvision.transforms.v2.functional.
  warnings.warn(
Traceback (most recent call last):
  File "D:\Video_Retalking\inference.py", line 19, in <module>
    from utils.ffhq_preprocess import Croper
  File "D:\Video_Retalking\utils\ffhq_preprocess.py", line 31, in <module>
    import dlib
  File "C:\Users\Alon\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.9_qbz5n2kfra8p0\LocalCache\local-packages\Python39\site-packages\dlib\__init__.py", line 19, in <module>
    from _dlib_pybind11 import *
ImportError: DLL load failed while importing _dlib_pybind11: The specified module could not be found.


I tried pip install it, pip3 install and even conda install but nothing solve the problem.
I also tried to re-install CUDA but I still get the same error as above.

Any idea how to fix that and make Video-Retalking run local on Anaconda?
Thanks ahead 🙏

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.