wyhsirius / lia Goto Github PK

[ICLR 22] Latent Image Animator: Learning to Animate Images via Latent Space Navigation

Home Page: https://wyhsirius.github.io/LIA-project/

License: Other

Python 100.00%

lia's Issues

inference with output of 512x512

Great work, thank you for publishing this. I was wondering how to get the high-resolution video outputs (512x512) that you include on the project page. Do I just need to set the size parameter to 512, instead of the default of 256 here?

Why do we need to repeat the latent variable 14 times?

Thanks for your excellent work. I have a question.

Why do we need to repeat the latent variable (latent = latent.reshape((latent.shape[0], -1)).unsqueeze(1).repeat(1, inject_index, 1)) by repeating it 14 times? Why not use the latent variable (1, 512) directly as the input for all subsequent networks? I think repeating does not add new information here. Could you explain the reason here?

Thanks.

Can I train the LIA directly on TED dataset with size 384?

Hey @wyhsirius ,
thanks for the nice work, can I ask have you tried to train the proposed LIA on TED with size 384 * 384?

Training vox from scratch issues

Everything seems fine except the eyes movement. It seems my model doesn't correctly capture eyes movement from driving videos. It's now 300k steps, should I wait for more steps? Are there any more parameter setting tricks?

Question about the motion networks..

Hello,
I really appreciate your great work.

I have a question at the implementation of the encoder; motion network blocks:
https://github.com/wyhsirius/LIA/blob/main/networks/encoder.py#L252

Here, each EqualLinear block of motion networks does not have any activation function.
In this case, the motion networks do not have the non-linearity mathematically so it can be equal to the large size of a linear layer.

Is this intended?
If then, have you experienced any significant differences both using an activation function at the linear block or just using a single linear layer?

Additionally, stacked MLP blocks of the StyleGAN2 contain activation functions:
https://github.com/rosinality/stylegan2-pytorch/blob/master/model.py#L412

LIA 512 blurry output

I tried to infer LIA 512 model (found on gdrive) but it is not that sharp. It looks slightly sharper than 256 model. Is that normal?
Do i have to change any parameter other than size?

PyAV is not installed?

I am stuck with this error "ImportError: PyAV is not installed, and is necessary for the video operations in torchvision." but I have installed PyAV with "pip install av" and the error persists, I uninstalled it and reinstalled it back, I replaced all the files inside of the LIA folder for the git ones, and I don't know what else to do, can anyone help?

Training on 512*512 resolution

Recently I have been trying to train higher resolution (512*512) on LIA. I have uploaded a checkpoint on https://huggingface.co/taocode/LIA_512. The result seems to be very poor. Does anyone succeeded in training on higher resolution based on this repo?

Animation Intensity

How to control intensity of animation?

Problem about the contour of head

Hi, it's a great job in face animation, but i found there is some artifacts in the contour of head, it seems to lack of harmonization operation in the edge and contour.
Do u found it as well ?
is that a problem caused by network ?
if so, any methods to fix it ?

About FID value

I evaluate your model using video FID following the same implementation as you mentioned in paper, the FID value of same-identity reconstruction is 6.8586 rather than 0.161 of cross-video generation. As we know, cross-video generation is more difficult than same-identity reconstruction, and the FID value should be higher.

I'm confused about the above results and would like to know the details of your FID evaluation process.

error PyAV av.error.FileNotFoundError: [Errno 2] No such file or directory

first error

!python run_demo.py --model vox --source_path /content/LIA/data/vox/241.jpg --driving_path /content/LIA/data/vox/faceexp2.mp4 --save_folder  /content/LIA/res # using vox model
==> loading model
==> loading data
Traceback (most recent call last):
  File "run_demo.py", line 110, in <module>
    demo = Demo(args)
  File "run_demo.py", line 72, in __init__
    self.vid_target, self.fps = vid_preprocessing(args.driving_path)
  File "run_demo.py", line 31, in vid_preprocessing
    vid_dict = torchvision.io.read_video(vid_path, pts_unit='sec')
  File "/usr/local/lib/python3.7/dist-packages/torchvision/io/video.py", line 273, in read_video
    _check_av_available()
  File "/usr/local/lib/python3.7/dist-packages/torchvision/io/video.py", line 42, in _check_av_available
    raise av
ImportError: PyAV is not installed, and is necessary for the video operations in torchvision.
See https://github.com/mikeboers/PyAV#installation for instructions on how to
install PyAV on your system.

!pip install av
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting av
  Downloading av-9.2.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (28.2 MB)
     |████████████████████████████████| 28.2 MB 1.2 MB/s 
Installing collected packages: av
Successfully installed av-9.2.0

second error

!python run_demo.py --model vox --source_path /content/LIA/data/vox/241.jpg --driving_path /content/LIA/data/vox/faceexp2.mp4 --save_folder  /content/LIA/res # using vox model
==> loading model
==> loading data
==> running
  0% 0/1273 [00:00<?, ?it/s]/content/LIA/networks/styledecoder.py:439: UserWarning: torch.qr is deprecated in favor of torch.linalg.qr and will be removed in a future PyTorch release.
The boolean parameter 'some' has been replaced with a string parameter 'mode'.
Q, R = torch.qr(A, some)
should be replaced with
Q, R = torch.linalg.qr(A, 'reduced' if some else 'complete') (Triggered internally at  ../aten/src/ATen/native/BatchLinearAlgebra.cpp:1980.)
  Q, R = torch.qr(weight)  # get eignvector, orthogonal [n1, n2, n3, n4]
/usr/local/lib/python3.7/dist-packages/torch/nn/functional.py:4194: UserWarning: Default grid_sample and affine_grid behavior has changed to align_corners=False since 1.3.0. Please specify align_corners=True if the old behavior is desired. See the documentation of grid_sample for details.
  "Default grid_sample and affine_grid behavior has changed "
100% 1273/1273 [01:12<00:00, 17.61it/s]
Traceback (most recent call last):
  File "run_demo.py", line 111, in <module>
    demo.run()
  File "run_demo.py", line 93, in run
    save_video(vid_target_recon, self.save_path, self.fps)
  File "run_demo.py", line 44, in save_video
    torchvision.io.write_video(save_path, vid[0], fps=fps)
  File "/usr/local/lib/python3.7/dist-packages/torchvision/io/video.py", line 135, in write_video
    container.mux(packet)
  File "av/container/output.pyx", line 211, in av.container.output.OutputContainer.mux
  File "av/container/output.pyx", line 217, in av.container.output.OutputContainer.mux_one
  File "av/container/output.pyx", line 172, in av.container.output.OutputContainer.start_encoding
  File "av/error.pyx", line 336, in av.error.err_check
av.error.FileNotFoundError: [Errno 2] No such file or directory

Training Time cost

Hey @wyhsirius,

Very impressive work, may I ask how long you cost for training the proposed method on each dataset?
I want to do some related research work, so it would help me greatly if I can know this.

Best regards

Awesome but...

Awesome but where is the code? FOMM is still the leader despite being 3yo!
Regards

Offer to chip in for training VOX 512x512 !!

Hi all!
LIA is a really cool project and currently one of the best that gives quality animation. Great work!

My suggestion is the following.

Find those willing to chip in for model training
Find someone who can competently train the model, for example, on 8xV100 (with correct HQ datasets)
Estimate the cost of training
For example, the user @leg0m4n gave such an assessment for the model 256x256 of about 1000$

In the paper, the authors said that they have been training for approx. 6 days and used 8xV100. V100 being a predecessor to 3090, and 3090 cost on vast.ai is around 2.5$ per hour, I'd assume the training cost is around $1k.
Originally post: #5 (comment)

Investing... I am ready to invest 100$ ^_^
Select 3 trusted sponsors and create a multisig wallet.
For example, I really liked this one: https://app.safe.global/
Choose blockchain BNB and USDT as investment currency.
Training vox-model 512x512 and sharing to participants.
Bingo!

The equivalent code of multi-layer MLP and One-layer Linear in class Encoder

Hi, I found the activation in this sub-network is set as None. If it was None, the self.fc should equal to the cascaded multiple linear layer without activation, therefore equal to one Linear Layer? Is it right?

Tranining problems

Hey @wyhsirius,
I was training the model on 4gpus, Have you met the following problem:

When I directly train start from 0,
I can use batch_size=32 to train the model without any problem,
However, when I want to train the model with --resume_ckpt, it shows like below, and I can just use very small batch size to avoid the out of memory problem :

I would appreciate it if you can share me some suggestion to solve this problem~

Bests,

What if the driving image is only a single frame or image?

I tried to change the driving video to a single frame or image and got results like this:
source | driving | vox_result

It seems that threre is no change between the source and vox_result images.

mat1 and mat2 shapes cannot be multiplied (6656x6 and 512x512)

Hello,
When I try the demo here: https://replicate.com/wyhsirius/lia

It outputs this error:
mat1 and mat2 shapes cannot be multiplied (6656x6 and 512x512)

What should I do to fix this?

Why use h_start?

LIA/run_demo.py

Line 90 in d120cb2

img_recon = self.gen(self.img_source, img_target, h_start)

Thx for the great work! Could you please explain why we need to use h_start and why you set h_start=None for Ted data?

Would you please share the 512x512 checkpoints?

about architecture

Hi, thanks for the nice work. I have some questions about architecture.

In Fig.8 in, toRGB of shape 4 x 4, 512 not is used (refer to the red box in the below image), right? This is implemented by "self.to_rgb1=ToRGB(...)" in styledecoder.py, but didn't used in forward pass and also in backward pass. (This yields unused parameter error when training). I'm not sure this is intended implementation. I think this layer is important because it is the starting layer and the only the un-warped feature, which may helps to preserve id.

Did you use NoiseInjection in Synthesis network? The parameters are assigned when initializing the NoiseInjection, but it is not used in the model.

If I missed something, please free to tell. Anyway, I am enjoying LIA. Thanks.

Would you please share the 512x512 checkpoints?

Thanks for your greate works.
Would you please share the 512x512 checkpoints? It would be get better result.

wyhsirius / lia Goto Github PK

lia's Issues

Recommend Projects

Recommend Topics

Recommend Org