Git Product home page Git Product logo

Comments (42)

vincent-thevenin avatar vincent-thevenin commented on July 19, 2024 3

@Bip3 @shiyingZhang90 Hello again, good news, I managed to produce great results with no collapse. Thank you for your patience. Here are some sample images:
image
image
image

from realistic-neural-talking-head-models.

Jarvisss avatar Jarvisss commented on July 19, 2024 3

@kaahan
The vgg19 and vggface loss mentioned in the paper are caffe trained version, the input should be in BGR order, [0-255],

However, in this repo, vgg19 and vggface takes images in RGB order, and [0-1] normalized, while keep the weights the same with paper, i.e. vgg19_weight=1.5e-1, vggface_weight=2.5e-2

So either should you change the weight of content loss, or change the pretrained model to caffe pretrained version, to make the final loss balanced.

For me, I download the caffe version of vgg19 from https://github.com/jcjohnson/pytorch-vgg,
and make the input to vgg in range of [0-255], BGR order.

Main code:

self.vgg19_caffe_RGB_mean = torch.FloatTensor([123.68, 116.779, 103.939]).view(1, 3, 1, 1).to(device) # RGB order
self.vggface_caffe_RGB_mean = torch.FloatTensor([129.1863,104.7624,93.5940]).view(1, 3, 1, 1).to(device) # RGB order

x_vgg19 = x * 255  - self.vgg19_caffe_RGB_mean
x_vgg19 = x_vgg19[:,[2,1,0],:,:]
x_hat_vgg19 = x_hat * 255 - self.vgg19_caffe_RGB_mean
x_hat_vgg19 = x_hat_vgg19[:,[2,1,0],:,:]
x_vggface = x * 255 - self.vggface_caffe_RGB_mean
x_vggface = x_vggface[:,[2,1,0],:,:] # B RGB H W -> B BGR H W
x_hat_vggface = x_hat * 255 - self.vggface_caffe_RGB_mean
x_hat_vggface = x_hat_vggface[:,[2,1,0],:,:] # B RGB H W -> B BGR H W

Edit: I ran the meta-training for ~8 epochs on the voxceleb2 dev dataset

Edit 2: I've create a PR #56 for the update of vgg loss calculation

from realistic-neural-talking-head-models.

rexxar-liang avatar rexxar-liang commented on July 19, 2024 1

Got same issue as @OndrejTexler , I try to train the model(using VOX2 training dataset), after 5000 iterations, I use the checkpont for inference, then get "red" results.

from realistic-neural-talking-head-models.

vincent-thevenin avatar vincent-thevenin commented on July 19, 2024 1

@OndrejTexler @mikirui @brucechou1983 @rexxar-liang @Selimonder @yushizhiyao , Thank you for your feedback!
I finally got my hands on better hardware and finally reproduced your error after 2.4 epochs on the full dataset.

After working on it, I successfully reversed the red outputs, as of now it looks like the model starts over as it recover from the colapse.

The problem seems to comes from how I updated the weights in train.py. I update the generator and the discriminator weights at the same time by calculating the gradient of the sum of lossG and lossD.
However, a component of lossG: lossAdv cancels out lossD when summed together. So the gradient is the same when it should point to different directions for the generator and the discriminator.

I am training the model some more at the moment to see if the problem really comes from that and that I'm not mistaken.

from realistic-neural-talking-head-models.

amil-rp-work avatar amil-rp-work commented on July 19, 2024 1

Hey @Jarvisss
Wow you got some amazing results!! Congratulations on such an amazing job.
A few quesitions

  • Were you able to replicate the results from the paper?
  • If possible can you please share your model_weights?

from realistic-neural-talking-head-models.

vincent-thevenin avatar vincent-thevenin commented on July 19, 2024

Hello @bowen-xiao96 !
Thanks for pointing that out, I'm puzzled as to how I got my results with the content loss dropping its gradient. That's a major mistake on my part.

I'd love to know more about the issues you encounter that make my code unreproduceable for you.

There are no hidden tricks per se. I followed what Egor Zakharov wrote in his paper and on social media, so Adam should be used without momentum. The generator should be pretty deep for good results, I used 17 residual blocks in total. What made the difference for me was the adain parameters, there should be 2 parameters for each width in each normalization layers.

from realistic-neural-talking-head-models.

mikirui avatar mikirui commented on July 19, 2024

Hi @bowen-xiao96 ,

Have you tried the code after the modification on vgg loss you mentioned here? Can you reproduce the results of @vincent-thevenin after applying this modification?
Thank you.

from realistic-neural-talking-head-models.

mikirui avatar mikirui commented on July 19, 2024

Hi @vincent-thevenin ,

I try your code for training but I can only get some blurry results (examples shown below) after training ~ 7 epochs on voxelceleb2 testset.
output_11000

Besides, when I use your provided pre-trained model as initialization, the generated results get worse when I train for more iterations with your codes. Thus there may be some differences between your posted code and your training code for that pre-trained model?
Many thanks to you if you can figure out this question!

from realistic-neural-talking-head-models.

vincent-thevenin avatar vincent-thevenin commented on July 19, 2024

Hi @mikirui,
Thank you for trying the code out and sorry for the late answer I somehow did not receive any notification of mentions.

I just made some further changes to how the gradient flow is handled. After fixing @bowen-xiao96 's error I did not test train.py immediately so your blurriness might come from there. However at the latest commit it should be working better. Can you please pull the latest changes and tell me if you see any changes?

from realistic-neural-talking-head-models.

mikirui avatar mikirui commented on July 19, 2024

Hi @vincent-thevenin ,
Thanks for your reply and updates on codes.

I try your latest code and the results (after ~2 epochs) are shown below:
y_8100
output_8100

I also use the dataloader like https://github.com/grey-eye/talking-heads (which is faster) and run about 5 epoches. The results are shown:
y_34900
output_34900

The generated image quality is better than the last version code, but I think the results are still quite more blurry than the results shown in your README?

from realistic-neural-talking-head-models.

OndrejTexler avatar OndrejTexler commented on July 19, 2024

First of all, @vincent-thevenin, great work - I really appreciate it!

@mikirui, it seems I did exactly what you did.

Results generated using the latest repo code (train dataset, after ~1 epoch, it took ~22 hours on HW I have available) Maybe, there is a chance that in next 4 epochs it would converge to the similar results as you present in README, but I do not think so.
res_1

Results when grey-eye/talking-heads data preprocessing and data loader is used. After 5 epochs (there is still a lot of artifacts and it certainly does not look like README's results)
res_5_2200_1

And ... in 10th epoch something goes really wrong and results start to be "red",
and it never corrects. Result after 25 epochs.
res_5_2200_1

The loss for 25 epochs (in 10th epoch, little increase of the loss is visible, but nothing major)
res_5_2200_1

Bottomline:
Dataloader of grey-eye/talking-heads has significantly less randomness, but it works in their scenario, so I guess it is not an issue here.

Also, I tried your pre-trained model and I am getting really good results with it. I would love to be able to train model achieving similar results as the pre-trained one.

Do you have any suggestions or ideas? Thank you!

from realistic-neural-talking-head-models.

brucechou1983 avatar brucechou1983 commented on July 19, 2024

@vincent-thevenin Thanks for sharing this great repo.

I'am trying this model with a larger training set (still a subset of the entire VoxCeleb2 dev set). During the first few steps it seems like I can see a human face like this.

1399

But the model collapse to whole black very soon at steps around 3k~5k.

3499

Does anyone run into similar problems?

*Update: tried to train from scratch using test set of VoxCeleb2 and got similar results

image

More information:
I had difficulty building caffe so I used the Pytorch_VGGFACE.pth and Pytorch_VGGFACE_IR.py shared in issue #10 . The training data is VoxCeleb2 test set which contains 36273 videos. Hyper-parameters remain untouched. I also tried to increase the feature matching loss weights from 1e1 to 5e1, but the model collapsed even faster.

Module versions:
numpy==1.16.1
torch==1.2.0
torchvision==0.4.0

Environment:
ubuntu 16.04
1080Ti

Any help would be appreciated.

*Update 2:
The mode collapsing issue goes away after I built these vggface files by myself.

Now I run into the same situation as @OndrejTexler . Everything looks red after ~22k steps.

image

The content loss and feature matching loss increase dramatically.

from realistic-neural-talking-head-models.

brucechou1983 avatar brucechou1983 commented on July 19, 2024

Does anyone successfully reproduce the results like @vincent-thevenin 's checkpoint?

from realistic-neural-talking-head-models.

Selimonder avatar Selimonder commented on July 19, 2024

20k step in the vox-dev set here. Although results around 10k have colors on it, around 15k step everything appears to be red. I have used a batch size of 4 along with k=8.
Some kind of weird gradient explosion?

from realistic-neural-talking-head-models.

yushizhiyao avatar yushizhiyao commented on July 19, 2024

I got the same error as yours @brucechou1983 , have you found the reason?

from realistic-neural-talking-head-models.

vincent-thevenin avatar vincent-thevenin commented on July 19, 2024

I have added a commit in the master branch. But you should check out the other branch I am actively maintaining that one and will merge soon. The second branch preprocesses the dataset and uses custom folder paths. I manage to decrease training time by 20x compared to the main branch.

from realistic-neural-talking-head-models.

dimaischenko avatar dimaischenko commented on July 19, 2024

by 20x compared to the main branch.

Hi, @vincent-thevenin . Do you mean 2x?

from realistic-neural-talking-head-models.

vincent-thevenin avatar vincent-thevenin commented on July 19, 2024

@dimaischenko I went from some unsightly 240 hours/epoch on the full dataset on my setup to less than 12h. So indeed 20x :) Also the compressed preprocessed dataset is 17GB compared to 270GB for the full one.

from realistic-neural-talking-head-models.

dimaischenko avatar dimaischenko commented on July 19, 2024

The problem seems to comes from how I updated the weights in train.py. I update the generator and the discriminator weights at the same time by calculating the gradient of the sum of lossG and lossD.
However, a component of lossG: lossAdv cancels out lossD when summed together. So the gradient is the same when it should point to different directions for the generator and the discriminator.

@vincent-thevenin have you tried to train with new loss and get good results? I tried for 1 epoch on full dev dataset, but lossG is equal 100 and results are much worse than with previous loss (lossG + lossD)

from realistic-neural-talking-head-models.

brucechou1983 avatar brucechou1983 commented on July 19, 2024

@dimaischenko I went from some unsightly 240 hours/epoch on the full dataset on my setup to less than 12h. So indeed 20x :) Also the compressed preprocessed dataset is 17GB compared to 270GB for the full one.

@vincent-thevenin Thanks for sharing your findings. May I know the hardware requirement (GPU memory budget) to run your latest updates for the whole dataset?

from realistic-neural-talking-head-models.

dimaischenko avatar dimaischenko commented on July 19, 2024

Hello @vincent-thevenin! Did you achieve good results with the new loss?

from realistic-neural-talking-head-models.

vincent-thevenin avatar vincent-thevenin commented on July 19, 2024

@dimaischenko I got bad results as well. I experimented around stuff and disabling the adverserial loss and matching loss to just keep the content loss creates outputs similar to what I would get on the main branch. I'm still looking into it and will notify once I reach good results.

from realistic-neural-talking-head-models.

vincent-thevenin avatar vincent-thevenin commented on July 19, 2024

@brucechou1983 The model uses 8gb of vram with batch size of 2. I haven't tested with batch size of 1 but if you're having problems with memory, try reducing batch size to 1 first.

from realistic-neural-talking-head-models.

Bip3 avatar Bip3 commented on July 19, 2024

Hey @vincent-thevenin I'm trying to use your latest branch, but I ran into some problems mostly just understanding what some of your parameters were. What are path_to_Wi and path_to_Preprocess? I assumed path to preprocess was the path to the voxceleb dataset, but when I plug in the correct path it just errors out.

from realistic-neural-talking-head-models.

shiyingZhang90 avatar shiyingZhang90 commented on July 19, 2024

Hi @vincent-thevenin , for the 20x decreased training time branch do you refer to save_disc branch? Do you already get good results from that branch? Thanks

from realistic-neural-talking-head-models.

vincent-thevenin avatar vincent-thevenin commented on July 19, 2024

Hi @Bip3, path_to_Wi is the path to a folder that contains the discriminator vectors for each video. I started using the full dataset and loading everything to gpu just consumes memory uselessly so I save and load when necessary, it is filled when you initialize the Discriminator.

path_to_preprocess is the folder where you save the preprocessed images after running preprocess.py.

I will update the readme to make the changes clearer.
Hope that helps.

from realistic-neural-talking-head-models.

vincent-thevenin avatar vincent-thevenin commented on July 19, 2024

@shiyingZhang90 this is my result for 15 epochs, still training it further at the moment.

image

from realistic-neural-talking-head-models.

shiyingZhang90 avatar shiyingZhang90 commented on July 19, 2024

@vincent-thevenin thanks so much for the update! Actually I don't think calculating the gradient of the sum of lossG and lossD is wrong according to the paper, hence looking forward to great result for the updated code. BTW, I found the way you calculate content loss is different form another repo . Will that reference code help?

from realistic-neural-talking-head-models.

Bip3 avatar Bip3 commented on July 19, 2024

Hey @vincent-thevenin , thanks for the response. Could you give instructions on how you got your save_disc branch training?

Edit: Figured it out. For anyone wondering, you must run preprocessing.py, and save that in the same folder specified under path_to_preprocess in params. Everything else is pretty much the same as master.

from realistic-neural-talking-head-models.

shiyingZhang90 avatar shiyingZhang90 commented on July 19, 2024

Hi @vincent-thevenin, do you get better result after training for more epochs on save_disc branch? I'm still wondering how to achieve training result in your demo

from realistic-neural-talking-head-models.

prateek-manocha avatar prateek-manocha commented on July 19, 2024

Hey @vincent-thevenin.
Great work! Can you please share the weights from which you got the above-shown results?
These are from save_disc branch, right?
I am planning to train the code on the full dataset, hence it'll be helpful if you can clear this out.

from realistic-neural-talking-head-models.

brucechou1983 avatar brucechou1983 commented on July 19, 2024

@vincent-thevenin do you achieve this result by current save_disc branch (commit: e461da8) ?

from realistic-neural-talking-head-models.

Jarvisss avatar Jarvisss commented on July 19, 2024

@vincent-thevenin @brucechou1983

Hi, I used the current save_disc branch (commit: e43ca9f), and ran on the full dataset with K=8 and batchsize=6.
The result seems to be different from yours with much blurry and artifacts.

the results after 3 epochs:

epoch_3_batch_4799

epoch_3_batch_18399

result after 4 epochs:
epoch_4_batch_799

epoch_4_batch_11699

Have you met this kind of results during training? Should I train more epochs?

losses G :
lossG

and losses D:
lossD

from realistic-neural-talking-head-models.

Jarvisss avatar Jarvisss commented on July 19, 2024

Hi, I have trained for another 10 epochs and got result like yours @vincent-thevenin,

epoch_10_batch_36499

epoch_10_batch_39499

epoch_10_batch_40499

The loss_content kept going down, but loss_adv and loss_matching loss ended going up with training epochs.
image

But it seems cannot keep the source identity, the image generated seems to be a different person from the input image.

I also have a question about the Loss_adv.
In your implementation, the loss_FM is implemented by summing up the L1 loss of feature maps of discriminator,without normalization.

I wonder if this is correct? Or shall we multiply the Loss_FM by 1/layers or something?As in the Pix2PixHD pytorch implementation, it also use the FM loss, and they do the nomalization.

from realistic-neural-talking-head-models.

Jarvisss avatar Jarvisss commented on July 19, 2024

Hi @vincent-thevenin , Good News!
I've got some good results with a few changes to your code.

The following results are generated from the same person (id_08696) with different driving videos.

Click the images to view video results on Youtube

1. Feed forward without finetuning

2. Fine tuning for 100 epochs

More results:

As we can see, identity gap exists in feed forward results, which can be briged by finetuning.

from realistic-neural-talking-head-models.

Jarvisss avatar Jarvisss commented on July 19, 2024

Hi @kaahan,

  1. I use the meta-embedding vector e_hat for the inference. And fine-tuning only affects G, D, and do nothing with Embedder.
  2. If the driving landmark(face shape, etc) is too different from the source landmark, the result image would be like the driving one.
    image

It would be helpful if you can share your result,
Best,
Jarvisss

from realistic-neural-talking-head-models.

phquanta avatar phquanta commented on July 19, 2024

@kaahan
The vgg19 and vggface loss mentioned in the paper are caffe trained version, the input should be in BGR order, [0-255],

However, in this repo, vgg19 and vggface takes images in RGB order, and [0-1] normalized, while keep the weights the same with paper, i.e. vgg19_weight=1.5e-1, vggface_weight=2.5e-2

So either should you change the weight of content loss, or change the pretrained model to caffe pretrained version, to make the final loss balanced.

For me, I download the caffe version of vgg19 from https://github.com/jcjohnson/pytorch-vgg,
and make the input to vgg in range of [0-255], BGR order.

Main code:

self.vgg19_caffe_RGB_mean = torch.FloatTensor([123.68, 116.779, 103.939]).view(1, 3, 1, 1).to(device) # RGB order
self.vggface_caffe_RGB_mean = torch.FloatTensor([129.1863,104.7624,93.5940]).view(1, 3, 1, 1).to(device) # RGB order

x_vgg19 = x * 255  - self.vgg19_caffe_RGB_mean
x_vgg19 = x_vgg19[:,[2,1,0],:,:]
x_hat_vgg19 = x_hat * 255 - self.vgg19_caffe_RGB_mean
x_hat_vgg19 = x_hat_vgg19[:,[2,1,0],:,:]
x_vggface = x * 255 - self.vggface_caffe_RGB_mean
x_vggface = x_vggface[:,[2,1,0],:,:] # B RGB H W -> B BGR H W
x_hat_vggface = x_hat * 255 - self.vggface_caffe_RGB_mean
x_hat_vggface = x_hat_vggface[:,[2,1,0],:,:] # B RGB H W -> B BGR H W

Edit: I ran the meta-training for ~8 epochs on the voxceleb2 dev dataset

Edit 2: I've create a PR #56 for the update of vgg loss calculation

I'm curious is it possible to share weights, or at least lossG vs training ?

I'm having big mismatch between Vincent's losses and yours, his lossG around 10, yours around 100. Also i'm seeing losses 0 for Discriminator with your suggested changes vs normal losses in Vincent's code.

from realistic-neural-talking-head-models.

phquanta avatar phquanta commented on July 19, 2024

@kaahan
The vgg19 and vggface loss mentioned in the paper are caffe trained version, the input should be in BGR order, [0-255],

However, in this repo, vgg19 and vggface takes images in RGB order, and [0-1] normalized, while keep the weights the same with paper, i.e. vgg19_weight=1.5e-1, vggface_weight=2.5e-2

So either should you change the weight of content loss, or change the pretrained model to caffe pretrained version, to make the final loss balanced.

For me, I download the caffe version of vgg19 from https://github.com/jcjohnson/pytorch-vgg,
and make the input to vgg in range of [0-255], BGR order.

Main code:

self.vgg19_caffe_RGB_mean = torch.FloatTensor([123.68, 116.779, 103.939]).view(1, 3, 1, 1).to(device) # RGB order
self.vggface_caffe_RGB_mean = torch.FloatTensor([129.1863,104.7624,93.5940]).view(1, 3, 1, 1).to(device) # RGB order

x_vgg19 = x * 255  - self.vgg19_caffe_RGB_mean
x_vgg19 = x_vgg19[:,[2,1,0],:,:]
x_hat_vgg19 = x_hat * 255 - self.vgg19_caffe_RGB_mean
x_hat_vgg19 = x_hat_vgg19[:,[2,1,0],:,:]
x_vggface = x * 255 - self.vggface_caffe_RGB_mean
x_vggface = x_vggface[:,[2,1,0],:,:] # B RGB H W -> B BGR H W
x_hat_vggface = x_hat * 255 - self.vggface_caffe_RGB_mean
x_hat_vggface = x_hat_vggface[:,[2,1,0],:,:] # B RGB H W -> B BGR H W

Edit: I ran the meta-training for ~8 epochs on the voxceleb2 dev dataset

Edit 2: I've create a PR #56 for the update of vgg loss calculation

After 8 epochs ?

from realistic-neural-talking-head-models.

HAN-oQo avatar HAN-oQo commented on July 19, 2024

Hi, @Jarvisss @vincent-thevenin
Now i'm having hard time reproducing the result..
I'm working with 2 rtx 2080ti and batch size 12 (total 6).
And I'm using half of the Voxceleb2 data.
I'm not sure my model is training well. The followings are my sample results with 8 epochs and the loss log.

스크린샷 2021-09-12 오후 8 38 41

스크린샷 2021-09-12 오후 8 39 40

스크린샷 2021-09-12 오후 8 42 44

Do you think the model is training well? How many epochs did you train the model?
Did you also experience this kind of loss log while training?
In my opinion, the discriminator is too strong so it seems converged already.

I would appreciate your reply!
(This is my repo: https://github.com/hanq0212/Few_Shot-Neural_Talking_Head)

from realistic-neural-talking-head-models.

lvZic avatar lvZic commented on July 19, 2024

Hi, @Jarvisss @vincent-thevenin Now i'm having hard time reproducing the result.. I'm working with 2 rtx 2080ti and batch size 12 (total 6). And I'm using half of the Voxceleb2 data. I'm not sure my model is training well. The followings are my sample results with 8 epochs and the loss log.

스크린샷 2021-09-12 오후 8 38 41 스크린샷 2021-09-12 오후 8 39 40 스크린샷 2021-09-12 오후 8 42 44

Do you think the model is training well? How many epochs did you train the model? Did you also experience this kind of loss log while training? In my opinion, the discriminator is too strong so it seems converged already.

I would appreciate your reply! (This is my repo: https://github.com/hanq0212/Few_Shot-Neural_Talking_Head)

have u reproduced the result successfully?

from realistic-neural-talking-head-models.

HAN-oQo avatar HAN-oQo commented on July 19, 2024

Hi, @lvZic
After longer training(more than 15days), I could get a much better result.
But still can't reproduce the paper's result perfectly.

from realistic-neural-talking-head-models.

vuthede avatar vuthede commented on July 19, 2024

Hi @Jarvisss,
Have you resolve the

cannot keep the source identity, the image generated seems to be a different person from the input image

or

If the driving landmark(face shape, etc) is too different from the source landmark, the result image would be like the driving one

I got the same problem, I trained only on 5000 videos and 4-th epoch until now.

from realistic-neural-talking-head-models.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.