Modify the cross layer connections of the networks in stages 2 and 3. In stage 3, trai

<a class="issue-link js-issue-link" data-error-text="Failed to load title" data-id="22

<a href="https://github.com/sczhou/CodeFormer/issues/367#issuecomment-206

why In stage 3, train the GAN loss to nan and the tensor of the network output image to nan？ about codeformer HOT 6 OPEN

create-li commented on July 17, 2024

why In stage 3, train the GAN loss to nan and the tensor of the network output image to nan？

from codeformer.

Comments (6)

chiixy commented on July 17, 2024

I have the same NAN pix loss and perceptual loss when i train my 1024 resolution model and need to connect 512 size feat in stage3.

I found first time meet NAN after optimizer.step(), not the forward function, not the loss.backward, so maybe the lr is too large, just need to reduce the lr. (eg. set 5e-6 can solve this problem for me)

You can check when the first time meet NAN in codeformer_joint_model.py file:

As for the train loss and output tensor normally in the stage 2, in my case, it is because the Fuse_sft_block produce the NAN gradient, while the network arch in stage2 have no such block, maybe you can check if the same as me.

from codeformer.

SherryXieYuchen commented on July 17, 2024

I have the same NAN pix loss and perceptual loss when i train my 1024 resolution model and need to connect 512 size feat in stage3.

I found first time meet NAN after optimizer.step(), not the forward function, not the loss.backward, so maybe the lr is too large, just need to reduce the lr. (eg. set 5e-6 can solve this problem for me)

You can check when the first time meet NAN in codeformer_joint_model.py file:

As for the train loss and output tensor normally in the stage 2, in my case, it is because the Fuse_sft_block produce the NAN gradient, while the network arch in stage2 have no such block, maybe you can check if the same as me.

Hi, may I ask your GPU info when training 1024 resolution in stage3? I had nvidia 4090 with 24G but got cuda out of memory when I tried to training 1024 resolution model in stage3. I keep the number of GPU 8 and set the batch size to 1. It still doesn't work. The connect_list is ['64', '128', '256','512']. Very appreciated if you have any suggestions.

from codeformer.

chiixy commented on July 17, 2024

I have the same NAN pix loss and perceptual loss when i train my 1024 resolution model and need to connect 512 size feat in stage3.
I found first time meet NAN after optimizer.step(), not the forward function, not the loss.backward, so maybe the lr is too large, just need to reduce the lr. (eg. set 5e-6 can solve this problem for me)
You can check when the first time meet NAN in codeformer_joint_model.py file:
As for the train loss and output tensor normally in the stage 2, in my case, it is because the Fuse_sft_block produce the NAN gradient, while the network arch in stage2 have no such block, maybe you can check if the same as me.

Hi, may I ask your GPU info when training 1024 resolution in stage3? I had nvidia 4090 with 24G but got cuda out of memory when I tried to training 1024 resolution model in stage3. I keep the number of GPU 8 and set the batch size to 1. It still doesn't work. The connect_list is ['64', '128', '256','512']. Very appreciated if you have any suggestions.

Yeah, i use 1080ti to train the model, the official network arch need lots of memory when get 1024*1024 input in training time, so i compress the arch to train my model, thought it will loss some detail in the restored face

from codeformer.

SherryXieYuchen commented on July 17, 2024

I have the same NAN pix loss and perceptual loss when i train my 1024 resolution model and need to connect 512 size feat in stage3.
I found first time meet NAN after optimizer.step(), not the forward function, not the loss.backward, so maybe the lr is too large, just need to reduce the lr. (eg. set 5e-6 can solve this problem for me)
You can check when the first time meet NAN in codeformer_joint_model.py file:
As for the train loss and output tensor normally in the stage 2, in my case, it is because the Fuse_sft_block produce the NAN gradient, while the network arch in stage2 have no such block, maybe you can check if the same as me.

Hi, may I ask your GPU info when training 1024 resolution in stage3? I had nvidia 4090 with 24G but got cuda out of memory when I tried to training 1024 resolution model in stage3. I keep the number of GPU 8 and set the batch size to 1. It still doesn't work. The connect_list is ['64', '128', '256','512']. Very appreciated if you have any suggestions.

Yeah, i use 1080ti to train the model, the official network arch need lots of memory when get 1024*1024 input in training time, so i compress the arch to train my model, thought it will loss some detail in the restored face

yes I find the model with 1024 resolution loss some detail too after I modify the arch to complete the training. Hard to balance the memory problem and the restoration fidelity. Anyway, thanks for your reply :)

from codeformer.

create-li commented on July 17, 2024

#367 (comment)
@SherryXieYuchen
I tried to set the learning rate to 5e-7 , but still encountered l_g_gan, l_d_real l_d_fake, etc. as 0.
When I set the learning rate to 5e-8, although the loss is not 0 or nan, the network is almost not updated, and the output graph is brown when w=1，As shown in the following figure

Yes, it is indeed the problem with Fuses_sft_block, but I don't know how to modify the fusion network

from codeformer.

SherryXieYuchen commented on July 17, 2024

#367 (comment) @SherryXieYuchen I tried to set the learning rate to 5e-7 , but still encountered l_g_gan, l_d_real l_d_fake, etc. as 0. When I set the learning rate to 5e-8, although the loss is not 0 or nan, the network is almost not updated, and the output graph is brown when w=1，As shown in the following figure

Yes, it is indeed the problem with Fuses_sft_block, but I don't know how to modify the fusion network

I haven't seen this output graph problem before. Although setting the learning rate too small could cause the network not updating.

from codeformer.

why In stage 3, train the GAN loss to nan and the tensor of the network output image to nan？ about codeformer HOT 6 OPEN

Comments (6)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent