Git Product home page Git Product logo

Comments (6)

chiixy avatar chiixy commented on July 17, 2024

I have the same NAN pix loss and perceptual loss when i train my 1024 resolution model and need to connect 512 size feat in stage3.

I found first time meet NAN after optimizer.step(), not the forward function, not the loss.backward, so maybe the lr is too large, just need to reduce the lr. (eg. set 5e-6 can solve this problem for me)

You can check when the first time meet NAN in codeformer_joint_model.py file:
Screenshot from 2024-04-11 18-10-36

As for the train loss and output tensor normally in the stage 2, in my case, it is because the Fuse_sft_block produce the NAN gradient, while the network arch in stage2 have no such block, maybe you can check if the same as me.

from codeformer.

SherryXieYuchen avatar SherryXieYuchen commented on July 17, 2024

I have the same NAN pix loss and perceptual loss when i train my 1024 resolution model and need to connect 512 size feat in stage3.

I found first time meet NAN after optimizer.step(), not the forward function, not the loss.backward, so maybe the lr is too large, just need to reduce the lr. (eg. set 5e-6 can solve this problem for me)

You can check when the first time meet NAN in codeformer_joint_model.py file: Screenshot from 2024-04-11 18-10-36

As for the train loss and output tensor normally in the stage 2, in my case, it is because the Fuse_sft_block produce the NAN gradient, while the network arch in stage2 have no such block, maybe you can check if the same as me.

Hi, may I ask your GPU info when training 1024 resolution in stage3? I had nvidia 4090 with 24G but got cuda out of memory when I tried to training 1024 resolution model in stage3. I keep the number of GPU 8 and set the batch size to 1. It still doesn't work. The connect_list is ['64', '128', '256','512']. Very appreciated if you have any suggestions.

from codeformer.

chiixy avatar chiixy commented on July 17, 2024

I have the same NAN pix loss and perceptual loss when i train my 1024 resolution model and need to connect 512 size feat in stage3.
I found first time meet NAN after optimizer.step(), not the forward function, not the loss.backward, so maybe the lr is too large, just need to reduce the lr. (eg. set 5e-6 can solve this problem for me)
You can check when the first time meet NAN in codeformer_joint_model.py file: Screenshot from 2024-04-11 18-10-36
As for the train loss and output tensor normally in the stage 2, in my case, it is because the Fuse_sft_block produce the NAN gradient, while the network arch in stage2 have no such block, maybe you can check if the same as me.

Hi, may I ask your GPU info when training 1024 resolution in stage3? I had nvidia 4090 with 24G but got cuda out of memory when I tried to training 1024 resolution model in stage3. I keep the number of GPU 8 and set the batch size to 1. It still doesn't work. The connect_list is ['64', '128', '256','512']. Very appreciated if you have any suggestions.

Yeah, i use 1080ti to train the model, the official network arch need lots of memory when get 1024*1024 input in training time, so i compress the arch to train my model, thought it will loss some detail in the restored face

from codeformer.

SherryXieYuchen avatar SherryXieYuchen commented on July 17, 2024

I have the same NAN pix loss and perceptual loss when i train my 1024 resolution model and need to connect 512 size feat in stage3.
I found first time meet NAN after optimizer.step(), not the forward function, not the loss.backward, so maybe the lr is too large, just need to reduce the lr. (eg. set 5e-6 can solve this problem for me)
You can check when the first time meet NAN in codeformer_joint_model.py file: Screenshot from 2024-04-11 18-10-36
As for the train loss and output tensor normally in the stage 2, in my case, it is because the Fuse_sft_block produce the NAN gradient, while the network arch in stage2 have no such block, maybe you can check if the same as me.

Hi, may I ask your GPU info when training 1024 resolution in stage3? I had nvidia 4090 with 24G but got cuda out of memory when I tried to training 1024 resolution model in stage3. I keep the number of GPU 8 and set the batch size to 1. It still doesn't work. The connect_list is ['64', '128', '256','512']. Very appreciated if you have any suggestions.

Yeah, i use 1080ti to train the model, the official network arch need lots of memory when get 1024*1024 input in training time, so i compress the arch to train my model, thought it will loss some detail in the restored face

yes I find the model with 1024 resolution loss some detail too after I modify the arch to complete the training. Hard to balance the memory problem and the restoration fidelity. Anyway, thanks for your reply :)

from codeformer.

create-li avatar create-li commented on July 17, 2024

#367 (comment)
@SherryXieYuchen
I tried to set the learning rate to 5e-7 , but still encountered l_g_gan, l_d_real l_d_fake, etc. as 0.
When I set the learning rate to 5e-8, although the loss is not 0 or nan, the network is almost not updated, and the output graph is brown when w=1,As shown in the following figure
image

Yes, it is indeed the problem with Fuses_sft_block, but I don't know how to modify the fusion network

from codeformer.

SherryXieYuchen avatar SherryXieYuchen commented on July 17, 2024

#367 (comment) @SherryXieYuchen I tried to set the learning rate to 5e-7 , but still encountered l_g_gan, l_d_real l_d_fake, etc. as 0. When I set the learning rate to 5e-8, although the loss is not 0 or nan, the network is almost not updated, and the output graph is brown when w=1,As shown in the following figure image

Yes, it is indeed the problem with Fuses_sft_block, but I don't know how to modify the fusion network

I haven't seen this output graph problem before. Although setting the learning rate too small could cause the network not updating.

from codeformer.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.