Git Product home page Git Product logo

Comments (9)

geosada avatar geosada commented on June 28, 2024 2

Though I'm not too sure if it's going to be useful, let me share with you what I encountered while playing with the code for CelebA 64x64.
Increasing flow steps from 22 (the orginal Krzysztof's setting) to 32 caused a loss to get NaN, which I suspect that a gradient explosion probably happened somewhere.

Actually, at that time, I changed the learning dynamics carelessly from the original: I started to run the training with num_steps=1000 and lr_ph: 0.0005 from the beginning, and it turned out that this change caused NaN error after all.

In short, the warm-up strategy, that is, In [33] and In [34] for example in Celeba64x64_22steps.ipynb, seems important.
So, Christabella, I also think that it is worth trying to start the training with much smaller lr_ph, as Krzysztof suggested in the last comment.

from deep-learning-notes.

kmkolasinski avatar kmkolasinski commented on June 28, 2024 1

This is interesting, but unfortunately I have no idea for this. Do you have some minimal example of code to run or test like small model with fake input data ? Are you sure you are plotting the right thing?
Probably it would be worth to check whether optimizer applies these gradients to L, U, scale weights. Maybe we could start from single layer with 1D-invertible conv and check whether it can learn some expected distribution like gaussian multivariate with nontrivial covariance matrix.

from deep-learning-notes.

kmkolasinski avatar kmkolasinski commented on June 28, 2024 1

Yes good idea, maybe it would be better to start from much more smaller learning rate and simpler architecture. MNIST is good idea also. Can you check this ?

from deep-learning-notes.

kmkolasinski avatar kmkolasinski commented on June 28, 2024 1

but in Celeba64x64_22steps.ipynb the learning rate schedule goes up, down, down, up---just wondering, why that schedule?

Hi, I believe I was playing with LR rate manually, so there is no specific reason for this schedule. When experimenting in jupyter notebooks, I usually start from small LR e.g. 0.0001 and test model for few epochs to check whether it trains or not. If so, I increase LR and then systematically decrease e.g. [0.001, 0.0005, 0.0001, 0.00005]. This time I increased LR at the end, probably because I noted that model is not learning fast enough, so I wanted to help it, or maybe this is just a typo. Sorry for the inconvenience.

from deep-learning-notes.

kmkolasinski avatar kmkolasinski commented on June 28, 2024 1

Wow, you're doing a great detective work :)

And they encountered NaN in step 8K and 4K respectively. The NaN first pops up in gradient/ChainLayer/Scale3/ChainLayer/Step5/ChainLayer//AffineCouplingLayer/Step5/Conv/biases_0 which is really strange because it doesn't look abnormal:

I believe you should work with smaller LR. NFs are very complicated networks with many parts which are very fragile and they don't train fast (at least from my experience). If you have a bad luck you can always sample some hard minibatch which will generate large loss and which will break whole training. You can try to overcome this problem with some gradient clipping techniques. Or maybe one could write some custom optimizer which would reject these updates for which loss is far from the current running mean.

from deep-learning-notes.

christabella avatar christabella commented on June 28, 2024

Thank you very much for the reply! I just tried visualizing the weights in the provided notebook Celeba48x48_22steps.ipynb without changing anything else, and some gradients are also exploding even in step 2:

image

Probably it would be worth to check whether optimizer applies these gradients to L, U, scale weights. Maybe we could start from single layer with 1D-invertible conv and check whether it can learn some expected distribution like gaussian multivariate with nontrivial covariance matrix.

If I understand your suggestion correctly, I will try to reduce the flow's complexity (number of steps etc.) and plot the weights again. Maybe on MNIST instead of CelebA...

from deep-learning-notes.

kmkolasinski avatar kmkolasinski commented on June 28, 2024

@geosada thanks for your comment πŸ‘

from deep-learning-notes.

christabella avatar christabella commented on June 28, 2024

Thanks a lot @geosada for pointing out the importance of the warm-up strategy! Although I have heard of starting with a small learning rate that goes up and down again, but in Celeba64x64_22steps.ipynb the learning rate schedule goes up, down, down, up---just wondering, why that schedule?

/\    instead of  /\ 
  \/             /  \

[33]: lr=0.0001 for 1 x 100 steps
[34]: lr=0.0005 for 5 x 1000 steps
[35]: lr=0.0001 for 5 x 1000 steps
[36]: lr=0.00005 for 5 x 1000 steps
[37]: lr=0.0001 for 5 x 1000 steps

from deep-learning-notes.

christabella avatar christabella commented on June 28, 2024

I also noticed that the CelebA notebook uses a per-pixel loss

loss_per_pixel = loss / image_size / image_size  
total_loss = l2_loss + loss_per_pixel

while for MNIST, it's the sum over all pixels
total_loss = l2_loss + loss

Maybe that’s why the l2 loss goes up in the MNIST notebook, because the loss summed over all pixels overpowers the l2 loss:
image

Furthermore, in the official GLOW implementation, log p(x) (bits_x) is also divided by the number of subpixels.

But when I tried to divide the MNIST loss by image_size=28, there was instability at around 10-15K steps:
image
image
And they encountered NaN in step 8K and 4K respectively. The NaN first pops up in gradient/ChainLayer/Scale3/ChainLayer/Step5/ChainLayer//AffineCouplingLayer/Step5/Conv/biases_0 which is really strange because it doesn't look abnormal:
image

This was with lr=0.005 which is maybe too high as @geosada mentioned and MNIST did not have a warm-up strategy. I will try using the learning rate warm-up.


Also another unrelated thought--I wonder if act norm, if it acts like batch norm, will interact with l2 regularization to produce adaptive learning rate.

from deep-learning-notes.

Related Issues (13)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.