Hi Krzysztof, When visualizing the distribution of weights and gradi

Wow, you're doing a great detective work :) <div class="snippet-clipboard-content

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Thanks a lot <a class="user-mention notranslate" data-hovercard-type="user" data-hover

I also noticed that the CelebA notebook uses a per-pixel loss <div class="snippet-

Curious about Glow implementation: some weights look frozen? about deep-learning-notes HOT 9 OPEN

kmkolasinski commented on June 28, 2024

Curious about Glow implementation: some weights look frozen?

from deep-learning-notes.

Comments (9)

geosada commented on June 28, 2024 2

Though I'm not too sure if it's going to be useful, let me share with you what I encountered while playing with the code for CelebA 64x64.
Increasing flow steps from 22 (the orginal Krzysztof's setting) to 32 caused a loss to get NaN, which I suspect that a gradient explosion probably happened somewhere.

Actually, at that time, I changed the learning dynamics carelessly from the original: I started to run the training with num_steps=1000 and lr_ph: 0.0005 from the beginning, and it turned out that this change caused NaN error after all.

In short, the warm-up strategy, that is, In [33] and In [34] for example in Celeba64x64_22steps.ipynb, seems important.
So, Christabella, I also think that it is worth trying to start the training with much smaller lr_ph, as Krzysztof suggested in the last comment.

from deep-learning-notes.

kmkolasinski commented on June 28, 2024 1

This is interesting, but unfortunately I have no idea for this. Do you have some minimal example of code to run or test like small model with fake input data ? Are you sure you are plotting the right thing?
Probably it would be worth to check whether optimizer applies these gradients to L, U, scale weights. Maybe we could start from single layer with 1D-invertible conv and check whether it can learn some expected distribution like gaussian multivariate with nontrivial covariance matrix.

from deep-learning-notes.

kmkolasinski commented on June 28, 2024 1

Yes good idea, maybe it would be better to start from much more smaller learning rate and simpler architecture. MNIST is good idea also. Can you check this ?

from deep-learning-notes.

kmkolasinski commented on June 28, 2024 1

but in Celeba64x64_22steps.ipynb the learning rate schedule goes up, down, down, up---just wondering, why that schedule?

Hi, I believe I was playing with LR rate manually, so there is no specific reason for this schedule. When experimenting in jupyter notebooks, I usually start from small LR e.g. 0.0001 and test model for few epochs to check whether it trains or not. If so, I increase LR and then systematically decrease e.g. [0.001, 0.0005, 0.0001, 0.00005]. This time I increased LR at the end, probably because I noted that model is not learning fast enough, so I wanted to help it, or maybe this is just a typo. Sorry for the inconvenience.

from deep-learning-notes.

kmkolasinski commented on June 28, 2024 1

Wow, you're doing a great detective work :)

And they encountered NaN in step 8K and 4K respectively. The NaN first pops up in gradient/ChainLayer/Scale3/ChainLayer/Step5/ChainLayer//AffineCouplingLayer/Step5/Conv/biases_0 which is really strange because it doesn't look abnormal:

I believe you should work with smaller LR. NFs are very complicated networks with many parts which are very fragile and they don't train fast (at least from my experience). If you have a bad luck you can always sample some hard minibatch which will generate large loss and which will break whole training. You can try to overcome this problem with some gradient clipping techniques. Or maybe one could write some custom optimizer which would reject these updates for which loss is far from the current running mean.

from deep-learning-notes.

christabella commented on June 28, 2024

Thank you very much for the reply! I just tried visualizing the weights in the provided notebook Celeba48x48_22steps.ipynb without changing anything else, and some gradients are also exploding even in step 2:

Probably it would be worth to check whether optimizer applies these gradients to L, U, scale weights. Maybe we could start from single layer with 1D-invertible conv and check whether it can learn some expected distribution like gaussian multivariate with nontrivial covariance matrix.

If I understand your suggestion correctly, I will try to reduce the flow's complexity (number of steps etc.) and plot the weights again. Maybe on MNIST instead of CelebA...

from deep-learning-notes.

kmkolasinski commented on June 28, 2024

@geosada thanks for your comment 👍

from deep-learning-notes.

christabella commented on June 28, 2024

Thanks a lot @geosada for pointing out the importance of the warm-up strategy! Although I have heard of starting with a small learning rate that goes up and down again, but in Celeba64x64_22steps.ipynb the learning rate schedule goes up, down, down, up---just wondering, why that schedule?

/\    instead of  /\ 
  \/             /  \

[33]: lr=0.0001 for 1 x 100 steps
[34]: lr=0.0005 for 5 x 1000 steps
[35]: lr=0.0001 for 5 x 1000 steps
[36]: lr=0.00005 for 5 x 1000 steps
[37]: lr=0.0001 for 5 x 1000 steps

from deep-learning-notes.

christabella commented on June 28, 2024

I also noticed that the CelebA notebook uses a per-pixel loss

loss_per_pixel = loss / image_size / image_size  
total_loss = l2_loss + loss_per_pixel

while for MNIST, it's the sum over all pixels
total_loss = l2_loss + loss

Maybe that’s why the l2 loss goes up in the MNIST notebook, because the loss summed over all pixels overpowers the l2 loss:

Furthermore, in the official GLOW implementation, log p(x) (bits_x) is also divided by the number of subpixels.

But when I tried to divide the MNIST loss by image_size=28, there was instability at around 10-15K steps:

And they encountered NaN in step 8K and 4K respectively. The NaN first pops up in gradient/ChainLayer/Scale3/ChainLayer/Step5/ChainLayer//AffineCouplingLayer/Step5/Conv/biases_0 which is really strange because it doesn't look abnormal:

This was with lr=0.005 which is maybe too high as @geosada mentioned and MNIST did not have a warm-up strategy. I will try using the learning rate warm-up.

Also another unrelated thought--I wonder if act norm, if it acts like batch norm, will interact with l2 regularization to produce adaptive learning rate.

from deep-learning-notes.

Curious about Glow implementation: some weights look frozen? about deep-learning-notes HOT 9 OPEN

Comments (9)

Related Issues (13)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent