Comments (9)
Though I'm not too sure if it's going to be useful, let me share with you what I encountered while playing with the code for CelebA 64x64.
Increasing flow steps from 22 (the orginal Krzysztof's setting) to 32 caused a loss to get NaN, which I suspect that a gradient explosion probably happened somewhere.
Actually, at that time, I changed the learning dynamics carelessly from the original: I started to run the training with num_steps=1000 and lr_ph: 0.0005 from the beginning, and it turned out that this change caused NaN error after all.
In short, the warm-up strategy, that is, In [33] and In [34] for example in Celeba64x64_22steps.ipynb, seems important.
So, Christabella, I also think that it is worth trying to start the training with much smaller lr_ph, as Krzysztof suggested in the last comment.
from deep-learning-notes.
This is interesting, but unfortunately I have no idea for this. Do you have some minimal example of code to run or test like small model with fake input data ? Are you sure you are plotting the right thing?
Probably it would be worth to check whether optimizer applies these gradients to L, U, scale weights. Maybe we could start from single layer with 1D-invertible conv and check whether it can learn some expected distribution like gaussian multivariate with nontrivial covariance matrix.
from deep-learning-notes.
Yes good idea, maybe it would be better to start from much more smaller learning rate and simpler architecture. MNIST is good idea also. Can you check this ?
from deep-learning-notes.
but in
Celeba64x64_22steps.ipynb
the learning rate schedule goes up, down, down, up---just wondering, why that schedule?
Hi, I believe I was playing with LR rate manually, so there is no specific reason for this schedule. When experimenting in jupyter notebooks, I usually start from small LR e.g. 0.0001 and test model for few epochs to check whether it trains or not. If so, I increase LR and then systematically decrease e.g. [0.001, 0.0005, 0.0001, 0.00005]. This time I increased LR at the end, probably because I noted that model is not learning fast enough, so I wanted to help it, or maybe this is just a typo. Sorry for the inconvenience.
from deep-learning-notes.
Wow, you're doing a great detective work :)
And they encountered NaN in step 8K and 4K respectively. The NaN first pops up in gradient/ChainLayer/Scale3/ChainLayer/Step5/ChainLayer//AffineCouplingLayer/Step5/Conv/biases_0 which is really strange because it doesn't look abnormal:
I believe you should work with smaller LR. NFs are very complicated networks with many parts which are very fragile and they don't train fast (at least from my experience). If you have a bad luck you can always sample some hard minibatch which will generate large loss and which will break whole training. You can try to overcome this problem with some gradient clipping techniques. Or maybe one could write some custom optimizer which would reject these updates for which loss is far from the current running mean.
from deep-learning-notes.
Thank you very much for the reply! I just tried visualizing the weights in the provided notebook Celeba48x48_22steps.ipynb
without changing anything else, and some gradients are also exploding even in step 2:
Probably it would be worth to check whether optimizer applies these gradients to L, U, scale weights. Maybe we could start from single layer with 1D-invertible conv and check whether it can learn some expected distribution like gaussian multivariate with nontrivial covariance matrix.
If I understand your suggestion correctly, I will try to reduce the flow's complexity (number of steps etc.) and plot the weights again. Maybe on MNIST instead of CelebA...
from deep-learning-notes.
@geosada thanks for your comment π
from deep-learning-notes.
Thanks a lot @geosada for pointing out the importance of the warm-up strategy! Although I have heard of starting with a small learning rate that goes up and down again, but in Celeba64x64_22steps.ipynb
the learning rate schedule goes up, down, down, up---just wondering, why that schedule?
/\ instead of /\
\/ / \
[33]: lr=0.0001 for 1 x 100 steps
[34]: lr=0.0005 for 5 x 1000 steps
[35]: lr=0.0001 for 5 x 1000 steps
[36]: lr=0.00005 for 5 x 1000 steps
[37]: lr=0.0001 for 5 x 1000 steps
from deep-learning-notes.
I also noticed that the CelebA notebook uses a per-pixel loss
loss_per_pixel = loss / image_size / image_size
total_loss = l2_loss + loss_per_pixel
while for MNIST, it's the sum over all pixels
total_loss = l2_loss + loss
Maybe thatβs why the l2 loss goes up in the MNIST notebook, because the loss summed over all pixels overpowers the l2 loss:
Furthermore, in the official GLOW implementation, log p(x)
(bits_x
) is also divided by the number of subpixels.
But when I tried to divide the MNIST loss by image_size=28
, there was instability at around 10-15K steps:
And they encountered NaN in step 8K and 4K respectively. The NaN first pops up in gradient/ChainLayer/Scale3/ChainLayer/Step5/ChainLayer//AffineCouplingLayer/Step5/Conv/biases_0
which is really strange because it doesn't look abnormal:
This was with lr=0.005
which is maybe too high as @geosada mentioned and MNIST did not have a warm-up strategy. I will try using the learning rate warm-up.
Also another unrelated thought--I wonder if act norm, if it acts like batch norm, will interact with l2 regularization to produce adaptive learning rate.
from deep-learning-notes.
Related Issues (13)
- Gradient calculation. HOT 4
- Question about 2017-09-Poincare-Embeddings HOT 1
- License? HOT 1
- NeuralODE with multiple features? HOT 3
- Minor typo in normalizing flows slides HOT 2
- Questions about Glow implementation (Not bugs) HOT 8
- Do you have Bibtex? HOT 5
- Using neural ODE to estimate dynamics of forced systems
- Can't find module "neural_ode" in 2.Demo_optimize_bullet_trajectory.ipynb HOT 2
- SGD in Max-Norm HOT 3
- bugs in CIFAR10 training with ResNet-32 HOT 2
- About oversampling-datasets-example.ipynb HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
π Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. πππ
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google β€οΈ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from deep-learning-notes.