RAdam

We are in an early-release beta. Expect some adventures and rough edges.

What is the problem that we are interested in?

We want to uncover underlying principles of the warmup for Adam. In some applications, the warmup is a must-have to stabilize training, while its underlying mechanism is largely unknown. In our study, we suggest one root reason is the large variance of the adaptive learning rate, provide both theoretical and empirical evidence to support our hypothesis.

TL;DR: "If warmup is the answer, what is the question?", "Why we should use warmup and how?"

Note: we do not want to over-claim RAdam as the best optimizer, since there are too many variants / datasets / tasks.

Rectified Adam

As in Figure 8, we assume gradients subject to a normal distribution (mean: \mu, variance: 1). The variance of the adaptive learning rate is simulated and plotted in Figure 8 (blue line). We can see that the adaptive learning rate has a large variance in the early stage of training.

At the same time, when using the Transformers for NMT (see more detailed discussions in the next section), a warmup stage is usually required to avoid convergence problems (Adam-vanilla converges around 500 PPL in Figure 1, while Adam-warmup successfully converges under 10 PPL). In further explorations, we notice that, if we use additional 2000 samples to update the adaptive learning rate, the convergence problems are avoided (Adam-2k); or, if we increase the value of eps, the convergence problems are relieved (Adam-eps).

Therefore, we conjugate it is the large variance in the early stage causes the convergence problem, and further propose Rectified Adam by estimating the variance of the adaptive learning rate. More details can be found in our paper.

Questions and Discussions

Why SGD needs warmup?

To the best of our knowledge, the warmup heuristic is originally designed for large minibatch sgd[0], and the intuition is to handle the large gradient variance in the early stage. In our study, we focus on optimizer instead of neural models; the variance of gradients deserves more in-depth analysis and is beyond the scope of our study.

[0] Priya Goyal, Piotr Dollar, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. Accurate, large minibatch sgd: Training imagenet in 1 hour. arXiv preprint arXiv:1706.02677, 2017.

Choice of Transformer

We choose the original Transformer as our study object because it suffers from more serious convergence problems in our experiments. We show that, even for such setting, Adam-2k / Adam-eps can avoid convergence problems by minimal changes / controlled experiments, which verifies our hypothesis.

Why warmup has a bigger impact on some model than others?

Although the adaptive learning rate has a larger variance in the early stage, the exact magnitude is subject to the model design. Thus, the convergent problem could be more serious for some models/tasks than others. In our experiments, we observe RAdam achieves consistent improvements over the vanilla Adam. It verifies the variance issue widely exists (since we can get better performance by fixing it).

Notes on Transformer

Despite its efficiency and effectiveness, we observe the Transformers model to be sensitive. For example, by changing the position of the layer norm, the model may / may not require the warmup to get a good performance. Intuitively, since the gradient of the attention layer could be more sparse and the adaptive learning rates for smaller gradients have a larger variance, they are more sensitive. Similarly, we believe this problem deserves more in-depth analysis and is beyond the scope of our study.

Quick Start Guide

Directly replace the vanilla Adam with RAdam without changing any settings.
Further tune hyper-parameters for a better performance.

Note that in our paper, our major contribution is to identify why we need the warmup for Adam. Although some users successfully improve the model performance (user comments), directly plugging in RAdam may not result in an immediate performance boost. Based on our experience, replacing the vanilla Adam with RAdam usually results in a better performance; however, if warmup has already been employed and tuned in the baseline method, it is necessary to tune hyper-parameters.

rotorliu / radam Goto Github PK

radam's Introduction

RAdam

What is the problem that we are interested in?

TL;DR: "If warmup is the answer, what is the question?", "Why we should use warmup and how?"

Rectified Adam

Questions and Discussions

Why SGD needs warmup?

Choice of Transformer

Why warmup has a bigger impact on some model than others?

Notes on Transformer

Quick Start Guide

Related Posts and Repos

Unofficial Re-Implementations

Unofficial Introduction & Mentions

User Comments

radam's People

Contributors

Watchers

Recommend Projects

Recommend Topics

Recommend Org