Git Product home page Git Product logo

radam's Introduction

RAdam

License

We are in an early-release beta. Expect some adventures and rough edges.

What is the problem that we are interested in?

We want to uncover underlying principles of the warmup for Adam. In some applications, the warmup is a must-have to stabilize training, while its underlying mechanism is largely unknown. In our study, we suggest one root reason is the large variance of the adaptive learning rate, provide both theoretical and empirical evidence to support our hypothesis.

TL;DR: "If warmup is the answer, what is the question?", "Why we should use warmup and how?"

Note: we do not want to over-claim RAdam as the best optimizer, since there are too many variants / datasets / tasks.

Rectified Adam

As in Figure 8, we assume gradients subject to a normal distribution (mean: \mu, variance: 1). The variance of the adaptive learning rate is simulated and plotted in Figure 8 (blue line). We can see that the adaptive learning rate has a large variance in the early stage of training.

At the same time, when using the Transformers for NMT (see more detailed discussions in the next section), a warmup stage is usually required to avoid convergence problems (Adam-vanilla converges around 500 PPL in Figure 1, while Adam-warmup successfully converges under 10 PPL). In further explorations, we notice that, if we use additional 2000 samples to update the adaptive learning rate, the convergence problems are avoided (Adam-2k); or, if we increase the value of eps, the convergence problems are relieved (Adam-eps).

Therefore, we conjugate it is the large variance in the early stage causes the convergence problem, and further propose Rectified Adam by estimating the variance of the adaptive learning rate. More details can be found in our paper.

Questions and Discussions

Why SGD needs warmup?

To the best of our knowledge, the warmup heuristic is originally designed for large minibatch sgd[0], and the intuition is to handle the large gradient variance in the early stage. In our study, we focus on optimizer instead of neural models; the variance of gradients deserves more in-depth analysis and is beyond the scope of our study.

[0] Priya Goyal, Piotr Dollar, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. Accurate, large minibatch sgd: Training imagenet in 1 hour. arXiv preprint arXiv:1706.02677, 2017.

Choice of Transformer

We choose the original Transformer as our study object because it suffers from more serious convergence problems in our experiments. We show that, even for such setting, Adam-2k / Adam-eps can avoid convergence problems by minimal changes / controlled experiments, which verifies our hypothesis.

Why warmup has a bigger impact on some model than others?

Although the adaptive learning rate has a larger variance in the early stage, the exact magnitude is subject to the model design. Thus, the convergent problem could be more serious for some models/tasks than others. In our experiments, we observe RAdam achieves consistent improvements over the vanilla Adam. It verifies the variance issue widely exists (since we can get better performance by fixing it).

Notes on Transformer

Despite its efficiency and effectiveness, we observe the Transformers model to be sensitive. For example, by changing the position of the layer norm, the model may / may not require the warmup to get a good performance. Intuitively, since the gradient of the attention layer could be more sparse and the adaptive learning rates for smaller gradients have a larger variance, they are more sensitive. Similarly, we believe this problem deserves more in-depth analysis and is beyond the scope of our study.

Quick Start Guide

  1. Directly replace the vanilla Adam with RAdam without changing any settings.
  2. Further tune hyper-parameters for a better performance.

Note that in our paper, our major contribution is to identify why we need the warmup for Adam. Although some users successfully improve the model performance (user comments), directly plugging in RAdam may not result in an immediate performance boost. Based on our experience, replacing the vanilla Adam with RAdam usually results in a better performance; however, if warmup has already been employed and tuned in the baseline method, it is necessary to tune hyper-parameters.

Related Posts and Repos

Unofficial Re-Implementations

RAdam is very easy to implement, we provide PyTorch implementations here, while third party re-implementations can be found at:

Keras Implementation

Keras Implementation

Unofficial Introduction & Mentions

We provide a simple introduction in Rectified Adam, while more details can be found in our paper. There are some unofficial introductions available(wrote by native English speakers), but they are listed here for reference only (contents/claims in our paper is more accurate):

Medium Post

Twitter Post

User Comments

We are happy to see that our algorithms are found to be useful by some users : -)

"...I tested it on ImageNette and quickly got new high accuracy scores for the 5 and 20 epoch 128px leaderboard scores, so I know it works... https://t.co/1MZmTbmFjn

โ€” Less Wright August 15, 2019

Thought "sounds interesting, I'll give it a try" - top 5 are vanilla Adam, bottom 4 (I only have access to 4 GPUs) are RAdam... so far looking pretty promising! pic.twitter.com/irvJSeoVfx

โ€” Hamish Dickson (@_mishy) August 16, 2019

radam's People

Contributors

liyuanlucasliu avatar waldeland avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.