Git Product home page Git Product logo

pytorch-optimizer's Introduction

torch-optimizer

GitHub Actions status for master branch

image

image

Documentation Status

image

image

torch-optimizer -- collection of optimizers for PyTorch compatible with optim module.

Simple example

import torch_optimizer as optim

# model = ...
optimizer = optim.DiffGrad(model.parameters(), lr=0.001)
optimizer.step()

Installation

Installation process is simple, just:

$ pip install torch_optimizer

Documentation

https://pytorch-optimizer.rtfd.io

Citation

Please cite the original authors of the optimization algorithms. If you like this package:

@software{Novik_torchoptimizers,
    title        = {{torch-optimizer -- collection of optimization algorithms for PyTorch.}},
    author       = {Novik, Mykola},
    year         = 2020,
    month        = 1,
    version      = {1.0.1}
}

Or use the github feature: "cite this repository" button.

Supported Optimizers

A2GradExp https://arxiv.org/abs/1810.00553
A2GradInc https://arxiv.org/abs/1810.00553
A2GradUni https://arxiv.org/abs/1810.00553
AccSGD https://arxiv.org/abs/1803.05591
AdaBelief https://arxiv.org/abs/2010.07468
AdaBound https://arxiv.org/abs/1902.09843
AdaMod https://arxiv.org/abs/1910.12249
Adafactor https://arxiv.org/abs/1804.04235
Adahessian https://arxiv.org/abs/2006.00719
AdamP https://arxiv.org/abs/2006.08217
AggMo https://arxiv.org/abs/1804.00325
Apollo https://arxiv.org/abs/2009.13586
DiffGrad https://arxiv.org/abs/1909.11015
Lamb https://arxiv.org/abs/1904.00962
Lookahead https://arxiv.org/abs/1907.08610
MADGRAD https://arxiv.org/abs/2101.11075
NovoGrad https://arxiv.org/abs/1905.11286
PID https://www4.comp.polyu.edu.hk/~cslzhang/paper/CVPR18_PID.pdf
QHAdam https://arxiv.org/abs/1810.06801
QHM https://arxiv.org/abs/1810.06801
RAdam https://arxiv.org/abs/1908.03265
Ranger https://medium.com/@lessw/new-deep-learning-optimizer-ranger-synergistic-combination-of-radam-lookahead-for-the-best-of-2dc83f79a48d
RangerQH https://arxiv.org/abs/1810.06801
RangerVA https://arxiv.org/abs/1908.00700v2
SGDP https://arxiv.org/abs/2006.08217
SGDW https://arxiv.org/abs/1608.03983
SWATS https://arxiv.org/abs/1712.07628
Shampoo https://arxiv.org/abs/1802.09568
Yogi https://papers.nips.cc/paper/8186-adaptive-methods-for-nonconvex-optimization

Visualizations

Visualizations help us see how different algorithms deal with simple situations like: saddle points, local minima, valleys etc, and may provide interesting insights into the inner workings of an algorithm. Rosenbrock and Rastrigin benchmark functions were selected because:

  • Rosenbrock (also known as banana function), is non-convex function that has one global minimum (1.0. 1.0). The global minimum is inside a long, narrow, parabolic shaped flat valley. Finding the valley is trivial. Converging to the global minimum, however, is difficult. Optimization algorithms might pay a lot of attention to one coordinate, and struggle following the valley which is relatively flat.

image

  • Rastrigin is a non-convex function and has one global minimum in (0.0, 0.0). Finding the minimum of this function is a fairly difficult problem due to its large search space and its large number of local minima.

    image

Each optimizer performs 501 optimization steps. Learning rate is the best one found by a hyper parameter search algorithm, the rest of the tuning parameters are default. It is very easy to extend the script and tune other optimizer parameters.

python examples/viz_optimizers.py

Warning

Do not pick an optimizer based on visualizations, optimization approaches have unique properties and may be tailored for different purposes or may require explicit learning rate schedule etc. The best way to find out is to try one on your particular problem and see if it improves scores.

If you do not know which optimizer to use, start with the built in SGD/Adam. Once the training logic is ready and baseline scores are established, swap the optimizer and see if there is any improvement.

A2GradExp

image

image

import torch_optimizer as optim

# model = ...
optimizer = optim.A2GradExp(
    model.parameters(),
    kappa=1000.0,
    beta=10.0,
    lips=10.0,
    rho=0.5,
)
optimizer.step()

Paper: Optimal Adaptive and Accelerated Stochastic Gradient Descent (2018) [https://arxiv.org/abs/1810.00553]

Reference Code: https://github.com/severilov/A2Grad_optimizer

A2GradInc

image

image

import torch_optimizer as optim

# model = ...
optimizer = optim.A2GradInc(
    model.parameters(),
    kappa=1000.0,
    beta=10.0,
    lips=10.0,
)
optimizer.step()

Paper: Optimal Adaptive and Accelerated Stochastic Gradient Descent (2018) [https://arxiv.org/abs/1810.00553]

Reference Code: https://github.com/severilov/A2Grad_optimizer

A2GradUni

image

image

import torch_optimizer as optim

# model = ...
optimizer = optim.A2GradUni(
    model.parameters(),
    kappa=1000.0,
    beta=10.0,
    lips=10.0,
)
optimizer.step()

Paper: Optimal Adaptive and Accelerated Stochastic Gradient Descent (2018) [https://arxiv.org/abs/1810.00553]

Reference Code: https://github.com/severilov/A2Grad_optimizer

AccSGD

image

image

import torch_optimizer as optim

# model = ...
optimizer = optim.AccSGD(
    model.parameters(),
    lr=1e-3,
    kappa=1000.0,
    xi=10.0,
    small_const=0.7,
    weight_decay=0
)
optimizer.step()

Paper: On the insufficiency of existing momentum schemes for Stochastic Optimization (2019) [https://arxiv.org/abs/1803.05591]

Reference Code: https://github.com/rahulkidambi/AccSGD

AdaBelief

image

image

import torch_optimizer as optim

# model = ...
optimizer = optim.AdaBelief(
    m.parameters(),
    lr= 1e-3,
    betas=(0.9, 0.999),
    eps=1e-3,
    weight_decay=0,
    amsgrad=False,
    weight_decouple=False,
    fixed_decay=False,
    rectify=False,
)
optimizer.step()

Paper: AdaBelief Optimizer, adapting stepsizes by the belief in observed gradients (2020) [https://arxiv.org/abs/2010.07468]

Reference Code: https://github.com/juntang-zhuang/Adabelief-Optimizer

AdaBound

image

image

import torch_optimizer as optim

# model = ...
optimizer = optim.AdaBound(
    m.parameters(),
    lr= 1e-3,
    betas= (0.9, 0.999),
    final_lr = 0.1,
    gamma=1e-3,
    eps= 1e-8,
    weight_decay=0,
    amsbound=False,
)
optimizer.step()

Paper: Adaptive Gradient Methods with Dynamic Bound of Learning Rate (2019) [https://arxiv.org/abs/1902.09843]

Reference Code: https://github.com/Luolc/AdaBound

AdaMod

The AdaMod method restricts the adaptive learning rates with adaptive and momental upper bounds. The dynamic learning rate bounds are based on the exponential moving averages of the adaptive learning rates themselves, which smooth out unexpected large learning rates and stabilize the training of deep neural networks.

image

image

import torch_optimizer as optim

# model = ...
optimizer = optim.AdaMod(
    m.parameters(),
    lr= 1e-3,
    betas=(0.9, 0.999),
    beta3=0.999,
    eps=1e-8,
    weight_decay=0,
)
optimizer.step()

Paper: An Adaptive and Momental Bound Method for Stochastic Learning. (2019) [https://arxiv.org/abs/1910.12249]

Reference Code: https://github.com/lancopku/AdaMod

Adafactor

image

image

import torch_optimizer as optim

# model = ...
optimizer = optim.Adafactor(
    m.parameters(),
    lr= 1e-3,
    eps2= (1e-30, 1e-3),
    clip_threshold=1.0,
    decay_rate=-0.8,
    beta1=None,
    weight_decay=0.0,
    scale_parameter=True,
    relative_step=True,
    warmup_init=False,
)
optimizer.step()

Paper: Adafactor: Adaptive Learning Rates with Sublinear Memory Cost. (2018) [https://arxiv.org/abs/1804.04235]

Reference Code: https://github.com/pytorch/fairseq/blob/master/fairseq/optim/adafactor.py

Adahessian

image

image

import torch_optimizer as optim

# model = ...
optimizer = optim.Adahessian(
    m.parameters(),
    lr= 1.0,
    betas= (0.9, 0.999),
    eps= 1e-4,
    weight_decay=0.0,
    hessian_power=1.0,
)
  loss_fn(m(input), target).backward(create_graph = True) # create_graph=True is necessary for Hessian calculation
optimizer.step()

Paper: ADAHESSIAN: An Adaptive Second Order Optimizer for Machine Learning (2020) [https://arxiv.org/abs/2006.00719]

Reference Code: https://github.com/amirgholami/adahessian

AdamP

AdamP propose a simple and effective solution: at each iteration of the Adam optimizer applied on scale-invariant weights (e.g., Conv weights preceding a BN layer), AdamP removes the radial component (i.e., parallel to the weight vector) from the update vector. Intuitively, this operation prevents the unnecessary update along the radial direction that only increases the weight norm without contributing to the loss minimization.

image

image

import torch_optimizer as optim

# model = ...
optimizer = optim.AdamP(
    m.parameters(),
    lr= 1e-3,
    betas=(0.9, 0.999),
    eps=1e-8,
    weight_decay=0,
    delta = 0.1,
    wd_ratio = 0.1
)
optimizer.step()

Paper: Slowing Down the Weight Norm Increase in Momentum-based Optimizers. (2020) [https://arxiv.org/abs/2006.08217]

Reference Code: https://github.com/clovaai/AdamP

AggMo

image

image

import torch_optimizer as optim

# model = ...
optimizer = optim.AggMo(
    m.parameters(),
    lr= 1e-3,
    betas=(0.0, 0.9, 0.99),
    weight_decay=0,
)
optimizer.step()

Paper: Aggregated Momentum: Stability Through Passive Damping. (2019) [https://arxiv.org/abs/1804.00325]

Reference Code: https://github.com/AtheMathmo/AggMo

Apollo

image

image

import torch_optimizer as optim

# model = ...
optimizer = optim.Apollo(
    m.parameters(),
    lr= 1e-2,
    beta=0.9,
    eps=1e-4,
    warmup=0,
    init_lr=0.01,
    weight_decay=0,
)
optimizer.step()

Paper: Apollo: An Adaptive Parameter-wise Diagonal Quasi-Newton Method for Nonconvex Stochastic Optimization. (2020) [https://arxiv.org/abs/2009.13586]

Reference Code: https://github.com/XuezheMax/apollo

DiffGrad

Optimizer based on the difference between the present and the immediate past gradient, the step size is adjusted for each parameter in such a way that it should have a larger step size for faster gradient changing parameters and a lower step size for lower gradient changing parameters.

image

image

import torch_optimizer as optim

# model = ...
optimizer = optim.DiffGrad(
    m.parameters(),
    lr= 1e-3,
    betas=(0.9, 0.999),
    eps=1e-8,
    weight_decay=0,
)
optimizer.step()

Paper: diffGrad: An Optimization Method for Convolutional Neural Networks. (2019) [https://arxiv.org/abs/1909.11015]

Reference Code: https://github.com/shivram1987/diffGrad

Lamb

image

image

import torch_optimizer as optim

# model = ...
optimizer = optim.Lamb(
    m.parameters(),
    lr= 1e-3,
    betas=(0.9, 0.999),
    eps=1e-8,
    weight_decay=0,
)
optimizer.step()

Paper: Large Batch Optimization for Deep Learning: Training BERT in 76 minutes (2019) [https://arxiv.org/abs/1904.00962]

Reference Code: https://github.com/cybertronai/pytorch-lamb

Lookahead

image

image

import torch_optimizer as optim

# model = ...
# base optimizer, any other optimizer can be used like Adam or DiffGrad
yogi = optim.Yogi(
    m.parameters(),
    lr= 1e-2,
    betas=(0.9, 0.999),
    eps=1e-3,
    initial_accumulator=1e-6,
    weight_decay=0,
)

optimizer = optim.Lookahead(yogi, k=5, alpha=0.5)
optimizer.step()

Paper: Lookahead Optimizer: k steps forward, 1 step back (2019) [https://arxiv.org/abs/1907.08610]

Reference Code: https://github.com/alphadl/lookahead.pytorch

MADGRAD

image

image

import torch_optimizer as optim

# model = ...
optimizer = optim.MADGRAD(
    m.parameters(),
    lr=1e-2,
    momentum=0.9,
    weight_decay=0,
    eps=1e-6,
)
optimizer.step()

Paper: Adaptivity without Compromise: A Momentumized, Adaptive, Dual Averaged Gradient Method for Stochastic Optimization (2021) [https://arxiv.org/abs/2101.11075]

Reference Code: https://github.com/facebookresearch/madgrad

NovoGrad

image

image

import torch_optimizer as optim

# model = ...
optimizer = optim.NovoGrad(
    m.parameters(),
    lr= 1e-3,
    betas=(0.9, 0.999),
    eps=1e-8,
    weight_decay=0,
    grad_averaging=False,
    amsgrad=False,
)
optimizer.step()

Paper: Stochastic Gradient Methods with Layer-wise Adaptive Moments for Training of Deep Networks (2019) [https://arxiv.org/abs/1905.11286]

Reference Code: https://github.com/NVIDIA/DeepLearningExamples/

PID

image

image

import torch_optimizer as optim

# model = ...
optimizer = optim.PID(
    m.parameters(),
    lr=1e-3,
    momentum=0,
    dampening=0,
    weight_decay=1e-2,
    integral=5.0,
    derivative=10.0,
)
optimizer.step()

Paper: A PID Controller Approach for Stochastic Optimization of Deep Networks (2018) [http://www4.comp.polyu.edu.hk/~cslzhang/paper/CVPR18_PID.pdf]

Reference Code: https://github.com/tensorboy/PIDOptimizer

QHAdam

image

image

import torch_optimizer as optim

# model = ...
optimizer = optim.QHAdam(
    m.parameters(),
    lr= 1e-3,
    betas=(0.9, 0.999),
    nus=(1.0, 1.0),
    weight_decay=0,
    decouple_weight_decay=False,
    eps=1e-8,
)
optimizer.step()

Paper: Quasi-hyperbolic momentum and Adam for deep learning (2019) [https://arxiv.org/abs/1810.06801]

Reference Code: https://github.com/facebookresearch/qhoptim

QHM

image

image

import torch_optimizer as optim

# model = ...
optimizer = optim.QHM(
    m.parameters(),
    lr=1e-3,
    momentum=0,
    nu=0.7,
    weight_decay=1e-2,
    weight_decay_type='grad',
)
optimizer.step()

Paper: Quasi-hyperbolic momentum and Adam for deep learning (2019) [https://arxiv.org/abs/1810.06801]

Reference Code: https://github.com/facebookresearch/qhoptim

RAdam

image

image

Deprecated, please use version provided by PyTorch.

import torch_optimizer as optim

# model = ...
optimizer = optim.RAdam(
    m.parameters(),
    lr= 1e-3,
    betas=(0.9, 0.999),
    eps=1e-8,
    weight_decay=0,
)
optimizer.step()

Paper: On the Variance of the Adaptive Learning Rate and Beyond (2019) [https://arxiv.org/abs/1908.03265]

Reference Code: https://github.com/LiyuanLucasLiu/RAdam

Ranger

image

image

import torch_optimizer as optim

# model = ...
optimizer = optim.Ranger(
    m.parameters(),
    lr=1e-3,
    alpha=0.5,
    k=6,
    N_sma_threshhold=5,
    betas=(.95, 0.999),
    eps=1e-5,
    weight_decay=0
)
optimizer.step()

Paper: New Deep Learning Optimizer, Ranger: Synergistic combination of RAdam + LookAhead for the best of both (2019) [https://medium.com/@lessw/new-deep-learning-optimizer-ranger-synergistic-combination-of-radam-lookahead-for-the-best-of-2dc83f79a48d]

Reference Code: https://github.com/lessw2020/Ranger-Deep-Learning-Optimizer

RangerQH

image

image

import torch_optimizer as optim

# model = ...
optimizer = optim.RangerQH(
    m.parameters(),
    lr=1e-3,
    betas=(0.9, 0.999),
    nus=(.7, 1.0),
    weight_decay=0.0,
    k=6,
    alpha=.5,
    decouple_weight_decay=False,
    eps=1e-8,
)
optimizer.step()

Paper: Quasi-hyperbolic momentum and Adam for deep learning (2018) [https://arxiv.org/abs/1810.06801]

Reference Code: https://github.com/lessw2020/Ranger-Deep-Learning-Optimizer

RangerVA

image

image

import torch_optimizer as optim

# model = ...
optimizer = optim.RangerVA(
    m.parameters(),
    lr=1e-3,
    alpha=0.5,
    k=6,
    n_sma_threshhold=5,
    betas=(.95, 0.999),
    eps=1e-5,
    weight_decay=0,
    amsgrad=True,
    transformer='softplus',
    smooth=50,
    grad_transformer='square'
)
optimizer.step()

Paper: Calibrating the Adaptive Learning Rate to Improve Convergence of ADAM (2019) [https://arxiv.org/abs/1908.00700v2]

Reference Code: https://github.com/lessw2020/Ranger-Deep-Learning-Optimizer

SGDP

image

image

import torch_optimizer as optim

# model = ...
optimizer = optim.SGDP(
    m.parameters(),
    lr= 1e-3,
    momentum=0,
    dampening=0,
    weight_decay=1e-2,
    nesterov=False,
    delta = 0.1,
    wd_ratio = 0.1
)
optimizer.step()

Paper: Slowing Down the Weight Norm Increase in Momentum-based Optimizers. (2020) [https://arxiv.org/abs/2006.08217]

Reference Code: https://github.com/clovaai/AdamP

SGDW

image

image

import torch_optimizer as optim

# model = ...
optimizer = optim.SGDW(
    m.parameters(),
    lr= 1e-3,
    momentum=0,
    dampening=0,
    weight_decay=1e-2,
    nesterov=False,
)
optimizer.step()

Paper: SGDR: Stochastic Gradient Descent with Warm Restarts (2017) [https://arxiv.org/abs/1608.03983]

Reference Code: pytorch/pytorch#22466

SWATS

image

image

import torch_optimizer as optim

# model = ...
optimizer = optim.SWATS(
    model.parameters(),
    lr=1e-1,
    betas=(0.9, 0.999),
    eps=1e-3,
    weight_decay= 0.0,
    amsgrad=False,
    nesterov=False,
)
optimizer.step()

Paper: Improving Generalization Performance by Switching from Adam to SGD (2017) [https://arxiv.org/abs/1712.07628]

Reference Code: https://github.com/Mrpatekful/swats

Shampoo

image

image

import torch_optimizer as optim

# model = ...
optimizer = optim.Shampoo(
    m.parameters(),
    lr=1e-1,
    momentum=0.0,
    weight_decay=0.0,
    epsilon=1e-4,
    update_freq=1,
)
optimizer.step()

Paper: Shampoo: Preconditioned Stochastic Tensor Optimization (2018) [https://arxiv.org/abs/1802.09568]

Reference Code: https://github.com/moskomule/shampoo.pytorch

Yogi

Yogi is optimization algorithm based on ADAM with more fine grained effective learning rate control, and has similar theoretical guarantees on convergence as ADAM.

image

image

import torch_optimizer as optim

# model = ...
optimizer = optim.Yogi(
    m.parameters(),
    lr= 1e-2,
    betas=(0.9, 0.999),
    eps=1e-3,
    initial_accumulator=1e-6,
    weight_decay=0,
)
optimizer.step()

Paper: Adaptive Methods for Nonconvex Optimization (2018) [https://papers.nips.cc/paper/8186-adaptive-methods-for-nonconvex-optimization]

Reference Code: https://github.com/4rtemi5/Yogi-Optimizer_Keras

Adam (PyTorch built-in)

image

image

SGD (PyTorch built-in)

image

image

pytorch-optimizer's People

Contributors

amirgholami avatar avinashsai avatar bgnkim avatar carefree0910 avatar chenkins avatar crawlingcub avatar deepsourcebot avatar dependabot-preview[bot] avatar dependabot[bot] avatar goliney avatar iiseymour avatar jettify avatar jona-sassenhagen avatar jwuphysics avatar liyuanlucasliu avatar lucidrains avatar matech96 avatar mpariente avatar muupan avatar ryancinsight avatar sidml avatar slckl avatar thisiseshan avatar tkon3 avatar vinnik-dmitry07 avatar yohann84l avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

pytorch-optimizer's Issues

Example of how to use scheduler

I'd like to use the Adafactor scheduler as the hugging face code has (but their code does not work for CNNS).

questions as follow:
a) How do I use schedulers with pytorch-optimizer?
b) Can we add an example in the readme?
c) does adafactor have it's own scheduler here?

related:

Thanks and Question

Thanks for putting this together! I'm giving DiffGrad a try now.

I know that you haven't run benchmarks, but you have obviously read a lot of papers and implemented a lot of optimizers so would be great to get your opinion on the most promising optimiser(s) to try for CNNs?

README.rst Typo

The Adahessian optimizer example code is missing a comma after the betas.

DeepOBS Benchmarks

It would be interesting to see how the optimizers perform on real-world problems. DeepOBS just added PyTorch support, so it should be relatively simple to evaluate the optimzers on a range of different problems.

RAdam: self._buffer should not be shared across param_groups

The current source code of RAdam forces to share self._buffer across different param_groups.
However, this scrambles different learning rate settings since the self._buffer only stores the latest step size of the FIRST parameter group. (This can be checked by using two parameter groups, one of which has ZERO learning rate.)

As mentioned in RAdam#24, this was already fixed in the original RAdam repo by assigning _buffer into param_group settings. Could you update pytorch-optimizer to follow the original repository?

Best Optimizer for Training GANs?

May I please learn what is the best available optimizer here for training Generative Adversarial Networks?

Are there such a benchmark available anywhere? Only AdaBelief explicitly mentions GANs in their paper.

LAMB: Differences from the paper author's official implementation

The LAMB implementation of the PyTorch version you released is different from the official version of TensorFlow released by the paper author. According to the official implementation published in the paper, the author's code implementation skips some parameters according to their names() when calculating. But in your implementation, it seems that all parameters are directly involved in the calculation.
For example, exclude_from_weight_decay=["batch_normalization", "LayerNorm", "layer_norm"]
Their implementation:
https://github.com/tensorflow/addons/blob/master/tensorflow_addons/optimizers/lamb.py

Complex numbers

Hi Jettify,

AdaBound and Adahessian doesn't work for complex numbers, which show the below messages

for AdaBound
File "optimizer.py", line 701, in step step_size.div_(denom).clamp_(lower_bound, upper_bound).mul_( RuntimeError: "clamp_scalar_cpu" not implemented for 'ComplexFloat'

For Adahessian
File "optimizer.py", line 433, in <listcomp> * torch.randint_like( RuntimeError: check_random_bounds handles only integral, floating-point and boolean types

Do you have any ideas to solve these issues? Thanks.

Bests,
Ni

Apollo optimizer eats all the GPU memory

My network is:
a few dense layers (conv with padding + concatenating output to input),
2-layer LSTM and
2 Linear layers in the end.
Even after I made a network laughingly small, all GPU memory (8 GB) was consumed in a few epochs.

I understand that Apollo optimizer is quasi-Newton and attempts to approximate second derivative, but still - why memory consumption grows with every epoch?
I tried putting torch.cuda.empty_cache(), torch.clear_autocast_cache() (I didn't understand this, but who knows), gc.collect() - after each call consumption dropped a bit, but not so fast as Apollo took it :)

Unfair comparison in Visualizations

Hi,

Thanks a lot for this great repo.
For the comparison in the Visualizations example, I found that for each config, you run 100 updates.
I am concerned that 100 is too small so that it would favor optimizers that have fast convergence in the first few updates.

For other optimizers that the convergence is relatively slow at beginning, it would select large lr. This could lead to unstable convergence for these optimizers.

Moreover, for hyper-parameter search, the objective is the distance between the last step point and the minimum. I think the function value of the last step point may be a better objective.

At last, some optimizers implicitly implement learning rate decay (such as AdaBound and RAdam), but some not.
But in your comparison, no explicit learning rate schedule is used.

Adafactor fails to run on a custom (rfs) resnet12 (with MAML)

I was trying adafactor but I get the following issues:

args.scheduler=None
--------------------- META-TRAIN ------------------------
Starting training!
Traceback (most recent call last):
  File "/home/miranda9/automl-meta-learning/automl-proj-src/experiments/meta_learning/main_metalearning.py", line 441, in <module>
    main_resume_from_checkpoint(args)
  File "/home/miranda9/automl-meta-learning/automl-proj-src/experiments/meta_learning/main_metalearning.py", line 403, in main_resume_from_checkpoint
    run_training(args)
  File "/home/miranda9/automl-meta-learning/automl-proj-src/experiments/meta_learning/main_metalearning.py", line 413, in run_training
    meta_train_fixed_iterations(args)
  File "/home/miranda9/automl-meta-learning/automl-proj-src/meta_learning/training/meta_training.py", line 233, in meta_train_fixed_iterations
    args.outer_opt.step()
  File "/home/miranda9/miniconda3/envs/metalearning_gpu/lib/python3.9/site-packages/torch/optim/optimizer.py", line 88, in wrapper
    return func(*args, **kwargs)
  File "/home/miranda9/miniconda3/envs/metalearning_gpu/lib/python3.9/site-packages/torch_optimizer/adafactor.py", line 191, in step
    self._approx_sq_grad(
  File "/home/miranda9/miniconda3/envs/metalearning_gpu/lib/python3.9/site-packages/torch_optimizer/adafactor.py", line 116, in _approx_sq_grad
    (exp_avg_sq_row / exp_avg_sq_row.mean(dim=-1))
RuntimeError: The size of tensor a (3) must match the size of tensor b (64) at non-singleton dimension 1

with the pytorch default adam training runs so why does this one fail?

related:

Add baseline visualizations

I love the illustrations, but I find the absence of any kind of baselines a shame. It'd be nice to see how Adam or SGD do on the example functions and compare them with some of the more fancy optimizers.

Would this be possible?

I can probably run the required experiments myself, if there are no problems.

Wrong paper references for Ranger optimizer variants

The README lists Calibrating the Adaptive Learning Rate to Improve Convergence of ADAM by Tong, Liang, and Bi (2019) as the source paper accompanying the Ranger, RangerQH, and RangerVA codes. However, this paper seems to describe only the addition of softplus to Adam (SAdam) and AMSGrad (SAMSGrad), implemented here: https://github.com/neilliang90/Sadam; it makes no mention of the LookAhead or RAdam techniques. Therefore it makes sense to credit RangerVA to this paper, but Ranger and RangerQH should not use this reference.

The original Ranger optimizer is a combination of the LookAhead, Rectified Adam, and Gradient Centralization papers, and is described in a blog post.

RangerQH uses quasi-hyperbolic momentum introduced by Ma and Yarats (2018) on top of the regular Ranger optimizer, so I believe this should be the reference.

I would propose the following references:

  • Ranger - New Deep Learning Optimizer, Ranger: Synergistic combination of RAdam + LookAhead for the best of both (2019) [https://medium.com/@lessw/new-deep-learning-optimizer-ranger-synergistic-combination-of-radam-lookahead-for-the-best-of-2dc83f79a48d]
  • RangerQH - Quasi-hyperbolic momentum and Adam for deep learning (2018) [https://arxiv.org/abs/1810.06801]
  • RangerVA - Calibrating the Adaptive Learning Rate to Improve Convergence of ADAM (2019) [https://arxiv.org/abs/1908.00700v2]

YOGI Initialization

exp_avg_sq Initialization

"Thus, for YOGI, we propose to initialize the vt based on gradient square evaluated at the initial point averaged over a (reasonably large) mini-batch."

The initial exp_avg_sq should be initialized to the gradient square.

exp_avg Initialization

image

The YOGI optimizer exp_avg should be initialized to zero instead of initial_accumulator based on m0 above.

lamb optimizer mistake

Hi, I was checking your lamb implementation and I think there is a mistake in it.
According to the paper, exp_avg and exp_avg_sq (m and v) must be updated this way:
m /= (1 - betta_1t)
v /= (1 - betta_2
t)
In your implementation they are not updated and so even if self.debias==True, there is still update missing from adam_norm.
Please correct me if I'm wrong

add swats

Adam_SGD:

"SWATS from Keskar & Socher (2017) a high-scoring paper by ICLR in 2018, a method proposed to automatically switch from Adam to SGD for better generalization performance. The idea of the algorithm itself is very simple. It uses Adam, which works well despite minimal tuning, but after learning until a certain stage, it is taken over by SGD."

https://github.com/Mrpatekful/swats

AdamD implementation (or option to skip bias-correction to adam-derived optimizers)?

I recently put out a proposal to add an argument to adam-derived optimizers to skip the bias-correction term on w, only applying it to v. See the figure attached in the issue pytorch/pytorch#67105 and the write-up I put together for theoretical justification AdamD: Improved bias-correction in Adam. Since it's still too early in the idea's existence to add this to the pytorch repo (according to them), your repo seems like a reasonable home for it. I am happy to send you a PR, but I would like to hear what you would prefer:

  1. A new optimizer, AdamD and AdamDW (mirroring Adam/AdamW but with the bias-correction on the w term step excluded).
  2. An otherwise vanilla fork of Adam/AdamW, with a boolean flag allowing the user to turn the bias-correction on/off, as well as adding this option to the relevant optimizers already included in this repo. I have not read through it carefully but this would likely include Lamb (it would be an option to enable bias-correction on v only, since it is already excluded otherwise), AdamP, and maybe others.

Let me know how you would like to proceed, or if you want any further clarification!

Type issues with release 0.0.1a11 on Py3.5

Hey @jettify

With Python 3.5 on Linux, importing the package leads to a TypeError -

$ python3 -m venv test && source test/bin/activate
(test) $
(test) $
(test) $ pip install torch_optimizer
Collecting torch_optimizer
  Downloading https://files.pythonhosted.org/packages/38/8f/8c9fdfb199a8f7e06a16c2315d076aa1505057f2496b7a7f2c73ece66215/torch_optimizer-0.0.1a11-py3-none-any.whl
Collecting pytorch-ranger>=0.1.1 (from torch_optimizer)
  Downloading https://files.pythonhosted.org/packages/0d/70/12256257d861bbc3e176130d25be1de085ce7a9e60594064888a950f2154/pytorch_ranger-0.1.1-py3-none-any.whl
Collecting torch>=1.1.0 (from torch_optimizer)
  Using cached https://files.pythonhosted.org/packages/47/69/7a1291b74a3af0043db9048606daeb8b57cd9dea90b9df740485f3843878/torch-1.4.0-cp35-cp35m-manylinux1_x86_64.whl
Collecting numpy (from torch>=1.1.0->torch_optimizer)
  Downloading https://files.pythonhosted.org/packages/ff/18/c0b937e2f84095ae230196899e56d1d7d76c8e8424fb235ed7e5bb6d68af/numpy-1.18.2-cp35-cp35m-manylinux1_x86_64.whl (20.0MB)
    100% |████████████████████████████████| 20.0MB 79kB/s 
Installing collected packages: numpy, torch, pytorch-ranger, torch-optimizer
Successfully installed numpy-1.18.2 pytorch-ranger-0.1.1 torch-1.4.0 torch-optimizer-0.0.1a11
(test) $
(test) $
(test) $ python3 -c 'import torch_optimizer'
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/tmp/o/test/lib/python3.5/site-packages/torch_optimizer/__init__.py", line 69, in <module>
    def get(name: str) -> Optional[Type[Optimizer]]:
  File "/usr/lib/python3.5/typing.py", line 649, in __getitem__
    return Union[arg, type(None)]
  File "/usr/lib/python3.5/typing.py", line 552, in __getitem__
    dict(self.__dict__), parameters, _root=True)
  File "/usr/lib/python3.5/typing.py", line 512, in __new__
    for t2 in all_params - {t1} if not isinstance(t2, TypeVar)):
  File "/usr/lib/python3.5/typing.py", line 512, in <genexpr>
    for t2 in all_params - {t1} if not isinstance(t2, TypeVar)):
  File "/usr/lib/python3.5/typing.py", line 1077, in __subclasscheck__
    if super().__subclasscheck__(cls):
  File "/usr/lib/python3.5/abc.py", line 225, in __subclasscheck__
    for scls in cls.__subclasses__():
TypeError: descriptor '__subclasses__' of 'type' object needs an argument

'Yogi' object has no attribute 'Yogi'

Hi, if calling yogi from pytorch-optimizer has some bug ( runtime error, Yogi object has no attribute Yogi).
so at this moment, i am calling yogi.py directly.

#import torch_optimizer as optim      # in second iteration (for statement), it getting error. 
from yogi import Yogi                        # import from yogi.py file (include types.py definition)

for fold, (train_idx, val_idx) in enumerate(...):
    model = Net(...)
    # optim = optim.Yogi(model, parameters(), lr=1e-2, betas=(0.9, 0.999), eps=1e-3, initial_accumulator=1e-6, weight_decay=0)
    optim = Yogi(model, parameters(), lr=1e-2, betas=(0.9, 0.999), eps=1e-3, initial_accumulator=1e-6, weight_decay=0)
    ...

SGDP default settings are "incorrect" (and the visualization is uncharacteristic for SGDP)

One of SGDP's main points is the use of momentum, currently the default settings shown paired with the visualization have it disabled. This has a pretty large impact on its speed, I did a little test seeing how long it'd take to go from randn noise to a goal image and it handled it in 3 steps reaching a squared sum difference of 7.0072e-10 with a learning rate of 0.1, momentum set to 0.9888544, and nesterov disabled. With 0 momentum, the poor thing takes 27 steps to reach 0.7700. Plus, the project page for AdamP / SGDP shows the same test as the visualizations used here, but they use momentum with various settings and it definitely reaches the goal. The github page for AdamP also gives default parameters of "SGDP(params, lr=0.1, weight_decay=1e-5, momentum=0.9, nesterov=True)", and this seems closer to the demonstrations on their project page, even though their own code defaults momentum to 0. Thus, I believe the values on their page showing an example of importing and using SGDP should be used for the default values on this page.

eps value in Adabelief optimizer

Hi.

The eps hyper-parameter is set to 1e-3 by default in the current implementation.

The author recommends setting eps to 1e-8 for usual cases.
image
In my experiments, I found the optimizer to be very sensitive to eps value. The results with eps=1e-8 were significantly better than with eps=1e-3.

Don’t include `tests` in your distribution

You need to fix the following line so tests isn’t installed globally for users

packages=find_packages(),

$ pip show -f torch-optimizer
Name: torch-optimizer
Version: 0.1.0
[…]
Files:
  tests/__init__.py
  tests/__pycache__/[…].pyc
  tests/conftest.py
  tests/test_basic.py
  tests/test_optimizer.py
  tests/test_optimizer_with_nn.py
  tests/test_param_validation.py
  tests/test_utils.py
  tests/utils.py
  torch_optimizer-0.1.0.dist-info/[…]
  torch_optimizer/__init__.py
  torch_optimizer/__pycache__/[…].pyc
  torch_optimizer/a2grad.py
  torch_optimizer/[…].py
  torch_optimizer/yogi.py

spot bug in SGDW implementation (weight decay part)

Hi,

I was using the SGDW implementation in this repo, and I wonder if anything is wrong with this line:

p.data.add_(weight_decay, alpha=-group['lr'])

Let weight decay be $\lambda$ and learning rate be $\mu_t$. If I understand it correctly, this line of code update weight decay with
$$\theta_t \leftarrow \tilde{\theta}_t - \lambda \mu_t$$
where (follow the notation in the paper)

$$\tilde{\theta}_t \leftarrow \theta_{t-1} - m_t$$

But it should be

$$ \begin{aligned} \theta_{t-1} &\leftarrow \theta_{t-1} \cdot (1 - \lambda \mu_t) \\ \theta_t &\leftarrow \theta_{t-1} - m_t \end{aligned} $$

as in the paper:
image

This result in poor performance of training compared to SGD with the same set of optimization hyper-parameter.

Thanks!

Regards, Liu

Using GPU to train the model

Hello, I'm really appreciate your work. But now I wonder how to use GPU to train the model. There are always mistakes when I use the CUDA device. Thanks a lot.
device = torch.device('cuda' if use_cuda else 'cpu')

GPU memory leak in adahessian optimizer?

Hi
I am using your library and appreciate all the work you have put into this capability. I started using the adahessian optimizer and found that my GPU memory would increase until it used all my GPU memory and the run crashed as the optimizer operated. The leak seems to be within the get_trace routine and I believe it is can be fixed by changing

      hvs = torch.autograd.grad(
            grads, params, grad_outputs=v, only_inputs=True, retain_graph=True
        )

to

      hvs = torch.autograd.grad(
            grads, params, grad_outputs=v, only_inputs=True, retain_graph=False
        )

If you get a chance to check this out, please comment to let me know.
Thanks!

Lamb optimizer warning in pytorch 1.6

Hi I'm getting this deprecated warning in pytorch 1.6 for Lamb:


  | 2020-06-25T01:58:41.682+01:00 | add_(Tensor other, *, Number alpha) (Triggered internally at /opt/conda/conda-bld/pytorch_1592982553767/work/torch/csrc/utils/python_arg_parser.cpp:766.)
-- | -- | --
  | 2020-06-25T01:58:41.682+01:00 | exp_avg.mul_(beta1).add_(1 - beta1, grad)
  | 2020-06-25T01:58:41.682+01:00 | 2020-06-25T00:58:41 - WARNING - /opt/conda/envs/py36/lib/python3.6/site-packages/torch_optimizer/lamb.py:120: UserWarning: This overload of add_ is deprecated:
  | 2020-06-25T01:58:41.682+01:00 | add_(Number alpha, Tensor other)
  | 2020-06-25T01:58:41.682+01:00 | Consider using one of the following signatures instead:
  | 2020-06-25T01:58:41.682+01:00 | add_(Tensor other, *, Number alpha) (Triggered internally at /opt/conda/conda-bld/pytorch_1592982553767/work/torch/csrc/utils/python_arg_parser.cpp:766.)
  | 2020-06-25T01:58:41.682+01:00 | exp_avg.mul_(beta1).add_(1 - beta1, grad)


Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.