Collect optimizer related papers, data, repositories
Title | Year | Optimizer | Published | Code | Keywords |
---|---|---|---|---|---|
A Stochastic Approximation Method | 1951 | SGD | projecteuclid | code | gradient descent |
Some methods of speeding up the convergence of iteration methods | 1964 | Polyak | sciencedirect | gradient descent | |
Large-scale linearly constrained optimization | 1978 | MINOS | springerlink | quasi-newton | |
On the limited memory BFGS method for large scale optimization | 1989 | L-BFGS | springerlink | quasi-newton | |
Particle swarm optimization | 1995 | PSO | ieee | evolutionary | |
Trust region methods | 2000 | Sub-sampled TR | siam | inexact hessian | |
Evolving Neural Networks through Augmenting Topologies | 2002 | NEAT | ieee | code | evolutionary |
A Limited Memory Algorithm for Bound Constrained Optimization | 2003 | L-BFGS-B | researchgate | code | quasi-newton |
Online convex programming and generalized infinitesimal gradient ascent | 2003 | OGD | acm | gradient descent | |
A Stochastic Quasi-Newton Method for Online Convex Optimization | 2007 | O-LBFGS | researchgate | quasi-newton | |
Scalable training of L1-regularized log-linear models | 2007 | OWL-QN | acm | code | quasi-newton |
A Hypercube-Based Encoding for Evolving Large-Scale Neural Networks | 2009 | HyperNEAT | ieee | evolutionary | |
AdaDiff: Adaptive Gradient Descent with the Differential of Gradient | 2010 | AdaDiff | iopscience | gradient descent | |
Adaptive Subgradient Methods for Online Learning and Stochastic Optimization | 2011 | AdaGrad | jmlr | code | gradient descent |
CMA-ES: evolution strategies and covariance matrix adaptation | 2011 | CMA-ES | acm | code | evolutionary |
ADADELTA: An Adaptive Learning Rate Method | 2012 | ADADELTA | arxiv | code | gradient descent |
A Stochastic Gradient Method with an Exponential Convergence Rate for Finite Training Sets | 2012 | SAG | arxiv | variance reduced | |
An Enhanced Hypercube-Based Encoding for Evolving the Placement, Density, and Connectivity of Neurons | 2012 | ES-HyperNEAT | ieee | code | evolutionary |
CMA-TWEANN: efficient optimization of neural networks via self-adaptation and seamless augmentation | 2012 | CMA-TWEANN | acm | evolutionary | |
Neural Networks for Machine Learning | 2012 | RMSProp | coursera | code | gradient descent |
No More Pesky Learning Rates | 2012 | vSGD-b | arxiv | code | variance reduced |
No More Pesky Learning Rates | 2012 | vSGD-g | arxiv | code | variance reduced |
No More Pesky Learning Rates | 2012 | vSGD-l | arxiv | code | variance reduced |
Accelerating stochastic gradient descent using predictive variance reduction | 2013 | SVRG | neurips | code | variance reduced |
Adaptive learning rates and parallelization for stochastic, sparse, non-smooth gradients | 2013 | vSGD-fd | arxiv | gradient descent | |
Stochastic First- and Zeroth-order Methods for Nonconvex Stochastic Programming | 2013 | ZO-SGD | arxiv | gradient free | |
Mini-batch Stochastic Approximation Methods for Nonconvex Stochastic Composite Optimization | 2013 | ZO-ProxSGD | arxiv | gradient free | |
Mini-batch Stochastic Approximation Methods for Nonconvex Stochastic Composite Optimization | 2013 | ZO-PSGD | arxiv | gradient free | |
Semi-Stochastic Gradient Descent Methods | 2013 | S2GD | arxiv | variance reduced | |
Adam: A Method for Stochastic Optimization | 2014 | Adam | arxiv | code | gradient descent |
SAGA: A Fast Incremental Gradient Method With Support for Non-Strongly Convex Composite Objectives | 2014 | SAGA | arxiv | code | variance reduced |
A Stochastic Quasi-Newton Method for Large-Scale Optimization | 2014 | SQN | arxiv | code | quasi-newton |
RES: Regularized Stochastic BFGS Algorithm | 2014 | Reg-oBFGS-Inf | arxiv | quasi-newton | |
A Proximal Stochastic Gradient Method with Progressive Variance Reduction | 2014 | Prox-SVRG | arxiv | code | variance reduced |
A Computationally Efficient Limited Memory CMA-ES for Large Scale Optimization | 2014 | LM-CMA-ES | arxiv | evolutionary | |
Random feedback weights support learning in deep neural networks | 2014 | FA | arxiv | code | gradient descent |
Adam: A Method for Stochastic Optimization | 2015 | AdaMax | arxiv | code | gradient descent |
Scale-Free Algorithms for Online Linear Optimization | 2015 | AdaFTRL | arxiv | gradient descent | |
A Linearly-Convergent Stochastic L-BFGS Algorithm | 2015 | SVRG-SQN | arxiv | code | quasi-newton |
Accelerating SVRG via second-order information | 2015 | SVRG+II: LBFGS | opt | quasi-newton | |
Accelerating SVRG via second-order information | 2015 | SVRG+I: Subsampled Hessian followed by SVT | opt | quasi-newton | |
Probabilistic Line Searches for Stochastic Optimization | 2015 | ProbLS | arxiv | gradient descent | |
Optimizing Neural Networks with Kronecker-factored Approximate Curvature | 2015 | K-FAC | arxiv | code | gradient descent |
adaQN: An Adaptive Quasi-Newton Algorithm for Training RNNs | 2015 | adaQN | arxiv | code | quasi-newton |
Stochastic Quasi-Newton Methods for Nonconvex Stochastic Optimization | 2016 | Damp-oBFGS-Inf | arxiv | code | quasi-newton |
Eve: A Gradient Based Optimization Method with Locally and Globally Adaptive Learning Rates | 2016 | Eve | arxiv | code | gradient descent |
Incorporating Nesterov Momentum into Adam | 2016 | Nadam | openreview | code | gradient descent |
The Whale Optimization Algorithm | 2016 | WOA | sciencedirect | code | evolutionary |
Adaptive Learning Rate via Covariance Matrix Based Preconditioning for Deep Neural Networks | 2016 | SDProp | arxiv | gradient descent | |
Barzilai-Borwein Step Size for Stochastic Gradient Descent | 2016 | SGD-BB | arxiv | code | gradient descent |
Barzilai-Borwein Step Size for Stochastic Gradient Descent | 2016 | SVRG-BB | arxiv | code | variance reduced |
SGDR: Stochastic Gradient Descent with Warm Restarts | 2016 | SGDR | arxiv | code | gradient descent |
Katyusha: The First Direct Acceleration of Stochastic Gradient Methods | 2016 | Katyusha | arxiv | variance reduced | |
A Comprehensive Linear Speedup Analysis for Asynchronous Stochastic Parallel Optimization from Zeroth-Order to First-Order | 2016 | ZO-SCD | arxiv | gradient free | |
Direct Feedback Alignment Provides Learning in Deep Neural Networks | 2016 | DFA | arxiv | code | gradient descent |
AdaBatch: Adaptive Batch Sizes for Training Deep Neural Networks | 2017 | AdaBatch | arxiv | code | gradient descent |
AdaComp : Adaptive Residual Gradient Compression for Data-Parallel Distributed Training | 2017 | AdaComp | arxiv | gradient descent | |
SARAH: A Novel Method for Machine Learning Problems Using Stochastic Recursive Gradient | 2017 | SARAH | arxiv | variance reduced | |
Sub-sampled Cubic Regularization for Non-convex Optimization | 2017 | SCR | arxiv | code | inexact hessian |
IQN: An Incremental Quasi-Newton Method with Local Superlinear Convergence Rate | 2017 | IQN | arxiv | code | quasi-newton |
Decoupled Weight Decay Regularization | 2017 | AdamW | arxiv | code | gradient descent |
Decoupled Weight Decay Regularization | 2017 | SGDW | arxiv | code | gradient descent |
BPGrad: Towards Global Optimality in Deep Learning via Branch and Pruning | 2017 | BPGrad | arxiv | code | gradient descent |
Training Deep Networks without Learning Rates Through Coin Betting | 2017 | COCOB | arxiv | code | gradient descent |
Practical Gauss-Newton Optimisation for Deep Learning | 2017 | KFLR | arxiv | gradient descent | |
Practical Gauss-Newton Optimisation for Deep Learning | 2017 | KFRA | arxiv | gradient descent | |
Large Batch Training of Convolutional Networks | 2017 | LARS | arxiv | code | gradient descent |
Dissecting Adam: The Sign, Magnitude and Variance of Stochastic Gradients | 2017 | M-SVAG | arxiv | code | gradient descent |
Normalized Direction-preserving Adam | 2017 | ND-Adam | arxiv | code | gradient descent |
Noisy Natural Gradient as Variational Inference | 2017 | Noisy Adam | arxiv | code | gradient descent |
Noisy Natural Gradient as Variational Inference | 2017 | Noisy K-FAC | arxiv | code | gradient descent |
Evolving Deep Neural Networks | 2017 | CoDeepNEAT | arxiv | code | evolutionary |
Evolving Deep Convolutional Neural Networks for Image Classification | 2017 | EvoCNN | arxiv | code | evolutionary |
NMODE --- Neuro-MODule Evolution | 2017 | NMODE | arxiv | code | evolutionary |
Online Convex Optimization with Unconstrained Domains and Losses | 2017 | RescaledExp | arxiv | gradient descent | |
Variants of RMSProp and Adagrad with Logarithmic Regret Bounds | 2017 | SC-Adagrad | arxiv | code | gradient descent |
Variants of RMSProp and Adagrad with Logarithmic Regret Bounds | 2017 | SC-RMSProp | arxiv | code | gradient descent |
Improving Generalization Performance by Switching from Adam to SGD | 2017 | SWATS | arxiv | code | gradient descent |
YellowFin and the Art of Momentum Tuning | 2017 | YellowFin | arxiv | code | gradient descent |
Natasha 2: Faster Non-Convex Optimization Than SGD | 2017 | Natasha2 | arxiv | gradient descent | |
Natasha 2: Faster Non-Convex Optimization Than SGD | 2017 | Natasha1.5 | arxiv | gradient descent | |
Regularizing and Optimizing LSTM Language Models | 2017 | NT-ASGD | arxiv | code | gradient descent |
SW-SGD: The Sliding Window Stochastic Gradient Descent Algorithm | 2017 | SW-SGD | sciencedirect | gradient descent | |
Adafactor: Adaptive Learning Rates with Sublinear Memory Cost | 2018 | Adafactor | arxiv | code | gradient descent |
Quasi-hyperbolic momentum and Adam for deep learning | 2018 | QHAdam | arxiv | code | gradient descent |
Online Adaptive Methods, Universality and Acceleration | 2018 | AcceleGrad | arxiv | gradient descent | |
Bayesian filtering unifies adaptive and non-adaptive neural network optimization methods | 2018 | AdaBayes | arxiv | code | gradient descent |
On the Convergence of A Class of Adam-Type Algorithms for Non-Convex Optimization | 2018 | AdaFom | arxiv | gradient descent | |
Fast Approximate Natural Gradient Descent in a Kronecker-factored Eigenbasis | 2018 | EKFAC | arxiv | code | gradient descent |
AdaShift: Decorrelation and Convergence of Adaptive Learning Rate Methods | 2018 | AdaShift | arxiv | code | gradient descent |
Practical Bayesian Learning of Neural Networks via Adaptive Optimisation Methods | 2018 | BADAM | arxiv | code | gradient descent |
Small steps and giant leaps: Minimal Newton solvers for Deep Learning | 2018 | Curveball | arxiv | code | gradient descent |
GADAM: Genetic-Evolutionary ADAM for Deep Neural Network Optimization | 2018 | GADAM | arxiv | gradient descent | |
HyperAdam: A Learnable Task-Adaptive Adam for Network Training | 2018 | HyperAdam | arxiv | code | gradient descent |
L4: Practical loss-based stepsize adaptation for deep learning | 2018 | L4Adam | arxiv | code | gradient descent |
L4: Practical loss-based stepsize adaptation for deep learning | 2018 | L4Momentum | arxiv | code | gradient descent |
Nostalgic Adam: Weighting more of the past gradients when designing the adaptive learning rate | 2018 | NosAdam | arxiv | code | gradient descent |
Closing the Generalization Gap of Adaptive Gradient Methods in Training Deep Neural Networks | 2018 | Padam | arxiv | code | gradient descent |
Quasi-hyperbolic momentum and Adam for deep learning | 2018 | QHM | arxiv | code | gradient descent |
Optimal Adaptive and Accelerated Stochastic Gradient Descent | 2018 | A2GradExp | arxiv | code | gradient descent |
Optimal Adaptive and Accelerated Stochastic Gradient Descent | 2018 | A2GradInc | arxiv | code | gradient descent |
Optimal Adaptive and Accelerated Stochastic Gradient Descent | 2018 | A2GradUni | arxiv | code | gradient descent |
Shampoo: Preconditioned Stochastic Tensor Optimization | 2018 | Shampoo | arxiv | code | gradient descent |
signSGD: Compressed Optimisation for Non-Convex Problems | 2018 | signSGD | arxiv | code | gradient descent |
Fast and Scalable Bayesian Deep Learning by Weight-Perturbation in Adam | 2018 | VAdam | arxiv | code | gradient descent |
VR-SGD: A Simple Stochastic Variance Reduction Method for Machine Learning | 2018 | VR-SGD | arxiv | code | gradient descent |
WNGrad: Learn the Learning Rate in Gradient Descent | 2018 | WNGrad | arxiv | code | gradient descent |
Adaptive Methods for Nonconvex Optimization | 2018 | Yogi | neurips | code | gradient descent |
First-order Stochastic Algorithms for Escaping From Saddle Points in Almost Linear Time | 2018 | NEON | arxiv | gradient descent | |
Katyusha X: Practical Momentum Method for Stochastic Sum-of-Nonconvex Optimization | 2018 | Katyusha X | arxiv | variance reduced | |
PSA-CMA-ES: CMA-ES with population size adaptation | 2018 | PSA-CMA-ES | acm | evolutionary | |
AdaGrad Stepsizes: Sharp Convergence Over Nonconvex Landscapes | 2018 | AdaGrad-Norm | arxiv | code | gradient descent |
Aggregated Momentum: Stability Through Passive Damping | 2018 | AggMo | arxiv | code | gradient descent |
Accelerating SGD with momentum for over-parameterized learning | 2018 | MaSS | arxiv | code | gradient descent |
SADAGRAD: Strongly Adaptive Stochastic Gradient Methods | 2018 | SADAGRAD | mlr | gradient descent | |
Deep Frank-Wolfe For Neural Network Optimization | 2018 | DFW | arxiv | code | gradient descent |
On the Convergence of AdaGrad with Momentum for Training Deep Neural Networks | 2018 | AdaHB | deepai | gradient descent | |
On the Convergence of AdaGrad with Momentum for Training Deep Neural Networks | 2018 | AdaNAG | deepai | gradient descent | |
Kalman Gradient Descent: Adaptive Variance Reduction in Stochastic Optimization | 2018 | KGD | arxiv | code | gradient descent |
On the Convergence of Adam and Beyond | 2019 | AMSGrad | arxiv | code | gradient descent |
Local AdaAlter: Communication-Efficient Stochastic Gradient Descent with Adaptive Learning Rates | 2019 | AdaAlter | arxiv | code | gradient descent |
Adaptive Gradient Methods with Dynamic Bound of Learning Rate | 2019 | AdaBound | arxiv | code | gradient descent |
Does Adam optimizer keep close to the optimal point? | 2019 | AdaFix | arxiv | gradient descent | |
Adaloss: Adaptive Loss Function for Landmark Localization | 2019 | Adaloss | arxiv | gradient descent | |
A new perspective in understanding of Adam-Type algorithms and beyond | 2019 | AdamAL | openreview | code | gradient descent |
On the Convergence of Adam and Beyond | 2019 | AdamNC | arxiv | gradient descent | |
Lookahead Optimizer: k steps forward, 1 step back | 2019 | Lookahead | arxiv | code | gradient descent |
On Higher-order Moments in Adam | 2019 | HAdam | arxiv | gradient descent | |
An Adaptive and Momental Bound Method for Stochastic Learning | 2019 | AdaMod | arxiv | code | gradient descent |
On the Convergence Proof of AMSGrad and a New Version | 2019 | AdamX | arxiv | gradient descent | |
Second-order Information in First-order Optimization Methods | 2019 | AdaSqrt | arxiv | code | gradient descent |
Adathm: Adaptive Gradient Method Based on Estimates of Third-Order Moments | 2019 | Adathm | ieee | gradient descent | |
Domain-independent Dominance of Adaptive Methods | 2019 | Delayed Adam | arxiv | code | gradient descent |
Domain-independent Dominance of Adaptive Methods | 2019 | AvaGrad | arxiv | code | gradient descent |
Painless Stochastic Gradient: Interpolation, Line-Search, and Convergence Rates | 2019 | ArmijoLS | arxiv | code | gradient descent |
An Adaptive Remote Stochastic Gradient Method for Training Neural Networks | 2019 | ARSG | arxiv | code | gradient descent |
BGADAM: Boosting based Genetic-Evolutionary ADAM for Neural Network Optimization | 2019 | BGADAM | arxiv | gradient descent | |
CProp: Adaptive Learning Rate Scaling from Past Gradient Conformity | 2019 | CProp | arxiv | code | gradient descent |
DADAM: A Consensus-based Distributed Adaptive Gradient Method for Online Optimization | 2019 | DADAM | arxiv | code | gradient descent |
diffGrad: An Optimization Method for Convolutional Neural Networks | 2019 | diffGrad | arxiv | code | gradient descent |
Gradient-only line searches: An Alternative to Probabilistic Line Searches | 2019 | GOLS-I | arxiv | gradient descent | |
Large Batch Optimization for Deep Learning: Training BERT in 76 minutes | 2019 | LAMB | arxiv | code | gradient descent |
An Adaptive Remote Stochastic Gradient Method for Training Neural Networks | 2019 | NAMSB | arxiv | code | gradient descent |
An Adaptive Remote Stochastic Gradient Method for Training Neural Networks | 2019 | NAMSG | arxiv | code | gradient descent |
Stochastic Gradient Methods with Layer-wise Adaptive Moments for Training of Deep Networks | 2019 | Novograd | arxiv | code | gradient descent |
Fast-DENSER++: Evolving Fully-Trained Deep Artificial Neural Networks | 2019 | F-DENSER++ | arxiv | code | evolutionary |
Fast DENSER: Efficient Deep NeuroEvolution | 2019 | F-DENSER | researchgate | code | evolutionary |
Parabolic Approximation Line Search for DNNs | 2019 | PAL | arxiv | code | gradient descent |
The Role of Memory in Stochastic Optimization | 2019 | PolyAdam | arxiv | gradient descent | |
PowerSGD: Practical Low-Rank Gradient Compression for Distributed Optimization | 2019 | PowerSGD | arxiv | code | gradient descent |
PowerSGD: Practical Low-Rank Gradient Compression for Distributed Optimization | 2019 | PowerSGDM | arxiv | code | gradient descent |
On the Variance of the Adaptive Learning Rate and Beyond | 2019 | RAdam | arxiv | code | gradient descent |
Matrix-Free Preconditioning in Online Learning | 2019 | RecursiveOptimizer | arxiv | code | gradient descent |
On Empirical Comparisons of Optimizers for Deep Learning | 2019 | RMSterov | arxiv | gradient descent | |
SAdam: A Variant of Adam for Strongly Convex Functions | 2019 | SAdam | arxiv | code | gradient descent |
Calibrating the Adaptive Learning Rate to Improve Convergence of ADAM | 2019 | Sadam | arxiv | code | gradient descent |
Calibrating the Adaptive Learning Rate to Improve Convergence of ADAM | 2019 | SAMSGrad | arxiv | code | gradient descent |
signADAM: Learning Confidences for Deep Neural Networks | 2019 | signADAM | arxiv | code | gradient descent |
signADAM: Learning Confidences for Deep Neural Networks | 2019 | signADAM++ | arxiv | code | gradient descent |
Memory-Efficient Adaptive Optimization | 2019 | SM3 | arxiv | code | gradient descent |
Momentum-Based Variance Reduction in Non-Convex SGD | 2019 | STORM | arxiv | code | gradeint descent |
ZO-AdaMM: Zeroth-Order Adaptive Momentum Method for Black-Box Optimization | 2019 | ZO-AdaMM | arxiv | code | gradient free |
signSGD via Zeroth-Order Oracle | 2019 | ZO-signSGD | openreview | gradient free | |
Demon: Improved Neural Network Training with Momentum Decay | 2019 | Demon SGDM | arxiv | code | gradient descent |
Demon: Improved Neural Network Training with Momentum Decay | 2019 | Demon Adam | arxiv | code | gradient descent |
An Optimistic Acceleration of AMSGrad for Nonconvex Optimization | 2019 | OPT-AMSGrad | arxiv | gradient descent | |
UniXGrad: A Universal, Adaptive Algorithm with Optimal Guarantees for Constrained Optimization | 2019 | UniXGrad | arxiv | gradient descent | |
An Adaptive Optimization Algorithm Based on Hybrid Power and Multidimensional Update Strategy | 2019 | AdaHMG | ieee | gradient descent | |
ProxSGD: Training Structured Neural Networks under Regularization and Constraints | 2019 | ProxSGD | openreview | code | gradient descent |
Efficient Learning Rate Adaptation for Convolutional Neural Network Training | 2019 | e-AdLR | ieee | gradient descent | |
AdaBelief Optimizer: Adapting Stepsizes by the Belief in Observed Gradients | 2020 | AdaBelief | arxiv | code | gradient descent |
ADAHESSIAN: An Adaptive Second Order Optimizer for Machine Learning | 2020 | ADAHESSIAN | arxiv | code | gradient descent |
Adai: Separating the Effects of Adaptive Learning Rate and Momentum Inertia | 2020 | Adai | arxiv | code | gradient descent |
Adam+: A Stochastic Method with Adaptive Variance Reduction | 2020 | Adam+ | arxiv | gradient descent | |
Adam with Bandit Sampling for Deep Learning | 2020 | Adambs | arxiv | code | gradient descent |
Why are Adaptive Methods Good for Attention Models? | 2020 | ACClip | arxiv | gradient descent | |
AdamP: Slowing Down the Slowdown for Momentum Optimizers on Scale-invariant Weights | 2020 | AdamP | arxiv | code | gradient descent |
On the Trend-corrected Variant of Adaptive Stochastic Optimization Methods | 2020 | AdamT | arxiv | code | gradient descent |
AdaS: Adaptive Scheduling of Stochastic Gradients | 2020 | AdaS | arxiv | code | gradient descent |
AdaScale SGD: A User-Friendly Algorithm for Distributed Training | 2020 | AdaScale | arxiv | gradient descent | |
AdaSGD: Bridging the gap between SGD and Adam | 2020 | AdaSGD | arxiv | gradient descent | |
AdaX: Adaptive Gradient Descent with Exponential Long Term Memory | 2020 | AdaX | arxiv | code | gradient descent |
AdaX: Adaptive Gradient Descent with Exponential Long Term Memory | 2020 | AdaX-W | arxiv | code | gradient descent |
AEGD: Adaptive Gradient Descent with Energy | 2020 | AEGD | arxiv | code | gradient descent |
Biased Stochastic Gradient Descent for Conditional Stochastic Optimization | 2020 | BSGD | arxiv | gradient descent | |
Compositional ADAM: An Adaptive Compositional Solver | 2020 | C-ADAM | arxiv | gradient descent | |
CADA: Communication-Adaptive Distributed Adam | 2020 | CADA | arxiv | code | gradient descent |
CoolMomentum: A Method for Stochastic Optimization by Langevin Dynamics with Simulated Annealing | 2020 | CoolMomentum | arxiv | code | gradient descent |
EAdam Optimizer: How ε Impact Adam | 2020 | EAdam | arxiv | code | gradient descent |
Expectigrad: Fast Stochastic Optimization with Robust Convergence Properties | 2020 | Expectigrad | arxiv | code | gradient descent |
Stochastic Gradient Descent with Nonlinear Conjugate Gradient-Style Adaptive Momentum | 2020 | FRSGD | arxiv | gradient descent | |
Iterative Averaging in the Quest for Best Test Error | 2020 | Gadam | arxiv | gradient descent | |
A Variant of Gradient Descent Algorithm Based on Gradient Averaging | 2020 | Grad-Avg | arxiv | gradient descent | |
Gravilon: Applications of a New Gradient Descent Method to Machine Learning | 2020 | Gravilon | arxiv | gradient descent | |
Practical Quasi-Newton Methods for Training Deep Neural Networks | 2020 | K-BFGS | arxiv | code | gradient descent |
Practical Quasi-Newton Methods for Training Deep Neural Networks | 2020 | K-BFGS(L) | arxiv | code | gradient descent |
LaProp: Separating Momentum and Adaptivity in Adam | 2020 | LaProp | arxiv | code | gradient descent |
Mixing ADAM and SGD: a Combined Optimization Method | 2020 | MAS | arxiv | code | gradient descent |
Self-Tuning Stochastic Optimization with Curvature-Aware Gradient Filtering | 2020 | MEKA | arxiv | gradient descent | |
MTAdam: Automatic Balancing of Multiple Training Loss Terms | 2020 | MTAdam | arxiv | code | gradient descent |
Momentum with Variance Reduction for Nonconvex Composition Optimization | 2020 | MVRC-1 | arxiv | gradient descent | |
Momentum with Variance Reduction for Nonconvex Composition Optimization | 2020 | MVRC-2 | arxiv | gradient descent | |
PAGE: A Simple and Optimal Probabilistic Gradient Estimator for Nonconvex Optimization | 2020 | PAGE | arxiv | gradient descent | |
Momentum-based variance-reduced proximal stochastic gradient method for composite nonconvex stochastic optimization | 2020 | PSTorm | arxiv | gradient descent | |
Ranger-Deep-Learning-Optimizer | 2020 | Ranger | github | code | gradient descent |
Gradient Centralization: A New Optimization Technique for Deep Neural Networks | 2020 | GC | arxiv | code | gradient descent |
S-SGD: Symmetrical Stochastic Gradient Descent with Weight Noise Injection for Reaching Flat Minima | 2020 | S-SGD | arxiv | gradient descent | |
SALR: Sharpness-aware Learning Rate Scheduler for Improved Generalization | 2020 | SALR | arxiv | gradient descent | |
Sharpness-aware Minimization for Efficiently Improving Generalization | 2020 | SAM | arxiv | code | gradient descent |
Stochastic Runge-Kutta methods and adaptive SGD-G2 stochastic gradient descent | 2020 | SGD-G2 | arxiv | gradient descent | |
A New Accelerated Stochastic Gradient Method with Momentum | 2020 | SGDM | arxiv | gradient descent | |
Scheduled Restart Momentum for Accelerated Stochastic Gradient Descent | 2020 | SRSGD | arxiv | code | gradient descent |
Adaptive Gradient Methods Can Be Provably Faster than SGD after Finite Epochs | 2020 | SHAdaGrad | arxiv | gradient descent | |
Enhance Curvature Information by Structured Stochastic Quasi-Newton Methods | 2020 | SKQN | arxiv | gradient descent | |
Enhance Curvature Information by Structured Stochastic Quasi-Newton Methods | 2020 | S4QN | arxiv | gradient descent | |
SMG: A Shuffling Gradient-Based Method with Momentum | 2020 | SMG | arxiv | gradient descent | |
Stochastic Normalized Gradient Descent with Momentum for Large Batch Training | 2020 | SNGM | arxiv | gradient descent | |
TAdam: A Robust Stochastic Gradient Optimizer | 2020 | TAdam | arxiv | code | gradient descent |
Eigenvalue-corrected Natural Gradient Based on a New Approximation | 2020 | TEKFAC | arxiv | gradient descent | |
pbSGD: Powered Stochastic Gradient Descent Methods for Accelerated Non-Convex Optimization | 2020 | pbSGD | ijcai | code | gradient descent |
Apollo: An Adaptive Parameter-wise Diagonal Quasi-Newton Method for Nonconvex Stochastic Optimization | 2020 | Apollo | arxiv | code | quasi-newton |
Apollo: An Adaptive Parameter-wise Diagonal Quasi-Newton Method for Nonconvex Stochastic Optimization | 2020 | ApolloW | arxiv | code | quasi-newton |
Slime mould algorithm: A new method for stochastic optimization | 2020 | SMA | sciencedirect | code | evolutionary |
AdaSwarm: Augmenting Gradient-Based optimizers in Deep Learning with Swarm Intelligence | 2020 | AdaSwarm | arxiv | code | evolutionary |
Adaptive Gradient Methods for Constrained Convex Optimization and Variational Inequalities | 2020 | AdaACSA | arxiv | gradient descent | |
Adaptive Gradient Methods for Constrained Convex Optimization and Variational Inequalities | 2020 | AdaAGD+ | arxiv | gradient descent | |
SCW-SGD: Stochastically Confidence-Weighted SGD | 2020 | SCWSGD | ieee | gradient descent | |
An Improved Adaptive Optimization Technique for Image Classification | 2020 | Mean-ADAM | ieee | gradient descent | |
Accelerated Large Batch Optimization of BERT Pretraining in 54 minutes | 2020 | LANS | arxiv | code | gradient descent |
Weak and Strong Gradient Directions: Explaining Memorization, Generalization, and Hardness of Examples at Scale | 2020 | RM3 | arxiv | code | gradient descent |
On the distance between two neural networks and the stability of learning | 2020 | Fromage | arxiv | code | gradient descent |
Smooth momentum: improving lipschitzness in gradient descent | 2022 | Smooth Momentum | springerlink | gradient descent | |
Towards Better Generalization of Adaptive Gradient Methods | 2020 | SAGD | neurips | gradient descent |