Optimizer

An optimizer in machine learning is an algorithm that iteratively adjusts a model's learnable parameters to minimize (or maximize) an objective function, commonly called a loss function. Optimizers sit at the core of every training pipeline: after backpropagation computes how much each parameter contributed to the prediction error, the optimizer decides how to update those parameters so the model improves over time.

The choice of optimizer affects training speed, final model quality, memory consumption, and hyperparameter sensitivity. Decades of research have produced a broad family of algorithms, from simple gradient descent to adaptive methods like Adam and recent discoveries such as Lion and Muon.

ELI5 (Explain like I'm 5)

Imagine you are blindfolded in a hilly field and you want to find the lowest valley. You can feel the slope under your feet, and that tells you which direction goes downhill. An optimizer is your strategy for walking downhill.

SGD means you take one small step in the downhill direction every time you check the slope.
Momentum is like rolling a ball: it picks up speed when the slope keeps pointing the same way, so you move faster through gentle slopes and slow down when the direction changes.
Adam is like having a smart hiking guide who remembers how steep the ground has been recently and adjusts your step size for every direction independently, so you take big steps across flat ground and careful steps on steep, rocky terrain.
AdamW adds a gentle pull toward the center of the field so you do not wander too far in any one direction (this is weight decay).
Newer optimizers like Lion and Muon try to find the valley faster or with less effort by using clever shortcuts.

Role in the training loop

The optimizer occupies a specific position in the standard supervised learning training loop:

Forward pass. Input data flows through the neural network to produce predictions.
Loss computation. A loss function (for example, cross-entropy or mean squared error) quantifies the difference between predictions and ground-truth labels.
Backward pass. Backpropagation computes the gradient of the loss with respect to every learnable parameter.
Parameter update. The optimizer uses the gradients (and possibly its own internal state) to compute a step direction and step size, then updates every parameter.

Steps 1 through 4 repeat for each mini-batch of training data. One full pass through the training set is called an epoch.

Mathematical foundation

Given a parameter vector theta, a loss function L(theta), and a learning rate eta, the simplest optimizer performs the update:

theta_{t+1} = theta_t - eta * grad L(theta_t)

This is vanilla gradient descent. All other optimizers modify this rule in one or more of the following ways:

Stochastic sampling. Compute the gradient on a random subset (mini-batch) rather than the full dataset.
Momentum. Accumulate an exponentially weighted moving average of past gradients to smooth the trajectory.
Adaptive learning rates. Scale the learning rate per parameter based on the history of gradients for that parameter.
Second-order information. Incorporate curvature information (the Hessian or an approximation of it) to take better-informed steps.
Weight decay. Add a penalty proportional to the magnitude of the parameters to encourage smaller weights.

Classical optimizers

Batch gradient descent

Batch gradient descent computes the gradient of the loss over the entire training dataset before making a single update:

theta = theta - eta * (1/N) * sum_{i=1}^{N} grad L_i(theta)

This produces a stable, low-variance gradient estimate, but it is impractical for large datasets because the full dataset must fit in memory and every update requires a complete pass through the data.

Stochastic gradient descent (SGD)

Stochastic gradient descent computes the gradient from a single example or a small mini-batch instead of the full dataset. This idea traces back to the Robbins-Monro stochastic approximation method published in 1951. SGD introduces noise into the gradient estimates, which can actually help the optimizer escape shallow local minima and saddle points. Mini-batch SGD (using, say, 32 to 512 examples per gradient estimate) balances the variance reduction of larger batches with the computational savings of smaller ones.

SGD with momentum

Polyak introduced the momentum method in 1964, and Rumelhart, Hinton, and Williams popularized it for neural networks in 1986. Instead of using only the current gradient, the optimizer maintains a velocity vector that accumulates past gradients:

v_t = gamma * v_{t-1} + eta * grad L(theta_t)
theta_{t+1} = theta_t - v_t

The momentum coefficient gamma (typically 0.9) controls how much history is retained. Momentum accelerates convergence along consistent gradient directions and dampens oscillations in directions where the gradient frequently changes sign. It remains the standard choice for many computer vision tasks, including training ResNets and other convolutional neural networks.

Nesterov accelerated gradient (NAG)

Proposed by Yurii Nesterov in 1983 for convex optimization, NAG modifies momentum by computing the gradient at a "look-ahead" position rather than the current position:

v_t = gamma * v_{t-1} + eta * grad L(theta_t - gamma * v_{t-1})
theta_{t+1} = theta_t - v_t

By evaluating the gradient at the projected future position, NAG produces more responsive updates and achieves faster convergence rates on convex problems. Nesterov proved an optimal O(1/t^2) convergence rate for smooth convex functions, compared to the O(1/t) rate of standard gradient descent.

Adaptive learning rate optimizers

Adaptive methods maintain per-parameter learning rates that adjust automatically based on the history of gradients. This eliminates or reduces the need to manually tune the global learning rate.

AdaGrad

Introduced by Duchi, Hazan, and Singer in 2011, AdaGrad accumulates the sum of squared gradients for each parameter and uses this sum to scale the learning rate:

G_t = G_{t-1} + (grad L(theta_t))^2
theta_{t+1} = theta_t - (eta / sqrt(G_t + epsilon)) * grad L(theta_t)

Parameters that receive large gradients get smaller effective learning rates, while parameters with small or infrequent gradients retain larger learning rates. This makes AdaGrad well suited for problems with sparse features, such as natural language processing tasks where rare words have infrequent but informative gradients. The main drawback is that the accumulated squared gradient sum grows monotonically, causing the effective learning rate to shrink to near zero over long training runs.

RMSProp

Geoffrey Hinton proposed RMSProp in his 2012 Coursera lecture on neural networks. RMSProp fixes AdaGrad's diminishing learning rate problem by replacing the cumulative sum with an exponentially decaying average of squared gradients:

E[g^2]_t = rho * E[g^2]_{t-1} + (1 - rho) * (grad L(theta_t))^2
theta_{t+1} = theta_t - (eta / sqrt(E[g^2]_t + epsilon)) * grad L(theta_t)

The decay rate rho (typically 0.9) ensures that only recent gradient magnitudes influence the per-parameter learning rate. RMSProp was never formally published in a peer-reviewed paper, but it became one of the most widely used optimizers in practice, particularly for recurrent neural networks and reinforcement learning.

Adadelta

Matthew Zeiler introduced Adadelta in 2012 as an extension of AdaGrad. Like RMSProp, it uses an exponentially decaying average of squared gradients. It also maintains a running average of squared parameter updates, which replaces the global learning rate entirely:

E[g^2]_t = rho * E[g^2]_{t-1} + (1 - rho) * (grad L(theta_t))^2
delta_theta_t = -(sqrt(E[delta_theta^2]_{t-1} + epsilon) / sqrt(E[g^2]_t + epsilon)) * grad L(theta_t)
E[delta_theta^2]_t = rho * E[delta_theta^2]_{t-1} + (1 - rho) * (delta_theta_t)^2

By computing the ratio of update RMS to gradient RMS, Adadelta achieves correct units for the parameter update without requiring a manually specified learning rate.

Adam

Kingma and Ba introduced Adam (Adaptive Moment Estimation) in a 2014 paper published at ICLR 2015. Adam combines the first moment estimate (momentum) with the second moment estimate (RMSProp-style adaptive learning rate), plus bias correction to account for the zero initialization of the moment estimates:

m_t = beta_1 * m_{t-1} + (1 - beta_1) * grad L(theta_t)       (first moment)
v_t = beta_2 * v_{t-1} + (1 - beta_2) * (grad L(theta_t))^2   (second moment)
m_hat_t = m_t / (1 - beta_1^t)                                 (bias correction)
v_hat_t = v_t / (1 - beta_2^t)                                 (bias correction)
theta_{t+1} = theta_t - eta * m_hat_t / (sqrt(v_hat_t) + epsilon)

The default hyperparameters (beta_1 = 0.9, beta_2 = 0.999, epsilon = 1e-8) work well across a wide range of problems, which is a major reason for Adam's popularity. Adam is computationally efficient, has modest memory requirements (two buffers per parameter), and is relatively insensitive to the choice of learning rate compared to SGD.

AdamW

Loshchilov and Hutter published "Decoupled Weight Decay Regularization" in 2017 (ICLR 2019), showing that the standard way of implementing L2 regularization in Adam is not equivalent to true weight decay. In Adam with L2 regularization, the regularization gradient gets scaled by the adaptive learning rate, which weakens the regularization effect for parameters with large gradient histories. AdamW fixes this by decoupling weight decay from the gradient-based update:

m_t, v_t = (same as Adam)
theta_{t+1} = theta_t - eta * (m_hat_t / (sqrt(v_hat_t) + epsilon) + lambda * theta_t)

Here, lambda is the weight decay coefficient applied directly to the parameters, independent of the adaptive scaling. AdamW has become the default optimizer for training transformers, large language models, and many other modern architectures.

NAdam

Timothy Dozat proposed NAdam (Nesterov-accelerated Adaptive Moment Estimation) in 2016 by incorporating Nesterov momentum into Adam. Instead of using the current first-moment estimate for the update, NAdam uses the look-ahead first moment, similar to how NAG looks ahead relative to standard momentum. NAdam generally converges faster than Adam on tasks where Nesterov momentum provides a benefit, including language modeling and certain computer vision workloads.

RAdam

Liu et al. introduced RAdam (Rectified Adam) at ICLR 2020. They identified that Adam's adaptive learning rate has problematically high variance in the early steps of training because the second moment estimate is computed from very few samples. This variance is the underlying reason why learning rate warmup helps Adam. RAdam estimates the variance of the adaptive learning rate and applies a rectification term that automatically suppresses the variance when it is too high. The result is an optimizer that adapts between SGD-like behavior early in training and full Adam behavior later, removing the need to manually tune a warmup schedule.

Memory-efficient optimizers

As models have grown to billions of parameters, optimizer memory consumption has become a practical bottleneck. Adam and AdamW store two state buffers (first and second moment) per parameter, doubling the memory required beyond the model parameters and gradients themselves.

Adafactor

Shazeer and Stern introduced Adafactor in 2018 (ICML 2018) to reduce the memory cost of adaptive optimizers. For a weight matrix of size m x n, Adam stores an m x n second-moment buffer. Adafactor factorizes this into per-row and per-column statistics, reducing memory from O(m * n) to O(m + n). Combined with update clipping and the option to drop momentum, Adafactor achieves comparable results to Adam on Transformer training while using significantly less memory. Google used Adafactor for training the T5 model family.

Lion

Chen et al. at Google Brain discovered Lion (EvoLved Sign Momentum) through automated program search, described in their 2023 NeurIPS paper "Symbolic Discovery of Optimization Algorithms." Rather than being designed by hand, Lion was found by searching over a space of possible optimizer programs using evolutionary methods. The resulting algorithm is remarkably simple: it uses only the sign of a momentum-based interpolation to determine the update direction:

update = sign(beta_1 * m_{t-1} + (1 - beta_1) * grad L(theta_t))
theta_{t+1} = theta_t - eta * update
m_t = beta_2 * m_{t-1} + (1 - beta_2) * grad L(theta_t)

Because Lion stores only one momentum buffer (compared to Adam's two) and uses a uniform update magnitude, it roughly halves the optimizer memory overhead. Lion requires a learning rate 3 to 10 times smaller than Adam's due to its sign-based updates. On image classification with ViT, Lion improved ImageNet accuracy by up to 2%. On diffusion models, it reduced training compute by up to 2.3 times. Lion has also been deployed in production at Google for search ads click-through-rate models.

Second-order and curvature-aware optimizers

First-order optimizers use only gradient information. Second-order optimizers also incorporate curvature information (the Hessian matrix or approximations of it) to take more informed steps. While full Newton's method computes the inverse Hessian (which is prohibitively expensive for neural networks with millions of parameters), several practical approximations exist.

Sophia

Liu et al. introduced Sophia (Second-order Clipped Stochastic Optimization) in 2023 (ICLR 2024) for language model pre-training. Sophia uses a lightweight diagonal Hessian estimate as a pre-conditioner, dividing the gradient by the estimated curvature and then applying element-wise clipping. The clipping controls worst-case update sizes, which tames instability from non-convexity and rapid Hessian changes. Sophia estimates the diagonal Hessian only every few iterations, keeping the average per-step overhead negligible. On GPT models from 125M to 1.5B parameters, Sophia achieved a 2x speedup over Adam in steps, total compute, and wall-clock time to reach the same perplexity.

Shampoo

Gupta et al. introduced Shampoo in 2018 as a structure-aware preconditioning algorithm for stochastic optimization over tensor spaces. For a weight matrix W of size m x n, instead of maintaining a single mn by mn preconditioner (which would be infeasible), Shampoo maintains separate preconditioners of size m x m and n x n, one for each tensor dimension. Anil et al. (2021) developed a distributed implementation of Shampoo that demonstrated strong performance on large-scale training tasks. More recently, Vyas et al. (2024) introduced SOAP, which combines Shampoo's preconditioning with Adam's per-element adaptivity for improved stability.

Large-batch optimizers

LAMB

You et al. introduced LAMB (Layer-wise Adaptive Moments optimizer for Batch training) in 2019 for scaling up batch sizes during training. LAMB combines Adam's adaptive per-parameter scaling with a layer-wise trust ratio inspired by LARS (Layer-wise Adaptive Rate Scaling). The trust ratio normalizes the update magnitude relative to the parameter magnitude for each layer, preventing any single layer from receiving disproportionately large updates. The headline result was reducing BERT pre-training time from 3 days to 76 minutes by scaling to batch size 32,868 on TPUv3 Pods without degrading performance. However, subsequent work by Anil et al. (2021) showed that standard Adam with careful tuning can match LAMB at large batch sizes.

Emerging optimizers

Muon

Keller Jordan et al. introduced Muon (MomentUm Orthogonalized by Newton-Schulz) in late 2024. Muon treats neural network weight updates as matrices rather than collections of independent scalars. It runs standard SGD with Nesterov momentum and then replaces each 2D parameter's update with its nearest orthogonal matrix, computed efficiently using Newton-Schulz iteration. While Adam treats each parameter independently, Muon exploits the geometric structure of weight matrices. Scaling law experiments showed that Muon achieves comparable performance to AdamW while requiring roughly 52% of the training FLOPs, translating to nearly 2x cost savings for large training runs. Muon currently holds training speed records for both NanoGPT and CIFAR-10 speedrunning benchmarks.

Optimizer comparison

Optimizer	Year	Per-parameter LR	State memory per parameter	Typical use cases
SGD	1951	No	None	Simple convex problems, baseline
SGD + Momentum	1964/1986	No	1 buffer (velocity)	Computer vision, CNNs
NAG	1983	No	1 buffer (velocity)	Convex optimization, some vision tasks
AdaGrad	2011	Yes	1 buffer (sum of squared gradients)	Sparse features, NLP
RMSProp	2012	Yes	1 buffer (EMA of squared gradients)	RNNs, reinforcement learning
Adadelta	2012	Yes (no global LR needed)	2 buffers	General-purpose
Adam	2014	Yes	2 buffers (first and second moments)	General-purpose default
NAdam	2016	Yes	2 buffers	Language modeling, vision
AdamW	2017	Yes	2 buffers	Transformers, LLMs
Adafactor	2018	Yes (factorized)	O(m+n) instead of O(m*n)	Large transformers (T5)
Shampoo	2018	Yes (matrix preconditioner)	2 preconditioner matrices	Large-scale distributed training
LAMB	2019	Yes (layer-wise trust ratio)	2 buffers + trust ratio	Large-batch distributed training
RAdam	2020	Yes (rectified)	2 buffers	General-purpose (no warmup needed)
Lion	2023	Sign-based	1 buffer (momentum)	Vision, language, diffusion models
Sophia	2023	Curvature-based	2 buffers + Hessian estimate	LLM pre-training
Muon	2024	Orthogonalized	1 buffer (momentum)	LLM pre-training, speed records

Memory requirements at scale

Optimizer state is often the largest memory consumer during training, especially for billion-parameter models. The following table shows approximate optimizer state memory for a 1.5 billion parameter model stored in FP32:

Optimizer	Buffers per parameter	FP32 state memory (1.5B params)	Notes
SGD (no momentum)	0	0 GB	No extra state
SGD + Momentum	1	~6 GB	One velocity buffer
Adam / AdamW	2	~12 GB	First and second moments
Adafactor	~0.01 (factorized)	~0.1 GB (approx.)	Row and column statistics
Lion	1	~6 GB	Single momentum buffer
Sophia	2 + Hessian	~12+ GB	Moments plus periodic Hessian

When combined with mixed-precision training, model parameters and gradients can be stored in FP16 or BF16, but optimizer states are typically kept in FP32 to preserve numerical precision for small gradient updates. Techniques like ZeRO (from DeepSpeed) shard optimizer states across multiple GPUs to reduce per-device memory.

Learning rate schedules

The optimizer and the learning rate schedule work together. The schedule controls how the learning rate changes over the course of training, while the optimizer determines how the learning rate is applied to each parameter.

Common schedules

Schedule	Description	Typical pairing
Constant	Learning rate stays fixed	SGD, debugging
Step decay	Multiply LR by a factor (e.g., 0.1) at fixed epochs	SGD + Momentum for vision
Exponential decay	LR decays by a fixed ratio each epoch	General-purpose
Cosine annealing	LR follows a cosine curve from max to min	AdamW for transformers
Linear warmup + cosine decay	LR ramps up linearly then decays via cosine	AdamW for LLM pre-training
Cyclic LR	LR oscillates between bounds	SGD for exploring loss landscape
One-cycle	LR increases then decreases over one cycle	SGD + Momentum (fast training)

Warmup

Learning rate warmup starts training with a very small learning rate and linearly increases it to the target value over a set number of steps. Warmup is particularly important for adaptive optimizers like Adam because the second-moment estimates have high variance in the early steps when they are computed from very few gradient samples. Warmup allows the optimizer to collect accurate gradient statistics before making large updates. An additional benefit is that warmup helps the model move away from sharp, poorly conditioned regions of the loss surface toward flatter regions that tolerate larger learning rates.

Gradient clipping

Gradient clipping is a stability technique used alongside the optimizer to prevent exploding gradients. It is applied after backpropagation but before the optimizer step. There are two common forms:

Clipping by norm. If the global norm of the gradient vector exceeds a threshold (e.g., 1.0), the entire gradient is scaled down proportionally. This is the most common approach and is used in virtually every large-scale transformer training pipeline.
Clipping by value. Each gradient component is independently clamped to a fixed range (e.g., [-1, 1]).

Gradient clipping is especially important for recurrent neural networks and transformers, where long sequences can cause gradient magnitudes to grow exponentially through many layers of computation.

The generalization debate: SGD vs. adaptive methods

Wilson et al. (2017) showed that adaptive optimizers like Adam sometimes find solutions that generalize worse than SGD with momentum on certain tasks, particularly image classification. One hypothesis is that SGD's noisy gradient estimates make it more likely to converge to flat minima in the loss landscape, which tend to generalize better to unseen data. Adaptive methods, by contrast, may converge to sharper minima because their per-parameter learning rates allow them to navigate narrow valleys that SGD would skip over.

More recent theoretical work (Zhou et al., 2020) formalized this observation, showing that SGD is more locally unstable at sharp minima and can escape them to reach flatter regions. AdamW partially addresses the generalization gap by properly decoupling weight decay from the adaptive update, which provides more consistent regularization.

In modern practice, the generalization gap has narrowed considerably. AdamW with appropriate weight decay and learning rate scheduling matches or exceeds SGD on most benchmarks, which is why it has become the dominant optimizer for transformer-based architectures.

Practical selection guidelines

The following guidelines reflect common practice as of 2025:

Start with AdamW for most tasks. Its default hyperparameters are robust, and it works well across architectures and domains.
Use SGD with momentum when training convolutional neural networks for image classification, especially if training from scratch with large datasets. Many state-of-the-art vision results still use SGD.
Use AdamW with linear warmup and cosine decay for training transformers and language models. This combination is the near-universal default for LLM pre-training.
Consider Adafactor when GPU memory is limited and you are training large transformer models. It provides Adam-like behavior with substantially lower memory.
Try Lion for efficiency-sensitive workloads. It halves optimizer memory and has shown strong results on vision and language tasks.
Evaluate Muon for LLM pre-training if you want to reduce training FLOPs. Early results suggest it can cut training costs by nearly half.
Use LAMB when scaling to very large batch sizes in distributed training.
Apply AdaGrad for problems with sparse features, such as recommendation systems or NLP tasks with large vocabularies.
Always implement learning rate scheduling. Even adaptive optimizers benefit from warmup and decay schedules.
Use gradient clipping (typically max norm 1.0) for any transformer or RNN training to prevent training instability.

Optimizers by domain

Domain	Common optimizer choices	Notes
Image classification	SGD + Momentum, AdamW	SGD traditional for CNNs; AdamW increasingly popular with ViTs
Object detection	SGD + Momentum, AdamW	Often inherits the backbone's optimizer choice
Natural language processing	AdamW	Near-universal default for transformers
Large language models	AdamW, Lion, Muon, Sophia	AdamW standard; alternatives offer efficiency gains
Diffusion models	AdamW, Lion	Lion showed 2.3x training efficiency gains
Reinforcement learning	Adam, RMSProp	Adam common in policy gradient methods; RMSProp in value-based methods
GANs	Adam	Adaptive rates help stabilize adversarial training
Speech recognition	Adam, AdamW	Adaptive methods work well for sequence-to-sequence models
Recommendation systems	AdaGrad, Adam	AdaGrad's sparse feature handling is beneficial

Implementation examples

PyTorch

import torch.optim as optim

# AdamW with weight decay
optimizer = optim.AdamW(model.parameters(), lr=1e-3, weight_decay=0.01)

# SGD with momentum and Nesterov
optimizer = optim.SGD(model.parameters(), lr=0.1, momentum=0.9, nesterov=True)

# Learning rate scheduler: linear warmup + cosine decay
from torch.optim.lr_scheduler import CosineAnnealingLR, LinearLR, SequentialLR
warmup = LinearLR(optimizer, start_factor=0.01, total_iters=1000)
cosine = CosineAnnealingLR(optimizer, T_max=50000)
scheduler = SequentialLR(optimizer, schedulers=[warmup, cosine], milestones=[1000])

# Training loop with gradient clipping
for batch in dataloader:
    optimizer.zero_grad()
    loss = model(batch)
    loss.backward()
    torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
    optimizer.step()
    scheduler.step()

TensorFlow / Keras

import tensorflow as tf

# AdamW
optimizer = tf.keras.optimizers.AdamW(learning_rate=1e-3, weight_decay=0.01)

# SGD with momentum
optimizer = tf.keras.optimizers.SGD(learning_rate=0.1, momentum=0.9, nesterov=True)

# Cosine decay schedule with warmup
lr_schedule = tf.keras.optimizers.schedules.CosineDecay(
    initial_learning_rate=1e-3,
    decay_steps=50000,
    warmup_target=1e-3,
    warmup_steps=1000
)
optimizer = tf.keras.optimizers.AdamW(learning_rate=lr_schedule, weight_decay=0.01)

JAX / Optax

import optax

# AdamW with linear warmup and cosine decay
schedule = optax.warmup_cosine_decay_schedule(
    init_value=0.0, peak_value=1e-3,
    warmup_steps=1000, decay_steps=50000
)
optimizer = optax.adamw(learning_rate=schedule, weight_decay=0.01)

Historical timeline

Year	Event
1847	Augustin-Louis Cauchy describes the gradient descent method
1951	Robbins and Monro publish the stochastic approximation framework
1964	Polyak introduces the heavy ball method (momentum)
1983	Nesterov proposes accelerated gradient for convex optimization
1986	Rumelhart, Hinton, and Williams apply momentum to neural network training via backpropagation
2011	Duchi, Hazan, and Singer publish AdaGrad
2012	Zeiler introduces Adadelta; Hinton proposes RMSProp
2014	Kingma and Ba introduce Adam
2016	Dozat proposes NAdam
2017	Loshchilov and Hutter propose AdamW (decoupled weight decay)
2018	Shazeer and Stern introduce Adafactor; Gupta et al. introduce Shampoo
2019	You et al. introduce LAMB for large-batch training
2020	Liu et al. introduce RAdam (rectified Adam)
2023	Chen et al. discover Lion via program search; Liu et al. introduce Sophia
2024	Jordan et al. introduce Muon with Newton-Schulz orthogonalization

Convergence theory

Optimizer convergence guarantees depend on the properties of the objective function:

Convex functions. SGD converges to the global minimum at a rate of O(1/sqrt(T)) for general convex functions and O(1/T) for strongly convex functions, where T is the number of iterations. With Nesterov acceleration, the rate improves to O(1/T^2) for smooth convex functions.
Non-convex functions. For general non-convex objectives (which include most neural network loss surfaces), convergence to a global minimum is not guaranteed. Instead, SGD is guaranteed to find an epsilon-stationary point (where the gradient norm is at most epsilon) in O(1/epsilon^4) gradient evaluations.
Adam convergence. The original Adam paper proved convergence for online convex optimization. However, Reddi et al. (2018) showed that Adam can diverge on certain convex problems due to the exponential moving average of squared gradients causing "short-term memory." AMSGrad, proposed in the same paper, fixed this issue by maintaining the maximum of past squared gradient averages, though the fix is rarely needed in practice.

In practice, convergence theory provides useful intuition but does not fully explain optimizer behavior on deep neural network training, where the loss surface is highly non-convex and high-dimensional.

References

Robbins, H. and Monro, S. (1951). "A Stochastic Approximation Method." *The Annals of Mathematical Statistics*, 22(3), 400-407.
Nesterov, Y. (1983). "A Method for Solving the Convex Programming Problem with Convergence Rate O(1/k^2)." *Doklady Akademii Nauk SSSR*, 269(3), 543-547.
Duchi, J., Hazan, E., and Singer, Y. (2011). "Adaptive Subgradient Methods for Online Learning and Stochastic Optimization." *Journal of Machine Learning Research*, 12, 2121-2159.
Zeiler, M. D. (2012). "ADADELTA: An Adaptive Learning Rate Method." *arXiv preprint arXiv:1212.5701*.
Kingma, D. P. and Ba, J. (2015). "Adam: A Method for Stochastic Optimization." *Proceedings of the 3rd International Conference on Learning Representations (ICLR 2015)*.
Ruder, S. (2016). "An overview of gradient descent optimization algorithms." *arXiv preprint arXiv:1609.04747*.
Dozat, T. (2016). "Incorporating Nesterov Momentum into Adam." *ICLR 2016 Workshop*.
Loshchilov, I. and Hutter, F. (2019). "Decoupled Weight Decay Regularization." *Proceedings of the 7th International Conference on Learning Representations (ICLR 2019)*.
Liu, L. et al. (2020). "On the Variance of the Adaptive Learning Rate and Beyond." *Proceedings of the 8th International Conference on Learning Representations (ICLR 2020)*.
Shazeer, N. and Stern, M. (2018). "Adafactor: Adaptive Learning Rates with Sublinear Memory Cost." *Proceedings of the 35th International Conference on Machine Learning (ICML 2018)*.
You, Y. et al. (2019). "Large Batch Optimization for Deep Learning: Training BERT in 76 Minutes." *arXiv preprint arXiv:1904.00962*.
Chen, X. et al. (2023). "Symbolic Discovery of Optimization Algorithms." *Proceedings of the 37th Conference on Neural Information Processing Systems (NeurIPS 2023)*.
Liu, H. et al. (2024). "Sophia: A Scalable Stochastic Second-order Optimizer for Language Model Pre-training." *Proceedings of the 12th International Conference on Learning Representations (ICLR 2024)*.
Gupta, V. et al. (2018). "Shampoo: Preconditioned Stochastic Tensor Optimization." *Proceedings of the 35th International Conference on Machine Learning (ICML 2018)*.
Jordan, K. et al. (2024). "Muon: An Optimizer for Hidden Layers in Neural Networks." *Blog post and open-source implementation*.

ELI5 (Explain like I'm 5)

Role in the training loop

Mathematical foundation

Classical optimizers

Batch gradient descent

Stochastic gradient descent (SGD)

SGD with momentum

Nesterov accelerated gradient (NAG)

Adaptive learning rate optimizers

AdaGrad

RMSProp

Adadelta

Adam

AdamW

NAdam

RAdam

Memory-efficient optimizers

Adafactor

Lion

Second-order and curvature-aware optimizers

Sophia

Shampoo

Large-batch optimizers

LAMB

Emerging optimizers

Muon

Optimizer comparison

Memory requirements at scale

Learning rate schedules

Common schedules

Warmup

Gradient clipping

The generalization debate: SGD vs. adaptive methods

Practical selection guidelines

Optimizers by domain

Implementation examples

PyTorch

TensorFlow / Keras

JAX / Optax

Historical timeline

Convergence theory

See also

References

Improve this article

Related Articles

Sparse autoencoder

ARC-AGI 2

GELU (Gaussian Error Linear Unit)

LeNet

Gradient Descent

Hyperparameter

ELI5 (Explain like I'm 5)

Role in the training loop

Mathematical foundation

Classical optimizers

Batch gradient descent

Stochastic gradient descent (SGD)

SGD with momentum

Nesterov accelerated gradient (NAG)

Adaptive learning rate optimizers

AdaGrad

RMSProp

Adadelta

Adam

AdamW

NAdam

RAdam

Memory-efficient optimizers

Adafactor

Lion

Second-order and curvature-aware optimizers

Sophia

Shampoo

Large-batch optimizers

LAMB

Emerging optimizers

Muon

Optimizer comparison

Memory requirements at scale

Learning rate schedules