Optimizer
Last reviewed
Jun 2, 2026
Sources
24 citations
Review status
Source-backed
Revision
v6 · 6,146 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 2, 2026
Sources
24 citations
Review status
Source-backed
Revision
v6 · 6,146 words
Add missing citations, update stale details, or suggest a clearer explanation.
An optimizer in machine learning is an algorithm that iteratively adjusts a model's learnable parameters to minimize (or maximize) an objective function, commonly called a loss function. Optimizers sit at the core of every training pipeline: after backpropagation computes how much each parameter contributed to the prediction error, the optimizer decides how to update those parameters so the model improves over time.
The choice of optimizer affects training speed, final model quality, memory consumption, and hyperparameter sensitivity. Decades of research have produced a broad family of algorithms, from simple gradient descent to adaptive methods like Adam and recent discoveries such as Lion and Muon. By 2025, alternatives to the long-dominant AdamW had begun to see production use at frontier scale: Muon, for example, was used to train Moonshot AI's trillion-parameter Kimi K2 model.[16]
Imagine you are blindfolded in a hilly field and you want to find the lowest valley. You can feel the slope under your feet, and that tells you which direction goes downhill. An optimizer is your strategy for walking downhill.
The optimizer occupies a specific position in the standard supervised learning training loop:
Steps 1 through 4 repeat for each mini-batch of training data. One full pass through the training set is called an epoch.
Given a parameter vector theta, a loss function L(theta), and a learning rate eta, the simplest optimizer performs the update:
theta_{t+1} = theta_t - eta * grad L(theta_t)
This is vanilla gradient descent. All other optimizers modify this rule in one or more of the following ways:
Batch gradient descent computes the gradient of the loss over the entire training dataset before making a single update:
theta = theta - eta * (1/N) * sum_{i=1}^{N} grad L_i(theta)
This produces a stable, low-variance gradient estimate, but it is impractical for large datasets because the full dataset must fit in memory and every update requires a complete pass through the data.
Stochastic gradient descent computes the gradient from a single example or a small mini-batch instead of the full dataset. This idea traces back to the Robbins-Monro stochastic approximation method published in 1951.[1] SGD introduces noise into the gradient estimates, which can actually help the optimizer escape shallow local minima and saddle points. Mini-batch SGD (using, say, 32 to 512 examples per gradient estimate) balances the variance reduction of larger batches with the computational savings of smaller ones.
Polyak introduced the momentum method in 1964, and Rumelhart, Hinton, and Williams popularized it for neural networks in 1986. Instead of using only the current gradient, the optimizer maintains a velocity vector that accumulates past gradients:
v_t = gamma * v_{t-1} + eta * grad L(theta_t)
theta_{t+1} = theta_t - v_t
The momentum coefficient gamma (typically 0.9) controls how much history is retained. Momentum accelerates convergence along consistent gradient directions and dampens oscillations in directions where the gradient frequently changes sign. It remains the standard choice for many computer vision tasks, including training ResNets and other convolutional neural networks.
Proposed by Yurii Nesterov in 1983 for convex optimization, NAG modifies momentum by computing the gradient at a "look-ahead" position rather than the current position:[2]
v_t = gamma * v_{t-1} + eta * grad L(theta_t - gamma * v_{t-1})
theta_{t+1} = theta_t - v_t
By evaluating the gradient at the projected future position, NAG produces more responsive updates and achieves faster convergence rates on convex problems. Nesterov proved an optimal O(1/t^2) convergence rate for smooth convex functions, compared to the O(1/t) rate of standard gradient descent.[2] Sebastian Ruder's widely cited 2016 survey gives a unified overview of these gradient descent variants and the adaptive methods that followed.[6]
Adaptive methods maintain per-parameter learning rates that adjust automatically based on the history of gradients. This eliminates or reduces the need to manually tune the global learning rate.
Introduced by Duchi, Hazan, and Singer in 2011, AdaGrad accumulates the sum of squared gradients for each parameter and uses this sum to scale the learning rate:[3]
G_t = G_{t-1} + (grad L(theta_t))^2
theta_{t+1} = theta_t - (eta / sqrt(G_t + epsilon)) * grad L(theta_t)
Parameters that receive large gradients get smaller effective learning rates, while parameters with small or infrequent gradients retain larger learning rates. This makes AdaGrad well suited for problems with sparse features, such as natural language processing tasks where rare words have infrequent but informative gradients. The main drawback is that the accumulated squared gradient sum grows monotonically, causing the effective learning rate to shrink to near zero over long training runs.
Geoffrey Hinton proposed RMSProp in his 2012 Coursera lecture on neural networks. RMSProp fixes AdaGrad's diminishing learning rate problem by replacing the cumulative sum with an exponentially decaying average of squared gradients:
E[g^2]_t = rho * E[g^2]_{t-1} + (1 - rho) * (grad L(theta_t))^2
theta_{t+1} = theta_t - (eta / sqrt(E[g^2]_t + epsilon)) * grad L(theta_t)
The decay rate rho (typically 0.9) ensures that only recent gradient magnitudes influence the per-parameter learning rate. RMSProp was never formally published in a peer-reviewed paper, but it became one of the most widely used optimizers in practice, particularly for recurrent neural networks and reinforcement learning.
Matthew Zeiler introduced Adadelta in 2012 as an extension of AdaGrad.[4] Like RMSProp, it uses an exponentially decaying average of squared gradients. It also maintains a running average of squared parameter updates, which replaces the global learning rate entirely:
E[g^2]_t = rho * E[g^2]_{t-1} + (1 - rho) * (grad L(theta_t))^2
delta_theta_t = -(sqrt(E[delta_theta^2]_{t-1} + epsilon) / sqrt(E[g^2]_t + epsilon)) * grad L(theta_t)
E[delta_theta^2]_t = rho * E[delta_theta^2]_{t-1} + (1 - rho) * (delta_theta_t)^2
By computing the ratio of update RMS to gradient RMS, Adadelta achieves correct units for the parameter update without requiring a manually specified learning rate.
Kingma and Ba introduced Adam (Adaptive Moment Estimation) in a 2014 paper published at ICLR 2015.[5] Adam combines the first moment estimate (momentum) with the second moment estimate (RMSProp-style adaptive learning rate), plus bias correction to account for the zero initialization of the moment estimates:
m_t = beta_1 * m_{t-1} + (1 - beta_1) * grad L(theta_t) (first moment)
v_t = beta_2 * v_{t-1} + (1 - beta_2) * (grad L(theta_t))^2 (second moment)
m_hat_t = m_t / (1 - beta_1^t) (bias correction)
v_hat_t = v_t / (1 - beta_2^t) (bias correction)
theta_{t+1} = theta_t - eta * m_hat_t / (sqrt(v_hat_t) + epsilon)
The default hyperparameters (beta_1 = 0.9, beta_2 = 0.999, epsilon = 1e-8) work well across a wide range of problems, which is a major reason for Adam's popularity. Adam is computationally efficient, has modest memory requirements (two buffers per parameter), and is relatively insensitive to the choice of learning rate compared to SGD.
Loshchilov and Hutter published "Decoupled Weight Decay Regularization" in 2017 (ICLR 2019), showing that the standard way of implementing L2 regularization in Adam is not equivalent to true weight decay.[8] In Adam with L2 regularization, the regularization gradient gets scaled by the adaptive learning rate, which weakens the regularization effect for parameters with large gradient histories. AdamW fixes this by decoupling weight decay from the gradient-based update:
m_t, v_t = (same as Adam)
theta_{t+1} = theta_t - eta * (m_hat_t / (sqrt(v_hat_t) + epsilon) + lambda * theta_t)
Here, lambda is the weight decay coefficient applied directly to the parameters, independent of the adaptive scaling. AdamW has become the default optimizer for training transformers, large language models, and many other modern architectures.
Timothy Dozat proposed NAdam (Nesterov-accelerated Adaptive Moment Estimation) in 2016 by incorporating Nesterov momentum into Adam.[7] Instead of using the current first-moment estimate for the update, NAdam uses the look-ahead first moment, similar to how NAG looks ahead relative to standard momentum. NAdam generally converges faster than Adam on tasks where Nesterov momentum provides a benefit, including language modeling and certain computer vision workloads.
Liu et al. introduced RAdam (Rectified Adam) at ICLR 2020.[9] They identified that Adam's adaptive learning rate has problematically high variance in the early steps of training because the second moment estimate is computed from very few samples. This variance is the underlying reason why learning rate warmup helps Adam. RAdam estimates the variance of the adaptive learning rate and applies a rectification term that automatically suppresses the variance when it is too high. The result is an optimizer that adapts between SGD-like behavior early in training and full Adam behavior later, removing the need to manually tune a warmup schedule.
As models have grown to billions of parameters, optimizer memory consumption has become a practical bottleneck. Adam and AdamW store two state buffers (first and second moment) per parameter, doubling the memory required beyond the model parameters and gradients themselves.
Shazeer and Stern introduced Adafactor in 2018 (ICML 2018) to reduce the memory cost of adaptive optimizers.[10] For a weight matrix of size m x n, Adam stores an m x n second-moment buffer. Adafactor factorizes this into per-row and per-column statistics (maintaining only the per-row and per-column sums of the moving average of squared gradients, then reconstructing per-parameter estimates from these sums), reducing memory from O(m * n) to O(m + n).[10] Adafactor also replaces bias correction with a slowly increasing second-moment decay rate and adds update clipping to keep step sizes stable when momentum is dropped. The combination achieves comparable results to Adam on Transformer training while using significantly less memory. Google used Adafactor for training the T5 model family.
Chen et al. at Google Brain discovered Lion (EvoLved Sign Momentum) through automated program search, described in their 2023 NeurIPS paper "Symbolic Discovery of Optimization Algorithms."[12] Rather than being designed by hand, Lion was found by searching over a space of possible optimizer programs using evolutionary methods. The resulting algorithm is remarkably simple: it uses only the sign of a momentum-based interpolation to determine the update direction:
update = sign(beta_1 * m_{t-1} + (1 - beta_1) * grad L(theta_t))
theta_{t+1} = theta_t - eta * update
m_t = beta_2 * m_{t-1} + (1 - beta_2) * grad L(theta_t)
Because Lion stores only one momentum buffer (compared to Adam's two) and uses a uniform update magnitude, it roughly halves the optimizer memory overhead. The two coefficients play distinct roles: the update uses an interpolation weighted by beta_1, while the momentum buffer that is carried forward is tracked with beta_2, which lets the running momentum retain a longer history than the value used for the current step. Because the update magnitude is decoupled from the gradient scale, Lion requires a learning rate 3 to 10 times smaller than Adam's, with weight decay correspondingly larger.[12] On image classification with ViT, Lion improved ImageNet accuracy by up to 2%. On diffusion models, it reduced training compute by up to 2.3 times. Lion has also been deployed in production at Google for search ads click-through-rate models.
First-order optimizers use only gradient information. Second-order optimizers also incorporate curvature information (the Hessian matrix or approximations of it) to take more informed steps. While full Newton's method computes the inverse Hessian (which is prohibitively expensive for neural networks with millions of parameters), several practical approximations exist. These methods are sometimes called natural gradient or preconditioned methods because they reshape the gradient by a curvature estimate before stepping.
James Martens and Roger Grosse introduced K-FAC (Kronecker-Factored Approximate Curvature) at ICML 2015.[17] K-FAC builds an efficiently invertible approximation of a neural network's Fisher information matrix, which is used in place of the Hessian to perform approximate natural gradient descent. The Fisher is treated as block-diagonal across layers, and each layer's block is approximated as the Kronecker product of two much smaller matrices, one built from the layer's input activations and one from the backpropagated gradients. Inverting a Kronecker product reduces to inverting its two small factors, which is far cheaper than inverting the full block and, unlike a purely diagonal approximation, still captures correlations between parameters within a layer. Martens and Grosse reported that while K-FAC updates cost only several times more than a plain stochastic gradient step, each update makes much more optimization progress, so the method can be substantially faster than SGD with momentum in practice. Notably, the cost of storing and inverting the approximation does not grow with the amount of data used to estimate it, which lets K-FAC work well in highly stochastic regimes. K-FAC has since been extended to convolutional networks, recurrent networks, and distributed large-batch training, and it influenced later structure-aware preconditioners such as Shampoo.
Liu et al. introduced Sophia (Second-order Clipped Stochastic Optimization) in 2023 (ICLR 2024) for language model pre-training.[13] Sophia uses a lightweight diagonal Hessian estimate as a pre-conditioner. The update is the moving average of the gradients divided by the moving average of the estimated diagonal Hessian, followed by element-wise clipping. The clipping controls worst-case update sizes, which tames instability from non-convexity and rapid Hessian changes. Sophia estimates the diagonal Hessian only every few iterations (for example every ten steps), keeping the average per-step overhead negligible. On GPT models from 125M to 1.5B parameters, Sophia achieved a 2x speedup over Adam in steps, total compute, and wall-clock time to reach the same perplexity.[13]
Gupta et al. introduced Shampoo in 2018 as a structure-aware preconditioning algorithm for stochastic optimization over tensor spaces.[14] For a weight matrix W of size m x n, instead of maintaining a single mn by mn preconditioner (which would be infeasible), Shampoo maintains separate preconditioners of size m x m and n x n, one for each tensor dimension. Anil et al. (2021) developed a distributed implementation of Shampoo that demonstrated strong performance on large-scale training tasks. More recently, Vyas et al. (2024) introduced SOAP, which combines Shampoo's preconditioning with Adam's per-element adaptivity for improved stability. A distributed Shampoo implementation won the external-tuning track of the inaugural AlgoPerf training-algorithms benchmark in 2024, providing evidence that preconditioned methods can beat well-tuned AdamW and NAdam baselines on wall-clock time-to-result across a suite of workloads.[20]
You et al. introduced LAMB (Layer-wise Adaptive Moments optimizer for Batch training) in 2019 for scaling up batch sizes during training.[11] LAMB combines Adam's adaptive per-parameter scaling with a layer-wise trust ratio inspired by LARS (Layer-wise Adaptive Rate Scaling). The trust ratio normalizes the update magnitude relative to the parameter magnitude for each layer, preventing any single layer from receiving disproportionately large updates. The headline result was reducing BERT pre-training time from 3 days to 76 minutes by scaling the second training phase to a batch size of roughly 32,000 (32,768) on TPUv3 Pods without degrading performance.[11] However, subsequent work by Anil et al. (2021) showed that standard Adam with careful tuning can match LAMB at large batch sizes.
Keller Jordan introduced Muon (MomentUm Orthogonalized by Newton-Schulz) in December 2024.[15] Muon treats neural network weight updates as matrices rather than collections of independent scalars. It runs standard SGD with Nesterov momentum and then replaces each 2D parameter's update with its nearest semi-orthogonal matrix, the solution to the problem of finding the orthogonal matrix closest to the momentum matrix in Frobenius norm. While Adam treats each parameter independently, Muon exploits the geometric structure of weight matrices.
The orthogonalization is computed with a Newton-Schulz iteration rather than an exact (and expensive) singular value decomposition. Jordan's implementation uses a quintic polynomial iteration with non-convergent coefficients (3.4445, -4.7750, 2.0315) run for five steps, deliberately tuned to maximize the slope near zero so that very few iterations are needed and the iteration remains stable in bfloat16.[15] Because the procedure only makes sense for 2D weights, Muon is applied to hidden-layer weight matrices, while scalar and vector parameters together with the input embeddings and output (classifier) head are left to a standard method such as AdamW. For typical transformer training the extra Newton-Schulz work adds well under 1% to the total FLOPs.[15]
Muon first attracted attention through speedrunning: it improved the NanoGPT speed record by about 1.35x (roughly a 35% reduction in training time versus AdamW) and the CIFAR-10 record from 3.3 to 2.6 A100-seconds.[15] The modded-nanogpt benchmark, which races to reach a FineWeb validation loss of 3.28 (a target Andrej Karpathy's llm.c GPT-2 reproduction reached in about 45 minutes on 8 H100s in May 2024), was driven down to roughly 3 minutes on the same hardware over the following months as the community layered Muon with architecture and data-loading improvements.[15][18] At larger scale, Jordan reported training a 1.5B parameter transformer to GPT-2 XL quality in 10 hours on an 8xH100 node, versus about 13.3 hours for an AdamW baseline.[15]
A Moonshot AI team led by Jingyuan Liu and Jianlin Su published "Muon is Scalable for LLM Training" in February 2025, showing that two changes let Muon work out of the box at scale without per-parameter tuning: adding decoupled weight decay, and rescaling each parameter's update so that the per-parameter update root-mean-square is consistent across the model (Muon's raw orthogonalized updates otherwise have a different scale than Adam's, which complicates transferring hyperparameters).[16] Their scaling-law experiments found that Muon reaches the same loss as a compute-optimal AdamW run with roughly half the training FLOPs, about 2x compute efficiency. They validated this by training Moonlight, a Mixture-of-Experts model with 3B activated and 16B total parameters, on 5.7 trillion tokens, and open-sourced a distributed Muon implementation designed to be memory and communication efficient.[16]
Muon then moved to frontier scale. Moonshot AI's Kimi K2, a Mixture-of-Experts model with 1 trillion total parameters and 32 billion activated, was pre-trained on 15.5 trillion tokens using an optimizer the team calls MuonClip: Muon plus a technique named QK-Clip that rescales the query and key projection weights whenever attention logits grow too large.[19] Exploding attention logits are a common source of loss spikes when training large Muon models, and Moonshot reported that QK-Clip let them pre-train Kimi K2 with zero loss spikes and no training instability.[19] This made Kimi K2 one of the first publicly documented trillion-parameter models trained without Adam-family optimizers as the primary update rule.
Muon holds training speed records on the public NanoGPT and CIFAR-10 speedrunning leaderboards, and a growing line of follow-up work (for example adaptive and distributed variants) has continued to refine it.
Schedule-Free methods, introduced by Defazio et al. in "The Road Less Scheduled" (NeurIPS 2024), remove the learning rate schedule entirely.[21] Instead of decaying the learning rate over a pre-set horizon, the method maintains a running average of the iterates and evaluates gradients at an interpolation between the average and the most recent point, which unifies the roles of momentum, iterate averaging, and scheduling. Because there is no schedule, training does not need to commit in advance to a total number of steps, which is useful when the stopping time is unknown. Schedule-Free wrappers exist for SGD and AdamW, and Schedule-Free AdamW was the algorithm behind the winning entry in the self-tuning track of the 2024 AlgoPerf competition, where it trained roughly 8% faster than the baseline while introducing no extra hyperparameters.[20][21]
| Optimizer | Year | Per-parameter LR | State memory per parameter | Typical use cases |
|---|---|---|---|---|
| SGD | 1951 | No | None | Simple convex problems, baseline |
| SGD + Momentum | 1964/1986 | No | 1 buffer (velocity) | Computer vision, CNNs |
| NAG | 1983 | No | 1 buffer (velocity) | Convex optimization, some vision tasks |
| AdaGrad | 2011 | Yes | 1 buffer (sum of squared gradients) | Sparse features, NLP |
| RMSProp | 2012 | Yes | 1 buffer (EMA of squared gradients) | RNNs, reinforcement learning |
| Adadelta | 2012 | Yes (no global LR needed) | 2 buffers | General-purpose |
| Adam | 2014 | Yes | 2 buffers (first and second moments) | General-purpose default |
| K-FAC | 2015 | Natural gradient | Kronecker factors per layer | Natural-gradient training, large-batch |
| NAdam | 2016 | Yes | 2 buffers | Language modeling, vision |
| AdamW | 2017 | Yes | 2 buffers | Transformers, LLMs |
| Adafactor | 2018 | Yes (factorized) | O(m+n) instead of O(m*n) | Large transformers (T5) |
| Shampoo | 2018 | Yes (matrix preconditioner) | 2 preconditioner matrices | Large-scale distributed training |
| LAMB | 2019 | Yes (layer-wise trust ratio) | 2 buffers + trust ratio | Large-batch distributed training |
| RAdam | 2020 | Yes (rectified) | 2 buffers | General-purpose (no warmup needed) |
| Lion | 2023 | Sign-based | 1 buffer (momentum) | Vision, language, diffusion models |
| Sophia | 2023 | Curvature-based | 2 buffers + Hessian estimate | LLM pre-training |
| Schedule-Free (AdamW) | 2024 | Yes (no schedule) | 2 buffers + iterate average | General-purpose, self-tuning |
| Muon | 2024 | Orthogonalized | 1 buffer (momentum) | LLM pre-training (Moonlight, Kimi K2), speed records |
Optimizer state is often the largest memory consumer during training, especially for billion-parameter models. The following table shows approximate optimizer state memory for a 1.5 billion parameter model stored in FP32:
| Optimizer | Buffers per parameter | FP32 state memory (1.5B params) | Notes |
|---|---|---|---|
| SGD (no momentum) | 0 | 0 GB | No extra state |
| SGD + Momentum | 1 | ~6 GB | One velocity buffer |
| Adam / AdamW | 2 | ~12 GB | First and second moments |
| Adafactor | ~0.01 (factorized) | ~0.1 GB (approx.) | Row and column statistics |
| Lion | 1 | ~6 GB | Single momentum buffer |
| Sophia | 2 + Hessian | ~12+ GB | Moments plus periodic Hessian |
When combined with mixed-precision training, model parameters and gradients can be stored in FP16 or BF16, but optimizer states are typically kept in FP32 to preserve numerical precision for small gradient updates. Techniques like ZeRO (from DeepSpeed) shard optimizer states across multiple GPUs to reduce per-device memory.
The optimizer and the learning rate schedule work together. The schedule controls how the learning rate changes over the course of training, while the optimizer determines how the learning rate is applied to each parameter.
| Schedule | Description | Typical pairing |
|---|---|---|
| Constant | Learning rate stays fixed | SGD, debugging |
| Step decay | Multiply LR by a factor (e.g., 0.1) at fixed epochs | SGD + Momentum for vision |
| Exponential decay | LR decays by a fixed ratio each epoch | General-purpose |
| Cosine annealing | LR follows a cosine curve from max to min | AdamW for transformers |
| Linear warmup + cosine decay | LR ramps up linearly then decays via cosine | AdamW for LLM pre-training |
| Cyclic LR | LR oscillates between bounds | SGD for exploring loss landscape |
| One-cycle | LR increases then decreases over one cycle | SGD + Momentum (fast training) |
Learning rate warmup starts training with a very small learning rate and linearly increases it to the target value over a set number of steps. Warmup is particularly important for adaptive optimizers like Adam because the second-moment estimates have high variance in the early steps when they are computed from very few gradient samples. Warmup allows the optimizer to collect accurate gradient statistics before making large updates. An additional benefit is that warmup helps the model move away from sharp, poorly conditioned regions of the loss surface toward flatter regions that tolerate larger learning rates.
Gradient clipping is a stability technique used alongside the optimizer to prevent exploding gradients. It is applied after backpropagation but before the optimizer step. There are two common forms:
Gradient clipping is especially important for recurrent neural networks and transformers, where long sequences can cause gradient magnitudes to grow exponentially through many layers of computation.
Wilson et al. (2017) showed that adaptive optimizers like Adam sometimes find solutions that generalize worse than SGD with momentum on certain tasks, particularly image classification.[22] One hypothesis is that SGD's noisy gradient estimates make it more likely to converge to flat minima in the loss landscape, which tend to generalize better to unseen data. Adaptive methods, by contrast, may converge to sharper minima because their per-parameter learning rates allow them to navigate narrow valleys that SGD would skip over.
More recent theoretical work (Zhou et al., 2020) formalized this observation, showing that SGD is more locally unstable at sharp minima and can escape them to reach flatter regions.[23] AdamW partially addresses the generalization gap by properly decoupling weight decay from the adaptive update, which provides more consistent regularization.
In modern practice, the generalization gap has narrowed considerably. AdamW with appropriate weight decay and learning rate scheduling matches or exceeds SGD on most benchmarks, which is why it has become the dominant optimizer for transformer-based architectures.
The following guidelines reflect common practice as of 2026:
| Domain | Common optimizer choices | Notes |
|---|---|---|
| Image classification | SGD + Momentum, AdamW | SGD traditional for CNNs; AdamW increasingly popular with ViTs |
| Object detection | SGD + Momentum, AdamW | Often inherits the backbone's optimizer choice |
| Natural language processing | AdamW | Near-universal default for transformers |
| Large language models | AdamW, Lion, Muon, Sophia | AdamW standard; alternatives offer efficiency gains |
| Diffusion models | AdamW, Lion | Lion showed 2.3x training efficiency gains |
| Reinforcement learning | Adam, RMSProp | Adam common in policy gradient methods; RMSProp in value-based methods |
| GANs | Adam | Adaptive rates help stabilize adversarial training |
| Speech recognition | Adam, AdamW | Adaptive methods work well for sequence-to-sequence models |
| Recommendation systems | AdaGrad, Adam | AdaGrad's sparse feature handling is beneficial |
import torch.optim as optim
# AdamW with weight decay
optimizer = optim.AdamW(model.parameters(), lr=1e-3, weight_decay=0.01)
# SGD with momentum and Nesterov
optimizer = optim.SGD(model.parameters(), lr=0.1, momentum=0.9, nesterov=True)
# Learning rate scheduler: linear warmup + cosine decay
from torch.optim.lr_scheduler import CosineAnnealingLR, LinearLR, SequentialLR
warmup = LinearLR(optimizer, start_factor=0.01, total_iters=1000)
cosine = CosineAnnealingLR(optimizer, T_max=50000)
scheduler = SequentialLR(optimizer, schedulers=[warmup, cosine], milestones=[1000])
# Training loop with gradient clipping
for batch in dataloader:
optimizer.zero_grad()
loss = model(batch)
loss.backward()
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
optimizer.step()
scheduler.step()
import tensorflow as tf
# AdamW
optimizer = tf.keras.optimizers.AdamW(learning_rate=1e-3, weight_decay=0.01)
# SGD with momentum
optimizer = tf.keras.optimizers.SGD(learning_rate=0.1, momentum=0.9, nesterov=True)
# Cosine decay schedule with warmup
lr_schedule = tf.keras.optimizers.schedules.CosineDecay(
initial_learning_rate=1e-3,
decay_steps=50000,
warmup_target=1e-3,
warmup_steps=1000
)
optimizer = tf.keras.optimizers.AdamW(learning_rate=lr_schedule, weight_decay=0.01)
import optax
# AdamW with linear warmup and cosine decay
schedule = optax.warmup_cosine_decay_schedule(
init_value=0.0, peak_value=1e-3,
warmup_steps=1000, decay_steps=50000
)
optimizer = optax.adamw(learning_rate=schedule, weight_decay=0.01)
| Year | Event |
|---|---|
| 1847 | Augustin-Louis Cauchy describes the gradient descent method |
| 1951 | Robbins and Monro publish the stochastic approximation framework |
| 1964 | Polyak introduces the heavy ball method (momentum) |
| 1983 | Nesterov proposes accelerated gradient for convex optimization |
| 1986 | Rumelhart, Hinton, and Williams apply momentum to neural network training via backpropagation |
| 2011 | Duchi, Hazan, and Singer publish AdaGrad |
| 2012 | Zeiler introduces Adadelta; Hinton proposes RMSProp |
| 2014 | Kingma and Ba introduce Adam |
| 2015 | Martens and Grosse introduce K-FAC (Kronecker-factored approximate curvature) |
| 2016 | Dozat proposes NAdam |
| 2017 | Loshchilov and Hutter propose AdamW (decoupled weight decay) |
| 2018 | Shazeer and Stern introduce Adafactor; Gupta et al. introduce Shampoo |
| 2019 | You et al. introduce LAMB for large-batch training |
| 2020 | Liu et al. introduce RAdam (rectified Adam) |
| 2023 | Chen et al. discover Lion via program search; Liu et al. introduce Sophia |
| 2024 | Jordan introduces Muon with Newton-Schulz orthogonalization; Defazio et al. introduce Schedule-Free; Distributed Shampoo wins the inaugural AlgoPerf competition |
| 2025 | Moonshot AI shows Muon scales to LLMs (Moonlight) and trains the trillion-parameter Kimi K2 with MuonClip |
Optimizer convergence guarantees depend on the properties of the objective function:
In practice, convergence theory provides useful intuition but does not fully explain optimizer behavior on deep neural network training, where the loss surface is highly non-convex and high-dimensional.