An optimizer in machine learning is an algorithm that iteratively adjusts a model's learnable parameters to minimize (or maximize) an objective function, commonly called a loss function. Optimizers sit at the core of every training pipeline: after backpropagation computes how much each parameter contributed to the prediction error, the optimizer decides how to update those parameters so the model improves over time.
The choice of optimizer affects training speed, final model quality, memory consumption, and hyperparameter sensitivity. Decades of research have produced a broad family of algorithms, from simple gradient descent to adaptive methods like Adam and recent discoveries such as Lion and Muon.
Imagine you are blindfolded in a hilly field and you want to find the lowest valley. You can feel the slope under your feet, and that tells you which direction goes downhill. An optimizer is your strategy for walking downhill.
The optimizer occupies a specific position in the standard supervised learning training loop:
Steps 1 through 4 repeat for each mini-batch of training data. One full pass through the training set is called an epoch.
Given a parameter vector theta, a loss function L(theta), and a learning rate eta, the simplest optimizer performs the update:
theta_{t+1} = theta_t - eta * grad L(theta_t)
This is vanilla gradient descent. All other optimizers modify this rule in one or more of the following ways:
Batch gradient descent computes the gradient of the loss over the entire training dataset before making a single update:
theta = theta - eta * (1/N) * sum_{i=1}^{N} grad L_i(theta)
This produces a stable, low-variance gradient estimate, but it is impractical for large datasets because the full dataset must fit in memory and every update requires a complete pass through the data.
Stochastic gradient descent computes the gradient from a single example or a small mini-batch instead of the full dataset. This idea traces back to the Robbins-Monro stochastic approximation method published in 1951. SGD introduces noise into the gradient estimates, which can actually help the optimizer escape shallow local minima and saddle points. Mini-batch SGD (using, say, 32 to 512 examples per gradient estimate) balances the variance reduction of larger batches with the computational savings of smaller ones.
Polyak introduced the momentum method in 1964, and Rumelhart, Hinton, and Williams popularized it for neural networks in 1986. Instead of using only the current gradient, the optimizer maintains a velocity vector that accumulates past gradients:
v_t = gamma * v_{t-1} + eta * grad L(theta_t)
theta_{t+1} = theta_t - v_t
The momentum coefficient gamma (typically 0.9) controls how much history is retained. Momentum accelerates convergence along consistent gradient directions and dampens oscillations in directions where the gradient frequently changes sign. It remains the standard choice for many computer vision tasks, including training ResNets and other convolutional neural networks.
Proposed by Yurii Nesterov in 1983 for convex optimization, NAG modifies momentum by computing the gradient at a "look-ahead" position rather than the current position:
v_t = gamma * v_{t-1} + eta * grad L(theta_t - gamma * v_{t-1})
theta_{t+1} = theta_t - v_t
By evaluating the gradient at the projected future position, NAG produces more responsive updates and achieves faster convergence rates on convex problems. Nesterov proved an optimal O(1/t^2) convergence rate for smooth convex functions, compared to the O(1/t) rate of standard gradient descent.
Adaptive methods maintain per-parameter learning rates that adjust automatically based on the history of gradients. This eliminates or reduces the need to manually tune the global learning rate.
Introduced by Duchi, Hazan, and Singer in 2011, AdaGrad accumulates the sum of squared gradients for each parameter and uses this sum to scale the learning rate:
G_t = G_{t-1} + (grad L(theta_t))^2
theta_{t+1} = theta_t - (eta / sqrt(G_t + epsilon)) * grad L(theta_t)
Parameters that receive large gradients get smaller effective learning rates, while parameters with small or infrequent gradients retain larger learning rates. This makes AdaGrad well suited for problems with sparse features, such as natural language processing tasks where rare words have infrequent but informative gradients. The main drawback is that the accumulated squared gradient sum grows monotonically, causing the effective learning rate to shrink to near zero over long training runs.
Geoffrey Hinton proposed RMSProp in his 2012 Coursera lecture on neural networks. RMSProp fixes AdaGrad's diminishing learning rate problem by replacing the cumulative sum with an exponentially decaying average of squared gradients:
E[g^2]_t = rho * E[g^2]_{t-1} + (1 - rho) * (grad L(theta_t))^2
theta_{t+1} = theta_t - (eta / sqrt(E[g^2]_t + epsilon)) * grad L(theta_t)
The decay rate rho (typically 0.9) ensures that only recent gradient magnitudes influence the per-parameter learning rate. RMSProp was never formally published in a peer-reviewed paper, but it became one of the most widely used optimizers in practice, particularly for recurrent neural networks and reinforcement learning.
Matthew Zeiler introduced Adadelta in 2012 as an extension of AdaGrad. Like RMSProp, it uses an exponentially decaying average of squared gradients. It also maintains a running average of squared parameter updates, which replaces the global learning rate entirely:
E[g^2]_t = rho * E[g^2]_{t-1} + (1 - rho) * (grad L(theta_t))^2
delta_theta_t = -(sqrt(E[delta_theta^2]_{t-1} + epsilon) / sqrt(E[g^2]_t + epsilon)) * grad L(theta_t)
E[delta_theta^2]_t = rho * E[delta_theta^2]_{t-1} + (1 - rho) * (delta_theta_t)^2
By computing the ratio of update RMS to gradient RMS, Adadelta achieves correct units for the parameter update without requiring a manually specified learning rate.
Kingma and Ba introduced Adam (Adaptive Moment Estimation) in a 2014 paper published at ICLR 2015. Adam combines the first moment estimate (momentum) with the second moment estimate (RMSProp-style adaptive learning rate), plus bias correction to account for the zero initialization of the moment estimates:
m_t = beta_1 * m_{t-1} + (1 - beta_1) * grad L(theta_t) (first moment)
v_t = beta_2 * v_{t-1} + (1 - beta_2) * (grad L(theta_t))^2 (second moment)
m_hat_t = m_t / (1 - beta_1^t) (bias correction)
v_hat_t = v_t / (1 - beta_2^t) (bias correction)
theta_{t+1} = theta_t - eta * m_hat_t / (sqrt(v_hat_t) + epsilon)
The default hyperparameters (beta_1 = 0.9, beta_2 = 0.999, epsilon = 1e-8) work well across a wide range of problems, which is a major reason for Adam's popularity. Adam is computationally efficient, has modest memory requirements (two buffers per parameter), and is relatively insensitive to the choice of learning rate compared to SGD.
Loshchilov and Hutter published "Decoupled Weight Decay Regularization" in 2017 (ICLR 2019), showing that the standard way of implementing L2 regularization in Adam is not equivalent to true weight decay. In Adam with L2 regularization, the regularization gradient gets scaled by the adaptive learning rate, which weakens the regularization effect for parameters with large gradient histories. AdamW fixes this by decoupling weight decay from the gradient-based update:
m_t, v_t = (same as Adam)
theta_{t+1} = theta_t - eta * (m_hat_t / (sqrt(v_hat_t) + epsilon) + lambda * theta_t)
Here, lambda is the weight decay coefficient applied directly to the parameters, independent of the adaptive scaling. AdamW has become the default optimizer for training transformers, large language models, and many other modern architectures.
Timothy Dozat proposed NAdam (Nesterov-accelerated Adaptive Moment Estimation) in 2016 by incorporating Nesterov momentum into Adam. Instead of using the current first-moment estimate for the update, NAdam uses the look-ahead first moment, similar to how NAG looks ahead relative to standard momentum. NAdam generally converges faster than Adam on tasks where Nesterov momentum provides a benefit, including language modeling and certain computer vision workloads.
Liu et al. introduced RAdam (Rectified Adam) at ICLR 2020. They identified that Adam's adaptive learning rate has problematically high variance in the early steps of training because the second moment estimate is computed from very few samples. This variance is the underlying reason why learning rate warmup helps Adam. RAdam estimates the variance of the adaptive learning rate and applies a rectification term that automatically suppresses the variance when it is too high. The result is an optimizer that adapts between SGD-like behavior early in training and full Adam behavior later, removing the need to manually tune a warmup schedule.
As models have grown to billions of parameters, optimizer memory consumption has become a practical bottleneck. Adam and AdamW store two state buffers (first and second moment) per parameter, doubling the memory required beyond the model parameters and gradients themselves.
Shazeer and Stern introduced Adafactor in 2018 (ICML 2018) to reduce the memory cost of adaptive optimizers. For a weight matrix of size m x n, Adam stores an m x n second-moment buffer. Adafactor factorizes this into per-row and per-column statistics, reducing memory from O(m * n) to O(m + n). Combined with update clipping and the option to drop momentum, Adafactor achieves comparable results to Adam on Transformer training while using significantly less memory. Google used Adafactor for training the T5 model family.
Chen et al. at Google Brain discovered Lion (EvoLved Sign Momentum) through automated program search, described in their 2023 NeurIPS paper "Symbolic Discovery of Optimization Algorithms." Rather than being designed by hand, Lion was found by searching over a space of possible optimizer programs using evolutionary methods. The resulting algorithm is remarkably simple: it uses only the sign of a momentum-based interpolation to determine the update direction:
update = sign(beta_1 * m_{t-1} + (1 - beta_1) * grad L(theta_t))
theta_{t+1} = theta_t - eta * update
m_t = beta_2 * m_{t-1} + (1 - beta_2) * grad L(theta_t)
Because Lion stores only one momentum buffer (compared to Adam's two) and uses a uniform update magnitude, it roughly halves the optimizer memory overhead. Lion requires a learning rate 3 to 10 times smaller than Adam's due to its sign-based updates. On image classification with ViT, Lion improved ImageNet accuracy by up to 2%. On diffusion models, it reduced training compute by up to 2.3 times. Lion has also been deployed in production at Google for search ads click-through-rate models.
First-order optimizers use only gradient information. Second-order optimizers also incorporate curvature information (the Hessian matrix or approximations of it) to take more informed steps. While full Newton's method computes the inverse Hessian (which is prohibitively expensive for neural networks with millions of parameters), several practical approximations exist.
Liu et al. introduced Sophia (Second-order Clipped Stochastic Optimization) in 2023 (ICLR 2024) for language model pre-training. Sophia uses a lightweight diagonal Hessian estimate as a pre-conditioner, dividing the gradient by the estimated curvature and then applying element-wise clipping. The clipping controls worst-case update sizes, which tames instability from non-convexity and rapid Hessian changes. Sophia estimates the diagonal Hessian only every few iterations, keeping the average per-step overhead negligible. On GPT models from 125M to 1.5B parameters, Sophia achieved a 2x speedup over Adam in steps, total compute, and wall-clock time to reach the same perplexity.
Gupta et al. introduced Shampoo in 2018 as a structure-aware preconditioning algorithm for stochastic optimization over tensor spaces. For a weight matrix W of size m x n, instead of maintaining a single mn by mn preconditioner (which would be infeasible), Shampoo maintains separate preconditioners of size m x m and n x n, one for each tensor dimension. Anil et al. (2021) developed a distributed implementation of Shampoo that demonstrated strong performance on large-scale training tasks. More recently, Vyas et al. (2024) introduced SOAP, which combines Shampoo's preconditioning with Adam's per-element adaptivity for improved stability.
You et al. introduced LAMB (Layer-wise Adaptive Moments optimizer for Batch training) in 2019 for scaling up batch sizes during training. LAMB combines Adam's adaptive per-parameter scaling with a layer-wise trust ratio inspired by LARS (Layer-wise Adaptive Rate Scaling). The trust ratio normalizes the update magnitude relative to the parameter magnitude for each layer, preventing any single layer from receiving disproportionately large updates. The headline result was reducing BERT pre-training time from 3 days to 76 minutes by scaling to batch size 32,868 on TPUv3 Pods without degrading performance. However, subsequent work by Anil et al. (2021) showed that standard Adam with careful tuning can match LAMB at large batch sizes.
Keller Jordan et al. introduced Muon (MomentUm Orthogonalized by Newton-Schulz) in late 2024. Muon treats neural network weight updates as matrices rather than collections of independent scalars. It runs standard SGD with Nesterov momentum and then replaces each 2D parameter's update with its nearest orthogonal matrix, computed efficiently using Newton-Schulz iteration. While Adam treats each parameter independently, Muon exploits the geometric structure of weight matrices. Scaling law experiments showed that Muon achieves comparable performance to AdamW while requiring roughly 52% of the training FLOPs, translating to nearly 2x cost savings for large training runs. Muon currently holds training speed records for both NanoGPT and CIFAR-10 speedrunning benchmarks.
| Optimizer | Year | Per-parameter LR | State memory per parameter | Typical use cases |
|---|---|---|---|---|
| SGD | 1951 | No | None | Simple convex problems, baseline |
| SGD + Momentum | 1964/1986 | No | 1 buffer (velocity) | Computer vision, CNNs |
| NAG | 1983 | No | 1 buffer (velocity) | Convex optimization, some vision tasks |
| AdaGrad | 2011 | Yes | 1 buffer (sum of squared gradients) | Sparse features, NLP |
| RMSProp | 2012 | Yes | 1 buffer (EMA of squared gradients) | RNNs, reinforcement learning |
| Adadelta | 2012 | Yes (no global LR needed) | 2 buffers | General-purpose |
| Adam | 2014 | Yes | 2 buffers (first and second moments) | General-purpose default |
| NAdam | 2016 | Yes | 2 buffers | Language modeling, vision |
| AdamW | 2017 | Yes | 2 buffers | Transformers, LLMs |
| Adafactor | 2018 | Yes (factorized) | O(m+n) instead of O(m*n) | Large transformers (T5) |
| Shampoo | 2018 | Yes (matrix preconditioner) | 2 preconditioner matrices | Large-scale distributed training |
| LAMB | 2019 | Yes (layer-wise trust ratio) | 2 buffers + trust ratio | Large-batch distributed training |
| RAdam | 2020 | Yes (rectified) | 2 buffers | General-purpose (no warmup needed) |
| Lion | 2023 | Sign-based | 1 buffer (momentum) | Vision, language, diffusion models |
| Sophia | 2023 | Curvature-based | 2 buffers + Hessian estimate | LLM pre-training |
| Muon | 2024 | Orthogonalized | 1 buffer (momentum) | LLM pre-training, speed records |
Optimizer state is often the largest memory consumer during training, especially for billion-parameter models. The following table shows approximate optimizer state memory for a 1.5 billion parameter model stored in FP32:
| Optimizer | Buffers per parameter | FP32 state memory (1.5B params) | Notes |
|---|---|---|---|
| SGD (no momentum) | 0 | 0 GB | No extra state |
| SGD + Momentum | 1 | ~6 GB | One velocity buffer |
| Adam / AdamW | 2 | ~12 GB | First and second moments |
| Adafactor | ~0.01 (factorized) | ~0.1 GB (approx.) | Row and column statistics |
| Lion | 1 | ~6 GB | Single momentum buffer |
| Sophia | 2 + Hessian | ~12+ GB | Moments plus periodic Hessian |
When combined with mixed-precision training, model parameters and gradients can be stored in FP16 or BF16, but optimizer states are typically kept in FP32 to preserve numerical precision for small gradient updates. Techniques like ZeRO (from DeepSpeed) shard optimizer states across multiple GPUs to reduce per-device memory.
The optimizer and the learning rate schedule work together. The schedule controls how the learning rate changes over the course of training, while the optimizer determines how the learning rate is applied to each parameter.
| Schedule | Description | Typical pairing |
|---|---|---|
| Constant | Learning rate stays fixed | SGD, debugging |
| Step decay | Multiply LR by a factor (e.g., 0.1) at fixed epochs | SGD + Momentum for vision |
| Exponential decay | LR decays by a fixed ratio each epoch | General-purpose |
| Cosine annealing | LR follows a cosine curve from max to min | AdamW for transformers |
| Linear warmup + cosine decay | LR ramps up linearly then decays via cosine | AdamW for LLM pre-training |
| Cyclic LR | LR oscillates between bounds | SGD for exploring loss landscape |
| One-cycle | LR increases then decreases over one cycle | SGD + Momentum (fast training) |
Learning rate warmup starts training with a very small learning rate and linearly increases it to the target value over a set number of steps. Warmup is particularly important for adaptive optimizers like Adam because the second-moment estimates have high variance in the early steps when they are computed from very few gradient samples. Warmup allows the optimizer to collect accurate gradient statistics before making large updates. An additional benefit is that warmup helps the model move away from sharp, poorly conditioned regions of the loss surface toward flatter regions that tolerate larger learning rates.
Gradient clipping is a stability technique used alongside the optimizer to prevent exploding gradients. It is applied after backpropagation but before the optimizer step. There are two common forms:
Gradient clipping is especially important for recurrent neural networks and transformers, where long sequences can cause gradient magnitudes to grow exponentially through many layers of computation.
Wilson et al. (2017) showed that adaptive optimizers like Adam sometimes find solutions that generalize worse than SGD with momentum on certain tasks, particularly image classification. One hypothesis is that SGD's noisy gradient estimates make it more likely to converge to flat minima in the loss landscape, which tend to generalize better to unseen data. Adaptive methods, by contrast, may converge to sharper minima because their per-parameter learning rates allow them to navigate narrow valleys that SGD would skip over.
More recent theoretical work (Zhou et al., 2020) formalized this observation, showing that SGD is more locally unstable at sharp minima and can escape them to reach flatter regions. AdamW partially addresses the generalization gap by properly decoupling weight decay from the adaptive update, which provides more consistent regularization.
In modern practice, the generalization gap has narrowed considerably. AdamW with appropriate weight decay and learning rate scheduling matches or exceeds SGD on most benchmarks, which is why it has become the dominant optimizer for transformer-based architectures.
The following guidelines reflect common practice as of 2025:
| Domain | Common optimizer choices | Notes |
|---|---|---|
| Image classification | SGD + Momentum, AdamW | SGD traditional for CNNs; AdamW increasingly popular with ViTs |
| Object detection | SGD + Momentum, AdamW | Often inherits the backbone's optimizer choice |
| Natural language processing | AdamW | Near-universal default for transformers |
| Large language models | AdamW, Lion, Muon, Sophia | AdamW standard; alternatives offer efficiency gains |
| Diffusion models | AdamW, Lion | Lion showed 2.3x training efficiency gains |
| Reinforcement learning | Adam, RMSProp | Adam common in policy gradient methods; RMSProp in value-based methods |
| GANs | Adam | Adaptive rates help stabilize adversarial training |
| Speech recognition | Adam, AdamW | Adaptive methods work well for sequence-to-sequence models |
| Recommendation systems | AdaGrad, Adam | AdaGrad's sparse feature handling is beneficial |
import torch.optim as optim
# AdamW with weight decay
optimizer = optim.AdamW(model.parameters(), lr=1e-3, weight_decay=0.01)
# SGD with momentum and Nesterov
optimizer = optim.SGD(model.parameters(), lr=0.1, momentum=0.9, nesterov=True)
# Learning rate scheduler: linear warmup + cosine decay
from torch.optim.lr_scheduler import CosineAnnealingLR, LinearLR, SequentialLR
warmup = LinearLR(optimizer, start_factor=0.01, total_iters=1000)
cosine = CosineAnnealingLR(optimizer, T_max=50000)
scheduler = SequentialLR(optimizer, schedulers=[warmup, cosine], milestones=[1000])
# Training loop with gradient clipping
for batch in dataloader:
optimizer.zero_grad()
loss = model(batch)
loss.backward()
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
optimizer.step()
scheduler.step()
import tensorflow as tf
# AdamW
optimizer = tf.keras.optimizers.AdamW(learning_rate=1e-3, weight_decay=0.01)
# SGD with momentum
optimizer = tf.keras.optimizers.SGD(learning_rate=0.1, momentum=0.9, nesterov=True)
# Cosine decay schedule with warmup
lr_schedule = tf.keras.optimizers.schedules.CosineDecay(
initial_learning_rate=1e-3,
decay_steps=50000,
warmup_target=1e-3,
warmup_steps=1000
)
optimizer = tf.keras.optimizers.AdamW(learning_rate=lr_schedule, weight_decay=0.01)
import optax
# AdamW with linear warmup and cosine decay
schedule = optax.warmup_cosine_decay_schedule(
init_value=0.0, peak_value=1e-3,
warmup_steps=1000, decay_steps=50000
)
optimizer = optax.adamw(learning_rate=schedule, weight_decay=0.01)
| Year | Event |
|---|---|
| 1847 | Augustin-Louis Cauchy describes the gradient descent method |
| 1951 | Robbins and Monro publish the stochastic approximation framework |
| 1964 | Polyak introduces the heavy ball method (momentum) |
| 1983 | Nesterov proposes accelerated gradient for convex optimization |
| 1986 | Rumelhart, Hinton, and Williams apply momentum to neural network training via backpropagation |
| 2011 | Duchi, Hazan, and Singer publish AdaGrad |
| 2012 | Zeiler introduces Adadelta; Hinton proposes RMSProp |
| 2014 | Kingma and Ba introduce Adam |
| 2016 | Dozat proposes NAdam |
| 2017 | Loshchilov and Hutter propose AdamW (decoupled weight decay) |
| 2018 | Shazeer and Stern introduce Adafactor; Gupta et al. introduce Shampoo |
| 2019 | You et al. introduce LAMB for large-batch training |
| 2020 | Liu et al. introduce RAdam (rectified Adam) |
| 2023 | Chen et al. discover Lion via program search; Liu et al. introduce Sophia |
| 2024 | Jordan et al. introduce Muon with Newton-Schulz orthogonalization |
Optimizer convergence guarantees depend on the properties of the objective function:
In practice, convergence theory provides useful intuition but does not fully explain optimizer behavior on deep neural network training, where the loss surface is highly non-convex and high-dimensional.