# Optimizer

> Source: https://aiwiki.ai/wiki/optimizer
> Updated: 2026-06-21
> Categories: Deep Learning, Machine Learning, Training & Optimization
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

An **optimizer** in [machine learning](/wiki/machine_learning) is an algorithm that iteratively adjusts a model's learnable parameters to minimize (or maximize) an [objective function](/wiki/objective_function), commonly called a [loss function](/wiki/loss_function), using the gradients produced by [backpropagation](/wiki/backpropagation). The optimizer is the component that decides the direction and size of each parameter update, so it sits at the core of every training pipeline: backpropagation computes how much each parameter contributed to the prediction error, and the optimizer turns those gradients into the next set of weights.

The choice of optimizer affects training speed, final model quality, memory consumption, and hyperparameter sensitivity. Decades of research have produced a broad family of algorithms, from simple [gradient descent](/wiki/gradient_descent) to adaptive methods like [Adam](/wiki/adam_optimizer) and recent discoveries such as Lion and Muon. [Adam](/wiki/adam_optimizer), introduced by Kingma and Ba at ICLR 2015, and its successor [AdamW](/wiki/adamw) (Loshchilov and Hutter, ICLR 2019) are the de facto default optimizers for training [transformers](/wiki/transformer) and [large language models](/wiki/large_language_model).[5][8] By 2025, alternatives to the long-dominant AdamW had begun to see production use at frontier scale: Muon, for example, was used to train Moonshot AI's trillion-parameter Kimi K2 model.[16]

## ELI5 (Explain like I'm 5)

Imagine you are blindfolded in a hilly field and you want to find the lowest valley. You can feel the slope under your feet, and that tells you which direction goes downhill. An optimizer is your strategy for walking downhill.

- **SGD** means you take one small step in the downhill direction every time you check the slope.
- **Momentum** is like rolling a ball: it picks up speed when the slope keeps pointing the same way, so you move faster through gentle slopes and slow down when the direction changes.
- **Adam** is like having a smart hiking guide who remembers how steep the ground has been recently and adjusts your step size for every direction independently, so you take big steps across flat ground and careful steps on steep, rocky terrain.
- **AdamW** adds a gentle pull toward the center of the field so you do not wander too far in any one direction (this is weight decay).
- Newer optimizers like **Lion** and **Muon** try to find the valley faster or with less effort by using clever shortcuts.

## What does an optimizer do in the training loop?

The optimizer occupies a specific position in the standard supervised learning training loop:

1. **Forward pass.** Input data flows through the [neural network](/wiki/neural_network) to produce predictions.
2. **Loss computation.** A loss function (for example, [cross-entropy](/wiki/cross-entropy) or [mean squared error](/wiki/mean_squared_error_mse)) quantifies the difference between predictions and ground-truth labels.
3. **Backward pass.** [Backpropagation](/wiki/backpropagation) computes the gradient of the loss with respect to every learnable parameter.
4. **Parameter update.** The optimizer uses the gradients (and possibly its own internal state) to compute a step direction and step size, then updates every parameter.

Steps 1 through 4 repeat for each mini-batch of training data. One full pass through the training set is called an [epoch](/wiki/epoch).

## Mathematical foundation

Given a parameter vector **theta**, a loss function L(theta), and a [learning rate](/wiki/learning_rate) eta, the simplest optimizer performs the update:

```
theta_{t+1} = theta_t - eta * grad L(theta_t)
```

This is vanilla gradient descent. All other optimizers modify this rule in one or more of the following ways:

- **Stochastic sampling.** Compute the gradient on a random subset (mini-batch) rather than the full dataset.
- **Momentum.** Accumulate an exponentially weighted moving average of past gradients to smooth the trajectory.
- **Adaptive learning rates.** Scale the learning rate per parameter based on the history of gradients for that parameter.
- **Second-order information.** Incorporate curvature information (the [Hessian](/wiki/hessian_matrix) or an approximation of it) to take better-informed steps.
- **Weight decay.** Add a penalty proportional to the magnitude of the parameters to encourage smaller weights.

## Classical optimizers

### Batch gradient descent

Batch gradient descent computes the gradient of the loss over the entire training dataset before making a single update:

```
theta = theta - eta * (1/N) * sum_{i=1}^{N} grad L_i(theta)
```

This produces a stable, low-variance gradient estimate, but it is impractical for large datasets because the full dataset must fit in memory and every update requires a complete pass through the data.

### Stochastic gradient descent (SGD)

Stochastic gradient descent computes the gradient from a single example or a small mini-batch instead of the full dataset. This idea traces back to the Robbins-Monro stochastic approximation method published in 1951.[1] SGD introduces noise into the gradient estimates, which can actually help the optimizer escape shallow local minima and saddle points. Mini-batch SGD (using, say, 32 to 512 examples per gradient estimate) balances the variance reduction of larger batches with the computational savings of smaller ones.

### SGD with momentum

Polyak introduced the momentum method in 1964, and Rumelhart, Hinton, and Williams popularized it for neural networks in 1986. Instead of using only the current gradient, the optimizer maintains a velocity vector that accumulates past gradients:

```
v_t = gamma * v_{t-1} + eta * grad L(theta_t)
theta_{t+1} = theta_t - v_t
```

The momentum coefficient gamma (typically 0.9) controls how much history is retained. Momentum accelerates convergence along consistent gradient directions and dampens oscillations in directions where the gradient frequently changes sign. It remains the standard choice for many [computer vision](/wiki/computer_vision) tasks, including training [ResNets](/wiki/resnet) and other [convolutional neural networks](/wiki/convolutional_neural_network).

### Nesterov accelerated gradient (NAG)

Proposed by Yurii Nesterov in 1983 for convex optimization, NAG modifies momentum by computing the gradient at a "look-ahead" position rather than the current position:[2]

```
v_t = gamma * v_{t-1} + eta * grad L(theta_t - gamma * v_{t-1})
theta_{t+1} = theta_t - v_t
```

By evaluating the gradient at the projected future position, NAG produces more responsive updates and achieves faster convergence rates on convex problems. Nesterov proved an optimal O(1/t^2) convergence rate for smooth convex functions, compared to the O(1/t) rate of standard gradient descent.[2] Sebastian Ruder's widely cited 2016 survey gives a unified overview of these gradient descent variants and the adaptive methods that followed.[6]

## Adaptive learning rate optimizers

Adaptive methods maintain per-parameter learning rates that adjust automatically based on the history of gradients. This eliminates or reduces the need to manually tune the global learning rate.

### AdaGrad

Introduced by Duchi, Hazan, and Singer in 2011, [AdaGrad](/wiki/adagrad) accumulates the sum of squared gradients for each parameter and uses this sum to scale the learning rate:[3]

```
G_t = G_{t-1} + (grad L(theta_t))^2
theta_{t+1} = theta_t - (eta / sqrt(G_t + epsilon)) * grad L(theta_t)
```

Parameters that receive large gradients get smaller effective learning rates, while parameters with small or infrequent gradients retain larger learning rates. This makes AdaGrad well suited for problems with sparse features, such as [natural language processing](/wiki/natural_language_processing) tasks where rare words have infrequent but informative gradients. The main drawback is that the accumulated squared gradient sum grows monotonically, causing the effective learning rate to shrink to near zero over long training runs.

### RMSProp

Geoffrey Hinton proposed RMSProp in his 2012 Coursera lecture on neural networks. RMSProp fixes AdaGrad's diminishing learning rate problem by replacing the cumulative sum with an exponentially decaying average of squared gradients:

```
E[g^2]_t = rho * E[g^2]_{t-1} + (1 - rho) * (grad L(theta_t))^2
theta_{t+1} = theta_t - (eta / sqrt(E[g^2]_t + epsilon)) * grad L(theta_t)
```

The decay rate rho (typically 0.9) ensures that only recent gradient magnitudes influence the per-parameter learning rate. RMSProp was never formally published in a peer-reviewed paper, but it became one of the most widely used optimizers in practice, particularly for [recurrent neural networks](/wiki/recurrent_neural_network) and [reinforcement learning](/wiki/reinforcement_learning).

### Adadelta

Matthew Zeiler introduced Adadelta in 2012 as an extension of AdaGrad.[4] Like RMSProp, it uses an exponentially decaying average of squared gradients. It also maintains a running average of squared parameter updates, which replaces the global learning rate entirely:

```
E[g^2]_t = rho * E[g^2]_{t-1} + (1 - rho) * (grad L(theta_t))^2
delta_theta_t = -(sqrt(E[delta_theta^2]_{t-1} + epsilon) / sqrt(E[g^2]_t + epsilon)) * grad L(theta_t)
E[delta_theta^2]_t = rho * E[delta_theta^2]_{t-1} + (1 - rho) * (delta_theta_t)^2
```

By computing the ratio of update RMS to gradient RMS, Adadelta achieves correct units for the parameter update without requiring a manually specified learning rate.

### Adam

Kingma and Ba introduced [Adam](/wiki/adam_optimizer) (Adaptive Moment Estimation) in a 2014 paper published at ICLR 2015.[5] As the authors put it, "We introduce Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments."[5] Adam combines the first moment estimate (momentum) with the second moment estimate (RMSProp-style adaptive learning rate), plus bias correction to account for the zero initialization of the moment estimates:

```
m_t = beta_1 * m_{t-1} + (1 - beta_1) * grad L(theta_t)       (first moment)
v_t = beta_2 * v_{t-1} + (1 - beta_2) * (grad L(theta_t))^2   (second moment)
m_hat_t = m_t / (1 - beta_1^t)                                 (bias correction)
v_hat_t = v_t / (1 - beta_2^t)                                 (bias correction)
theta_{t+1} = theta_t - eta * m_hat_t / (sqrt(v_hat_t) + epsilon)
```

The default hyperparameters (beta_1 = 0.9, beta_2 = 0.999, epsilon = 1e-8) work well across a wide range of problems, which is a major reason for Adam's popularity. Adam is computationally efficient, has modest memory requirements (two buffers per parameter), and is relatively insensitive to the choice of learning rate compared to SGD. The Adam paper is among the most cited works in modern machine learning and helped make adaptive optimization the default for [deep learning](/wiki/deep_learning).

### AdamW

Loshchilov and Hutter published "Decoupled Weight Decay Regularization" in 2017 (ICLR 2019), showing that the standard way of implementing L2 regularization in Adam is not equivalent to true [weight decay](/wiki/weight_decay).[8] Their central observation: "L2 regularization and weight decay regularization are equivalent for standard stochastic gradient descent (when rescaled by the learning rate), but as we demonstrate this is not the case for adaptive gradient algorithms, such as Adam."[8] In Adam with L2 regularization, the regularization gradient gets scaled by the adaptive learning rate, which weakens the regularization effect for parameters with large gradient histories. AdamW fixes this by decoupling weight decay from the gradient-based update:

```
m_t, v_t = (same as Adam)
theta_{t+1} = theta_t - eta * (m_hat_t / (sqrt(v_hat_t) + epsilon) + lambda * theta_t)
```

Here, lambda is the weight decay coefficient applied directly to the parameters, independent of the adaptive scaling. AdamW has become the default optimizer for training [transformers](/wiki/transformer), [large language models](/wiki/large_language_model), and many other modern architectures.

### NAdam

Timothy Dozat proposed NAdam (Nesterov-accelerated Adaptive Moment Estimation) in 2016 by incorporating Nesterov momentum into Adam.[7] Instead of using the current first-moment estimate for the update, NAdam uses the look-ahead first moment, similar to how NAG looks ahead relative to standard momentum. NAdam generally converges faster than Adam on tasks where Nesterov momentum provides a benefit, including language modeling and certain computer vision workloads.

### RAdam

Liu et al. introduced RAdam (Rectified Adam) at ICLR 2020.[9] They identified that Adam's adaptive learning rate has problematically high variance in the early steps of training because the second moment estimate is computed from very few samples. This variance is the underlying reason why learning rate warmup helps Adam. RAdam estimates the variance of the adaptive learning rate and applies a rectification term that automatically suppresses the variance when it is too high. The result is an optimizer that adapts between SGD-like behavior early in training and full Adam behavior later, removing the need to manually tune a warmup schedule.

## Memory-efficient optimizers

As models have grown to billions of parameters, optimizer memory consumption has become a practical bottleneck. Adam and AdamW store two state buffers (first and second moment) per parameter, doubling the memory required beyond the model parameters and gradients themselves.

### Adafactor

Shazeer and Stern introduced Adafactor in 2018 (ICML 2018) to reduce the memory cost of adaptive optimizers.[10] For a weight matrix of size m x n, Adam stores an m x n second-moment buffer. Adafactor factorizes this into per-row and per-column statistics (maintaining only the per-row and per-column sums of the moving average of squared gradients, then reconstructing per-parameter estimates from these sums), reducing memory from O(m * n) to O(m + n).[10] Adafactor also replaces bias correction with a slowly increasing second-moment decay rate and adds update clipping to keep step sizes stable when momentum is dropped. The combination achieves comparable results to Adam on [Transformer](/wiki/transformer) training while using significantly less memory. Google used Adafactor for training the T5 model family.

### Lion

Chen et al. at Google Brain discovered Lion (EvoLved Sign Momentum) through automated program search, described in their 2023 NeurIPS paper "Symbolic Discovery of Optimization Algorithms."[12] Rather than being designed by hand, Lion was found by searching over a space of possible optimizer programs using evolutionary methods. The resulting algorithm is remarkably simple: it uses only the sign of a momentum-based interpolation to determine the update direction:

```
update = sign(beta_1 * m_{t-1} + (1 - beta_1) * grad L(theta_t))
theta_{t+1} = theta_t - eta * update
m_t = beta_2 * m_{t-1} + (1 - beta_2) * grad L(theta_t)
```

Because Lion stores only one momentum buffer (compared to Adam's two) and uses a uniform update magnitude, it roughly halves the optimizer memory overhead. The two coefficients play distinct roles: the update uses an interpolation weighted by beta_1, while the momentum buffer that is carried forward is tracked with beta_2, which lets the running momentum retain a longer history than the value used for the current step. Because the update magnitude is decoupled from the gradient scale, Lion requires a learning rate 3 to 10 times smaller than Adam's, with weight decay correspondingly larger.[12] On image classification with [ViT](/wiki/vision_transformer), Lion improved ImageNet accuracy by up to 2% and saved up to 5x the pre-training compute on JFT.[12] On [diffusion models](/wiki/diffusion_model), it achieved a better FID score than Adam while reducing training compute by up to 2.3x.[12] Lion has also been deployed in production at Google for search ads click-through-rate models.

## Second-order and curvature-aware optimizers

First-order optimizers use only gradient information. Second-order optimizers also incorporate curvature information (the Hessian matrix or approximations of it) to take more informed steps. While full Newton's method computes the inverse Hessian (which is prohibitively expensive for neural networks with millions of parameters), several practical approximations exist. These methods are sometimes called natural gradient or preconditioned methods because they reshape the gradient by a curvature estimate before stepping.

### K-FAC

James Martens and Roger Grosse introduced K-FAC (Kronecker-Factored Approximate Curvature) at ICML 2015.[17] K-FAC builds an efficiently invertible approximation of a neural network's [Fisher information matrix](/wiki/fisher_information), which is used in place of the Hessian to perform approximate [natural gradient descent](/wiki/natural_gradient). The Fisher is treated as block-diagonal across layers, and each layer's block is approximated as the Kronecker product of two much smaller matrices, one built from the layer's input activations and one from the backpropagated gradients. Inverting a Kronecker product reduces to inverting its two small factors, which is far cheaper than inverting the full block and, unlike a purely diagonal approximation, still captures correlations between parameters within a layer. Martens and Grosse reported that while K-FAC updates cost only several times more than a plain stochastic gradient step, each update makes much more optimization progress, so the method can be substantially faster than SGD with momentum in practice. Notably, the cost of storing and inverting the approximation does not grow with the amount of data used to estimate it, which lets K-FAC work well in highly stochastic regimes. K-FAC has since been extended to convolutional networks, recurrent networks, and distributed large-batch training, and it influenced later structure-aware preconditioners such as Shampoo.

### Sophia

Liu et al. introduced Sophia (Second-order Clipped Stochastic Optimization) in 2023 (ICLR 2024) for [language model](/wiki/language_model) pre-training.[13] Sophia uses a lightweight diagonal Hessian estimate as a pre-conditioner. The update is the moving average of the gradients divided by the moving average of the estimated diagonal Hessian, followed by element-wise clipping. The clipping controls worst-case update sizes, which tames instability from non-convexity and rapid Hessian changes. Sophia estimates the diagonal Hessian only every few iterations (for example every ten steps), keeping the average per-step overhead negligible. On GPT models from 125M to 1.5B parameters, Sophia achieved a 2x speedup over Adam in steps, total compute, and wall-clock time to reach the same perplexity.[13]

### Shampoo

Gupta et al. introduced Shampoo in 2018 as a structure-aware preconditioning algorithm for stochastic optimization over tensor spaces.[14] For a weight matrix W of size m x n, instead of maintaining a single m*n by m*n preconditioner (which would be infeasible), Shampoo maintains separate preconditioners of size m x m and n x n, one for each tensor dimension. Anil et al. (2021) developed a distributed implementation of Shampoo that demonstrated strong performance on large-scale training tasks. More recently, Vyas et al. (2024) introduced SOAP, which combines Shampoo's preconditioning with Adam's per-element adaptivity for improved stability. A distributed Shampoo implementation won the external-tuning track of the inaugural AlgoPerf training-algorithms benchmark in 2024, providing evidence that preconditioned methods can beat well-tuned AdamW and NAdam baselines on wall-clock time-to-result across a suite of workloads.[20]

## Large-batch optimizers

### LAMB

You et al. introduced LAMB (Layer-wise Adaptive Moments optimizer for Batch training) in 2019 for scaling up batch sizes during training.[11] LAMB combines Adam's adaptive per-parameter scaling with a layer-wise trust ratio inspired by LARS (Layer-wise Adaptive Rate Scaling). The trust ratio normalizes the update magnitude relative to the parameter magnitude for each layer, preventing any single layer from receiving disproportionately large updates. The headline result was reducing BERT pre-training time from 3 days to 76 minutes by scaling the second training phase to a batch size of roughly 32,000 (32,768) on TPUv3 Pods without degrading performance.[11] However, subsequent work by Anil et al. (2021) showed that standard Adam with careful tuning can match LAMB at large batch sizes.

## Emerging optimizers

### What is the Muon optimizer?

Keller Jordan introduced Muon (MomentUm Orthogonalized by Newton-Schulz) in December 2024, defining it succinctly as "an optimizer for 2D parameters of neural network hidden layers."[15] Muon treats neural network weight updates as matrices rather than collections of independent scalars. It runs standard SGD with Nesterov momentum and then replaces each 2D parameter's update with its nearest semi-orthogonal matrix, the solution to the problem of finding the orthogonal matrix closest to the momentum matrix in Frobenius norm. While Adam treats each parameter independently, Muon exploits the geometric structure of weight matrices.

The orthogonalization is computed with a Newton-Schulz iteration rather than an exact (and expensive) singular value decomposition. Jordan's implementation uses a quintic polynomial iteration with non-convergent coefficients (3.4445, -4.7750, 2.0315) run for five steps, deliberately tuned to maximize the slope near zero so that very few iterations are needed and the iteration remains stable in bfloat16.[15] Because the procedure only makes sense for 2D weights, Muon is applied to hidden-layer weight matrices, while scalar and vector parameters together with the input embeddings and output (classifier) head are left to a standard method such as AdamW. For typical transformer training the extra Newton-Schulz work adds well under 1% to the total FLOPs.[15]

Muon first attracted attention through speedrunning: it improved the NanoGPT speed record by about 1.35x (roughly a 35% reduction in training time versus AdamW) and the CIFAR-10 record for reaching 94% accuracy from 3.3 to 2.6 A100-seconds.[15] The modded-nanogpt benchmark, which races to reach a FineWeb validation loss of 3.28 (a target Andrej Karpathy's llm.c GPT-2 reproduction reached in about 45 minutes on 8 H100s in May 2024), was driven down to roughly 3 minutes on the same hardware over the following months as the community layered Muon with architecture and data-loading improvements.[15][18] At larger scale, Jordan reported training a 1.5B parameter transformer to GPT-2 XL quality in 10 hours on an 8xH100 node, versus about 13.3 hours for an AdamW baseline.[15]

#### How does Muon scale to large language models?

A Moonshot AI team led by Jingyuan Liu and Jianlin Su published "Muon is Scalable for LLM Training" in February 2025, showing that two changes let Muon work out of the box at scale without per-parameter tuning: adding decoupled weight decay, and rescaling each parameter's update so that the per-parameter update root-mean-square is consistent across the model (Muon's raw orthogonalized updates otherwise have a different scale than Adam's, which complicates transferring hyperparameters).[16] Their scaling-law experiments found that Muon reaches the same loss as a compute-optimal AdamW run with roughly half the training FLOPs, about 2x compute efficiency. They validated this by training Moonlight, a Mixture-of-Experts model with 3B activated and 16B total parameters, on 5.7 trillion tokens, and open-sourced a distributed Muon implementation designed to be memory and communication efficient.[16]

Muon then moved to frontier scale. Moonshot AI's Kimi K2, a Mixture-of-Experts model with about 1.04 trillion total parameters and 32 billion activated, was pre-trained on 15.5 trillion tokens using an optimizer the team calls MuonClip: Muon plus a technique named QK-Clip that rescales the query and key projection weights whenever attention logits grow too large.[19] Exploding attention logits are a common source of loss spikes when training large Muon models, and Moonshot reported that QK-Clip let them pre-train Kimi K2 with zero loss spikes across the entire run and no training instability.[19] This made Kimi K2 one of the first publicly documented trillion-parameter models trained without Adam-family optimizers as the primary update rule.

Muon holds training speed records on the public NanoGPT and CIFAR-10 speedrunning leaderboards, and a growing line of follow-up work (for example adaptive and distributed variants) has continued to refine it.

### Schedule-Free optimization

Schedule-Free methods, introduced by Defazio et al. in "The Road Less Scheduled" (NeurIPS 2024), remove the learning rate schedule entirely.[21] Instead of decaying the learning rate over a pre-set horizon, the method maintains a running average of the iterates and evaluates gradients at an interpolation between the average and the most recent point, which unifies the roles of momentum, iterate averaging, and scheduling. Because there is no schedule, training does not need to commit in advance to a total number of steps, which is useful when the stopping time is unknown. Schedule-Free wrappers exist for SGD and AdamW, and Schedule-Free AdamW was the algorithm behind the winning entry in the self-tuning track of the 2024 AlgoPerf competition, where it trained roughly 8% faster than the baseline while introducing no extra hyperparameters.[20][21]

## Optimizer comparison

| Optimizer | Year | Per-parameter LR | State memory per parameter | Typical use cases |
|---|---|---|---|---|
| SGD | 1951 | No | None | Simple convex problems, baseline |
| SGD + Momentum | 1964/1986 | No | 1 buffer (velocity) | [Computer vision](/wiki/computer_vision), [CNNs](/wiki/convolutional_neural_network) |
| NAG | 1983 | No | 1 buffer (velocity) | Convex optimization, some vision tasks |
| [AdaGrad](/wiki/adagrad) | 2011 | Yes | 1 buffer (sum of squared gradients) | Sparse features, [NLP](/wiki/natural_language_processing) |
| RMSProp | 2012 | Yes | 1 buffer (EMA of squared gradients) | [RNNs](/wiki/recurrent_neural_network), [reinforcement learning](/wiki/reinforcement_learning) |
| Adadelta | 2012 | Yes (no global LR needed) | 2 buffers | General-purpose |
| [Adam](/wiki/adam_optimizer) | 2014 | Yes | 2 buffers (first and second moments) | General-purpose default |
| K-FAC | 2015 | Natural gradient | Kronecker factors per layer | Natural-gradient training, large-batch |
| NAdam | 2016 | Yes | 2 buffers | Language modeling, vision |
| [AdamW](/wiki/adamw) | 2017 | Yes | 2 buffers | [Transformers](/wiki/transformer), [LLMs](/wiki/large_language_model) |
| Adafactor | 2018 | Yes (factorized) | O(m+n) instead of O(m*n) | Large transformers (T5) |
| Shampoo | 2018 | Yes (matrix preconditioner) | 2 preconditioner matrices | Large-scale distributed training |
| LAMB | 2019 | Yes (layer-wise trust ratio) | 2 buffers + trust ratio | Large-batch distributed training |
| RAdam | 2020 | Yes (rectified) | 2 buffers | General-purpose (no warmup needed) |
| Lion | 2023 | Sign-based | 1 buffer (momentum) | Vision, language, diffusion models |
| Sophia | 2023 | Curvature-based | 2 buffers + Hessian estimate | LLM pre-training |
| Schedule-Free (AdamW) | 2024 | Yes (no schedule) | 2 buffers + iterate average | General-purpose, self-tuning |
| Muon | 2024 | Orthogonalized | 1 buffer (momentum) | LLM pre-training (Moonlight, Kimi K2), speed records |

## How much memory does each optimizer use at scale?

Optimizer state is often the largest memory consumer during training, especially for billion-parameter models. The following table shows approximate optimizer state memory for a 1.5 billion parameter model stored in FP32:

| Optimizer | Buffers per parameter | FP32 state memory (1.5B params) | Notes |
|---|---|---|---|
| SGD (no momentum) | 0 | 0 GB | No extra state |
| SGD + Momentum | 1 | ~6 GB | One velocity buffer |
| Adam / AdamW | 2 | ~12 GB | First and second moments |
| Adafactor | ~0.01 (factorized) | ~0.1 GB (approx.) | Row and column statistics |
| Lion | 1 | ~6 GB | Single momentum buffer |
| Sophia | 2 + Hessian | ~12+ GB | Moments plus periodic Hessian |

When combined with [mixed-precision training](/wiki/mixed_precision_training), model parameters and gradients can be stored in FP16 or BF16, but optimizer states are typically kept in FP32 to preserve numerical precision for small gradient updates. Techniques like [ZeRO](/wiki/zero_optimization) (from [DeepSpeed](/wiki/deepspeed)) shard optimizer states across multiple GPUs to reduce per-device memory.

## Learning rate schedules

The optimizer and the learning rate schedule work together. The schedule controls how the learning rate changes over the course of training, while the optimizer determines how the learning rate is applied to each parameter.

### Common schedules

| Schedule | Description | Typical pairing |
|---|---|---|
| Constant | Learning rate stays fixed | SGD, debugging |
| Step decay | Multiply LR by a factor (e.g., 0.1) at fixed epochs | SGD + Momentum for vision |
| Exponential decay | LR decays by a fixed ratio each epoch | General-purpose |
| Cosine annealing | LR follows a cosine curve from max to min | AdamW for transformers |
| Linear warmup + cosine decay | LR ramps up linearly then decays via cosine | AdamW for LLM pre-training |
| Cyclic LR | LR oscillates between bounds | SGD for exploring loss landscape |
| One-cycle | LR increases then decreases over one cycle | SGD + Momentum (fast training) |

### Warmup

Learning rate warmup starts training with a very small learning rate and linearly increases it to the target value over a set number of steps. Warmup is particularly important for adaptive optimizers like Adam because the second-moment estimates have high variance in the early steps when they are computed from very few gradient samples. Warmup allows the optimizer to collect accurate gradient statistics before making large updates. An additional benefit is that warmup helps the model move away from sharp, poorly conditioned regions of the loss surface toward flatter regions that tolerate larger learning rates.

## Gradient clipping

[Gradient clipping](/wiki/gradient_clipping) is a stability technique used alongside the optimizer to prevent exploding gradients. It is applied after backpropagation but before the optimizer step. There are two common forms:

- **Clipping by norm.** If the global norm of the gradient vector exceeds a threshold (e.g., 1.0), the entire gradient is scaled down proportionally. This is the most common approach and is used in virtually every large-scale transformer training pipeline.
- **Clipping by value.** Each gradient component is independently clamped to a fixed range (e.g., [-1, 1]).

Gradient clipping is especially important for recurrent neural networks and transformers, where long sequences can cause gradient magnitudes to grow exponentially through many layers of computation.

## SGD vs. adaptive methods: which generalizes better?

Wilson et al. (2017) showed that adaptive optimizers like Adam sometimes find solutions that generalize worse than SGD with momentum on certain tasks, particularly image classification.[22] One hypothesis is that SGD's noisy gradient estimates make it more likely to converge to flat minima in the loss landscape, which tend to generalize better to unseen data. Adaptive methods, by contrast, may converge to sharper minima because their per-parameter learning rates allow them to navigate narrow valleys that SGD would skip over.

More recent theoretical work (Zhou et al., 2020) formalized this observation, showing that SGD is more locally unstable at sharp minima and can escape them to reach flatter regions.[23] AdamW partially addresses the generalization gap by properly decoupling weight decay from the adaptive update, which provides more consistent regularization.

In modern practice, the generalization gap has narrowed considerably. AdamW with appropriate weight decay and learning rate scheduling matches or exceeds SGD on most benchmarks, which is why it has become the dominant optimizer for transformer-based architectures.

## Which optimizer should you use?

The following guidelines reflect common practice as of 2026:

1. **Start with AdamW** for most tasks. Its default hyperparameters are robust, and it works well across architectures and domains.
2. **Use SGD with momentum** when training convolutional neural networks for image classification, especially if training from scratch with large datasets. Many state-of-the-art vision results still use SGD.
3. **Use AdamW with linear warmup and cosine decay** for training transformers and language models. This combination is the near-universal default for LLM pre-training.
4. **Consider Adafactor** when GPU memory is limited and you are training large transformer models. It provides Adam-like behavior with substantially lower memory.
5. **Try Lion** for efficiency-sensitive workloads. It halves optimizer memory and has shown strong results on vision and language tasks.
6. **Evaluate Muon** for LLM pre-training if you want to reduce training FLOPs. Scaling-law experiments and frontier deployments (Moonlight and the trillion-parameter Kimi K2) suggest it can roughly halve the compute needed to reach a target loss, though it must be paired with AdamW for embeddings, heads, and other non-matrix parameters, and stabilized (for example with QK-Clip) at very large scale.
7. **Use LAMB** when scaling to very large batch sizes in distributed training.
8. **Apply AdaGrad** for problems with sparse features, such as recommendation systems or NLP tasks with large vocabularies.
9. **Always implement learning rate scheduling.** Even adaptive optimizers benefit from warmup and decay schedules.
10. **Use gradient clipping** (typically max norm 1.0) for any transformer or RNN training to prevent training instability.

## Optimizers by domain

| Domain | Common optimizer choices | Notes |
|---|---|---|
| Image classification | SGD + Momentum, AdamW | SGD traditional for CNNs; AdamW increasingly popular with ViTs |
| Object detection | SGD + Momentum, AdamW | Often inherits the backbone's optimizer choice |
| [Natural language processing](/wiki/natural_language_processing) | AdamW | Near-universal default for transformers |
| [Large language models](/wiki/large_language_model) | AdamW, Lion, Muon, Sophia | AdamW standard; alternatives offer efficiency gains |
| [Diffusion models](/wiki/diffusion_model) | AdamW, Lion | Lion showed 2.3x training efficiency gains |
| [Reinforcement learning](/wiki/reinforcement_learning) | Adam, RMSProp | Adam common in policy gradient methods; RMSProp in value-based methods |
| [GANs](/wiki/generative_adversarial_network) | Adam | Adaptive rates help stabilize adversarial training |
| [Speech recognition](/wiki/speech_recognition) | Adam, AdamW | Adaptive methods work well for sequence-to-sequence models |
| Recommendation systems | AdaGrad, Adam | AdaGrad's sparse feature handling is beneficial |

## Implementation examples

### PyTorch

```python
import torch.optim as optim

# AdamW with weight decay
optimizer = optim.AdamW(model.parameters(), lr=1e-3, weight_decay=0.01)

# SGD with momentum and Nesterov
optimizer = optim.SGD(model.parameters(), lr=0.1, momentum=0.9, nesterov=True)

# Learning rate scheduler: linear warmup + cosine decay
from torch.optim.lr_scheduler import CosineAnnealingLR, LinearLR, SequentialLR
warmup = LinearLR(optimizer, start_factor=0.01, total_iters=1000)
cosine = CosineAnnealingLR(optimizer, T_max=50000)
scheduler = SequentialLR(optimizer, schedulers=[warmup, cosine], milestones=[1000])

# Training loop with gradient clipping
for batch in dataloader:
    optimizer.zero_grad()
    loss = model(batch)
    loss.backward()
    torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
    optimizer.step()
    scheduler.step()
```

### TensorFlow / Keras

```python
import tensorflow as tf

# AdamW
optimizer = tf.keras.optimizers.AdamW(learning_rate=1e-3, weight_decay=0.01)

# SGD with momentum
optimizer = tf.keras.optimizers.SGD(learning_rate=0.1, momentum=0.9, nesterov=True)

# Cosine decay schedule with warmup
lr_schedule = tf.keras.optimizers.schedules.CosineDecay(
    initial_learning_rate=1e-3,
    decay_steps=50000,
    warmup_target=1e-3,
    warmup_steps=1000
)
optimizer = tf.keras.optimizers.AdamW(learning_rate=lr_schedule, weight_decay=0.01)
```

### JAX / Optax

```python
import optax

# AdamW with linear warmup and cosine decay
schedule = optax.warmup_cosine_decay_schedule(
    init_value=0.0, peak_value=1e-3,
    warmup_steps=1000, decay_steps=50000
)
optimizer = optax.adamw(learning_rate=schedule, weight_decay=0.01)
```

## Historical timeline

| Year | Event |
|---|---|
| 1847 | Augustin-Louis Cauchy describes the gradient descent method |
| 1951 | Robbins and Monro publish the stochastic approximation framework |
| 1964 | Polyak introduces the heavy ball method (momentum) |
| 1983 | Nesterov proposes accelerated gradient for convex optimization |
| 1986 | Rumelhart, Hinton, and Williams apply momentum to neural network training via backpropagation |
| 2011 | Duchi, Hazan, and Singer publish AdaGrad |
| 2012 | Zeiler introduces Adadelta; Hinton proposes RMSProp |
| 2014 | Kingma and Ba introduce Adam |
| 2015 | Martens and Grosse introduce K-FAC (Kronecker-factored approximate curvature) |
| 2016 | Dozat proposes NAdam |
| 2017 | Loshchilov and Hutter propose AdamW (decoupled weight decay) |
| 2018 | Shazeer and Stern introduce Adafactor; Gupta et al. introduce Shampoo |
| 2019 | You et al. introduce LAMB for large-batch training |
| 2020 | Liu et al. introduce RAdam (rectified Adam) |
| 2023 | Chen et al. discover Lion via program search; Liu et al. introduce Sophia |
| 2024 | Jordan introduces Muon with Newton-Schulz orthogonalization; Defazio et al. introduce Schedule-Free; Distributed Shampoo wins the inaugural AlgoPerf competition |
| 2025 | Moonshot AI shows Muon scales to LLMs (Moonlight) and trains the trillion-parameter Kimi K2 with MuonClip |

## Convergence theory

Optimizer convergence guarantees depend on the properties of the objective function:

- **Convex functions.** SGD converges to the global minimum at a rate of O(1/sqrt(T)) for general convex functions and O(1/T) for strongly convex functions, where T is the number of iterations. With Nesterov acceleration, the rate improves to O(1/T^2) for smooth convex functions.
- **Non-convex functions.** For general non-convex objectives (which include most neural network loss surfaces), convergence to a global minimum is not guaranteed. Instead, SGD is guaranteed to find an epsilon-stationary point (where the gradient norm is at most epsilon) in O(1/epsilon^4) gradient evaluations.
- **Adam convergence.** The original Adam paper proved convergence for online convex optimization.[5] However, Reddi et al. (2018) showed that Adam can diverge on certain convex problems due to the exponential moving average of squared gradients causing "short-term memory."[24] AMSGrad, proposed in the same paper, fixed this issue by maintaining the maximum of past squared gradient averages, though the fix is rarely needed in practice.

In practice, convergence theory provides useful intuition but does not fully explain optimizer behavior on deep neural network training, where the loss surface is highly non-convex and high-dimensional.

## See also

- [Gradient descent](/wiki/gradient_descent)
- [Learning rate](/wiki/learning_rate)
- [Backpropagation](/wiki/backpropagation)
- [Loss function](/wiki/loss_function)
- [Weight decay](/wiki/weight_decay)
- [Adam optimizer](/wiki/adam_optimizer)
- [AdaGrad](/wiki/adagrad)
- [Stochastic gradient descent](/wiki/stochastic_gradient_descent_sgd)
- [Mixed-precision training](/wiki/mixed_precision_training)

## References

1. Robbins, H. and Monro, S. (1951). "A Stochastic Approximation Method." *The Annals of Mathematical Statistics*, 22(3), 400-407.
2. Nesterov, Y. (1983). "A Method for Solving the Convex Programming Problem with Convergence Rate O(1/k^2)." *Doklady Akademii Nauk SSSR*, 269(3), 543-547.
3. Duchi, J., Hazan, E., and Singer, Y. (2011). "Adaptive Subgradient Methods for Online Learning and Stochastic Optimization." *Journal of Machine Learning Research*, 12, 2121-2159.
4. Zeiler, M. D. (2012). "ADADELTA: An Adaptive Learning Rate Method." *arXiv preprint arXiv:1212.5701*.
5. Kingma, D. P. and Ba, J. (2015). "Adam: A Method for Stochastic Optimization." *Proceedings of the 3rd International Conference on Learning Representations (ICLR 2015)*. https://arxiv.org/abs/1412.6980
6. Ruder, S. (2016). "An overview of gradient descent optimization algorithms." *arXiv preprint arXiv:1609.04747*.
7. Dozat, T. (2016). "Incorporating Nesterov Momentum into Adam." *ICLR 2016 Workshop*.
8. Loshchilov, I. and Hutter, F. (2019). "Decoupled Weight Decay Regularization." *Proceedings of the 7th International Conference on Learning Representations (ICLR 2019)*. https://arxiv.org/abs/1711.05101
9. Liu, L. et al. (2020). "On the Variance of the Adaptive Learning Rate and Beyond." *Proceedings of the 8th International Conference on Learning Representations (ICLR 2020)*.
10. Shazeer, N. and Stern, M. (2018). "Adafactor: Adaptive Learning Rates with Sublinear Memory Cost." *Proceedings of the 35th International Conference on Machine Learning (ICML 2018)*.
11. You, Y. et al. (2019). "Large Batch Optimization for Deep Learning: Training BERT in 76 Minutes." *arXiv preprint arXiv:1904.00962*.
12. Chen, X. et al. (2023). "Symbolic Discovery of Optimization Algorithms." *Proceedings of the 37th Conference on Neural Information Processing Systems (NeurIPS 2023)*. https://arxiv.org/abs/2302.06675
13. Liu, H. et al. (2024). "Sophia: A Scalable Stochastic Second-order Optimizer for Language Model Pre-training." *Proceedings of the 12th International Conference on Learning Representations (ICLR 2024)*.
14. Gupta, V. et al. (2018). "Shampoo: Preconditioned Stochastic Tensor Optimization." *Proceedings of the 35th International Conference on Machine Learning (ICML 2018)*.
15. Jordan, K. (2024). "Muon: An Optimizer for Hidden Layers in Neural Networks." Blog post and open-source implementation. https://kellerjordan.github.io/posts/muon/ ; code at https://github.com/KellerJordan/modded-nanogpt
16. Liu, J., Su, J., Yao, X. et al. (2025). "Muon is Scalable for LLM Training." *arXiv preprint arXiv:2502.16982*. https://arxiv.org/abs/2502.16982
17. Martens, J. and Grosse, R. (2015). "Optimizing Neural Networks with Kronecker-factored Approximate Curvature." *Proceedings of the 32nd International Conference on Machine Learning (ICML 2015)*. https://arxiv.org/abs/1503.05671
18. Jordan, K. et al. (2024-2025). "modded-nanogpt: NanoGPT speedrun." Open-source repository and worklog. https://github.com/KellerJordan/modded-nanogpt
19. Kimi Team, Moonshot AI (2025). "Kimi K2: Open Agentic Intelligence." Technical report and model release. https://arxiv.org/abs/2507.20534 ; https://moonshotai.github.io/Kimi-K2/
20. MLCommons (2024). "Announcing the results of the inaugural AlgoPerf: Training Algorithms benchmark competition." https://mlcommons.org/2024/08/mlc-algoperf-benchmark-competition/
21. Defazio, A., Yang, X., Mehta, H., Mishchenko, K., Khaled, A., and Cutkosky, A. (2024). "The Road Less Scheduled." *Proceedings of the 38th Conference on Neural Information Processing Systems (NeurIPS 2024)*. https://arxiv.org/abs/2405.15682
22. Wilson, A. C., Roelofs, R., Stern, M., Srebro, N., and Recht, B. (2017). "The Marginal Value of Adaptive Gradient Methods in Machine Learning." *Advances in Neural Information Processing Systems 30 (NeurIPS 2017)*. https://arxiv.org/abs/1705.08292
23. Zhou, P., Feng, J., Ma, C., Xiong, C., Hoi, S., and E, W. (2020). "Towards Theoretically Understanding Why SGD Generalizes Better Than ADAM in Deep Learning." *Advances in Neural Information Processing Systems 33 (NeurIPS 2020)*. https://arxiv.org/abs/2010.05627
24. Reddi, S. J., Kale, S., and Kumar, S. (2018). "On the Convergence of Adam and Beyond." *Proceedings of the 6th International Conference on Learning Representations (ICLR 2018)*. https://openreview.net/forum?id=ryQu7f-RZ