The Adam optimizer (short for Adaptive Moment Estimation) is an algorithm for first-order gradient descent-based optimization of stochastic objective functions. Introduced by Diederik P. Kingma and Jimmy Ba in a 2014 paper and presented at ICLR 2015, Adam has become the default optimizer for training deep learning models across nearly every domain [1]. With over 200,000 citations on Google Scholar, it ranks among the most cited papers in all of machine learning. In 2026, the paper received the ICLR Test of Time Award, recognizing that "Adam revolutionized neural network training, enabling significantly faster convergence and more stable training across a wide variety of architectures and tasks" [2].
Adam works by combining two older ideas: momentum (tracking an exponential moving average of past gradients) and RMSProp (tracking an exponential moving average of past squared gradients). By maintaining both a first moment estimate and a second moment estimate of the gradient, and applying bias correction to both, Adam adapts the learning rate for each parameter individually. This per-parameter adaptivity is what makes Adam so robust across a wide range of tasks and architectures without extensive hyperparameter tuning.
Before Adam, practitioners training neural networks faced a difficult choice among several optimization algorithms, each with its own strengths and weaknesses.
Stochastic gradient descent (SGD) is the simplest approach: compute the gradient of the loss function on a mini-batch of data and take a step proportional to the negative gradient. SGD is mathematically well-understood and, with proper tuning, can achieve excellent generalization. However, SGD with a fixed learning rate converges slowly on problems with ill-conditioned loss surfaces (surfaces where the curvature varies dramatically across different parameter directions).
SGD with momentum addresses some of SGD's slowness by accumulating a velocity vector that smooths out oscillations and accelerates movement along consistent gradient directions. The classical momentum update maintains an exponentially decaying average of past gradients. Nesterov accelerated gradient (NAG) is a variant that evaluates the gradient at a look-ahead position, often yielding slightly faster convergence.
Adagrad (Duchi et al., 2011) introduced the idea of adapting the learning rate for each parameter based on historical gradient information [3]. Parameters with large past gradients get smaller learning rates, and parameters with small past gradients get larger learning rates. This is useful for sparse data (like natural language processing tasks), but Adagrad's accumulation of squared gradients causes the learning rate to shrink monotonically, eventually becoming too small to make meaningful progress.
RMSProp (Hinton, unpublished lecture notes, 2012) fixed Adagrad's shrinking learning rate by replacing the sum of squared gradients with an exponential moving average of squared gradients [4]. This gives the algorithm a "window" of recent gradient history rather than the entire history, allowing the learning rate to increase again if gradients grow.
Adam brings momentum and RMSProp together into a single algorithm, adds bias correction to handle initialization artifacts, and provides a principled default configuration that works well in most settings.
Adam maintains two state variables for each parameter in the model: a first moment estimate (the mean of recent gradients, analogous to momentum) and a second moment estimate (the mean of recent squared gradients, analogous to RMSProp). These estimates are updated at each training step, bias-corrected to account for their initialization at zero, and then used to compute an adaptive learning rate for each parameter.
Given a parameter vector theta, a loss function L(theta), and the gradient g_t = nabla L(theta_t) at time step t:
Update the first moment estimate (mean of gradients): m_t = beta_1 * m_(t-1) + (1 - beta_1) * g_t
Update the second moment estimate (mean of squared gradients): v_t = beta_2 * v_(t-1) + (1 - beta_2) * g_t^2
Apply bias correction: m_hat_t = m_t / (1 - beta_1^t) v_hat_t = v_t / (1 - beta_2^t)
Update the parameters: theta_t = theta_(t-1) - alpha * m_hat_t / (sqrt(v_hat_t) + epsilon)
Here, alpha is the step size (learning rate), beta_1 and beta_2 are the exponential decay rates for the first and second moment estimates, and epsilon is a small constant added for numerical stability.
Both m_t and v_t are initialized to zero vectors. Because of this, the estimates are biased toward zero during the early steps of training, especially when beta_1 and beta_2 are close to 1 (which they typically are). The bias correction step divides by (1 - beta^t), which is small for small t but approaches 1 as t grows. This ensures that the moment estimates are unbiased from the very first step, preventing the optimizer from taking overly small or poorly directed steps at the beginning of training.
Without bias correction, the effective learning rate in the first few steps would be much smaller than intended, and the ratio between the first and second moment estimates could be distorted. Kingma and Ba showed that the bias-corrected estimates have the correct expected value under the assumption that the gradients are drawn from a stationary distribution [1].
The first moment estimate m_t acts like momentum: it smooths out noise in the gradient signal and accelerates the optimizer along directions of consistent gradient. The second moment estimate v_t captures the scale of the gradient for each parameter. Dividing m_hat_t by sqrt(v_hat_t) normalizes the update, so parameters with large gradients get scaled down and parameters with small gradients get scaled up. This per-parameter adaptivity is the core of what makes Adam effective across a wide range of problems.
Adam has four hyperparameters. One of its biggest practical advantages is that the recommended defaults work remarkably well across a wide variety of tasks.
| Hyperparameter | Symbol | Default value | Role |
|---|---|---|---|
| Learning rate | alpha | 0.001 | Controls the step size of each parameter update |
| First moment decay | beta_1 | 0.9 | Exponential decay rate for the gradient moving average (momentum) |
| Second moment decay | beta_2 | 0.999 | Exponential decay rate for the squared gradient moving average |
| Epsilon | epsilon | 1e-8 | Small constant for numerical stability in the denominator |
The default learning rate of 0.001 is a good starting point for many tasks. In practice, the learning rate is the hyperparameter most often tuned. For large language models, learning rates in the range of 1e-4 to 3e-4 are common during pre-training, often combined with a warmup schedule followed by cosine decay. For fine-tuning, even smaller learning rates (1e-5 to 5e-5) are typical.
Beta_1 = 0.9 means the first moment estimate is an exponential moving average with an effective window of roughly 10 steps (1 / (1 - 0.9) = 10). Beta_2 = 0.999 gives the second moment estimate a much longer window of roughly 1,000 steps. The longer window for the second moment helps stabilize the per-parameter scaling, since the variance of the gradient can itself be noisy.
Some practitioners use beta_2 = 0.95 or beta_2 = 0.98 for training large transformers, finding that a shorter second-moment window can be beneficial when gradients change significantly throughout training.
Epsilon prevents division by zero when v_hat_t is very small. The default value of 1e-8 is usually fine, but some implementations (including TensorFlow's default) use 1e-7, and some practitioners have found that larger values (like 1e-6 or even 1e-4) can improve stability, particularly in mixed-precision training.
One of the most important modifications to Adam came from Ilya Loshchilov and Frank Hutter, who identified a subtle but significant problem with how Adam interacts with weight decay [5].
Weight decay and L2 regularization are equivalent for SGD: adding a penalty of (lambda/2) * ||theta||^2 to the loss function produces the same parameter update as directly decaying the weights by a factor of (1 - lambda * alpha) at each step. However, this equivalence breaks down for adaptive optimizers like Adam.
In Adam, the gradient of the L2 penalty term (lambda * theta) gets divided by the second moment estimate, just like every other gradient component. This means the regularization effect is scaled differently for each parameter, and parameters with large gradient variance receive weaker regularization. The intended regularization strength is distorted by the adaptive scaling.
AdamW decouples weight decay from the gradient-based update. Instead of adding the L2 penalty to the loss and letting Adam process the resulting gradient, AdamW applies weight decay directly to the parameters after the Adam update step:
theta_t = theta_(t-1) - alpha * m_hat_t / (sqrt(v_hat_t) + epsilon) - alpha * lambda * theta_(t-1)
This ensures that weight decay acts uniformly on all parameters regardless of their gradient statistics. Loshchilov and Hutter showed that AdamW substantially improves generalization compared to Adam with L2 regularization, particularly on image classification benchmarks where Adam had previously been outperformed by SGD with momentum [5].
AdamW was published in 2017 (arXiv preprint) and presented at ICLR 2019. It has since become the standard optimizer for training large language models and vision transformers. PyTorch provides it as torch.optim.AdamW, and it is the optimizer of choice for training models like GPT, BERT, and their successors.
The success of Adam has inspired a large family of adaptive optimizers. Some address specific limitations of Adam, while others introduce entirely new ideas.
| Optimizer | Year | Key idea | Primary use case |
|---|---|---|---|
| Adam [1] | 2014 | Combines momentum + RMSProp with bias correction | General-purpose default |
| AdamW [5] | 2017 | Decouples weight decay from adaptive gradient scaling | Standard for LLM and ViT training |
| LARS [6] | 2017 | Layer-wise adaptive rate scaling for SGD; normalizes gradient by layer | Large-batch training of CNNs |
| LAMB [7] | 2019 | Layer-wise adaptive rate scaling for Adam | Large-batch training of BERT and transformers |
| Adafactor [8] | 2018 | Factorizes second moment into row and column vectors; sublinear memory | Memory-efficient training of large models |
| RAdam [9] | 2019 | Rectifies variance of adaptive learning rate in early training | Stabilized Adam warmup |
| Lion [10] | 2023 | Discovered via program search; uses sign of momentum; lower memory | Efficient alternative to AdamW |
| Sophia [11] | 2023 | Uses diagonal Hessian estimate for second-order-like adaptivity | Faster LLM pre-training |
| Schedule-Free Adam [12] | 2024 | Eliminates need for learning rate schedules via iterate averaging | Schedule-free training |
| AdEMAMix [13] | 2024 | Maintains two momentum terms with different timescales | Better use of old gradients |
LARS (Layer-wise Adaptive Rate Scaling) was developed by Yang You et al. to enable large-batch training of convolutional neural networks [6]. The key insight is that the ratio of the weight norm to the gradient norm varies dramatically across layers, and a single global learning rate cannot accommodate all layers well at very large batch sizes. LARS computes a per-layer trust ratio and scales the learning rate accordingly.
LAMB (Layer-wise Adaptive Moments optimizer for Batch training) extends this idea to Adam [7]. It was developed to enable training BERT with batch sizes up to 64K in just 76 minutes. LAMB normalizes the Adam update by the ratio of the parameter norm to the update norm, layer by layer.
Adafactor, proposed by Shazeer and Stern (2018), addresses Adam's memory overhead [8]. Adam stores two state variables (first and second moment) for each parameter, doubling the memory required compared to SGD. For a model with billions of parameters, this is a serious constraint. Adafactor reduces memory by factorizing the second moment matrix into row and column factors, cutting memory from O(N) to O(sqrt(N)) for weight matrices. It also supports automatic learning rate scaling, reducing the number of hyperparameters to tune.
Lion (EvoLved Sign Momentum) was discovered by Chen et al. (2023) through an automated program search over the space of possible optimizer update rules [10]. The resulting algorithm is surprisingly simple: it only tracks momentum (no second moment) and uses the sign of the momentum for parameter updates. This makes Lion more memory-efficient than Adam (one state variable instead of two) and sometimes faster in practice. Lion has shown strong results on both vision and language tasks.
Sophia (Second-order Clipped Stochastic Optimization) was proposed by Liu et al. (2023) as a lightweight second-order optimizer for language model pre-training [11]. Sophia estimates the diagonal of the Hessian (a measure of curvature) and uses it to adapt per-parameter learning rates, similar to how Adam uses the second moment but with a more direct connection to the loss surface geometry. On GPT-2 pre-training benchmarks, Sophia was reported to reach the same validation loss as AdamW using roughly 50% fewer steps.
Schedule-Free Adam, introduced by Defazio et al. (2024) in the paper "The Road Less Scheduled," eliminates the need to specify a learning rate schedule [12]. Traditional training with Adam or AdamW typically involves a warmup phase followed by a cosine or linear decay schedule, which requires knowing the total training duration in advance. Schedule-Free Adam uses a theory unifying scheduling and iterate averaging to achieve state-of-the-art performance without any schedule. It won the MLCommons 2024 AlgoPerf Algorithmic Efficiency Challenge in the self-tuning track.
Several factors explain Adam's widespread adoption.
Robust defaults. Adam's recommended hyperparameters (alpha = 0.001, beta_1 = 0.9, beta_2 = 0.999, epsilon = 1e-8) work well out of the box for a wide range of architectures and datasets. A researcher or engineer can start training with these defaults and often achieve competitive results without any hyperparameter search. This is not true of SGD, which is highly sensitive to the learning rate and momentum settings and often requires a carefully designed learning rate schedule.
Fast initial convergence. Adam typically makes rapid progress in the first few epochs of training. The adaptive per-parameter learning rates help it navigate loss surfaces with very different scales across parameter groups, which is common in deep networks with different layer types (embeddings, attention layers, feedforward layers, normalization layers).
Works across architectures. Adam performs well on convolutional networks, recurrent networks, transformers, generative adversarial networks, variational autoencoders, and more. This generality has made it the default choice in codebases, tutorials, and research papers alike.
Low sensitivity to scale. Because Adam normalizes gradients by the square root of the second moment, it is relatively insensitive to the absolute scale of the gradients. This makes it easier to use with different loss functions, batch sizes, and model architectures without rescaling the learning rate.
Despite Adam's convenience, SGD with momentum remains the preferred optimizer in certain settings. The choice between the two is a long-standing practical question in deep learning.
| Consideration | SGD with momentum | Adam / AdamW |
|---|---|---|
| Generalization on image classification | Often better with proper tuning | Comparable with AdamW |
| Ease of tuning | Requires careful learning rate and schedule selection | Works well with defaults |
| Training speed (wall-clock time to good loss) | Slower convergence, especially early | Faster convergence, especially early |
| Memory usage | One state variable per parameter | Two state variables per parameter |
| Large language models | Rarely used | Standard choice (AdamW) |
| Computer vision (CNNs) | Strong tradition, especially ResNet-era | Increasingly common with ViTs |
| Fine-tuning pre-trained models | Sometimes used | More common, especially for NLP |
Historically, the vision community favored SGD with momentum because it was observed to generalize better than Adam on benchmarks like ImageNet. This generalization gap was largely explained by Loshchilov and Hutter's work on decoupled weight decay: when weight decay is properly decoupled (as in AdamW), the gap narrows substantially or disappears [5].
For large language model pre-training, AdamW is effectively the only optimizer in wide use. The scale of these models (billions of parameters, trillions of tokens) makes robust default behavior critical, and AdamW delivers this. Papers training GPT-3, PaLM, LLaMA, and similar models all report using AdamW.
SGD still has an edge in memory efficiency: it stores only one state variable per parameter (the momentum buffer), compared to Adam's two. For extremely large models where GPU memory is the binding constraint, this difference matters, and it has motivated memory-efficient alternatives like Adafactor and Lion.
Adam is not without its problems. Researchers have identified several theoretical and practical concerns over the years.
Reddi et al. (2018) demonstrated that Adam can fail to converge to the optimal solution in certain convex optimization settings [9]. The issue stems from the exponential moving average of squared gradients, which can cause the effective learning rate to increase at the wrong time. This motivated the development of AMSGrad, a variant that maintains the maximum of all past second moment estimates, though AMSGrad has not seen wide practical adoption because the convergence issues rarely manifest on real deep learning problems.
Several studies have reported that Adam finds solutions that generalize worse than those found by SGD, particularly on image classification tasks. Wilson et al. (2017) argued that adaptive gradient methods tend to find sharp minima that generalize poorly. However, this concern has been partially addressed by AdamW and by learning rate warmup strategies.
Adam requires storing two additional state variables (m_t and v_t) for every parameter. For a model with N parameters in float32, this means 8N bytes of additional optimizer state beyond the parameters themselves. For billion-parameter models, this translates to gigabytes of extra GPU memory, which can be the difference between fitting a model on available hardware or not.
The default beta_2 = 0.999 creates a very long window for the second moment estimate (roughly 1,000 steps). If the gradient distribution changes significantly during training (which it does), the second moment estimate can lag behind, causing the learning rate to be poorly calibrated. Some practitioners find that reducing beta_2 to 0.95 or 0.98 improves performance for language model training.
The training of large language models has standardized heavily around AdamW. The typical recipe involves:
Warmup phase: The learning rate is linearly increased from zero (or a very small value) to the peak learning rate over the first 1-5% of training steps. This prevents large, poorly directed updates at the beginning when the moment estimates are not yet reliable.
Peak learning rate: For models in the 1B-100B parameter range, peak learning rates typically fall between 1e-4 and 6e-4, with larger models often using smaller learning rates.
Decay schedule: After warmup, the learning rate is decayed following a cosine schedule down to roughly 10% of the peak value.
Weight decay: A weight decay coefficient of 0.1 is common for large language models.
Beta values: beta_1 = 0.9 and beta_2 = 0.95 have become standard for LLM training, deviating from the original defaults to use a shorter second-moment window.
Gradient clipping: The global gradient norm is typically clipped to 1.0 to prevent training instabilities from occasional gradient spikes.
This recipe, with minor variations, appears in the training descriptions of GPT-3, PaLM, LLaMA, Chinchilla, and many other large language models.
Adam is available in every major deep learning framework.
In PyTorch:
import torch
optimizer = torch.optim.Adam(model.parameters(), lr=0.001, betas=(0.9, 0.999), eps=1e-8)
# Or use AdamW for decoupled weight decay:
optimizer = torch.optim.AdamW(model.parameters(), lr=0.001, weight_decay=0.01)
In TensorFlow / Keras:
import tensorflow as tf
optimizer = tf.keras.optimizers.Adam(learning_rate=0.001, beta_1=0.9, beta_2=0.999, epsilon=1e-7)
# Or AdamW:
optimizer = tf.keras.optimizers.AdamW(learning_rate=0.001, weight_decay=0.01)
PyTorch's implementation includes several options beyond the basic algorithm, such as amsgrad (the AMSGrad variant), fused (a faster CUDA implementation that fuses the optimizer step into a single kernel), and foreach (which processes parameter groups in batches for better GPU utilization).
As of 2025-2026, AdamW remains the dominant optimizer for training deep learning models, particularly large language models and vision transformers. No alternative has yet displaced it as the default choice.
However, several directions show promise:
The field of optimizer research continues to be active, but Adam's combination of simplicity, robustness, and strong defaults makes it likely to remain the go-to choice for the foreseeable future. Its impact on the field cannot be overstated: by providing an optimizer that "just works," Kingma and Ba enabled researchers to focus on model architecture and data rather than optimization tuning, accelerating progress across all of deep learning.