The Adam optimizer (short for Adaptive Moment Estimation) is an algorithm for first-order gradient descent-based optimization of stochastic objective functions. Introduced by Diederik P. Kingma and Jimmy Ba in a 2014 preprint and presented at ICLR 2015, Adam has become the default optimizer for training deep learning models across nearly every domain [1]. With over 200,000 citations on Google Scholar, it ranks among the most cited papers in all of machine learning and is one of the most cited scientific papers of the 21st century in any field. In 2026, the paper received the ICLR Test of Time Award, recognizing that "Adam revolutionized neural network training, enabling significantly faster convergence and more stable training across a wide variety of architectures and tasks" [2].
Adam works by combining two older ideas: momentum (tracking an exponential moving average of past gradients) and RMSProp (tracking an exponential moving average of past squared gradients). By maintaining both a first moment estimate and a second moment estimate of the gradient, and applying bias correction to both, Adam adapts the learning rate for each parameter individually. This per-parameter adaptivity is what makes Adam so robust across a wide range of tasks and architectures without extensive hyperparameter tuning. The algorithm is invariant to diagonal rescaling of the gradients, requires only first-order gradient information, and has very modest memory requirements relative to second-order methods.
Before Adam, practitioners training neural networks faced a difficult choice among several optimization algorithms, each with its own strengths and weaknesses. Each algorithm represented a different trade-off between simplicity, memory cost, convergence speed, and the amount of hyperparameter tuning required.
Stochastic gradient descent (SGD) is the simplest approach: compute the gradient of the loss function on a mini-batch of data and take a step proportional to the negative gradient. SGD is mathematically well understood and, with proper tuning, can achieve excellent generalization. However, SGD with a fixed learning rate converges slowly on problems with ill-conditioned loss surfaces, that is, surfaces where the curvature varies dramatically across different parameter directions.
SGD with momentum addresses some of SGD's slowness by accumulating a velocity vector that smooths out oscillations and accelerates movement along consistent gradient directions. The classical momentum update maintains an exponentially decaying average of past gradients. Nesterov accelerated gradient (NAG) is a variant that evaluates the gradient at a look-ahead position, often yielding slightly faster convergence on convex problems. Both methods, however, still use a single global learning rate.
Adagrad (Duchi et al., 2011) introduced the idea of adapting the learning rate for each parameter based on historical gradient information [3]. Parameters with large past gradients get smaller learning rates, and parameters with small past gradients get larger learning rates. This is useful for sparse data such as natural language processing tasks, where some features are seen rarely. Adagrad's accumulation of squared gradients, however, causes the learning rate to shrink monotonically, eventually becoming too small to make meaningful progress in long training runs.
RMSProp (Hinton, unpublished lecture notes, 2012) fixed Adagrad's shrinking learning rate by replacing the sum of squared gradients with an exponential moving average of squared gradients [4]. This gives the algorithm a sliding window of recent gradient history rather than the entire history, allowing the learning rate to recover if gradients grow again. Tieleman and Hinton's slides from the Coursera course on neural networks made RMSProp influential in practice, even though no formal paper was ever published.
AdaDelta (Zeiler, 2012) was developed independently around the same time and arrived at a similar fix, additionally trying to remove the need for a manually tuned learning rate by tracking a running average of squared parameter updates as well.
Adam brings momentum and RMSProp together into a single algorithm, adds bias correction to handle initialization artifacts, and provides a principled default configuration that works well in most settings. Kingma and Ba framed the contribution as combining the best aspects of two adaptive methods: AdaGrad, which works well with sparse gradients, and RMSProp, which works well in online and non-stationary settings.
Adam maintains two state variables for each parameter in the model: a first moment estimate (the mean of recent gradients, analogous to momentum) and a second moment estimate (the mean of recent squared gradients, analogous to RMSProp). These estimates are updated at each training step, bias corrected to account for their initialization at zero, and then used to compute an adaptive parameter update for each parameter.
Given a parameter vector theta, a loss function L(theta), and the gradient g_t = nabla L(theta_t) at time step t:
Compute the gradient on the current minibatch: g_t = nabla L(theta_(t-1))
Update the first moment estimate (mean of gradients): m_t = beta_1 * m_(t-1) + (1 - beta_1) * g_t
Update the second moment estimate (mean of squared gradients): v_t = beta_2 * v_(t-1) + (1 - beta_2) * g_t^2
The square here is element-wise.
Apply bias correction: m_hat_t = m_t / (1 - beta_1^t) v_hat_t = v_t / (1 - beta_2^t)
Update the parameters: theta_t = theta_(t-1) - alpha * m_hat_t / (sqrt(v_hat_t) + epsilon)
Here, alpha is the step size (learning rate), beta_1 and beta_2 are the exponential decay rates for the first and second moment estimates, and epsilon is a small constant added for numerical stability. The square root and division in step 5 are also element-wise.
The algorithm is initialized with m_0 = 0 and v_0 = 0. The original paper uses a slightly different but mathematically equivalent formulation that folds the bias correction into a single effective learning rate, alpha_t = alpha * sqrt(1 - beta_2^t) / (1 - beta_1^t), to save a small amount of computation [1].
Both m_t and v_t are initialized to zero vectors. Because of this, the estimates are biased toward zero during the early steps of training, especially when beta_1 and beta_2 are close to 1 (which they typically are). With beta_2 = 0.999, the uncorrected v_1 after the first step equals only 0.001 * g_1^2, which is roughly a thousand times smaller than the true second moment of the gradient. Plugging this into the denominator would produce an enormous, badly directed first step.
The bias correction step divides by (1 - beta^t), which is small for small t but approaches 1 as t grows. After about 10 steps with beta_1 = 0.9 the correction factor is essentially 1, and after a few thousand steps with beta_2 = 0.999 the same is true. This ensures that the moment estimates are effectively unbiased from the very first step, preventing the optimizer from taking overly small or poorly directed steps at the beginning of training.
Kingma and Ba showed that the bias-corrected estimates have the correct expected value under the assumption that the gradients are drawn from a stationary distribution [1]. In practice, the gradient distribution is far from stationary, but the correction still does what it is supposed to: it removes the systematic bias caused by the zero initialization.
The first moment estimate m_t acts like momentum: it smooths out noise in the gradient signal and accelerates the optimizer along directions of consistent gradient. The second moment estimate v_t captures the scale of the gradient for each parameter. Dividing m_hat_t by sqrt(v_hat_t) normalizes the update, so parameters with large gradients get scaled down and parameters with small gradients get scaled up. This per-parameter adaptivity is the core of what makes Adam effective across a wide range of problems.
A useful way to see this is that the effective per-parameter learning rate at step t is roughly alpha / sqrt(E[g^2]), which is the global learning rate divided by an estimate of the root mean square gradient. The signal-to-noise ratio of the update is m_hat_t / sqrt(v_hat_t), and Adam steps in proportion to that ratio rather than to the raw gradient.
Adam has four hyperparameters. One of its biggest practical advantages is that the recommended defaults work remarkably well across a wide variety of tasks. The original paper proposed alpha = 0.001, beta_1 = 0.9, beta_2 = 0.999, and epsilon = 1e-8, and these have remained the defaults in nearly every implementation since.
| Hyperparameter | Symbol | Default value | Role |
|---|---|---|---|
| Learning rate | alpha | 0.001 | Controls the step size of each parameter update |
| First moment decay | beta_1 | 0.9 | Exponential decay rate for the gradient moving average (momentum) |
| Second moment decay | beta_2 | 0.999 | Exponential decay rate for the squared gradient moving average |
| Epsilon | epsilon | 1e-8 | Small constant for numerical stability in the denominator |
The default learning rate of 0.001 is a good starting point for many tasks. In practice, the learning rate is the hyperparameter most often tuned. A common rule of thumb is that, for very large batch sizes, the learning rate should be scaled up roughly proportionally; for very small models or noisy data, it should be scaled down. Adam's adaptive scaling makes this less critical than for SGD, but it does not remove the need for tuning entirely.
Beta_1 = 0.9 means the first moment estimate is an exponential moving average with an effective window of roughly 10 steps (1 / (1 - 0.9) = 10). Beta_2 = 0.999 gives the second moment estimate a much longer window of roughly 1,000 steps. The longer window for the second moment helps stabilize the per-parameter scaling, since the variance of the gradient can itself be noisy.
For large transformer training, beta_2 = 0.95 has become the de facto standard. Llama 2 pretraining used (beta_1, beta_2) = (0.9, 0.95), and the same setting appears in PaLM, GPT-3, Llama 3, DeepSeek V3, and the Masked Autoencoder vision recipe [14][15]. The intuition is that the gradient distribution shifts substantially during pretraining of a large model, so a shorter second-moment window allows Adam to react faster to those shifts.
Epsilon prevents division by zero when v_hat_t is very small. The default value of 1e-8 is usually fine, but some implementations (including TensorFlow's default in older versions) use 1e-7, and some practitioners have found that larger values such as 1e-6 or even 1e-4 can improve stability, particularly in mixed-precision training where very small denominators interact badly with float16 representations.
Across modern deep learning, several conventions have emerged for setting Adam (and AdamW) hyperparameters in different domains. The table below summarizes typical settings, drawn from the published training recipes for major models [14][15][16].
| Task | Optimizer | Peak learning rate | beta_1 | beta_2 | Weight decay | Schedule |
|---|---|---|---|---|---|---|
| LLM pretraining (GPT-3, PaLM, Llama, DeepSeek) | AdamW | 1e-4 to 6e-4 | 0.9 | 0.95 | 0.1 | warmup + cosine |
| LLM fine-tuning (instruction, chat) | AdamW | 1e-5 to 5e-5 | 0.9 | 0.999 | 0.0 to 0.1 | warmup + linear or cosine |
| LoRA / adapter fine-tuning | AdamW | 1e-4 to 3e-4 | 0.9 | 0.999 | 0.0 | warmup + cosine |
| BERT-style masked language model pretraining | AdamW | 1e-4 | 0.9 | 0.999 | 0.01 | warmup + linear decay |
| Vision Transformer (ViT) pretraining | AdamW | 1e-3 to 3e-3 | 0.9 | 0.999 | 0.05 to 0.3 | warmup + cosine |
| Diffusion model training | AdamW | 1e-4 | 0.9 | 0.999 | 1e-2 | constant or cosine |
| ResNet / CNN image classification | SGD-momentum | 0.1 | n/a | n/a | 1e-4 | step or cosine |
| GAN training | Adam | 1e-4 to 2e-4 | 0.5 | 0.9 or 0.999 | 0 | constant |
| Reinforcement learning (PPO, DQN) | Adam | 3e-4 | 0.9 | 0.999 | 0 | constant |
For generative adversarial networks, Radford, Metz and Chintala recommended beta_1 = 0.5 in their DCGAN paper, finding that the default 0.9 caused training instability. This setting has stuck across most GAN literature.
One of the most important modifications to Adam came from Ilya Loshchilov and Frank Hutter, who identified a subtle but significant problem with how Adam interacts with weight decay [5]. Their paper, originally titled "Fixing Weight Decay Regularization in Adam" and later renamed "Decoupled Weight Decay Regularization," was published as an arXiv preprint in 2017 and accepted at ICLR 2019.
Weight decay and L2 regularization are equivalent for SGD: adding a penalty of (lambda / 2) * ||theta||^2 to the loss function produces the same parameter update as directly decaying the weights by a factor of (1 - lambda * alpha) at each step. This equivalence is what makes the two terms get used interchangeably in much of the older literature.
The equivalence breaks down for adaptive optimizers like Adam. In Adam, the gradient of the L2 penalty term (lambda * theta) gets divided by the second moment estimate, just like every other gradient component. This means the regularization effect is scaled differently for each parameter: parameters with large gradient variance receive weaker regularization, and parameters with small gradient variance receive stronger regularization. The intended uniform shrinkage is distorted by the adaptive scaling, and the choice of weight decay becomes entangled with the choice of learning rate.
AdamW decouples weight decay from the gradient-based update. Instead of adding the L2 penalty to the loss and letting Adam process the resulting gradient, AdamW applies weight decay directly to the parameters after the Adam update step:
theta_t = theta_(t-1) - alpha * (m_hat_t / (sqrt(v_hat_t) + epsilon) + lambda * theta_(t-1))
The critical change is that the lambda * theta_(t-1) term sits outside the adaptive normalization. Every parameter gets shrunk by the same fraction at each step, regardless of its gradient history.
Loshchilov and Hutter showed that AdamW substantially improves generalization compared to Adam with L2 regularization, particularly on image classification benchmarks where Adam had previously been outperformed by SGD with momentum. The decoupling also disentangles the weight decay coefficient from the learning rate, making hyperparameter search much easier: in plain Adam, changing the learning rate also implicitly changes the effective regularization strength, while in AdamW the two are independent [5].
AdamW is now the standard optimizer for training large language models and vision transformers. PyTorch provides it as torch.optim.AdamW, and it is the optimizer of choice for training models like GPT, BERT, Llama, Claude, Gemini, and their successors. As a rule of thumb, almost any time someone says they used "Adam" to train a transformer in the post-2019 era, they almost certainly mean AdamW.
The success of Adam has inspired a large family of adaptive optimizers. Some address specific limitations of Adam, while others introduce entirely new ideas. The table below summarizes the most influential variants.
| Optimizer | Year | Key idea | Memory vs Adam | Primary use case |
|---|---|---|---|---|
| Adam [1] | 2014 | Combines momentum and RMSProp with bias correction | 1.0x | General-purpose default |
| AdaMax [1] | 2014 | Replaces L2 norm of past gradients with L_infinity norm | 1.0x | More stable when gradients have heavy tails |
| NAdam [17] | 2016 | Adam with Nesterov momentum lookahead in the first moment | 1.0x | Slightly faster convergence than vanilla Adam |
| AMSGrad [9] | 2018 | Maintains running maximum of v_t to fix non-convergence proof | 1.5x | Theoretical convergence guarantee |
| AdamW [5] | 2017 / 2019 | Decouples weight decay from adaptive gradient scaling | 1.0x | Standard for LLM and ViT training |
| Adafactor [8] | 2018 | Factorizes second moment into row and column vectors; sublinear memory | ~0.5x | Memory-efficient training of large models |
| LARS [6] | 2017 | Layer-wise adaptive rate scaling for SGD; normalizes gradient by layer | < 1.0x | Large-batch training of CNNs |
| LAMB [7] | 2019 | Layer-wise adaptive rate scaling applied to Adam | 1.0x | Large-batch training of BERT |
| RAdam [10] | 2019 | Rectifies variance of adaptive learning rate, removing need for warmup | 1.0x | Stabilized Adam without manual warmup |
| AdaBelief [18] | 2020 | Adapts step size to the "belief" in the current gradient direction | 1.0x | Sometimes better generalization and GAN stability |
| 8-bit Adam [13] | 2022 | Block-wise quantizes m and v to 8 bits | 0.25x | Memory-efficient LLM training and fine-tuning |
| Lion [11] | 2023 | Discovered via program search; uses sign of momentum; only one state | 0.5x | Efficient alternative to AdamW |
| Sophia [12] | 2023 | Uses diagonal Hessian estimate for second-order-like adaptivity | ~1.0x | Faster LLM pre-training |
| Schedule-Free Adam [19] | 2024 | Eliminates need for learning rate schedules via iterate averaging | 1.0x | Schedule-free training |
| AdEMAMix [20] | 2024 | Maintains two momentum terms with different timescales | 1.5x | Better use of older gradients |
AdaMax was introduced in the same paper as Adam itself [1]. It generalizes Adam by replacing the L2 norm of past gradients with the infinity norm. Concretely, the second-moment update is replaced by u_t = max(beta_2 * u_(t-1), |g_t|), and the parameter update becomes theta_t = theta_(t-1) - (alpha / (1 - beta_1^t)) * m_t / u_t. The infinity norm version is more numerically stable when gradients have heavy-tailed distributions, but in practice AdaMax is rarely used today.
NAdam (Nesterov-accelerated Adaptive Moment Estimation) was introduced by Timothy Dozat in a workshop paper at ICLR 2016 [17]. It modifies Adam to incorporate Nesterov momentum, applying a lookahead step inside the first-moment update. The change is small but consistent: NAdam often produces slightly faster convergence than vanilla Adam at the same hyperparameters, with no extra memory cost. Most major frameworks ship a NAdam implementation, but in modern deep learning AdamW has overshadowed it.
Reddi et al. (2018) demonstrated that the original convergence proof for Adam contains an error and constructed a simple convex problem on which Adam fails to converge to the optimum [9]. The issue is that the running average v_t can decrease over time, allowing the effective learning rate to grow when it should shrink, which can push the optimizer away from the optimum. Their fix, called AMSGrad, replaces v_t in the denominator with v_max_t = max(v_max_(t-1), v_t), the running maximum of all past second-moment estimates. This restores monotonicity of the effective learning rate and gives a clean convergence proof. The paper won the ICLR 2018 Best Paper Award. Despite its theoretical importance, AMSGrad has not been widely adopted because the failure modes it prevents rarely show up on real deep learning problems, and the fix slightly hurts empirical performance in many cases.
LARS (Layer-wise Adaptive Rate Scaling) was developed by Yang You et al. to enable large-batch training of convolutional neural networks, computing a per-layer trust ratio so that a single global learning rate does not have to fit every layer at very large batch sizes [6]. LAMB (Layer-wise Adaptive Moments optimizer for Batch training) extends this idea to Adam, normalizing the update by the ratio of the parameter norm to the update norm layer by layer [7]. LAMB was used to train BERT with batch sizes up to 64K in 76 minutes on a 1024-TPU pod.
Adafactor, proposed by Shazeer and Stern (2018), addresses Adam's memory overhead [8]. It factorizes the second moment matrix for each weight matrix into the outer product of a row vector and a column vector, cutting memory from O(N) to O(sqrt(N)) for that matrix. It also supports automatic per-parameter learning rate scaling based on the parameter's RMS, reducing hyperparameter tuning. Adafactor was used to train T5 and many other Google models.
Liyuan Liu and colleagues (2019) showed that the variance of Adam's adaptive learning rate is very high in the early steps of training because the second-moment estimate is computed from very few samples [10]. This high variance is the underlying reason why a learning-rate warmup phase is empirically necessary for stable Adam training. RAdam (Rectified Adam) introduces a closed-form term that explicitly rectifies this variance, automatically suppressing the adaptive component when it would be unreliable. The result is that RAdam often trains stably without an explicit warmup schedule, although in practice modern transformer recipes still use warmup for AdamW.
AdaBelief, introduced by Juntang Zhuang and colleagues (2020), modifies Adam by replacing the second moment v_t (the EMA of g_t^2) with s_t, an EMA of (g_t - m_t)^2, which can be interpreted as the variance of the gradient around its current EMA [18]. Loosely, this scales the step size by how much the current gradient agrees with the trend captured by m_t, taking a larger step when the optimizer "believes" the direction is reliable and a smaller step when the gradient is bouncing around. AdaBelief has shown competitive results on classification, GAN training, and reinforcement learning, although it has not displaced AdamW as the default.
Tim Dettmers and colleagues introduced 8-bit Adam in 2022 as part of the bitsandbytes library [13]. The optimizer state, m and v, normally stored in 32-bit floating point, is instead stored in 8-bit using block-wise dynamic quantization. Each block of values is quantized independently with its own dynamic range, which avoids the precision loss that would come from a single global quantization range. Dequantization happens on the fly during each step.
The paper showed that the 8-bit version matches the convergence of full-precision Adam on a wide range of benchmarks, including GPT-2, GLUE fine-tuning, image classification, and machine translation, while reducing optimizer memory by a factor of four. For very large models this is the difference between training fitting on a given GPU or not. The technique has become a workhorse for fine-tuning LLMs under tight memory budgets and is the default optimizer in many parameter-efficient fine-tuning libraries.
Lion (EvoLved Sign Momentum) was discovered by Chen et al. (2023) through an automated program search over the space of possible optimizer update rules [11]. The resulting algorithm is surprisingly simple: it tracks only momentum (no second moment) and uses the sign of an interpolation between the current gradient and the momentum buffer for the parameter update. Because Lion stores only one state variable per parameter instead of two, it cuts optimizer memory in half compared to Adam, a significant savings at large model scale.
Lion has shown strong results on both vision and language tasks. On diffusion models, the original paper reported that Lion outperforms Adam, achieving better FID scores while reducing training compute by up to 2.3 times. On masked language model pretraining and fine-tuning, Lion reaches comparable or slightly better performance than Adam. The main practical caveat is that the optimal Lion learning rate is typically 3 to 10 times smaller than the optimal AdamW learning rate, and weight decay typically needs to be 3 to 10 times larger; recipes built for AdamW do not transfer directly.
Sophia (Second-order Clipped Stochastic Optimization), proposed by Hong Liu et al. (2023), is a lightweight second-order optimizer for language model pretraining [12]. It estimates the diagonal of the Hessian and uses it to adapt per-parameter learning rates, with a more direct connection to the loss surface geometry than Adam's second-moment EMA. On GPT-2 pretraining benchmarks, Sophia reached the same validation loss as AdamW using roughly 50 percent fewer steps. As of 2026, Sophia and similar second-order methods are an active research area but have not displaced AdamW for production-scale LLM training.
Schedule-Free Adam, introduced by Aaron Defazio and colleagues (2024) in "The Road Less Scheduled," removes the need to specify a learning rate schedule [19]. Traditional Adam training requires a warmup phase followed by a cosine or linear decay schedule, which assumes the total training duration is known in advance. Schedule-Free Adam uses a theoretical unification of scheduling and iterate averaging to achieve competitive performance without any schedule, and won the MLCommons 2024 AlgoPerf Algorithmic Efficiency Challenge in the self-tuning track.
AdEMAMix, by Pagliardini, Ablin, and Grangier (2024), augments Adam's first moment with a second momentum buffer that uses a much higher beta (such as 0.9999) to retain information from older gradients [20]. The two momentums are mixed in the update with the long-horizon term getting a small weight. In experiments, AdEMAMix matches AdamW with substantially fewer training tokens.
Several factors explain Adam's widespread adoption. The recommended hyperparameters work well out of the box for a wide range of architectures and datasets, so a researcher can start training with the defaults and often achieve competitive results without hyperparameter search. Adam typically also makes rapid progress in the first few epochs of training, since the adaptive per-parameter learning rates help it navigate loss surfaces with very different scales across parameter groups, which is common in deep networks with embedding layers, attention layers, feedforward layers, and normalization layers all coexisting in one model.
Adam works across architectures: convolutional networks, recurrent networks, transformers, generative adversarial networks, variational autoencoders, diffusion models, graph neural networks, and more. This generality has made it the default choice in codebases, tutorials, and research papers alike. Because it normalizes gradients by the square root of the second moment, it is also relatively insensitive to the absolute scale of the gradients, making it easier to use with different loss functions, batch sizes, and model architectures without rescaling the learning rate.
Despite Adam's convenience, SGD with momentum remains the preferred optimizer in certain settings. The choice between the two is a long-standing practical question in deep learning.
| Consideration | SGD with momentum | Adam / AdamW |
|---|---|---|
| Generalization on image classification | Often strong with proper tuning | Comparable with AdamW |
| Ease of tuning | Requires careful learning rate and schedule selection | Works well with defaults |
| Convergence speed (wall-clock) | Slower, especially early | Faster, especially early |
| Memory per parameter (fp32) | 4 bytes (1 momentum buffer) | 8 bytes (m and v) |
| Large language models | Rarely used | Standard choice (AdamW) |
| Computer vision (CNNs) | Strong tradition, especially ResNet-era | Increasingly common with ViTs |
| Fine-tuning pre-trained models | Sometimes used | More common, especially for NLP |
| Hyperparameter independence | Learning rate, momentum, weight decay all interact | AdamW decouples weight decay from rest |
| Gradient noise tolerance | Sensitive | More tolerant |
Historically, the vision community favored SGD with momentum because it was observed to generalize better than Adam on benchmarks like ImageNet. Wilson et al. (2017), in "The Marginal Value of Adaptive Gradient Methods in Machine Learning," argued that adaptive methods including Adam find solutions that generalize worse than SGD on a range of supervised learning tasks. This generalization gap was largely explained later by Loshchilov and Hutter's work on decoupled weight decay: when weight decay is properly decoupled (as in AdamW), the gap narrows substantially or disappears [5]. Ablations comparing AdamW to SGD-momentum on ImageNet ViT training generally find AdamW to be comparable or better.
For large language model pre-training, AdamW is effectively the only optimizer in widespread production use. The scale of these models (hundreds of billions of parameters, trillions of tokens) makes robust default behavior critical, and AdamW delivers this. Papers training GPT-3, PaLM, Llama, Llama 2, Llama 3, Mistral, DeepSeek, and Claude all report using AdamW with closely related recipes.
SGD still has an edge in raw memory efficiency: it stores only one state variable per parameter (the momentum buffer), compared to Adam's two. For extremely large models where GPU memory is the binding constraint, this difference matters, and it has motivated memory-efficient alternatives like Adafactor, 8-bit Adam, and Lion.
The original Adam paper provided a regret bound suggesting O(sqrt(T)) regret for online convex optimization. Reddi, Kale, and Kumar (2018) found a flaw in that proof and constructed an explicit convex problem on which Adam fails to converge to the optimum [9]. Their AMSGrad variant restored a clean convergence guarantee by enforcing monotonicity of the effective learning rate.
In the non-convex setting that actually describes deep learning, no optimizer has tight convergence guarantees on real loss surfaces. A series of papers since 2018 (Defossez et al. 2020; Zou et al. 2019) has provided convergence bounds for Adam under various smoothness and gradient-noise assumptions. Under mild conditions, Adam provably converges to a stationary point of a non-convex objective, though the rates depend on assumptions that may not hold in practice. The more useful perspective is empirical: across thousands of published experiments, Adam and especially AdamW consistently produce good results across a wide range of architectures and datasets, with much less hyperparameter tuning than competing methods.
Adam requires storing two additional state variables (m_t and v_t) for every parameter. For a model with N parameters in float32, this means 8N bytes of additional optimizer state, on top of the 4N bytes of parameters and 4N bytes of gradients. Mixed-precision training adds master copies of the parameters in fp32 as well.
The table below shows the optimizer memory cost for a few model scales, assuming float32 m and v, in addition to the model parameters themselves.
| Model size | Parameters | Adam state (m + v, fp32) | Notes |
|---|---|---|---|
| 125M (GPT-2 small) | 125 million | 1.0 GB | Trivial on a single GPU |
| 1.3B (GPT-2 XL) | 1.3 billion | 10.4 GB | Fits on a single 24 GB GPU |
| 7B (Llama 2 7B) | 7 billion | 56 GB | Needs sharding or 8-bit optimizer |
| 70B (Llama 2 70B) | 70 billion | 560 GB | Requires multi-GPU sharding |
| 405B (Llama 3 405B) | 405 billion | 3.2 TB | Heavily sharded with ZeRO or FSDP |
For a 70B-parameter model, the 560 GB optimizer state alone exceeds the memory of any single GPU available in 2026. Three main techniques are used to shrink it:
8-bit Adam (Dettmers et al., 2022) quantizes m and v to 8 bits using block-wise quantization, reducing optimizer memory by a factor of four. The 70B model's 560 GB drops to roughly 140 GB, while convergence remains essentially unchanged in the published benchmarks [13].
Adafactor (Shazeer and Stern, 2018) factorizes the second moment matrix of each weight matrix as an outer product of a row vector and a column vector. For a weight matrix of shape (m, n), this turns O(m * n) state into O(m + n) state, sometimes reducing optimizer state by an order of magnitude. The first moment can also be dropped, with some loss of stability [8].
Lion (Chen et al., 2023) stores only the momentum buffer, halving the optimizer memory relative to Adam. Combined with bf16 storage, the savings stack [11].
At the scale of modern LLMs, the optimizer state alone does not fit on one GPU and must be partitioned across the devices in a training cluster.
DeepSpeed ZeRO (Zero Redundancy Optimizer), introduced by Rajbhandari et al. (2020), shards the optimizer state, gradients, and parameters across data-parallel workers in three increasingly aggressive stages [21]. ZeRO Stage 1 shards the optimizer state, ZeRO Stage 2 also shards the gradients, and ZeRO Stage 3 shards the parameters as well. With Stage 3, no single GPU ever holds the full state of any tensor, which is what makes training models with hundreds of billions of parameters tractable on commercially available hardware.
PyTorch FSDP (Fully Sharded Data Parallel), built on the same idea, is the native PyTorch implementation of optimizer-state and parameter sharding and is now the dominant choice in PyTorch-based training stacks. Combined with bf16 mixed precision and 8-bit Adam, FSDP enables training of 70B-parameter models on relatively modest GPU clusters.
Tensor parallelism and pipeline parallelism complicate the optimizer state further: when a single weight matrix is split across multiple GPUs, the corresponding chunks of m and v are also split, and the optimizer step is run independently on each shard. Modern training frameworks such as Megatron-LM, NeMo, Mosaic Composer, and Colossal-AI handle these details so that researchers can focus on the model rather than on optimizer plumbing.
LLM training has standardized heavily around AdamW. The typical recipe linearly warms the learning rate from zero to the peak value over the first 0.5 to 5 percent of training steps (commonly 2,000 to 8,000 steps), then decays it on a cosine schedule down to roughly 10 percent of peak. Peak learning rates fall between 1e-4 and 6e-4 for the 1B to 100B parameter range, with larger models using smaller values: Llama 2 7B used 3e-4, Llama 2 70B used 1.5e-4, and Llama 3 405B used 8e-5. Weight decay is set to 0.1 (BERT-style models historically used 0.01). The beta values are beta_1 = 0.9 and beta_2 = 0.95, a shorter second-moment window than Adam's original 0.999. The global gradient norm is clipped to 1.0 to suppress occasional gradient spikes, and forward and backward passes are run in bf16 with the master parameters and optimizer state kept in fp32 for numerical stability. This recipe, with minor variations, appears in the training descriptions of GPT-3, PaLM, Llama 1 / 2 / 3, Chinchilla, Mistral, DeepSeek V2 / V3, Qwen, and other large language models published since 2020 [14][15][16]. Some recent runs replace the cosine schedule with a Warmup-Stable-Decay schedule that holds the learning rate constant for most of training and decays only at the end.
Adam is available in every major deep learning framework.
In PyTorch:
import torch
optimizer = torch.optim.Adam(model.parameters(), lr=0.001, betas=(0.9, 0.999), eps=1e-8)
# Or use AdamW for decoupled weight decay:
optimizer = torch.optim.AdamW(model.parameters(), lr=0.001, weight_decay=0.01)
In TensorFlow / Keras:
import tensorflow as tf
optimizer = tf.keras.optimizers.Adam(learning_rate=0.001, beta_1=0.9, beta_2=0.999, epsilon=1e-7)
# Or AdamW:
optimizer = tf.keras.optimizers.AdamW(learning_rate=0.001, weight_decay=0.01)
In JAX using Optax:
import optax
optimizer = optax.adam(learning_rate=0.001, b1=0.9, b2=0.999, eps=1e-8)
# AdamW with decoupled weight decay:
optimizer = optax.adamw(learning_rate=0.001, weight_decay=0.01)
For 8-bit Adam on PyTorch:
import bitsandbytes as bnb
optimizer = bnb.optim.Adam8bit(model.parameters(), lr=0.001, weight_decay=0.01)
# Or AdamW8bit for decoupled weight decay
optimizer = bnb.optim.AdamW8bit(model.parameters(), lr=0.001, weight_decay=0.01)
PyTorch's implementation includes several options beyond the basic algorithm: amsgrad toggles the AMSGrad variant, fused enables a faster CUDA implementation that fuses the optimizer step into a single kernel, and foreach processes parameter groups in batches for better GPU utilization. For most production training workloads, fused=True and AdamW with decoupled weight decay are the right defaults.
The convergence gap was the focus of Reddi et al. (2018), who showed that Adam can fail to converge in convex settings due to non-monotonic effective learning rates [9]. AMSGrad fixed the proof but is rarely used in practice because the failure modes do not appear on real deep learning problems.
The generalization gap, raised by Wilson et al. (2017) in "The Marginal Value of Adaptive Gradient Methods in Machine Learning," was a major argument against Adam in the late 2010s [22]. Wilson and colleagues found that on several supervised learning tasks, Adam reached a lower training loss than SGD but a higher test loss, suggesting it found sharper minima that generalize less well. Subsequent work attributed most of this gap to the L2-versus-weight-decay confusion that AdamW resolves; once weight decay is decoupled, the gap largely disappears for transformer training, although it can still persist for some CNN tasks.
Memory overhead in vanilla Adam is unavoidable: 8 bytes of optimizer state per parameter in float32, on top of 4 bytes of parameter and 4 bytes of gradient. The 8-bit Adam, Adafactor, and Lion variants exist to mitigate this.
Sensitivity to beta_2 has been observed in large model training. The default beta_2 = 0.999 creates a very long window for the second moment estimate. If the gradient distribution changes significantly during training, the second moment estimate can lag behind, and the shift to beta_2 = 0.95 in modern LLM training is a direct response.
Numerical issues with epsilon arise in mixed-precision training. With fp16, very small values of v_hat_t can underflow and the denominator becomes dominated by epsilon. Training in bf16 (which has the same exponent range as fp32 but less mantissa precision) avoids most of this; fp16 training often needs a larger epsilon or loss scaling to be stable.
Finally, Adam can be unstable on transformers without warmup. The first few thousand steps often produce wildly varying gradients before the second-moment estimate has stabilized, which can cause loss spikes. Both warmup and RAdam-style variance rectification address this.
As of 2026, AdamW remains the dominant optimizer for training deep learning models, particularly large language models and vision transformers. Memory-efficient optimizers like Lion, Adafactor, and 8-bit Adam are gaining traction where optimizer memory is the bottleneck, including QLoRA-style fine-tuning of large LLMs on consumer GPUs. Schedule-free methods could simplify training pipelines by removing the need to specify learning rate schedules. Second-order methods like Sophia offer potential for faster convergence through curvature information. Distributed methods like LAMB and zero-redundancy optimizers (ZeRO, FSDP) address the challenges of sharding optimizer state across many GPUs.
The field of optimizer research continues to be active, but Adam's combination of simplicity, robustness, and strong defaults makes it likely to remain the go-to choice for the foreseeable future. By providing an optimizer that just works, Kingma and Ba enabled researchers to focus on model architecture and data rather than optimization tuning, accelerating progress across all of deep learning.