AdamW is a variant of the Adam optimizer that decouples weight decay from the gradient-based update rule. Introduced by Ilya Loshchilov and Frank Hutter in a 2017 preprint and formally published at the International Conference on Learning Representations (ICLR) in 2019 under the title Decoupled Weight Decay Regularization, AdamW fixes a subtle but consequential flaw in how standard Adam combines L2 regularization with adaptive per-parameter learning rates [1]. The fix is short enough to fit on a single line of pseudocode, yet it has reshaped how nearly every modern deep neural network is trained. AdamW is the default optimizer for the vast majority of transformer models trained since 2018, including GPT-4, Claude, Gemini, and the Llama 4 family of open-weight models. When practitioners refer to "the optimizer" used to pretrain a large language model, they almost always mean AdamW with cosine learning rate decay and linear warmup.
The core insight behind AdamW is that L2 regularization and weight decay are mathematically equivalent only for plain stochastic gradient descent, not for adaptive methods like Adam, RMSProp, or Adagrad [1]. In adaptive optimizers, adding an L2 penalty to the loss function causes the regularization strength to be scaled by each parameter's running second-moment estimate. Parameters with large historical gradients receive weaker regularization than rarely-updated parameters, which undermines the uniform shrinkage that weight decay is supposed to provide. AdamW restores the intended behavior by applying weight decay directly to the parameters, after the adaptive Adam step is computed and before the parameter update is applied. The result is more consistent regularization, better generalization, and a significant practical benefit: the learning rate and weight decay hyperparameters become much more independent, which makes hyperparameter tuning dramatically easier.
The original Adam algorithm was introduced by Diederik Kingma and Jimmy Ba in a 2014 preprint that became one of the most cited papers in machine learning history [2]. Adam combines two ideas from earlier optimizers: momentum, which accumulates an exponential moving average of past gradients, and per-parameter adaptive learning rates inspired by RMSProp and Adagrad, which scale updates inversely to the square root of an exponential moving average of squared gradients. Adam adds bias correction to both moment estimates so that the running averages are accurate even at the start of training when the buffers are still warming up from their zero initialization. The algorithm requires only first-order gradient information, has modest memory requirements compared to second-order methods, and is invariant to diagonal rescaling of the gradients. These properties made Adam an immediate hit in the deep learning community.
Despite Adam's popularity, by 2017 it had developed a reputation for generalizing slightly worse than well-tuned stochastic gradient descent with momentum, particularly on image classification benchmarks. Practitioners who wanted state-of-the-art accuracy on tasks like CIFAR-10 and ImageNet often reverted to SGD, accepting the longer tuning effort in exchange for a few tenths of a percentage point of test accuracy. Several papers tried to explain the gap, attributing it variously to noise in the second-moment estimate, sharp minima found by adaptive optimizers, and the difficulty of correctly setting weight decay. Loshchilov and Hutter's contribution was to identify a specific implementation error in how virtually every deep learning framework was applying weight decay to Adam, and to show that fixing it largely closed the generalization gap.
For stochastic gradient descent, the two common ways to penalize large weights are mathematically interchangeable. The first approach, L2 regularization, modifies the loss function by adding a quadratic penalty on the weights. When the gradient of this combined loss is computed, it produces a term equal to the original gradient plus a constant times the weights themselves. The second approach, weight decay, modifies the parameter update directly by multiplying the weights by a factor slightly less than one at each step before subtracting the gradient step. Loshchilov and Hutter point out that for plain SGD with learning rate alpha and decay coefficient lambda, both approaches produce identical updates if the L2 coefficient is chosen to be lambda divided by alpha [1]. Most deep learning frameworks therefore use the L2 implementation and call it weight decay, treating the two as synonymous.
For Adam, the equivalence breaks down. The Adam update divides each gradient component by the square root of the running second moment plus a small epsilon. If weight decay is implemented as L2 regularization (folded into the gradient), then the weight-decay term is also divided by the same per-parameter denominator. A weight whose recent gradients have been large will have a large second-moment estimate, so its effective decay coefficient is small. A weight whose recent gradients have been small or sparse will have a small second-moment estimate, so its effective decay coefficient is large. The net effect is that Adam with L2 regularization regularizes infrequently-active parameters far more aggressively than frequently-active ones, the opposite of what one usually wants and the opposite of how SGD with weight decay behaves.
Decoupled weight decay fixes this by leaving the gradient untouched and instead subtracting a fraction of the parameter from itself, after the adaptive update is computed. The pseudocode change is to replace the line that adds lambda times theta to the gradient with a line that subtracts alpha times lambda times theta from the parameter at the end of the step. With this single modification, weight decay applies uniformly to every parameter regardless of its gradient history, mirroring the behavior of SGD with weight decay and restoring the original meaning of the lambda hyperparameter.
The AdamW update at step t for a parameter theta uses the following quantities: the gradient g_t of the loss with respect to theta, the running first moment m_{t-1} and second moment v_{t-1} from the previous step, the exponential decay rates beta_1 and beta_2, the small numerical stabilizer epsilon, the learning rate alpha, and the weight decay coefficient lambda. The update proceeds in five steps. First, update the first moment estimate as m_t = beta_1 * m_{t-1} + (1 - beta_1) * g_t. Second, update the second moment estimate as v_t = beta_2 * v_{t-1} + (1 - beta_2) * g_t squared, where the square is element-wise. Third, compute the bias-corrected estimates m_hat = m_t / (1 - beta_1^t) and v_hat = v_t / (1 - beta_2^t). Fourth, compute the adaptive step delta = m_hat / (sqrt(v_hat) + epsilon). Fifth, update the parameter as theta_t = theta_{t-1} - alpha * (delta + lambda * theta_{t-1}).
The critical contrast with vanilla Adam plus L2 regularization is the placement of the lambda * theta term. In Adam with L2, lambda * theta is added to g_t before the second-moment update, so it gets squared, accumulated into v_t, and divided by sqrt(v_hat). In AdamW, lambda * theta_{t-1} is added to delta after the adaptive scaling, so it produces a clean shrinkage independent of the gradient history. Some implementations equivalently write the final step as theta_t = (1 - alpha * lambda) * theta_{t-1} - alpha * delta, which makes the multiplicative shrinkage even more obvious. Both formulations produce identical updates and the choice between them is purely cosmetic.
A further refinement that appears in the original paper but is often omitted in practice is to schedule the weight decay coefficient with the same multiplier used for the learning rate. If the learning rate is multiplied by a schedule factor eta_t (for example a cosine decay), then the weight decay should also be multiplied by eta_t. This keeps the ratio of decay to gradient step constant throughout training and preserves the property that weight decay corresponds to a fixed implicit prior over the weights. Most modern implementations follow this convention through their learning rate scheduler, although some libraries treat the two coefficients as fully independent.
The default AdamW hyperparameters in PyTorch are beta_1 = 0.9, beta_2 = 0.999, epsilon = 1e-8, learning rate alpha = 1e-3, and weight decay lambda = 0.01 [3]. These defaults work reasonably well for small to medium models but are almost never used directly for training large transformers. The dominant convention for large language model pretraining sets beta_2 to 0.95 rather than 0.999, which gives the second-moment estimate a much shorter effective averaging window. The shorter window allows the estimate to track gradient magnitude more responsively as training progresses through different phases. This setting was popularized by GPT-3 and has been carried forward by virtually every major LLM release since.
The table below summarizes optimizer settings reported for several well-known models and benchmarks.
| Model | beta_1 | beta_2 | Weight decay | Peak learning rate | Schedule | Source |
|---|---|---|---|---|---|---|
| BERT-Large | 0.9 | 0.999 | 0.01 | 1e-4 | Linear warmup, linear decay | Devlin et al. 2018 |
| GPT-2 1.5B | 0.9 | 0.95 | 0.01 | 2.5e-4 | Cosine decay with warmup | Radford et al. 2019 |
| GPT-3 175B | 0.9 | 0.95 | 0.1 | 6e-5 | Cosine decay over 375M tokens warmup | Brown et al. 2020 |
| LLaMA 65B | 0.9 | 0.95 | 0.1 | 1.5e-4 | Cosine decay with 2k step warmup | Touvron et al. 2023 |
| LLaMA 2 70B | 0.9 | 0.95 | 0.1 | 1.5e-4 | Cosine decay to 10% of peak | Touvron et al. 2023 |
| ViT-Large | 0.9 | 0.999 | 0.3 | 1e-3 | Cosine decay with warmup | Dosovitskiy et al. 2020 |
| Stable Diffusion | 0.9 | 0.999 | 0.01 | 1e-4 | Constant with warmup | Rombach et al. 2022 |
A near-universal convention is to exclude bias terms and the scale and shift parameters of layer normalization from weight decay [3]. Bias parameters shift the activation function input; shrinking them toward zero forces the model to use zero-centered activations even when the data does not warrant it. LayerNorm scale parameters control the post-normalization activation magnitude, and decaying them to zero would distort the normalization. Embedding matrices are sometimes also excluded, particularly in vision transformers, although the practice is less consistent. The standard idiom in PyTorch is to construct two parameter groups when initializing the optimizer, one with weight_decay set to lambda containing all 2D matrices, and one with weight_decay set to zero containing all 1D vectors plus biases.
A common rule of thumb is that the optimal weight decay grows roughly with the square root of the dataset size and inversely with model size, although recent scaling-law work suggests the dependence is more complicated than any simple closed form. Most published recipes for transformer pretraining use values between 0.01 and 0.1, with 0.1 being the default for autoregressive language models and 0.01 the default for masked language models like BERT. Fine-tuning recipes use lower values, often between 0 and 0.01, since the pretrained weights are already well regularized.
The epsilon parameter is set to 1e-8 by default in PyTorch and to 1e-6 or 1e-4 in some other libraries. Larger epsilon values reduce the magnitude of the adaptive step for parameters with very small second-moment estimates, which can stabilize training in low-precision arithmetic. Some recipes for bfloat16 training set epsilon to 1e-15 or even larger to avoid numerical underflow when v_hat becomes extremely small.
AdamW is rarely used with a constant learning rate. The dominant schedule for transformer pretraining combines a short linear warmup with a long cosine decay. During the warmup phase, the learning rate increases linearly from zero (or a very small value) to its peak over the first few thousand to few hundred thousand steps. The warmup serves several purposes: it allows the second-moment estimate to accumulate enough samples to be reliable before any large updates are applied, it prevents the bias-correction term from producing very large initial steps when t is small, and it gives the model time to escape any pathological initial configuration without diverging.
After warmup, cosine decay smoothly reduces the learning rate from its peak to a final value (often 10 percent of peak) following one half-cycle of a cosine curve. The cosine schedule was popularized by Loshchilov and Hutter's earlier SGDR paper and has empirically been shown to produce strong final loss values across many architectures and dataset sizes. An alternative is linear decay to zero or to a small final value, which is the default in the Hugging Face Transformers library and which performs comparably to cosine for many fine-tuning tasks.
For very long training runs, a more recent practice is to use a constant learning rate after warmup with a brief cooldown phase at the end. This approach, sometimes called the warmup-stable-decay schedule, allows the practitioner to extend or stop training without committing in advance to a final step count. It also makes it easier to compare loss curves at intermediate checkpoints since they all see the same learning rate.
AdamW's memory overhead is one of its main practical drawbacks. The optimizer maintains a first moment buffer and a second moment buffer for every trainable parameter, each typically stored in float32 even when the model parameters and gradients use lower precision. The table below summarizes per-parameter memory in a typical mixed-precision training setup with bfloat16 parameters and gradients but float32 optimizer state.
| Component | Bytes per parameter | Notes |
|---|---|---|
| Model weights (bfloat16) | 2 | Required for forward pass |
| Master weights (float32) | 4 | Used for accurate parameter update |
| Gradients (bfloat16) | 2 | Reduced across data-parallel ranks |
| First moment m (float32) | 4 | Adam state |
| Second moment v (float32) | 4 | Adam state |
| Total | 16 | Per parameter, excluding activations |
For a 70 billion parameter model, this works out to over a terabyte of optimizer state, which is more than the parameters themselves. A common optimization is to use fully sharded data parallelism (FSDP) or ZeRO stage 3, which partitions the optimizer state, gradients, and parameters across data-parallel ranks so that each rank holds only its share. Another approach is to quantize the optimizer state to 8-bit or even 4-bit precision; the 8-bit AdamW implementation in the bitsandbytes library reduces optimizer memory from 8 bytes per parameter to 2 bytes per parameter with negligible loss of accuracy. More aggressive techniques like Adafactor, GaLore, and LoRA reduce the optimizer state further by exploiting low-rank structure or factored representations of the second moment.
AdamW remains the dominant optimizer for transformer training despite a steady stream of proposed alternatives. The table below summarizes how AdamW compares to several widely discussed alternatives on the dimensions that matter most for large-scale training.
| Optimizer | State per parameter | Update mechanism | Typical learning rate vs AdamW | Notes |
|---|---|---|---|---|
| Adam | 8 bytes | Adaptive first and second moment | Same | Couples L2 with adaptive scaling, hurts generalization |
| AdamW | 8 bytes | Adaptive with decoupled weight decay | Reference | Industry default for transformer pretraining |
| LAMB | 8 bytes | Adam plus per-layer trust ratio | Larger | Designed for very large batch sizes [4] |
| Adafactor | ~4 bytes | Factored second moment, sign update option | Comparable | Memory-efficient, used in T5 and PaLM [5] |
| Lion | 4 bytes | Sign of momentum | 3x to 10x smaller | Discovered by program search, simpler update [6] |
| Sophia | 8 bytes plus periodic Hessian | Diagonal Hessian preconditioner | Comparable | Aims for 2x speedup on LLM pretraining [7] |
| Distributed Shampoo | Larger preconditioner blocks | Kronecker-factored second order | Comparable | Won AlgoPerf external tuning track [8] |
| Muon | 2 bytes plus periodic NS iteration | Orthogonalized momentum via Newton-Schulz | Comparable | Reports 2x compute efficiency on transformers [9] |
LAMB (Layerwise Adaptive Moments optimizer for Batch training) was introduced by Yang You and colleagues in 2019 and adds a per-layer trust ratio to the Adam update [4]. The trust ratio rescales each layer's update so that its norm is proportional to the norm of the layer's weights, which prevents any single layer from dominating the update at very large batch sizes. LAMB allowed BERT-Large to be trained with a batch size of 32,768 in 76 minutes on a TPUv3 Pod, a record at the time. LAMB is mostly used for large-batch pretraining and has not displaced AdamW for typical batch sizes.
Adafactor, introduced by Noam Shazeer and Mitchell Stern in 2018, factorizes the second-moment matrix into the outer product of two smaller vectors, reducing the memory cost from O(n*m) to O(n+m) for an n by m weight matrix. Adafactor was used to train T5 and PaLM and remains popular for very large models where optimizer state would otherwise dominate memory. It often converges slightly slower than AdamW on small to medium models but the gap closes for very large models, and the memory savings can enable larger batch sizes that more than compensate for any per-step inefficiency.
Lion (EvoLved Sign Momentum) was introduced in 2023 by Xiangning Chen and colleagues at Google through a symbolic program-search procedure that automatically discovered new optimizer variants [6]. Lion uses only momentum (no second-moment estimate) and applies the sign function to the momentum buffer, so every parameter receives an update of identical magnitude scaled by the learning rate. Lion typically requires a learning rate three to ten times smaller than AdamW. It uses half the optimizer state of AdamW and has been shown to match or exceed AdamW on vision-language contrastive learning, diffusion models, and autoregressive language modeling, although the improvements are not universal across all settings and model sizes.
Sophia (Second-order Clipped Stochastic Optimization) was introduced by Hong Liu and colleagues in 2023 and applies a diagonal Hessian preconditioner with element-wise clipping [7]. Sophia estimates the diagonal Hessian only every few iterations to keep per-step cost low, and uses a clipping mechanism to bound the maximum update magnitude. The authors report that Sophia achieves the same validation pretraining loss as Adam in roughly half the number of steps on GPT-2 models from 125M to 1.5B parameters. Sophia has not seen wide adoption in production LLM training despite the favorable benchmark results, partly because the Hessian estimation adds engineering complexity and partly because many of its gains come from very long training runs that few labs replicate.
Distributed Shampoo, originally introduced by Vineet Gupta and colleagues in 2018 and scaled up by Rohan Anil and colleagues in 2020, applies a Kronecker-factored approximation to the second-order preconditioner [8]. Each weight matrix is preconditioned with the product of two smaller matrices, one for each axis. A distributed implementation of Shampoo won the external tuning track of the AlgoPerf neural network training algorithm competition in 2024, narrowly beating well-tuned AdamW baselines. Shampoo's main drawback is the cost of computing matrix inverses for the preconditioners, which has historically limited its use to settings with large pools of accelerators and complex distributed implementations.
Muon (Momentum Orthogonalized by Newton-Schulz), introduced by Keller Jordan in October 2024, applies a Newton-Schulz iteration to orthogonalize the momentum buffer before each step [9]. The Newton-Schulz iteration approximates the matrix square root inverse needed to whiten the gradient, providing a second-order-like update at modest computational cost. Muon only applies to 2D matrix parameters; biases, embeddings, and the final output projection are still trained with AdamW. Reported scaling-law experiments suggest Muon achieves roughly 2x computational efficiency over AdamW for compute-optimal transformer training, and Muon has been adopted by several labs for production LLM training, including the Kimi team's K1.5 and K2 models.
Despite these alternatives, AdamW remains the workhorse for the overwhelming majority of transformer training in 2026. Its combination of robust convergence, well-understood hyperparameter behavior, and broad library support keeps it the safe default choice. Most newer optimizers that report large speedups over AdamW are evaluated on relatively short training runs and do not always maintain their advantage over the multi-trillion-token training runs that characterize frontier models.
The AdamW paper has been cited tens of thousands of times and the algorithm appears in essentially every modern deep learning library. PyTorch added AdamW as a separate class (torch.optim.AdamW) in version 1.5, distinct from the older Adam class which retained the L2-coupled implementation for backward compatibility. TensorFlow added AdamW through the TensorFlow Addons package and later as a first-class optimizer in tf.keras.optimizers.experimental. JAX exposes AdamW through Optax. The Hugging Face Transformers library uses AdamW as the default optimizer for nearly all of its training scripts and example notebooks.
The practical importance of AdamW is hard to overstate. The major foundation models trained since 2018 have almost all used AdamW or a close variant, including the GPT family from OpenAI, the Claude family from Anthropic, the Llama family from Meta, the Gemini and PaLM families from Google, the Mistral models, the Qwen series from Alibaba, the DeepSeek models, and most of the open-source community models that fill the Hugging Face Hub. Vision transformers and multimodal models like CLIP and Flamingo also use AdamW. Diffusion models including Stable Diffusion and DALL-E use AdamW. Even recent reinforcement learning methods that train policies with PPO or GRPO on top of language models use AdamW for the underlying gradient updates.
For researchers and engineers reproducing published recipes, the first thing to verify when reading a paper that says it used "Adam" is whether the authors actually used Adam with L2 regularization (the older convention) or AdamW (the modern convention). Many papers from before about 2019 used L2-coupled Adam and reported their weight decay coefficient as if it were the AdamW lambda, which can lead to wildly different effective regularization when reproduced in a modern framework. Modern reproductions usually convert the coefficient by dividing by the learning rate to get an equivalent decoupled lambda, although this is only an approximation since the two updates are not exactly equivalent for adaptive optimizers.
AdamW is not a finished story. Its memory overhead remains a significant constraint for the largest models, motivating ongoing work on memory-efficient adaptive optimizers including Adafactor, 8-bit AdamW, GaLore, and APOLLO. Its hyperparameters, while easier to tune than those of Adam with L2, still require some care; recent work on scaling laws for AdamW weight decay shows that the optimal lambda depends nontrivially on dataset size, batch size, and model size, and several labs have published recipes for setting these hyperparameters as a function of model scale.
A second active research direction is whether AdamW is actually the right inductive bias for transformer training. Lion, Sophia, Muon, and Shampoo all challenge the assumption that diagonal preconditioning by the second-moment estimate is the best per-parameter scaling. Each of these methods has demonstrated meaningful gains in particular regimes, and the question of whether one of them will displace AdamW as the default for frontier training is open. So far, the inertia of the deep learning ecosystem (with mature implementations, well-understood failure modes, and decades of accumulated practitioner intuition) has kept AdamW dominant even when newer methods report better numbers on benchmarks.
A third question is whether the bias correction terms in AdamW are actually useful, harmful, or neutral for very long training runs. The bias correction was originally motivated by the observation that the moment estimates start at zero and slowly warm up, but for training runs that span hundreds of thousands or millions of steps, the bias-correction multiplier is essentially one for almost all of training and only matters in the first few hundred steps. Some recent variants such as NAdam and Yogi modify the bias correction or replace the second-moment update entirely, with mixed empirical results.
Finally, there is interest in the theoretical foundations of AdamW. The convergence proofs available for vanilla Adam do not transfer cleanly to AdamW, and the role of weight decay in nonconvex stochastic optimization is still poorly understood. Recent work has connected weight decay to implicit regularization toward flat minima, to a form of equivariance to network parameterization, and to the spectral properties of the weight matrices. None of these analyses fully explain why AdamW works as well as it does on transformers, and a satisfying theoretical account of decoupled weight decay in the modern training regime remains an open problem.