# AdamW

> Source: https://aiwiki.ai/wiki/adamw
> Updated: 2026-07-13
> Categories: Machine Learning, Training & Optimization
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

**AdamW** is a variant of the [Adam optimizer](/wiki/adam_optimizer) that decouples weight decay from the gradient-based update rule, applying the decay directly to the weights instead of folding it into the loss as an L2 penalty. It was introduced by Ilya Loshchilov and Frank Hutter in a preprint submitted on 14 November 2017 and formally published at the International Conference on Learning Representations (ICLR) in 2019 under the title *Decoupled Weight Decay Regularization* [1]. The paper shows that L2 regularization and weight decay are equivalent for plain stochastic gradient descent but not for adaptive optimizers like Adam, and that decoupling the two "substantially improves Adam's generalization performance, allowing it to compete with SGD with momentum on image classification datasets" [1]. The fix is short enough to fit on a single line of pseudocode, yet it has reshaped how nearly every modern deep neural network is trained. AdamW is the default [optimizer](/wiki/optimizer) for the vast majority of [transformer](/wiki/transformer) models trained since 2018, including [GPT-4](/wiki/gpt-4), [Claude](/wiki/claude_ai), Gemini, and the [Llama 4](/wiki/llama_4) family of open-weight models. When practitioners refer to "the optimizer" used to pretrain a [large language model](/wiki/large_language_model), they almost always mean AdamW with cosine learning rate decay and linear warmup.

The core insight behind AdamW is that L2 regularization and weight decay are mathematically equivalent only for plain stochastic gradient descent, not for adaptive methods like Adam, RMSProp, or Adagrad [1]. In adaptive optimizers, adding an L2 penalty to the loss function causes the regularization strength to be scaled by each parameter's running second-moment estimate. Parameters with large historical gradients receive weaker regularization than rarely-updated parameters, which undermines the uniform shrinkage that weight decay is supposed to provide. AdamW restores the intended behavior by applying weight decay directly to the parameters, after the adaptive Adam step is computed and before the parameter update is applied. The result is more consistent regularization, better generalization, and a significant practical benefit: the learning rate and weight decay hyperparameters become much more independent, which makes hyperparameter tuning dramatically easier.

## Background

The original Adam algorithm was introduced by Diederik Kingma and Jimmy Ba in a 2014 preprint that became one of the most cited papers in [machine learning](/wiki/machine_learning) history, with more than 160,000 citations recorded on Semantic Scholar [2]. Adam combines two ideas from earlier optimizers: momentum, which accumulates an exponential moving average of past gradients, and per-parameter adaptive learning rates inspired by RMSProp and Adagrad, which scale updates inversely to the square root of an exponential moving average of squared gradients. Adam adds bias correction to both moment estimates so that the running averages are accurate even at the start of training when the buffers are still warming up from their zero initialization. The algorithm requires only first-order gradient information, has modest memory requirements compared to second-order methods, and is invariant to diagonal rescaling of the gradients. These properties made Adam an immediate hit in the deep learning community.

Despite Adam's popularity, by 2017 it had developed a reputation for generalizing slightly worse than well-tuned stochastic gradient descent with momentum, particularly on image classification benchmarks. Practitioners who wanted state-of-the-art accuracy on tasks like CIFAR-10 and ImageNet often reverted to SGD, accepting the longer tuning effort in exchange for a few tenths of a percentage point of test accuracy. Several papers tried to explain the gap, attributing it variously to noise in the second-moment estimate, sharp minima found by adaptive optimizers, and the difficulty of correctly setting weight decay. Loshchilov and Hutter's contribution was to identify a specific implementation error in how virtually every deep learning framework was applying weight decay to Adam, and to show that fixing it largely closed the generalization gap.

## What is the difference between L2 regularization and weight decay?

For stochastic gradient descent, the two common ways to penalize large weights are mathematically interchangeable. The first approach, L2 [regularization](/wiki/regularization), modifies the loss function by adding a quadratic penalty on the weights. When the gradient of this combined loss is computed, it produces a term equal to the original gradient plus a constant times the weights themselves. The second approach, weight decay, modifies the parameter update directly by multiplying the weights by a factor slightly less than one at each step before subtracting the gradient step. Loshchilov and Hutter point out that for plain SGD with learning rate alpha and decay coefficient lambda, both approaches produce identical updates if the L2 coefficient is chosen to be lambda divided by alpha [1]. Most deep learning frameworks therefore use the L2 implementation and call it weight decay, treating the two as synonymous. The paper notes that common implementations "often call it 'weight decay' in what may be misleading due to the inequivalence we expose" [1].

For Adam, the equivalence breaks down. The Adam update divides each gradient component by the square root of the running second moment plus a small epsilon. If weight decay is implemented as L2 regularization (folded into the gradient), then the weight-decay term is also divided by the same per-parameter denominator. A weight whose recent gradients have been large will have a large second-moment estimate, so its effective decay coefficient is small. A weight whose recent gradients have been small or sparse will have a small second-moment estimate, so its effective decay coefficient is large. The net effect is that Adam with L2 regularization regularizes infrequently-active parameters far more aggressively than frequently-active ones, the opposite of what one usually wants and the opposite of how SGD with weight decay behaves.

Decoupled weight decay fixes this by leaving the gradient untouched and instead subtracting a fraction of the parameter from itself, after the adaptive update is computed. The pseudocode change is to replace the line that adds lambda times theta to the gradient with a line that subtracts alpha times lambda times theta from the parameter at the end of the step. With this single modification, weight decay applies uniformly to every parameter regardless of its gradient history, mirroring the behavior of SGD with weight decay and restoring the original meaning of the lambda hyperparameter.

## Algorithm

The AdamW update at step $$t$$ for a parameter $$\theta$$ uses the following quantities: the gradient $$g_t$$ of the loss with respect to $$\theta$$, the running first moment $$m_{t-1}$$ and second moment $$v_{t-1}$$ from the previous step, the exponential decay rates $$\beta_1$$ and $$\beta_2$$, the small numerical stabilizer $$\epsilon$$, the learning rate $$\alpha$$, and the weight decay coefficient $$\lambda$$. The update proceeds in five steps. First, update the first moment estimate as $$m_t = \beta_1 m_{t-1} + (1 - \beta_1) g_t$$. Second, update the second moment estimate as $$v_t = \beta_2 v_{t-1} + (1 - \beta_2) g_t^2$$, where the square is element-wise. Third, compute the bias-corrected estimates $$\hat{m} = m_t / (1 - \beta_1^t)$$ and $$\hat{v} = v_t / (1 - \beta_2^t)$$. Fourth, compute the adaptive step $$\delta = \hat{m} / (\sqrt{\hat{v}} + \epsilon)$$. Fifth, update the parameter as $$\theta_t = \theta_{t-1} - \alpha (\delta + \lambda \theta_{t-1})$$.

The critical contrast with vanilla Adam plus L2 regularization is the placement of the $$\lambda \theta$$ term. In Adam with L2, $$\lambda \theta$$ is added to $$g_t$$ before the second-moment update, so it gets squared, accumulated into $$v_t$$, and divided by $$\sqrt{\hat{v}}$$. In AdamW, $$\lambda \theta_{t-1}$$ is added to $$\delta$$ after the adaptive scaling, so it produces a clean shrinkage independent of the gradient history. Some implementations equivalently write the final step as $$\theta_t = (1 - \alpha \lambda) \theta_{t-1} - \alpha \delta$$, which makes the multiplicative shrinkage even more obvious. Both formulations produce identical updates and the choice between them is purely cosmetic.

A further refinement that appears in the original paper but is often omitted in practice is to schedule the weight decay coefficient with the same multiplier used for the learning rate. If the learning rate is multiplied by a schedule factor $$\eta_t$$ (for example a cosine decay), then the weight decay should also be multiplied by $$\eta_t$$. This keeps the ratio of decay to gradient step constant throughout training and preserves the property that weight decay corresponds to a fixed implicit prior over the weights. Most modern implementations follow this convention through their learning rate scheduler, although some libraries treat the two coefficients as fully independent.

## What hyperparameters does AdamW use in practice?

The default AdamW hyperparameters in PyTorch are $$\beta_1 = 0.9$$, $$\beta_2 = 0.999$$, $$\epsilon = 10^{-8}$$, learning rate $$\alpha = 10^{-3}$$, and weight decay $$\lambda = 0.01$$ [3]. These defaults work reasonably well for small to medium models but are almost never used directly for training large transformers. The dominant convention for [large language model](/wiki/large_language_model) pretraining sets $$\beta_2$$ to 0.95 rather than 0.999, which gives the second-moment estimate a much shorter effective averaging window. The shorter window allows the estimate to track gradient magnitude more responsively as training progresses through different phases. This setting was popularized by GPT-3, whose published recipe used $$\beta_1 = 0.9$$, $$\beta_2 = 0.95$$, a weight decay of 0.1, gradient clipping at global norm 1.0, a linear warmup over the first 375 million tokens, and cosine decay to 10 percent of the peak learning rate over 260 billion tokens [10]. It has been carried forward by virtually every major LLM release since.

The table below summarizes optimizer settings reported for several well-known models and benchmarks.

| Model | $$\beta_1$$ | $$\beta_2$$ | Weight decay | Peak [learning rate](/wiki/learning_rate) | Schedule | Source |
|---|---|---|---|---|---|---|
| BERT-Large | 0.9 | 0.999 | 0.01 | 1e-4 | Linear warmup, linear decay | Devlin et al. 2018 |
| GPT-2 1.5B | 0.9 | 0.95 | 0.01 | 2.5e-4 | Cosine decay with warmup | Radford et al. 2019 |
| GPT-3 175B | 0.9 | 0.95 | 0.1 | 6e-5 | Cosine decay over 375M tokens warmup | Brown et al. 2020 |
| LLaMA 65B | 0.9 | 0.95 | 0.1 | 1.5e-4 | Cosine decay with 2k step warmup | Touvron et al. 2023 |
| LLaMA 2 70B | 0.9 | 0.95 | 0.1 | 1.5e-4 | Cosine decay to 10% of peak | Touvron et al. 2023 |
| ViT-Large | 0.9 | 0.999 | 0.3 | 1e-3 | Cosine decay with warmup | Dosovitskiy et al. 2020 |
| Stable Diffusion | 0.9 | 0.999 | 0.01 | 1e-4 | Constant with warmup | Rombach et al. 2022 |

A near-universal convention is to exclude bias terms and the scale and shift parameters of [layer normalization](/wiki/layer_normalization) from weight decay [3]. Bias parameters shift the activation function input; shrinking them toward zero forces the model to use zero-centered activations even when the data does not warrant it. LayerNorm scale parameters control the post-normalization activation magnitude, and decaying them to zero would distort the normalization. Embedding matrices are sometimes also excluded, particularly in vision transformers, although the practice is less consistent. The standard idiom in PyTorch is to construct two parameter groups when initializing the optimizer, one with weight_decay set to lambda containing all 2D matrices, and one with weight_decay set to zero containing all 1D vectors plus biases.

A common rule of thumb is that the optimal weight decay grows roughly with the square root of the dataset size and inversely with model size, although recent scaling-law work suggests the dependence is more complicated than any simple closed form. Most published recipes for transformer pretraining use values between 0.01 and 0.1, with 0.1 being the default for autoregressive language models and 0.01 the default for masked language models like BERT. Fine-tuning recipes use lower values, often between 0 and 0.01, since the pretrained weights are already well regularized.

The epsilon parameter is set to 1e-8 by default in PyTorch and to 1e-6 or 1e-4 in some other libraries. Larger epsilon values reduce the magnitude of the adaptive step for parameters with very small second-moment estimates, which can stabilize training in low-precision arithmetic. Some recipes for bfloat16 training set epsilon to 1e-15 or even larger to avoid numerical underflow when v_hat becomes extremely small.

## Learning rate schedules

AdamW is rarely used with a constant learning rate. The dominant schedule for transformer pretraining combines a short linear warmup with a long cosine decay. During the warmup phase, the learning rate increases linearly from zero (or a very small value) to its peak over the first few thousand to few hundred thousand steps. The warmup serves several purposes: it allows the second-moment estimate to accumulate enough samples to be reliable before any large updates are applied, it prevents the bias-correction term from producing very large initial steps when t is small, and it gives the model time to escape any pathological initial configuration without diverging.

After warmup, cosine decay smoothly reduces the learning rate from its peak to a final value (often 10 percent of peak) following one half-cycle of a cosine curve. The cosine schedule was popularized by Loshchilov and Hutter's earlier SGDR paper, published at ICLR 2017, which set new state-of-the-art results of 3.14 percent error on CIFAR-10 and 16.21 percent on CIFAR-100 using cosine annealing with warm restarts [13]. It has empirically been shown to produce strong final loss values across many architectures and dataset sizes. An alternative is linear decay to zero or to a small final value, which is the default in the Hugging Face Transformers library and which performs comparably to cosine for many fine-tuning tasks.

For very long training runs, a more recent practice is to use a constant learning rate after warmup with a brief cooldown phase at the end. This approach, sometimes called the warmup-stable-decay schedule, allows the practitioner to extend or stop training without committing in advance to a final step count. It also makes it easier to compare loss curves at intermediate checkpoints since they all see the same learning rate.

## How much memory does AdamW use?

AdamW's memory overhead is one of its main practical drawbacks. The optimizer maintains a first moment buffer and a second moment buffer for every trainable parameter, each typically stored in float32 even when the model parameters and gradients use lower precision. The table below summarizes per-parameter memory in a typical mixed-precision training setup with bfloat16 parameters and gradients but float32 optimizer state.

| Component | Bytes per parameter | Notes |
|---|---|---|
| Model weights (bfloat16) | 2 | Required for forward pass |
| Master weights (float32) | 4 | Used for accurate parameter update |
| Gradients (bfloat16) | 2 | Reduced across data-parallel ranks |
| First moment m (float32) | 4 | Adam state |
| Second moment v (float32) | 4 | Adam state |
| Total | 16 | Per parameter, excluding activations |

For a 70 billion parameter model, this works out to over a terabyte of optimizer state, which is more than the parameters themselves. A common optimization is to use fully sharded data parallelism (FSDP) or ZeRO stage 3, which partitions the optimizer state, gradients, and parameters across data-parallel ranks so that each rank holds only its share. Another approach is to quantize the optimizer state to 8-bit or even 4-bit precision. The 8-bit AdamW implementation in the bitsandbytes library, based on the block-wise quantization method of Dettmers and colleagues published at ICLR 2022, reduces optimizer state from 32 bits to 8 bits per state, roughly a 4x reduction, while maintaining 32-bit performance on language modeling, GLUE fine-tuning, ImageNet classification, and machine translation without changes to the original hyperparameters [14]. More aggressive techniques like Adafactor, GaLore, and LoRA reduce the optimizer state further by exploiting low-rank structure or factored representations of the second moment.

## How does AdamW compare to other optimizers?

AdamW remains the dominant optimizer for transformer training despite a steady stream of proposed alternatives. The table below summarizes how AdamW compares to several widely discussed alternatives on the dimensions that matter most for large-scale training.

| Optimizer | State per parameter | Update mechanism | Typical learning rate vs AdamW | Notes |
|---|---|---|---|---|
| Adam | 8 bytes | Adaptive first and second moment | Same | Couples L2 with adaptive scaling, hurts generalization |
| AdamW | 8 bytes | Adaptive with decoupled weight decay | Reference | Industry default for transformer pretraining |
| LAMB | 8 bytes | Adam plus per-layer trust ratio | Larger | Designed for very large batch sizes [4] |
| Adafactor | ~4 bytes | Factored second moment, sign update option | Comparable | Memory-efficient, used in T5 and PaLM [5] |
| Lion | 4 bytes | Sign of momentum | 3x to 10x smaller | Discovered by program search, simpler update [6] |
| Sophia | 8 bytes plus periodic Hessian | Diagonal Hessian preconditioner | Comparable | Aims for 2x speedup on LLM pretraining [7] |
| Distributed Shampoo | Larger preconditioner blocks | Kronecker-factored second order | Comparable | Won AlgoPerf external tuning track [8] |
| Muon | 2 bytes plus periodic NS iteration | Orthogonalized momentum via Newton-Schulz | Comparable | Reports 2x compute efficiency on transformers [9] |

LAMB (Layerwise Adaptive Moments optimizer for Batch training) was introduced by Yang You and colleagues in 2019 and adds a per-layer trust ratio to the Adam update [4]. The trust ratio rescales each layer's update so that its norm is proportional to the norm of the layer's weights, which prevents any single layer from dominating the update at very large batch sizes. The authors report that LAMB enabled BERT to be trained with a batch size of 32,868 without degradation, cutting wall-clock training time from 3 days to 76 minutes on a TPUv3 Pod, a record at the time [4]. LAMB is mostly used for large-batch pretraining and has not displaced AdamW for typical batch sizes.

Adafactor, introduced by Noam Shazeer and Mitchell Stern in 2018, factorizes the second-moment matrix into the outer product of two smaller vectors, reducing the memory cost from O(n*m) to O(n+m) for an n by m weight matrix [5]. Adafactor was used to train T5 and PaLM and remains popular for very large models where optimizer state would otherwise dominate memory. It often converges slightly slower than AdamW on small to medium models but the gap closes for very large models, and the memory savings can enable larger batch sizes that more than compensate for any per-step inefficiency.

Lion (EvoLved Sign Momentum) was introduced in 2023 by Xiangning Chen and colleagues at Google through a symbolic program-search procedure that automatically discovered new optimizer variants [6]. Lion uses only momentum (no second-moment estimate) and applies the sign function to the momentum buffer, so every parameter receives an update of identical magnitude scaled by the learning rate. Lion typically requires a learning rate three to ten times smaller than AdamW. It uses half the optimizer state of AdamW and has been shown to match or exceed AdamW on vision-language contrastive learning, diffusion models, and autoregressive language modeling, although the improvements are not universal across all settings and model sizes.

Sophia (Second-order Clipped Stochastic Optimization) was introduced by Hong Liu and colleagues in 2023 and applies a diagonal Hessian preconditioner with element-wise clipping [7]. Sophia estimates the diagonal Hessian only every few iterations to keep per-step cost low, and uses a clipping mechanism to bound the maximum update magnitude. The authors report that Sophia achieves the same validation pretraining loss as Adam in roughly half the number of steps on GPT-2 models from 125M to 1.5B parameters [7]. Sophia has not seen wide adoption in production LLM training despite the favorable benchmark results, partly because the Hessian estimation adds engineering complexity and partly because many of its gains come from very long training runs that few labs replicate.

Distributed Shampoo, originally introduced by Vineet Gupta and colleagues in 2018 and scaled up by Rohan Anil and colleagues in 2020, applies a Kronecker-factored approximation to the second-order preconditioner [8]. Each weight matrix is preconditioned with the product of two smaller matrices, one for each axis. A distributed implementation of Shampoo won the external tuning track of the inaugural AlgoPerf: Training Algorithms competition, whose results were announced by MLCommons in August 2024, training models about 28 percent faster than the tuned baseline and narrowly beating well-tuned AdamW [15]. Shampoo's main drawback is the cost of computing matrix inverses for the preconditioners, which has historically limited its use to settings with large pools of accelerators and complex distributed implementations.

Muon (Momentum Orthogonalized by Newton-Schulz), introduced by Keller Jordan in October 2024, applies a Newton-Schulz iteration to orthogonalize the momentum buffer before each step [9]. The Newton-Schulz iteration approximates the matrix square root inverse needed to whiten the gradient, providing a second-order-like update at modest computational cost. Muon only applies to 2D matrix parameters; biases, embeddings, and the final output projection are still trained with AdamW. Reported scaling-law experiments suggest Muon achieves roughly 2x computational efficiency over AdamW for compute-optimal transformer training. Muon has been adopted by several labs for production LLM training: Moonshot AI's MuonClip variant, which adds a query-key clipping mechanism for stability, pretrained the trillion-parameter Kimi K2 mixture-of-experts model (1 trillion total parameters, 32 billion active) on 15.5 trillion tokens with, the team reports, zero loss spikes [16].

Despite these alternatives, AdamW remains the workhorse for the overwhelming majority of transformer training in 2026. Its combination of robust convergence, well-understood hyperparameter behavior, and broad library support keeps it the safe default choice. Most newer optimizers that report large speedups over AdamW are evaluated on relatively short training runs and do not always maintain their advantage over the multi-trillion-token training runs that characterize frontier models.

## Adoption in modern AI systems

The AdamW paper has been cited tens of thousands of times and the algorithm appears in essentially every modern deep learning library. PyTorch added AdamW as a separate class (torch.optim.AdamW), distinct from the older Adam class which retained the L2-coupled implementation for backward compatibility. TensorFlow added AdamW through the TensorFlow Addons package and later as a first-class optimizer in tf.keras.optimizers.experimental. JAX exposes AdamW through Optax. The Hugging Face Transformers library uses AdamW as the default optimizer for nearly all of its training scripts and example notebooks.

The practical importance of AdamW is hard to overstate. The major foundation models trained since 2018 have almost all used AdamW or a close variant, including the GPT family from OpenAI, the Claude family from Anthropic, the Llama family from Meta, the Gemini and PaLM families from Google, the Mistral models, the Qwen series from Alibaba, the DeepSeek models, and most of the open-source community models that fill the Hugging Face Hub. Vision transformers and multimodal models like CLIP and Flamingo also use AdamW. Diffusion models including Stable Diffusion and DALL-E use AdamW. Even recent reinforcement learning methods that train policies with PPO or GRPO on top of language models use AdamW for the underlying gradient updates.

For researchers and engineers reproducing published recipes, the first thing to verify when reading a paper that says it used "Adam" is whether the authors actually used Adam with L2 regularization (the older convention) or AdamW (the modern convention). Many papers from before about 2019 used L2-coupled Adam and reported their weight decay coefficient as if it were the AdamW lambda, which can lead to wildly different effective regularization when reproduced in a modern framework. Modern reproductions usually convert the coefficient by dividing by the learning rate to get an equivalent decoupled lambda, although this is only an approximation since the two updates are not exactly equivalent for adaptive optimizers.

## Limitations and ongoing research

AdamW is not a finished story. Its memory overhead remains a significant constraint for the largest models, motivating ongoing work on memory-efficient adaptive optimizers including Adafactor, 8-bit AdamW, GaLore, and APOLLO. Its hyperparameters, while easier to tune than those of Adam with L2, still require some care; recent work on scaling laws for AdamW weight decay shows that the optimal lambda depends nontrivially on dataset size, batch size, and model size, and several labs have published recipes for setting these hyperparameters as a function of model scale.

A second active research direction is whether AdamW is actually the right inductive bias for transformer training. Lion, Sophia, Muon, and Shampoo all challenge the assumption that diagonal preconditioning by the second-moment estimate is the best per-parameter scaling. Each of these methods has demonstrated meaningful gains in particular regimes, and the question of whether one of them will displace AdamW as the default for frontier training is open. So far, the inertia of the deep learning ecosystem (with mature implementations, well-understood failure modes, and decades of accumulated practitioner intuition) has kept AdamW dominant even when newer methods report better numbers on benchmarks.

A third question is whether the bias correction terms in AdamW are actually useful, harmful, or neutral for very long training runs. The bias correction was originally motivated by the observation that the moment estimates start at zero and slowly warm up, but for training runs that span hundreds of thousands or millions of steps, the bias-correction multiplier is essentially one for almost all of training and only matters in the first few hundred steps. Some recent variants such as NAdam and Yogi modify the bias correction or replace the second-moment update entirely, with mixed empirical results.

Finally, there is interest in the theoretical foundations of AdamW. The convergence proofs available for vanilla Adam do not transfer cleanly to AdamW, and the role of weight decay in nonconvex stochastic optimization is still poorly understood. Recent work has connected weight decay to implicit regularization toward flat minima, to a form of equivariance to network parameterization, and to the spectral properties of the weight matrices. None of these analyses fully explain why AdamW works as well as it does on transformers, and a satisfying theoretical account of decoupled weight decay in the modern training regime remains an open problem.

## See also

- [Adam optimizer](/wiki/adam_optimizer)
- [Optimizer](/wiki/optimizer)
- [Learning rate](/wiki/learning_rate)
- [Regularization](/wiki/regularization)
- [Transformer](/wiki/transformer)
- [Large language model](/wiki/large_language_model)
- [GPT-4](/wiki/gpt-4)
- [Claude](/wiki/claude_ai)
- [Llama 4](/wiki/llama_4)

## References

1. Loshchilov, I. and Hutter, F. (2019). *Decoupled Weight Decay Regularization*. International Conference on Learning Representations (ICLR). arXiv:1711.05101. https://arxiv.org/abs/1711.05101
2. Kingma, D. P. and Ba, J. (2015). *Adam: A Method for Stochastic Optimization*. International Conference on Learning Representations (ICLR). arXiv:1412.6980. https://arxiv.org/abs/1412.6980
3. PyTorch documentation. *torch.optim.AdamW*. https://docs.pytorch.org/docs/stable/generated/torch.optim.AdamW.html
4. You, Y. et al. (2020). *Large Batch Optimization for Deep Learning: Training BERT in 76 minutes*. International Conference on Learning Representations (ICLR). arXiv:1904.00962. https://arxiv.org/abs/1904.00962
5. Shazeer, N. and Stern, M. (2018). *Adafactor: Adaptive Learning Rates with Sublinear Memory Cost*. International Conference on Machine Learning (ICML). arXiv:1804.04235. https://arxiv.org/abs/1804.04235
6. Chen, X. et al. (2023). *Symbolic Discovery of Optimization Algorithms*. NeurIPS. arXiv:2302.06675. https://arxiv.org/abs/2302.06675
7. Liu, H., Li, Z., Hall, D., Liang, P., and Ma, T. (2024). *Sophia: A Scalable Stochastic Second-order Optimizer for Language Model Pre-training*. International Conference on Learning Representations (ICLR). arXiv:2305.14342. https://arxiv.org/abs/2305.14342
8. Anil, R., Gupta, V., Koren, T., Regan, K., and Singer, Y. (2020). *Scalable Second Order Optimization for Deep Learning*. arXiv:2002.09018. https://arxiv.org/abs/2002.09018
9. Jordan, K. (2024). *Muon: An optimizer for hidden layers in neural networks*. https://kellerjordan.github.io/posts/muon/
10. Brown, T. et al. (2020). *Language Models are Few-Shot Learners*. NeurIPS. arXiv:2005.14165. https://arxiv.org/abs/2005.14165
11. Touvron, H. et al. (2023). *LLaMA: Open and Efficient Foundation Language Models*. arXiv:2302.13971. https://arxiv.org/abs/2302.13971
12. Touvron, H. et al. (2023). *Llama 2: Open Foundation and Fine-Tuned Chat Models*. arXiv:2307.09288. https://arxiv.org/abs/2307.09288
13. Loshchilov, I. and Hutter, F. (2017). *SGDR: Stochastic Gradient Descent with Warm Restarts*. International Conference on Learning Representations (ICLR). arXiv:1608.03983. https://arxiv.org/abs/1608.03983
14. Dettmers, T., Lewis, M., Shleifer, S., and Zettlemoyer, L. (2022). *8-bit Optimizers via Block-wise Quantization*. International Conference on Learning Representations (ICLR). arXiv:2110.02861. https://arxiv.org/abs/2110.02861
15. MLCommons (2024). *Announcing the results of the inaugural AlgoPerf: Training Algorithms benchmark competition*. https://mlcommons.org/2024/08/mlc-algoperf-benchmark-competition/
16. Kimi Team, Moonshot AI (2025). *Kimi K2: Open Agentic Intelligence*. arXiv:2507.20534. https://arxiv.org/abs/2507.20534