Weight Decay

Weight decay is a regularization technique used in training neural networks and other machine learning models that penalizes large parameter values by shrinking weights toward zero on every update step. In its original formulation, weight decay multiplies the parameter vector by a value slightly less than one before each gradient step, gently pulling every weight back toward the origin. The strength of this pull is controlled by a single hyperparameter, conventionally written as the Greek letter lambda. Weight decay has become one of the most widely used regularizers in modern deep learning and is a default ingredient in the recipe used to train every major frontier large language model, including LLaMA, GPT, Claude, and Gemini, where the value lambda equals 0.1 has emerged as a near-universal default.

Weight decay is often confused with L2 regularization, and for plain stochastic gradient descent the two are mathematically equivalent up to a rescaling of the coefficient. The two methods diverge as soon as adaptive optimizers such as the Adam optimizer enter the picture, because adaptive methods rescale every coordinate of the gradient by an estimate of the second moment of past gradients. The L2 penalty gets caught up in this rescaling and ends up applied unevenly across parameters, while true weight decay continues to apply a clean uniform shrinkage. The 2019 paper by Ilya Loshchilov and Frank Hutter, Decoupled Weight Decay Regularization, made this distinction precise and proposed the AdamW variant that decouples weight decay from the gradient-based update. AdamW has since become the default training algorithm for almost every transformer model in production.

Origin and History

Weight decay was introduced by Stephen Hanson and Lorien Pratt at the 1988 NeurIPS conference in their paper Comparing Biases for Minimal Network Construction with Back-Propagation. Hanson and Pratt were searching for a way to grow small networks that solved a task while resisting the natural tendency of backpropagation to produce solutions with many large, redundant weights. They proposed augmenting the gradient update for each weight with a term proportional to the weight itself, which had the effect of decaying the weight toward zero in proportion to its current magnitude. Their experiments showed that networks trained this way had fewer functional connections and generalized better to unseen data, foreshadowing the modern understanding of weight decay as both a regularizer and a tool for implicit pruning.

Three years later, Anders Krogh and John Hertz published A Simple Weight Decay Can Improve Generalization at NeurIPS 1991. The paper provided one of the first careful theoretical and empirical analyses of why the technique works. Krogh and Hertz analyzed weight decay in the context of single-layer linear networks where the bias variance tradeoff could be solved in closed form. They showed that adding the decay term changes the solution from the ordinary least squares estimate to the ridge regression estimate, biasing the network toward smoother input-output mappings that resist overfitting on small datasets. They also demonstrated empirically that even very small decay coefficients produced measurable improvements in test set performance on speech recognition and time series tasks. This pair of papers established weight decay as a respectable regularization technique within the connectionist research community.

For most of the 1990s and 2000s, weight decay was treated as essentially synonymous with L2 regularization, since both produce identical updates under plain gradient descent. The convention bled into popular machine learning frameworks. PyTorch, TensorFlow, JAX, and Theano all expose a single hyperparameter named weight_decay on their optimizer classes, and in many implementations that knob actually triggers L2 regularization internally rather than the original Hanson and Pratt formulation. The slippage went largely unnoticed for years because the equivalence holds for SGD and the differences are small at the modest model sizes typical of pre-2015 deep learning.

The distinction became important again in 2017 when Loshchilov and Hutter circulated a preprint titled Fixing Weight Decay Regularization in Adam. They showed that the L2 implementation embedded inside popular Adam codebases produced systematically worse generalization than a clean decoupled implementation, and that the gap widened as models grew. The paper went through several revisions and appeared in its final form at ICLR 2019 under the new title Decoupled Weight Decay Regularization. The paper also introduced the AdamW name that practitioners now use for the corrected algorithm. Within two years AdamW had displaced Adam in the training pipelines of every major large language model.

Mathematical Formulation

The original Hanson and Pratt formulation defines weight decay as a direct modification of the gradient update rule rather than as a change to the loss function. For a parameter vector w at training step t, the update is

w_{t+1} = w_t - lr * grad(L(w_t)) - lr * lambda * w_t

where lr is the learning rate, L is the loss, and lambda is the decay coefficient. The third term shrinks every coordinate of w toward zero by a small fraction lr times lambda on each step. Equivalently, the update can be rearranged as

w_{t+1} = (1 - lr * lambda) * w_t - lr * grad(L(w_t))

which shows that weight decay can be implemented as a multiplicative decay applied before the gradient step rather than as an extra additive term. When lr times lambda is small, the multiplicative form is essentially identical to the additive form to first order.

L2 regularization, by contrast, modifies the loss function rather than the update rule. The regularized loss is

L_reg(w) = L(w) + (lambda / 2) * ||w||^2

and the gradient is

grad(L_reg(w)) = grad(L(w)) + lambda * w

so the standard SGD update becomes

w_{t+1} = w_t - lr * (grad(L(w_t)) + lambda * w_t)

Distributing the learning rate gives exactly the original Hanson and Pratt rule. The two views are therefore identical for plain SGD, and the lambda symbol carries the same meaning in both formulations once the rescaling by lr is accounted for.

The equivalence breaks for any optimizer that rescales the raw gradient before applying the update. Adaptive optimizers such as Adam, AdaGrad, RMSProp, and AdaDelta all maintain a running estimate of the second moment of past gradients and divide each coordinate of the update by the square root of that estimate. This produces a per-parameter learning rate. If L2 regularization is added to the loss, the term lambda times w is folded into the gradient before the adaptive rescaling, so its effective magnitude on each weight depends on the historical gradient noise of that weight. Weights with large historical gradients see their effective L2 penalty divided by a large number, weakening the regularization on exactly those parameters that the network is most actively learning. Weights with small historical gradients see their penalty amplified, regularizing parameters that the optimization is essentially ignoring.

The Loshchilov and Hutter analysis showed that this coupling produces systematically worse test accuracy than a decoupled formulation in which the weight decay step is applied as a separate, optimizer-independent shrinkage. For Adam, the decoupled update is

m_t = beta_1 * m_{t-1} + (1 - beta_1) * grad(L(w_t))
v_t = beta_2 * v_{t-1} + (1 - beta_2) * grad(L(w_t))^2
m_hat = m_t / (1 - beta_1^t)
v_hat = v_t / (1 - beta_2^t)
w_{t+1} = w_t - lr * (m_hat / (sqrt(v_hat) + epsilon) + lambda * w_t)

for the coupled version that L2 regularization induces, versus

w_{t+1} = w_t * (1 - lr * lambda) - lr * (m_hat / (sqrt(v_hat) + epsilon))

for AdamW, where the multiplicative shrinkage by (1 minus lr times lambda) happens outside the adaptive update. The decoupled form recovers the spirit of the original Hanson and Pratt rule and applies a uniform shrinkage to every parameter regardless of its gradient history.

Weight Decay versus L2 Regularization across Optimizers

The table below summarizes how weight decay and L2 regularization relate to one another across the major optimizer families used in modern deep learning.

Optimizer	L2 regularization update	True weight decay update	Equivalent?	Practical recommendation
Vanilla SGD	w_{t+1} = w_t - lr * (grad + lambda * w)	w_{t+1} = (1 - lr * lambda) * w_t - lr * grad	Yes (rescale lambda by lr)	Either form works; SGD with momentum uses SGDW in some libraries
SGD with momentum	L2 penalty enters momentum buffer	Decoupled shrinkage applied after momentum step	Approximately, with small differences	Decoupled SGDW slightly more robust
Adam	L2 term divided by sqrt(v_hat)	Uniform shrinkage independent of v_hat	No	Always use AdamW instead of Adam plus L2
AdamW	Not used; decoupled by construction	w_{t+1} = (1 - lr * lambda) * w_t - lr * (m_hat / (sqrt(v_hat) + eps))	N/A	Default for transformers and LLMs
AdaGrad	L2 term divided by sqrt(sum of squared gradients)	Uniform decay independent of accumulator	No	Decoupled version preferred
RMSProp	L2 term divided by sqrt(running gradient square)	Uniform decay independent of running square	No	Decoupled RMSProp preferred
Lion	Sign-based update; L2 enters sign computation	Decoupled shrinkage outside sign step	No	Decoupled by construction in original Lion
Adafactor	Factored second moment; L2 disrupts factorization	Decoupled shrinkage applied independently	No	Decoupled version standard

The pattern is consistent. Whenever the optimizer applies a non-uniform rescaling of the gradient, mixing L2 regularization into the loss produces a per-parameter penalty that no longer matches the intended uniform weight decay, and the decoupled formulation recovers the correct behavior. This is why modern training pipelines for large models almost always use AdamW or a similar decoupled variant.

Why Decoupling Matters for Adaptive Optimizers

The failure mode of L2 regularization in adaptive optimizers can be understood by tracing what happens to a single weight during training. Suppose a weight w_i has a large historical gradient magnitude, meaning the second moment estimate v_i is large. When L2 regularization is added to the loss, the gradient becomes grad_i plus lambda times w_i, and Adam divides this entire quantity by the square root of v_i. The contribution of the regularization term to the update is therefore lambda times w_i divided by sqrt(v_i), which can be much smaller than the bare lambda times w_i that would be applied in plain SGD. The effective regularization on this weight is weakened.

Now suppose a different weight w_j has a small historical gradient magnitude. The square root of v_j is small, so the contribution of the L2 term becomes lambda times w_j divided by a small number, amplifying the effective regularization. Parameters that the network is barely learning end up shrunk aggressively, while parameters with strong gradient signals are barely regularized at all. This is the opposite of what one would want from a regularizer designed to prevent overfitting, since overfitting tends to manifest precisely in the parameters that are aggressively learning idiosyncratic features of the training set.

A second issue is hyperparameter coupling. With L2 regularization in Adam, the optimal value of lambda is sensitive to the choice of learning rate, the choice of beta_2, and the noise scale of the gradients. Practitioners reported that they had to retune the L2 coefficient every time they changed the learning rate schedule, which made it hard to transfer hyperparameters across model sizes or training lengths. The Loshchilov and Hutter paper showed that the decoupled formulation breaks this coupling. With AdamW, the optimal lambda becomes essentially independent of the learning rate over a wide range, which dramatically simplifies hyperparameter tuning at scale.

A third reason to prefer decoupling is conceptual cleanliness. The decoupled rule has the same form as the original Hanson and Pratt update, applied as a multiplicative shrinkage independent of the gradient computation. This makes the regularization easy to reason about in isolation and decouples it from any future change in the optimizer. If a research team later decides to swap Adam for Lion or Sophia or Adafactor, the weight decay schedule and coefficient transfer cleanly, whereas an L2 penalty buried in the loss would interact with the new optimizer in unpredictable ways.

Bayesian Interpretation

Weight decay has a clean Bayesian interpretation as Maximum A Posteriori estimation under a Gaussian prior. Suppose the prior over weights is a multivariate normal distribution with zero mean and isotropic covariance sigma squared times the identity. The negative log prior is then proportional to ||w||^2 divided by 2 sigma squared, plus a constant. If the data likelihood corresponds to the unregularized loss L(w), then the negative log posterior is L(w) plus the negative log prior, which equals L(w) plus the L2 penalty with lambda equal to one over sigma squared. Minimizing this combined objective recovers the MAP estimate, and the gradient of the MAP loss is exactly the gradient of L plus lambda times w, the L2 regularized gradient.

This interpretation gives a principled meaning to the weight decay coefficient. Larger lambda corresponds to a tighter Gaussian prior centered on zero, expressing a stronger belief that the true weights are small. Smaller lambda corresponds to a looser prior that lets the data speak for itself. The Bayesian framework also makes clear that weight decay is not a hack but rather the natural consequence of combining a Gaussian prior on parameters with maximum likelihood inference. The same framework explains why a Laplace prior on weights produces L1 regularization instead, since the Laplace negative log density is proportional to ||w||_1, the L1 norm.

The Bayesian view extends to several useful generalizations. Block-structured priors lead to group L2 regularization on parameter blocks, which is useful for structured pruning. Heavy-tailed priors such as the Student-t distribution produce non-convex regularizers that more aggressively shrink small weights to zero while being gentler on large weights. Gaussian process priors over functions lead to function-space regularizers that are more sophisticated than weight decay but recover weight decay as a special case under specific basis function expansions.

Connection to Ridge Regression and Lasso Regression

Weight decay is the neural network counterpart of ridge regression, which adds the same L2 penalty to the loss of a linear regression model. In ridge regression the closed-form solution is (X^T X + lambda I)^(-1) X^T y, which biases the least squares estimate toward zero by adding lambda times the identity matrix to the Gram matrix before inverting. The penalty has the welcome side effect of making the matrix invertible even when X is rank deficient, which is why ridge regression was originally introduced by Hoerl and Kennard in 1970 to handle multicollinearity in regression problems.

The same shrinkage mechanism, applied to the parameters of a neural network rather than a linear regressor, is what we call weight decay. The fact that ridge regression's bias variance tradeoff can be analyzed in closed form has made it a standard pedagogical setting for introducing weight decay, and Krogh and Hertz used exactly this connection in their 1991 paper to derive theoretical predictions about how the optimal lambda should scale with the noise level of the training data and the number of parameters. The Bayesian interpretation as a Gaussian prior also goes through unchanged in both settings.

Lasso regression, introduced by Robert Tibshirani in 1996, replaces the L2 penalty with an L1 penalty of the form lambda times ||w||_1. The L1 penalty has the same shrinkage effect on small weights but applies a constant pull on every weight regardless of magnitude, which produces sparse solutions in which many weights are exactly zero. In neural networks, L1 regularization is occasionally used to encourage interpretable feature selection, but plain weight decay (L2) remains far more popular because the resulting weight matrices retain full numerical rank and play nicely with batched matrix multiplications on GPUs. Elastic net regularization combines both penalties and is used in some specialized settings.

Connection to Early Stopping

A classical theoretical result connects weight decay to early stopping in the special case of quadratic loss surfaces. For a linear model trained with gradient descent on a least squares loss, both weight decay and early stopping produce identical implicit regularization in the limit of small step size. The intuition is that gradient descent moves the weights from their initial value at zero toward the unregularized solution along directions ordered by the eigenvalues of the Hessian, with high-eigenvalue directions reaching their unregularized values quickly and low-eigenvalue directions taking many steps. Stopping training early is equivalent to leaving the low-eigenvalue components small, which is exactly what L2 regularization accomplishes by shrinking every direction in proportion to the inverse of the eigenvalue plus lambda. The number of training steps before stopping plays a role analogous to one over lambda.

This equivalence does not hold exactly for non-convex losses or adaptive optimizers, but the conceptual connection remains useful. Early stopping is often described as an implicit form of regularization that bounds the maximum distance the weights can travel from their initialization, and weight decay is an explicit form that pulls them back toward the initialization on every step. In modern practice the two are often combined, since they regularize through somewhat different mechanisms and their effects are not entirely redundant. Many large model training pipelines use weight decay throughout training and stop training when validation loss plateaus, getting both forms of regularization simultaneously.

Default Values across Major Models

Weight decay coefficients have converged across the field to a small set of values that work well for different model classes. The table below summarizes the lambda values used in published recipes for several major model families.

Model	Year	Optimizer	Weight decay (lambda)	Notes
ResNet	2015	SGD with momentum	0.0001	Standard ImageNet recipe
AlexNet	2012	SGD with momentum	0.0005	Original ImageNet competition entry
VGGNet	2014	SGD with momentum	0.0005	Same as AlexNet
BERT	2018	AdamW	0.01	First large transformer; small lambda
RoBERTa	2019	AdamW	0.01	Followed BERT recipe
GPT-2	2019	AdamW	0.01	Followed BERT-era convention
GPT-3	2020	AdamW	0.1	Established the modern LLM default
T5	2020	Adafactor	0.0	T5 used no weight decay; relied on dropout
Chinchilla	2022	AdamW	0.1	Followed GPT-3 recipe
LLaMA	2023	AdamW	0.1	Reused GPT-3 hyperparameters
LLaMA 2	2023	AdamW	0.1	Same as LLaMA
LLaMA 3	2024	AdamW	0.1	Same as LLaMA
Mistral 7B	2023	AdamW	0.1	Standard LLM recipe
Falcon	2023	AdamW	0.1	Followed Chinchilla-style recipe
PaLM	2022	Adafactor	0.0	Followed T5; relied on other regularizers
Gemini 1	2023	AdamW or AdaFactor variant	Approximately 0.1	Details not fully public
Claude (original)	2022	AdamW	Approximately 0.1	Anthropic uses standard LLM recipe
Stable Diffusion	2022	AdamW	0.01	Image diffusion default
ViT	2020	AdamW	0.1	Vision transformer default
DeiT	2020	AdamW	0.05	Slightly stronger than ViT
Swin Transformer	2021	AdamW	0.05	Same as DeiT

The pattern that emerges is that convolutional architectures from the SGD era used very small weight decay values around 1e-4 to 5e-4, while transformer architectures trained with AdamW have settled on values one to three orders of magnitude larger. The shift reflects both the change in optimizer (AdamW's decoupled weight decay scales differently than coupled L2 in Adam) and the change in scale (larger models seem to tolerate and benefit from stronger weight decay). The value lambda equals 0.1 with AdamW has become so standard for large transformers that most papers do not even bother to report it explicitly anymore.

Implementation Details

In modern training frameworks, weight decay is typically implemented as a single line inside the optimizer step. PyTorch exposes a weight_decay argument on every optimizer in torch.optim, but for SGD and similar non-adaptive optimizers it is implemented as L2 regularization (added to the gradient), while AdamW implements true decoupled weight decay. The PyTorch documentation explicitly recommends torch.optim.AdamW over torch.optim.Adam with weight_decay for any setting where regularization matters. Other frameworks have made similar moves; Hugging Face Transformers wraps AdamW as the default for trainer recipes, JAX optax provides optax.adamw with decoupled weight decay built in, and TensorFlow's tf.keras.optimizers.experimental.AdamW matches the PyTorch behavior.

A common practical issue is which parameters should have weight decay applied to them. The standard recipe followed by every major LLM training run is to exclude bias terms and the gain and bias parameters of normalization layers (LayerNorm, RMSNorm, BatchNorm) from weight decay, applying lambda only to the weights of linear and convolutional layers. The exclusion is justified by both theory and practice. Bias terms and normalization gains have a different inductive role than weight matrices, since they shift and scale activations rather than mix them, and shrinking them toward zero produces no useful regularization while degrading the network's expressive capacity. For batch normalization, shrinking the gamma scale parameter toward zero can cause the variance estimate in the denominator to collapse, producing numerical instability.

The exclusion is implemented by partitioning the parameters into two groups before passing them to the optimizer. In PyTorch the pattern looks like

decay_params = [p for n, p in model.named_parameters() if p.dim() >= 2 and 'bias' not in n]
no_decay_params = [p for n, p in model.named_parameters() if p.dim() < 2 or 'bias' in n]
optimizer = torch.optim.AdamW([
    {'params': decay_params, 'weight_decay': 0.1},
    {'params': no_decay_params, 'weight_decay': 0.0},
], lr=1e-4)

where the dimensionality check excludes one-dimensional parameters such as biases and normalization gains. Hugging Face Transformers' Trainer class implements this partitioning by default and is the source of the convention for many downstream training scripts.

A second implementation detail is the interaction between weight decay and learning rate schedules. Because the per-step shrinkage in AdamW is lr times lambda, decreasing the learning rate effectively decreases the strength of weight decay over the course of training. Some practitioners prefer to keep the effective decay constant by scaling lambda inversely with lr as the schedule decays, but the standard recipe in LLM training does not do this; the lambda value stays fixed at 0.1 for the entire run. This is partly a matter of empirical convention and partly because the cosine schedule at the end of LLM training spends only a small fraction of total compute at low learning rates, so the change in effective decay is modest.

Effect on Training Dynamics

Weight decay has several well-documented effects on the dynamics of training. The most obvious is that it bounds the norm of the weight vector. Without weight decay, gradient descent on an over-parameterized network can produce arbitrarily large weights, especially when the loss surface has shallow minima with large parameter norms. Weight decay introduces a force that pulls weights back toward zero, and at convergence the gradient of the loss must balance this pull, producing finite weight norms. This makes the trained network easier to analyze, easier to compress, and less prone to numerical instability.

A more subtle effect arises in networks with normalization layers that make the loss invariant to the scale of certain weights. In a batch normalization network, scaling the weights of the layer before normalization by a constant has no effect on the output, so the loss is constant along radial directions in weight space. Without weight decay the gradient is always perpendicular to the weight vector, so the weight norm grows monotonically during training, and the effective learning rate (which scales like lr divided by ||w||) decreases over time. Weight decay counteracts this growth, keeping the weight norm bounded and the effective learning rate stable. This effect was identified by Sanjeev Arora and colleagues in 2019 and explains why weight decay remains beneficial in batch normalized networks even though its classical regularization interpretation no longer applies.

Recent theoretical work has also argued that weight decay induces a low-rank bias in the trained weight matrices. Lechao Xiao and collaborators showed in 2024 that ReLU networks trained with SGD and weight decay tend to converge to weight matrices that are well approximated by low-rank factorizations, and that this low-rank structure correlates with improved generalization bounds. The mechanism is that weight decay penalizes the sum of squared singular values of each weight matrix (the Frobenius norm), and combining this with the specific structure of gradient updates in deep networks tends to push spectra toward a few dominant singular values. This low-rank bias is one explanation for why weight decay continues to help generalization even when its classical role as a regularizer is unclear.

Tuning Weight Decay in Practice

The weight decay coefficient is one of the easier hyperparameters to tune because its effects are smooth and monotonic. Larger lambda produces stronger regularization, smaller weight norms, and (up to a point) better generalization, while excessively large lambda eventually starts to hurt training accuracy and slow convergence. The sensible search range for AdamW spans roughly two orders of magnitude, from about 0.001 for small models or short training runs up to 0.3 for very large models or runs prone to overfitting.

For convolutional networks trained with SGD plus momentum, the default search range is much smaller, typically 1e-5 to 1e-3, with 1e-4 being the most common starting point. The scale difference between SGD and AdamW reflects the different role of the lr times lambda product in the two optimizers and is a frequent source of confusion when porting recipes between architectures.

A practical rule of thumb popularized by the LLM training literature is that the optimal lambda scales weakly with model size. Doubling the parameter count generally calls for a slight increase in lambda, but the increase is much smaller than the increase in parameter count. This is consistent with the empirical observation that lambda equals 0.1 works well across LLM scales from a few hundred million to several hundred billion parameters with no further tuning. Recent work by Wang and colleagues in 2024 has provided more careful scaling laws showing that the optimal lambda actually grows roughly with the square root of the ratio of dataset size to batch size, which is consistent with the small empirical increases seen in practice.

A common mistake is to set weight decay too low for adaptive optimizers because the user is mentally calibrated to SGD-era values. A user who sets lambda equals 1e-4 in AdamW because they remember it as the ResNet default is applying essentially no regularization, since the decoupled shrinkage per step is lr times 1e-4, which for typical learning rates of 1e-4 produces a per-step multiplicative factor of (1 minus 1e-8). The standard LLM value of lambda equals 0.1 produces a per-step factor of about (1 minus 1e-5), which is one thousand times stronger and represents a meaningful regularizer.

Limitations and Failure Modes

Weight decay is not a panacea and has several known failure modes. The most important is that it can interfere with training when applied to parameters that should not be shrunk. Bias terms, normalization layer parameters, and embedding tables are all common cases where naively applying weight decay degrades performance. The standard exclusion list described above handles the most common cases, but more exotic architectures may need additional exclusions. For example, mixture of experts models often exclude the router weights from weight decay, and some retrieval augmented models exclude the cross-attention parameters that bridge the retrieval and generation stages.

A second limitation is that weight decay alone is rarely sufficient to prevent overfitting in modern overparameterized models. Frontier LLMs apply weight decay alongside dropout, label smoothing, learning rate warmup, gradient clipping, and various data augmentation strategies. The relative contribution of weight decay to final test performance is hard to isolate because it interacts with all of these other techniques. Ablation studies that remove weight decay typically find a noticeable but not catastrophic increase in validation loss, suggesting that the technique is one important ingredient among several rather than a dominant contributor.

A third issue is that the precise effect of weight decay in deep networks is not as well understood theoretically as the classical analysis would suggest. The role of weight decay as a regularizer that prefers small-norm solutions is clear in linear models and shallow networks but becomes murky in very deep networks with normalization layers, where the loss landscape is complex and the relationship between weight norm and generalization is non-monotonic. Recent papers including Kobayashi and colleagues' 2023 work have argued that the role of weight decay in modern deep learning is closer to a learning rate adjustment than a true regularizer, and that classical bias variance arguments do not directly apply. The empirical evidence that weight decay helps remains strong, but the theoretical story is still being written.

Variants and Extensions

Several variants of weight decay have been proposed for specific settings. AdamW itself, described above, is the most important variant and is now the default rather than the exception. SGDW is the analogous decoupled version of SGD with momentum, proposed in the same 2017 Loshchilov and Hutter paper. Stable Weight Decay, proposed by Zhuang and colleagues in 2022, modifies AdamW to make the effective decay invariant to the second moment estimates, providing a cleaner separation between adaptive learning rates and regularization. AdEMAMix and other 2024 era optimizers also include carefully designed weight decay schemes.

Per-parameter weight decay allows different lambda values for different parameter groups, enabling more aggressive regularization of layers that overfit and gentler regularization of layers that underfit. The Hugging Face Trainer and many recent training recipes support per-parameter decay through parameter group dictionaries.

Scheduled weight decay increases or decreases lambda over the course of training. The most common pattern is to keep lambda constant during the main training run and reduce it at the end, but some recipes also warm up lambda from zero alongside the learning rate warmup. The empirical evidence for scheduled weight decay is mixed; most production training runs use a constant value.

p-norm weight decay, proposed by Vlaar and colleagues in 2024, generalizes the L2 penalty to other p-norms, with L1 producing sparsity-encouraging weight decay and higher p values producing milder shrinkage. The decoupled formulation generalizes cleanly to any differentiable norm.

References

Hanson, S. J. and Pratt, L. Y. (1988). *Comparing Biases for Minimal Network Construction with Back-Propagation*. Advances in Neural Information Processing Systems 1.
Krogh, A. and Hertz, J. A. (1991). *A Simple Weight Decay Can Improve Generalization*. Advances in Neural Information Processing Systems 4.
Hoerl, A. E. and Kennard, R. W. (1970). *Ridge Regression: Biased Estimation for Nonorthogonal Problems*. Technometrics, 12(1), 55 to 67.
Tibshirani, R. (1996). *Regression Shrinkage and Selection via the Lasso*. Journal of the Royal Statistical Society Series B, 58(1), 267 to 288.
Loshchilov, I. and Hutter, F. (2017). *Fixing Weight Decay Regularization in Adam*. arXiv preprint arXiv:1711.05101.
Loshchilov, I. and Hutter, F. (2019). *Decoupled Weight Decay Regularization*. International Conference on Learning Representations (ICLR) 2019.
Kingma, D. P. and Ba, J. (2015). *Adam: A Method for Stochastic Optimization*. International Conference on Learning Representations (ICLR) 2015.
Brown, T. B. et al. (2020). *Language Models are Few-Shot Learners*. Advances in Neural Information Processing Systems 33 (GPT-3 paper).
Hoffmann, J. et al. (2022). *Training Compute-Optimal Large Language Models*. arXiv preprint arXiv:2203.15556 (Chinchilla).
Touvron, H. et al. (2023). *LLaMA: Open and Efficient Foundation Language Models*. arXiv preprint arXiv:2302.13971.
Touvron, H. et al. (2023). *Llama 2: Open Foundation and Fine-Tuned Chat Models*. arXiv preprint arXiv:2307.09288.
Dubey, A. et al. (2024). *The Llama 3 Herd of Models*. arXiv preprint arXiv:2407.21783.
Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2019). *BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding*. NAACL-HLT 2019.
Liu, Y. et al. (2019). *RoBERTa: A Robustly Optimized BERT Pretraining Approach*. arXiv preprint arXiv:1907.11692.
He, K., Zhang, X., Ren, S., and Sun, J. (2016). *Deep Residual Learning for Image Recognition*. IEEE Conference on Computer Vision and Pattern Recognition (ResNet).
Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2012). *ImageNet Classification with Deep Convolutional Neural Networks*. Advances in Neural Information Processing Systems 25 (AlexNet).
Dosovitskiy, A. et al. (2021). *An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale*. International Conference on Learning Representations (ViT).
Arora, S., Li, Z., and Lyu, K. (2019). *Theoretical Analysis of Auto Rate-Tuning by Batch Normalization*. International Conference on Learning Representations 2019.
Van Laarhoven, T. (2017). *L2 Regularization versus Batch and Weight Normalization*. arXiv preprint arXiv:1706.05350.
Zhuang, J. et al. (2022). *Stable Weight Decay Regularization*. International Conference on Learning Representations 2022.
Xiao, L. et al. (2024). *Towards Better Generalization: Weight Decay Induces Low-rank Bias for Neural Networks*. arXiv preprint arXiv:2410.02176.
Kobayashi, M. et al. (2023). *Why Do We Need Weight Decay in Modern Deep Learning?* arXiv preprint arXiv:2310.04415.
Wang, Z. et al. (2024). *How to set AdamW's weight decay as you scale model and dataset size*. arXiv preprint arXiv:2405.13698.
Vlaar, J. et al. (2024). *Decoupled Weight Decay for Any p-Norm*. arXiv preprint arXiv:2404.10824.
Bjorck, J., Gomes, C., and Selman, B. (2021). *Understanding Decoupled and Early Weight Decay*. AAAI Conference on Artificial Intelligence 2021.
PyTorch Documentation. *torch.optim.AdamW*. https://docs.pytorch.org/docs/stable/generated/torch.optim.AdamW.html
Chinchilla, GPT-3, and LLaMA configuration files (Meta AI, OpenAI, DeepMind public releases).

Weight Decay

Weight Decay

Origin and History

Mathematical Formulation

Weight Decay versus L2 Regularization across Optimizers

Why Decoupling Matters for Adaptive Optimizers

Bayesian Interpretation

Connection to Ridge Regression and Lasso Regression

Connection to Early Stopping

Default Values across Major Models

Implementation Details

Effect on Training Dynamics

Tuning Weight Decay in Practice

Limitations and Failure Modes

Variants and Extensions

See Also

References

Improve this article

Weight Decay

Origin and History

Mathematical Formulation

Weight Decay versus L2 Regularization across Optimizers

Why Decoupling Matters for Adaptive Optimizers

Bayesian Interpretation

Connection to Ridge Regression and Lasso Regression

Connection to Early Stopping

Default Values across Major Models

Implementation Details

Effect on Training Dynamics

Tuning Weight Decay in Practice

Limitations and Failure Modes

Variants and Extensions

See Also

References

Weight Decay

Origin and History

Mathematical Formulation

Weight Decay versus L2 Regularization across Optimizers

Why Decoupling Matters for Adaptive Optimizers

Bayesian Interpretation

Connection to Ridge Regression and Lasso Regression

Connection to Early Stopping

Default Values across Major Models

Implementation Details

Effect on Training Dynamics

Tuning Weight Decay in Practice

Limitations and Failure Modes

Variants and Extensions

See Also

References

Improve this article

Related Articles

ARC-AGI 2

L1 Regularization

L2 Regularization

L0 Regularization

Dropout Regularization

Early Stopping

Weight Decay

Origin and History

Mathematical Formulation

Weight Decay versus L2 Regularization across Optimizers

Why Decoupling Matters for Adaptive Optimizers

Bayesian Interpretation

Connection to Ridge Regression and Lasso Regression

Connection to Early Stopping

Default Values across Major Models

Implementation Details

Effect on Training Dynamics

Tuning Weight Decay in Practice

Limitations and Failure Modes

Variants and Extensions

See Also

References

Related Articles

ARC-AGI 2

L1 Regularization

L2 Regularization

L0 Regularization

Dropout Regularization

Early Stopping