# L2 Regularization

> Source: https://aiwiki.ai/wiki/l2_regularization
> Updated: 2026-06-21
> Categories: Machine Learning, Training & Optimization
> From AI Wiki (https://aiwiki.ai), a free encyclopedia of artificial intelligence. Quote with attribution.

*See also: [machine learning terms](/wiki/machine_learning_terms), [regularization](/wiki/regularization), [L1 regularization](/wiki/l1_regularization), [elastic net](/wiki/elastic_net), [overfitting](/wiki/overfitting)*

## What is L2 regularization?

L2 regularization is a technique in [machine learning](/wiki/machine_learning) and statistics that penalizes large [weight](/wiki/weight) values by adding the sum of squared parameters, scaled by a strength factor lambda, to the [loss function](/wiki/loss_function). This penalty shrinks all weights smoothly toward zero without forcing any to become exactly zero, which reduces [overfitting](/wiki/overfitting) and improves a model's ability to generalize to unseen data. It is also known as Ridge regression in the context of [linear regression](/wiki/linear_regression) and as Tikhonov regularization in applied mathematics. In deep learning it is usually implemented as weight decay, and for the [Adam](/wiki/adam_optimizer) family of optimizers the two are not equivalent, a result that motivated the AdamW optimizer in 2019 [3].

The core idea behind L2 regularization is straightforward: among all models that fit the training data reasonably well, prefer the one with smaller weights. Large weights tend to amplify noise in the training data, leading to predictions that fluctuate wildly with small changes in input. By penalizing the squared magnitude of each weight, L2 regularization steers the optimization process toward smoother, more stable solutions.

L2 regularization has a long history spanning multiple fields. In statistics, Arthur Hoerl and Robert Kennard introduced Ridge regression in 1970 in two papers in *Technometrics*, as a method for handling multicollinearity in [linear regression](/wiki/linear_regression) problems [1]. Independently, the Soviet mathematician Andrey Tikhonov developed the same mathematical framework for solving ill-posed inverse problems, proving in 1943 that restricting solutions to a compact set restores stability, and formalizing the regularization method in 1963 [2]. Today, L2 regularization is a standard component in virtually every area of machine learning, from classical regression models to deep [neural networks](/wiki/neural_network).

## Mathematical formulation

The standard supervised learning objective minimizes a [loss function](/wiki/loss_function) that measures the discrepancy between predictions and true labels. L2 regularization augments this objective by adding a penalty term proportional to the squared L2 norm of the weight vector.

### General form

Given a loss function L(w) that depends on parameters w, the L2-regularized objective is:

**J(w) = L(w) + (lambda / 2) * ||w||_2^2**

where:

| Symbol | Meaning |
|--------|---------|
| J(w) | Total regularized objective |
| L(w) | Original loss (e.g., mean squared error, cross-entropy) |
| lambda | Regularization strength (hyperparameter, lambda >= 0) |
| \|\|w\|\|_2^2 | Squared L2 norm: the sum of squares of all weights, w_1^2 + w_2^2 + ... + w_n^2 |

The factor of 1/2 is a convention that simplifies the gradient computation, since the derivative of (lambda / 2) * w_i^2 with respect to w_i is simply lambda * w_i.

### Ridge regression (linear models)

For [linear regression](/wiki/linear_regression) with the ordinary least squares loss, the Ridge regression objective takes the form:

**J(beta) = ||y - X * beta||^2 + lambda * ||beta||^2**

where X is the design matrix, y is the target vector, and beta is the coefficient vector. This problem has a closed-form solution [5]:

**beta_hat = (X^T * X + lambda * I)^(-1) * X^T * y**

The addition of lambda * I to the matrix X^T * X guarantees that the matrix is invertible, even when the features are highly correlated (multicollinear) or when the number of features exceeds the number of observations. This property was the original motivation for Ridge regression in the 1970 paper by Hoerl and Kennard [1].

### For neural networks

In deep [neural networks](/wiki/neural_network), the L2 penalty is typically applied to the weight matrices of each layer while excluding bias terms [6]. For a network with L layers, the regularized loss becomes:

**J = L(w) + (lambda / 2m) * sum over l from 1 to L of ||W^[l]||_F^2**

where ||W^[l]||_F denotes the Frobenius norm of the weight matrix at layer l (computed by squaring every element and summing them), and m is the number of training examples. Bias terms are conventionally excluded from regularization because they do not contribute to the model's sensitivity to input features.

## How does L2 regularization affect gradient descent?

The practical effect of L2 regularization on [gradient descent](/wiki/gradient_descent) optimization is direct and intuitive. When computing the gradient of the regularized objective with respect to a weight w_i, the penalty term contributes an additional gradient of lambda * w_i:

**dJ/dw_i = dL/dw_i + lambda * w_i**

The parameter update rule for standard [stochastic gradient descent](/wiki/stochastic_gradient_descent_sgd) (SGD) then becomes:

**w_i <- w_i - eta * (dL/dw_i + lambda * w_i)**

which can be rewritten as:

**w_i <- (1 - eta * lambda) * w_i - eta * dL/dw_i**

The factor (1 - eta * lambda) is strictly less than 1, so at every update step the weight is first multiplied by a number slightly below 1, shrinking it toward zero, before the gradient correction from the data is applied. This multiplicative shrinkage is the origin of the term **weight decay**, and it explains why L2 regularization prevents weights from growing too large [9]. Krogh and Hertz, who analyzed this mechanism in 1991, proved that in a linear network weight decay "suppresses any irrelevant components of the weight vector by choosing the smallest vector that solves the learning problem" and can also suppress some of the effect of static noise on the targets [9].

## Why are weight decay and Adam not equivalent?

For standard SGD, L2 regularization and weight decay produce identical parameter updates. This equivalence holds because the gradient of the L2 penalty integrates cleanly into the SGD update rule.

However, Ilya Loshchilov and Frank Hutter demonstrated in their ICLR 2019 paper "Decoupled Weight Decay Regularization" that this equivalence **breaks down for adaptive gradient optimizers** such as [Adam](/wiki/adam_optimizer) [3]. As the paper states in its opening line, "L2 regularization and weight decay regularization are equivalent for standard stochastic gradient descent (when rescaled by the learning rate), but as we demonstrate this is not the case for adaptive gradient algorithms, such as Adam" [3]. In Adam, the gradient is divided by the running average of squared gradients before the update is applied. When L2 regularization is used with Adam, the regularization gradient lambda * w gets divided by this same second moment estimate, meaning that weights with historically large gradients receive less regularization than intended, and weights with small gradients receive more regularization than intended.

To restore proper weight decay behavior with Adam, Loshchilov and Hutter proposed **AdamW**, which decouples the weight decay from the gradient-based update [3]. The authors report that decoupling "decouples the optimal choice of weight decay factor from the setting of the learning rate" and "substantially improves Adam's generalization performance," closing much of the historical generalization gap between Adam and SGD with momentum [3]:

| Optimizer | Update rule (simplified) | L2 regularization equivalent to weight decay? |
|-----------|-------------------------|-----------------------------------------------|
| SGD | w <- (1 - eta * lambda) * w - eta * grad | Yes |
| Adam (with L2 penalty) | w <- w - eta * (grad + lambda * w) / sqrt(v) | No (regularization is scaled by adaptive term) |
| AdamW (decoupled) | w <- (1 - lambda) * w - eta * grad / sqrt(v) | Yes (by design) |

This distinction has significant practical consequences. Many deep learning practitioners use Adam-based optimizers, and using L2 regularization instead of proper decoupled weight decay can lead to suboptimal generalization [3]. The [PyTorch](/wiki/pytorch) optimizer `torch.optim.AdamW` implements decoupled weight decay and is now the recommended choice over `torch.optim.Adam` with a `weight_decay` parameter. AdamW has since become the default optimizer for training large [transformer](/wiki/transformer) models, including the GPT and Llama families.

## Geometric interpretation

The geometric interpretation of L2 regularization provides clear intuition for why it shrinks weights without setting them to zero.

The constrained optimization form of Ridge regression is:

**minimize ||y - X * beta||^2 subject to ||beta||^2 <= t**

for some budget t. In a two-dimensional parameter space (two weights), the constraint ||beta||^2 <= t defines a **circular region** (or a hypersphere in higher dimensions) centered at the origin. The ordinary least squares solution lies at the center of the elliptical contours of the loss function. The regularized solution is the point where the smallest loss contour touches the circular constraint boundary.

Because the circle has a smooth boundary with no corners, the point of tangency almost always lies at a location where both weight coordinates are nonzero. This is the fundamental geometric reason why L2 regularization shrinks weights but does not produce exact zeros.

This geometry contrasts sharply with [L1 regularization](/wiki/l1_regularization) (Lasso), which uses a **diamond-shaped** (L1 ball) constraint region. The diamond has sharp corners aligned with the coordinate axes. The loss contours are more likely to first touch the constraint boundary at one of these corners, which corresponds to one or more weights being exactly zero. This is why L1 regularization performs feature selection (by zeroing out irrelevant features), while L2 regularization does not.

## How does L2 differ from L1 regularization?

L1 and L2 regularization are the two most common forms of [regularization](/wiki/regularization) in machine learning. They differ in their penalty terms, their effects on model parameters, and their practical use cases.

| Property | L2 regularization (Ridge) | [L1 regularization](/wiki/l1_regularization) (Lasso) |
|----------|--------------------------|------------------------|
| Penalty term | lambda * sum(w_i^2) | lambda * sum(\|w_i\|) |
| Constraint shape | Circle (hypersphere) | Diamond (cross-polytope) |
| Sparsity | Does not produce exact zeros | Drives many weights to exactly zero |
| Feature selection | No (all features retained) | Yes (irrelevant features removed) |
| Multicollinearity handling | Distributes weight among correlated features | Tends to pick one feature from a correlated group |
| Closed-form solution | Yes (for linear models) | No (requires iterative algorithms) |
| Differentiability | Smooth everywhere | Not differentiable at zero |
| Bayesian interpretation | Gaussian prior on weights | Laplace prior on weights |
| Stability with correlated features | Stable; spreads weight evenly | Unstable; solution can vary with small data changes |

When correlated features are present, Ridge regression distributes the weight evenly among them, while Lasso tends to select one and zero out the rest. This behavior makes Ridge regression more stable and predictable in high-multicollinearity settings. On the other hand, when true sparsity exists (only a few features are genuinely relevant), Lasso can produce more interpretable models.

## Elastic net: combining L1 and L2

The **[elastic net](/wiki/elastic_net)**, introduced by Hui Zou and Trevor Hastie in 2005 in the *Journal of the Royal Statistical Society: Series B*, combines L1 and L2 penalties to capture the benefits of both [4]:

**J(w) = L(w) + lambda_1 * sum(|w_i|) + lambda_2 * sum(w_i^2)**

An equivalent parameterization uses a mixing ratio alpha in [0, 1], where alpha = 1 gives pure Lasso and alpha = 0 gives pure Ridge:

**J(w) = L(w) + lambda * [alpha * sum(|w_i|) + (1 - alpha) * sum(w_i^2)]**

The L1 component provides sparsity (feature selection), while the L2 component stabilizes the solution when features are correlated. Elastic net tends to select groups of correlated features together rather than picking just one, a behavior the authors named the **grouping effect**, defined as the tendency of strongly correlated predictors to be in or out of the model together [4]. Zou and Hastie report that the elastic net "often outperforms the lasso, while enjoying a similar sparsity of representation," and is especially useful when the number of predictors greatly exceeds the number of observations (the p >> n case), where the Lasso is not a satisfactory variable-selection method [4]. This makes elastic net particularly useful for datasets with many correlated predictors or when the number of features exceeds the number of observations.

## Bayesian interpretation

From a [Bayesian](/wiki/bayesian_inference) perspective, L2 regularization corresponds to placing a **Gaussian (normal) prior** on the model weights. Specifically, if the weights are assumed to follow a zero-mean Gaussian distribution with variance proportional to 1/lambda:

**P(w) proportional to exp(-lambda / 2 * ||w||^2)**

then the maximum a posteriori (MAP) estimate under a Gaussian likelihood is equivalent to the L2-regularized solution [7]. The prior encodes the belief that weights should be small and centered around zero, but not necessarily exactly zero.

This Bayesian interpretation offers several insights:

- **Larger lambda** corresponds to a narrower Gaussian prior (smaller variance), reflecting a stronger belief that weights should be near zero. This leads to more aggressive shrinkage.
- **Smaller lambda** corresponds to a wider Gaussian prior, allowing weights more freedom to take on large values. In the limit as lambda approaches 0, the prior becomes non-informative and the solution converges to ordinary least squares.
- The Gaussian prior has smooth, bell-shaped tails, which penalize very large weights heavily but allow moderately sized weights. This is why Ridge regression shrinks weights smoothly rather than truncating them to zero.

By contrast, [L1 regularization](/wiki/l1_regularization) corresponds to a **Laplace prior** on the weights, which has heavier tails and a sharp peak at zero, explaining why Lasso drives weights to exact zeros.

## When was Ridge regression invented?

The statistical formulation of L2 regularization originated with Arthur E. Hoerl and Robert W. Kennard, who published two seminal papers in *Technometrics* in 1970. The first paper, "Ridge Regression: Biased Estimation for Nonorthogonal Problems," appeared in volume 12, issue 1, pages 55-67, and showed that adding a small positive constant to the diagonal of the X^T * X matrix could produce estimates with lower mean squared error than ordinary least squares, despite introducing bias [1].

This result was counterintuitive at the time. The prevailing statistical theory favored unbiased estimators, and the Gauss-Markov theorem established that OLS was the best linear unbiased estimator. Hoerl and Kennard demonstrated that a biased estimator (Ridge) could achieve better overall accuracy by trading a small amount of bias for a large reduction in variance [1]. This bias-variance tradeoff is now a foundational concept in [machine learning](/wiki/machine_learning) [5].

The name "Ridge" comes from ridge analysis, a technique Hoerl used in industrial chemistry for exploring response surfaces. Independently, the mathematician Andrey Tikhonov had developed the same regularization approach for solving ill-posed integral equations [2], and the technique is widely known in applied mathematics and numerical analysis as Tikhonov regularization or Tikhonov-Phillips regularization (David L. Phillips contributed independently as well).

## L2 regularization in neural networks

In deep learning, L2 regularization (usually implemented as weight decay) is one of several tools for controlling the capacity of [neural networks](/wiki/neural_network). It is often combined with other regularization strategies:

| Technique | How it works | Relationship to L2 |
|-----------|-------------|--------------------|
| [Dropout](/wiki/dropout) | Randomly zeroes out activations during training | Complementary; often used together |
| [Batch normalization](/wiki/batch_normalization) | Normalizes layer inputs; has an implicit regularization effect | Can reduce the need for explicit L2 |
| [Early stopping](/wiki/early_stopping) | Stops training when validation error begins to rise | Has a similar shrinkage effect to L2 on the weight trajectory |
| [Data augmentation](/wiki/data_augmentation) | Expands the effective training set | Complementary; addresses overfitting from a different angle |
| [Label smoothing](/wiki/label_smoothing) | Softens hard labels to prevent overconfident predictions | Orthogonal to L2; targets output confidence |

A noteworthy theoretical connection exists between L2 regularization and [early stopping](/wiki/early_stopping). For linear models trained with gradient descent, stopping the optimization early has an effect equivalent to L2 regularization: the effective regularization strength is inversely proportional to the number of training iterations [6]. This connection extends approximately to neural networks, where early stopping acts as an implicit form of weight decay.

In modern deep learning practice, the weight decay coefficient is typically set to a small value such as 1e-4 or 1e-2. The optimal value depends on the model architecture, dataset size, and other regularization techniques in use. [Transformers](/wiki/transformer) and other large models often use weight decay values of 0.01 or 0.1 as part of their training recipe. As a concrete example, the original GPT-3 training used a weight decay of 0.1, and the recipe is applied through AdamW rather than a raw L2 penalty.

## How do you choose the regularization parameter (lambda)?

Selecting the right value for the regularization hyperparameter lambda is critical. Too large a value leads to underfitting (excessive bias, weights shrunk too aggressively), while too small a value provides insufficient regularization and allows overfitting.

### Cross-validation

The standard approach for choosing lambda is **k-fold [cross-validation](/wiki/cross-validation)**. The training data is split into k folds, and for each candidate lambda value the model is trained on k-1 folds and evaluated on the remaining fold. The lambda that yields the best average validation performance is selected. A typical workflow involves:

1. Define a grid of candidate lambda values on a logarithmic scale (e.g., 10^-6, 10^-5, ..., 10^2, 10^3).
2. For each lambda, run k-fold cross-validation (k = 5 or k = 10 are common choices).
3. Select the lambda with the lowest mean validation error.
4. Optionally, apply the "one standard error rule": choose the largest lambda whose validation error is within one standard error of the minimum, favoring a simpler model.

[Scikit-learn](/wiki/scikit_learn) provides `RidgeCV`, which performs efficient leave-one-out cross-validation for Ridge regression using a computational shortcut that avoids retraining the model for each held-out sample [8].

### Generalized cross-validation

For Ridge regression specifically, generalized cross-validation (GCV) provides an efficient approximation to leave-one-out cross-validation. GCV minimizes the function:

**GCV(lambda) = RSS(lambda) / (1 - tr(H(lambda)) / n)^2**

where H(lambda) is the hat matrix and tr(H) represents the effective degrees of freedom. This approach requires only a single matrix computation rather than n separate model fits.

## Implementation

### Scikit-learn

[Scikit-learn](/wiki/scikit_learn) provides several classes for Ridge regression in `sklearn.linear_model`:

```python
from sklearn.linear_model import Ridge, RidgeCV
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split

X, y = make_regression(n_samples=1000, n_features=50, noise=10)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Ridge with a fixed alpha (lambda)
ridge = Ridge(alpha=1.0)
ridge.fit(X_train, y_train)
print(f"R^2 score: {ridge.score(X_test, y_test):.4f}")

# RidgeCV with built-in cross-validation
ridge_cv = RidgeCV(alphas=[0.01, 0.1, 1.0, 10.0, 100.0])
ridge_cv.fit(X_train, y_train)
print(f"Best alpha: {ridge_cv.alpha_}")
print(f"R^2 score: {ridge_cv.score(X_test, y_test):.4f}")
```

Note that scikit-learn uses the parameter name `alpha` instead of `lambda` (since `lambda` is a reserved keyword in Python).

### PyTorch

In [PyTorch](/wiki/pytorch), L2 regularization is most commonly applied through the `weight_decay` parameter in optimizers:

```python
import torch
import torch.nn as nn

model = nn.Sequential(
    nn.Linear(50, 128),
    nn.ReLU(),
    nn.Linear(128, 1)
)

# SGD with weight decay (equivalent to L2 regularization)
optimizer = torch.optim.SGD(model.parameters(), lr=0.01, weight_decay=1e-4)

# AdamW with decoupled weight decay (recommended over Adam + weight_decay)
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3, weight_decay=0.01)
```

For fine-grained control, bias parameters can be excluded from weight decay:

```python
param_groups = [
    {"params": [p for n, p in model.named_parameters() if "bias" not in n],
     "weight_decay": 0.01},
    {"params": [p for n, p in model.named_parameters() if "bias" in n],
     "weight_decay": 0.0}
]
optimizer = torch.optim.AdamW(param_groups, lr=1e-3)
```

### TensorFlow/Keras

In [TensorFlow](/wiki/tensorflow) and [Keras](/wiki/keras), L2 regularization can be applied directly to individual layers using kernel regularizers:

```python
import tensorflow as tf

model = tf.keras.Sequential([
    tf.keras.layers.Dense(128, activation="relu",
                          kernel_regularizer=tf.keras.regularizers.l2(1e-4)),
    tf.keras.layers.Dense(1)
])

model.compile(optimizer="adam", loss="mse")
```

## Advantages and disadvantages

### Advantages

1. **Prevents overfitting.** By penalizing large weights, L2 regularization constrains the model's complexity and encourages solutions that generalize better to unseen data.
2. **Handles multicollinearity.** Ridge regression remains stable and well-defined even when features are highly correlated or the design matrix is ill-conditioned, situations where ordinary least squares can produce wildly unstable estimates.
3. **Closed-form solution.** For linear models, the Ridge estimator has an analytical solution that is computationally efficient and numerically stable [10].
4. **Smooth optimization landscape.** The L2 penalty is differentiable everywhere, which makes gradient-based optimization straightforward and avoids the numerical challenges associated with the non-differentiable L1 penalty.
5. **Improves conditioning.** Adding lambda * I to X^T * X always improves the condition number of the matrix, making the numerical solution more reliable.

### Disadvantages

1. **No feature selection.** Unlike [L1 regularization](/wiki/l1_regularization), L2 regularization does not drive weights to exactly zero, so it retains all features in the model. This can be a drawback when true sparsity exists.
2. **Introduces bias.** The Ridge estimator is biased (it systematically underestimates the true coefficient values), which can lead to underfitting if lambda is too large.
3. **Hyperparameter tuning required.** The regularization strength lambda must be carefully tuned, typically through cross-validation, adding computational cost to the model selection process.
4. **Not equivalent to weight decay for all optimizers.** As discussed above, L2 regularization and weight decay are not interchangeable when using adaptive optimizers like Adam.

## Explain like I'm 5 (ELI5)

Imagine you are building a tower out of blocks. You want your tower to be tall, but you also want it to be steady so it does not fall over. If you use really big, heavy blocks near the top, the tower might get tall quickly, but it will wobble and tip over easily.

L2 regularization is like a rule that says: "You can use whatever blocks you want, but bigger blocks cost more points." Since you want to save your points, you end up using medium-sized blocks instead of giant ones. Your tower might not be as tall, but it stands up much better. That is what L2 regularization does for a machine learning model: it keeps the weights (the building blocks of the model) from getting too big, so the model works well not just on the data it was trained on, but also on new data it has never seen before.

## References

1. Hoerl, A. E., & Kennard, R. W. (1970). "Ridge Regression: Biased Estimation for Nonorthogonal Problems." *Technometrics*, 12(1), 55-67.
2. Tikhonov, A. N. (1963). "Solution of incorrectly formulated problems and the regularization method." *Soviet Mathematics Doklady*, 4, 1035-1038.
3. Loshchilov, I., & Hutter, F. (2019). "Decoupled Weight Decay Regularization." *Proceedings of the 7th International Conference on Learning Representations (ICLR 2019)*. arXiv:1711.05101. https://arxiv.org/abs/1711.05101
4. Zou, H., & Hastie, T. (2005). "Regularization and variable selection via the elastic net." *Journal of the Royal Statistical Society: Series B*, 67(2), 301-320.
5. Hastie, T., Tibshirani, R., & Friedman, J. (2009). *The Elements of Statistical Learning* (2nd ed.). Springer. Chapter 3: Linear Methods for Regression.
6. Goodfellow, I., Bengio, Y., & Courville, A. (2016). *Deep Learning*. MIT Press. Chapter 7: Regularization for Deep Learning.
7. Bishop, C. M. (2006). *Pattern Recognition and Machine Learning*. Springer. Section 3.1.4: Regularized Least Squares.
8. Pedregosa, F., et al. (2011). "Scikit-learn: Machine Learning in Python." *Journal of Machine Learning Research*, 12, 2825-2830.
9. Krogh, A., & Hertz, J. A. (1991). "A Simple Weight Decay Can Improve Generalization." *Advances in Neural Information Processing Systems (NeurIPS)*, 4, 950-957.
10. Murphy, K. P. (2022). *Probabilistic Machine Learning: An Introduction*. MIT Press. Chapter 11: Linear Regression.

