L2 Regularization

Introduction

L2 regularization is a widely used technique in machine learning and statistics that penalizes large weight values by adding the sum of squared parameters to the loss function. Also known as Ridge regression in the context of linear regression and Tikhonov regularization in applied mathematics, L2 regularization discourages complex models by shrinking parameter values toward zero without forcing them to become exactly zero. This shrinkage effect reduces overfitting and improves a model's ability to generalize to unseen data.

The core idea behind L2 regularization is straightforward: among all models that fit the training data reasonably well, prefer the one with smaller weights. Large weights tend to amplify noise in the training data, leading to predictions that fluctuate wildly with small changes in input. By penalizing the squared magnitude of each weight, L2 regularization steers the optimization process toward smoother, more stable solutions.

L2 regularization has a long history spanning multiple fields. In statistics, Arthur Hoerl and Robert Kennard introduced Ridge regression in 1970 as a method for handling multicollinearity in linear regression problems. Independently, the Soviet mathematician Andrey Tikhonov developed the same mathematical framework for solving ill-posed inverse problems. Today, L2 regularization is a standard component in virtually every area of machine learning, from classical regression models to deep neural networks.

Mathematical formulation

The standard supervised learning objective minimizes a loss function that measures the discrepancy between predictions and true labels. L2 regularization augments this objective by adding a penalty term proportional to the squared L2 norm of the weight vector.

General form

Given a loss function L(w) that depends on parameters w, the L2-regularized objective is:

J(w) = L(w) + (lambda / 2) * ||w||_2^2

where:

Symbol	Meaning
J(w)	Total regularized objective
L(w)	Original loss (e.g., mean squared error, cross-entropy)
lambda	Regularization strength (hyperparameter, lambda >= 0)
\|\|w\|\|_2^2	Squared L2 norm: the sum of squares of all weights, w_1^2 + w_2^2 + ... + w_n^2

The factor of 1/2 is a convention that simplifies the gradient computation, since the derivative of (lambda / 2) * w_i^2 with respect to w_i is simply lambda * w_i.

Ridge regression (linear models)

For linear regression with the ordinary least squares loss, the Ridge regression objective takes the form:

J(beta) = ||y - X * beta||^2 + lambda * ||beta||^2

where X is the design matrix, y is the target vector, and beta is the coefficient vector. This problem has a closed-form solution:

beta_hat = (X^T * X + lambda * I)^(-1) * X^T * y

The addition of lambda * I to the matrix X^T * X guarantees that the matrix is invertible, even when the features are highly correlated (multicollinear) or when the number of features exceeds the number of observations. This property was the original motivation for Ridge regression in the 1970 paper by Hoerl and Kennard.

For neural networks

In deep neural networks, the L2 penalty is typically applied to the weight matrices of each layer while excluding bias terms. For a network with L layers, the regularized loss becomes:

J = L(w) + (lambda / 2m) * sum over l from 1 to L of ||W^[l]||_F^2

where ||W^[l]||_F denotes the Frobenius norm of the weight matrix at layer l (computed by squaring every element and summing them), and m is the number of training examples. Bias terms are conventionally excluded from regularization because they do not contribute to the model's sensitivity to input features.

How L2 regularization affects gradient descent

The practical effect of L2 regularization on gradient descent optimization is direct and intuitive. When computing the gradient of the regularized objective with respect to a weight w_i, the penalty term contributes an additional gradient of lambda * w_i:

dJ/dw_i = dL/dw_i + lambda * w_i

The parameter update rule for standard stochastic gradient descent (SGD) then becomes:

w_i <- w_i - eta * (dL/dw_i + lambda * w_i)

which can be rewritten as:

w_i <- (1 - eta * lambda) * w_i - eta * dL/dw_i

The factor (1 - eta * lambda) is strictly less than 1, so at every update step the weight is first multiplied by a number slightly below 1, shrinking it toward zero, before the gradient correction from the data is applied. This multiplicative shrinkage is the origin of the term weight decay, and it explains why L2 regularization prevents weights from growing too large.

Weight decay and its non-equivalence with Adam

For standard SGD, L2 regularization and weight decay produce identical parameter updates. This equivalence holds because the gradient of the L2 penalty integrates cleanly into the SGD update rule.

However, Loshchilov and Hutter demonstrated in their 2019 ICLR paper "Decoupled Weight Decay Regularization" that this equivalence breaks down for adaptive gradient optimizers such as Adam. In Adam, the gradient is divided by the running average of squared gradients before the update is applied. When L2 regularization is used with Adam, the regularization gradient lambda * w gets divided by this same second moment estimate, meaning that weights with historically large gradients receive less regularization than intended, and weights with small gradients receive more regularization than intended.

To restore proper weight decay behavior with Adam, Loshchilov and Hutter proposed AdamW, which decouples the weight decay from the gradient-based update:

Optimizer	Update rule (simplified)	L2 regularization equivalent to weight decay?
SGD	w <- (1 - eta * lambda) * w - eta * grad	Yes
Adam (with L2 penalty)	w <- w - eta * (grad + lambda * w) / sqrt(v)	No (regularization is scaled by adaptive term)
AdamW (decoupled)	w <- (1 - lambda) * w - eta * grad / sqrt(v)	Yes (by design)

This distinction has significant practical consequences. Many deep learning practitioners use Adam-based optimizers, and using L2 regularization instead of proper decoupled weight decay can lead to suboptimal generalization. The PyTorch optimizer torch.optim.AdamW implements decoupled weight decay and is now the recommended choice over torch.optim.Adam with a weight_decay parameter.

Geometric interpretation

The geometric interpretation of L2 regularization provides clear intuition for why it shrinks weights without setting them to zero.

The constrained optimization form of Ridge regression is:

minimize ||y - X * beta||^2 subject to ||beta||^2 <= t

for some budget t. In a two-dimensional parameter space (two weights), the constraint ||beta||^2 <= t defines a circular region (or a hypersphere in higher dimensions) centered at the origin. The ordinary least squares solution lies at the center of the elliptical contours of the loss function. The regularized solution is the point where the smallest loss contour touches the circular constraint boundary.

Because the circle has a smooth boundary with no corners, the point of tangency almost always lies at a location where both weight coordinates are nonzero. This is the fundamental geometric reason why L2 regularization shrinks weights but does not produce exact zeros.

This geometry contrasts sharply with L1 regularization (Lasso), which uses a diamond-shaped (L1 ball) constraint region. The diamond has sharp corners aligned with the coordinate axes. The loss contours are more likely to first touch the constraint boundary at one of these corners, which corresponds to one or more weights being exactly zero. This is why L1 regularization performs feature selection (by zeroing out irrelevant features), while L2 regularization does not.

Comparison with L1 regularization

L1 and L2 regularization are the two most common forms of regularization in machine learning. They differ in their penalty terms, their effects on model parameters, and their practical use cases.

Property	L2 regularization (Ridge)	L1 regularization (Lasso)
Penalty term	lambda * sum(w_i^2)	lambda * sum(\|w_i\|)
Constraint shape	Circle (hypersphere)	Diamond (cross-polytope)
Sparsity	Does not produce exact zeros	Drives many weights to exactly zero
Feature selection	No (all features retained)	Yes (irrelevant features removed)
Multicollinearity handling	Distributes weight among correlated features	Tends to pick one feature from a correlated group
Closed-form solution	Yes (for linear models)	No (requires iterative algorithms)
Differentiability	Smooth everywhere	Not differentiable at zero
Bayesian interpretation	Gaussian prior on weights	Laplace prior on weights
Stability with correlated features	Stable; spreads weight evenly	Unstable; solution can vary with small data changes

When correlated features are present, Ridge regression distributes the weight evenly among them, while Lasso tends to select one and zero out the rest. This behavior makes Ridge regression more stable and predictable in high-multicollinearity settings. On the other hand, when true sparsity exists (only a few features are genuinely relevant), Lasso can produce more interpretable models.

Elastic net: combining L1 and L2

The elastic net, introduced by Zou and Hastie in 2005, combines L1 and L2 penalties to capture the benefits of both:

J(w) = L(w) + lambda_1 * sum(|w_i|) + lambda_2 * sum(w_i^2)

An equivalent parameterization uses a mixing ratio alpha in [0, 1], where alpha = 1 gives pure Lasso and alpha = 0 gives pure Ridge:

J(w) = L(w) + lambda * [alpha * sum(|w_i|) + (1 - alpha) * sum(w_i^2)]

The L1 component provides sparsity (feature selection), while the L2 component stabilizes the solution when features are correlated. Elastic net tends to select groups of correlated features together rather than picking just one, a behavior known as the grouping effect. This makes elastic net particularly useful for datasets with many correlated predictors or when the number of features exceeds the number of observations.

Bayesian interpretation

From a Bayesian perspective, L2 regularization corresponds to placing a Gaussian (normal) prior on the model weights. Specifically, if the weights are assumed to follow a zero-mean Gaussian distribution with variance proportional to 1/lambda:

P(w) proportional to exp(-lambda / 2 * ||w||^2)

then the maximum a posteriori (MAP) estimate under a Gaussian likelihood is equivalent to the L2-regularized solution. The prior encodes the belief that weights should be small and centered around zero, but not necessarily exactly zero.

This Bayesian interpretation offers several insights:

Larger lambda corresponds to a narrower Gaussian prior (smaller variance), reflecting a stronger belief that weights should be near zero. This leads to more aggressive shrinkage.
Smaller lambda corresponds to a wider Gaussian prior, allowing weights more freedom to take on large values. In the limit as lambda approaches 0, the prior becomes non-informative and the solution converges to ordinary least squares.
The Gaussian prior has smooth, bell-shaped tails, which penalize very large weights heavily but allow moderately sized weights. This is why Ridge regression shrinks weights smoothly rather than truncating them to zero.

By contrast, L1 regularization corresponds to a Laplace prior on the weights, which has heavier tails and a sharp peak at zero, explaining why Lasso drives weights to exact zeros.

Ridge regression: historical context

The statistical formulation of L2 regularization originated with Arthur E. Hoerl and Robert W. Kennard, who published two seminal papers in Technometrics in 1970. The first paper, "Ridge Regression: Biased Estimation for Nonorthogonal Problems," showed that adding a small positive constant to the diagonal of the X^T * X matrix could produce estimates with lower mean squared error than ordinary least squares, despite introducing bias.

This result was counterintuitive at the time. The prevailing statistical theory favored unbiased estimators, and the Gauss-Markov theorem established that OLS was the best linear unbiased estimator. Hoerl and Kennard demonstrated that a biased estimator (Ridge) could achieve better overall accuracy by trading a small amount of bias for a large reduction in variance. This bias-variance tradeoff is now a foundational concept in machine learning.

The name "Ridge" comes from ridge analysis, a technique Hoerl used in industrial chemistry for exploring response surfaces. Independently, the mathematician Andrey Tikhonov had developed the same regularization approach for solving ill-posed integral equations, and the technique is widely known in applied mathematics and numerical analysis as Tikhonov regularization or Tikhonov-Phillips regularization (David L. Phillips contributed independently as well).

L2 regularization in neural networks

In deep learning, L2 regularization (usually implemented as weight decay) is one of several tools for controlling the capacity of neural networks. It is often combined with other regularization strategies:

Technique	How it works	Relationship to L2
Dropout	Randomly zeroes out activations during training	Complementary; often used together
Batch normalization	Normalizes layer inputs; has an implicit regularization effect	Can reduce the need for explicit L2
Early stopping	Stops training when validation error begins to rise	Has a similar shrinkage effect to L2 on the weight trajectory
Data augmentation	Expands the effective training set	Complementary; addresses overfitting from a different angle
Label smoothing	Softens hard labels to prevent overconfident predictions	Orthogonal to L2; targets output confidence

A noteworthy theoretical connection exists between L2 regularization and early stopping. For linear models trained with gradient descent, stopping the optimization early has an effect equivalent to L2 regularization: the effective regularization strength is inversely proportional to the number of training iterations. This connection extends approximately to neural networks, where early stopping acts as an implicit form of weight decay.

In modern deep learning practice, the weight decay coefficient is typically set to a small value such as 1e-4 or 1e-2. The optimal value depends on the model architecture, dataset size, and other regularization techniques in use. Transformers and other large models often use weight decay values of 0.01 or 0.1 as part of their training recipe.

Choosing the regularization parameter (lambda)

Selecting the right value for the regularization hyperparameter lambda is critical. Too large a value leads to underfitting (excessive bias, weights shrunk too aggressively), while too small a value provides insufficient regularization and allows overfitting.

Cross-validation

The standard approach for choosing lambda is k-fold cross-validation. The training data is split into k folds, and for each candidate lambda value the model is trained on k-1 folds and evaluated on the remaining fold. The lambda that yields the best average validation performance is selected. A typical workflow involves:

Define a grid of candidate lambda values on a logarithmic scale (e.g., 10^-6, 10^-5, ..., 10^2, 10^3).
For each lambda, run k-fold cross-validation (k = 5 or k = 10 are common choices).
Select the lambda with the lowest mean validation error.
Optionally, apply the "one standard error rule": choose the largest lambda whose validation error is within one standard error of the minimum, favoring a simpler model.

Scikit-learn provides RidgeCV, which performs efficient leave-one-out cross-validation for Ridge regression using a computational shortcut that avoids retraining the model for each held-out sample.

Generalized cross-validation

For Ridge regression specifically, generalized cross-validation (GCV) provides an efficient approximation to leave-one-out cross-validation. GCV minimizes the function:

GCV(lambda) = RSS(lambda) / (1 - tr(H(lambda)) / n)^2

where H(lambda) is the hat matrix and tr(H) represents the effective degrees of freedom. This approach requires only a single matrix computation rather than n separate model fits.

Implementation

Scikit-learn

Scikit-learn provides several classes for Ridge regression in sklearn.linear_model:

from sklearn.linear_model import Ridge, RidgeCV
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split

X, y = make_regression(n_samples=1000, n_features=50, noise=10)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Ridge with a fixed alpha (lambda)
ridge = Ridge(alpha=1.0)
ridge.fit(X_train, y_train)
print(f"R^2 score: {ridge.score(X_test, y_test):.4f}")

# RidgeCV with built-in cross-validation
ridge_cv = RidgeCV(alphas=[0.01, 0.1, 1.0, 10.0, 100.0])
ridge_cv.fit(X_train, y_train)
print(f"Best alpha: {ridge_cv.alpha_}")
print(f"R^2 score: {ridge_cv.score(X_test, y_test):.4f}")

Note that scikit-learn uses the parameter name alpha instead of lambda (since lambda is a reserved keyword in Python).

PyTorch

In PyTorch, L2 regularization is most commonly applied through the weight_decay parameter in optimizers:

import torch
import torch.nn as nn

model = nn.Sequential(
    nn.Linear(50, 128),
    nn.ReLU(),
    nn.Linear(128, 1)
)

# SGD with weight decay (equivalent to L2 regularization)
optimizer = torch.optim.SGD(model.parameters(), lr=0.01, weight_decay=1e-4)

# AdamW with decoupled weight decay (recommended over Adam + weight_decay)
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3, weight_decay=0.01)

For fine-grained control, bias parameters can be excluded from weight decay:

param_groups = [
    {"params": [p for n, p in model.named_parameters() if "bias" not in n],
     "weight_decay": 0.01},
    {"params": [p for n, p in model.named_parameters() if "bias" in n],
     "weight_decay": 0.0}
]
optimizer = torch.optim.AdamW(param_groups, lr=1e-3)

TensorFlow/Keras

In TensorFlow and Keras, L2 regularization can be applied directly to individual layers using kernel regularizers:

import tensorflow as tf

model = tf.keras.Sequential([
    tf.keras.layers.Dense(128, activation="relu",
                          kernel_regularizer=tf.keras.regularizers.l2(1e-4)),
    tf.keras.layers.Dense(1)
])

model.compile(optimizer="adam", loss="mse")

Advantages and disadvantages

Advantages

Prevents overfitting. By penalizing large weights, L2 regularization constrains the model's complexity and encourages solutions that generalize better to unseen data.
Handles multicollinearity. Ridge regression remains stable and well-defined even when features are highly correlated or the design matrix is ill-conditioned, situations where ordinary least squares can produce wildly unstable estimates.
Closed-form solution. For linear models, the Ridge estimator has an analytical solution that is computationally efficient and numerically stable.
Smooth optimization landscape. The L2 penalty is differentiable everywhere, which makes gradient-based optimization straightforward and avoids the numerical challenges associated with the non-differentiable L1 penalty.
Improves conditioning. Adding lambda * I to X^T * X always improves the condition number of the matrix, making the numerical solution more reliable.

Disadvantages

No feature selection. Unlike L1 regularization, L2 regularization does not drive weights to exactly zero, so it retains all features in the model. This can be a drawback when true sparsity exists.
Introduces bias. The Ridge estimator is biased (it systematically underestimates the true coefficient values), which can lead to underfitting if lambda is too large.
Hyperparameter tuning required. The regularization strength lambda must be carefully tuned, typically through cross-validation, adding computational cost to the model selection process.
Not equivalent to weight decay for all optimizers. As discussed above, L2 regularization and weight decay are not interchangeable when using adaptive optimizers like Adam.

Explain like I'm 5 (ELI5)

Imagine you are building a tower out of blocks. You want your tower to be tall, but you also want it to be steady so it does not fall over. If you use really big, heavy blocks near the top, the tower might get tall quickly, but it will wobble and tip over easily.

L2 regularization is like a rule that says: "You can use whatever blocks you want, but bigger blocks cost more points." Since you want to save your points, you end up using medium-sized blocks instead of giant ones. Your tower might not be as tall, but it stands up much better. That is what L2 regularization does for a machine learning model: it keeps the weights (the building blocks of the model) from getting too big, so the model works well not just on the data it was trained on, but also on new data it has never seen before.

References

Hoerl, A. E., & Kennard, R. W. (1970). "Ridge Regression: Biased Estimation for Nonorthogonal Problems." *Technometrics*, 12(1), 55-67.
Tikhonov, A. N. (1963). "Solution of incorrectly formulated problems and the regularization method." *Soviet Mathematics Doklady*, 4, 1035-1038.
Loshchilov, I., & Hutter, F. (2019). "Decoupled Weight Decay Regularization." *Proceedings of the 7th International Conference on Learning Representations (ICLR 2019)*. arXiv:1711.05101.
Zou, H., & Hastie, T. (2005). "Regularization and variable selection via the elastic net." *Journal of the Royal Statistical Society: Series B*, 67(2), 301-320.
Hastie, T., Tibshirani, R., & Friedman, J. (2009). *The Elements of Statistical Learning* (2nd ed.). Springer. Chapter 3: Linear Methods for Regression.
Goodfellow, I., Bengio, Y., & Courville, A. (2016). *Deep Learning*. MIT Press. Chapter 7: Regularization for Deep Learning.
Bishop, C. M. (2006). *Pattern Recognition and Machine Learning*. Springer. Section 3.1.4: Regularized Least Squares.
Pedregosa, F., et al. (2011). "Scikit-learn: Machine Learning in Python." *Journal of Machine Learning Research*, 12, 2825-2830.
Krogh, A., & Hertz, J. A. (1991). "A Simple Weight Decay Can Improve Generalization." *Advances in Neural Information Processing Systems (NeurIPS)*, 4, 950-957.
Murphy, K. P. (2022). *Probabilistic Machine Learning: An Introduction*. MIT Press. Chapter 11: Linear Regression.

Introduction

Mathematical formulation

General form

Ridge regression (linear models)

For neural networks

How L2 regularization affects gradient descent

Weight decay and its non-equivalence with Adam

Geometric interpretation

Comparison with L1 regularization

Elastic net: combining L1 and L2

Bayesian interpretation

Ridge regression: historical context

L2 regularization in neural networks

Choosing the regularization parameter (lambda)

Cross-validation

Generalized cross-validation

Implementation

Scikit-learn

PyTorch

TensorFlow/Keras

Advantages and disadvantages

Advantages

Disadvantages

Explain like I'm 5 (ELI5)

References

Improve this article

Related Articles

ARC-AGI 2

L0 Regularization

L1 Regularization

Weight Decay

Dropout Regularization

Early Stopping

Introduction

Mathematical formulation

General form

Ridge regression (linear models)

For neural networks

How L2 regularization affects gradient descent

Weight decay and its non-equivalence with Adam

Geometric interpretation

Comparison with L1 regularization

Elastic net: combining L1 and L2

Bayesian interpretation

Ridge regression: historical context

L2 regularization in neural networks

Choosing the regularization parameter (lambda)

Cross-validation

Generalized cross-validation

Implementation

Scikit-learn

PyTorch

TensorFlow/Keras

Advantages and disadvantages

Advantages

Disadvantages

Explain like I'm 5 (ELI5)

References

Related Articles

ARC-AGI 2

L0 Regularization

L1 Regularization

Weight Decay

Dropout Regularization

Early Stopping