See also: machine learning terms, regularization, L1 regularization, elastic net, overfitting
L2 regularization is a widely used technique in machine learning and statistics that penalizes large weight values by adding the sum of squared parameters to the loss function. Also known as Ridge regression in the context of linear regression and Tikhonov regularization in applied mathematics, L2 regularization discourages complex models by shrinking parameter values toward zero without forcing them to become exactly zero. This shrinkage effect reduces overfitting and improves a model's ability to generalize to unseen data.
The core idea behind L2 regularization is straightforward: among all models that fit the training data reasonably well, prefer the one with smaller weights. Large weights tend to amplify noise in the training data, leading to predictions that fluctuate wildly with small changes in input. By penalizing the squared magnitude of each weight, L2 regularization steers the optimization process toward smoother, more stable solutions.
L2 regularization has a long history spanning multiple fields. In statistics, Arthur Hoerl and Robert Kennard introduced Ridge regression in 1970 as a method for handling multicollinearity in linear regression problems. Independently, the Soviet mathematician Andrey Tikhonov developed the same mathematical framework for solving ill-posed inverse problems. Today, L2 regularization is a standard component in virtually every area of machine learning, from classical regression models to deep neural networks.
The standard supervised learning objective minimizes a loss function that measures the discrepancy between predictions and true labels. L2 regularization augments this objective by adding a penalty term proportional to the squared L2 norm of the weight vector.
Given a loss function L(w) that depends on parameters w, the L2-regularized objective is:
J(w) = L(w) + (lambda / 2) * ||w||_2^2
where:
| Symbol | Meaning |
|---|---|
| J(w) | Total regularized objective |
| L(w) | Original loss (e.g., mean squared error, cross-entropy) |
| lambda | Regularization strength (hyperparameter, lambda >= 0) |
| ||w||_2^2 | Squared L2 norm: the sum of squares of all weights, w_1^2 + w_2^2 + ... + w_n^2 |
The factor of 1/2 is a convention that simplifies the gradient computation, since the derivative of (lambda / 2) * w_i^2 with respect to w_i is simply lambda * w_i.
For linear regression with the ordinary least squares loss, the Ridge regression objective takes the form:
J(beta) = ||y - X * beta||^2 + lambda * ||beta||^2
where X is the design matrix, y is the target vector, and beta is the coefficient vector. This problem has a closed-form solution:
beta_hat = (X^T * X + lambda * I)^(-1) * X^T * y
The addition of lambda * I to the matrix X^T * X guarantees that the matrix is invertible, even when the features are highly correlated (multicollinear) or when the number of features exceeds the number of observations. This property was the original motivation for Ridge regression in the 1970 paper by Hoerl and Kennard.
In deep neural networks, the L2 penalty is typically applied to the weight matrices of each layer while excluding bias terms. For a network with L layers, the regularized loss becomes:
J = L(w) + (lambda / 2m) * sum over l from 1 to L of ||W^[l]||_F^2
where ||W^[l]||_F denotes the Frobenius norm of the weight matrix at layer l (computed by squaring every element and summing them), and m is the number of training examples. Bias terms are conventionally excluded from regularization because they do not contribute to the model's sensitivity to input features.
The practical effect of L2 regularization on gradient descent optimization is direct and intuitive. When computing the gradient of the regularized objective with respect to a weight w_i, the penalty term contributes an additional gradient of lambda * w_i:
dJ/dw_i = dL/dw_i + lambda * w_i
The parameter update rule for standard stochastic gradient descent (SGD) then becomes:
w_i <- w_i - eta * (dL/dw_i + lambda * w_i)
which can be rewritten as:
w_i <- (1 - eta * lambda) * w_i - eta * dL/dw_i
The factor (1 - eta * lambda) is strictly less than 1, so at every update step the weight is first multiplied by a number slightly below 1, shrinking it toward zero, before the gradient correction from the data is applied. This multiplicative shrinkage is the origin of the term weight decay, and it explains why L2 regularization prevents weights from growing too large.
For standard SGD, L2 regularization and weight decay produce identical parameter updates. This equivalence holds because the gradient of the L2 penalty integrates cleanly into the SGD update rule.
However, Loshchilov and Hutter demonstrated in their 2019 ICLR paper "Decoupled Weight Decay Regularization" that this equivalence breaks down for adaptive gradient optimizers such as Adam. In Adam, the gradient is divided by the running average of squared gradients before the update is applied. When L2 regularization is used with Adam, the regularization gradient lambda * w gets divided by this same second moment estimate, meaning that weights with historically large gradients receive less regularization than intended, and weights with small gradients receive more regularization than intended.
To restore proper weight decay behavior with Adam, Loshchilov and Hutter proposed AdamW, which decouples the weight decay from the gradient-based update:
| Optimizer | Update rule (simplified) | L2 regularization equivalent to weight decay? |
|---|---|---|
| SGD | w <- (1 - eta * lambda) * w - eta * grad | Yes |
| Adam (with L2 penalty) | w <- w - eta * (grad + lambda * w) / sqrt(v) | No (regularization is scaled by adaptive term) |
| AdamW (decoupled) | w <- (1 - lambda) * w - eta * grad / sqrt(v) | Yes (by design) |
This distinction has significant practical consequences. Many deep learning practitioners use Adam-based optimizers, and using L2 regularization instead of proper decoupled weight decay can lead to suboptimal generalization. The PyTorch optimizer torch.optim.AdamW implements decoupled weight decay and is now the recommended choice over torch.optim.Adam with a weight_decay parameter.
The geometric interpretation of L2 regularization provides clear intuition for why it shrinks weights without setting them to zero.
The constrained optimization form of Ridge regression is:
minimize ||y - X * beta||^2 subject to ||beta||^2 <= t
for some budget t. In a two-dimensional parameter space (two weights), the constraint ||beta||^2 <= t defines a circular region (or a hypersphere in higher dimensions) centered at the origin. The ordinary least squares solution lies at the center of the elliptical contours of the loss function. The regularized solution is the point where the smallest loss contour touches the circular constraint boundary.
Because the circle has a smooth boundary with no corners, the point of tangency almost always lies at a location where both weight coordinates are nonzero. This is the fundamental geometric reason why L2 regularization shrinks weights but does not produce exact zeros.
This geometry contrasts sharply with L1 regularization (Lasso), which uses a diamond-shaped (L1 ball) constraint region. The diamond has sharp corners aligned with the coordinate axes. The loss contours are more likely to first touch the constraint boundary at one of these corners, which corresponds to one or more weights being exactly zero. This is why L1 regularization performs feature selection (by zeroing out irrelevant features), while L2 regularization does not.
L1 and L2 regularization are the two most common forms of regularization in machine learning. They differ in their penalty terms, their effects on model parameters, and their practical use cases.
| Property | L2 regularization (Ridge) | L1 regularization (Lasso) |
|---|---|---|
| Penalty term | lambda * sum(w_i^2) | lambda * sum(|w_i|) |
| Constraint shape | Circle (hypersphere) | Diamond (cross-polytope) |
| Sparsity | Does not produce exact zeros | Drives many weights to exactly zero |
| Feature selection | No (all features retained) | Yes (irrelevant features removed) |
| Multicollinearity handling | Distributes weight among correlated features | Tends to pick one feature from a correlated group |
| Closed-form solution | Yes (for linear models) | No (requires iterative algorithms) |
| Differentiability | Smooth everywhere | Not differentiable at zero |
| Bayesian interpretation | Gaussian prior on weights | Laplace prior on weights |
| Stability with correlated features | Stable; spreads weight evenly | Unstable; solution can vary with small data changes |
When correlated features are present, Ridge regression distributes the weight evenly among them, while Lasso tends to select one and zero out the rest. This behavior makes Ridge regression more stable and predictable in high-multicollinearity settings. On the other hand, when true sparsity exists (only a few features are genuinely relevant), Lasso can produce more interpretable models.
The elastic net, introduced by Zou and Hastie in 2005, combines L1 and L2 penalties to capture the benefits of both:
J(w) = L(w) + lambda_1 * sum(|w_i|) + lambda_2 * sum(w_i^2)
An equivalent parameterization uses a mixing ratio alpha in [0, 1], where alpha = 1 gives pure Lasso and alpha = 0 gives pure Ridge:
J(w) = L(w) + lambda * [alpha * sum(|w_i|) + (1 - alpha) * sum(w_i^2)]
The L1 component provides sparsity (feature selection), while the L2 component stabilizes the solution when features are correlated. Elastic net tends to select groups of correlated features together rather than picking just one, a behavior known as the grouping effect. This makes elastic net particularly useful for datasets with many correlated predictors or when the number of features exceeds the number of observations.
From a Bayesian perspective, L2 regularization corresponds to placing a Gaussian (normal) prior on the model weights. Specifically, if the weights are assumed to follow a zero-mean Gaussian distribution with variance proportional to 1/lambda:
P(w) proportional to exp(-lambda / 2 * ||w||^2)
then the maximum a posteriori (MAP) estimate under a Gaussian likelihood is equivalent to the L2-regularized solution. The prior encodes the belief that weights should be small and centered around zero, but not necessarily exactly zero.
This Bayesian interpretation offers several insights:
By contrast, L1 regularization corresponds to a Laplace prior on the weights, which has heavier tails and a sharp peak at zero, explaining why Lasso drives weights to exact zeros.
The statistical formulation of L2 regularization originated with Arthur E. Hoerl and Robert W. Kennard, who published two seminal papers in Technometrics in 1970. The first paper, "Ridge Regression: Biased Estimation for Nonorthogonal Problems," showed that adding a small positive constant to the diagonal of the X^T * X matrix could produce estimates with lower mean squared error than ordinary least squares, despite introducing bias.
This result was counterintuitive at the time. The prevailing statistical theory favored unbiased estimators, and the Gauss-Markov theorem established that OLS was the best linear unbiased estimator. Hoerl and Kennard demonstrated that a biased estimator (Ridge) could achieve better overall accuracy by trading a small amount of bias for a large reduction in variance. This bias-variance tradeoff is now a foundational concept in machine learning.
The name "Ridge" comes from ridge analysis, a technique Hoerl used in industrial chemistry for exploring response surfaces. Independently, the mathematician Andrey Tikhonov had developed the same regularization approach for solving ill-posed integral equations, and the technique is widely known in applied mathematics and numerical analysis as Tikhonov regularization or Tikhonov-Phillips regularization (David L. Phillips contributed independently as well).
In deep learning, L2 regularization (usually implemented as weight decay) is one of several tools for controlling the capacity of neural networks. It is often combined with other regularization strategies:
| Technique | How it works | Relationship to L2 |
|---|---|---|
| Dropout | Randomly zeroes out activations during training | Complementary; often used together |
| Batch normalization | Normalizes layer inputs; has an implicit regularization effect | Can reduce the need for explicit L2 |
| Early stopping | Stops training when validation error begins to rise | Has a similar shrinkage effect to L2 on the weight trajectory |
| Data augmentation | Expands the effective training set | Complementary; addresses overfitting from a different angle |
| Label smoothing | Softens hard labels to prevent overconfident predictions | Orthogonal to L2; targets output confidence |
A noteworthy theoretical connection exists between L2 regularization and early stopping. For linear models trained with gradient descent, stopping the optimization early has an effect equivalent to L2 regularization: the effective regularization strength is inversely proportional to the number of training iterations. This connection extends approximately to neural networks, where early stopping acts as an implicit form of weight decay.
In modern deep learning practice, the weight decay coefficient is typically set to a small value such as 1e-4 or 1e-2. The optimal value depends on the model architecture, dataset size, and other regularization techniques in use. Transformers and other large models often use weight decay values of 0.01 or 0.1 as part of their training recipe.
Selecting the right value for the regularization hyperparameter lambda is critical. Too large a value leads to underfitting (excessive bias, weights shrunk too aggressively), while too small a value provides insufficient regularization and allows overfitting.
The standard approach for choosing lambda is k-fold cross-validation. The training data is split into k folds, and for each candidate lambda value the model is trained on k-1 folds and evaluated on the remaining fold. The lambda that yields the best average validation performance is selected. A typical workflow involves:
Scikit-learn provides RidgeCV, which performs efficient leave-one-out cross-validation for Ridge regression using a computational shortcut that avoids retraining the model for each held-out sample.
For Ridge regression specifically, generalized cross-validation (GCV) provides an efficient approximation to leave-one-out cross-validation. GCV minimizes the function:
GCV(lambda) = RSS(lambda) / (1 - tr(H(lambda)) / n)^2
where H(lambda) is the hat matrix and tr(H) represents the effective degrees of freedom. This approach requires only a single matrix computation rather than n separate model fits.
Scikit-learn provides several classes for Ridge regression in sklearn.linear_model:
from sklearn.linear_model import Ridge, RidgeCV
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
X, y = make_regression(n_samples=1000, n_features=50, noise=10)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
# Ridge with a fixed alpha (lambda)
ridge = Ridge(alpha=1.0)
ridge.fit(X_train, y_train)
print(f"R^2 score: {ridge.score(X_test, y_test):.4f}")
# RidgeCV with built-in cross-validation
ridge_cv = RidgeCV(alphas=[0.01, 0.1, 1.0, 10.0, 100.0])
ridge_cv.fit(X_train, y_train)
print(f"Best alpha: {ridge_cv.alpha_}")
print(f"R^2 score: {ridge_cv.score(X_test, y_test):.4f}")
Note that scikit-learn uses the parameter name alpha instead of lambda (since lambda is a reserved keyword in Python).
In PyTorch, L2 regularization is most commonly applied through the weight_decay parameter in optimizers:
import torch
import torch.nn as nn
model = nn.Sequential(
nn.Linear(50, 128),
nn.ReLU(),
nn.Linear(128, 1)
)
# SGD with weight decay (equivalent to L2 regularization)
optimizer = torch.optim.SGD(model.parameters(), lr=0.01, weight_decay=1e-4)
# AdamW with decoupled weight decay (recommended over Adam + weight_decay)
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3, weight_decay=0.01)
For fine-grained control, bias parameters can be excluded from weight decay:
param_groups = [
{"params": [p for n, p in model.named_parameters() if "bias" not in n],
"weight_decay": 0.01},
{"params": [p for n, p in model.named_parameters() if "bias" in n],
"weight_decay": 0.0}
]
optimizer = torch.optim.AdamW(param_groups, lr=1e-3)
In TensorFlow and Keras, L2 regularization can be applied directly to individual layers using kernel regularizers:
import tensorflow as tf
model = tf.keras.Sequential([
tf.keras.layers.Dense(128, activation="relu",
kernel_regularizer=tf.keras.regularizers.l2(1e-4)),
tf.keras.layers.Dense(1)
])
model.compile(optimizer="adam", loss="mse")
Imagine you are building a tower out of blocks. You want your tower to be tall, but you also want it to be steady so it does not fall over. If you use really big, heavy blocks near the top, the tower might get tall quickly, but it will wobble and tip over easily.
L2 regularization is like a rule that says: "You can use whatever blocks you want, but bigger blocks cost more points." Since you want to save your points, you end up using medium-sized blocks instead of giant ones. Your tower might not be as tall, but it stands up much better. That is what L2 regularization does for a machine learning model: it keeps the weights (the building blocks of the model) from getting too big, so the model works well not just on the data it was trained on, but also on new data it has never seen before.