See also: Machine learning terms
Regularization is a set of techniques used in machine learning to prevent overfitting, which occurs when a model learns to perform well on the training data but does not generalize well to unseen data. Regularization works by adding constraints or penalties during training that discourage the model from becoming overly complex. The core idea is that simpler models, or models with smaller parameter values, tend to generalize better to new data.
The principle behind regularization traces back to Occam's razor: given two explanations that fit the observed data equally well, the simpler one is more likely to be correct. In statistical learning, a model that fits the training data perfectly may have captured noise and idiosyncratic patterns rather than the true underlying relationship. Regularization operationalizes Occam's razor by adding a cost for complexity, biasing the learning process toward simpler hypotheses that are more likely to hold on unseen examples.
In mathematical terms, most regularization methods modify the loss function by adding a penalty term:
Total Loss = Original Loss + lambda * Regularization Penalty
The hyperparameter lambda (sometimes written as alpha) controls the strength of regularization. A larger value imposes a stronger penalty, pushing the model toward simpler solutions. A value of zero recovers the unregularized objective.
Regularization techniques span a wide range of approaches. Some directly penalize model weights, while others impose implicit constraints through training procedures or data manipulation.
| Technique | Category | How it works | Typical use case |
|---|---|---|---|
| L1 (Lasso) | Weight penalty | Adds sum of absolute weight values to loss | Feature selection; sparse models |
| L2 (Ridge) | Weight penalty | Adds sum of squared weight values to loss | General-purpose weight shrinkage |
| Elastic Net | Weight penalty | Combines L1 and L2 penalties | Correlated features with desired sparsity |
| Dropout | Training procedure | Randomly deactivates neurons during training | Deep learning fully connected and recurrent layers |
| Batch normalization | Training procedure | Normalizes layer inputs across the mini-batch | Deep networks; stabilizes and mildly regularizes |
| Early stopping | Training procedure | Halts training when validation loss stops improving | Any iterative training process |
| Data augmentation | Data-based | Applies transformations to create additional training examples | Computer vision, NLP, audio |
| Weight decay | Weight penalty | Directly shrinks weights at each optimizer step | Standard in AdamW and SGD |
| Label smoothing | Output-based | Replaces hard 0/1 targets with soft targets (e.g., 0.1/0.9) | Classification with overconfident predictions |
| Spectral normalization | Weight constraint | Constrains the spectral norm of weight matrices to be at most 1 | GANs, stability-sensitive architectures |
| Stochastic depth | Training procedure | Randomly skips entire residual blocks during training | Deep ResNets, vision transformers |
| Mixup | Data-based | Linearly interpolates pairs of training examples and their labels | Image classification, semi-supervised learning |
| CutMix | Data-based | Cuts and pastes patches between training images, mixing labels proportionally | Image classification, object detection |
| Noise injection | Training procedure | Adds random noise to inputs, weights, or gradients | Small datasets; recurrent networks |
L1 regularization, also known as Lasso regularization (Least Absolute Shrinkage and Selection Operator), adds the sum of the absolute values of the model weights to the objective function.
The L1-regularized loss is:
L_total = L_original + lambda * sum(|w_i|)
where w_i represents each weight in the model and lambda is the regularization strength.
The key property of L1 regularization is that it promotes sparsity. Because the absolute value function has a sharp corner at zero, the optimization process tends to drive many weights to exactly zero. This effectively performs automatic feature selection: features whose weights become zero are removed from the model entirely. L1 regularization is particularly useful when dealing with high-dimensional data where only a subset of features is expected to be relevant.
In practice, L1 regularization produces models that are easier to interpret because only a small number of features have nonzero weights. However, when features are highly correlated, L1 tends to arbitrarily select one from a group of correlated features and set the rest to zero, which can be unstable.
From a Bayesian perspective, L1 regularization is equivalent to placing a Laplace prior on the model parameters. The Laplace distribution has a sharp peak at zero, which assigns higher prior probability to weights that are exactly zero, explaining why L1 optimization tends to produce sparse solutions.
Tibshirani (1996) introduced Lasso in the context of linear regression, and it has since been widely adopted across many model types.
L2 regularization, also known as Ridge regularization, adds the sum of the squared values of the model weights to the objective function.
The L2-regularized loss is:
L_total = L_original + lambda * sum(w_i^2)
Unlike L1, L2 regularization does not drive weights to exactly zero. Instead, it shrinks all weights toward zero proportionally. Weights that are already small get pushed closer to zero, while large weights get penalized more heavily. The result is a smoother, less complex model where no single feature dominates the prediction.
L2 regularization is particularly effective at handling multicollinearity, a situation where predictor variables are highly correlated. When features are correlated, ordinary least squares regression produces unstable weight estimates with high variance. L2 regularization stabilizes these estimates by constraining the weight magnitudes.
From a Bayesian standpoint, L2 regularization corresponds to placing a Gaussian (normal) prior centered at zero on the model parameters. A Gaussian prior with variance 1 / lambda assigns higher probability to smaller weight values, favoring models with weights clustered near zero without enforcing exact sparsity.
Ridge regression was introduced by Hoerl and Kennard (1970) and remains one of the most widely used regularization methods in both classical statistics and modern deep learning.
Elastic Net regularization combines L1 and L2 regularization. It includes both penalty terms, weighted by a mixing parameter:
L_total = L_original + lambda_1 * sum(|w_i|) + lambda_2 * sum(w_i^2)
Alternatively, this can be expressed with a single regularization strength lambda and a mixing ratio alpha where alpha = 0 gives pure L2 and alpha = 1 gives pure L1:
L_total = L_original + lambda * [alpha * sum(|w_i|) + (1 - alpha) * sum(w_i^2)]
Elastic Net retains the sparsity-inducing property of L1 while also inheriting L2's ability to handle correlated features gracefully. When multiple features are correlated, Elastic Net tends to include or exclude them as a group rather than arbitrarily selecting one, which makes it more stable than pure L1.
Zou and Hastie (2005) proposed Elastic Net and showed that it outperforms Lasso in situations with correlated predictors or when the number of predictors exceeds the number of observations.
Dropout is a regularization technique designed specifically for neural networks. During each training step, dropout randomly sets a fraction of neuron activations to zero. The fraction is controlled by a dropout rate (commonly between 0.2 and 0.5). At test time, all neurons are active, but their outputs are scaled by the dropout rate to compensate.
Dropout prevents co-adaptation, a situation where neurons learn to depend on specific other neurons. When any neuron might be absent on a given training step, each neuron must learn features that are useful on their own, not just in the context of specific partner neurons. This produces a more robust internal representation.
Another interpretation of dropout is that it approximates training an ensemble of many different networks. Each training step uses a different "thinned" network (a random subset of the full network), and the final prediction at test time averages over all these implicit sub-networks.
Srivastava et al. (2014) published the foundational paper on dropout, showing that it significantly reduces overfitting across a wide range of tasks including image classification, speech recognition, and text classification.
Batch normalization, introduced by Ioffe and Szegedy (2015), normalizes the inputs to each layer by subtracting the mini-batch mean and dividing by the mini-batch standard deviation. Learnable scale and shift parameters are then applied.
While batch normalization was originally designed to address internal covariate shift and speed up training, it also has a regularizing effect. The normalization uses statistics computed from the current mini-batch, which introduces noise into the computation (since each mini-batch is a random sample). This noise acts as a mild regularizer, similar in spirit to dropout. In practice, batch normalization often reduces the need for dropout, and some architectures use one or the other but not both.
Early stopping monitors the model's performance on a validation set during training and halts the process when validation performance stops improving. A "patience" parameter specifies how many consecutive epochs of no improvement to tolerate before stopping.
Early stopping is an implicit form of regularization. As training progresses, the model's weights move further from their initial values. Stopping early constrains how far the weights can move, which limits the effective complexity of the model. Bishop (1995) and others have shown that early stopping is mathematically related to L2 regularization in certain settings; both constrain the weight space, just through different mechanisms.
Data augmentation expands the effective size of the training set by creating modified copies of existing training examples. In computer vision, common augmentations include random rotations, flips, crops, color jitter, and scaling. In NLP, augmentations include synonym replacement, back-translation, and random insertion or deletion of words.
By presenting the model with more varied versions of the same data, augmentation reduces the model's ability to memorize specific training examples. The model must learn features that are invariant to the applied transformations, which improves generalization.
Weight decay directly shrinks the model weights at each optimization step by multiplying them by a factor slightly less than 1 (e.g., 0.999). In the standard SGD optimizer, weight decay is mathematically equivalent to L2 regularization. However, for adaptive optimizers like Adam, weight decay and L2 regularization are not the same.
The distinction arises because Adam maintains per-parameter adaptive learning rates based on the first and second moments of past gradients. When L2 regularization is used with Adam, the gradient of the L2 penalty gets scaled by these adaptive factors, which means different parameters receive different effective regularization strengths depending on their gradient history. Parameters with large historical gradients receive weaker effective regularization, while rarely updated parameters receive stronger regularization. This coupling was not intentional and often hurts performance.
Loshchilov and Hutter (2019) identified this problem and proposed AdamW, which decouples weight decay from the gradient update. In AdamW, weight decay is applied directly to the weights after the Adam update step, rather than being added to the gradient before it. This ensures that all parameters are regularized uniformly regardless of their gradient history. Their experiments showed that decoupled weight decay substantially improves generalization and makes the optimal weight decay factor more independent of the learning rate, simplifying hyperparameter tuning.
Label smoothing replaces the hard target labels (0 or 1 in classification) with softened versions. For example, instead of a target of 1.0 for the correct class and 0.0 for all others, label smoothing might use 0.9 for the correct class and distribute the remaining 0.1 uniformly across the other classes.
This prevents the model from becoming overconfident in its predictions. Without label smoothing, the model is incentivized to push its output probabilities toward 0 and 1, which requires very large weight magnitudes and makes the model brittle. Szegedy et al. (2016) introduced label smoothing as part of the Inception v2 architecture and showed that it improved both calibration and generalization.
Spectral normalization constrains the spectral norm (the largest singular value) of each weight matrix to be at most 1. This bounds the Lipschitz constant of each layer, limiting how much the output can change in response to small input perturbations.
Miyato et al. (2018) proposed spectral normalization for training generative adversarial networks (GANs), where training stability is a major concern. By controlling the discriminator's Lipschitz constant, spectral normalization prevents the discriminator from producing overly sharp gradients that destabilize training. It has since been applied to other architectures where controlling the model's sensitivity to input perturbations is desirable.
Noise injection adds random perturbations to inputs, weights, or intermediate activations during training. The most common variant adds Gaussian noise with zero mean and a fixed or decaying variance. This forces the model to learn representations that are robust to small perturbations rather than relying on precise input values.
Adding noise to the inputs is one of the oldest regularization techniques and can be shown to be approximately equivalent to a form of Tikhonov (L2) regularization under certain conditions. Specifically, injecting Gaussian noise with variance sigma^2 into the inputs of a linear model is equivalent to adding a penalty proportional to sigma^2 times the squared norm of the weights (Bishop, 1995).
Noise can also be injected directly into the weights during training. Weight noise encourages the network to find broad minima in the loss landscape rather than sharp ones, because solutions that are sensitive to small weight perturbations will perform poorly when noise is added. Graves et al. (2013) demonstrated the effectiveness of weight noise for training long short-term memory (LSTM) networks.
Annealing the noise variance over the course of training (starting with a larger variance and gradually reducing it) often works better than maintaining a fixed variance, as it allows the model to explore broadly early in training and then refine its parameters later.
The following table summarizes the mathematical formulations of the primary weight-based regularization methods:
| Method | Penalty term | Effect on weights | Sparsity |
|---|---|---|---|
| L1 (Lasso) | lambda * sum(abs(w_i)) | Drives many weights to exactly zero | Yes |
| L2 (Ridge) | lambda * sum(w_i^2) | Shrinks all weights toward zero | No |
| Elastic Net | lambda * [alpha * sum(abs(w_i)) + (1-alpha) * sum(w_i^2)] | Combines sparsity with shrinkage | Partial |
| Weight decay | Multiply weights by (1 - lambda) each step | Shrinks all weights toward zero | No |
| Spectral norm | Constrain largest singular value of W to 1 | Bounds layer's Lipschitz constant | No |
Regularization has a natural interpretation in Bayesian statistics. In the Bayesian framework, model parameters are treated as random variables with a prior distribution that encodes beliefs about the parameters before observing any data. Learning is then framed as computing the posterior distribution over parameters given the observed data, using Bayes' theorem.
Maximum a posteriori (MAP) estimation finds the parameter values that maximize the posterior probability. Taking the negative log of the posterior and minimizing it yields an objective that consists of two terms: the negative log-likelihood (which corresponds to the original loss function) and the negative log-prior (which acts as a regularization penalty). The specific form of the prior determines the type of regularization:
| Prior distribution | Corresponding regularization | Effect |
|---|---|---|
Gaussian (mean 0, variance 1/lambda) | L2 regularization | Shrinks weights toward zero |
Laplace (location 0, scale 1/lambda) | L1 regularization | Promotes exact sparsity |
| Spike-and-slab | L0-type regularization | Selects a subset of nonzero weights |
| Horseshoe | Adaptive shrinkage | Heavy shrinkage on small weights, mild on large |
This connection means that choosing a regularization method is, from a Bayesian standpoint, equivalent to choosing a prior over model parameters. A practitioner who uses L2 regularization is implicitly assuming that the model's weights are drawn from a Gaussian distribution centered at zero. One who uses L1 regularization is assuming a Laplace distribution, which has heavier tails and a sharper peak at zero.
The Bayesian perspective also provides a principled way to set the regularization strength. Rather than treating lambda as a hyperparameter to be tuned by cross-validation, Bayesian methods can place a hyperprior on lambda and infer its value from the data. This approach, known as empirical Bayes or type-II maximum likelihood, automatically determines how much regularization is appropriate.
The bias-variance tradeoff provides the theoretical foundation for understanding why regularization works. A model's expected prediction error on unseen data can be decomposed into three components: bias (squared), variance, and irreducible noise.
Bias measures how far the model's average prediction is from the true value. Models that are too simple (high regularization) tend to have high bias because they cannot capture the true complexity of the data. Variance measures how much the model's predictions fluctuate across different training sets. Complex models (low regularization) tend to have high variance because they fit the noise in each training set differently.
Regularization increases bias by constraining the model's flexibility, but it reduces variance by preventing the model from fitting idiosyncratic patterns in the training data. The net effect is often a decrease in total prediction error, because the reduction in variance more than compensates for the increase in bias.
This tradeoff is directly controlled by the regularization strength. As lambda increases from zero:
The optimal lambda sits at the point where total error is minimized, balancing bias and variance. This is why regularization strength must be carefully tuned rather than simply set to a large value.
The regularization hyperparameter lambda must be selected carefully. Too small a value provides insufficient regularization and allows overfitting. Too large a value over-constrains the model, causing underfitting. Several strategies exist for finding a good value.
K-fold cross-validation is the most common approach. The training data is divided into K folds (typically 5 or 10). For each candidate lambda value, the model is trained on K-1 folds and evaluated on the held-out fold, rotating through all folds. The lambda that produces the best average validation performance is selected. This approach is computationally expensive because it requires training K models for each lambda value, but it provides a reliable estimate of out-of-sample performance.
Grid search evaluates a predefined set of lambda values, typically spaced on a logarithmic scale (e.g., 1e-5, 1e-4, 1e-3, 1e-2, 1e-1, 1). This works well when the rough range of good values is known in advance.
Random search samples lambda values randomly from a specified distribution. Bergstra and Bengio (2012) showed that random search is more efficient than grid search when some hyperparameters matter more than others, because it explores a wider range of values for each dimension.
Bayesian optimization uses a probabilistic model of the objective function to select the next lambda value to evaluate. Tools like Optuna, Hyperopt, and Weights & Biases Sweeps implement this approach, which is particularly useful when each training run is expensive.
For L1 and L2 regularization in linear models, efficient algorithms exist that compute the entire solution path across all lambda values at roughly the cost of a single fit. The LARS algorithm (Efron et al., 2004) does this for Lasso, and the solution path for Ridge regression can be computed in closed form.
For neural networks, the regularization strength is typically tuned alongside other hyperparameters such as learning rate, dropout rate, and batch size. Modern training recipes often specify default values: for example, a weight decay of 0.01 to 0.1 with AdamW is standard for many architectures.
Regularization plays different roles and takes different forms in traditional machine learning models compared to deep neural networks.
In traditional ML (linear regression, logistic regression, support vector machines), regularization is typically applied through explicit penalty terms on the model weights. The penalty directly modifies the objective function, and for many models, the resulting optimization problem remains convex. This means there is a single global optimum that can be found reliably. The choice between L1, L2, and Elastic Net is the primary decision, and the regularization strength is the main hyperparameter to tune.
In deep neural networks, the situation is more complex. The loss landscape is non-convex with many local minima and saddle points, so the optimizer's trajectory through the loss landscape itself acts as an implicit regularizer. Several additional forms of regularization are available:
| Aspect | Traditional ML | Neural networks |
|---|---|---|
| Primary methods | L1, L2, Elastic Net | Weight decay, dropout, data augmentation, batch normalization |
| Regularization target | Model weights | Weights, activations, gradients, outputs |
| Loss landscape | Convex (for many models) | Non-convex |
| Implicit regularization | Minimal | SGD noise, architecture choices, initialization |
| Tuning complexity | One or two hyperparameters | Many interacting hyperparameters |
| Theoretical guarantees | Well-established bounds | Largely empirical understanding |
Neural networks also benefit from forms of implicit regularization that have no counterpart in traditional ML. The stochasticity of mini-batch gradient descent itself acts as a regularizer: the noise from sampling different mini-batches prevents the optimizer from converging precisely to a sharp minimum. Smaller batch sizes introduce more noise and thus more implicit regularization, which partially explains why small-batch training sometimes generalizes better than large-batch training. The choice of optimizer, learning rate schedule, and network architecture also implicitly regularize the model by influencing which solutions the optimizer finds.
Transformer-based models use a distinct combination of regularization techniques that differs from the classical deep learning toolkit. Understanding this modern landscape is essential for practitioners working with large language models.
Stochastic depth, introduced by Huang et al. (2016), randomly skips entire residual blocks during training. For each training step, each residual block has a probability of being bypassed entirely (its output is replaced by its input through the skip connection). The drop probability typically increases linearly from 0 at the first layer to a maximum value (often 0.1-0.3) at the last layer.
Stochastic depth has become a standard regularization technique for vision transformers (ViTs). The DeiT training recipe (Touvron et al., 2021) and the timm library's "ResNet Strikes Back" recipe (Wightman et al., 2021) both include stochastic depth as a key component. It serves a dual purpose: regularizing the model and implicitly training an ensemble of networks with different effective depths.
| Regularization technique | Used in vision transformers | Used in LLMs |
|---|---|---|
| Stochastic depth (DropPath) | Yes (standard) | Rarely |
| Dropout on attention weights | Yes | Yes (but often low, 0.0-0.1) |
| Dropout on FFN layers | Yes | Yes (but often low or zero) |
| Weight decay | Yes (0.05-0.3) | Yes (0.1 standard) |
| Label smoothing | Yes (0.1 typical) | Occasionally |
| Mixup / CutMix | Yes (standard in ViT training) | No |
| Data augmentation | Yes (RandAugment, etc.) | Limited (masking, token dropout) |
Mixup (Zhang et al., 2018) creates new training examples by linearly interpolating pairs of existing examples and their labels. Given two examples (x_a, y_a) and (x_b, y_b), mixup creates a new example:
x_mix = lambda * x_a + (1 - lambda) * x_b
y_mix = lambda * y_a + (1 - lambda) * y_b
where lambda is sampled from a Beta distribution. This encourages the model to behave linearly between training examples, which produces smoother decision boundaries and reduces overfitting.
CutMix (Yun et al., 2019) takes a different approach: instead of blending entire images, it cuts a rectangular patch from one image and pastes it onto another. The labels are mixed proportionally to the area of the patch. CutMix preserves local image structure better than Mixup, which can produce blurred or unnatural images.
Both techniques are now standard in vision transformer training pipelines. The DeiT recipe, for example, uses both Mixup (alpha=0.8) and CutMix (alpha=1.0) with a 50% probability of switching between them. These augmentations are particularly important for training ViTs from scratch, as transformers lack the inductive biases (translation equivariance, locality) that CNNs have, making them more prone to overfitting on small to medium datasets.
Modern LLMs employ a surprisingly minimal set of explicit regularization techniques compared to vision models. The primary regularization tools in LLM pre-training are:
The trend toward reduced explicit regularization in LLMs is driven by the observation that very large models trained on very large datasets are in a regime where overfitting is not the primary concern. Instead, underfitting (not training long enough or on enough data) is the bigger risk. This stands in contrast to smaller-scale training, where regularization is essential.
Selecting the right regularization technique depends on the model architecture, data characteristics, and the specific problem.
For linear models (linear regression, logistic regression), L1, L2, or Elastic Net regularization are the standard choices. Use L1 when you expect many features to be irrelevant and want automatic feature selection. Use L2 when all features may be relevant but you want to prevent large weights. Use Elastic Net when features are correlated and you want both sparsity and stability.
For deep neural networks, dropout, batch normalization, weight decay, and data augmentation are the primary tools. Modern practice often combines several of these. For example, a typical image classification pipeline might use data augmentation, batch normalization in convolutional layers, dropout in fully connected layers, and weight decay in the optimizer. The combination matters: adding too many regularizers at once can lead to underfitting, so each should be tuned on a validation set.
For GANs and other generative models, spectral normalization and gradient penalties are preferred because they directly control the smoothness of the discriminator or critic function.
For large language models and transformer-based architectures, weight decay is the primary explicit regularizer, with dropout applied sparingly. Data augmentation in NLP (e.g., back-translation, token masking) is common during pre-training but less critical than in vision.
The regularization strength (lambda or dropout rate) is a hyperparameter that should be tuned via cross-validation or a held-out validation set. Too little regularization allows overfitting; too much forces the model into underfitting.
Imagine you are building a sandcastle using a limited amount of sand. You want your sandcastle to look great and be as sturdy as possible. In machine learning, the sand represents the information we have, and the sandcastle is the model we build.
Sometimes, when we build our sandcastle (or model), we focus too much on making it look perfect using the sand we have, and we forget that it needs to be sturdy enough to withstand waves or wind (unseen data). Regularization is like adding some water or using a different technique while building our sandcastle to make it stronger and more resilient. This way, it will look good and be sturdy, even when faced with new challenges.
Here is another way to think about it: if you are studying for a test, you could memorize every single practice question word for word. You would get a perfect score on those practice questions, but if the test has slightly different questions, you might fail. Regularization is like your teacher saying "don't just memorize the answers; learn the main ideas." It stops you from memorizing the practice questions too closely, so you actually understand the material and can answer new questions you have never seen before.