Regularization

Regularization in machine learning

Regularization is a set of techniques used in machine learning to prevent overfitting, which occurs when a model learns to perform well on the training data but does not generalize well to unseen data. Regularization works by adding constraints or penalties during training that discourage the model from becoming overly complex. The core idea is that simpler models, or models with smaller parameter values, tend to generalize better to new data.

The principle behind regularization traces back to Occam's razor: given two explanations that fit the observed data equally well, the simpler one is more likely to be correct. In statistical learning, a model that fits the training data perfectly may have captured noise and idiosyncratic patterns rather than the true underlying relationship. Regularization operationalizes Occam's razor by adding a cost for complexity, biasing the learning process toward simpler hypotheses that are more likely to hold on unseen examples.

In mathematical terms, most regularization methods modify the loss function by adding a penalty term:

Total Loss = Original Loss + lambda * Regularization Penalty

The hyperparameter lambda (sometimes written as alpha) controls the strength of regularization. A larger value imposes a stronger penalty, pushing the model toward simpler solutions. A value of zero recovers the unregularized objective.

Types of regularization

Regularization techniques span a wide range of approaches. Some directly penalize model weights, while others impose implicit constraints through training procedures or data manipulation.

Technique	Category	How it works	Typical use case
L1 (Lasso)	Weight penalty	Adds sum of absolute weight values to loss	Feature selection; sparse models
L2 (Ridge)	Weight penalty	Adds sum of squared weight values to loss	General-purpose weight shrinkage
Elastic Net	Weight penalty	Combines L1 and L2 penalties	Correlated features with desired sparsity
Dropout	Training procedure	Randomly deactivates neurons during training	Deep learning fully connected and recurrent layers
Batch normalization	Training procedure	Normalizes layer inputs across the mini-batch	Deep networks; stabilizes and mildly regularizes
Early stopping	Training procedure	Halts training when validation loss stops improving	Any iterative training process
Data augmentation	Data-based	Applies transformations to create additional training examples	Computer vision, NLP, audio
Weight decay	Weight penalty	Directly shrinks weights at each optimizer step	Standard in AdamW and SGD
Label smoothing	Output-based	Replaces hard 0/1 targets with soft targets (e.g., 0.1/0.9)	Classification with overconfident predictions
Spectral normalization	Weight constraint	Constrains the spectral norm of weight matrices to be at most 1	GANs, stability-sensitive architectures
Stochastic depth	Training procedure	Randomly skips entire residual blocks during training	Deep ResNets, vision transformers
Mixup	Data-based	Linearly interpolates pairs of training examples and their labels	Image classification, semi-supervised learning
CutMix	Data-based	Cuts and pastes patches between training images, mixing labels proportionally	Image classification, object detection
Noise injection	Training procedure	Adds random noise to inputs, weights, or gradients	Small datasets; recurrent networks

L1 regularization (Lasso)

L1 regularization, also known as Lasso regularization (Least Absolute Shrinkage and Selection Operator), adds the sum of the absolute values of the model weights to the objective function.

The L1-regularized loss is:

L_total = L_original + lambda * sum(|w_i|)

where w_i represents each weight in the model and lambda is the regularization strength.

The key property of L1 regularization is that it promotes sparsity. Because the absolute value function has a sharp corner at zero, the optimization process tends to drive many weights to exactly zero. This effectively performs automatic feature selection: features whose weights become zero are removed from the model entirely. L1 regularization is particularly useful when dealing with high-dimensional data where only a subset of features is expected to be relevant.

In practice, L1 regularization produces models that are easier to interpret because only a small number of features have nonzero weights. However, when features are highly correlated, L1 tends to arbitrarily select one from a group of correlated features and set the rest to zero, which can be unstable.

From a Bayesian perspective, L1 regularization is equivalent to placing a Laplace prior on the model parameters. The Laplace distribution has a sharp peak at zero, which assigns higher prior probability to weights that are exactly zero, explaining why L1 optimization tends to produce sparse solutions.

Tibshirani (1996) introduced Lasso in the context of linear regression, and it has since been widely adopted across many model types.

L2 regularization (Ridge)

L2 regularization, also known as Ridge regularization, adds the sum of the squared values of the model weights to the objective function.

The L2-regularized loss is:

L_total = L_original + lambda * sum(w_i^2)

Unlike L1, L2 regularization does not drive weights to exactly zero. Instead, it shrinks all weights toward zero proportionally. Weights that are already small get pushed closer to zero, while large weights get penalized more heavily. The result is a smoother, less complex model where no single feature dominates the prediction.

L2 regularization is particularly effective at handling multicollinearity, a situation where predictor variables are highly correlated. When features are correlated, ordinary least squares regression produces unstable weight estimates with high variance. L2 regularization stabilizes these estimates by constraining the weight magnitudes.

From a Bayesian standpoint, L2 regularization corresponds to placing a Gaussian (normal) prior centered at zero on the model parameters. A Gaussian prior with variance 1 / lambda assigns higher probability to smaller weight values, favoring models with weights clustered near zero without enforcing exact sparsity.

Ridge regression was introduced by Hoerl and Kennard (1970) and remains one of the most widely used regularization methods in both classical statistics and modern deep learning.

Elastic Net regularization

Elastic Net regularization combines L1 and L2 regularization. It includes both penalty terms, weighted by a mixing parameter:

L_total = L_original + lambda_1 * sum(|w_i|) + lambda_2 * sum(w_i^2)

Alternatively, this can be expressed with a single regularization strength lambda and a mixing ratio alpha where alpha = 0 gives pure L2 and alpha = 1 gives pure L1:

L_total = L_original + lambda * [alpha * sum(|w_i|) + (1 - alpha) * sum(w_i^2)]

Elastic Net retains the sparsity-inducing property of L1 while also inheriting L2's ability to handle correlated features gracefully. When multiple features are correlated, Elastic Net tends to include or exclude them as a group rather than arbitrarily selecting one, which makes it more stable than pure L1.

Zou and Hastie (2005) proposed Elastic Net and showed that it outperforms Lasso in situations with correlated predictors or when the number of predictors exceeds the number of observations.

Dropout

Dropout is a regularization technique designed specifically for neural networks. During each training step, dropout randomly sets a fraction of neuron activations to zero. The fraction is controlled by a dropout rate (commonly between 0.2 and 0.5). At test time, all neurons are active, but their outputs are scaled by the dropout rate to compensate.

Dropout prevents co-adaptation, a situation where neurons learn to depend on specific other neurons. When any neuron might be absent on a given training step, each neuron must learn features that are useful on their own, not just in the context of specific partner neurons. This produces a more robust internal representation.

Another interpretation of dropout is that it approximates training an ensemble of many different networks. Each training step uses a different "thinned" network (a random subset of the full network), and the final prediction at test time averages over all these implicit sub-networks.

Srivastava et al. (2014) published the foundational paper on dropout, showing that it significantly reduces overfitting across a wide range of tasks including image classification, speech recognition, and text classification.

Batch normalization

Batch normalization, introduced by Ioffe and Szegedy (2015), normalizes the inputs to each layer by subtracting the mini-batch mean and dividing by the mini-batch standard deviation. Learnable scale and shift parameters are then applied.

While batch normalization was originally designed to address internal covariate shift and speed up training, it also has a regularizing effect. The normalization uses statistics computed from the current mini-batch, which introduces noise into the computation (since each mini-batch is a random sample). This noise acts as a mild regularizer, similar in spirit to dropout. In practice, batch normalization often reduces the need for dropout, and some architectures use one or the other but not both.

Early stopping

Early stopping monitors the model's performance on a validation set during training and halts the process when validation performance stops improving. A "patience" parameter specifies how many consecutive epochs of no improvement to tolerate before stopping.

Early stopping is an implicit form of regularization. As training progresses, the model's weights move further from their initial values. Stopping early constrains how far the weights can move, which limits the effective complexity of the model. Bishop (1995) and others have shown that early stopping is mathematically related to L2 regularization in certain settings; both constrain the weight space, just through different mechanisms.

Data augmentation

Data augmentation expands the effective size of the training set by creating modified copies of existing training examples. In computer vision, common augmentations include random rotations, flips, crops, color jitter, and scaling. In NLP, augmentations include synonym replacement, back-translation, and random insertion or deletion of words.

By presenting the model with more varied versions of the same data, augmentation reduces the model's ability to memorize specific training examples. The model must learn features that are invariant to the applied transformations, which improves generalization.

Weight decay

Weight decay directly shrinks the model weights at each optimization step by multiplying them by a factor slightly less than 1 (e.g., 0.999). In the standard SGD optimizer, weight decay is mathematically equivalent to L2 regularization. However, for adaptive optimizers like Adam, weight decay and L2 regularization are not the same.

The distinction arises because Adam maintains per-parameter adaptive learning rates based on the first and second moments of past gradients. When L2 regularization is used with Adam, the gradient of the L2 penalty gets scaled by these adaptive factors, which means different parameters receive different effective regularization strengths depending on their gradient history. Parameters with large historical gradients receive weaker effective regularization, while rarely updated parameters receive stronger regularization. This coupling was not intentional and often hurts performance.

Loshchilov and Hutter (2019) identified this problem and proposed AdamW, which decouples weight decay from the gradient update. In AdamW, weight decay is applied directly to the weights after the Adam update step, rather than being added to the gradient before it. This ensures that all parameters are regularized uniformly regardless of their gradient history. Their experiments showed that decoupled weight decay substantially improves generalization and makes the optimal weight decay factor more independent of the learning rate, simplifying hyperparameter tuning.

Label smoothing

Label smoothing replaces the hard target labels (0 or 1 in classification) with softened versions. For example, instead of a target of 1.0 for the correct class and 0.0 for all others, label smoothing might use 0.9 for the correct class and distribute the remaining 0.1 uniformly across the other classes.

This prevents the model from becoming overconfident in its predictions. Without label smoothing, the model is incentivized to push its output probabilities toward 0 and 1, which requires very large weight magnitudes and makes the model brittle. Szegedy et al. (2016) introduced label smoothing as part of the Inception v2 architecture and showed that it improved both calibration and generalization.

Spectral normalization

Spectral normalization constrains the spectral norm (the largest singular value) of each weight matrix to be at most 1. This bounds the Lipschitz constant of each layer, limiting how much the output can change in response to small input perturbations.

Miyato et al. (2018) proposed spectral normalization for training generative adversarial networks (GANs), where training stability is a major concern. By controlling the discriminator's Lipschitz constant, spectral normalization prevents the discriminator from producing overly sharp gradients that destabilize training. It has since been applied to other architectures where controlling the model's sensitivity to input perturbations is desirable.

Noise injection

Noise injection adds random perturbations to inputs, weights, or intermediate activations during training. The most common variant adds Gaussian noise with zero mean and a fixed or decaying variance. This forces the model to learn representations that are robust to small perturbations rather than relying on precise input values.

Adding noise to the inputs is one of the oldest regularization techniques and can be shown to be approximately equivalent to a form of Tikhonov (L2) regularization under certain conditions. Specifically, injecting Gaussian noise with variance sigma^2 into the inputs of a linear model is equivalent to adding a penalty proportional to sigma^2 times the squared norm of the weights (Bishop, 1995).

Noise can also be injected directly into the weights during training. Weight noise encourages the network to find broad minima in the loss landscape rather than sharp ones, because solutions that are sensitive to small weight perturbations will perform poorly when noise is added. Graves et al. (2013) demonstrated the effectiveness of weight noise for training long short-term memory (LSTM) networks.

Annealing the noise variance over the course of training (starting with a larger variance and gradually reducing it) often works better than maintaining a fixed variance, as it allows the model to explore broadly early in training and then refine its parameters later.

Mathematical comparison

The following table summarizes the mathematical formulations of the primary weight-based regularization methods:

Method	Penalty term	Effect on weights	Sparsity
L1 (Lasso)	`lambda * sum(abs(w_i))`	Drives many weights to exactly zero	Yes
L2 (Ridge)	`lambda * sum(w_i^2)`	Shrinks all weights toward zero	No
Elastic Net	`lambda * [alpha * sum(abs(w_i)) + (1-alpha) * sum(w_i^2)]`	Combines sparsity with shrinkage	Partial
Weight decay	Multiply weights by `(1 - lambda)` each step	Shrinks all weights toward zero	No
Spectral norm	Constrain largest singular value of W to 1	Bounds layer's Lipschitz constant	No

Bayesian perspective on regularization

Regularization has a natural interpretation in Bayesian statistics. In the Bayesian framework, model parameters are treated as random variables with a prior distribution that encodes beliefs about the parameters before observing any data. Learning is then framed as computing the posterior distribution over parameters given the observed data, using Bayes' theorem.

Maximum a posteriori (MAP) estimation finds the parameter values that maximize the posterior probability. Taking the negative log of the posterior and minimizing it yields an objective that consists of two terms: the negative log-likelihood (which corresponds to the original loss function) and the negative log-prior (which acts as a regularization penalty). The specific form of the prior determines the type of regularization:

Prior distribution	Corresponding regularization	Effect
Gaussian (mean 0, variance `1/lambda`)	L2 regularization	Shrinks weights toward zero
Laplace (location 0, scale `1/lambda`)	L1 regularization	Promotes exact sparsity
Spike-and-slab	L0-type regularization	Selects a subset of nonzero weights
Horseshoe	Adaptive shrinkage	Heavy shrinkage on small weights, mild on large

This connection means that choosing a regularization method is, from a Bayesian standpoint, equivalent to choosing a prior over model parameters. A practitioner who uses L2 regularization is implicitly assuming that the model's weights are drawn from a Gaussian distribution centered at zero. One who uses L1 regularization is assuming a Laplace distribution, which has heavier tails and a sharper peak at zero.

The Bayesian perspective also provides a principled way to set the regularization strength. Rather than treating lambda as a hyperparameter to be tuned by cross-validation, Bayesian methods can place a hyperprior on lambda and infer its value from the data. This approach, known as empirical Bayes or type-II maximum likelihood, automatically determines how much regularization is appropriate.

Regularization and the bias-variance tradeoff

The bias-variance tradeoff provides the theoretical foundation for understanding why regularization works. A model's expected prediction error on unseen data can be decomposed into three components: bias (squared), variance, and irreducible noise.

Bias measures how far the model's average prediction is from the true value. Models that are too simple (high regularization) tend to have high bias because they cannot capture the true complexity of the data. Variance measures how much the model's predictions fluctuate across different training sets. Complex models (low regularization) tend to have high variance because they fit the noise in each training set differently.

Regularization increases bias by constraining the model's flexibility, but it reduces variance by preventing the model from fitting idiosyncratic patterns in the training data. The net effect is often a decrease in total prediction error, because the reduction in variance more than compensates for the increase in bias.

This tradeoff is directly controlled by the regularization strength. As lambda increases from zero:

The model becomes progressively more constrained.
Bias increases monotonically.
Variance decreases monotonically.
Total error initially decreases (variance reduction dominates), reaches a minimum, and then increases (bias dominates).

The optimal lambda sits at the point where total error is minimized, balancing bias and variance. This is why regularization strength must be carefully tuned rather than simply set to a large value.

Choosing the regularization strength

The regularization hyperparameter lambda must be selected carefully. Too small a value provides insufficient regularization and allows overfitting. Too large a value over-constrains the model, causing underfitting. Several strategies exist for finding a good value.

K-fold cross-validation is the most common approach. The training data is divided into K folds (typically 5 or 10). For each candidate lambda value, the model is trained on K-1 folds and evaluated on the held-out fold, rotating through all folds. The lambda that produces the best average validation performance is selected. This approach is computationally expensive because it requires training K models for each lambda value, but it provides a reliable estimate of out-of-sample performance.

Grid search evaluates a predefined set of lambda values, typically spaced on a logarithmic scale (e.g., 1e-5, 1e-4, 1e-3, 1e-2, 1e-1, 1). This works well when the rough range of good values is known in advance.

Random search samples lambda values randomly from a specified distribution. Bergstra and Bengio (2012) showed that random search is more efficient than grid search when some hyperparameters matter more than others, because it explores a wider range of values for each dimension.

Bayesian optimization uses a probabilistic model of the objective function to select the next lambda value to evaluate. Tools like Optuna, Hyperopt, and Weights & Biases Sweeps implement this approach, which is particularly useful when each training run is expensive.

For L1 and L2 regularization in linear models, efficient algorithms exist that compute the entire solution path across all lambda values at roughly the cost of a single fit. The LARS algorithm (Efron et al., 2004) does this for Lasso, and the solution path for Ridge regression can be computed in closed form.

For neural networks, the regularization strength is typically tuned alongside other hyperparameters such as learning rate, dropout rate, and batch size. Modern training recipes often specify default values: for example, a weight decay of 0.01 to 0.1 with AdamW is standard for many architectures.

Regularization in neural networks vs. traditional ML

Regularization plays different roles and takes different forms in traditional machine learning models compared to deep neural networks.

In traditional ML (linear regression, logistic regression, support vector machines), regularization is typically applied through explicit penalty terms on the model weights. The penalty directly modifies the objective function, and for many models, the resulting optimization problem remains convex. This means there is a single global optimum that can be found reliably. The choice between L1, L2, and Elastic Net is the primary decision, and the regularization strength is the main hyperparameter to tune.

In deep neural networks, the situation is more complex. The loss landscape is non-convex with many local minima and saddle points, so the optimizer's trajectory through the loss landscape itself acts as an implicit regularizer. Several additional forms of regularization are available:

Aspect	Traditional ML	Neural networks
Primary methods	L1, L2, Elastic Net	Weight decay, dropout, data augmentation, batch normalization
Regularization target	Model weights	Weights, activations, gradients, outputs
Loss landscape	Convex (for many models)	Non-convex
Implicit regularization	Minimal	SGD noise, architecture choices, initialization
Tuning complexity	One or two hyperparameters	Many interacting hyperparameters
Theoretical guarantees	Well-established bounds	Largely empirical understanding

Neural networks also benefit from forms of implicit regularization that have no counterpart in traditional ML. The stochasticity of mini-batch gradient descent itself acts as a regularizer: the noise from sampling different mini-batches prevents the optimizer from converging precisely to a sharp minimum. Smaller batch sizes introduce more noise and thus more implicit regularization, which partially explains why small-batch training sometimes generalizes better than large-batch training. The choice of optimizer, learning rate schedule, and network architecture also implicitly regularize the model by influencing which solutions the optimizer finds.

Modern regularization in transformers

Transformer-based models use a distinct combination of regularization techniques that differs from the classical deep learning toolkit. Understanding this modern landscape is essential for practitioners working with large language models.

Stochastic depth (DropPath)

Stochastic depth, introduced by Huang et al. (2016), randomly skips entire residual blocks during training. For each training step, each residual block has a probability of being bypassed entirely (its output is replaced by its input through the skip connection). The drop probability typically increases linearly from 0 at the first layer to a maximum value (often 0.1-0.3) at the last layer.

Stochastic depth has become a standard regularization technique for vision transformers (ViTs). The DeiT training recipe (Touvron et al., 2021) and the timm library's "ResNet Strikes Back" recipe (Wightman et al., 2021) both include stochastic depth as a key component. It serves a dual purpose: regularizing the model and implicitly training an ensemble of networks with different effective depths.

Regularization technique	Used in vision transformers	Used in LLMs
Stochastic depth (DropPath)	Yes (standard)	Rarely
Dropout on attention weights	Yes	Yes (but often low, 0.0-0.1)
Dropout on FFN layers	Yes	Yes (but often low or zero)
Weight decay	Yes (0.05-0.3)	Yes (0.1 standard)
Label smoothing	Yes (0.1 typical)	Occasionally
Mixup / CutMix	Yes (standard in ViT training)	No
Data augmentation	Yes (RandAugment, etc.)	Limited (masking, token dropout)

Mixup and CutMix

Mixup (Zhang et al., 2018) creates new training examples by linearly interpolating pairs of existing examples and their labels. Given two examples (x_a, y_a) and (x_b, y_b), mixup creates a new example:

x_mix = lambda * x_a + (1 - lambda) * x_b y_mix = lambda * y_a + (1 - lambda) * y_b

where lambda is sampled from a Beta distribution. This encourages the model to behave linearly between training examples, which produces smoother decision boundaries and reduces overfitting.

CutMix (Yun et al., 2019) takes a different approach: instead of blending entire images, it cuts a rectangular patch from one image and pastes it onto another. The labels are mixed proportionally to the area of the patch. CutMix preserves local image structure better than Mixup, which can produce blurred or unnatural images.

Both techniques are now standard in vision transformer training pipelines. The DeiT recipe, for example, uses both Mixup (alpha=0.8) and CutMix (alpha=1.0) with a 50% probability of switching between them. These augmentations are particularly important for training ViTs from scratch, as transformers lack the inductive biases (translation equivariance, locality) that CNNs have, making them more prone to overfitting on small to medium datasets.

Regularization in large language models

Modern LLMs employ a surprisingly minimal set of explicit regularization techniques compared to vision models. The primary regularization tools in LLM pre-training are:

Weight decay: Typically set to 0.1 with AdamW. Applied to all parameters except biases and layer normalization weights.
Dropout: Often set to 0.0 or very low values (0.1) during pre-training of large models. Some models like GPT-3 used 0.1, while others like LLaMA used 0.0.
Data diversity: The massive scale and diversity of pre-training corpora acts as an implicit regularizer. With trillions of tokens, the model is unlikely to memorize individual examples.

The trend toward reduced explicit regularization in LLMs is driven by the observation that very large models trained on very large datasets are in a regime where overfitting is not the primary concern. Instead, underfitting (not training long enough or on enough data) is the bigger risk. This stands in contrast to smaller-scale training, where regularization is essential.

Choosing regularization

Selecting the right regularization technique depends on the model architecture, data characteristics, and the specific problem.

For linear models (linear regression, logistic regression), L1, L2, or Elastic Net regularization are the standard choices. Use L1 when you expect many features to be irrelevant and want automatic feature selection. Use L2 when all features may be relevant but you want to prevent large weights. Use Elastic Net when features are correlated and you want both sparsity and stability.

For deep neural networks, dropout, batch normalization, weight decay, and data augmentation are the primary tools. Modern practice often combines several of these. For example, a typical image classification pipeline might use data augmentation, batch normalization in convolutional layers, dropout in fully connected layers, and weight decay in the optimizer. The combination matters: adding too many regularizers at once can lead to underfitting, so each should be tuned on a validation set.

For GANs and other generative models, spectral normalization and gradient penalties are preferred because they directly control the smoothness of the discriminator or critic function.

For large language models and transformer-based architectures, weight decay is the primary explicit regularizer, with dropout applied sparingly. Data augmentation in NLP (e.g., back-translation, token masking) is common during pre-training but less critical than in vision.

The regularization strength (lambda or dropout rate) is a hyperparameter that should be tuned via cross-validation or a held-out validation set. Too little regularization allows overfitting; too much forces the model into underfitting.

Explain like I'm 5 (ELI5)

Imagine you are building a sandcastle using a limited amount of sand. You want your sandcastle to look great and be as sturdy as possible. In machine learning, the sand represents the information we have, and the sandcastle is the model we build.

Sometimes, when we build our sandcastle (or model), we focus too much on making it look perfect using the sand we have, and we forget that it needs to be sturdy enough to withstand waves or wind (unseen data). Regularization is like adding some water or using a different technique while building our sandcastle to make it stronger and more resilient. This way, it will look good and be sturdy, even when faced with new challenges.

Here is another way to think about it: if you are studying for a test, you could memorize every single practice question word for word. You would get a perfect score on those practice questions, but if the test has slightly different questions, you might fail. Regularization is like your teacher saying "don't just memorize the answers; learn the main ideas." It stops you from memorizing the practice questions too closely, so you actually understand the material and can answer new questions you have never seen before.

References

Tibshirani, R. (1996). \"Regression Shrinkage and Selection via the Lasso.\" *Journal of the Royal Statistical Society: Series B*, 58(1), 267-288. https://doi.org/10.1111/j.2517-6161.1996.tb02080.x
Hoerl, A.E. and Kennard, R.W. (1970). \"Ridge Regression: Biased Estimation for Nonorthogonal Problems.\" *Technometrics*, 12(1), 55-67. https://doi.org/10.1080/00401706.1970.10488634
Zou, H. and Hastie, T. (2005). \"Regularization and Variable Selection via the Elastic Net.\" *Journal of the Royal Statistical Society: Series B*, 67(2), 301-320. https://doi.org/10.1111/j.1467-9868.2005.00503.x
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. (2014). \"Dropout: A Simple Way to Prevent Neural Networks from Overfitting.\" *Journal of Machine Learning Research*, 15, 1929-1958. https://www.jmlr.org/papers/v15/srivastava14a.html
Ioffe, S. and Szegedy, C. (2015). \"Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift.\" *Proceedings of the 32nd International Conference on Machine Learning (ICML)*. https://arxiv.org/abs/1502.03167
Loshchilov, I. and Hutter, F. (2019). \"Decoupled Weight Decay Regularization.\" *International Conference on Learning Representations (ICLR)*. https://arxiv.org/abs/1711.05101
Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., and Wojna, Z. (2016). \"Rethinking the Inception Architecture for Computer Vision.\" *Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)*. https://arxiv.org/abs/1512.00567
Miyato, T., Kataoka, T., Koyama, M., and Yoshida, Y. (2018). \"Spectral Normalization for Generative Adversarial Networks.\" *International Conference on Learning Representations (ICLR)*. https://arxiv.org/abs/1802.05957
Goodfellow, I., Bengio, Y., and Courville, A. (2016). *Deep Learning*. MIT Press. https://www.deeplearningbook.org/
Huang, G., Sun, Y., Liu, Z., Sedra, D., and Weinberger, K. (2016). \"Deep Networks with Stochastic Depth.\" *European Conference on Computer Vision (ECCV)*. https://arxiv.org/abs/1603.09382
Zhang, H., Cisse, M., Dauphin, Y.N., and Lopez-Paz, D. (2018). \"mixup: Beyond Empirical Risk Minimization.\" *International Conference on Learning Representations (ICLR)*. https://arxiv.org/abs/1710.09412
Yun, S., Han, D., Oh, S.J., Chun, S., Choe, J., and Yoo, Y. (2019). \"CutMix: Regularization Strategy to Train Strong Classifiers with Localizable Features.\" *International Conference on Computer Vision (ICCV)*. https://arxiv.org/abs/1905.04899
Touvron, H., Cord, M., Douze, M., et al. (2021). \"Training data-efficient image transformers & distillation through attention.\" *International Conference on Machine Learning (ICML)*. https://arxiv.org/abs/2012.12877
Bishop, C.M. (1995). \"Training with Noise is Equivalent to Tikhonov Regularization.\" *Neural Computation*, 7(1), 108-116. https://doi.org/10.1162/neco.1995.7.1.108
Bergstra, J. and Bengio, Y. (2012). \"Random Search for Hyper-Parameter Optimization.\" *Journal of Machine Learning Research*, 13, 281-305. https://www.jmlr.org/papers/v13/bergstra12a.html
Efron, B., Hastie, T., Johnstone, I., and Tibshirani, R. (2004). \"Least Angle Regression.\" *The Annals of Statistics*, 32(2), 407-499. https://doi.org/10.1214/009053604000000067

Regularization in machine learning

Types of regularization

L1 regularization (Lasso)

L2 regularization (Ridge)

Elastic Net regularization

Dropout

Batch normalization

Early stopping

Data augmentation

Weight decay

Label smoothing

Spectral normalization

Noise injection

Mathematical comparison

Bayesian perspective on regularization

Regularization and the bias-variance tradeoff

Choosing the regularization strength

Regularization in neural networks vs. traditional ML

Modern regularization in transformers

Stochastic depth (DropPath)

Mixup and CutMix

Regularization in large language models

Choosing regularization

Explain like I'm 5 (ELI5)

References

Improve this article

Related Articles

Sparse autoencoder

ARC-AGI 2

GELU (Gaussian Error Linear Unit)

LeNet

Hyperparameter

Loss Function

Regularization in machine learning

Types of regularization

L1 regularization (Lasso)

L2 regularization (Ridge)

Elastic Net regularization

Dropout

Batch normalization

Early stopping

Data augmentation

Weight decay

Label smoothing

Spectral normalization

Noise injection

Mathematical comparison

Bayesian perspective on regularization

Regularization and the bias-variance tradeoff

Choosing the regularization strength

Regularization in neural networks vs. traditional ML

Modern regularization in transformers

Stochastic depth (DropPath)

Mixup and CutMix

Regularization in large language models

Choosing regularization

Explain like I'm 5 (ELI5)

References

Related Articles

Sparse autoencoder

ARC-AGI 2

GELU (Gaussian Error Linear Unit)

LeNet

Hyperparameter

Loss Function