See also: Machine learning terms
Shrinkage in machine learning and statistics is a regularization technique that pulls model coefficient estimates toward zero, or toward some other fixed shrinkage target, to reduce variance at the cost of introducing a small amount of bias. The idea is older than modern machine learning. It comes out of mid-twentieth century mathematical statistics and now sits at the heart of nearly every supervised learning workflow, from penalized linear regression to weight decay in deep neural networks.
Shrinkage methods reduce the effective complexity of a model by discouraging large coefficients. The motivation is the bias-variance tradeoff: unconstrained estimators such as ordinary least squares are unbiased but can have huge variance, especially when the number of features is comparable to the number of observations or when predictors are strongly correlated. A modest amount of shrinkage trades a little bias for a much larger reduction in variance, lowering total mean squared error and helping avoid overfitting. Common shrinkage methods include ridge regression, lasso, and elastic net.
Shrinkage as a formal idea begins with Charles Stein's 1956 paper showing that the maximum likelihood estimator for the mean of a multivariate Gaussian is inadmissible when the dimension is at least three. Willard James and Charles Stein made the result concrete in 1961 by exhibiting a specific estimator that strictly dominates the MLE in mean squared error.
Suppose we observe a single vector $X \sim \mathcal{N}(\theta, \sigma^2 I_d)$ in $d \geq 3$ dimensions and wish to estimate the unknown mean vector $\theta$. The MLE is simply $\hat\theta_{MLE} = X$. The James-Stein estimator instead returns
$$\hat\theta_{JS} = \left(1 - \frac{(d-2)\sigma^2}{\lVert X \rVert^2}\right) X.$$
The leading factor pulls $X$ toward the origin by an amount that depends on how far the observation lies from zero. James and Stein proved that for $d \geq 3$ the risk of $\hat\theta_{JS}$ is uniformly smaller than the risk of $\hat\theta_{MLE}$ for every value of $\theta$. The result is often called the Stein paradox because it shows that pooling unrelated estimation problems can be strictly better than handling each one on its own, even when the parameters share no apparent relationship.
| Estimator | Bias | Variance | MSE risk |
|---|---|---|---|
| MLE $\hat\theta = X$ | 0 | $d \sigma^2$ | $d\sigma^2$ |
| James-Stein $\hat\theta_{JS}$ | non-zero | reduced | strictly less than $d\sigma^2$ for $d \geq 3$ |
| Positive-part JS | non-zero | reduced further | dominates plain JS |
The positive-part variant clips the shrinkage factor at zero so the estimator never flips sign, and it dominates the original James-Stein estimator. Bradley Efron and Carl Morris later popularized the result with examples involving baseball batting averages and small-area estimation, showing that even pragmatic data analysts gain by shrinking. The James-Stein estimator is the historical bridge between classical statistics and the regularization techniques that machine learning relies on today.
The Stein phenomenon set the stage, but shrinkage entered everyday data analysis through penalized linear regression. The basic idea is to add a penalty term to the squared error loss that discourages large coefficients. Different penalties correspond to different shrinkage methods.
Ridge regression, also known as Tikhonov regularization, was introduced by Arthur Hoerl and Robert Kennard in their 1970 Technometrics paper Ridge Regression: Biased Estimation for Nonorthogonal Problems. They were motivated by the practical problem of multicollinearity: when columns of the design matrix are nearly linearly dependent, $X^\top X$ is close to singular and ordinary least squares estimates blow up.
Ridge regression replaces the OLS objective with
$$\hat\beta_{ridge} = \arg\min_\beta \lVert y - X\beta \rVert_2^2 + \lambda \lVert \beta \rVert_2^2,$$
where $\lambda \geq 0$ controls the strength of the L2 penalty. The closed form solution is
$$\hat\beta_{ridge} = (X^\top X + \lambda I)^{-1} X^\top y.$$
Adding $\lambda I$ to the Gram matrix guarantees invertibility for any $\lambda > 0$, even when $X$ has more columns than rows.
The singular value decomposition $X = U D V^\top$ gives a clean geometric picture. Writing $d_j$ for the singular values, the ridge fit can be expressed as
$$X \hat\beta_{ridge} = \sum_j u_j \frac{d_j^2}{d_j^2 + \lambda} u_j^\top y.$$
Each principal component direction is multiplied by a shrinkage factor $d_j^2 / (d_j^2 + \lambda)$. Directions with large singular values are barely touched; directions with small singular values, where the data carries little information, are shrunk aggressively. This is why ridge tends to stabilize estimates without zeroing out features. The effective degrees of freedom $\sum_j d_j^2/(d_j^2+\lambda)$ replace the simple count of parameters and can be used for model selection.
Robert Tibshirani's 1996 paper Regression Shrinkage and Selection via the Lasso swapped the L2 penalty for an L1 penalty:
$$\hat\beta_{lasso} = \arg\min_\beta \lVert y - X\beta \rVert_2^2 + \lambda \lVert \beta \rVert_1.$$
The acronym stands for Least Absolute Shrinkage and Selection Operator. The L1 penalty has a sharp corner at zero, which causes some coefficients to be set exactly to zero rather than merely small. Lasso therefore performs simultaneous shrinkage and feature selection, producing sparse models that are easier to interpret.
Unlike ridge, lasso has no closed form. Solutions are computed with coordinate descent, the LARS algorithm of Bradley Efron, Trevor Hastie, Iain Johnstone, and Robert Tibshirani (Annals of Statistics, 2004), or proximal gradient methods. The full lasso path, showing how coefficients enter and leave the model as $\lambda$ varies, can be computed in roughly the same time as a single OLS fit.
Lasso has known weaknesses. When the number of predictors $p$ exceeds the number of observations $n$, lasso can select at most $n$ variables. When predictors are highly correlated, lasso tends to pick one and ignore the others somewhat arbitrarily, which makes the chosen variables unstable across resampled datasets.
Hui Zou and Trevor Hastie introduced the elastic net in 2005 to address those weaknesses. It blends both penalties:
$$\hat\beta_{en} = \arg\min_\beta \lVert y - X\beta \rVert_2^2 + \lambda \left( \alpha \lVert \beta \rVert_1 + (1-\alpha) \lVert \beta \rVert_2^2 \right),$$
where $\alpha \in [0,1]$ mixes lasso ($\alpha = 1$) and ridge ($\alpha = 0$). The elastic net keeps lasso's ability to zero out coefficients while inheriting ridge's grouping behavior: strongly correlated predictors tend to enter or leave the model together rather than being chosen at random.
| Method | Penalty | Closed form | Selects features | Handles correlated predictors | Typical use |
|---|---|---|---|---|---|
| OLS / best subset | none / L0 (combinatorial) | yes / no | best subset only | poorly | small clean problems |
| Ridge | L2 | yes | no | shrinks them together | many correlated weak signals |
| Lasso | L1 | no | yes (sparse) | picks one arbitrarily | sparse signal, $p > n$ |
| Elastic net | mix of L1 and L2 | no | yes | groups them together | grouped predictors, genomics |
Penalized regression has a natural Bayesian reading. Ridge regression is the maximum a posteriori estimator under a zero-mean Gaussian prior on the coefficients, since the log of a Gaussian density gives a quadratic penalty. Lasso corresponds to a zero-mean Laplace (double-exponential) prior, whose log density is proportional to the absolute value. The Laplace prior puts more mass exactly at zero than the Gaussian, which is why lasso produces exact zeros while ridge does not. Elastic net corresponds to a hybrid prior that combines both shapes.
This perspective is more than a curiosity. It connects shrinkage to hierarchical Bayesian models, where the prior variance can itself be estimated from data, leading to empirical Bayes shrinkage estimators that adapt the amount of shrinkage automatically.
Every shrinkage method has a tuning parameter that controls how much the coefficients are pulled toward zero. Too little shrinkage and the estimator behaves like OLS, with high variance. Too much and the estimator collapses to the shrinkage target, with high bias. The standard recipe for choosing $\lambda$ is k-fold cross-validation: fit the model at a grid of $\lambda$ values, average the held-out error across folds, and pick the $\lambda$ that minimizes prediction error. A common refinement is the one-standard-error rule, which selects the largest $\lambda$ whose cross-validated error is within one standard error of the minimum, biasing the choice toward simpler models.
For penalized regression specifically, packages such as glmnet and scikit-learn compute the entire regularization path efficiently and let users tune $\lambda$ with cross-validation in a single call.
The word shrinkage turns up in several other corners of statistical learning, each with the same underlying idea: pull an unstable estimate toward something more conservative.
Jerome Friedman's 2001 paper Greedy Function Approximation: A Gradient Boosting Machine introduced shrinkage as a regularization knob for gradient boosting. After fitting each base learner, the model adds it to the ensemble multiplied by a small constant $\nu \in (0, 1]$, called the learning rate or shrinkage parameter:
$$F_m(x) = F_{m-1}(x) + \nu \cdot h_m(x).$$
Smaller values of $\nu$ generalize better but require more boosting iterations. In practice, learning rates of $0.01$ to $0.1$ combined with thousands of trees are standard in implementations such as XGBoost, LightGBM, and CatBoost. The $\nu$ parameter plays a role almost identical to $\lambda$ in penalized regression.
The sample covariance matrix is unbiased but extremely noisy when the number of variables is large relative to the sample size. Olivier Ledoit and Michael Wolf's 2004 paper Honey, I Shrunk the Sample Covariance Matrix proposed estimating covariance by a convex combination
$$\hat\Sigma = \delta F + (1 - \delta) S,$$
where $S$ is the sample covariance, $F$ is a structured target such as a scalar multiple of the identity, and $\delta \in [0, 1]$ is an optimally chosen shrinkage intensity. The Ledoit-Wolf estimator is a workhorse in finance, where covariance matrices feed mean-variance portfolio optimization, and it appears in scikit-learn as the default for high-dimensional covariance estimation.
Deep learning relies on shrinkage in two places that are usually not labeled as such. The first is weight decay, which adds an L2 penalty on the network parameters during training. The second is implicit shrinkage from early stopping, which halts gradient descent before the parameters reach their unregularized optimum. Both pull the learned weights toward zero, and both improve generalization on overparameterized models.
Ilya Loshchilov and Frank Hutter's Decoupled Weight Decay Regularization paper, presented at ICLR 2019, pointed out that L2 regularization and weight decay are equivalent for plain SGD but not for adaptive optimizers like Adam, where the per-parameter learning rates distort the L2 gradient. Their fix, AdamW, applies weight decay as a separate step that is independent of the adaptive learning rate. AdamW has since become the standard optimizer for training large language models and vision transformers.
| Setting | Method | Shrinks toward | Tuning parameter |
|---|---|---|---|
| Multivariate mean estimation | James-Stein | origin or grand mean | data-driven |
| Linear regression | Ridge | zero vector | $\lambda$ |
| Linear regression | Lasso | zero vector (sparse) | $\lambda$ |
| Linear regression | Elastic net | zero vector | $\lambda$, $\alpha$ |
| Tree ensembles | Gradient boosting shrinkage | constant (no update) | learning rate $\nu$ |
| Covariance estimation | Ledoit-Wolf | structured target $F$ | $\delta$ |
| Deep learning | Weight decay / AdamW | zero weights | decay coefficient |
| Iterative training | Early stopping | initialization | stopping epoch |
Shrinkage is not free. The added bias is real, and the choice of penalty matters. Ridge regression keeps every variable in the model and can be a poor choice when interpretability and sparsity are the goal. Lasso achieves sparsity but can be unstable in the presence of correlated features and is limited to selecting at most $n$ variables in the $p \gg n$ regime. Elastic net partially fixes both problems but introduces a second tuning parameter that has to be chosen.
Most shrinkage methods assume that the true coefficients are themselves smooth or compressible in some sense. When the underlying signal is genuinely large in many directions, shrinkage can hurt more than it helps. The choice of $\lambda$ is critical: too aggressive a penalty discards real signal, while too weak a penalty leaves variance high. Cross-validation handles this in most cases, but with very small datasets the cross-validation estimate of optimal $\lambda$ is itself noisy, so the resulting model can vary with the specific train-test split.
A more subtle pitfall is the pre-processing step. Penalty terms like $\lambda \lVert \beta \rVert_2^2$ depend on the scale of each predictor, so features should typically be standardized before fitting penalized regression. Forgetting this step is a common source of mysterious results.
Imagine you're trying to make a paper airplane. You have a lot of different folds you can make, and you're not sure which combination will make the best airplane. You start by making a really complicated airplane with lots of folds, but it doesn't fly very well because it's too heavy and hard to understand.
Shrinkage in machine learning is like removing some of the unnecessary folds from your airplane to make it simpler and easier to understand. By doing this, your airplane becomes lighter and flies better. In machine learning, removing some of the unnecessary parts of the model (the folds) can help it make better predictions and be easier to understand.