Shrinkage

Introduction

Shrinkage in machine learning and statistics is a regularization technique that pulls model coefficient estimates toward zero, or toward some other fixed shrinkage target, to reduce variance at the cost of introducing a small amount of bias. The idea is older than modern machine learning. It comes out of mid-twentieth century mathematical statistics and now sits at the heart of nearly every supervised learning workflow, from penalized linear regression to weight decay in deep neural networks.

Shrinkage methods reduce the effective complexity of a model by discouraging large coefficients. The motivation is the bias-variance tradeoff: unconstrained estimators such as ordinary least squares are unbiased but can have huge variance, especially when the number of features is comparable to the number of observations or when predictors are strongly correlated. A modest amount of shrinkage trades a little bias for a much larger reduction in variance, lowering total mean squared error and helping avoid overfitting. Common shrinkage methods include ridge regression, lasso, and elastic net.

Statistical origin: the James-Stein estimator

Shrinkage as a formal idea begins with Charles Stein's 1956 paper showing that the maximum likelihood estimator for the mean of a multivariate Gaussian is inadmissible when the dimension is at least three. Willard James and Charles Stein made the result concrete in 1961 by exhibiting a specific estimator that strictly dominates the MLE in mean squared error.

Suppose we observe a single vector $X \sim \mathcal{N}(\theta, \sigma^2 I_d)$ in $d \geq 3$ dimensions and wish to estimate the unknown mean vector $\theta$. The MLE is simply $\hat\theta_{MLE} = X$. The James-Stein estimator instead returns

$$\hat\theta_{JS} = \left(1 - \frac{(d-2)\sigma^2}{\lVert X \rVert^2}\right) X.$$

The leading factor pulls $X$ toward the origin by an amount that depends on how far the observation lies from zero. James and Stein proved that for $d \geq 3$ the risk of $\hat\theta_{JS}$ is uniformly smaller than the risk of $\hat\theta_{MLE}$ for every value of $\theta$. The result is often called the Stein paradox because it shows that pooling unrelated estimation problems can be strictly better than handling each one on its own, even when the parameters share no apparent relationship.

Estimator	Bias	Variance	MSE risk
MLE $\hat\theta = X$	0	$d \sigma^2$	$d\sigma^2$
James-Stein $\hat\theta_{JS}$	non-zero	reduced	strictly less than $d\sigma^2$ for $d \geq 3$
Positive-part JS	non-zero	reduced further	dominates plain JS

The positive-part variant clips the shrinkage factor at zero so the estimator never flips sign, and it dominates the original James-Stein estimator. Bradley Efron and Carl Morris later popularized the result with examples involving baseball batting averages and small-area estimation, showing that even pragmatic data analysts gain by shrinking. The James-Stein estimator is the historical bridge between classical statistics and the regularization techniques that machine learning relies on today.

Shrinkage in linear regression

The Stein phenomenon set the stage, but shrinkage entered everyday data analysis through penalized linear regression. The basic idea is to add a penalty term to the squared error loss that discourages large coefficients. Different penalties correspond to different shrinkage methods.

Ridge regression

Ridge regression, also known as Tikhonov regularization, was introduced by Arthur Hoerl and Robert Kennard in their 1970 Technometrics paper Ridge Regression: Biased Estimation for Nonorthogonal Problems. They were motivated by the practical problem of multicollinearity: when columns of the design matrix are nearly linearly dependent, $X^\top X$ is close to singular and ordinary least squares estimates blow up.

Ridge regression replaces the OLS objective with

$$\hat\beta_{ridge} = \arg\min_\beta \lVert y - X\beta \rVert_2^2 + \lambda \lVert \beta \rVert_2^2,$$

where $\lambda \geq 0$ controls the strength of the L2 penalty. The closed form solution is

$$\hat\beta_{ridge} = (X^\top X + \lambda I)^{-1} X^\top y.$$

Adding $\lambda I$ to the Gram matrix guarantees invertibility for any $\lambda > 0$, even when $X$ has more columns than rows.

The singular value decomposition $X = U D V^\top$ gives a clean geometric picture. Writing $d_j$ for the singular values, the ridge fit can be expressed as

$$X \hat\beta_{ridge} = \sum_j u_j \frac{d_j^2}{d_j^2 + \lambda} u_j^\top y.$$

Each principal component direction is multiplied by a shrinkage factor $d_j^2 / (d_j^2 + \lambda)$. Directions with large singular values are barely touched; directions with small singular values, where the data carries little information, are shrunk aggressively. This is why ridge tends to stabilize estimates without zeroing out features. The effective degrees of freedom $\sum_j d_j^2/(d_j^2+\lambda)$ replace the simple count of parameters and can be used for model selection.

Lasso

Robert Tibshirani's 1996 paper Regression Shrinkage and Selection via the Lasso swapped the L2 penalty for an L1 penalty:

$$\hat\beta_{lasso} = \arg\min_\beta \lVert y - X\beta \rVert_2^2 + \lambda \lVert \beta \rVert_1.$$

The acronym stands for Least Absolute Shrinkage and Selection Operator. The L1 penalty has a sharp corner at zero, which causes some coefficients to be set exactly to zero rather than merely small. Lasso therefore performs simultaneous shrinkage and feature selection, producing sparse models that are easier to interpret.

Unlike ridge, lasso has no closed form. Solutions are computed with coordinate descent, the LARS algorithm of Bradley Efron, Trevor Hastie, Iain Johnstone, and Robert Tibshirani (Annals of Statistics, 2004), or proximal gradient methods. The full lasso path, showing how coefficients enter and leave the model as $\lambda$ varies, can be computed in roughly the same time as a single OLS fit.

Lasso has known weaknesses. When the number of predictors $p$ exceeds the number of observations $n$, lasso can select at most $n$ variables. When predictors are highly correlated, lasso tends to pick one and ignore the others somewhat arbitrarily, which makes the chosen variables unstable across resampled datasets.

Elastic net

Hui Zou and Trevor Hastie introduced the elastic net in 2005 to address those weaknesses. It blends both penalties:

$$\hat\beta_{en} = \arg\min_\beta \lVert y - X\beta \rVert_2^2 + \lambda \left( \alpha \lVert \beta \rVert_1 + (1-\alpha) \lVert \beta \rVert_2^2 \right),$$

where $\alpha \in [0,1]$ mixes lasso ($\alpha = 1$) and ridge ($\alpha = 0$). The elastic net keeps lasso's ability to zero out coefficients while inheriting ridge's grouping behavior: strongly correlated predictors tend to enter or leave the model together rather than being chosen at random.

Comparison of penalized regression methods

Method	Penalty	Closed form	Selects features	Handles correlated predictors	Typical use
OLS / best subset	none / L0 (combinatorial)	yes / no	best subset only	poorly	small clean problems
Ridge	L2	yes	no	shrinks them together	many correlated weak signals
Lasso	L1	no	yes (sparse)	picks one arbitrarily	sparse signal, $p > n$
Elastic net	mix of L1 and L2	no	yes	groups them together	grouped predictors, genomics

Bayesian interpretation

Penalized regression has a natural Bayesian reading. Ridge regression is the maximum a posteriori estimator under a zero-mean Gaussian prior on the coefficients, since the log of a Gaussian density gives a quadratic penalty. Lasso corresponds to a zero-mean Laplace (double-exponential) prior, whose log density is proportional to the absolute value. The Laplace prior puts more mass exactly at zero than the Gaussian, which is why lasso produces exact zeros while ridge does not. Elastic net corresponds to a hybrid prior that combines both shapes.

This perspective is more than a curiosity. It connects shrinkage to hierarchical Bayesian models, where the prior variance can itself be estimated from data, leading to empirical Bayes shrinkage estimators that adapt the amount of shrinkage automatically.

Choosing the regularization strength

Every shrinkage method has a tuning parameter that controls how much the coefficients are pulled toward zero. Too little shrinkage and the estimator behaves like OLS, with high variance. Too much and the estimator collapses to the shrinkage target, with high bias. The standard recipe for choosing $\lambda$ is k-fold cross-validation: fit the model at a grid of $\lambda$ values, average the held-out error across folds, and pick the $\lambda$ that minimizes prediction error. A common refinement is the one-standard-error rule, which selects the largest $\lambda$ whose cross-validated error is within one standard error of the minimum, biasing the choice toward simpler models.

For penalized regression specifically, packages such as glmnet and scikit-learn compute the entire regularization path efficiently and let users tune $\lambda$ with cross-validation in a single call.

Shrinkage beyond linear regression

The word shrinkage turns up in several other corners of statistical learning, each with the same underlying idea: pull an unstable estimate toward something more conservative.

Gradient boosting

Jerome Friedman's 2001 paper Greedy Function Approximation: A Gradient Boosting Machine introduced shrinkage as a regularization knob for gradient boosting. After fitting each base learner, the model adds it to the ensemble multiplied by a small constant $\nu \in (0, 1]$, called the learning rate or shrinkage parameter:

$$F_m(x) = F_{m-1}(x) + \nu \cdot h_m(x).$$

Smaller values of $\nu$ generalize better but require more boosting iterations. In practice, learning rates of $0.01$ to $0.1$ combined with thousands of trees are standard in implementations such as XGBoost, LightGBM, and CatBoost. The $\nu$ parameter plays a role almost identical to $\lambda$ in penalized regression.

Covariance estimation

The sample covariance matrix is unbiased but extremely noisy when the number of variables is large relative to the sample size. Olivier Ledoit and Michael Wolf's 2004 paper Honey, I Shrunk the Sample Covariance Matrix proposed estimating covariance by a convex combination

$$\hat\Sigma = \delta F + (1 - \delta) S,$$

where $S$ is the sample covariance, $F$ is a structured target such as a scalar multiple of the identity, and $\delta \in [0, 1]$ is an optimally chosen shrinkage intensity. The Ledoit-Wolf estimator is a workhorse in finance, where covariance matrices feed mean-variance portfolio optimization, and it appears in scikit-learn as the default for high-dimensional covariance estimation.

Neural networks and weight decay

Deep learning relies on shrinkage in two places that are usually not labeled as such. The first is weight decay, which adds an L2 penalty on the network parameters during training. The second is implicit shrinkage from early stopping, which halts gradient descent before the parameters reach their unregularized optimum. Both pull the learned weights toward zero, and both improve generalization on overparameterized models.

Ilya Loshchilov and Frank Hutter's Decoupled Weight Decay Regularization paper, presented at ICLR 2019, pointed out that L2 regularization and weight decay are equivalent for plain SGD but not for adaptive optimizers like Adam, where the per-parameter learning rates distort the L2 gradient. Their fix, AdamW, applies weight decay as a separate step that is independent of the adaptive learning rate. AdamW has since become the standard optimizer for training large language models and vision transformers.

Common shrinkage methods at a glance

Setting	Method	Shrinks toward	Tuning parameter
Multivariate mean estimation	James-Stein	origin or grand mean	data-driven
Linear regression	Ridge	zero vector	$\lambda$
Linear regression	Lasso	zero vector (sparse)	$\lambda$
Linear regression	Elastic net	zero vector	$\lambda$, $\alpha$
Tree ensembles	Gradient boosting shrinkage	constant (no update)	learning rate $\nu$
Covariance estimation	Ledoit-Wolf	structured target $F$	$\delta$
Deep learning	Weight decay / AdamW	zero weights	decay coefficient
Iterative training	Early stopping	initialization	stopping epoch

Limitations and pitfalls

Shrinkage is not free. The added bias is real, and the choice of penalty matters. Ridge regression keeps every variable in the model and can be a poor choice when interpretability and sparsity are the goal. Lasso achieves sparsity but can be unstable in the presence of correlated features and is limited to selecting at most $n$ variables in the $p \gg n$ regime. Elastic net partially fixes both problems but introduces a second tuning parameter that has to be chosen.

Most shrinkage methods assume that the true coefficients are themselves smooth or compressible in some sense. When the underlying signal is genuinely large in many directions, shrinkage can hurt more than it helps. The choice of $\lambda$ is critical: too aggressive a penalty discards real signal, while too weak a penalty leaves variance high. Cross-validation handles this in most cases, but with very small datasets the cross-validation estimate of optimal $\lambda$ is itself noisy, so the resulting model can vary with the specific train-test split.

A more subtle pitfall is the pre-processing step. Penalty terms like $\lambda \lVert \beta \rVert_2^2$ depend on the scale of each predictor, so features should typically be standardized before fitting penalized regression. Forgetting this step is a common source of mysterious results.

Explain like I'm 5 (ELI5)

Imagine you're trying to make a paper airplane. You have a lot of different folds you can make, and you're not sure which combination will make the best airplane. You start by making a really complicated airplane with lots of folds, but it doesn't fly very well because it's too heavy and hard to understand.

Shrinkage in machine learning is like removing some of the unnecessary folds from your airplane to make it simpler and easier to understand. By doing this, your airplane becomes lighter and flies better. In machine learning, removing some of the unnecessary parts of the model (the folds) can help it make better predictions and be easier to understand.

References

Stein, C. (1956). Inadmissibility of the usual estimator for the mean of a multivariate normal distribution. *Proceedings of the Third Berkeley Symposium on Mathematical Statistics and Probability*, 1, 197-206.
James, W. and Stein, C. (1961). Estimation with quadratic loss. *Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability*, 1, 361-379.
Hoerl, A. E. and Kennard, R. W. (1970). Ridge regression: Biased estimation for nonorthogonal problems. *Technometrics*, 12(1), 55-67.
Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. *Journal of the Royal Statistical Society, Series B*, 58(1), 267-288.
Friedman, J. H. (2001). Greedy function approximation: A gradient boosting machine. *The Annals of Statistics*, 29(5), 1189-1232.
Efron, B., Hastie, T., Johnstone, I., and Tibshirani, R. (2004). Least angle regression. *The Annals of Statistics*, 32(2), 407-451.
Ledoit, O. and Wolf, M. (2004). Honey, I shrunk the sample covariance matrix. *The Journal of Portfolio Management*, 30(4), 110-119.
Zou, H. and Hastie, T. (2005). Regularization and variable selection via the elastic net. *Journal of the Royal Statistical Society, Series B*, 67(2), 301-320.
Hastie, T., Tibshirani, R., and Friedman, J. (2009). *The Elements of Statistical Learning*. Springer, 2nd edition.
Loshchilov, I. and Hutter, F. (2019). Decoupled weight decay regularization. *7th International Conference on Learning Representations (ICLR)*.
Efron, B. and Hastie, T. (2016). *Computer Age Statistical Inference*, Chapter 7: James-Stein estimation and ridge regression. Cambridge University Press.

Shrinkage

Introduction

Statistical origin: the James-Stein estimator

Shrinkage in linear regression

Ridge regression

Lasso

Elastic net

Comparison of penalized regression methods

Bayesian interpretation

Choosing the regularization strength

Shrinkage beyond linear regression

Gradient boosting

Covariance estimation

Neural networks and weight decay

Common shrinkage methods at a glance

Limitations and pitfalls

Explain like I'm 5 (ELI5)

See also

References

Improve this article

Introduction

Statistical origin: the James-Stein estimator

Shrinkage in linear regression

Ridge regression

Lasso

Elastic net

Comparison of penalized regression methods

Bayesian interpretation

Choosing the regularization strength

Shrinkage beyond linear regression

Gradient boosting

Covariance estimation

Neural networks and weight decay

Common shrinkage methods at a glance

Limitations and pitfalls

Explain like I'm 5 (ELI5)

See also

References

Introduction

Statistical origin: the James-Stein estimator

Shrinkage in linear regression

Ridge regression

Lasso

Elastic net

Comparison of penalized regression methods

Bayesian interpretation

Choosing the regularization strength

Shrinkage beyond linear regression

Gradient boosting

Covariance estimation

Neural networks and weight decay

Common shrinkage methods at a glance

Limitations and pitfalls

Explain like I'm 5 (ELI5)

See also

References

Improve this article

Related Articles

Regularization Rate

Empirical risk minimization (ERM)

Empirical Risk Minimization

L0 Regularization

L1 Regularization

L2 Regularization

Introduction

Statistical origin: the James-Stein estimator

Shrinkage in linear regression

Ridge regression

Lasso

Elastic net

Comparison of penalized regression methods

Bayesian interpretation

Choosing the regularization strength

Shrinkage beyond linear regression

Gradient boosting

Covariance estimation

Neural networks and weight decay

Common shrinkage methods at a glance

Limitations and pitfalls

Explain like I'm 5 (ELI5)

See also

References

Related Articles

Regularization Rate

Empirical risk minimization (ERM)

Empirical Risk Minimization

L0 Regularization

L1 Regularization

L2 Regularization