Ridge regression is a method of estimating the coefficients of a linear regression model that adds a penalty proportional to the sum of squared coefficients to the ordinary least squares objective. The penalty deliberately introduces bias into the parameter estimates so as to reduce their variance, particularly when the predictor variables are highly correlated or when the number of predictors approaches or exceeds the sample size. Ridge regression was introduced by Arthur E. Hoerl and Robert W. Kennard in their 1970 paper Ridge Regression: Biased Estimation for Nonorthogonal Problems, published in Technometrics, although the underlying mathematics had appeared earlier in Andrey Tikhonov's work on the regularization of ill-posed inverse problems [1][2]. In the applied mathematics literature, the same procedure is usually called Tikhonov regularization or L2 regularization, and the three names refer to identical estimators when the regularization matrix is a scalar multiple of the identity.
The core idea is simple. When the design matrix has columns that are nearly linearly dependent, the OLS estimator becomes numerically unstable: small changes in the data produce large changes in the estimated coefficients, and individual coefficients can take on extreme values that have little physical meaning. Ridge regression replaces the unconstrained minimization with a constrained one, requiring the squared L2 norm of the coefficient vector to remain below some bound. Equivalently, it adds a quadratic penalty term to the objective. The result is a unique, stable solution that exists for any design matrix, including the rank-deficient case where OLS is not even defined. The price paid for this stability is that every coefficient is shrunk toward zero, so the ridge estimator is biased even when the linear model is correctly specified. Ridge regression is one of the foundational techniques in modern machine learning and applied statistics. It serves as the conceptual ancestor of lasso regression, elastic net, and the broad family of penalized likelihood methods, and the same penalty term reappears in neural network training as L2 weight decay, in Gaussian process regression as a noise variance, and in support vector machines as the margin term.
The motivation for ridge regression begins with a familiar weakness of ordinary least squares. Given n observations of a response variable y and p predictor variables stored in a design matrix X, the OLS estimator yields beta_hat = (X^T X)^{-1} X^T y, provided X^T X is invertible [3]. Under the Gauss-Markov assumptions, this estimator is unbiased and has the smallest variance among all linear unbiased estimators. The trouble starts when the columns of X are highly correlated, a situation called multicollinearity. The matrix X^T X then has very small eigenvalues, and inverting it amplifies any noise in the data by the inverse of those eigenvalues. Individual coefficients may flip sign with small data perturbations, take values that contradict known scientific relationships, or produce predictions that vary wildly outside the training range. The situation is even more extreme when p exceeds n: X^T X is then rank-deficient, so the OLS solution is not unique.
Hoerl and Kennard's 1970 paper proposed adding a small positive constant lambda to each diagonal element of X^T X before inverting it [1]. The resulting estimator is well-defined for any X, has uniformly smaller variance than OLS for any positive lambda, and admits a clean geometric interpretation as the constrained minimizer of the residual sum of squares subject to a bound on the norm of the coefficient vector. The name ridge comes from the visual interpretation of the diagonal addition as building a ridge along the principal diagonal of the matrix being inverted, lifting it away from singularity.
The ridge regression estimator is the coefficient vector that minimizes the penalized residual sum of squares:
beta_hat_ridge = argmin_beta { ||y - X beta||^2 + lambda ||beta||^2 }
where ||y - X beta||^2 is the usual sum of squared residuals, ||beta||^2 is the squared L2 norm of the coefficient vector, and lambda is a non-negative scalar known as the regularization parameter, penalty parameter, ridge parameter, or shrinkage parameter [4]. When lambda is zero, the estimator reduces to ordinary least squares. As lambda grows, the penalty term becomes more dominant and pulls the estimated coefficients toward zero.
The optimization problem has a unique closed-form solution obtained by setting the gradient of the objective to zero and solving for beta:
beta_hat_ridge = (X^T X + lambda I)^{-1} X^T y
where I is the p by p identity matrix [4]. The added lambda I term shifts every eigenvalue of X^T X upward by lambda, which guarantees that the matrix is invertible for any positive lambda regardless of whether X^T X itself is invertible. Ridge regression therefore extends naturally to the case where p exceeds n.
An equivalent formulation expresses ridge regression as a constrained optimization problem: minimize ||y - X beta||^2 subject to ||beta||^2 <= t for some constant t. By Lagrangian duality, this constrained problem is equivalent to the penalized one for an appropriate value of lambda [4]. The constrained form makes the geometric content of ridge regression more transparent: the OLS objective forms elliptical level sets in coefficient space, while the constraint ||beta||^2 <= t carves out a sphere centered at the origin.
The behavior of ridge regression becomes especially clear through the singular value decomposition of the design matrix. Write X = U D V^T, where U and V are orthogonal matrices and D is a diagonal matrix containing the singular values d_1 >= d_2 >= ... >= d_p >= 0. The ridge fitted values can be written as a sum over the columns of U with each weight scaled by the shrinkage factor d_j^2 / (d_j^2 + lambda) [4]. Components with large singular values (well-determined directions in the predictor space) are shrunk only slightly, while components with small singular values (directions that the data barely constrains) are shrunk heavily. Ridge regression effectively damps the contribution of the directions in which the data is most uninformative, which is exactly where OLS pays the highest variance penalty. This SVD view also leads to the notion of effective degrees of freedom for ridge regression, given by df(lambda) = sum_j d_j^2 / (d_j^2 + lambda), which equals p when lambda is zero and decreases monotonically toward zero as lambda grows.
Ridge regression is not invariant to the scale of the predictor variables [4]. A predictor measured in millimeters will receive a different ridge coefficient than the same predictor measured in meters, because the L2 penalty treats all coefficients on equal footing. Standard practice is to center every predictor at zero and scale it to unit variance before fitting the ridge estimator. The intercept term is conventionally excluded from the penalty, since penalizing it would force the predicted response toward zero whenever the predictors take their mean values.
The theoretical justification for ridge regression rests on the bias-variance decomposition of an estimator's mean squared error. OLS is unbiased, so its mean squared error equals its variance, which is large when the design matrix is poorly conditioned. Ridge regression is biased (with bias growing with lambda), but its variance is uniformly smaller than that of OLS for any positive lambda [5]. The total mean squared error therefore traces out a U-shaped curve as lambda increases, and the optimal lambda balances bias against variance. Hoerl and Kennard's central theoretical result is that there always exists a positive lambda for which the ridge estimator achieves strictly smaller mean squared error than OLS [1]. The optimal lambda depends on the unknown true coefficient vector and on the noise variance, so it cannot be computed exactly from data, but its existence is guaranteed by the algebra. This is one of the earliest demonstrations of the now-familiar machine learning principle that intentionally biased estimators can outperform unbiased ones in overall predictive accuracy.
The geometric picture of ridge regression in two dimensions is one of the most pedagogically valuable diagrams in regularization theory [4]. The OLS objective has elliptical level sets in the (beta_1, beta_2) plane, with the OLS solution at the center of the ellipses. The ridge constraint ||beta||^2 <= t draws a circle of radius sqrt(t) centered at the origin, and the constrained ridge solution lies where the smallest OLS contour first touches the constraint circle. Because the constraint region is a circle (or sphere in higher dimensions), every direction is treated equally and there is no reason for the optimal point on the boundary to lie on a coordinate axis. Ridge regression therefore shrinks all coefficients but rarely sets any of them exactly to zero, which is why it does not perform variable selection: the constraint region has no corners. By contrast, lasso regression uses an L1 constraint that draws a diamond whose vertices lie on the coordinate axes. The optimal point frequently lies at a vertex, setting one or more coefficients to exactly zero and producing a sparse solution.
Ridge regression has a clean Bayesian interpretation [6]. Suppose y_i = x_i^T beta + epsilon_i with errors drawn independently from N(0, sigma^2). Place a multivariate normal prior on the coefficient vector with mean zero and covariance tau^2 I, so each coefficient is independently distributed as N(0, tau^2). The posterior distribution of beta given the data is then also multivariate normal, and its mean is exactly the ridge regression estimator with lambda equal to sigma^2 divided by tau^2. Because a normal posterior is symmetric and unimodal, the posterior mean coincides with the posterior mode, so the ridge estimate is also the maximum a posteriori (MAP) estimate. The penalty parameter lambda therefore has a precise probabilistic meaning as the ratio of the noise variance to the prior variance of the coefficients. The Bayesian view also explains why ridge regression and Gaussian process regression with a linear kernel produce identical posterior means.
The penalty parameter lambda must be chosen before the ridge estimator can be computed. Most modern practice selects lambda by resampling or by an analytical proxy for predictive error, rather than by the original ridge trace heuristic of Hoerl and Kennard. The most general approach is K-fold cross-validation: the data are partitioned into K subsets, and for each candidate value of lambda and each fold the model is fit on the remaining K-1 folds and evaluated on the held-out fold. The K validation errors are averaged, and the lambda minimizing the cross-validated error is selected. K is commonly 5 or 10.
In the special case where K equals n, cross-validation becomes leave-one-out cross-validation (LOOCV). For ridge regression, LOOCV admits a remarkable computational shortcut: the leave-one-out residuals can be computed exactly from a single fit on the full data using the diagonal entries of the hat matrix [4]. This makes LOOCV essentially as cheap as a single ridge fit, and it is the default strategy in libraries such as scikit-learn's RidgeCV [7].
Generalized cross-validation (GCV), introduced by Golub, Heath, and Wahba in 1979, replaces the data-dependent diagonal entries of the hat matrix in the LOOCV formula with their average [8]. The resulting statistic GCV(lambda) = (1/n) ||y - y_hat||^2 / (1 - tr(H_lambda)/n)^2 is rotationally invariant and easier to compute than LOOCV. GCV is asymptotically optimal under broad conditions and is the standard choice in many applied settings. Other criteria include Mallows' Cp, AIC, and BIC, all adapted to ridge regression by substituting the effective degrees of freedom df(lambda) for the count of parameters.
Ridge regression is one of several penalized regression methods that have become standard tools. The table below compares the four most widely used members of the family.
| Method | Penalty | Solution | Sparsity | Strengths | Weaknesses |
|---|---|---|---|---|---|
| Ordinary Least Squares | None | Closed form when X^T X invertible | Dense | Unbiased, simple | Unstable under multicollinearity, undefined when p > n |
| Ridge Regression | L2 (sum of squared coefficients) | Closed form for any X | Dense, all coefficients shrunk | Stable, handles p > n, simple to tune | No variable selection, all features retained |
| Lasso Regression | L1 (sum of absolute coefficients) | Iterative (LARS, coordinate descent) | Sparse, many coefficients exactly zero | Performs feature selection, interpretable | Unstable choice among correlated features, biased estimates of selected coefficients |
| Elastic Net | Convex combination of L1 and L2 | Iterative coordinate descent | Sparse, but groups correlated features together | Combines variable selection with multicollinearity handling | Two hyperparameters to tune |
Ridge and lasso occupy opposite corners of the regularization design space. Ridge keeps every variable but shrinks all of them; lasso eliminates many variables but leaves the survivors close to their unpenalized values. The choice often depends on whether the underlying effects are spread evenly over many predictors (favoring ridge) or concentrated in a small subset (favoring lasso). When predictors come in correlated groups, lasso tends to arbitrarily pick one representative, while ridge spreads the coefficient mass evenly across the group. Elastic net was introduced by Zou and Hastie to combine both methods, retaining variable selection while behaving more gracefully on correlated predictors [9].
The most influential extension of ridge regression is its kernelization. Standard ridge regression fits a hyperplane in the original feature space. Kernel ridge regression replaces the linear model with a function in a possibly infinite-dimensional reproducing kernel Hilbert space (RKHS), allowing it to fit smooth nonlinear relationships [10]. As in support vector machines, the algorithm relies only on inner products between feature vectors, evaluated through a kernel function k(x, x_prime). Given a positive-definite kernel k, the representer theorem guarantees that the minimizer can be written as f_hat(x) = sum_i alpha_i k(x_i, x), where the coefficients alpha satisfy the linear system (K + lambda I) alpha = y and K is the n by n Gram matrix with entries K_ij = k(x_i, x_j).
The table below summarizes the most commonly used kernels in kernel ridge regression and the kinds of functions they can represent.
| Kernel | Form | Hyperparameters | Typical use cases |
|---|---|---|---|
| Linear | k(x, x_prime) = x^T x_prime | None | Reduces to standard ridge regression |
| Polynomial | k(x, x_prime) = (gamma x^T x_prime + r)^d | Degree d, gamma, r | Interactions, low-order nonlinearities |
| Gaussian RBF | k(x, x_prime) = exp(-gamma | x - x_prime | |
| Laplacian | k(x, x_prime) = exp(-gamma | x - x_prime | |
| Matern | Family parameterized by smoothness nu | nu, length scale | Geostatistics, spatial models |
| Sigmoid | k(x, x_prime) = tanh(gamma x^T x_prime + r) | gamma, r | Mostly historical interest |
Kernel ridge regression has the same closed-form structure as standard ridge regression, only the matrix to be inverted is n by n rather than p by p. This is a major advantage when p is much larger than n, but the cost grows cubically with n, so it scales poorly to very large datasets. Approximate methods such as the Nystrom approximation and random Fourier features reduce this cost. Kernel ridge regression is mathematically equivalent to the predictive mean of a Gaussian process with the same kernel and observation noise variance equal to lambda.
Ridge regression is the workhorse for several families of applied problems. The first consists of classical regression problems where multicollinearity makes the OLS solution unstable. Economic indicators, anthropometric measurements, and sensor readings often move together, and ridge regression provides predictions robust to small changes in the data while retaining the interpretability of a linear model. The second consists of high-dimensional problems where p exceeds n. Genomic association studies routinely involve hundreds of thousands or millions of single nucleotide polymorphisms (SNPs) measured on a few thousand individuals, and ridge regression has been a workhorse for genomic prediction since the early 2000s [11]. The same situation arises in neuroimaging, where the number of voxels in a brain scan vastly exceeds the number of subjects, and in chemometrics, where spectral measurements at thousands of wavelengths predict a small number of chemical concentrations. A third family requires interpretable shrinkage: when all features are believed to contribute meaningfully to the response, lasso's tendency to discard most of them is unappealing. Ridge regression also appears as a building block inside many other algorithms: smoothing splines and generalized additive models are equivalent to ridge regression in basis expansions of the predictors, and most neural network training procedures rely on L2 weight decay, mathematically the same as adding a ridge penalty for every weight in the network.
In Python, scikit-learn offers the Ridge class with several solver choices including singular value decomposition, Cholesky factorization, conjugate gradient, and coordinate descent [7]. The companion RidgeCV class wraps Ridge with built-in efficient leave-one-out cross-validation, and RidgeClassifier and RidgeClassifierCV provide ridge-regularized linear classifiers. In R, the standard implementation is the glmnet package, written by Jerome Friedman, Trevor Hastie, and Robert Tibshirani [12]. The glmnet function fits the entire elastic net regularization path using cyclical coordinate descent. Setting the alpha parameter to zero produces pure ridge regression, while alpha equal to one produces lasso and intermediate values produce elastic net. The cv.glmnet function provides cross-validated selection of the penalty parameter. Other R packages include MASS (with lm.ridge, a direct implementation of Hoerl and Kennard's original method). MATLAB, Stata, and SAS each ship native ridge regression routines as well.
Ridge regression sits at the intersection of many active research areas. The Bayesian connection makes it the simplest example of a regularized maximum likelihood estimator, the kernel connection turns it into the predictive mean of a Gaussian process, and the L2 weight decay in neural networks (mathematically identical to ridge regression on each layer's weights) provides one of the most useful intuitions for understanding why deep networks generalize despite having vastly more parameters than training examples. Ridge regression is also the limiting case of several iterative algorithms: early stopping of gradient descent on the unregularized least squares objective produces estimates that closely resemble ridge estimates, with the iteration count playing the role of an inverse penalty parameter [4]. The classical shrinkage estimators (including the James-Stein estimator) share the same structure of dominating the unbiased estimator by introducing controlled bias.
Ridge regression, despite its virtues, has limitations practitioners should keep in mind. It does not perform variable selection: every predictor receives a nonzero coefficient, so when the goal is to identify a sparse set of important variables, lasso or elastic net is preferable. It assumes a linear relationship between predictors and response in the original or transformed feature space, and cannot capture nonlinear effects without explicit feature engineering or kernelization. The penalty parameter must be tuned, and an inappropriate value can either overshrink the coefficients (producing high bias) or undershrink them (recovering most of the OLS instability). Results also depend on predictor scaling, so a careless choice of standardization can produce misleading coefficient estimates. The Bayesian interpretation imposes a Gaussian prior on the coefficients, which is a strong assumption: if the true coefficients are heavy-tailed or sparse, alternative priors give better performance.