Least squares regression is a statistical method for estimating the parameters of a model by minimizing the sum of squared differences between observed values and the values predicted by the model. It is one of the oldest and most widely used techniques in statistics and machine learning, forming the mathematical backbone of linear regression and many other predictive modeling approaches. The core idea is straightforward: find the set of parameters that makes the model's predictions as close as possible to the actual data, where "closeness" is measured by the sum of squared residuals.
The method was independently developed by Adrien-Marie Legendre and Carl Friedrich Gauss in the early 19th century. Since then, it has expanded into a family of techniques, including ordinary least squares (OLS), weighted least squares (WLS), generalized least squares (GLS), nonlinear least squares, and total least squares, each suited to different data conditions and modeling assumptions.
Imagine you have a bunch of dots on a piece of paper, and you want to draw the single straight line that goes closest to all the dots at the same time. Some dots will be a little above the line, and some will be a little below it. Least squares regression finds the best line by looking at how far each dot is from the line, squaring those distances (so bigger misses count a lot more), and then picking the line that makes the total of all those squared distances as small as possible. It is like finding the fairest middle path through a scatter of points.
The method of least squares has a rich history intertwined with the development of modern statistics and astronomy.
The first published account of the method appeared in Adrien-Marie Legendre's 1805 work, Nouvelles methodes pour la determination des orbites des cometes (New Methods for the Determination of Comet Orbits). An appendix titled "Sur la Methode des moindres quarres" (On the Method of Least Squares) laid out the procedure clearly. Legendre proposed minimizing the sum of squared residuals as a principled way to fit a curve to observational data, and within a decade the technique was adopted across France, Italy, and Prussia as the standard method in astronomy and geodesy.
Carl Friedrich Gauss later claimed he had been using the method since 1795, when he was eighteen years old. He published his approach in 1809 in Theoria Motus Corporum Coelestium (Theory of the Motion of Celestial Bodies), where he used least squares to predict the orbit of the asteroid Ceres after it was discovered by Giuseppe Piazzi in 1801. Gauss's prediction of Ceres's position was remarkably accurate, which brought wide attention to the method. The resulting priority dispute between Legendre and Gauss became one of the most famous in the history of science.
Gauss made a contribution beyond Legendre by connecting the method to probability theory. He showed that if observational errors follow a normal distribution, then the least squares estimator coincides with the maximum likelihood estimator. This connection gave the method a deep theoretical foundation rather than being merely a convenient computational recipe.
Pierre-Simon Laplace provided additional justification around 1810 through the central limit theorem, showing that least squares estimators have desirable large-sample properties even when errors are not exactly normal. By 1822, Gauss had proved the optimality of the least squares estimator among all linear unbiased estimators for models with normally distributed errors. Andrey Markov later generalized this result, relaxing the normality requirement, leading to what is now known as the Gauss-Markov theorem.
The standard linear regression model relates a dependent variable y to a set of independent variables (predictors) through the equation:
y = Xβ + ε
where:
| Symbol | Meaning |
|---|---|
| y | Column vector of n observed response values (n x 1) |
| X | Design matrix of predictor values (n x p), where each row is an observation and each column is a predictor |
| β | Column vector of p unknown parameters to be estimated (p x 1) |
| ε | Column vector of n random error terms (n x 1) |
The first column of X is typically a column of ones to accommodate an intercept term.
Ordinary least squares seeks the parameter vector β that minimizes the residual sum of squares (RSS):
S(β) = Σᵢ (yᵢ - xᵢᵀβ)² = (y - Xβ)ᵀ(y - Xβ) = ||y - Xβ||²
This expression is the squared Euclidean norm of the residual vector. Because the function is quadratic and convex in β, it has a unique global minimum (provided X has full column rank).
To find the minimum, take the gradient of S(β) with respect to β and set it to zero:
∇β S(β) = -2Xᵀ(y - Xβ) = 0
Rearranging gives the normal equation:
XᵀXβ = Xᵀy
If the matrix XᵀX is invertible (which requires X to have full column rank, meaning no perfect multicollinearity among the predictors), then the unique solution is:
β̂ = (XᵀX)⁻¹Xᵀy
This is the ordinary least squares estimator. The term "normal equation" comes from the fact that the residual vector (y - Xβ̂) is orthogonal ("normal") to the column space of X.
For the special case of one predictor variable, y = α + βx + ε, the OLS estimators reduce to closed-form expressions:
β̂ = Σᵢ (xᵢ - x̄)(yᵢ - ȳ) / Σᵢ (xᵢ - x̄)²
α̂ = ȳ - β̂x̄
where x̄ and ȳ are the sample means of x and y respectively. The slope β̂ equals the sample covariance of x and y divided by the sample variance of x.
The OLS estimator has an elegant geometric meaning. The predicted values ŷ = Xβ̂ lie in the column space of X, which is a p-dimensional subspace of n-dimensional space. The OLS solution corresponds to the orthogonal projection of the observed vector y onto this subspace.
The projection matrix (also called the hat matrix) is:
H = X(XᵀX)⁻¹Xᵀ
so that ŷ = Hy. The residual vector ε̂ = y - ŷ = (I - H)y is perpendicular to every column of X. This means that OLS finds the point in the column space of X that is closest to y in the Euclidean sense. Geometrically, it drops a perpendicular from y onto the hyperplane spanned by the columns of X.
Under the assumption that E[ε | X] = 0 (strict exogeneity), the OLS estimator is unbiased:
E[β̂ | X] = β
This means that on average, over repeated sampling, the estimated coefficients equal the true parameter values.
The variance-covariance matrix of the OLS estimator is:
Var(β̂ | X) = σ²(XᵀX)⁻¹
where σ² is the variance of the error terms. An unbiased estimator of σ² is:
s² = ε̂ᵀε̂ / (n - p)
The denominator uses n - p (the degrees of freedom) rather than n to correct for the bias that would otherwise arise from using estimated residuals instead of true errors.
As the sample size n grows, the OLS estimator converges in probability to the true parameter values, provided the regressors satisfy certain regularity conditions (such as the sample moment matrix XᵀX/n converging to a positive definite matrix).
When the errors are normally distributed, ε ~ N(0, σ²I), the OLS estimator is not only BLUE (see below) but also the maximum likelihood estimator. In this case it achieves the Cramer-Rao lower bound, meaning no unbiased estimator (linear or nonlinear) can have lower variance.
The Gauss-Markov theorem is one of the central results justifying the use of OLS. It states that, under certain assumptions, the OLS estimator is the Best Linear Unbiased Estimator (BLUE).
Among all estimators that are (1) linear functions of the observed data y and (2) unbiased for the true parameter β, the OLS estimator has the smallest variance. More precisely, for any linear unbiased estimator β̃ = Cy, the difference Var(β̃) - Var(β̂) is a positive semidefinite matrix.
The theorem holds under the following conditions:
| Assumption | Mathematical statement | Meaning |
|---|---|---|
| Linearity | y = Xβ + ε | The true model is linear in parameters |
| Strict exogeneity | E[ε | X] = 0 | Errors have zero conditional mean given regressors |
| Homoscedasticity | Var(εᵢ) = σ² for all i | All error terms have the same constant variance |
| No autocorrelation | Cov(εᵢ, εⱼ) = 0 for i ≠ j | Error terms are uncorrelated with each other |
| Full rank | rank(X) = p | No perfect multicollinearity; XᵀX is invertible |
Notably, the Gauss-Markov theorem does not require the errors to be normally distributed. It only requires them to be uncorrelated with zero mean and constant variance.
Consider any alternative linear unbiased estimator β̃ = Cy where C = (XᵀX)⁻¹Xᵀ + D for some matrix D. Unbiasedness requires DX = 0. The variance of β̃ is:
Var(β̃) = σ²CCᵀ = σ²[(XᵀX)⁻¹ + DDᵀ]
Since DDᵀ is positive semidefinite, Var(β̃) - Var(β̂) = σ²DDᵀ is also positive semidefinite. Therefore OLS has the smallest variance among all linear unbiased estimators.
The Gauss-Markov theorem guarantees that OLS is optimal within the class of linear unbiased estimators. However, it does not say OLS is the best estimator overall. Biased estimators such as ridge regression can achieve lower mean squared error through the bias-variance tradeoff, accepting a small amount of bias in exchange for a large reduction in variance.
OLS relies on several assumptions for its estimators to have desirable properties. Violations of these assumptions affect the reliability of inferences drawn from the model.
The relationship between the dependent variable and the parameters must be linear. This does not require the relationship between the predictors and the response to be linear in the raw variables; transformations such as polynomial terms, logarithms, or interaction terms are permitted because the model remains linear in the parameters.
The observations are drawn randomly from the population. This ensures that the sample is representative and supports the statistical properties of the estimators.
No predictor variable can be written as an exact linear combination of other predictors. Perfect multicollinearity makes XᵀX singular and prevents a unique OLS solution. Near-multicollinearity (high but imperfect correlation among predictors) does not prevent estimation but inflates the standard errors of the affected coefficients, reducing the precision of estimates. The Variance Inflation Factor (VIF) is commonly used to detect multicollinearity; VIF values above 10 are generally considered problematic.
The expected value of the error term, conditional on the predictors, must be zero: E[ε | X] = 0. This means the predictors must not be correlated with the error term. Violations (endogeneity) can arise from omitted variable bias, simultaneous causation, or measurement error in the predictors. Endogeneity causes the OLS estimator to be biased and inconsistent.
The variance of the error term must be constant across all observations: Var(εᵢ | X) = σ² for all i. When this assumption is violated (heteroscedasticity), the OLS estimator remains unbiased but is no longer efficient, and the usual standard errors are biased. This can lead to incorrect hypothesis tests and confidence intervals.
The error terms must be uncorrelated with each other: Cov(εᵢ, εⱼ) = 0 for i ≠ j. This assumption is especially relevant in time series data, where consecutive observations may exhibit serial correlation. Violations of this assumption cause OLS to be inefficient and produce biased standard errors.
| Violation | Effect on OLS estimates | Effect on inference | Common diagnostic |
|---|---|---|---|
| Omitted variables / endogeneity | Biased and inconsistent | Invalid | Hausman test, instrumental variables |
| Heteroscedasticity | Unbiased but inefficient | Biased standard errors; invalid tests | Breusch-Pagan test, White test, residual plots |
| Autocorrelation | Unbiased but inefficient | Biased standard errors | Durbin-Watson test, Breusch-Godfrey test |
| Multicollinearity | Unbiased but imprecise | Inflated standard errors | VIF, condition number |
| Non-normality of errors | Unbiased (for large samples) | Invalid small-sample tests | Shapiro-Wilk test, Q-Q plots |
The coefficient of determination measures the proportion of total variance in the dependent variable that is explained by the model:
R² = 1 - RSS / TSS
where RSS = Σᵢ (yᵢ - ŷᵢ)² is the residual sum of squares and TSS = Σᵢ (yᵢ - ȳ)² is the total sum of squares. R² ranges from 0 to 1 for models with an intercept, with values closer to 1 indicating a better fit.
However, R² never decreases when additional predictors are added to the model, even if those predictors are irrelevant. This makes R² unsuitable for comparing models with different numbers of predictors.
The adjusted coefficient of determination corrects for the number of predictors in the model:
R²_adj = 1 - [(1 - R²)(n - 1)] / (n - p - 1)
Unlike R², the adjusted R² can decrease when irrelevant predictors are added, making it a better metric for model comparison. It can also take negative values when the model fits worse than a simple mean.
Beyond summary statistics, residual plots provide diagnostic information about model adequacy. Common patterns to check for include:
When the errors are normally distributed (or the sample size is large enough for asymptotic normality to apply), OLS enables statistical inference about the model parameters.
The standard error of the j-th coefficient is:
se(β̂ⱼ) = s * √[(XᵀX)⁻¹ⱼⱼ]
The test statistic for H₀: βⱼ = 0 is:
tⱼ = β̂ⱼ / se(β̂ⱼ)
which follows a t-distribution with n - p degrees of freedom under the null hypothesis.
The F-test evaluates whether the regression model as a whole explains a statistically significant portion of the variance:
F = [(TSS - RSS) / (p - 1)] / [RSS / (n - p)]
This statistic follows an F-distribution with (p - 1, n - p) degrees of freedom under the null hypothesis that all slope coefficients are zero.
A 100(1 - α)% confidence interval for the j-th coefficient is:
β̂ⱼ ± t_{α/2, n-p} * se(β̂ⱼ)
where t_{α/2, n-p} is the critical value from the t-distribution.
The most straightforward approach computes β̂ = (XᵀX)⁻¹Xᵀy directly. However, explicitly inverting XᵀX is numerically unstable when the matrix is ill-conditioned (i.e., when predictors are nearly collinear).
In practice, most statistical software solves OLS using QR decomposition. The design matrix is factored as X = QR, where Q is an orthogonal matrix and R is upper triangular. The solution becomes β̂ = R⁻¹Qᵀy, which is computed by back-substitution. QR decomposition is numerically more stable than direct inversion and is the default method in languages such as R and Python (via NumPy and scikit-learn).
SVD factors X = UΣVᵀ and handles rank-deficient or near-singular matrices gracefully. The pseudoinverse of X is computed as X⁺ = VΣ⁻¹Uᵀ, giving β̂ = X⁺y. SVD is more computationally expensive than QR but provides the most robust solution, especially when the data exhibit severe multicollinearity.
For very large datasets where forming and decomposing XᵀX is prohibitively expensive, iterative methods such as gradient descent can be used. The parameter update rule is:
β_{t+1} = βₜ - α * ∇S(βₜ) = βₜ + 2α * Xᵀ(y - Xβₜ)
where α is the learning rate. Stochastic gradient descent and mini-batch variants scale efficiently to millions of observations and are the standard approach in deep learning frameworks.
| Method | Time complexity | Numerical stability | Best suited for |
|---|---|---|---|
| Normal equation | O(np² + p³) | Low (sensitive to conditioning) | Small to medium datasets |
| QR decomposition | O(np²) | High | General-purpose use |
| SVD | O(np²) (higher constant) | Very high | Rank-deficient or ill-conditioned problems |
| Gradient descent | O(npT), T = iterations | Moderate | Very large datasets |
When the assumption of homoscedasticity is violated but the error variances are known (or can be estimated), weighted least squares provides a more efficient estimator. WLS minimizes a weighted sum of squared residuals:
S_w(β) = Σᵢ wᵢ(yᵢ - xᵢᵀβ)²
where wᵢ = 1/σᵢ² is the weight assigned to the i-th observation, inversely proportional to its error variance. In matrix form:
β̂_WLS = (XᵀWX)⁻¹XᵀWy
where W = diag(w₁, w₂, ..., wₙ). Observations with smaller error variance receive more weight because they carry more information about the true relationship. WLS is the appropriate method when errors are independent but have unequal variances.
GLS extends WLS to handle cases where the errors are both heteroscedastic and correlated, with a general covariance structure Var(ε) = σ²Ω, where Ω is a known positive definite matrix. The GLS estimator, introduced by Alexander Aitken in 1935, is:
β̂_GLS = (XᵀΩ⁻¹X)⁻¹XᵀΩ⁻¹y
GLS is equivalent to applying OLS to a transformed version of the data. Using the Cholesky decomposition Ω = CCᵀ, the transformation C⁻¹ decorrelates and standardizes the errors, producing a model suitable for standard OLS. Under the extended Gauss-Markov theorem (the Aitken theorem), the GLS estimator is BLUE when the covariance structure is correctly specified.
WLS is the special case of GLS where Ω is diagonal (errors are uncorrelated but have unequal variances).
In practice, the true covariance matrix Ω is rarely known. Feasible GLS estimates it from the data in a two-step procedure:
FGLS is consistent and asymptotically efficient under regularity conditions. However, for small to medium-sized samples, it can actually be less efficient than OLS with robust standard errors. Practical guidance from the econometrics literature recommends using OLS with heteroscedasticity and autocorrelation consistent (HAC) estimators (such as the Newey-West estimator or the Eicker-White estimator) rather than FGLS when the sample size is modest.
| Method | Error structure assumed | Estimator formula | When to use |
|---|---|---|---|
| OLS | Homoscedastic, uncorrelated (σ²I) | (XᵀX)⁻¹Xᵀy | Baseline method; valid when all classical assumptions hold |
| WLS | Heteroscedastic, uncorrelated (diagonal Ω) | (XᵀWX)⁻¹XᵀWy | Known or estimable unequal variances; no correlation |
| GLS | General covariance (full Ω) | (XᵀΩ⁻¹X)⁻¹XᵀΩ⁻¹y | Correlated and/or heteroscedastic errors with known Ω |
| FGLS | General covariance (estimated Ω̂) | (XᵀΩ̂⁻¹X)⁻¹XᵀΩ̂⁻¹y | Correlated/heteroscedastic errors; Ω unknown but estimable |
When the model is nonlinear in its parameters (for example, y = β₁ * e^(β₂x) + ε), the residual sum of squares no longer has a closed-form solution. Instead, iterative optimization algorithms are used:
Nonlinear least squares does not guarantee convergence to the global minimum, and the solution may depend on the choice of starting values.
Ordinary least squares assumes that only the dependent variable contains measurement error. Total least squares (also called orthogonal distance regression or errors-in-variables regression) accounts for errors in both the dependent and independent variables. Instead of minimizing vertical distances from data points to the fitted line, it minimizes the perpendicular (orthogonal) distances. Total least squares is computed using singular value decomposition (SVD) and is appropriate when the predictor variables are measured with noise.
When the number of predictors is large relative to the sample size, or when predictors are highly correlated, the OLS estimator can have high variance and overfit the training data. Regularized versions of least squares add a penalty term to the objective function to control model complexity.
Ridge regression adds the squared L2 norm of the coefficient vector to the loss function:
S_ridge(β) = ||y - Xβ||² + λ||β||²₂
The solution is:
β̂_ridge = (XᵀX + λI)⁻¹Xᵀy
The regularization parameter λ > 0 shrinks the coefficients toward zero but never sets them exactly to zero. Ridge regression is effective when many predictors contribute to the outcome and multicollinearity is present. Adding λI to XᵀX ensures the matrix is always invertible, even when XᵀX is singular.
Lasso regression adds the L1 norm of the coefficient vector:
S_lasso(β) = ||y - Xβ||² + λ||β||₁
Unlike ridge regression, the L1 penalty can drive some coefficients to exactly zero, performing automatic variable selection. This makes the Lasso useful when only a subset of predictors are relevant. The Lasso does not have a closed-form solution and is typically solved using coordinate descent or the LARS algorithm.
Elastic net combines both L1 and L2 penalties:
S_enet(β) = ||y - Xβ||² + λ₁||β||₁ + λ₂||β||²₂
It inherits the variable selection property of the Lasso while handling groups of correlated predictors better than the Lasso alone, which tends to arbitrarily select one predictor from a correlated group and ignore the rest.
| Method | Penalty | Variable selection | Handles multicollinearity | Solution form |
|---|---|---|---|---|
| OLS | None | No | Poorly | Closed-form |
| Ridge | λ||β||²₂ | No (shrinks but keeps all) | Yes | Closed-form |
| Lasso | λ||β||₁ | Yes (sets some to zero) | Partially | Iterative |
| Elastic net | λ₁||β||₁ + λ₂||β||²₂ | Yes | Yes | Iterative |
OLS has a direct connection to maximum likelihood estimation under the assumption of normally distributed errors. If ε ~ N(0, σ²I), the likelihood function for the observed data is:
L(β, σ²) = (2πσ²)^(-n/2) * exp[-1/(2σ²) * (y - Xβ)ᵀ(y - Xβ)]
Maximizing this likelihood with respect to β is equivalent to minimizing (y - Xβ)ᵀ(y - Xβ), which is exactly the OLS objective. This means the OLS estimator is also the maximum likelihood estimator when errors are normal, giving it additional desirable properties such as asymptotic efficiency and the achievement of the Cramer-Rao lower bound.
OLS is the default estimation method for linear models in econometrics, where it is used to estimate demand functions, production functions, wage equations, and countless other relationships. Extensions such as two-stage least squares (2SLS) and three-stage least squares (3SLS) handle endogeneity through instrumental variables.
In supervised learning, least squares regression serves as the foundation for understanding more complex models. Many algorithms, including neural networks, minimize mean squared error as their loss function, which is equivalent to least squares when the model is linear. The gradient descent algorithms used to train deep learning models directly inherit from the optimization framework of least squares.
Least squares methods are used extensively in signal processing for tasks such as filter design, system identification, spectrum estimation, and adaptive filtering. The Wiener filter, for instance, is derived from the principle of least squares applied to the problem of estimating a signal from noisy observations.
Least squares fitting is standard practice in physics, chemistry, and engineering for calibrating instruments, fitting theoretical models to experimental data, and determining physical constants. The method's statistical properties make it the natural choice whenever measurement noise is approximately Gaussian.
Historically, the method's first applications were in determining planetary orbits and making geodetic measurements. These remain active areas of application, with modern extensions handling the complex correlation structures that arise from satellite observations and GPS measurements.
| Method | Relationship to least squares |
|---|---|
| Linear regression | OLS is the standard estimation method for linear regression |
| Logistic regression | Uses maximum likelihood instead of least squares because the response is binary |
| Generalized linear models | Extend the least squares framework to non-normal response distributions |
| Principal component analysis | Both use eigendecomposition; PCA finds directions of maximum variance while OLS minimizes prediction error |
| Support vector machines | Replace squared loss with hinge loss for classification |
| Neural networks | Can use MSE loss (equivalent to least squares) but typically with nonlinear architectures |
| Random forests | Ensemble of decision trees; does not use least squares directly but can minimize MSE |
| Bayesian regression | Places a prior distribution on β and computes the posterior; the posterior mode under a Gaussian prior is equivalent to ridge regression |
Least squares regression is available in virtually every statistical and machine learning software package:
| Language/Tool | Function or class |
|---|---|
| Python (scikit-learn) | sklearn.linear_model.LinearRegression |
| Python (statsmodels) | statsmodels.api.OLS |
| Python (NumPy) | numpy.linalg.lstsq |
| R | lm() |
| MATLAB | fitlm(), backslash operator \ |
| Julia | GLM.jl, backslash operator |
| TensorFlow / PyTorch | Custom models with MSE loss |