# Least Squares Regression

> Source: https://aiwiki.ai/wiki/least_squares_regression
> Updated: 2026-07-11
> Categories: Machine Learning, Statistics
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

**Least squares regression** is a statistical method that fits a model to data by choosing the parameters that minimize the sum of the squared residuals, the squared differences between the observed values and the values the model predicts [1][7]. For a [linear regression](/wiki/linear_regression) model written $$y = X\beta + \varepsilon$$, the ordinary least squares (OLS) solution has the closed-form expression $$\hat{\beta} = (X^\top X)^{-1} X^\top y$$, obtained by solving the normal equations $$X^\top X \beta = X^\top y$$ [3][7]. It is one of the oldest and most widely used techniques in statistics and [machine learning](/wiki/machine_learning), forming the mathematical backbone of linear regression and many other predictive modeling approaches.

The core idea is straightforward: find the set of parameters that makes the model's predictions as close as possible to the actual data, where "closeness" is measured by the sum of squared residuals. As one standard reference puts it, least squares is "a method to determine the best-fit model by minimizing the sum of the squared residuals, the differences between observed values and the values predicted by the model" [16].

The method was first published by Adrien-Marie Legendre in 1805 and independently developed by Carl Friedrich Gauss, who published it in 1809 and claimed he had been using it since 1795 [1][2][3]. Since then, it has expanded into a family of techniques, including ordinary least squares (OLS), weighted least squares (WLS), generalized least squares (GLS), nonlinear least squares, and total least squares, each suited to different data conditions and modeling assumptions.

## ELI5: Explain like I'm 5

Imagine you have a bunch of dots on a piece of paper, and you want to draw the single straight line that goes closest to all the dots at the same time. Some dots will be a little above the line, and some will be a little below it. Least squares regression finds the best line by looking at how far each dot is from the line, squaring those distances (so bigger misses count a lot more), and then picking the line that makes the total of all those squared distances as small as possible. It is like finding the fairest middle path through a scatter of points.

## What is least squares regression?

Least squares regression estimates the unknown parameters of a model by minimizing the residual sum of squares (RSS), the total of the squared vertical gaps between each data point and the fitted curve [1][7]. When the model is linear in its parameters, this objective is a convex quadratic with a unique global minimum, and the minimizing parameters can be written down in closed form rather than searched for iteratively [3][7]. This combination of a clear optimization principle (minimize squared error) and an exact, fast solution is why least squares became, and remains, the default starting point for [regression](/wiki/regression) analysis across statistics, econometrics, engineering, and machine learning.

## Who invented least squares, and when?

The method of least squares has a rich history intertwined with the development of modern statistics and astronomy.

### Legendre's publication (1805)

The first published account of the method appeared in Adrien-Marie Legendre's 1805 work, *Nouvelles methodes pour la determination des orbites des cometes* (New Methods for the Determination of the Orbits of Comets) [1]. An appendix titled "Sur la Methode des moindres quarres" (On the Method of Least Squares) laid out the procedure clearly. Legendre proposed minimizing the sum of squared residuals as a principled way to fit a curve to observational data, and within a decade the technique was adopted across France, Italy, and Prussia as the standard method in astronomy and geodesy [3].

### Gauss's earlier claim (1795)

Carl Friedrich Gauss later claimed he had been using the method since 1795, when he was eighteen years old [2][3]. He published his approach in 1809 in *Theoria Motus Corporum Coelestium* (Theory of the Motion of Celestial Bodies), where he used least squares to predict the orbit of the asteroid Ceres after it was discovered by Giuseppe Piazzi in 1801 [2]. Gauss's prediction of Ceres's position was remarkably accurate, which brought wide attention to the method. The resulting priority dispute between Legendre and Gauss became one of the most famous in the history of science: Legendre, in an 1809 letter to Gauss, objected that priority is established only by publication, and in 1820 he published a supplement to his 1805 memoir publicly attacking Gauss's claim [3].

### Gauss's statistical justification

Gauss made a contribution beyond Legendre by connecting the method to probability theory. He showed that if observational errors follow a normal distribution, then the least squares estimator coincides with the maximum likelihood estimator [3]. This connection gave the method a deep theoretical foundation rather than being merely a convenient computational recipe.

### Later developments

Pierre-Simon Laplace provided additional justification around 1810 through the central limit theorem, showing that least squares estimators have desirable large-sample properties even when errors are not exactly normal [3]. By 1822, Gauss had proved the optimality of the least squares estimator among all linear unbiased estimators for models with normally distributed errors. Andrey Markov later generalized this result, relaxing the normality requirement, leading to what is now known as the Gauss-Markov theorem [15].

## Mathematical formulation

### The linear regression model

The standard [linear regression](/wiki/linear_regression) model relates a dependent variable *y* to a set of independent variables (predictors) through the equation:

$$
y = X\beta + \varepsilon
$$

where:

| Symbol | Meaning |
|--------|----------|
| *y* | Column vector of *n* observed response values (*n* x 1) |
| *X* | Design matrix of predictor values (*n* x *p*), where each row is an observation and each column is a predictor |
| β | Column vector of *p* unknown parameters to be estimated (*p* x 1) |
| ε | Column vector of *n* random error terms (*n* x 1) |

The first column of *X* is typically a column of ones to accommodate an intercept term [7].

### What is the objective function minimized?

Ordinary least squares seeks the parameter vector β that minimizes the residual sum of squares (RSS):

$$
S(\beta) = \sum_i (y_i - x_i^\top \beta)^2 = (y - X\beta)^\top (y - X\beta) = \lVert y - X\beta \rVert^2
$$

This expression is the squared Euclidean norm of the residual vector. Because the function is quadratic and convex in β, it has a unique global minimum (provided *X* has full column rank) [3][7].

### How are the coefficients computed (the normal equation)?

To find the minimum, take the gradient of S(β) with respect to β and set it to zero:

$$
\nabla_\beta S(\beta) = -2 X^\top (y - X\beta) = 0
$$

Rearranging gives the **normal equation**:

$$
X^\top X \beta = X^\top y
$$

If the matrix XᵀX is invertible (which requires *X* to have full column rank, meaning no perfect multicollinearity among the predictors), then the unique solution is [3][7]:

$$
\hat{\beta} = (X^\top X)^{-1} X^\top y
$$

This is the ordinary least squares estimator. The term "normal equation" comes from the fact that the residual vector (y - Xβ̂) is orthogonal ("normal") to the column space of *X* [7].

### Simple linear regression

For the special case of one predictor variable, $$y = \alpha + \beta x + \varepsilon$$, the OLS estimators reduce to closed-form expressions:

$$
\hat{\beta} = \frac{\sum_i (x_i - \bar{x})(y_i - \bar{y})}{\sum_i (x_i - \bar{x})^2}
$$

$$
\hat{\alpha} = \bar{y} - \hat{\beta} \bar{x}
$$

where $$\bar{x}$$ and $$\bar{y}$$ are the sample means of *x* and *y* respectively. The slope $$\hat{\beta}$$ equals the sample covariance of *x* and *y* divided by the sample variance of *x* [8].

## Geometric interpretation

The OLS estimator has an elegant geometric meaning. The predicted values $$\hat{y} = X\hat{\beta}$$ lie in the column space of *X*, which is a *p*-dimensional subspace of *n*-dimensional space. The OLS solution corresponds to the orthogonal projection of the observed vector *y* onto this subspace [7].

The projection matrix (also called the hat matrix) is:

$$
H = X (X^\top X)^{-1} X^\top
$$

so that $$\hat{y} = Hy$$. The residual vector $$\hat{\varepsilon} = y - \hat{y} = (I - H)y$$ is perpendicular to every column of *X*. This means that OLS finds the point in the column space of *X* that is closest to *y* in the Euclidean sense. Geometrically, it drops a perpendicular from *y* onto the hyperplane spanned by the columns of *X* [11].

## Properties of the OLS estimator

### Unbiasedness

Under the assumption that E[ε | X] = 0 (strict exogeneity), the OLS estimator is unbiased [5][7]:

$$
\mathbb{E}[\hat{\beta} \mid X] = \beta
$$

This means that on average, over repeated sampling, the estimated coefficients equal the true parameter values.

### Variance

The variance-covariance matrix of the OLS estimator is [7]:

$$
\mathrm{Var}(\hat{\beta} \mid X) = \sigma^2 (X^\top X)^{-1}
$$

where $$\sigma^2$$ is the variance of the error terms. An unbiased estimator of $$\sigma^2$$ is:

$$
s^2 = \frac{\hat{\varepsilon}^\top \hat{\varepsilon}}{n - p}
$$

The denominator uses *n* - *p* (the degrees of freedom) rather than *n* to correct for the bias that would otherwise arise from using estimated residuals instead of true errors.

### Consistency

As the sample size *n* grows, the OLS estimator converges in probability to the true parameter values, provided the regressors satisfy certain regularity conditions (such as the sample moment matrix XᵀX/n converging to a positive definite matrix) [5].

### Efficiency under normality

When the errors are normally distributed, $$\varepsilon \sim N(0, \sigma^2 I)$$, the OLS estimator is not only BLUE (see below) but also the maximum likelihood estimator. In this case it achieves the Cramer-Rao lower bound, meaning no unbiased estimator (linear or nonlinear) can have lower variance [10].

## What is the Gauss-Markov theorem?

The Gauss-Markov theorem is one of the central results justifying the use of OLS. It states that, under certain assumptions, the OLS estimator is the **Best Linear Unbiased Estimator (BLUE)** [15].

### Statement

The theorem states that "the ordinary least squares (OLS) estimator has the lowest sampling variance within the class of linear unbiased estimators, if the errors in the linear regression model are uncorrelated, have equal variances and expectation value of zero" [15]. More precisely, among all estimators that are (1) linear functions of the observed data *y* and (2) unbiased for the true parameter $$\beta$$, the OLS estimator has the smallest variance: for any linear unbiased estimator $$\tilde{\beta} = Cy$$, the difference $$\mathrm{Var}(\tilde{\beta}) - \mathrm{Var}(\hat{\beta})$$ is a positive semidefinite matrix.

Crucially, normality is not required. As the standard statement notes, "the errors do not need to be normal, nor do they need to be independent and identically distributed (only uncorrelated with mean zero and homoscedastic with finite variance)" [15].

### Required assumptions

The theorem holds under the following conditions [15]:

| Assumption | Mathematical statement | Meaning |
|---|---|---|
| Linearity | $$y = X\beta + \varepsilon$$ | The true model is linear in parameters |
| Strict exogeneity | $$\mathbb{E}[\varepsilon \mid X] = 0$$ | Errors have zero conditional mean given regressors |
| Homoscedasticity | $$\mathrm{Var}(\varepsilon_i) = \sigma^2$$ for all *i* | All error terms have the same constant variance |
| No autocorrelation | $$\mathrm{Cov}(\varepsilon_i, \varepsilon_j) = 0$$ for $$i \ne j$$ | Error terms are uncorrelated with each other |
| Full rank | $$\mathrm{rank}(X) = p$$ | No perfect multicollinearity; $$X^\top X$$ is invertible |

Notably, the Gauss-Markov theorem does **not** require the errors to be normally distributed. It only requires them to be uncorrelated with zero mean and constant variance [15].

### Proof sketch

Consider any alternative linear unbiased estimator $$\tilde{\beta} = Cy$$ where $$C = (X^\top X)^{-1} X^\top + D$$ for some matrix D. Unbiasedness requires $$DX = 0$$. The variance of $$\tilde{\beta}$$ is:

$$
\mathrm{Var}(\tilde{\beta}) = \sigma^2 C C^\top = \sigma^2 [(X^\top X)^{-1} + D D^\top]
$$

Since $$D D^\top$$ is positive semidefinite, $$\mathrm{Var}(\tilde{\beta}) - \mathrm{Var}(\hat{\beta}) = \sigma^2 D D^\top$$ is also positive semidefinite. Therefore OLS has the smallest variance among all linear unbiased estimators [15].

### Implications

The Gauss-Markov theorem guarantees that OLS is optimal within the class of linear unbiased estimators. However, it does not say OLS is the best estimator overall. Biased estimators such as [ridge regression](/wiki/l2_regularization) can achieve lower mean squared error through the [bias-variance tradeoff](/wiki/bias_variance_tradeoff), accepting a small amount of bias in exchange for a large reduction in variance [7][8].

## What assumptions does OLS make?

OLS relies on several assumptions for its estimators to have desirable properties. Violations of these assumptions affect the reliability of inferences drawn from the model.

### 1. Linearity in parameters

The relationship between the dependent variable and the parameters must be linear. This does not require the relationship between the predictors and the response to be linear in the raw variables; transformations such as polynomial terms, logarithms, or interaction terms are permitted because the model remains linear in the parameters [8].

### 2. Random sampling

The observations are drawn randomly from the population. This ensures that the sample is representative and supports the statistical properties of the estimators [5].

### 3. No perfect multicollinearity

No predictor variable can be written as an exact linear combination of other predictors. Perfect multicollinearity makes XᵀX singular and prevents a unique OLS solution. Near-multicollinearity (high but imperfect correlation among predictors) does not prevent estimation but inflates the standard errors of the affected coefficients, reducing the precision of estimates. The Variance Inflation Factor (VIF) is commonly used to detect multicollinearity; VIF values above 10 are generally considered problematic [6].

### 4. Exogeneity (zero conditional mean)

The expected value of the error term, conditional on the predictors, must be zero: $$\mathbb{E}[\varepsilon \mid X] = 0$$. This means the predictors must not be correlated with the error term. Violations (endogeneity) can arise from omitted variable bias, simultaneous causation, or measurement error in the predictors. Endogeneity causes the OLS estimator to be biased and inconsistent [5][9].

### 5. Homoscedasticity

The variance of the error term must be constant across all observations: $$\mathrm{Var}(\varepsilon_i \mid X) = \sigma^2$$ for all *i*. When this assumption is violated (heteroscedasticity), the OLS estimator remains unbiased but is no longer efficient, and the usual standard errors are biased. This can lead to incorrect hypothesis tests and confidence intervals [6][15].

### 6. No autocorrelation

The error terms must be uncorrelated with each other: $$\mathrm{Cov}(\varepsilon_i, \varepsilon_j) = 0$$ for $$i \ne j$$. This assumption is especially relevant in [time series](/wiki/time_series_analysis) data, where consecutive observations may exhibit serial correlation. Violations of this assumption cause OLS to be inefficient and produce biased standard errors [6].

### Summary of assumption violations

| Violation | Effect on OLS estimates | Effect on inference | Common diagnostic |
|---|---|---|---|
| Omitted variables / endogeneity | Biased and inconsistent | Invalid | Hausman test, instrumental variables |
| Heteroscedasticity | Unbiased but inefficient | Biased standard errors; invalid tests | Breusch-Pagan test, White test, residual plots |
| Autocorrelation | Unbiased but inefficient | Biased standard errors | Durbin-Watson test, Breusch-Godfrey test |
| Multicollinearity | Unbiased but imprecise | Inflated standard errors | VIF, condition number |
| Non-normality of errors | Unbiased (for large samples) | Invalid small-sample tests | Shapiro-Wilk test, Q-Q plots |

## Goodness of fit

### Coefficient of determination (R²)

The coefficient of determination measures the proportion of total variance in the dependent variable that is explained by the model:

$$
R^2 = 1 - \frac{\mathrm{RSS}}{\mathrm{TSS}}
$$

where $$\mathrm{RSS} = \sum_i (y_i - \hat{y}_i)^2$$ is the residual sum of squares and $$\mathrm{TSS} = \sum_i (y_i - \bar{y})^2$$ is the total sum of squares. $$R^2$$ ranges from 0 to 1 for models with an intercept, with values closer to 1 indicating a better fit [8].

However, R² never decreases when additional predictors are added to the model, even if those predictors are irrelevant. This makes R² unsuitable for comparing models with different numbers of predictors [8].

### Adjusted R²

The adjusted coefficient of determination corrects for the number of predictors in the model:

$$
R^2_{\text{adj}} = 1 - \frac{(1 - R^2)(n - 1)}{n - p - 1}
$$

Unlike R², the adjusted R² can decrease when irrelevant predictors are added, making it a better metric for model comparison. It can also take negative values when the model fits worse than a simple mean [8].

### Residual analysis

Beyond summary statistics, residual plots provide diagnostic information about model adequacy. Common patterns to check for include:

- **Non-random patterns** in residuals versus fitted values, which may indicate nonlinearity or missing predictors
- **Fan-shaped patterns** that suggest heteroscedasticity
- **Deviations from normality** in Q-Q plots, which affect the validity of hypothesis tests in small samples
- **Serial correlation** in time-ordered residual plots

## Hypothesis testing and inference

When the errors are normally distributed (or the sample size is large enough for asymptotic normality to apply), OLS enables statistical inference about the model parameters [8].

### t-tests for individual coefficients

The standard error of the *j*-th coefficient is:

$$
\mathrm{se}(\hat{\beta}_j) = s \sqrt{(X^\top X)^{-1}_{jj}}
$$

The test statistic for H₀: βⱼ = 0 is:

$$
t_j = \frac{\hat{\beta}_j}{\mathrm{se}(\hat{\beta}_j)}
$$

which follows a t-distribution with *n* - *p* degrees of freedom under the null hypothesis.

### F-test for overall significance

The F-test evaluates whether the regression model as a whole explains a statistically significant portion of the variance:

$$
F = \frac{(\mathrm{TSS} - \mathrm{RSS}) / (p - 1)}{\mathrm{RSS} / (n - p)}
$$

This statistic follows an F-distribution with (*p* - 1, *n* - *p*) degrees of freedom under the null hypothesis that all slope coefficients are zero.

### Confidence intervals

A 100(1 - α)% confidence interval for the *j*-th coefficient is:

$$
\hat{\beta}_j \pm t_{\alpha/2,\, n-p} \, \mathrm{se}(\hat{\beta}_j)
$$

where t_{α/2, n-p} is the critical value from the t-distribution.

## How is least squares computed in practice?

### Direct solution via the normal equation

The most straightforward approach computes β̂ = (XᵀX)⁻¹Xᵀy directly. Forming XᵀX costs about O(np²) and inverting it (or solving the linear system) costs about O(p³), for a combined cost of roughly O(np² + p³) [12]. However, explicitly inverting XᵀX is numerically unstable when the matrix is ill-conditioned (i.e., when predictors are nearly collinear) [12].

### QR decomposition

In practice, most statistical software solves OLS using QR decomposition. The design matrix is factored as X = QR, where Q is an orthogonal matrix and R is upper triangular. The solution becomes β̂ = R⁻¹Qᵀy, which is computed by back-substitution. QR decomposition is numerically more stable than direct inversion and is the default method in languages such as R and Python (via NumPy and scikit-learn) [12].

### Singular value decomposition (SVD)

SVD factors X = UΣVᵀ and handles rank-deficient or near-singular matrices gracefully. The pseudoinverse of *X* is computed as X⁺ = VΣ⁻¹Uᵀ, giving β̂ = X⁺y. SVD is more computationally expensive than QR but provides the most robust solution, especially when the data exhibit severe multicollinearity [12].

### How does least squares differ from gradient descent?

For very large datasets where forming and decomposing XᵀX is prohibitively expensive, iterative methods such as [gradient descent](/wiki/gradient_descent) can be used. The parameter update rule is:

$$
\beta_{t+1} = \beta_t - \alpha \nabla S(\beta_t) = \beta_t + 2\alpha X^\top (y - X\beta_t)
$$

where α is the [learning rate](/wiki/learning_rate). The fundamental difference is that the normal equation solves for β̂ analytically in a single step (no learning rate to tune, cost dominated by the O(p³) matrix inversion), whereas gradient descent approaches the minimum iteratively over T steps at a per-step cost of about O(np), giving O(npT) overall [12]. The analytical solution is usually preferred when the number of predictors is modest (a common rule of thumb cites roughly 10,000 features), while gradient descent scales better when p is very large because it avoids inverting or even forming the p x p matrix XᵀX. Stochastic gradient descent and mini-batch variants scale efficiently to millions of observations and are the standard approach in [deep learning](/wiki/deep_neural_network) frameworks.

| Method | Time complexity | Numerical stability | Best suited for |
|---|---|---|---|
| Normal equation | O(np² + p³) | Low (sensitive to conditioning) | Small to medium datasets |
| QR decomposition | O(np²) | High | General-purpose use |
| SVD | O(np²) (higher constant) | Very high | Rank-deficient or ill-conditioned problems |
| Gradient descent | O(npT), T = iterations | Moderate | Very large datasets |

## Variants of least squares

### Weighted least squares (WLS)

When the assumption of homoscedasticity is violated but the error variances are known (or can be estimated), weighted least squares provides a more efficient estimator. WLS minimizes a weighted sum of squared residuals:

$$
S_w(\beta) = \sum_i w_i (y_i - x_i^\top \beta)^2
$$

where $$w_i = 1/\sigma_i^2$$ is the weight assigned to the *i*-th observation, inversely proportional to its error variance. In matrix form:

$$
\hat{\beta}_{\text{WLS}} = (X^\top W X)^{-1} X^\top W y
$$

where $$W = \mathrm{diag}(w_1, w_2, \ldots, w_n)$$. Observations with smaller error variance receive more weight because they carry more information about the true relationship. WLS is the appropriate method when errors are independent but have unequal variances [7].

### Generalized least squares (GLS)

GLS extends WLS to handle cases where the errors are both heteroscedastic and correlated, with a general covariance structure $$\mathrm{Var}(\varepsilon) = \sigma^2 \Omega$$, where $$\Omega$$ is a known positive definite matrix. The GLS estimator, introduced by Alexander Aitken in 1935, is [4]:

$$
\hat{\beta}_{\text{GLS}} = (X^\top \Omega^{-1} X)^{-1} X^\top \Omega^{-1} y
$$

GLS is equivalent to applying OLS to a transformed version of the data. Using the Cholesky decomposition $$\Omega = C C^\top$$, the transformation $$C^{-1}$$ decorrelates and standardizes the errors, producing a model suitable for standard OLS. Under the extended Gauss-Markov theorem (the Aitken theorem), the GLS estimator is BLUE when the covariance structure is correctly specified [4].

WLS is the special case of GLS where Ω is diagonal (errors are uncorrelated but have unequal variances).

### Feasible generalized least squares (FGLS)

In practice, the true covariance matrix Ω is rarely known. Feasible GLS estimates it from the data in a two-step procedure:

1. Fit the model using OLS and compute the residuals
2. Use the residuals to estimate the covariance structure Ω̂
3. Apply GLS with the estimated covariance matrix

FGLS is consistent and asymptotically efficient under regularity conditions. However, for small to medium-sized samples, it can actually be less efficient than OLS with robust standard errors. Practical guidance from the econometrics literature recommends using OLS with heteroscedasticity and autocorrelation consistent (HAC) estimators (such as the Newey-West estimator or the Eicker-White estimator) rather than FGLS when the sample size is modest [6][9].

### Comparison of OLS, WLS, and GLS

| Method | Error structure assumed | Estimator formula | When to use |
|---|---|---|---|
| OLS | Homoscedastic, uncorrelated ($$\sigma^2 I$$) | $$(X^\top X)^{-1} X^\top y$$ | Baseline method; valid when all classical assumptions hold |
| WLS | Heteroscedastic, uncorrelated (diagonal $$\Omega$$) | $$(X^\top W X)^{-1} X^\top W y$$ | Known or estimable unequal variances; no correlation |
| GLS | General covariance (full $$\Omega$$) | $$(X^\top \Omega^{-1} X)^{-1} X^\top \Omega^{-1} y$$ | Correlated and/or heteroscedastic errors with known $$\Omega$$ |
| FGLS | General covariance (estimated $$\hat{\Omega}$$) | $$(X^\top \hat{\Omega}^{-1} X)^{-1} X^\top \hat{\Omega}^{-1} y$$ | Correlated/heteroscedastic errors; $$\Omega$$ unknown but estimable |

### Nonlinear least squares

When the model is nonlinear in its parameters (for example, $$y = \beta_1 e^{\beta_2 x} + \varepsilon$$), the residual sum of squares no longer has a closed-form solution. Instead, iterative optimization algorithms are used:

- **Gauss-Newton algorithm**: Linearizes the model using a first-order Taylor expansion at each iteration, then applies OLS to the linearized model. It converges quickly near the solution but can fail to converge from poor starting values.
- **Levenberg-Marquardt algorithm**: Interpolates between Gauss-Newton and [gradient descent](/wiki/gradient_descent). When parameters are far from optimal, it behaves like gradient descent; when close, it behaves like Gauss-Newton. It is more robust than pure Gauss-Newton but somewhat slower when starting values are already good [13][14].

Nonlinear least squares does not guarantee convergence to the global minimum, and the solution may depend on the choice of starting values.

### Total least squares

Ordinary least squares assumes that only the dependent variable contains measurement error. Total least squares (also called orthogonal distance regression or errors-in-variables regression) accounts for errors in both the dependent and independent variables. Instead of minimizing vertical distances from data points to the fitted line, it minimizes the perpendicular (orthogonal) distances. Total least squares is computed using [singular value decomposition](/wiki/singular_value_decomposition) (SVD) and is appropriate when the predictor variables are measured with noise [12].

## Regularized least squares

When the number of predictors is large relative to the sample size, or when predictors are highly correlated, the OLS estimator can have high variance and [overfit](/wiki/overfitting) the training data. Regularized versions of least squares add a penalty term to the objective function to control model complexity [7][8].

### Ridge regression (L2 regularization)

[Ridge regression](/wiki/l2_regularization) adds the squared L2 norm of the coefficient vector to the [loss function](/wiki/loss_function):

$$
S_{\text{ridge}}(\beta) = \lVert y - X\beta \rVert^2 + \lambda \lVert \beta \rVert_2^2
$$

The solution is:

$$
\hat{\beta}_{\text{ridge}} = (X^\top X + \lambda I)^{-1} X^\top y
$$

The regularization parameter λ > 0 shrinks the coefficients toward zero but never sets them exactly to zero. Ridge regression is effective when many predictors contribute to the outcome and multicollinearity is present. Adding λI to XᵀX ensures the matrix is always invertible, even when XᵀX is singular [7][8].

### Lasso (L1 regularization)

[Lasso regression](/wiki/l1_regularization) adds the L1 norm of the coefficient vector:

$$
S_{\text{lasso}}(\beta) = \lVert y - X\beta \rVert^2 + \lambda \lVert \beta \rVert_1
$$

Unlike ridge regression, the L1 penalty can drive some coefficients to exactly zero, performing automatic variable selection. This makes the Lasso useful when only a subset of predictors are relevant. The Lasso does not have a closed-form solution and is typically solved using coordinate descent or the LARS algorithm [8].

### Elastic net

Elastic net combines both L1 and L2 penalties:

$$
S_{\text{enet}}(\beta) = \lVert y - X\beta \rVert^2 + \lambda_1 \lVert \beta \rVert_1 + \lambda_2 \lVert \beta \rVert_2^2
$$

It inherits the variable selection property of the Lasso while handling groups of correlated predictors better than the Lasso alone, which tends to arbitrarily select one predictor from a correlated group and ignore the rest [8].

### Comparison of regularization methods

| Method | Penalty | Variable selection | Handles multicollinearity | Solution form |
|---|---|---|---|---|
| OLS | None | No | Poorly | Closed-form |
| Ridge | $$\lambda \lVert \beta \rVert_2^2$$ | No (shrinks but keeps all) | Yes | Closed-form |
| Lasso | $$\lambda \lVert \beta \rVert_1$$ | Yes (sets some to zero) | Partially | Iterative |
| Elastic net | $$\lambda_1 \lVert \beta \rVert_1 + \lambda_2 \lVert \beta \rVert_2^2$$ | Yes | Yes | Iterative |

## Connection to maximum likelihood estimation

OLS has a direct connection to [maximum likelihood](/wiki/generalized_linear_model) estimation under the assumption of normally distributed errors. If ε ~ N(0, σ²I), the likelihood function for the observed data is:

$$
L(\beta, \sigma^2) = (2\pi\sigma^2)^{-n/2} \exp\left[-\frac{1}{2\sigma^2} (y - X\beta)^\top (y - X\beta)\right]
$$

Maximizing this likelihood with respect to β is equivalent to minimizing (y - Xβ)ᵀ(y - Xβ), which is exactly the OLS objective. This means the OLS estimator is also the maximum likelihood estimator when errors are normal, giving it additional desirable properties such as asymptotic efficiency and the achievement of the Cramer-Rao lower bound [3][10].

## What is least squares regression used for?

### Statistics and econometrics

OLS is the default estimation method for linear models in econometrics, where it is used to estimate demand functions, production functions, wage equations, and countless other relationships. Extensions such as two-stage least squares (2SLS) and three-stage least squares (3SLS) handle endogeneity through instrumental variables [9].

### Machine learning

In [supervised learning](/wiki/supervised_learning), least squares regression serves as the foundation for understanding more complex models. Many algorithms, including [neural networks](/wiki/neural_network), minimize [mean squared error](/wiki/mean_squared_error_mse) as their loss function, which is equivalent to least squares when the model is linear. The [gradient descent](/wiki/gradient_descent) algorithms used to train deep learning models directly inherit from the optimization framework of least squares [7].

### Signal processing

Least squares methods are used extensively in signal processing for tasks such as filter design, system identification, spectrum estimation, and adaptive filtering. The Wiener filter, for instance, is derived from the principle of least squares applied to the problem of estimating a signal from noisy observations.

### Engineering and physical sciences

Least squares fitting is standard practice in physics, chemistry, and engineering for calibrating instruments, fitting theoretical models to experimental data, and determining physical constants. The method's statistical properties make it the natural choice whenever measurement noise is approximately Gaussian.

### Geodesy and astronomy

Historically, the method's first applications were in determining planetary orbits and making geodetic measurements [2]. These remain active areas of application, with modern extensions handling the complex correlation structures that arise from satellite observations and GPS measurements.

## Relationship to other methods

| Method | Relationship to least squares |
|---|---|
| [Linear regression](/wiki/linear_regression) | OLS is the standard estimation method for linear regression |
| [Logistic regression](/wiki/logistic_regression) | Uses maximum likelihood instead of least squares because the response is binary |
| [Generalized linear models](/wiki/generalized_linear_model) | Extend the least squares framework to non-normal response distributions |
| [Principal component analysis](/wiki/principal_component_analysis) | Both use eigendecomposition; PCA finds directions of maximum variance while OLS minimizes prediction error |
| [Support vector machines](/wiki/machine_learning) | Replace squared loss with hinge loss for classification |
| [Neural networks](/wiki/neural_network) | Can use MSE loss (equivalent to least squares) but typically with nonlinear architectures |
| [Random forests](/wiki/random_forest) | Ensemble of [decision trees](/wiki/decision_tree); does not use least squares directly but can minimize MSE |
| [Bayesian regression](/wiki/bayesian_optimization) | Places a prior distribution on β and computes the posterior; the posterior mode under a Gaussian prior is equivalent to ridge regression |

## What are the advantages and limitations of least squares?

### Advantages

- **Mathematical tractability**: Closed-form solution exists and is easy to compute
- **Well-understood theory**: Decades of statistical theory support its properties
- **Interpretability**: Coefficients have direct interpretations as partial effects
- **Efficiency**: BLUE under the Gauss-Markov assumptions [15]
- **Computational speed**: Solving the normal equation is fast for moderate-sized problems
- **Foundation for extensions**: Many advanced methods build directly on OLS

### When does least squares fail?

- **Sensitivity to outliers**: Squared residuals heavily penalize large errors, making OLS vulnerable to outliers. Robust regression methods (such as least absolute deviations or M-estimators) are more resistant.
- **Assumes linear relationship**: When the true relationship is nonlinear, OLS may produce a poor fit unless appropriate transformations are applied.
- **Multicollinearity**: Near-collinear predictors inflate variances and make individual coefficient estimates unreliable [6].
- **Not suitable for all response types**: For binary, count, or categorical outcomes, methods such as [logistic regression](/wiki/logistic_regression) or Poisson regression are more appropriate.
- **No automatic feature selection**: Standard OLS includes all predictors; regularization or stepwise methods are needed to select relevant variables.
- **Breaks when XᵀX is singular**: Perfect multicollinearity (or more predictors than observations) makes the normal equation unsolvable without regularization or a pseudoinverse [7][12].

## Implementation in software

Least squares regression is available in virtually every statistical and machine learning software package:

| Language/Tool | Function or class |
|---|---|
| Python ([scikit-learn](/wiki/scikit-learn)) | `sklearn.linear_model.LinearRegression` |
| Python (statsmodels) | `statsmodels.api.OLS` |
| Python ([NumPy](/wiki/numpy)) | `numpy.linalg.lstsq` |
| R | `lm()` |
| MATLAB | `fitlm()`, backslash operator `\` |
| Julia | `GLM.jl`, backslash operator |
| [TensorFlow](/wiki/tensorflow) / [PyTorch](/wiki/pytorch) | Custom models with MSE loss |

## See also

- [Linear regression](/wiki/linear_regression)
- [Gradient descent](/wiki/gradient_descent)
- [Mean squared error](/wiki/mean_squared_error_mse)
- [Overfitting](/wiki/overfitting)
- [L1 regularization (Lasso)](/wiki/l1_regularization)
- [L2 regularization (Ridge)](/wiki/l2_regularization)
- [Loss function](/wiki/loss_function)
- [Bias-variance tradeoff](/wiki/bias_variance_tradeoff)
- [Cross-validation](/wiki/cross-validation)
- [Logistic regression](/wiki/logistic_regression)

## References

1. Legendre, A.-M. (1805). *Nouvelles methodes pour la determination des orbites des cometes*. Paris: Firmin Didot.
2. Gauss, C. F. (1809). *Theoria Motus Corporum Coelestium*. Hamburg: Perthes et Besser.
3. Stigler, S. M. (1981). "Gauss and the Invention of Least Squares." *Annals of Statistics*, 9(3), 465-474.
4. Aitken, A. C. (1935). "On Least Squares and Linear Combination of Observations." *Proceedings of the Royal Society of Edinburgh*, 55, 42-48.
5. Hayashi, F. (2000). *Econometrics*. Princeton University Press. Chapter 1: OLS for the simple regression model; Chapter 2: The classical multiple regression model.
6. Greene, W. H. (2018). *Econometric Analysis* (8th ed.). Pearson. Chapters 3-5.
7. Hastie, T., Tibshirani, R., & Friedman, J. (2009). *The Elements of Statistical Learning* (2nd ed.). Springer. Chapter 3: Linear methods for regression.
8. James, G., Witten, D., Hastie, T., & Tibshirani, R. (2021). *An Introduction to Statistical Learning* (2nd ed.). Springer. Chapter 3: Linear regression; Chapter 6: Linear model selection and regularization.
9. Angrist, J. D., & Pischke, J.-S. (2009). *Mostly Harmless Econometrics*. Princeton University Press.
10. Wasserman, L. (2004). *All of Statistics*. Springer. Chapter 13: Linear and logistic regression.
11. Seber, G. A. F., & Lee, A. J. (2003). *Linear Regression Analysis* (2nd ed.). Wiley.
12. Bjorck, A. (1996). *Numerical Methods for Least Squares Problems*. SIAM.
13. Levenberg, K. (1944). "A Method for the Solution of Certain Non-Linear Problems in Least Squares." *Quarterly of Applied Mathematics*, 2(2), 164-168.
14. Marquardt, D. W. (1963). "An Algorithm for Least-Squares Estimation of Nonlinear Parameters." *Journal of the Society for Industrial and Applied Mathematics*, 11(2), 431-441.
15. "Gauss-Markov theorem." Wikipedia. Statement that the OLS estimator has the lowest sampling variance within the class of linear unbiased estimators when errors are uncorrelated, have equal variances, and have expectation zero; errors need not be normal or i.i.d.
16. "Least squares." Wikipedia. Definition of least squares as a method to determine the best-fit model by minimizing the sum of the squared residuals.