Least Squares Regression

Least squares regression is a statistical method for estimating the parameters of a model by minimizing the sum of squared differences between observed values and the values predicted by the model. It is one of the oldest and most widely used techniques in statistics and machine learning, forming the mathematical backbone of linear regression and many other predictive modeling approaches. The core idea is straightforward: find the set of parameters that makes the model's predictions as close as possible to the actual data, where "closeness" is measured by the sum of squared residuals.

The method was independently developed by Adrien-Marie Legendre and Carl Friedrich Gauss in the early 19th century. Since then, it has expanded into a family of techniques, including ordinary least squares (OLS), weighted least squares (WLS), generalized least squares (GLS), nonlinear least squares, and total least squares, each suited to different data conditions and modeling assumptions.

ELI5: Explain like I'm 5

Imagine you have a bunch of dots on a piece of paper, and you want to draw the single straight line that goes closest to all the dots at the same time. Some dots will be a little above the line, and some will be a little below it. Least squares regression finds the best line by looking at how far each dot is from the line, squaring those distances (so bigger misses count a lot more), and then picking the line that makes the total of all those squared distances as small as possible. It is like finding the fairest middle path through a scatter of points.

Historical background

The method of least squares has a rich history intertwined with the development of modern statistics and astronomy.

Legendre's publication (1805)

The first published account of the method appeared in Adrien-Marie Legendre's 1805 work, Nouvelles methodes pour la determination des orbites des cometes (New Methods for the Determination of Comet Orbits). An appendix titled "Sur la Methode des moindres quarres" (On the Method of Least Squares) laid out the procedure clearly. Legendre proposed minimizing the sum of squared residuals as a principled way to fit a curve to observational data, and within a decade the technique was adopted across France, Italy, and Prussia as the standard method in astronomy and geodesy.

Gauss's earlier claim (1795)

Carl Friedrich Gauss later claimed he had been using the method since 1795, when he was eighteen years old. He published his approach in 1809 in Theoria Motus Corporum Coelestium (Theory of the Motion of Celestial Bodies), where he used least squares to predict the orbit of the asteroid Ceres after it was discovered by Giuseppe Piazzi in 1801. Gauss's prediction of Ceres's position was remarkably accurate, which brought wide attention to the method. The resulting priority dispute between Legendre and Gauss became one of the most famous in the history of science.

Gauss's statistical justification

Gauss made a contribution beyond Legendre by connecting the method to probability theory. He showed that if observational errors follow a normal distribution, then the least squares estimator coincides with the maximum likelihood estimator. This connection gave the method a deep theoretical foundation rather than being merely a convenient computational recipe.

Later developments

Pierre-Simon Laplace provided additional justification around 1810 through the central limit theorem, showing that least squares estimators have desirable large-sample properties even when errors are not exactly normal. By 1822, Gauss had proved the optimality of the least squares estimator among all linear unbiased estimators for models with normally distributed errors. Andrey Markov later generalized this result, relaxing the normality requirement, leading to what is now known as the Gauss-Markov theorem.

Mathematical formulation

The linear regression model

The standard linear regression model relates a dependent variable y to a set of independent variables (predictors) through the equation:

y = Xβ + ε

where:

Symbol	Meaning
y	Column vector of n observed response values (n x 1)
X	Design matrix of predictor values (n x p), where each row is an observation and each column is a predictor
β	Column vector of p unknown parameters to be estimated (p x 1)
ε	Column vector of n random error terms (n x 1)

The first column of X is typically a column of ones to accommodate an intercept term.

The objective function

Ordinary least squares seeks the parameter vector β that minimizes the residual sum of squares (RSS):

S(β) = Σᵢ (yᵢ - xᵢᵀβ)² = (y - Xβ)ᵀ(y - Xβ) = ||y - Xβ||²

This expression is the squared Euclidean norm of the residual vector. Because the function is quadratic and convex in β, it has a unique global minimum (provided X has full column rank).

Deriving the normal equation

To find the minimum, take the gradient of S(β) with respect to β and set it to zero:

∇β S(β) = -2Xᵀ(y - Xβ) = 0

Rearranging gives the normal equation:

XᵀXβ = Xᵀy

If the matrix XᵀX is invertible (which requires X to have full column rank, meaning no perfect multicollinearity among the predictors), then the unique solution is:

β̂ = (XᵀX)⁻¹Xᵀy

This is the ordinary least squares estimator. The term "normal equation" comes from the fact that the residual vector (y - Xβ̂) is orthogonal ("normal") to the column space of X.

Simple linear regression

For the special case of one predictor variable, y = α + βx + ε, the OLS estimators reduce to closed-form expressions:

β̂ = Σᵢ (xᵢ - x̄)(yᵢ - ȳ) / Σᵢ (xᵢ - x̄)²

α̂ = ȳ - β̂x̄

where x̄ and ȳ are the sample means of x and y respectively. The slope β̂ equals the sample covariance of x and y divided by the sample variance of x.

Geometric interpretation

The OLS estimator has an elegant geometric meaning. The predicted values ŷ = Xβ̂ lie in the column space of X, which is a p-dimensional subspace of n-dimensional space. The OLS solution corresponds to the orthogonal projection of the observed vector y onto this subspace.

The projection matrix (also called the hat matrix) is:

H = X(XᵀX)⁻¹Xᵀ

so that ŷ = Hy. The residual vector ε̂ = y - ŷ = (I - H)y is perpendicular to every column of X. This means that OLS finds the point in the column space of X that is closest to y in the Euclidean sense. Geometrically, it drops a perpendicular from y onto the hyperplane spanned by the columns of X.

Properties of the OLS estimator

Unbiasedness

Under the assumption that E[ε | X] = 0 (strict exogeneity), the OLS estimator is unbiased:

E[β̂ | X] = β

This means that on average, over repeated sampling, the estimated coefficients equal the true parameter values.

Variance

The variance-covariance matrix of the OLS estimator is:

Var(β̂ | X) = σ²(XᵀX)⁻¹

where σ² is the variance of the error terms. An unbiased estimator of σ² is:

s² = ε̂ᵀε̂ / (n - p)

The denominator uses n - p (the degrees of freedom) rather than n to correct for the bias that would otherwise arise from using estimated residuals instead of true errors.

Consistency

As the sample size n grows, the OLS estimator converges in probability to the true parameter values, provided the regressors satisfy certain regularity conditions (such as the sample moment matrix XᵀX/n converging to a positive definite matrix).

Efficiency under normality

When the errors are normally distributed, ε ~ N(0, σ²I), the OLS estimator is not only BLUE (see below) but also the maximum likelihood estimator. In this case it achieves the Cramer-Rao lower bound, meaning no unbiased estimator (linear or nonlinear) can have lower variance.

The Gauss-Markov theorem

The Gauss-Markov theorem is one of the central results justifying the use of OLS. It states that, under certain assumptions, the OLS estimator is the Best Linear Unbiased Estimator (BLUE).

Statement

Among all estimators that are (1) linear functions of the observed data y and (2) unbiased for the true parameter β, the OLS estimator has the smallest variance. More precisely, for any linear unbiased estimator β̃ = Cy, the difference Var(β̃) - Var(β̂) is a positive semidefinite matrix.

Required assumptions

The theorem holds under the following conditions:

Assumption	Mathematical statement	Meaning
Linearity	y = Xβ + ε	The true model is linear in parameters
Strict exogeneity	E[ε \| X] = 0	Errors have zero conditional mean given regressors
Homoscedasticity	Var(εᵢ) = σ² for all i	All error terms have the same constant variance
No autocorrelation	Cov(εᵢ, εⱼ) = 0 for i ≠ j	Error terms are uncorrelated with each other
Full rank	rank(X) = p	No perfect multicollinearity; XᵀX is invertible

Notably, the Gauss-Markov theorem does not require the errors to be normally distributed. It only requires them to be uncorrelated with zero mean and constant variance.

Proof sketch

Consider any alternative linear unbiased estimator β̃ = Cy where C = (XᵀX)⁻¹Xᵀ + D for some matrix D. Unbiasedness requires DX = 0. The variance of β̃ is:

Var(β̃) = σ²CCᵀ = σ²[(XᵀX)⁻¹ + DDᵀ]

Since DDᵀ is positive semidefinite, Var(β̃) - Var(β̂) = σ²DDᵀ is also positive semidefinite. Therefore OLS has the smallest variance among all linear unbiased estimators.

Implications

The Gauss-Markov theorem guarantees that OLS is optimal within the class of linear unbiased estimators. However, it does not say OLS is the best estimator overall. Biased estimators such as ridge regression can achieve lower mean squared error through the bias-variance tradeoff, accepting a small amount of bias in exchange for a large reduction in variance.

Assumptions of OLS

OLS relies on several assumptions for its estimators to have desirable properties. Violations of these assumptions affect the reliability of inferences drawn from the model.

1. Linearity in parameters

The relationship between the dependent variable and the parameters must be linear. This does not require the relationship between the predictors and the response to be linear in the raw variables; transformations such as polynomial terms, logarithms, or interaction terms are permitted because the model remains linear in the parameters.

2. Random sampling

The observations are drawn randomly from the population. This ensures that the sample is representative and supports the statistical properties of the estimators.

3. No perfect multicollinearity

No predictor variable can be written as an exact linear combination of other predictors. Perfect multicollinearity makes XᵀX singular and prevents a unique OLS solution. Near-multicollinearity (high but imperfect correlation among predictors) does not prevent estimation but inflates the standard errors of the affected coefficients, reducing the precision of estimates. The Variance Inflation Factor (VIF) is commonly used to detect multicollinearity; VIF values above 10 are generally considered problematic.

4. Exogeneity (zero conditional mean)

The expected value of the error term, conditional on the predictors, must be zero: E[ε | X] = 0. This means the predictors must not be correlated with the error term. Violations (endogeneity) can arise from omitted variable bias, simultaneous causation, or measurement error in the predictors. Endogeneity causes the OLS estimator to be biased and inconsistent.

5. Homoscedasticity

The variance of the error term must be constant across all observations: Var(εᵢ | X) = σ² for all i. When this assumption is violated (heteroscedasticity), the OLS estimator remains unbiased but is no longer efficient, and the usual standard errors are biased. This can lead to incorrect hypothesis tests and confidence intervals.

6. No autocorrelation

The error terms must be uncorrelated with each other: Cov(εᵢ, εⱼ) = 0 for i ≠ j. This assumption is especially relevant in time series data, where consecutive observations may exhibit serial correlation. Violations of this assumption cause OLS to be inefficient and produce biased standard errors.

Summary of assumption violations

Violation	Effect on OLS estimates	Effect on inference	Common diagnostic
Omitted variables / endogeneity	Biased and inconsistent	Invalid	Hausman test, instrumental variables
Heteroscedasticity	Unbiased but inefficient	Biased standard errors; invalid tests	Breusch-Pagan test, White test, residual plots
Autocorrelation	Unbiased but inefficient	Biased standard errors	Durbin-Watson test, Breusch-Godfrey test
Multicollinearity	Unbiased but imprecise	Inflated standard errors	VIF, condition number
Non-normality of errors	Unbiased (for large samples)	Invalid small-sample tests	Shapiro-Wilk test, Q-Q plots

Goodness of fit

Coefficient of determination (R²)

The coefficient of determination measures the proportion of total variance in the dependent variable that is explained by the model:

R² = 1 - RSS / TSS

where RSS = Σᵢ (yᵢ - ŷᵢ)² is the residual sum of squares and TSS = Σᵢ (yᵢ - ȳ)² is the total sum of squares. R² ranges from 0 to 1 for models with an intercept, with values closer to 1 indicating a better fit.

However, R² never decreases when additional predictors are added to the model, even if those predictors are irrelevant. This makes R² unsuitable for comparing models with different numbers of predictors.

Adjusted R²

The adjusted coefficient of determination corrects for the number of predictors in the model:

R²_adj = 1 - [(1 - R²)(n - 1)] / (n - p - 1)

Unlike R², the adjusted R² can decrease when irrelevant predictors are added, making it a better metric for model comparison. It can also take negative values when the model fits worse than a simple mean.

Residual analysis

Beyond summary statistics, residual plots provide diagnostic information about model adequacy. Common patterns to check for include:

Non-random patterns in residuals versus fitted values, which may indicate nonlinearity or missing predictors
Fan-shaped patterns that suggest heteroscedasticity
Deviations from normality in Q-Q plots, which affect the validity of hypothesis tests in small samples
Serial correlation in time-ordered residual plots

Hypothesis testing and inference

When the errors are normally distributed (or the sample size is large enough for asymptotic normality to apply), OLS enables statistical inference about the model parameters.

t-tests for individual coefficients

The standard error of the j-th coefficient is:

se(β̂ⱼ) = s * √[(XᵀX)⁻¹ⱼⱼ]

The test statistic for H₀: βⱼ = 0 is:

tⱼ = β̂ⱼ / se(β̂ⱼ)

which follows a t-distribution with n - p degrees of freedom under the null hypothesis.

F-test for overall significance

The F-test evaluates whether the regression model as a whole explains a statistically significant portion of the variance:

F = [(TSS - RSS) / (p - 1)] / [RSS / (n - p)]

This statistic follows an F-distribution with (p - 1, n - p) degrees of freedom under the null hypothesis that all slope coefficients are zero.

Confidence intervals

A 100(1 - α)% confidence interval for the j-th coefficient is:

β̂ⱼ ± t_{α/2, n-p} * se(β̂ⱼ)

where t_{α/2, n-p} is the critical value from the t-distribution.

Computational methods

Direct solution via the normal equation

The most straightforward approach computes β̂ = (XᵀX)⁻¹Xᵀy directly. However, explicitly inverting XᵀX is numerically unstable when the matrix is ill-conditioned (i.e., when predictors are nearly collinear).

QR decomposition

In practice, most statistical software solves OLS using QR decomposition. The design matrix is factored as X = QR, where Q is an orthogonal matrix and R is upper triangular. The solution becomes β̂ = R⁻¹Qᵀy, which is computed by back-substitution. QR decomposition is numerically more stable than direct inversion and is the default method in languages such as R and Python (via NumPy and scikit-learn).

Singular value decomposition (SVD)

SVD factors X = UΣVᵀ and handles rank-deficient or near-singular matrices gracefully. The pseudoinverse of X is computed as X⁺ = VΣ⁻¹Uᵀ, giving β̂ = X⁺y. SVD is more computationally expensive than QR but provides the most robust solution, especially when the data exhibit severe multicollinearity.

Gradient descent

For very large datasets where forming and decomposing XᵀX is prohibitively expensive, iterative methods such as gradient descent can be used. The parameter update rule is:

β_{t+1} = βₜ - α * ∇S(βₜ) = βₜ + 2α * Xᵀ(y - Xβₜ)

where α is the learning rate. Stochastic gradient descent and mini-batch variants scale efficiently to millions of observations and are the standard approach in deep learning frameworks.

Method	Time complexity	Numerical stability	Best suited for
Normal equation	O(np² + p³)	Low (sensitive to conditioning)	Small to medium datasets
QR decomposition	O(np²)	High	General-purpose use
SVD	O(np²) (higher constant)	Very high	Rank-deficient or ill-conditioned problems
Gradient descent	O(npT), T = iterations	Moderate	Very large datasets

Variants of least squares

Weighted least squares (WLS)

When the assumption of homoscedasticity is violated but the error variances are known (or can be estimated), weighted least squares provides a more efficient estimator. WLS minimizes a weighted sum of squared residuals:

S_w(β) = Σᵢ wᵢ(yᵢ - xᵢᵀβ)²

where wᵢ = 1/σᵢ² is the weight assigned to the i-th observation, inversely proportional to its error variance. In matrix form:

β̂_WLS = (XᵀWX)⁻¹XᵀWy

where W = diag(w₁, w₂, ..., wₙ). Observations with smaller error variance receive more weight because they carry more information about the true relationship. WLS is the appropriate method when errors are independent but have unequal variances.

Generalized least squares (GLS)

GLS extends WLS to handle cases where the errors are both heteroscedastic and correlated, with a general covariance structure Var(ε) = σ²Ω, where Ω is a known positive definite matrix. The GLS estimator, introduced by Alexander Aitken in 1935, is:

β̂_GLS = (XᵀΩ⁻¹X)⁻¹XᵀΩ⁻¹y

GLS is equivalent to applying OLS to a transformed version of the data. Using the Cholesky decomposition Ω = CCᵀ, the transformation C⁻¹ decorrelates and standardizes the errors, producing a model suitable for standard OLS. Under the extended Gauss-Markov theorem (the Aitken theorem), the GLS estimator is BLUE when the covariance structure is correctly specified.

WLS is the special case of GLS where Ω is diagonal (errors are uncorrelated but have unequal variances).

Feasible generalized least squares (FGLS)

In practice, the true covariance matrix Ω is rarely known. Feasible GLS estimates it from the data in a two-step procedure:

Fit the model using OLS and compute the residuals
Use the residuals to estimate the covariance structure Ω̂
Apply GLS with the estimated covariance matrix

FGLS is consistent and asymptotically efficient under regularity conditions. However, for small to medium-sized samples, it can actually be less efficient than OLS with robust standard errors. Practical guidance from the econometrics literature recommends using OLS with heteroscedasticity and autocorrelation consistent (HAC) estimators (such as the Newey-West estimator or the Eicker-White estimator) rather than FGLS when the sample size is modest.

Comparison of OLS, WLS, and GLS

Method	Error structure assumed	Estimator formula	When to use
OLS	Homoscedastic, uncorrelated (σ²I)	(XᵀX)⁻¹Xᵀy	Baseline method; valid when all classical assumptions hold
WLS	Heteroscedastic, uncorrelated (diagonal Ω)	(XᵀWX)⁻¹XᵀWy	Known or estimable unequal variances; no correlation
GLS	General covariance (full Ω)	(XᵀΩ⁻¹X)⁻¹XᵀΩ⁻¹y	Correlated and/or heteroscedastic errors with known Ω
FGLS	General covariance (estimated Ω̂)	(XᵀΩ̂⁻¹X)⁻¹XᵀΩ̂⁻¹y	Correlated/heteroscedastic errors; Ω unknown but estimable

Nonlinear least squares

When the model is nonlinear in its parameters (for example, y = β₁ * e^(β₂x) + ε), the residual sum of squares no longer has a closed-form solution. Instead, iterative optimization algorithms are used:

Gauss-Newton algorithm: Linearizes the model using a first-order Taylor expansion at each iteration, then applies OLS to the linearized model. It converges quickly near the solution but can fail to converge from poor starting values.
Levenberg-Marquardt algorithm: Interpolates between Gauss-Newton and gradient descent. When parameters are far from optimal, it behaves like gradient descent; when close, it behaves like Gauss-Newton. It is more robust than pure Gauss-Newton but somewhat slower when starting values are already good.

Nonlinear least squares does not guarantee convergence to the global minimum, and the solution may depend on the choice of starting values.

Total least squares

Ordinary least squares assumes that only the dependent variable contains measurement error. Total least squares (also called orthogonal distance regression or errors-in-variables regression) accounts for errors in both the dependent and independent variables. Instead of minimizing vertical distances from data points to the fitted line, it minimizes the perpendicular (orthogonal) distances. Total least squares is computed using singular value decomposition (SVD) and is appropriate when the predictor variables are measured with noise.

Regularized least squares

When the number of predictors is large relative to the sample size, or when predictors are highly correlated, the OLS estimator can have high variance and overfit the training data. Regularized versions of least squares add a penalty term to the objective function to control model complexity.

Ridge regression (L2 regularization)

Ridge regression adds the squared L2 norm of the coefficient vector to the loss function:

S_ridge(β) = ||y - Xβ||² + λ||β||²₂

The solution is:

β̂_ridge = (XᵀX + λI)⁻¹Xᵀy

The regularization parameter λ > 0 shrinks the coefficients toward zero but never sets them exactly to zero. Ridge regression is effective when many predictors contribute to the outcome and multicollinearity is present. Adding λI to XᵀX ensures the matrix is always invertible, even when XᵀX is singular.

Lasso (L1 regularization)

Lasso regression adds the L1 norm of the coefficient vector:

S_lasso(β) = ||y - Xβ||² + λ||β||₁

Unlike ridge regression, the L1 penalty can drive some coefficients to exactly zero, performing automatic variable selection. This makes the Lasso useful when only a subset of predictors are relevant. The Lasso does not have a closed-form solution and is typically solved using coordinate descent or the LARS algorithm.

Elastic net

Elastic net combines both L1 and L2 penalties:

S_enet(β) = ||y - Xβ||² + λ₁||β||₁ + λ₂||β||²₂

It inherits the variable selection property of the Lasso while handling groups of correlated predictors better than the Lasso alone, which tends to arbitrarily select one predictor from a correlated group and ignore the rest.

Comparison of regularization methods

Method	Penalty	Variable selection	Handles multicollinearity	Solution form
OLS	None	No	Poorly	Closed-form
Ridge	λ\|\|β\|\|²₂	No (shrinks but keeps all)	Yes	Closed-form
Lasso	λ\|\|β\|\|₁	Yes (sets some to zero)	Partially	Iterative
Elastic net	λ₁\|\|β\|\|₁ + λ₂\|\|β\|\|²₂	Yes	Yes	Iterative

Connection to maximum likelihood estimation

OLS has a direct connection to maximum likelihood estimation under the assumption of normally distributed errors. If ε ~ N(0, σ²I), the likelihood function for the observed data is:

L(β, σ²) = (2πσ²)^(-n/2) * exp[-1/(2σ²) * (y - Xβ)ᵀ(y - Xβ)]

Maximizing this likelihood with respect to β is equivalent to minimizing (y - Xβ)ᵀ(y - Xβ), which is exactly the OLS objective. This means the OLS estimator is also the maximum likelihood estimator when errors are normal, giving it additional desirable properties such as asymptotic efficiency and the achievement of the Cramer-Rao lower bound.

Applications

Statistics and econometrics

OLS is the default estimation method for linear models in econometrics, where it is used to estimate demand functions, production functions, wage equations, and countless other relationships. Extensions such as two-stage least squares (2SLS) and three-stage least squares (3SLS) handle endogeneity through instrumental variables.

Machine learning

In supervised learning, least squares regression serves as the foundation for understanding more complex models. Many algorithms, including neural networks, minimize mean squared error as their loss function, which is equivalent to least squares when the model is linear. The gradient descent algorithms used to train deep learning models directly inherit from the optimization framework of least squares.

Signal processing

Least squares methods are used extensively in signal processing for tasks such as filter design, system identification, spectrum estimation, and adaptive filtering. The Wiener filter, for instance, is derived from the principle of least squares applied to the problem of estimating a signal from noisy observations.

Engineering and physical sciences

Least squares fitting is standard practice in physics, chemistry, and engineering for calibrating instruments, fitting theoretical models to experimental data, and determining physical constants. The method's statistical properties make it the natural choice whenever measurement noise is approximately Gaussian.

Geodesy and astronomy

Historically, the method's first applications were in determining planetary orbits and making geodetic measurements. These remain active areas of application, with modern extensions handling the complex correlation structures that arise from satellite observations and GPS measurements.

Relationship to other methods

Method	Relationship to least squares
Linear regression	OLS is the standard estimation method for linear regression
Logistic regression	Uses maximum likelihood instead of least squares because the response is binary
Generalized linear models	Extend the least squares framework to non-normal response distributions
Principal component analysis	Both use eigendecomposition; PCA finds directions of maximum variance while OLS minimizes prediction error
Support vector machines	Replace squared loss with hinge loss for classification
Neural networks	Can use MSE loss (equivalent to least squares) but typically with nonlinear architectures
Random forests	Ensemble of decision trees; does not use least squares directly but can minimize MSE
Bayesian regression	Places a prior distribution on β and computes the posterior; the posterior mode under a Gaussian prior is equivalent to ridge regression

Advantages and limitations

Advantages

Mathematical tractability: Closed-form solution exists and is easy to compute
Well-understood theory: Decades of statistical theory support its properties
Interpretability: Coefficients have direct interpretations as partial effects
Efficiency: BLUE under the Gauss-Markov assumptions
Computational speed: Solving the normal equation is fast for moderate-sized problems
Foundation for extensions: Many advanced methods build directly on OLS

Limitations

Sensitivity to outliers: Squared residuals heavily penalize large errors, making OLS vulnerable to outliers. Robust regression methods (such as least absolute deviations or M-estimators) are more resistant.
Assumes linear relationship: When the true relationship is nonlinear, OLS may produce a poor fit unless appropriate transformations are applied.
Multicollinearity: Near-collinear predictors inflate variances and make individual coefficient estimates unreliable.
Not suitable for all response types: For binary, count, or categorical outcomes, methods such as logistic regression or Poisson regression are more appropriate.
No automatic feature selection: Standard OLS includes all predictors; regularization or stepwise methods are needed to select relevant variables.

Implementation in software

Least squares regression is available in virtually every statistical and machine learning software package:

Language/Tool	Function or class
Python (scikit-learn)	`sklearn.linear_model.LinearRegression`
Python (statsmodels)	`statsmodels.api.OLS`
Python (NumPy)	`numpy.linalg.lstsq`
R	`lm()`
MATLAB	`fitlm()`, backslash operator `\`
Julia	`GLM.jl`, backslash operator
TensorFlow / PyTorch	Custom models with MSE loss

References

Legendre, A.-M. (1805). *Nouvelles methodes pour la determination des orbites des cometes*. Paris: Firmin Didot.
Gauss, C. F. (1809). *Theoria Motus Corporum Coelestium*. Hamburg: Perthes et Besser.
Stigler, S. M. (1981). "Gauss and the Invention of Least Squares." *Annals of Statistics*, 9(3), 465-474.
Aitken, A. C. (1935). "On Least Squares and Linear Combination of Observations." *Proceedings of the Royal Society of Edinburgh*, 55, 42-48.
Hayashi, F. (2000). *Econometrics*. Princeton University Press. Chapter 1: OLS for the simple regression model; Chapter 2: The classical multiple regression model.
Greene, W. H. (2018). *Econometric Analysis* (8th ed.). Pearson. Chapters 3-5.
Hastie, T., Tibshirani, R., & Friedman, J. (2009). *The Elements of Statistical Learning* (2nd ed.). Springer. Chapter 3: Linear methods for regression.
James, G., Witten, D., Hastie, T., & Tibshirani, R. (2021). *An Introduction to Statistical Learning* (2nd ed.). Springer. Chapter 3: Linear regression; Chapter 6: Linear model selection and regularization.
Angrist, J. D., & Pischke, J.-S. (2009). *Mostly Harmless Econometrics*. Princeton University Press.
Wasserman, L. (2004). *All of Statistics*. Springer. Chapter 13: Linear and logistic regression.
Seber, G. A. F., & Lee, A. J. (2003). *Linear Regression Analysis* (2nd ed.). Wiley.
Bjorck, A. (1996). *Numerical Methods for Least Squares Problems*. SIAM.
Levenberg, K. (1944). "A Method for the Solution of Certain Non-Linear Problems in Least Squares." *Quarterly of Applied Mathematics*, 2(2), 164-168.
Marquardt, D. W. (1963). "An Algorithm for Least-Squares Estimation of Nonlinear Parameters." *Journal of the Society for Industrial and Applied Mathematics*, 11(2), 431-441.

ELI5: Explain like I'm 5

Historical background

Legendre's publication (1805)

Gauss's earlier claim (1795)

Gauss's statistical justification

Later developments

Mathematical formulation

The linear regression model

The objective function

Deriving the normal equation

Simple linear regression

Geometric interpretation

Properties of the OLS estimator

Unbiasedness

Variance

Consistency

Efficiency under normality

The Gauss-Markov theorem

Statement

Required assumptions

Proof sketch

Implications

Assumptions of OLS

1. Linearity in parameters

2. Random sampling

3. No perfect multicollinearity

4. Exogeneity (zero conditional mean)

5. Homoscedasticity

6. No autocorrelation

Summary of assumption violations

Goodness of fit

Coefficient of determination (R²)

Adjusted R²

Residual analysis

Hypothesis testing and inference

t-tests for individual coefficients

F-test for overall significance

Confidence intervals

Computational methods

Direct solution via the normal equation

QR decomposition

Singular value decomposition (SVD)

Gradient descent

Variants of least squares

Weighted least squares (WLS)

Generalized least squares (GLS)

Feasible generalized least squares (FGLS)

Comparison of OLS, WLS, and GLS

Nonlinear least squares

Total least squares

Regularized least squares

Ridge regression (L2 regularization)

Lasso (L1 regularization)

Elastic net

Comparison of regularization methods

Connection to maximum likelihood estimation

Applications

Statistics and econometrics

Machine learning

Signal processing

Engineering and physical sciences

Geodesy and astronomy

Relationship to other methods

Advantages and limitations

Advantages

Limitations

Implementation in software

See also

References

Improve this article

Related Articles

ARC-AGI 2

Squared Loss

Linear Regression

Probabilistic Regression Model

Root Mean Squared Error (RMSE)

AUC-ROC

ELI5: Explain like I'm 5

Historical background

Legendre's publication (1805)