Linear Regression

Linear regression is one of the most foundational techniques in statistics and machine learning. It models the relationship between one or more independent variables (also called predictors or features) and a continuous dependent variable (the response) by fitting a linear equation to the observed data. Because of its simplicity, interpretability, and strong theoretical grounding, linear regression remains one of the most widely used methods in science, engineering, economics, and data analysis.

ELI5: Explain like I'm five

Imagine you have a lemonade stand and you want to figure out how the weather affects your sales. On hot days you sell more cups, and on cold days you sell fewer. Linear regression is like drawing the best straight line through a scatter of dots on a chart, where each dot represents a day. One axis shows the temperature that day, and the other axis shows how many cups you sold. Once you draw that line, you can slide your finger along it to predict how many cups you will sell tomorrow based on the forecast temperature. The line is not perfect (some dots are above it and some are below), but it gives you the best overall guess.

Historical background

The mathematical foundations of linear regression trace back to the early 19th century and the development of the method of least squares.

Adrien-Marie Legendre published the first clear description of the least squares method in 1805, in his work "Nouvelles methodes pour la determination des orbites des cometes" (New Methods for the Determination of the Orbits of Comets). He used the technique to fit astronomical observations of comet trajectories to predicted orbital paths.

Carl Friedrich Gauss later claimed he had been using the method since 1795, and in 1809 he published "Theoria Motus Corporum Coelestium" (Theory of the Motion of Heavenly Bodies), in which he connected least squares estimation with the normal distribution and the theory of probability. This connection laid the probabilistic groundwork for modern regression analysis. In 1822, Gauss proved that under certain conditions (errors with zero mean, equal variance, and no correlation), the least squares estimator is the best linear unbiased estimator. This result later became known as the Gauss-Markov theorem.

The term "regression" itself was introduced by Francis Galton in the 1880s during his studies of hereditary stature. Galton observed that children of exceptionally tall or short parents tended to be closer to the average height, a phenomenon he called "regression toward mediocrity." Karl Pearson and Udny Yule subsequently formalized regression as a general statistical tool in the 1890s and early 1900s, extending it beyond its original biological context.

Within a decade of Legendre's 1805 publication, the method of least squares had been adopted as a standard tool in astronomy and geodesy across France, Italy, and Prussia, representing one of the most rapid acceptances of a scientific technique in history.

Mathematical formulation

Simple linear regression

In simple linear regression, the model involves a single predictor variable x and takes the form:

y = Beta_0 + Beta_1 * x + epsilon

where:

y is the dependent variable (response)
x is the independent variable (predictor or feature)
Beta_0 is the intercept (the expected value of y when x = 0)
Beta_1 is the slope (the expected change in y for a one-unit change in x)
epsilon is the error term (the difference between observed and predicted values)

The goal is to estimate the parameters Beta_0 and Beta_1 so that the resulting line best fits the observed data.

Multiple linear regression

When there are p predictor variables, the model generalizes to:

y = Beta_0 + Beta_1 * x_1 + Beta_2 * x_2 + ... + Beta_p * x_p + epsilon

In matrix notation, the model is written as:

y = X * beta + epsilon

where y is an n-by-1 vector of observations, X is an n-by-(p+1) design matrix (with a column of ones for the intercept), beta is a (p+1)-by-1 vector of coefficients, and epsilon is an n-by-1 vector of errors.

Each coefficient Beta_j represents the expected change in the response variable for a one-unit increase in the corresponding predictor x_j, holding all other predictors constant. This "ceteris paribus" interpretation is central to regression analysis in fields like economics and epidemiology.

Ordinary least squares

Ordinary least squares (OLS) is the most common method for estimating the coefficients in a linear regression model. OLS minimizes the sum of squared residuals, which is the sum of the squared differences between observed values and predicted values:

Minimize: sum of (y_i - y_hat_i)^2

where y_hat_i = Beta_0 + Beta_1 * x_1i + ... + Beta_p * x_pi.

Closed-form solution (normal equation)

Setting the gradient of the sum of squared residuals to zero yields the normal equation. In matrix form, the OLS estimator is:

beta_hat = (X^T X)^(-1) X^T y

This closed-form solution provides exact coefficients in a single computation. It works well when the number of features p is moderate and the matrix X^T X is invertible. However, computing the matrix inverse becomes expensive for very large feature sets, with time complexity of approximately O(p^3).

Gradient descent approach

For large datasets where computing the normal equation is impractical, gradient descent offers an iterative alternative. The algorithm starts with an initial guess for the coefficients and repeatedly updates them by moving in the direction that reduces the loss function (the sum of squared errors).

The update rule at each step is:

beta := beta - alpha * gradient of loss with respect to beta

where alpha is the learning rate.

Three common variants exist:

Variant	Description	Typical use case
Batch gradient descent	Uses all training examples per update	Small to medium datasets
Stochastic gradient descent (SGD)	Uses one randomly chosen example per update	Very large datasets, online learning
Mini-batch gradient descent	Uses a small random subset per update	Deep learning, large-scale ML pipelines

Stochastic and mini-batch gradient descent introduce noise into the updates but converge much faster on large datasets and are the workhorses of modern deep learning optimization.

Assumptions of linear regression

The validity of OLS estimation and the statistical inferences drawn from it depend on several key assumptions. Violating these assumptions can lead to biased estimates, incorrect standard errors, or misleading hypothesis tests.

Assumption	Description	Consequence of violation
Linearity	The relationship between predictors and response is linear in the parameters	Biased coefficient estimates; model underfits the true pattern
Independence	Observations are independent of one another	Underestimated standard errors; inflated significance
Homoscedasticity	The variance of the error term is constant across all levels of the predictors	Inefficient estimates; invalid confidence intervals and hypothesis tests
Normality of residuals	The error terms follow a normal distribution	Unreliable p-values and confidence intervals in small samples
No multicollinearity	Predictor variables are not highly correlated with each other	Inflated standard errors; unstable coefficient estimates
No endogeneity	Error terms are uncorrelated with the predictors (exogeneity)	Biased and inconsistent coefficient estimates

The Gauss-Markov theorem states that when the first four assumptions hold (linearity, independence, homoscedasticity, and exogeneity), OLS produces the best linear unbiased estimator (BLUE) of the coefficients. Normality is not required for BLUE status but is needed for exact finite-sample inference (t-tests and F-tests). For large samples, the central limit theorem ensures that inference is approximately valid even without strict normality.

Evaluation metrics

Several metrics are used to assess the quality of a linear regression model. The choice of metric depends on the application and what aspect of model performance matters most.

Metric	Formula	Interpretation
Mean squared error (MSE)	(1/n) * sum of (y_i - y_hat_i)^2	Average squared prediction error; penalizes large errors heavily
Root mean squared error (RMSE)	sqrt(MSE)	Same units as y; easier to interpret than MSE
Mean absolute error (MAE)	(1/n) * sum of abs(y_i - y_hat_i)	Average absolute prediction error; more robust to outliers than MSE
R-squared (R^2)	1 - (SS_res / SS_tot)	Proportion of variance in y explained by the model; ranges from 0 to 1
Adjusted R-squared	1 - [(1 - R^2)(n - 1) / (n - p - 1)]	Adjusts R^2 for the number of predictors; penalizes unnecessary variables

R-squared is one of the most commonly reported statistics. A value of 0.85, for example, indicates that the model explains 85% of the variance in the response variable. However, R^2 always increases (or stays the same) as more predictors are added, even if they are irrelevant. Adjusted R-squared addresses this by penalizing model complexity, making it more suitable for comparing models with different numbers of predictors.

MAE is preferred when all errors should be treated equally regardless of magnitude, while RMSE is preferred when large errors are especially undesirable (such as in safety-critical applications).

Hypothesis testing and confidence intervals

Linear regression provides a built-in framework for statistical inference on the estimated coefficients.

t-test for individual coefficients

To test whether a specific predictor has a statistically significant relationship with the response, a t-test is used. The null hypothesis is H_0: Beta_j = 0 (the predictor has no effect). The test statistic is:

t = Beta_hat_j / SE(Beta_hat_j)

where SE(Beta_hat_j) is the standard error of the estimated coefficient. If the resulting p-value is below the chosen significance level (commonly 0.05), the null hypothesis is rejected, indicating that the predictor contributes meaningfully to the model.

F-test for overall model significance

The F-test evaluates whether the model as a whole explains a significant portion of the variance in the response. The null hypothesis is that all slope coefficients are simultaneously zero (H_0: Beta_1 = Beta_2 = ... = Beta_p = 0). A large F-statistic and a small p-value indicate that at least one predictor is significantly related to the response.

Confidence intervals

A 95% confidence interval for a coefficient Beta_j is computed as:

Beta_hat_j +/- t_(alpha/2, n-p-1) * SE(Beta_hat_j)

If the interval does not contain zero, the coefficient is statistically significant at the 5% level. Confidence intervals convey more information than p-values alone because they indicate both the direction and the precision of the estimated effect.

Diagnostic plots and model checking

After fitting a linear regression model, diagnostic plots help assess whether the model assumptions are met and identify potential problems.

Residual plot

A scatter plot of residuals (y - y_hat) against fitted values (y_hat) is the most important diagnostic. In a well-specified model, residuals should appear as a random cloud centered around zero with no discernible pattern. A funnel shape indicates heteroscedasticity, while a curved pattern suggests nonlinearity.

Q-Q plot

A quantile-quantile plot compares the distribution of the standardized residuals against a theoretical normal distribution. If the residuals are approximately normal, the points fall close to a 45-degree reference line. Systematic departures from the line (especially in the tails) indicate non-normality.

Cook's distance

Cook's distance measures the influence of each data point on the fitted model. It combines information about both the leverage (how far an observation's predictor values are from the mean) and the residual (how poorly the observation is predicted). A common rule of thumb flags observations with Cook's distance greater than 4/(n - p - 1) as potentially influential. Removing or investigating such points can reveal whether a few observations are driving the model's results.

Variance inflation factor (VIF)

The variance inflation factor quantifies how much the variance of a coefficient is inflated due to multicollinearity among predictors. VIF is calculated for each predictor by regressing it on all other predictors:

VIF_j = 1 / (1 - R^2_j)

where R^2_j is the R-squared from regressing x_j on all other predictors. General guidelines suggest that a VIF above 5 warrants investigation and a VIF above 10 indicates serious multicollinearity that should be addressed (for example, by removing or combining correlated predictors).

Feature selection and multicollinearity

When building a multiple regression model with many candidate predictors, selecting the right subset of features is important for both interpretability and predictive performance.

Multicollinearity

Multicollinearity occurs when two or more predictors are highly correlated. While the overall model predictions may remain accurate, multicollinearity inflates the standard errors of individual coefficients, making it difficult to determine which predictors are truly important. Severe multicollinearity can also make the coefficient estimates numerically unstable.

Common remedies include removing one of a pair of highly correlated predictors, combining correlated predictors into a single composite variable, applying principal component regression, or using regularization methods such as ridge regression.

Feature selection methods

Method	Approach	Notes
Forward selection	Start with no predictors; add the most significant one at each step	Simple but can miss interactions
Backward elimination	Start with all predictors; remove the least significant one at each step	Requires n > p
Stepwise selection	Combination of forward and backward steps	Most common automated approach; risk of overfitting
Lasso (L1)	Regularization-based; drives some coefficients to exactly zero	Performs feature selection automatically
Information criteria (AIC, BIC)	Compare models using penalized likelihood	BIC favors simpler models than AIC

Regularized variants

Standard OLS can overfit when the number of predictors is large relative to the number of observations, or when predictors are highly correlated. Regularized regression methods add a penalty term to the loss function to constrain the coefficient magnitudes, improving generalization to new data.

Ridge regression (L2 regularization)

Ridge regression adds the sum of squared coefficients (the L2 regularization penalty) to the OLS objective:

Minimize: sum of (y_i - y_hat_i)^2 + lambda * sum of Beta_j^2

The hyperparameter lambda controls the strength of the penalty. As lambda increases, the coefficients shrink toward zero but never reach exactly zero. Ridge regression is especially useful when predictors are correlated because it stabilizes the coefficient estimates. The solution has a closed-form expression:

beta_hat_ridge = (X^T X + lambda * I)^(-1) X^T y

Lasso regression (L1 regularization)

Lasso (Least Absolute Shrinkage and Selection Operator) uses the sum of absolute coefficient values as the penalty, corresponding to L1 regularization:

Minimize: sum of (y_i - y_hat_i)^2 + lambda * sum of abs(Beta_j)

Unlike ridge, the L1 penalty can drive some coefficients to exactly zero, effectively performing automatic feature selection. Lasso tends to select one variable from a group of correlated predictors and set the others to zero, which can be desirable or undesirable depending on the context.

Elastic net

Elastic net combines both L1 and L2 penalties:

Minimize: sum of (y_i - y_hat_i)^2 + lambda_1 * sum of abs(Beta_j) + lambda_2 * sum of Beta_j^2

An additional mixing parameter alpha (between 0 and 1) balances the two penalties. Elastic net inherits the feature selection capability of lasso while also handling groups of correlated features more gracefully, like ridge. It was proposed by Hui Zou and Trevor Hastie in 2005 to address cases where lasso's variable selection can be unstable.

Method	Penalty	Feature selection	Handles multicollinearity	Solution
OLS	None	No	Poorly	Closed-form
Ridge (L2)	sum of Beta_j^2	No (shrinks, never zeros)	Yes	Closed-form
Lasso (L1)	sum of abs(Beta_j)	Yes (zeros out coefficients)	Partially	Iterative (coordinate descent)
Elastic net	L1 + L2 combined	Yes	Yes	Iterative (coordinate descent)

Polynomial regression

Polynomial regression extends linear regression by including polynomial terms of the predictor variables, such as x^2, x^3, or interaction terms like x_1 * x_2. Despite being called "polynomial," the model is still linear in its parameters because the coefficients (Beta values) appear to the first power.

For a single predictor, a degree-d polynomial regression takes the form:

y = Beta_0 + Beta_1 * x + Beta_2 * x^2 + ... + Beta_d * x^d + epsilon

This allows the model to capture curved relationships while remaining within the linear regression framework. However, higher-degree polynomials risk overfitting the training data, especially with limited observations. In practice, degrees beyond 3 or 4 are rarely used without strong theoretical justification, and regularization or cross-validation is typically employed to select the appropriate polynomial degree.

Relationship to logistic regression and generalized linear models

Linear regression is a special case of generalized linear models (GLMs), a framework introduced by John Nelder and Robert Wedderburn in 1972. GLMs unify several regression techniques by specifying three components: a probability distribution for the response variable, a linear predictor (the familiar Beta_0 + Beta_1 * x_1 + ...), and a link function that connects the linear predictor to the expected value of the response.

For standard linear regression, the response distribution is normal (Gaussian), and the link function is the identity function (the linear predictor directly equals the expected response).

Logistic regression is another member of the GLM family. It uses a Bernoulli (or binomial) distribution for binary outcomes and the logit link function, which maps probabilities to the real line via the log-odds transformation. Poisson regression (for count data) and gamma regression (for positive continuous data with non-constant variance) are other common GLMs.

Understanding linear regression as a special case of GLMs provides a natural pathway to more complex models and highlights the shared estimation principles (maximum likelihood) that underlie these techniques.

Applications

Linear regression is one of the most widely applied statistical methods across virtually every quantitative discipline.

Domain	Example application
Economics	Estimating the effect of education on wages, forecasting GDP growth, modeling consumer spending
Finance	Computing a stock's beta (sensitivity to market returns), pricing assets, modeling interest rate risk
Epidemiology	Quantifying the relationship between risk factors (smoking, BMI) and health outcomes
Engineering	Predicting material strength based on composition, calibrating sensor measurements
Marketing	Estimating the effect of advertising spend on sales revenue
Environmental science	Modeling the relationship between CO2 concentrations and temperature anomalies
Social science	Studying the relationship between income inequality and social outcomes
Real estate	Predicting property prices from square footage, location, and amenities

In economics, linear regression is the predominant empirical tool, used to estimate demand functions, labor supply curves, and the impact of policy interventions. The Capital Asset Pricing Model (CAPM) in finance is fundamentally a linear regression of an asset's return on the market return, where the slope (beta) measures the asset's systematic risk.

In medical research, regression analysis was central to early studies linking tobacco smoking to mortality. Researchers used smoking as the independent variable and lifespan as the dependent variable while including additional predictors (education, income) to control for confounding socioeconomic factors.

Implementation

Linear regression is available in virtually every statistical software package and programming language. The two most popular Python libraries for this purpose are scikit-learn and statsmodels.

scikit-learn

scikit-learn's LinearRegression class is oriented toward prediction. It follows the familiar fit/predict API and integrates seamlessly with the library's pipelines, cross-validation utilities, and preprocessing transformers.

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Fit model
model = LinearRegression()
model.fit(X_train, y_train)

# Evaluate
y_pred = model.predict(X_test)
print(f"R-squared: {r2_score(y_test, y_pred):.4f}")
print(f"RMSE: {mean_squared_error(y_test, y_pred, squared=False):.4f}")

Regularized variants are available through Ridge, Lasso, and ElasticNet classes in sklearn.linear_model.

statsmodels

statsmodels is geared toward statistical inference, providing detailed summary tables with coefficient estimates, standard errors, t-statistics, p-values, confidence intervals, and diagnostic statistics.

import statsmodels.api as sm

# Add constant for intercept
X_with_const = sm.add_constant(X)

# Fit OLS model
model = sm.OLS(y, X_with_const).fit()

# Print full summary
print(model.summary())

statsmodels also supports a formula-based API similar to R:

import statsmodels.formula.api as smf

model = smf.ols('price ~ sqft + bedrooms + bathrooms', data=df).fit()
print(model.summary())

R

In R, linear regression is built into the base language via the lm() function:

model <- lm(price ~ sqft + bedrooms + bathrooms, data = housing)
summary(model)
confint(model)

R's formula syntax and extensive diagnostic plotting capabilities (plot(model)) make it a popular choice for statistical analysis and academic research.

Library/Tool	Language	Strengths
scikit-learn	Python	Prediction-focused; integrates with ML pipelines; regularized variants built in
statsmodels	Python	Inference-focused; detailed statistical summaries; formula API
R `lm()`	R	Native formula syntax; rich diagnostic plots; extensive ecosystem of statistical packages
MATLAB	MATLAB	Built-in `fitlm`; strong visualization; common in engineering
Spark MLlib	Scala/Python	Distributed computing for very large datasets

Strengths and limitations

Strengths

Interpretability: Coefficients have direct, intuitive interpretations as marginal effects.
Computational efficiency: The closed-form OLS solution is fast for moderate-sized datasets.
Statistical inference: Built-in hypothesis tests and confidence intervals require no additional machinery.
Theoretical foundation: Well-understood properties (BLUE under Gauss-Markov conditions) and extensive diagnostic tools.
Baseline model: Linear regression often serves as a strong first model and a benchmark against which more complex methods are compared.

Limitations

Linearity assumption: The model cannot capture nonlinear relationships without manual feature engineering (e.g., polynomial terms).
Sensitivity to outliers: OLS minimizes squared errors, so extreme values can disproportionately affect the fit.
Multicollinearity issues: Correlated predictors inflate standard errors and make individual coefficient estimates unreliable.
Not suitable for classification: Linear regression predicts continuous values and is inappropriate for categorical outcomes (logistic regression or other classifiers should be used instead).
Extrapolation risk: Predictions outside the range of the training data can be unreliable.

References

Legendre, A. M. (1805). *Nouvelles methodes pour la determination des orbites des cometes*. Paris: Firmin Didot.
Gauss, C. F. (1809). *Theoria Motus Corporum Coelestium*. Hamburg: Perthes et Besser.
Galton, F. (1886). "Regression towards mediocrity in hereditary stature." *Journal of the Anthropological Institute of Great Britain and Ireland*, 15, 246-263.
Nelder, J. A., & Wedderburn, R. W. M. (1972). "Generalized linear models." *Journal of the Royal Statistical Society: Series A*, 135(3), 370-384.
Zou, H., & Hastie, T. (2005). "Regularization and variable selection via the elastic net." *Journal of the Royal Statistical Society: Series B*, 67(2), 301-320.
Tibshirani, R. (1996). "Regression shrinkage and selection via the lasso." *Journal of the Royal Statistical Society: Series B*, 58(1), 267-288.
Hoerl, A. E., & Kennard, R. W. (1970). "Ridge regression: biased estimation for nonorthogonal problems." *Technometrics*, 12(1), 55-67.
Stigler, S. M. (1981). "Gauss and the invention of least squares." *Annals of Statistics*, 9(3), 465-474.
Hastie, T., Tibshirani, R., & Friedman, J. (2009). *The Elements of Statistical Learning* (2nd ed.). Springer.
James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). *An Introduction to Statistical Learning*. Springer.
Kutner, M. H., Nachtsheim, C. J., Neter, J., & Li, W. (2004). *Applied Linear Statistical Models* (5th ed.). McGraw-Hill.
Freedman, D. A. (2009). *Statistical Models: Theory and Practice* (2nd ed.). Cambridge University Press.

ELI5: Explain like I'm five

Historical background

Mathematical formulation

Simple linear regression

Multiple linear regression

Ordinary least squares

Closed-form solution (normal equation)

Gradient descent approach

Assumptions of linear regression

Evaluation metrics

Hypothesis testing and confidence intervals

t-test for individual coefficients

F-test for overall model significance

Confidence intervals

Diagnostic plots and model checking

Residual plot

Q-Q plot

Cook's distance

Variance inflation factor (VIF)

Feature selection and multicollinearity

Multicollinearity

Feature selection methods

Regularized variants

Ridge regression (L2 regularization)

Lasso regression (L1 regularization)

Elastic net

Polynomial regression

Relationship to logistic regression and generalized linear models

Applications

Implementation

scikit-learn

statsmodels

R

Strengths and limitations

Strengths

Limitations

See also

References

Improve this article

Related Articles

ARC-AGI 2

Squared Loss

Least Squares Regression

Probabilistic Regression Model

Root Mean Squared Error (RMSE)

AUC-ROC

ELI5: Explain like I'm five

Historical background

Mathematical formulation

Simple linear regression

Multiple linear regression

Ordinary least squares

Closed-form solution (normal equation)

Gradient descent approach

Assumptions of linear regression

Evaluation metrics

Hypothesis testing and confidence intervals

t-test for individual coefficients

F-test for overall model significance

Confidence intervals

Diagnostic plots and model checking

Residual plot

Q-Q plot

Cook's distance

Variance inflation factor (VIF)

Feature selection and multicollinearity

Multicollinearity

Feature selection methods

Regularized variants

Ridge regression (L2 regularization)

Lasso regression (L1 regularization)

Elastic net

Polynomial regression

Relationship to logistic regression and generalized linear models

Applications

Implementation

scikit-learn

statsmodels

R

Strengths and limitations