Linear regression is one of the most foundational techniques in statistics and machine learning. It models the relationship between one or more independent variables (also called predictors or features) and a continuous dependent variable (the response) by fitting a linear equation to the observed data. Because of its simplicity, interpretability, and strong theoretical grounding, linear regression remains one of the most widely used methods in science, engineering, economics, and data analysis.
Imagine you have a lemonade stand and you want to figure out how the weather affects your sales. On hot days you sell more cups, and on cold days you sell fewer. Linear regression is like drawing the best straight line through a scatter of dots on a chart, where each dot represents a day. One axis shows the temperature that day, and the other axis shows how many cups you sold. Once you draw that line, you can slide your finger along it to predict how many cups you will sell tomorrow based on the forecast temperature. The line is not perfect (some dots are above it and some are below), but it gives you the best overall guess.
The mathematical foundations of linear regression trace back to the early 19th century and the development of the method of least squares.
Adrien-Marie Legendre published the first clear description of the least squares method in 1805, in his work "Nouvelles methodes pour la determination des orbites des cometes" (New Methods for the Determination of the Orbits of Comets). He used the technique to fit astronomical observations of comet trajectories to predicted orbital paths.
Carl Friedrich Gauss later claimed he had been using the method since 1795, and in 1809 he published "Theoria Motus Corporum Coelestium" (Theory of the Motion of Heavenly Bodies), in which he connected least squares estimation with the normal distribution and the theory of probability. This connection laid the probabilistic groundwork for modern regression analysis. In 1822, Gauss proved that under certain conditions (errors with zero mean, equal variance, and no correlation), the least squares estimator is the best linear unbiased estimator. This result later became known as the Gauss-Markov theorem.
The term "regression" itself was introduced by Francis Galton in the 1880s during his studies of hereditary stature. Galton observed that children of exceptionally tall or short parents tended to be closer to the average height, a phenomenon he called "regression toward mediocrity." Karl Pearson and Udny Yule subsequently formalized regression as a general statistical tool in the 1890s and early 1900s, extending it beyond its original biological context.
Within a decade of Legendre's 1805 publication, the method of least squares had been adopted as a standard tool in astronomy and geodesy across France, Italy, and Prussia, representing one of the most rapid acceptances of a scientific technique in history.
In simple linear regression, the model involves a single predictor variable x and takes the form:
y = Beta_0 + Beta_1 * x + epsilon
where:
The goal is to estimate the parameters Beta_0 and Beta_1 so that the resulting line best fits the observed data.
When there are p predictor variables, the model generalizes to:
y = Beta_0 + Beta_1 * x_1 + Beta_2 * x_2 + ... + Beta_p * x_p + epsilon
In matrix notation, the model is written as:
y = X * beta + epsilon
where y is an n-by-1 vector of observations, X is an n-by-(p+1) design matrix (with a column of ones for the intercept), beta is a (p+1)-by-1 vector of coefficients, and epsilon is an n-by-1 vector of errors.
Each coefficient Beta_j represents the expected change in the response variable for a one-unit increase in the corresponding predictor x_j, holding all other predictors constant. This "ceteris paribus" interpretation is central to regression analysis in fields like economics and epidemiology.
Ordinary least squares (OLS) is the most common method for estimating the coefficients in a linear regression model. OLS minimizes the sum of squared residuals, which is the sum of the squared differences between observed values and predicted values:
Minimize: sum of (y_i - y_hat_i)^2
where y_hat_i = Beta_0 + Beta_1 * x_1i + ... + Beta_p * x_pi.
Setting the gradient of the sum of squared residuals to zero yields the normal equation. In matrix form, the OLS estimator is:
beta_hat = (X^T X)^(-1) X^T y
This closed-form solution provides exact coefficients in a single computation. It works well when the number of features p is moderate and the matrix X^T X is invertible. However, computing the matrix inverse becomes expensive for very large feature sets, with time complexity of approximately O(p^3).
For large datasets where computing the normal equation is impractical, gradient descent offers an iterative alternative. The algorithm starts with an initial guess for the coefficients and repeatedly updates them by moving in the direction that reduces the loss function (the sum of squared errors).
The update rule at each step is:
beta := beta - alpha * gradient of loss with respect to beta
where alpha is the learning rate.
Three common variants exist:
| Variant | Description | Typical use case |
|---|---|---|
| Batch gradient descent | Uses all training examples per update | Small to medium datasets |
| Stochastic gradient descent (SGD) | Uses one randomly chosen example per update | Very large datasets, online learning |
| Mini-batch gradient descent | Uses a small random subset per update | Deep learning, large-scale ML pipelines |
Stochastic and mini-batch gradient descent introduce noise into the updates but converge much faster on large datasets and are the workhorses of modern deep learning optimization.
The validity of OLS estimation and the statistical inferences drawn from it depend on several key assumptions. Violating these assumptions can lead to biased estimates, incorrect standard errors, or misleading hypothesis tests.
| Assumption | Description | Consequence of violation |
|---|---|---|
| Linearity | The relationship between predictors and response is linear in the parameters | Biased coefficient estimates; model underfits the true pattern |
| Independence | Observations are independent of one another | Underestimated standard errors; inflated significance |
| Homoscedasticity | The variance of the error term is constant across all levels of the predictors | Inefficient estimates; invalid confidence intervals and hypothesis tests |
| Normality of residuals | The error terms follow a normal distribution | Unreliable p-values and confidence intervals in small samples |
| No multicollinearity | Predictor variables are not highly correlated with each other | Inflated standard errors; unstable coefficient estimates |
| No endogeneity | Error terms are uncorrelated with the predictors (exogeneity) | Biased and inconsistent coefficient estimates |
The Gauss-Markov theorem states that when the first four assumptions hold (linearity, independence, homoscedasticity, and exogeneity), OLS produces the best linear unbiased estimator (BLUE) of the coefficients. Normality is not required for BLUE status but is needed for exact finite-sample inference (t-tests and F-tests). For large samples, the central limit theorem ensures that inference is approximately valid even without strict normality.
Several metrics are used to assess the quality of a linear regression model. The choice of metric depends on the application and what aspect of model performance matters most.
| Metric | Formula | Interpretation |
|---|---|---|
| Mean squared error (MSE) | (1/n) * sum of (y_i - y_hat_i)^2 | Average squared prediction error; penalizes large errors heavily |
| Root mean squared error (RMSE) | sqrt(MSE) | Same units as y; easier to interpret than MSE |
| Mean absolute error (MAE) | (1/n) * sum of abs(y_i - y_hat_i) | Average absolute prediction error; more robust to outliers than MSE |
| R-squared (R^2) | 1 - (SS_res / SS_tot) | Proportion of variance in y explained by the model; ranges from 0 to 1 |
| Adjusted R-squared | 1 - [(1 - R^2)(n - 1) / (n - p - 1)] | Adjusts R^2 for the number of predictors; penalizes unnecessary variables |
R-squared is one of the most commonly reported statistics. A value of 0.85, for example, indicates that the model explains 85% of the variance in the response variable. However, R^2 always increases (or stays the same) as more predictors are added, even if they are irrelevant. Adjusted R-squared addresses this by penalizing model complexity, making it more suitable for comparing models with different numbers of predictors.
MAE is preferred when all errors should be treated equally regardless of magnitude, while RMSE is preferred when large errors are especially undesirable (such as in safety-critical applications).
Linear regression provides a built-in framework for statistical inference on the estimated coefficients.
To test whether a specific predictor has a statistically significant relationship with the response, a t-test is used. The null hypothesis is H_0: Beta_j = 0 (the predictor has no effect). The test statistic is:
t = Beta_hat_j / SE(Beta_hat_j)
where SE(Beta_hat_j) is the standard error of the estimated coefficient. If the resulting p-value is below the chosen significance level (commonly 0.05), the null hypothesis is rejected, indicating that the predictor contributes meaningfully to the model.
The F-test evaluates whether the model as a whole explains a significant portion of the variance in the response. The null hypothesis is that all slope coefficients are simultaneously zero (H_0: Beta_1 = Beta_2 = ... = Beta_p = 0). A large F-statistic and a small p-value indicate that at least one predictor is significantly related to the response.
A 95% confidence interval for a coefficient Beta_j is computed as:
Beta_hat_j +/- t_(alpha/2, n-p-1) * SE(Beta_hat_j)
If the interval does not contain zero, the coefficient is statistically significant at the 5% level. Confidence intervals convey more information than p-values alone because they indicate both the direction and the precision of the estimated effect.
After fitting a linear regression model, diagnostic plots help assess whether the model assumptions are met and identify potential problems.
A scatter plot of residuals (y - y_hat) against fitted values (y_hat) is the most important diagnostic. In a well-specified model, residuals should appear as a random cloud centered around zero with no discernible pattern. A funnel shape indicates heteroscedasticity, while a curved pattern suggests nonlinearity.
A quantile-quantile plot compares the distribution of the standardized residuals against a theoretical normal distribution. If the residuals are approximately normal, the points fall close to a 45-degree reference line. Systematic departures from the line (especially in the tails) indicate non-normality.
Cook's distance measures the influence of each data point on the fitted model. It combines information about both the leverage (how far an observation's predictor values are from the mean) and the residual (how poorly the observation is predicted). A common rule of thumb flags observations with Cook's distance greater than 4/(n - p - 1) as potentially influential. Removing or investigating such points can reveal whether a few observations are driving the model's results.
The variance inflation factor quantifies how much the variance of a coefficient is inflated due to multicollinearity among predictors. VIF is calculated for each predictor by regressing it on all other predictors:
VIF_j = 1 / (1 - R^2_j)
where R^2_j is the R-squared from regressing x_j on all other predictors. General guidelines suggest that a VIF above 5 warrants investigation and a VIF above 10 indicates serious multicollinearity that should be addressed (for example, by removing or combining correlated predictors).
When building a multiple regression model with many candidate predictors, selecting the right subset of features is important for both interpretability and predictive performance.
Multicollinearity occurs when two or more predictors are highly correlated. While the overall model predictions may remain accurate, multicollinearity inflates the standard errors of individual coefficients, making it difficult to determine which predictors are truly important. Severe multicollinearity can also make the coefficient estimates numerically unstable.
Common remedies include removing one of a pair of highly correlated predictors, combining correlated predictors into a single composite variable, applying principal component regression, or using regularization methods such as ridge regression.
| Method | Approach | Notes |
|---|---|---|
| Forward selection | Start with no predictors; add the most significant one at each step | Simple but can miss interactions |
| Backward elimination | Start with all predictors; remove the least significant one at each step | Requires n > p |
| Stepwise selection | Combination of forward and backward steps | Most common automated approach; risk of overfitting |
| Lasso (L1) | Regularization-based; drives some coefficients to exactly zero | Performs feature selection automatically |
| Information criteria (AIC, BIC) | Compare models using penalized likelihood | BIC favors simpler models than AIC |
Standard OLS can overfit when the number of predictors is large relative to the number of observations, or when predictors are highly correlated. Regularized regression methods add a penalty term to the loss function to constrain the coefficient magnitudes, improving generalization to new data.
Ridge regression adds the sum of squared coefficients (the L2 regularization penalty) to the OLS objective:
Minimize: sum of (y_i - y_hat_i)^2 + lambda * sum of Beta_j^2
The hyperparameter lambda controls the strength of the penalty. As lambda increases, the coefficients shrink toward zero but never reach exactly zero. Ridge regression is especially useful when predictors are correlated because it stabilizes the coefficient estimates. The solution has a closed-form expression:
beta_hat_ridge = (X^T X + lambda * I)^(-1) X^T y
Lasso (Least Absolute Shrinkage and Selection Operator) uses the sum of absolute coefficient values as the penalty, corresponding to L1 regularization:
Minimize: sum of (y_i - y_hat_i)^2 + lambda * sum of abs(Beta_j)
Unlike ridge, the L1 penalty can drive some coefficients to exactly zero, effectively performing automatic feature selection. Lasso tends to select one variable from a group of correlated predictors and set the others to zero, which can be desirable or undesirable depending on the context.
Elastic net combines both L1 and L2 penalties:
Minimize: sum of (y_i - y_hat_i)^2 + lambda_1 * sum of abs(Beta_j) + lambda_2 * sum of Beta_j^2
An additional mixing parameter alpha (between 0 and 1) balances the two penalties. Elastic net inherits the feature selection capability of lasso while also handling groups of correlated features more gracefully, like ridge. It was proposed by Hui Zou and Trevor Hastie in 2005 to address cases where lasso's variable selection can be unstable.
| Method | Penalty | Feature selection | Handles multicollinearity | Solution |
|---|---|---|---|---|
| OLS | None | No | Poorly | Closed-form |
| Ridge (L2) | sum of Beta_j^2 | No (shrinks, never zeros) | Yes | Closed-form |
| Lasso (L1) | sum of abs(Beta_j) | Yes (zeros out coefficients) | Partially | Iterative (coordinate descent) |
| Elastic net | L1 + L2 combined | Yes | Yes | Iterative (coordinate descent) |
Polynomial regression extends linear regression by including polynomial terms of the predictor variables, such as x^2, x^3, or interaction terms like x_1 * x_2. Despite being called "polynomial," the model is still linear in its parameters because the coefficients (Beta values) appear to the first power.
For a single predictor, a degree-d polynomial regression takes the form:
y = Beta_0 + Beta_1 * x + Beta_2 * x^2 + ... + Beta_d * x^d + epsilon
This allows the model to capture curved relationships while remaining within the linear regression framework. However, higher-degree polynomials risk overfitting the training data, especially with limited observations. In practice, degrees beyond 3 or 4 are rarely used without strong theoretical justification, and regularization or cross-validation is typically employed to select the appropriate polynomial degree.
Linear regression is a special case of generalized linear models (GLMs), a framework introduced by John Nelder and Robert Wedderburn in 1972. GLMs unify several regression techniques by specifying three components: a probability distribution for the response variable, a linear predictor (the familiar Beta_0 + Beta_1 * x_1 + ...), and a link function that connects the linear predictor to the expected value of the response.
For standard linear regression, the response distribution is normal (Gaussian), and the link function is the identity function (the linear predictor directly equals the expected response).
Logistic regression is another member of the GLM family. It uses a Bernoulli (or binomial) distribution for binary outcomes and the logit link function, which maps probabilities to the real line via the log-odds transformation. Poisson regression (for count data) and gamma regression (for positive continuous data with non-constant variance) are other common GLMs.
Understanding linear regression as a special case of GLMs provides a natural pathway to more complex models and highlights the shared estimation principles (maximum likelihood) that underlie these techniques.
Linear regression is one of the most widely applied statistical methods across virtually every quantitative discipline.
| Domain | Example application |
|---|---|
| Economics | Estimating the effect of education on wages, forecasting GDP growth, modeling consumer spending |
| Finance | Computing a stock's beta (sensitivity to market returns), pricing assets, modeling interest rate risk |
| Epidemiology | Quantifying the relationship between risk factors (smoking, BMI) and health outcomes |
| Engineering | Predicting material strength based on composition, calibrating sensor measurements |
| Marketing | Estimating the effect of advertising spend on sales revenue |
| Environmental science | Modeling the relationship between CO2 concentrations and temperature anomalies |
| Social science | Studying the relationship between income inequality and social outcomes |
| Real estate | Predicting property prices from square footage, location, and amenities |
In economics, linear regression is the predominant empirical tool, used to estimate demand functions, labor supply curves, and the impact of policy interventions. The Capital Asset Pricing Model (CAPM) in finance is fundamentally a linear regression of an asset's return on the market return, where the slope (beta) measures the asset's systematic risk.
In medical research, regression analysis was central to early studies linking tobacco smoking to mortality. Researchers used smoking as the independent variable and lifespan as the dependent variable while including additional predictors (education, income) to control for confounding socioeconomic factors.
Linear regression is available in virtually every statistical software package and programming language. The two most popular Python libraries for this purpose are scikit-learn and statsmodels.
scikit-learn's LinearRegression class is oriented toward prediction. It follows the familiar fit/predict API and integrates seamlessly with the library's pipelines, cross-validation utilities, and preprocessing transformers.
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
# Fit model
model = LinearRegression()
model.fit(X_train, y_train)
# Evaluate
y_pred = model.predict(X_test)
print(f"R-squared: {r2_score(y_test, y_pred):.4f}")
print(f"RMSE: {mean_squared_error(y_test, y_pred, squared=False):.4f}")
Regularized variants are available through Ridge, Lasso, and ElasticNet classes in sklearn.linear_model.
statsmodels is geared toward statistical inference, providing detailed summary tables with coefficient estimates, standard errors, t-statistics, p-values, confidence intervals, and diagnostic statistics.
import statsmodels.api as sm
# Add constant for intercept
X_with_const = sm.add_constant(X)
# Fit OLS model
model = sm.OLS(y, X_with_const).fit()
# Print full summary
print(model.summary())
statsmodels also supports a formula-based API similar to R:
import statsmodels.formula.api as smf
model = smf.ols('price ~ sqft + bedrooms + bathrooms', data=df).fit()
print(model.summary())
In R, linear regression is built into the base language via the lm() function:
model <- lm(price ~ sqft + bedrooms + bathrooms, data = housing)
summary(model)
confint(model)
R's formula syntax and extensive diagnostic plotting capabilities (plot(model)) make it a popular choice for statistical analysis and academic research.
| Library/Tool | Language | Strengths |
|---|---|---|
| scikit-learn | Python | Prediction-focused; integrates with ML pipelines; regularized variants built in |
| statsmodels | Python | Inference-focused; detailed statistical summaries; formula API |
R lm() | R | Native formula syntax; rich diagnostic plots; extensive ecosystem of statistical packages |
| MATLAB | MATLAB | Built-in fitlm; strong visualization; common in engineering |
| Spark MLlib | Scala/Python | Distributed computing for very large datasets |