# Linear Regression

> Source: https://aiwiki.ai/wiki/linear_regression
> Updated: 2026-07-11
> Categories: Machine Learning, Statistics
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

**Linear regression** is a statistical method that models the relationship between one or more independent variables (the predictors or features) and a continuous dependent variable (the response) by fitting a linear equation of the form $$y = \beta_0 + \beta_1 x + \cdots$$ to observed data, with the coefficients chosen to minimize the sum of squared errors. First published by Adrien-Marie Legendre in 1805 and given its probabilistic foundation by Carl Friedrich Gauss in 1809, it is one of the oldest and most widely used techniques in statistics and [machine learning](/wiki/machine_learning).[1][2][9][10] Because of its simplicity, interpretability, and strong theoretical grounding, linear regression remains a default tool across science, engineering, economics, and data analysis, and it is the special case of the [generalized linear model](/wiki/generalization) with a normal response and an identity link.[4][9][10]

## ELI5: Explain like I'm five

Imagine you have a lemonade stand and you want to figure out how the weather affects your sales. On hot days you sell more cups, and on cold days you sell fewer. Linear regression is like drawing the best straight line through a scatter of dots on a chart, where each dot represents a day. One axis shows the temperature that day, and the other axis shows how many cups you sold. Once you draw that line, you can slide your finger along it to predict how many cups you will sell tomorrow based on the forecast temperature. The line is not perfect (some dots are above it and some are below), but it gives you the best overall guess.

## When was linear regression invented?

The mathematical foundations of linear regression trace back to the early 19th century and the development of the method of least squares.

Adrien-Marie Legendre published the first clear description of the least squares method in 1805, in his work "Nouvelles methodes pour la determination des orbites des cometes" (New Methods for the Determination of the Orbits of Comets).[1] The method appeared in a nine-page appendix titled "Sur la methode des moindres quarres" (On the Method of Least Squares), in which Legendre fit astronomical observations of comet trajectories to predicted orbital paths and also applied the technique to estimating the ellipticity of the Earth.[1][8]

Carl Friedrich Gauss later claimed he had been using the method since 1795, and in 1809 he published "Theoria Motus Corporum Coelestium" (Theory of the Motion of Heavenly Bodies), in which he connected least squares estimation with the normal distribution and the theory of probability.[2][8] This connection laid the probabilistic groundwork for modern regression analysis and triggered a famous priority dispute with Legendre; after examining Gauss's notebooks, the historian Stephen Stigler concluded that Gauss "probably told the truth" about his earlier use of the method.[8] In 1822, Gauss proved that under certain conditions (errors with zero mean, equal variance, and no correlation), the least squares estimator is the best linear unbiased estimator.[8] This result later became known as the Gauss-Markov theorem, named after Gauss and Andrey Markov; Alexander Aitken generalized it to arbitrary error covariance matrices in 1935.[11][13]

The term "regression" itself was introduced by Francis Galton in the 1880s during his studies of hereditary stature.[3] Galton observed that children of exceptionally tall or short parents tended to be closer to the average height, a phenomenon he called "regression towards mediocrity."[3] In his 1886 paper "Regression towards Mediocrity in Hereditary Stature," published in the Journal of the Anthropological Institute, Galton quantified the effect, reporting that the deviation of children from the population mean was on average about two-thirds (a ratio of 2 to 3) of the deviation of their parents.[3] Karl Pearson and Udny Yule subsequently formalized regression as a general statistical tool in the 1890s and early 1900s, extending it beyond its original biological context.

Within a decade of Legendre's 1805 publication, the method of least squares had been adopted as a standard tool in astronomy and geodesy across France, Italy, and Prussia, representing one of the most rapid acceptances of a scientific technique in history.[8]

## Mathematical formulation

### Simple linear regression

In simple linear regression, the model involves a single predictor variable x and takes the form:

$$
y = \beta_0 + \beta_1 x + \epsilon
$$

where:

- $$y$$ is the dependent variable (response)
- $$x$$ is the independent variable (predictor or [feature](/wiki/feature))
- $$\beta_0$$ is the intercept (the expected value of y when x = 0)
- $$\beta_1$$ is the slope (the expected change in y for a one-unit change in x)
- $$\epsilon$$ is the error term (the difference between observed and predicted values)

The goal is to estimate the parameters $$\beta_0$$ and $$\beta_1$$ so that the resulting line best fits the observed data.

### Multiple linear regression

When there are p predictor variables, the model generalizes to:

$$
y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_p x_p + \epsilon
$$

In matrix notation, the model is written as:

$$
y = X\beta + \epsilon
$$

where $$y$$ is an n-by-1 vector of observations, $$X$$ is an n-by-(p+1) design matrix (with a column of ones for the intercept), $$\beta$$ is a (p+1)-by-1 vector of coefficients, and $$\epsilon$$ is an n-by-1 vector of errors.[11]

Each coefficient $$\beta_j$$ represents the expected change in the response variable for a one-unit increase in the corresponding predictor $$x_j$$, holding all other predictors constant.[11] This "ceteris paribus" interpretation is central to regression analysis in fields like economics and epidemiology.[12]

## How are the coefficients estimated? Ordinary least squares

Ordinary least squares (OLS) is the most common method for estimating the coefficients in a linear regression model. OLS minimizes the sum of squared residuals, which is the sum of the squared differences between observed values and predicted values:[11]

$$
\text{Minimize:} \quad \sum_i (y_i - \hat{y}_i)^2
$$

where $$\hat{y}_i = \beta_0 + \beta_1 x_{1i} + \cdots + \beta_p x_{pi}$$.

### Closed-form solution (normal equation)

Setting the gradient of the sum of squared residuals to zero yields the normal equation. In matrix form, the OLS estimator is:[9][11]

$$
\hat{\beta} = (X^\top X)^{-1} X^\top y
$$

This closed-form solution provides exact coefficients in a single computation. It works well when the number of features p is moderate and the matrix $$X^\top X$$ is invertible. However, computing the matrix inverse becomes expensive for very large feature sets, with time complexity of approximately $$O(p^3)$$.

### Gradient descent approach

For large datasets where computing the normal equation is impractical, [gradient descent](/wiki/gradient_descent) offers an iterative alternative.[9] The algorithm starts with an initial guess for the coefficients and repeatedly updates them by moving in the direction that reduces the [loss function](/wiki/loss_function) (the sum of squared errors).

The update rule at each step is:

$$
\beta := \beta - \alpha \nabla_\beta L(\beta)
$$

where $$\alpha$$ is the learning rate.

Three common variants exist:

| Variant | Description | Typical use case |
|---|---|---|
| Batch gradient descent | Uses all training examples per update | Small to medium datasets |
| Stochastic gradient descent (SGD) | Uses one randomly chosen example per update | Very large datasets, online learning |
| Mini-batch gradient descent | Uses a small random subset per update | Deep learning, large-scale ML pipelines |

Stochastic and mini-batch gradient descent introduce noise into the updates but converge much faster on large datasets and are the workhorses of modern [deep learning](/wiki/deep_learning) optimization.

## What are the assumptions of linear regression?

The validity of OLS estimation and the statistical inferences drawn from it depend on several key assumptions.[12] Violating these assumptions can lead to biased estimates, incorrect standard errors, or misleading hypothesis tests.[11]

| Assumption | Description | Consequence of violation |
|---|---|---|
| Linearity | The relationship between predictors and response is linear in the parameters | Biased coefficient estimates; model underfits the true pattern |
| Independence | Observations are independent of one another | Underestimated standard errors; inflated significance |
| Homoscedasticity | The variance of the error term is constant across all levels of the predictors | Inefficient estimates; invalid confidence intervals and hypothesis tests |
| Normality of residuals | The error terms follow a normal distribution | Unreliable p-values and confidence intervals in small samples |
| No multicollinearity | Predictor variables are not highly correlated with each other | Inflated standard errors; unstable coefficient estimates |
| No endogeneity | Error terms are uncorrelated with the predictors (exogeneity) | Biased and inconsistent coefficient estimates |

The Gauss-Markov theorem states that when the first four assumptions hold (linearity, independence, homoscedasticity, and exogeneity), OLS produces the best linear unbiased estimator (BLUE) of the coefficients.[11][12] Normality is not required for BLUE status but is needed for exact finite-sample inference (t-tests and F-tests).[12] For large samples, the central limit theorem ensures that inference is approximately valid even without strict normality.

## How is a linear regression model evaluated?

Several metrics are used to assess the quality of a linear regression model. The choice of metric depends on the application and what aspect of model performance matters most.

| Metric | Formula | Interpretation |
|---|---|---|
| [Mean squared error](/wiki/mean_squared_error_mse) (MSE) | $$\frac{1}{n} \sum_i (y_i - \hat{y}_i)^2$$ | Average squared prediction error; penalizes large errors heavily |
| Root mean squared error (RMSE) | $$\sqrt{\mathrm{MSE}}$$ | Same units as y; easier to interpret than MSE |
| Mean absolute error (MAE) | $$\frac{1}{n} \sum_i \lvert y_i - \hat{y}_i \rvert$$ | Average absolute prediction error; more robust to outliers than MSE |
| R-squared ($$R^2$$) | $$1 - \frac{\text{SS}_{\text{res}}}{\text{SS}_{\text{tot}}}$$ | Proportion of variance in y explained by the model; ranges from 0 to 1 |
| Adjusted R-squared | $$1 - \frac{(1 - R^2)(n - 1)}{n - p - 1}$$ | Adjusts $$R^2$$ for the number of predictors; penalizes unnecessary variables |

**R-squared** is one of the most commonly reported statistics. A value of 0.85, for example, indicates that the model explains 85% of the variance in the response variable. However, $$R^2$$ always increases (or stays the same) as more predictors are added, even if they are irrelevant.[10] **Adjusted R-squared** addresses this by penalizing model complexity, making it more suitable for comparing models with different numbers of predictors.[11]

**MAE** is preferred when all errors should be treated equally regardless of magnitude, while **RMSE** is preferred when large errors are especially undesirable (such as in safety-critical applications).

## Hypothesis testing and confidence intervals

Linear regression provides a built-in framework for statistical inference on the estimated coefficients.

### t-test for individual coefficients

To test whether a specific predictor has a statistically significant relationship with the response, a t-test is used. The null hypothesis is $$H_0: \beta_j = 0$$ (the predictor has no effect). The test statistic is:[11]

$$
t = \frac{\hat{\beta}_j}{\mathrm{SE}(\hat{\beta}_j)}
$$

where $$\mathrm{SE}(\hat{\beta}_j)$$ is the standard error of the estimated coefficient. If the resulting p-value is below the chosen significance level (commonly 0.05), the null hypothesis is rejected, indicating that the predictor contributes meaningfully to the model.

### F-test for overall model significance

The F-test evaluates whether the model as a whole explains a significant portion of the variance in the response. The null hypothesis is that all slope coefficients are simultaneously zero ($$H_0: \beta_1 = \beta_2 = \cdots = \beta_p = 0$$).[11] A large F-statistic and a small p-value indicate that at least one predictor is significantly related to the response.

### Confidence intervals

A 95% confidence interval for a coefficient $$\beta_j$$ is computed as:

$$
\hat{\beta}_j \pm t_{\alpha/2,\, n-p-1} \, \mathrm{SE}(\hat{\beta}_j)
$$

If the interval does not contain zero, the coefficient is statistically significant at the 5% level.[11] Confidence intervals convey more information than p-values alone because they indicate both the direction and the precision of the estimated effect.[12]

## Diagnostic plots and model checking

After fitting a linear regression model, diagnostic plots help assess whether the model assumptions are met and identify potential problems.

### Residual plot

A scatter plot of residuals ($$y - \hat{y}$$) against fitted values ($$\hat{y}$$) is the most important diagnostic.[11] In a well-specified model, residuals should appear as a random cloud centered around zero with no discernible pattern. A funnel shape indicates heteroscedasticity, while a curved pattern suggests nonlinearity.[11]

### Q-Q plot

A quantile-quantile plot compares the distribution of the standardized residuals against a theoretical normal distribution. If the residuals are approximately normal, the points fall close to a 45-degree reference line. Systematic departures from the line (especially in the tails) indicate non-normality.

### Cook's distance

Cook's distance measures the influence of each data point on the fitted model.[11] It combines information about both the leverage (how far an observation's predictor values are from the mean) and the residual (how poorly the observation is predicted). A common rule of thumb flags observations with Cook's distance greater than $$\frac{4}{n - p - 1}$$ as potentially influential.[11] Removing or investigating such points can reveal whether a few observations are driving the model's results.

### Variance inflation factor (VIF)

The variance inflation factor quantifies how much the variance of a coefficient is inflated due to multicollinearity among predictors. VIF is calculated for each predictor by regressing it on all other predictors:

$$
\mathrm{VIF}_j = \frac{1}{1 - R_j^2}
$$

where $$R_j^2$$ is the R-squared from regressing $$x_j$$ on all other predictors. General guidelines suggest that a VIF above 5 warrants investigation and a VIF above 10 indicates serious multicollinearity that should be addressed (for example, by removing or combining correlated predictors).[11]

## Feature selection and multicollinearity

When building a multiple [regression model](/wiki/regression_model) with many candidate predictors, selecting the right subset of features is important for both interpretability and predictive performance.

### Multicollinearity

Multicollinearity occurs when two or more predictors are highly correlated. While the overall model predictions may remain accurate, multicollinearity inflates the standard errors of individual coefficients, making it difficult to determine which predictors are truly important.[11] Severe multicollinearity can also make the coefficient estimates numerically unstable.[7]

Common remedies include removing one of a pair of highly correlated predictors, combining correlated predictors into a single composite variable, applying principal component regression, or using regularization methods such as ridge regression.[7]

### Feature selection methods

| Method | Approach | Notes |
|---|---|---|
| Forward selection | Start with no predictors; add the most significant one at each step | Simple but can miss interactions |
| Backward elimination | Start with all predictors; remove the least significant one at each step | Requires n > p |
| Stepwise selection | Combination of forward and backward steps | Most common automated approach; risk of [overfitting](/wiki/overfitting) |
| Lasso (L1) | Regularization-based; drives some coefficients to exactly zero | Performs feature selection automatically |
| Information criteria (AIC, BIC) | Compare models using penalized likelihood | BIC favors simpler models than AIC |

## Regularized variants

Standard OLS can overfit when the number of predictors is large relative to the number of observations, or when predictors are highly correlated. Regularized regression methods add a penalty term to the loss function to constrain the coefficient magnitudes, improving [generalization](/wiki/generalization) to new data.[9]

### Ridge regression (L2 regularization)

Ridge regression adds the sum of squared coefficients (the [L2 regularization](/wiki/l2_regularization) penalty) to the OLS objective:[7]

$$
\text{Minimize:} \quad \sum_i (y_i - \hat{y}_i)^2 + \lambda \sum_j \beta_j^2
$$

The hyperparameter $$\lambda$$ controls the strength of the penalty. As $$\lambda$$ increases, the coefficients shrink toward zero but never reach exactly zero. Ridge regression is especially useful when predictors are correlated because it stabilizes the coefficient estimates.[7] The method was introduced by Arthur Hoerl and Robert Kennard in their 1970 paper "Ridge Regression: Biased Estimation for Nonorthogonal Problems."[7] The solution has a closed-form expression:

$$
\hat{\beta}_{\text{ridge}} = (X^\top X + \lambda I)^{-1} X^\top y
$$

### Lasso regression (L1 regularization)

Lasso (Least Absolute Shrinkage and Selection Operator) uses the sum of absolute coefficient values as the penalty, corresponding to [L1 regularization](/wiki/l1_regularization):[6]

$$
\text{Minimize:} \quad \sum_i (y_i - \hat{y}_i)^2 + \lambda \sum_j \lvert \beta_j \rvert
$$

Robert Tibshirani introduced the method in 1996, defining it as a procedure that "minimizes the residual sum of squares subject to the sum of the absolute value of the coefficients being less than a constant."[6] Unlike ridge, the L1 penalty can drive some coefficients to exactly zero, effectively performing automatic feature selection.[6] Lasso tends to select one variable from a group of correlated predictors and set the others to zero, which can be desirable or undesirable depending on the context.[5]

### Elastic net

Elastic net combines both L1 and L2 penalties:

$$
\text{Minimize:} \quad \sum_i (y_i - \hat{y}_i)^2 + \lambda_1 \sum_j \lvert \beta_j \rvert + \lambda_2 \sum_j \beta_j^2
$$

An additional mixing parameter $$\alpha$$ (between 0 and 1) balances the two penalties. Elastic net inherits the feature selection capability of lasso while also handling groups of correlated features more gracefully, like ridge. It was proposed by Hui Zou and Trevor Hastie in 2005 to address cases where lasso's variable selection can be unstable.[5]

| Method | Penalty | Feature selection | Handles multicollinearity | Solution |
|---|---|---|---|---|
| OLS | None | No | Poorly | Closed-form |
| Ridge (L2) | $$\sum_j \beta_j^2$$ | No (shrinks, never zeros) | Yes | Closed-form |
| Lasso (L1) | $$\sum_j \lvert \beta_j \rvert$$ | Yes (zeros out coefficients) | Partially | Iterative (coordinate descent) |
| Elastic net | L1 + L2 combined | Yes | Yes | Iterative (coordinate descent) |

## Polynomial regression

Polynomial regression extends linear regression by including polynomial terms of the predictor variables, such as $$x^2$$, $$x^3$$, or interaction terms like $$x_1 x_2$$. Despite being called "polynomial," the model is still linear in its parameters because the coefficients (Beta values) appear to the first power.

For a single predictor, a degree-d polynomial regression takes the form:

$$
y = \beta_0 + \beta_1 x + \beta_2 x^2 + \cdots + \beta_d x^d + \epsilon
$$

This allows the model to capture curved relationships while remaining within the linear regression framework. However, higher-degree polynomials risk overfitting the training data, especially with limited observations.[10] In practice, degrees beyond 3 or 4 are rarely used without strong theoretical justification, and regularization or cross-validation is typically employed to select the appropriate polynomial degree.[10]

## How does linear regression relate to logistic regression and generalized linear models?

Linear regression is a special case of generalized linear models (GLMs), a framework introduced by John Nelder and Robert Wedderburn in 1972 while both were working at the Rothamsted Experimental Station in England.[4] Their paper "Generalized Linear Models" showed that a single technique, iterative weighted least squares, can fit maximum likelihood estimates for any response distribution in the exponential family.[4] GLMs unify several regression techniques by specifying three components: a probability distribution for the response variable, a linear predictor (the familiar $$\beta_0 + \beta_1 x_1 + \cdots$$), and a link function that connects the linear predictor to the expected value of the response.[4]

For standard linear regression, the response distribution is normal (Gaussian), and the link function is the identity function (the linear predictor directly equals the expected response).

[Logistic regression](/wiki/logistic_regression) is another member of the GLM family. It uses a Bernoulli (or binomial) distribution for binary outcomes and the logit link function, which maps probabilities to the real line via the log-odds transformation.[4] Poisson regression (for count data) and gamma regression (for positive continuous data with non-constant variance) are other common GLMs.

Understanding linear regression as a special case of GLMs provides a natural pathway to more complex models and highlights the shared estimation principles (maximum likelihood) that underlie these techniques.

## What is linear regression used for?

Linear regression is one of the most widely applied statistical methods across virtually every quantitative discipline.

| Domain | Example application |
|---|---|
| Economics | Estimating the effect of education on wages, forecasting GDP growth, modeling consumer spending |
| Finance | Computing a stock's beta (sensitivity to market returns), pricing assets, modeling interest rate risk |
| Epidemiology | Quantifying the relationship between risk factors (smoking, BMI) and health outcomes |
| Engineering | Predicting material strength based on composition, calibrating sensor measurements |
| Marketing | Estimating the effect of advertising spend on sales revenue |
| Environmental science | Modeling the relationship between CO2 concentrations and temperature anomalies |
| Social science | Studying the relationship between income inequality and social outcomes |
| Real estate | Predicting property prices from square footage, location, and amenities |

In economics, linear regression is the predominant empirical tool, used to estimate demand functions, labor supply curves, and the impact of policy interventions. The Capital Asset Pricing Model (CAPM) in finance is fundamentally a linear regression of an asset's return on the market return, where the slope (beta) measures the asset's systematic risk.

In medical research, regression analysis was central to early studies linking tobacco smoking to mortality. Researchers used smoking as the independent variable and lifespan as the dependent variable while including additional predictors (education, income) to control for confounding socioeconomic factors.[12]

## How do you fit a linear regression in Python and R?

Linear regression is available in virtually every statistical software package and programming language. The two most popular Python libraries for this purpose are scikit-learn and statsmodels.

### scikit-learn

scikit-learn's `LinearRegression` class is oriented toward prediction. It follows the familiar fit/predict API and integrates seamlessly with the library's pipelines, cross-validation utilities, and preprocessing transformers.

```python
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Fit model
model = LinearRegression()
model.fit(X_train, y_train)

# Evaluate
y_pred = model.predict(X_test)
print(f"R-squared: {r2_score(y_test, y_pred):.4f}")
print(f"RMSE: {mean_squared_error(y_test, y_pred, squared=False):.4f}")
```

Regularized variants are available through `Ridge`, `Lasso`, and `ElasticNet` classes in `sklearn.linear_model`.

### statsmodels

statsmodels is geared toward statistical inference, providing detailed summary tables with coefficient estimates, standard errors, t-statistics, p-values, confidence intervals, and diagnostic statistics.

```python
import statsmodels.api as sm

# Add constant for intercept
X_with_const = sm.add_constant(X)

# Fit OLS model
model = sm.OLS(y, X_with_const).fit()

# Print full summary
print(model.summary())
```

statsmodels also supports a formula-based API similar to R:

```python
import statsmodels.formula.api as smf

model = smf.ols('price ~ sqft + bedrooms + bathrooms', data=df).fit()
print(model.summary())
```

### R

In R, linear regression is built into the base language via the `lm()` function:

```r
model <- lm(price ~ sqft + bedrooms + bathrooms, data = housing)
summary(model)
confint(model)
```

R's formula syntax and extensive diagnostic plotting capabilities (`plot(model)`) make it a popular choice for statistical analysis and academic research.

| Library/Tool | Language | Strengths |
|---|---|---|
| scikit-learn | Python | Prediction-focused; integrates with ML pipelines; regularized variants built in |
| statsmodels | Python | Inference-focused; detailed statistical summaries; formula API |
| R `lm()` | R | Native formula syntax; rich diagnostic plots; extensive ecosystem of statistical packages |
| MATLAB | MATLAB | Built-in `fitlm`; strong visualization; common in engineering |
| Spark MLlib | Scala/Python | Distributed computing for very large datasets |

## Strengths and limitations

### Strengths

- **Interpretability:** Coefficients have direct, intuitive interpretations as marginal effects.
- **Computational efficiency:** The closed-form OLS solution is fast for moderate-sized datasets.
- **Statistical inference:** Built-in hypothesis tests and confidence intervals require no additional machinery.
- **Theoretical foundation:** Well-understood properties (BLUE under Gauss-Markov conditions) and extensive diagnostic tools.[12]
- **Baseline model:** Linear regression often serves as a strong first model and a benchmark against which more complex methods are compared.[10]

### Limitations

- **Linearity assumption:** The model cannot capture nonlinear relationships without manual feature engineering (e.g., polynomial terms).
- **Sensitivity to outliers:** OLS minimizes squared errors, so extreme values can disproportionately affect the fit.[9]
- **Multicollinearity issues:** Correlated predictors inflate standard errors and make individual coefficient estimates unreliable.
- **Not suitable for classification:** Linear regression predicts continuous values and is inappropriate for categorical outcomes (logistic regression or other classifiers should be used instead).
- **Extrapolation risk:** Predictions outside the range of the training data can be unreliable.

## See also

- [Logistic regression](/wiki/logistic_regression)
- [Gradient descent](/wiki/gradient_descent)
- [Loss function](/wiki/loss_function)
- [Mean squared error](/wiki/mean_squared_error_mse)
- [Overfitting](/wiki/overfitting)
- [Feature](/wiki/feature)
- [L1 regularization](/wiki/l1_regularization)
- [L2 regularization](/wiki/l2_regularization)
- [Regression model](/wiki/regression_model)
- [Generalization](/wiki/generalization)

## References

1. Legendre, A. M. (1805). *Nouvelles methodes pour la determination des orbites des cometes*. Paris: Firmin Didot.
2. Gauss, C. F. (1809). *Theoria Motus Corporum Coelestium*. Hamburg: Perthes et Besser.
3. Galton, F. (1886). "Regression towards mediocrity in hereditary stature." *Journal of the Anthropological Institute of Great Britain and Ireland*, 15, 246-263.
4. Nelder, J. A., & Wedderburn, R. W. M. (1972). "Generalized linear models." *Journal of the Royal Statistical Society: Series A*, 135(3), 370-384.
5. Zou, H., & Hastie, T. (2005). "Regularization and variable selection via the elastic net." *Journal of the Royal Statistical Society: Series B*, 67(2), 301-320.
6. Tibshirani, R. (1996). "Regression shrinkage and selection via the lasso." *Journal of the Royal Statistical Society: Series B*, 58(1), 267-288.
7. Hoerl, A. E., & Kennard, R. W. (1970). "Ridge regression: biased estimation for nonorthogonal problems." *Technometrics*, 12(1), 55-67.
8. Stigler, S. M. (1981). "Gauss and the invention of least squares." *Annals of Statistics*, 9(3), 465-474.
9. Hastie, T., Tibshirani, R., & Friedman, J. (2009). *The Elements of Statistical Learning* (2nd ed.). Springer.
10. James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). *An Introduction to Statistical Learning*. Springer.
11. Kutner, M. H., Nachtsheim, C. J., Neter, J., & Li, W. (2004). *Applied Linear Statistical Models* (5th ed.). McGraw-Hill.
12. Freedman, D. A. (2009). *Statistical Models: Theory and Practice* (2nd ed.). Cambridge University Press.
13. Aitken, A. C. (1935). "On least squares and linear combinations of observations." *Proceedings of the Royal Society of Edinburgh*, 55, 42-48.