# Linear model

> Source: https://aiwiki.ai/wiki/linear_model
> Updated: 2026-06-23
> Categories: Machine Learning, Statistics
> From AI Wiki (https://aiwiki.ai), a free encyclopedia of artificial intelligence. Quote with attribution.

**A linear model is any statistics or [machine learning](/wiki/machine_learning) model whose prediction is a linear function of its input features, of the form `f(x) = g(w_1 x_1 + w_2 x_2 + ... + w_p x_p + b)`, where the model is linear in the learned weights `w` (a fixed link function `g` may wrap the output).** The family includes [linear regression](/wiki/linear_regression), [logistic regression](/wiki/logistic_regression), [ridge regression](/wiki/ridge_regression), [lasso regression](/wiki/lasso_regression), [elastic net](/wiki/elastic_net), the wider class of [generalized linear models](/wiki/generalized_linear_model), [linear discriminant analysis](/wiki/linear_discriminant_analysis), the linear [support vector machine](/wiki/support_vector_machine_svm), the [perceptron](/wiki/perceptron), and Bayesian linear models.

Linear models are the workhorse of applied statistics. They were the first parametric models studied formally, they remain the default tool in econometrics and biostatistics, and they still serve as the strongest available baseline in many machine learning problems where data is scarce, features are well engineered, or interpretation is part of the deliverable [10][11]. Even in deep learning, the final classification head of almost every modern network is a linear layer, and linear classifier probes are a standard tool for inspecting representations inside large neural networks [7]. The key reason fitting one is so reliable is that the optimisation objective is convex, so any local optimum is also the global optimum and the choice of solver affects only speed, not the final answer [10].

*See also: [Machine learning terms](/wiki/machine_learning_terms)*

## What makes a model linear?

A model is called linear because it is linear in the parameters `w`, not necessarily in the raw inputs. A regression on `[x, x^2, x^3]` is still a linear model: the design matrix has been expanded with polynomial features, but the prediction remains a linear combination of those features. This distinction matters because all the estimation theory and software for linear models continues to apply once you fix the feature transformation.

## ELI5: explain like I'm five

Imagine you have a recipe for pancakes, and you want to predict how fluffy they will be. You might guess that the fluffiness depends on three things: how much baking powder you add, how much milk you pour in, and how long you whisk the batter. A linear model says, very simply, that the fluffiness is the baking powder times one number, plus the milk times another number, plus the whisking time times a third number, plus a fixed starting amount. If you change the milk by one cup, the fluffiness changes by exactly the same amount no matter how much baking powder you used. That straight-line behavior is what makes the model linear, and it is the reason linear models are easy to read: each ingredient gets its own weight, and bigger weights mean more important ingredients.

## General form

A linear model in its most basic regression form is written as

```
y = w · x + b + ε
```

where `x` is a vector of input features, `w` is a vector of learned weights (also called coefficients), `b` is a scalar intercept (sometimes called the bias term), `y` is the predicted target, and `ε` is a residual error term. The expression `w · x + b` is called the linear predictor.

For classification, the same linear predictor is fed through a fixed nonlinear function. In binary [logistic regression](/wiki/logistic_regression), the linear predictor is passed through the sigmoid function `σ(z) = 1 / (1 + e^(-z))` to produce a class probability. In multiclass logistic regression (sometimes called softmax regression), several linear predictors are combined through the softmax function. In count regression problems, an exponential link converts the linear predictor into a positive rate.

## The linear-model family at a glance

The table below summarises the main members of the family, the type of target each one handles, the loss or likelihood that defines the fit, and the closest scikit-learn implementation.

| Model | Target | Link / loss | Closed form | Typical implementation |
| --- | --- | --- | --- | --- |
| Ordinary least squares (OLS) | Continuous | Squared loss, identity link | Yes | [scikit-learn](/wiki/scikit-learn) `LinearRegression`, `statsmodels` `OLS`, R `lm()` |
| [Ridge regression](/wiki/ridge_regression) | Continuous | Squared loss + L2 penalty | Yes | scikit-learn `Ridge`, `glmnet` |
| [Lasso regression](/wiki/lasso_regression) | Continuous | Squared loss + L1 penalty | No (coordinate descent) | scikit-learn `Lasso`, `glmnet` |
| [Elastic net](/wiki/elastic_net) | Continuous | Squared loss + L1 + L2 | No | scikit-learn `ElasticNet`, `glmnet` |
| [Logistic regression](/wiki/logistic_regression) | Binary or multiclass | Cross-entropy, logit link | No (IRLS) | scikit-learn `LogisticRegression`, `statsmodels` `Logit` |
| Poisson regression | Counts | Negative log-likelihood, log link | No (IRLS) | scikit-learn `PoissonRegressor`, `statsmodels` `GLM(family=Poisson)` |
| Gamma regression | Positive continuous | Negative log-likelihood, log or inverse link | No (IRLS) | scikit-learn `GammaRegressor`, `statsmodels` `GLM(family=Gamma)` |
| Linear [SVM](/wiki/support_vector_machine_svm) | Binary or multiclass | Hinge loss + L2 | No (subgradient or LIBLINEAR) | scikit-learn `LinearSVC` |
| [Linear discriminant analysis](/wiki/linear_discriminant_analysis) | Multiclass | Gaussian class-conditional | Yes | scikit-learn `LinearDiscriminantAnalysis` |
| Perceptron | Binary | Perceptron update rule | No (online) | scikit-learn `Perceptron` |
| Bayesian linear regression | Continuous | Posterior over weights | Closed for Gaussian prior | scikit-learn `BayesianRidge`, PyMC, Stan |

All of these models share the same prediction formula `f(x) = g⁻¹(w · x + b)` for some link `g`. They differ in the loss they minimise, the prior or penalty they impose on `w`, and how the optimisation is solved.

## Linear regression and ordinary least squares

The canonical linear model is [linear regression](/wiki/linear_regression) fit by ordinary least squares (OLS). Given a design matrix `X` of shape `n × p` and a response vector `y`, OLS chooses the coefficient vector `β` that minimises the residual sum of squares

```
RSS(β) = ||y - Xβ||²
```

When `X'X` is invertible, this minimisation has the famous closed-form solution

```
β̂ = (X'X)⁻¹ X'y
```

The normal equations `X'X β = X'y` were already known to Legendre and Gauss in the early 1800s. Adrien-Marie Legendre published the first clear statement of the method of least squares in his 1805 treatise *Nouvelles méthodes pour la détermination des orbites des comètes*, while Carl Friedrich Gauss claimed in his 1809 *Theoria Motus* to have been using it since 1795 and tied it to the Gaussian error distribution. The resulting priority dispute is one of the most famous in the history of statistics; the historian Stephen Stigler argued that Gauss probably did possess the method first but that Legendre retains priority of publication [17].

Under the Gauss-Markov assumptions (linearity in parameters, zero conditional mean of errors, homoscedasticity, no perfect multicollinearity, and uncorrelated errors), the OLS estimator is the best linear unbiased estimator (BLUE), meaning it has the smallest variance among all linear unbiased estimators of `β` [12]. The errors do not need to be Gaussian for the BLUE property to hold; normality is only needed for exact small-sample inference (t-tests, F-tests, confidence intervals).

**Geometric interpretation.** The OLS fit has a clean geometric reading. The fitted values `ŷ = Xβ̂ = X(X'X)⁻¹X'y` are the orthogonal projection of the response `y` onto the column space spanned by the predictors, and the residual vector `y - ŷ` is perpendicular to that subspace. The projection operator `H = X(X'X)⁻¹X'` is called the hat matrix because it puts the hat on `y`; it is symmetric and idempotent (`H² = H`), and its diagonal entries are the leverages that measure how much each observation pulls on its own fitted value. This view explains why the normal equations are exactly the condition that the residuals are orthogonal to every predictor.

**Inference.** Under the classical assumptions the sampling covariance of the estimator is `Var(β̂) = σ²(X'X)⁻¹`, where `σ²` is the error variance, estimated by the residual sum of squares divided by `n - p`. The square roots of the diagonal entries are the coefficient standard errors that drive the familiar t-statistics, confidence intervals, and the overall F-test reported by `lm()` in R and by `statsmodels` in Python [14].

## What is a generalized linear model?

For responses that are not continuous and Gaussian, the natural extension is the generalized linear model (GLM), introduced by John Nelder and Robert Wedderburn in their 1972 paper "Generalized Linear Models" in the *Journal of the Royal Statistical Society, Series A*, pages 370 to 384 [1]. Their abstract states that "the technique of iterative weighted linear regression can be used to obtain maximum likelihood estimates of the parameters with observations distributed according to some exponential family" [1]. The definitive textbook treatment is McCullagh and Nelder's *Generalized Linear Models* [9]. A GLM combines three ingredients:

1. A distribution for the response from the exponential family (Gaussian, Bernoulli, binomial, Poisson, gamma, inverse Gaussian).
2. A linear predictor `η = Xβ`.
3. A link function `g` such that `g(E[y]) = η`, or equivalently `E[y] = g⁻¹(Xβ)`.

Nelder and Wedderburn showed that maximum likelihood estimates for this whole class can be obtained by iteratively reweighted least squares (IRLS), a procedure that fits a sequence of weighted OLS problems with weights and working responses updated at each step until convergence [1]. IRLS remains the standard estimation method in `glm()` in R and in `statsmodels` GLM in Python [14]. Every exponential-family distribution has a natural or canonical link, the function that makes the linear predictor equal the natural parameter of the distribution (the logit for the Bernoulli, the log for the Poisson, the identity for the Gaussian). Using the canonical link makes IRLS coincide with Fisher scoring and gives the simplest sufficient statistics, although non-canonical links such as the probit or the complementary log-log are also commonly used.

The table below summarises the most common GLMs and the response distribution and link they use.

| GLM | Response | Distribution | Canonical link |
| --- | --- | --- | --- |
| Linear regression | Continuous | Gaussian | Identity |
| Logistic regression | Binary | Bernoulli | Logit |
| Probit regression | Binary | Bernoulli | Probit (inverse Gaussian CDF) |
| Multinomial logistic | Categorical | Multinomial | Multinomial logit |
| [Poisson regression](/wiki/poisson_regression) | Counts | Poisson | Log |
| Negative binomial regression | Overdispersed counts | Negative binomial | Log |
| Gamma regression | Positive continuous | Gamma | Inverse or log |
| Beta regression | Proportions in (0,1) | Beta | Logit |

Logistic regression is by far the most widely used GLM in machine learning. It models the log-odds of the positive class as a linear function of the inputs and is the default classifier when interpretability or calibrated probability scores matter.

## How are linear models estimated?

Linear models can be fit by several different procedures depending on the loss function, the size of the dataset, and the regularization in use.

| Method | When used | Typical solver |
| --- | --- | --- |
| Closed-form normal equations | OLS, ridge, small-to-medium `p` | Cholesky, QR, SVD decomposition |
| Maximum likelihood estimation | All GLMs, logistic regression | IRLS, Newton-Raphson |
| Coordinate descent | Lasso, elastic net | `glmnet` algorithm |
| Least angle regression | Lasso solution path | LARS algorithm |
| Stochastic gradient descent | Very large `n` or streaming data | scikit-learn `SGDRegressor`, `SGDClassifier` |
| Quasi-Newton (L-BFGS) | Logistic regression with smooth penalty | scikit-learn `LogisticRegression(solver='lbfgs')` |
| Markov chain Monte Carlo | Bayesian linear models | Stan, PyMC, NumPyro |

In practice few implementations form `(X'X)⁻¹` directly because explicitly inverting the matrix is numerically fragile when predictors are nearly collinear. The standard approach is a matrix factorization: scikit-learn's `LinearRegression` solves the problem through the singular value decomposition of `X` (via `scipy.linalg.lstsq`), while ridge defaults to a Cholesky or conjugate-gradient solver of comparable cost [13]. The lasso and elastic net are most commonly fit by the cyclic coordinate descent of the `glmnet` algorithm, which sweeps over one coordinate at a time and exploits warm starts to compute an entire regularization path cheaply [15]. An alternative, least angle regression (LARS), introduced by Bradley Efron, Trevor Hastie, Iain Johnstone, and Robert Tibshirani in their 2004 *Annals of Statistics* paper, traces the exact piecewise-linear lasso path in roughly the cost of a single OLS fit and clarified the geometry of L1 regularization [18].

For most modern problems with `n` in the tens of thousands and `p` in the hundreds, the closed-form OLS solution or a single L-BFGS run on the regularized log-likelihood completes in fractions of a second. The convexity of the underlying objective is the key reason fitting a linear model is so reliable: any local optimum is a global optimum, so the choice of solver only affects speed, not the final answer [10].

## Regularized variants

When the number of features is comparable to or larger than the number of observations, OLS becomes unstable and overfits. [Regularization](/wiki/regularization) addresses this by adding a penalty on the size of the coefficients.

Ridge regression, introduced by Arthur Hoerl and Robert Kennard in their 1970 paper "Ridge Regression: Biased Estimation for Nonorthogonal Problems" in *Technometrics* (volume 12, pages 55 to 67), adds an L2 penalty `λ||β||²` to the residual sum of squares [2]. The estimator becomes `β̂ = (X'X + λI)⁻¹ X'y`, which is always well defined because `X'X + λI` is invertible for any `λ > 0`. Ridge shrinks all coefficients toward zero but never sets them exactly to zero, and it is particularly useful when predictors are highly correlated. The same estimator is known in numerical analysis as Tikhonov regularization, named for Andrey Tikhonov, who studied it as a way to stabilise ill-posed inverse problems [19]. It also has a clean Bayesian reading: placing an independent Gaussian prior on each coefficient and computing the maximum-a-posteriori estimate yields exactly the ridge solution, with the penalty strength `λ` set by the ratio of the noise variance to the prior variance [19].

The lasso, introduced by Robert Tibshirani in his 1996 paper "Regression Shrinkage and Selection via the Lasso" in the *Journal of the Royal Statistical Society, Series B* (volume 58, pages 267 to 288), replaces the L2 penalty with an L1 penalty `λ∑|β_j|` [3]. As Tibshirani's abstract puts it, the method "tends to produce some coefficients that are exactly 0 and hence gives interpretable models" [3]. Because the L1 ball has corners at the axes, the solution often sets many coefficients exactly to zero. This makes the lasso a feature-selection method as well as a regularizer, which is invaluable in high-dimensional problems such as gene expression analysis where most predictors are expected to be irrelevant.

The elastic net, proposed by Hui Zou and Trevor Hastie in their 2005 *Journal of the Royal Statistical Society, Series B* paper (volume 67, pages 301 to 320), blends the L1 and L2 penalties: `λ_1∑|β_j| + λ_2||β||²` [4]. The L1 part still produces sparsity, but the L2 part stabilises the solution when groups of predictors are highly correlated. The lasso alone tends to pick one predictor at random from a correlated group; the elastic net tends to keep all members of the group, with smaller weights. In every case the penalty strength is a hyperparameter, usually chosen by cross-validation; scikit-learn's `RidgeCV`, `LassoCV`, and `ElasticNetCV` and the `cv.glmnet` routine in `glmnet` automate this search over a path of `λ` values.

| Penalty | Formula | Effect on coefficients | Selects features |
| --- | --- | --- | --- |
| L2 (ridge) | `λ∑β_j²` | Shrinks toward zero, rarely exactly zero | No |
| L1 (lasso) | `λ∑|β_j|` | Sets many exactly to zero | Yes |
| L1 + L2 (elastic net) | `λ_1∑|β_j| + λ_2∑β_j²` | Sparse and stable for correlated features | Yes |
| L0 | `λ·card(β ≠ 0)` | Best subset selection (NP-hard) | Yes |

## Other linear models

Several important models live outside the GLM framework but are still linear in their feature representation.

**Linear support vector machines.** A linear [SVM](/wiki/support_vector_machine_svm) replaces the cross-entropy loss with the hinge loss `max(0, 1 - y·(w·x + b))` and adds an L2 penalty. The resulting decision boundary is still a hyperplane, but the loss is designed to maximise the margin between classes rather than to estimate class probabilities. LIBLINEAR is the standard implementation for large sparse problems and is what scikit-learn uses behind `LinearSVC`.

**Perceptron.** Frank Rosenblatt's perceptron, introduced in his 1958 *Psychological Review* paper "The perceptron: a probabilistic model for information storage and organization in the brain," was the first explicitly online linear classifier [6]. Its update rule simply adds the misclassified input to the weight vector. The perceptron converges in a finite number of steps if and only if the data is linearly separable, a result later proved by Block and by Novikoff.

**Linear discriminant analysis.** Ronald Fisher's [LDA](/wiki/linear_discriminant_analysis), introduced in his 1936 paper "The use of multiple measurements in taxonomic problems" in the *Annals of Eugenics*, finds the linear projection that maximises the ratio of between-class variance to within-class variance [5]. Fisher illustrated the method on the iris dataset, which has remained a benchmark for classification ever since. LDA assumes the class-conditional distributions are Gaussian with a shared covariance matrix; under that assumption, the Bayes-optimal classifier is itself a linear function of the features.

**Linear mixed models.** When observations are grouped (students within schools, repeated measurements on the same patient), classical OLS underestimates standard errors because residuals are correlated within groups. Linear mixed models add random effects for groups in addition to the fixed-effect coefficients, giving valid inference for hierarchical and longitudinal data. They are central to biostatistics and educational measurement.

**Bayesian linear regression.** Putting a Gaussian prior on `β` and updating with Gaussian-distributed observations yields a Gaussian posterior over `β` in closed form. The posterior mean coincides with the ridge estimate, while the posterior covariance gives full uncertainty quantification. Bayesian linear regression is the workhorse of probabilistic numerics and provides the foundation for Gaussian-process regression.

**Robust linear regression.** Squared-error loss gives outliers enormous influence because a single large residual is squared. Robust variants keep the linear predictor but swap in a loss that grows more slowly. The Huber loss is quadratic for small residuals and linear beyond a threshold, the pinball loss used in quantile regression estimates a chosen conditional quantile rather than the mean, and least-trimmed-squares or consensus methods such as RANSAC fit only an inlier subset. scikit-learn exposes these as `HuberRegressor`, `QuantileRegressor`, `RANSACRegressor`, and `TheilSenRegressor`; the documentation notes that `HuberRegressor` is usually faster while `RANSACRegressor` copes best with large outliers in the response [13].

**Generalized additive models.** Generalized additive models (GAMs), introduced by Trevor Hastie and Robert Tibshirani in 1986, replace the linear predictor `∑β_j x_j` with a sum of smooth functions `∑f_j(x_j)` estimated from the data, usually with penalized splines [20]. A GAM keeps the additive, one-term-per-feature structure that makes linear models interpretable while allowing each feature to enter nonlinearly, which is why GAMs are often the first step beyond a straight linear model when the linearity assumption is too rigid.

**Weighted and generalized least squares.** When the errors are heteroscedastic or correlated, plain OLS is no longer efficient. Weighted least squares (WLS) rescales each observation by the inverse of its error standard deviation, and the more general formulation, generalized least squares (GLS), uses the full error covariance matrix. A. C. Aitken showed in 1935 that the GLS estimator is the best linear unbiased estimator when that covariance is known, which is why GLS is sometimes called the Aitken estimator. When the covariance must itself be estimated, the procedure is feasible GLS. These are available as `WLS` and `GLS` in `statsmodels` [14].

## Why use a linear model? Strengths

Linear models occupy a privileged position in machine learning for several practical reasons.

| Strength | Why it matters |
| --- | --- |
| Interpretability | Each coefficient gives the effect of a one-unit change in its feature, holding others fixed |
| Speed | Closed-form OLS or a single L-BFGS run finishes in milliseconds for typical problems |
| Convex objective | Any local optimum is the global optimum, so optimisation is reliable |
| Statistical inference | Standard errors, p-values, confidence intervals all available with mature theory |
| Strong baseline | Often matches or beats neural networks when training data is small |
| Few hyperparameters | Usually only the regularization strength needs tuning |
| Calibration | Probabilities from logistic regression are well calibrated by default |
| Regulatory acceptance | Required or strongly preferred in credit scoring, insurance pricing, and clinical risk models |

## What are the limitations of linear models? Weaknesses

The restrictions that make linear models simple also limit what they can express. The central limitation is linear separability: a single linear classifier can only separate classes with a flat hyperplane, so a problem like the XOR function cannot be solved by one linear model on the raw inputs, a point made famous by Minsky and Papert's 1969 critique of the perceptron.

| Weakness | Mitigation |
| --- | --- |
| Cannot capture nonlinear relationships from raw features | Add polynomial or interaction features, use splines, switch to a kernel method or tree model |
| Not linearly separable problems (e.g. XOR) need feature engineering | Lift features into a higher-dimensional space, use a kernel, or stack a nonlinear model |
| Sensitive to outliers under squared loss | Use robust losses (Huber, quantile) or trim observations |
| Assumes uncorrelated residuals | Use mixed models or generalized estimating equations for grouped data |
| Multicollinearity inflates variance | Use ridge or remove redundant features |
| Cannot model feature interactions automatically | Engineer cross-features or use a model that handles them |
| Assumes constant error variance | Use weighted least squares or transform the response |

## Classical OLS assumptions

The following assumptions underpin the BLUE property of OLS and the validity of the textbook standard errors. Diagnostics for each one form the backbone of any applied regression analysis.

| Assumption | Description | Diagnostic |
| --- | --- | --- |
| Linearity | The conditional mean of `y` is a linear function of `X` | Residual-versus-fitted plot |
| Independence of errors | Residuals are uncorrelated across observations | Durbin-Watson test, residual autocorrelation plot |
| Homoscedasticity | Errors have constant variance | Residual-versus-fitted plot, Breusch-Pagan test |
| Normality of errors | Errors are normally distributed (needed for inference, not for BLUE) | Q-Q plot, Shapiro-Wilk test |
| No perfect multicollinearity | Columns of `X` are linearly independent | Variance inflation factor (VIF) |
| Exogenous regressors | Errors uncorrelated with regressors | Hausman test, instrumental variable methods |
| No influential outliers | No single observation dominates the fit | Leverage, Cook's distance, DFBETAS |

When these assumptions are violated, OLS may still be unbiased but its standard errors and p-values become unreliable. The standard remedies are robust standard errors (Huber-White sandwich), generalized least squares for known correlation structure, or moving to a GLM with the appropriate distributional assumption [12]. A common rule of thumb flags a variance inflation factor above 5, and certainly above 10, as a sign of problematic multicollinearity, although the exact cutoff is a matter of convention rather than a hard rule.

## Model selection and goodness of fit

Because a linear model is so cheap to fit, the practical work is usually choosing which features to include and judging how well the model fits. Several standard quantities support this.

| Metric | What it measures | Penalizes complexity | Notes |
| --- | --- | --- | --- |
| R-squared (R²) | Fraction of variance in `y` explained by the model | No | Always increases as predictors are added, so it cannot be used to compare models of different size |
| Adjusted R-squared | R² adjusted for the number of predictors | Yes | Applies to linear regression; can decrease when a useless predictor is added |
| Akaike information criterion (AIC) | Likelihood penalized by `2k` for `k` parameters | Yes (mild) | Aims at predictive accuracy; tends to favour slightly larger models |
| Bayesian information criterion (BIC) | Likelihood penalized by `k·log(n)` | Yes (stronger) | Heavier penalty for large `n`; tends to favour simpler models and is consistent for the true model when it is in the candidate set |
| Mallows's Cp | Estimate of prediction error for nested OLS models | Yes | Closely related to AIC for Gaussian errors |
| Cross-validated error | Out-of-sample loss on held-out folds | Implicitly | The most model-agnostic criterion and the default for choosing the regularization strength |

R-squared, adjusted R-squared, AIC, and BIC are all reported directly by `lm()` summaries in R and by `statsmodels` in Python [14]. R-squared by itself is a poor model-selection tool because it never decreases when predictors are added; AIC and BIC, being based on the likelihood, can be compared across any models fit by maximum likelihood, not just linear regressions [10]. In high-dimensional or predictive settings the regularization strength is almost always chosen by cross-validation rather than by an information criterion.

## When should you use a linear model?

The practical decision of whether to reach for a linear model can be summarised as follows.

| Situation | Linear model is a good fit | Better alternative |
| --- | --- | --- |
| Small dataset, few features | Yes, almost always | None usually needed |
| High-dimensional sparse data (text, genomics) | Lasso or elastic net | Linear models with L1 |
| Need interpretable coefficients for stakeholders | Yes, especially logistic regression | Generalized additive models |
| Strict regulatory environment (credit, insurance) | Yes, often required | None permitted |
| Probabilistic forecasting with calibration | Logistic or Poisson regression | Gradient-boosted models with calibration layer |
| Strong nonlinear interactions in raw features | No, unless you engineer them | Random forests, gradient boosting, neural networks |
| Image, audio, or sequence input | Only after deep feature extraction | Convolutional or transformer models |
| Causal inference from observational data | Yes, with care over confounders | Doubly robust or instrumental-variable estimators |

A reasonable workflow in any new project is to fit a linear model first as a baseline, then a tree-based model, and only move to deep learning if both fail and the data and budget justify it.

## Modern context

Linear models continue to dominate large parts of applied statistics and applied machine learning even as deep learning grabs the headlines.

**Econometrics.** OLS, two-stage least squares, and generalized method of moments remain the default tools for causal inference in economics [12]. Almost every paper in top economics journals contains at least one linear regression table, and the entire treatment-effects literature builds on linear projections.

**Healthcare and biostatistics.** Linear models underpin much of clinical risk scoring because regulators, clinicians, and journals all expect a model whose coefficients can be read from a table on paper. Logistic regression is the standard tool for predicting a binary outcome (such as in-hospital mortality scores like qSOFA and APACHE), while time-to-event scores are usually built from Cox proportional hazards regression: the Framingham cardiovascular risk equation of D'Agostino and colleagues and the original Model for End-Stage Liver Disease (MELD) score were both derived from Cox models [21]. Some bedside scores, such as CHADS₂ for stroke risk in atrial fibrillation, are simplified integer point systems distilled from such analyses rather than the raw regression coefficients. Cox proportional hazards, introduced by David Cox in 1972 and estimated through his partial likelihood, is the GLM-style workhorse of survival analysis [22].

**Credit scoring.** Logistic regression remains the benchmark in the credit risk industry because the lack of interpretability of ensemble methods is incompatible with the requirements of financial regulators [16]. The coefficients of a credit-scoring logistic regression must be defensible to a model risk officer and to a banking supervisor, and a linear model is the easiest object to defend. Recent work explores hybrids that graft nonlinear decision-tree effects onto a logistic backbone to recover some accuracy while preserving interpretability [16].

**Recommendation systems.** Logistic regression is still the standard baseline for click-through-rate prediction in advertising and recommendation. Google's [Wide and Deep](/wiki/wide_and_deep) architecture, introduced by Cheng and colleagues in 2016, jointly trains a wide linear model on cross-product features with a deep neural network on dense embeddings, capturing memorisation and generalisation in a single model [8]. The wide component is exactly a logistic regression. The system was deployed on the Google Play app store, serving over a billion active users, where the joint model significantly increased app acquisitions over wide-only and deep-only variants [8].

**Linear probes for representation learning.** Guillaume Alain and Yoshua Bengio's 2016 paper "Understanding intermediate layers using linear classifier probes" introduced the use of linear classifiers trained on frozen intermediate features as a way to measure how much task-relevant information each layer has accumulated [7]. Their probes are "trained entirely independently of the model itself" and were demonstrated on Inception v3 and ResNet-50 [7]. Linear probing is now a standard evaluation protocol for self-supervised learning, including SimCLR, MoCo, MAE, and the major large-language-model evaluation suites.

**Double descent and the theory of overparameterization.** Linear regression has also become a key testbed for understanding why heavily overparameterized models generalize. Mikhail Belkin, Daniel Hsu, Siyuan Ma, and Soumik Mandal showed in their 2019 *PNAS* paper (volume 116, pages 15849 to 15854) that the test error of many models, including linear regression, follows a "double descent" curve: it rises as the number of parameters approaches the interpolation threshold where the model first fits the training data exactly, then falls again as the model becomes even larger [23]. The minimum-norm least-squares solution in the overparameterized regime is the simplest setting in which this benign-overfitting behaviour can be analysed exactly, which is one reason linear models remain central to learning theory even in the deep-learning era.

**Connection to neural networks.** Every fully connected layer in a neural network is a linear transformation followed by a nonlinearity. The final classification layer is almost always a softmax linear classifier, so the very last decision in a billion-parameter model is made by a linear model on top of learned features. The kernel trick exploited by kernel SVMs and Gaussian processes is the same idea in reverse: stay linear in a lifted feature space and let the kernel handle the nonlinearity.

## Implementations

Linear models are available in essentially every numerical computing environment. The most widely used ecosystems are listed below.

| Library | Language | Notable estimators |
| --- | --- | --- |
| [scikit-learn](/wiki/scikit-learn) | Python | `LinearRegression`, `LogisticRegression`, `Ridge`, `Lasso`, `ElasticNet`, `BayesianRidge`, `SGDRegressor`, `SGDClassifier`, `PoissonRegressor`, `GammaRegressor`, `LinearDiscriminantAnalysis`, `LinearSVC`, `Perceptron` |
| [statsmodels](/wiki/statsmodels) | Python | `OLS`, `WLS`, `GLS`, `Logit`, `Probit`, `Poisson`, full GLM family with statistical inference, mixed models, robust regression |
| [R](/wiki/r_software) base | R | `lm()`, `glm()` |
| `glmnet` | R, Python | Lasso, ridge, elastic net for OLS, logistic, Poisson, Cox, multinomial |
| MASS | R | `lda()`, `qda()`, robust regression |
| Stan, PyMC, NumPyro | Python | Bayesian linear and generalized linear models |
| MLlib | Spark, Scala | Distributed linear and logistic regression for big data |
| LIBLINEAR | C++ | Large-scale linear classification and regression |
| Vowpal Wabbit | C++ | Online linear models for very large datasets |

For classical statistical inference (confidence intervals, p-values, model summaries), `statsmodels` in Python and base R remain the default choices [14]. For predictive modelling, [scikit-learn](/wiki/scikit-learn) and `glmnet` are the standards [13][15].

## Worked example: linear regression in scikit-learn

The following snippet fits a simple OLS regression on a synthetic dataset and inspects the resulting coefficients.

```python
import numpy as np
from sklearn.linear_model import LinearRegression

# y = 2 + 3*x1 - 1*x2 + noise
rng = np.random.default_rng(seed=0)
X = rng.normal(size=(200, 2))
y = 2 + 3 * X[:, 0] - 1 * X[:, 1] + rng.normal(scale=0.5, size=200)

model = LinearRegression().fit(X, y)
print("intercept:", model.intercept_)
print("coefficients:", model.coef_)
print("R^2:", model.score(X, y))
```

The estimated intercept will be close to 2.0, the coefficients close to (3.0, -1.0), and the in-sample R-squared close to 0.97. Switching to ridge or lasso requires only changing the imported class:

```python
from sklearn.linear_model import Ridge, Lasso, ElasticNet
ridge = Ridge(alpha=1.0).fit(X, y)
lasso = Lasso(alpha=0.1).fit(X, y)
enet = ElasticNet(alpha=0.1, l1_ratio=0.5).fit(X, y)
```

For a binary classification problem, the equivalent line is `LogisticRegression().fit(X, y)`. The uniform interface across all of these estimators is one of the reasons linear models are so productive in practice: changing the loss, the link, or the penalty rarely requires touching more than one line of code.

## See also

- [Linear regression](/wiki/linear_regression)
- [Logistic regression](/wiki/logistic_regression)
- [Generalized linear model](/wiki/generalized_linear_model)
- [Ridge regression](/wiki/ridge_regression)
- [Lasso regression](/wiki/lasso_regression)

## References

1. Nelder, J. A. and Wedderburn, R. W. M. (1972). "Generalized Linear Models." *Journal of the Royal Statistical Society, Series A (General)*, 135(3), 370-384. https://www.jstor.org/stable/2344614
2. Hoerl, A. E. and Kennard, R. W. (1970). "Ridge Regression: Biased Estimation for Nonorthogonal Problems." *Technometrics*, 12(1), 55-67. https://www.jstor.org/stable/1267351
3. Tibshirani, R. (1996). "Regression Shrinkage and Selection via the Lasso." *Journal of the Royal Statistical Society, Series B (Methodological)*, 58(1), 267-288. https://rss.onlinelibrary.wiley.com/doi/10.1111/j.2517-6161.1996.tb02080.x
4. Zou, H. and Hastie, T. (2005). "Regularization and Variable Selection via the Elastic Net." *Journal of the Royal Statistical Society, Series B (Statistical Methodology)*, 67(2), 301-320. https://rss.onlinelibrary.wiley.com/doi/abs/10.1111/j.1467-9868.2005.00503.x
5. Fisher, R. A. (1936). "The Use of Multiple Measurements in Taxonomic Problems." *Annals of Eugenics*, 7(2), 179-188. https://onlinelibrary.wiley.com/doi/10.1111/j.1469-1809.1936.tb02137.x
6. Rosenblatt, F. (1958). "The Perceptron: A Probabilistic Model for Information Storage and Organization in the Brain." *Psychological Review*, 65(6), 386-408. https://doi.org/10.1037/h0042519
7. Alain, G. and Bengio, Y. (2016). "Understanding Intermediate Layers Using Linear Classifier Probes." arXiv:1610.01644. https://arxiv.org/abs/1610.01644
8. Cheng, H.-T. et al. (2016). "Wide & Deep Learning for Recommender Systems." *Proceedings of the 1st Workshop on Deep Learning for Recommender Systems (DLRS)*. arXiv:1606.07792. https://arxiv.org/abs/1606.07792
9. McCullagh, P. and Nelder, J. A. (1989). *Generalized Linear Models*, 2nd edition. Chapman and Hall/CRC. https://doi.org/10.1201/9780203753736
10. Hastie, T., Tibshirani, R., and Friedman, J. (2009). *The Elements of Statistical Learning: Data Mining, Inference, and Prediction*, 2nd edition. Springer. https://hastie.su.domains/ElemStatLearn/
11. James, G., Witten, D., Hastie, T., and Tibshirani, R. (2021). *An Introduction to Statistical Learning*, 2nd edition. Springer. https://www.statlearning.com/
12. Wooldridge, J. M. (2019). *Introductory Econometrics: A Modern Approach*, 7th edition. Cengage. https://www.cengage.com/c/introductory-econometrics-a-modern-approach-7e-wooldridge/
13. scikit-learn developers. "1.1. Linear Models." scikit-learn User Guide. https://scikit-learn.org/stable/modules/linear_model.html
14. statsmodels developers. "Linear Regression" and "Generalized Linear Models." statsmodels documentation. https://www.statsmodels.org/stable/regression.html
15. Friedman, J., Hastie, T., and Tibshirani, R. (2010). "Regularization Paths for Generalized Linear Models via Coordinate Descent." *Journal of Statistical Software*, 33(1), 1-22. https://www.jstatsoft.org/article/view/v033i01
16. Dumitrescu, E., Hue, S., Hurlin, C., and Tokpavi, S. (2022). "Machine Learning for Credit Scoring: Improving Logistic Regression with Non-Linear Decision-Tree Effects." *European Journal of Operational Research*, 297(3), 1178-1192. https://doi.org/10.1016/j.ejor.2021.06.053
17. Stigler, S. M. (1981). "Gauss and the Invention of Least Squares." *The Annals of Statistics*, 9(3), 465-474. https://doi.org/10.1214/aos/1176345451
18. Efron, B., Hastie, T., Johnstone, I., and Tibshirani, R. (2004). "Least Angle Regression." *The Annals of Statistics*, 32(2), 407-499. https://projecteuclid.org/journals/annals-of-statistics/volume-32/issue-2/Least-angle-regression/10.1214/009053604000000067.full
19. Tikhonov, A. N. (1963). "Solution of Incorrectly Formulated Problems and the Regularization Method." *Soviet Mathematics Doklady*, 4, 1035-1038. (Ridge as Tikhonov regularization; see also Bishop, C. M. (2006). *Pattern Recognition and Machine Learning*, Springer.) https://link.springer.com/book/9780387310732
20. Hastie, T. and Tibshirani, R. (1986). "Generalized Additive Models." *Statistical Science*, 1(3), 297-310. https://doi.org/10.1214/ss/1177013604
21. D'Agostino, R. B. et al. (2008). "General Cardiovascular Risk Profile for Use in Primary Care: The Framingham Heart Study." *Circulation*, 117(6), 743-753. https://doi.org/10.1161/CIRCULATIONAHA.107.699579
22. Cox, D. R. (1972). "Regression Models and Life-Tables." *Journal of the Royal Statistical Society, Series B (Methodological)*, 34(2), 187-220. https://rss.onlinelibrary.wiley.com/doi/10.1111/j.2517-6161.1972.tb00899.x
23. Belkin, M., Hsu, D., Ma, S., and Mandal, S. (2019). "Reconciling Modern Machine-Learning Practice and the Classical Bias-Variance Trade-Off." *Proceedings of the National Academy of Sciences (PNAS)*, 116(32), 15849-15854. https://doi.org/10.1073/pnas.1903070116

