Linear model is a broad family of statistics and machine learning models in which the prediction is a linear function of the input features (possibly composed with a fixed nonlinear link function). The family includes linear regression, logistic regression, ridge regression, lasso regression, elastic net, the wider class of generalized linear models, linear discriminant analysis, the linear support vector machine, the perceptron, and Bayesian linear models.
Linear models are the workhorse of applied statistics. They were the first parametric models studied formally, they remain the default tool in econometrics and biostatistics, and they still serve as the strongest available baseline in many machine learning problems where data is scarce, features are well engineered, or interpretation is part of the deliverable. Even in deep learning, the final classification head of almost every modern network is a linear layer, and linear classifier probes are a standard tool for inspecting representations inside large neural networks.
See also: Machine learning terms
Imagine you have a recipe for pancakes, and you want to predict how fluffy they will be. You might guess that the fluffiness depends on three things: how much baking powder you add, how much milk you pour in, and how long you whisk the batter. A linear model says, very simply, that the fluffiness is the baking powder times one number, plus the milk times another number, plus the whisking time times a third number, plus a fixed starting amount. If you change the milk by one cup, the fluffiness changes by exactly the same amount no matter how much baking powder you used. That straight-line behavior is what makes the model linear, and it is the reason linear models are easy to read: each ingredient gets its own weight, and bigger weights mean more important ingredients.
A linear model in its most basic regression form is written as
y = w · x + b + ε
where x is a vector of input features, w is a vector of learned weights (also called coefficients), b is a scalar intercept (sometimes called the bias term), y is the predicted target, and ε is a residual error term. The expression w · x + b is called the linear predictor.
For classification, the same linear predictor is fed through a fixed nonlinear function. In binary logistic regression, the linear predictor is passed through the sigmoid function σ(z) = 1 / (1 + e^(-z)) to produce a class probability. In multiclass logistic regression (sometimes called softmax regression), several linear predictors are combined through the softmax function. In count regression problems, an exponential link converts the linear predictor into a positive rate.
A model is called linear because it is linear in the parameters w, not necessarily in the raw inputs. A regression on [x, x^2, x^3] is still a linear model: the design matrix has been expanded with polynomial features, but the prediction remains a linear combination of those features. This distinction matters because all the estimation theory and software for linear models continues to apply once you fix the feature transformation.
The table below summarises the main members of the family, the type of target each one handles, the loss or likelihood that defines the fit, and the closest scikit-learn implementation.
| Model | Target | Link / loss | Closed form | Typical implementation |
|---|---|---|---|---|
| Ordinary least squares (OLS) | Continuous | Squared loss, identity link | Yes | scikit-learn LinearRegression, statsmodels OLS, R lm() |
| Ridge regression | Continuous | Squared loss + L2 penalty | Yes | scikit-learn Ridge, glmnet |
| Lasso regression | Continuous | Squared loss + L1 penalty | No (coordinate descent) | scikit-learn Lasso, glmnet |
| Elastic net | Continuous | Squared loss + L1 + L2 | No | scikit-learn ElasticNet, glmnet |
| Logistic regression | Binary or multiclass | Cross-entropy, logit link | No (IRLS) | scikit-learn LogisticRegression, statsmodels Logit |
| Poisson regression | Counts | Negative log-likelihood, log link | No (IRLS) | scikit-learn PoissonRegressor, statsmodels GLM(family=Poisson) |
| Gamma regression | Positive continuous | Negative log-likelihood, log or inverse link | No (IRLS) | scikit-learn GammaRegressor, statsmodels GLM(family=Gamma) |
| Linear SVM | Binary or multiclass | Hinge loss + L2 | No (subgradient or LIBLINEAR) | scikit-learn LinearSVC |
| Linear discriminant analysis | Multiclass | Gaussian class-conditional | Yes | scikit-learn LinearDiscriminantAnalysis |
| Perceptron | Binary | Perceptron update rule | No (online) | scikit-learn Perceptron |
| Bayesian linear regression | Continuous | Posterior over weights | Closed for Gaussian prior | scikit-learn BayesianRidge, PyMC, Stan |
All of these models share the same prediction formula f(x) = g⁻¹(w · x + b) for some link g. They differ in the loss they minimise, the prior or penalty they impose on w, and how the optimisation is solved.
The canonical linear model is linear regression fit by ordinary least squares (OLS). Given a design matrix X of shape n × p and a response vector y, OLS chooses the coefficient vector β that minimises the residual sum of squares
RSS(β) = ||y - Xβ||²
When X'X is invertible, this minimisation has the famous closed-form solution
β̂ = (X'X)⁻¹ X'y
The normal equations X'X β = X'y were already known to Legendre and Gauss in the early 1800s. Under the Gauss-Markov assumptions (linearity in parameters, zero conditional mean of errors, homoscedasticity, no perfect multicollinearity, and uncorrelated errors), the OLS estimator is the best linear unbiased estimator (BLUE), meaning it has the smallest variance among all linear unbiased estimators of β. The errors do not need to be Gaussian for the BLUE property to hold; normality is only needed for exact small-sample inference (t-tests, F-tests, confidence intervals).
For responses that are not continuous and Gaussian, the natural extension is the generalized linear model (GLM), introduced by John Nelder and Robert Wedderburn in their 1972 paper "Generalized Linear Models" in the Journal of the Royal Statistical Society, Series A. A GLM combines three ingredients:
η = Xβ.g such that g(E[y]) = η, or equivalently E[y] = g⁻¹(Xβ).Nelder and Wedderburn showed that maximum likelihood estimates for this whole class can be obtained by iteratively reweighted least squares (IRLS), a procedure that fits a sequence of weighted OLS problems with weights and working responses updated at each step until convergence. IRLS remains the standard estimation method in glm() in R and in statsmodels GLM in Python.
The table below summarises the most common GLMs and the response distribution and link they use.
| GLM | Response | Distribution | Canonical link |
|---|---|---|---|
| Linear regression | Continuous | Gaussian | Identity |
| Logistic regression | Binary | Bernoulli | Logit |
| Probit regression | Binary | Bernoulli | Probit (inverse Gaussian CDF) |
| Multinomial logistic | Categorical | Multinomial | Multinomial logit |
| Poisson regression | Counts | Poisson | Log |
| Negative binomial regression | Overdispersed counts | Negative binomial | Log |
| Gamma regression | Positive continuous | Gamma | Inverse or log |
| Beta regression | Proportions in (0,1) | Beta | Logit |
Logistic regression is by far the most widely used GLM in machine learning. It models the log-odds of the positive class as a linear function of the inputs and is the default classifier when interpretability or calibrated probability scores matter.
Linear models can be fit by several different procedures depending on the loss function, the size of the dataset, and the regularization in use.
| Method | When used | Typical solver |
|---|---|---|
| Closed-form normal equations | OLS, ridge, small-to-medium p | Cholesky, QR, SVD decomposition |
| Maximum likelihood estimation | All GLMs, logistic regression | IRLS, Newton-Raphson |
| Coordinate descent | Lasso, elastic net | glmnet algorithm |
| Stochastic gradient descent | Very large n or streaming data | scikit-learn SGDRegressor, SGDClassifier |
| Quasi-Newton (L-BFGS) | Logistic regression with smooth penalty | scikit-learn LogisticRegression(solver='lbfgs') |
| Markov chain Monte Carlo | Bayesian linear models | Stan, PyMC, NumPyro |
For most modern problems with n in the tens of thousands and p in the hundreds, the closed-form OLS solution or a single L-BFGS run on the regularized log-likelihood completes in fractions of a second. The convexity of the underlying objective is the key reason fitting a linear model is so reliable: any local optimum is a global optimum, so the choice of solver only affects speed, not the final answer.
When the number of features is comparable to or larger than the number of observations, OLS becomes unstable and overfits. Regularization addresses this by adding a penalty on the size of the coefficients.
Ridge regression, introduced by Arthur Hoerl and Robert Kennard in their 1970 paper "Ridge Regression: Biased Estimation for Nonorthogonal Problems" in Technometrics, adds an L2 penalty λ||β||² to the residual sum of squares. The estimator becomes β̂ = (X'X + λI)⁻¹ X'y, which is always well defined because X'X + λI is invertible for any λ > 0. Ridge shrinks all coefficients toward zero but never sets them exactly to zero, and it is particularly useful when predictors are highly correlated.
The lasso, introduced by Robert Tibshirani in his 1996 paper "Regression Shrinkage and Selection via the Lasso" in the Journal of the Royal Statistical Society, Series B, replaces the L2 penalty with an L1 penalty λ∑|β_j|. Because the L1 ball has corners at the axes, the solution often sets many coefficients exactly to zero. This makes the lasso a feature-selection method as well as a regularizer, which is invaluable in high-dimensional problems such as gene expression analysis where most predictors are expected to be irrelevant.
The elastic net, proposed by Hui Zou and Trevor Hastie in 2005, blends the L1 and L2 penalties: λ_1∑|β_j| + λ_2||β||². The L1 part still produces sparsity, but the L2 part stabilises the solution when groups of predictors are highly correlated. The lasso alone tends to pick one predictor at random from a correlated group; the elastic net tends to keep all members of the group, with smaller weights.
| Penalty | Formula | Effect on coefficients | Selects features |
|---|---|---|---|
| L2 (ridge) | λ∑β_j² | Shrinks toward zero, rarely exactly zero | No |
| L1 (lasso) | `λ∑ | β_j | ` |
| L1 + L2 (elastic net) | `λ_1∑ | β_j | + λ_2∑β_j²` |
| L0 | λ·card(β ≠ 0) | Best subset selection (NP-hard) | Yes |
Several important models live outside the GLM framework but are still linear in their feature representation.
Linear support vector machines. A linear SVM replaces the cross-entropy loss with the hinge loss max(0, 1 - y·(w·x + b)) and adds an L2 penalty. The resulting decision boundary is still a hyperplane, but the loss is designed to maximise the margin between classes rather than to estimate class probabilities. LIBLINEAR is the standard implementation for large sparse problems and is what scikit-learn uses behind LinearSVC.
Perceptron. Frank Rosenblatt's perceptron, introduced in his 1958 Psychological Review paper "The perceptron: a probabilistic model for information storage and organization in the brain," was the first explicitly online linear classifier. Its update rule simply adds the misclassified input to the weight vector. The perceptron converges in a finite number of steps if and only if the data is linearly separable, a result later proved by Block and by Novikoff.
Linear discriminant analysis. Ronald Fisher's LDA, introduced in his 1936 paper "The use of multiple measurements in taxonomic problems" in the Annals of Eugenics, finds the linear projection that maximises the ratio of between-class variance to within-class variance. Fisher illustrated the method on the iris dataset, which has remained a benchmark for classification ever since. LDA assumes the class-conditional distributions are Gaussian with a shared covariance matrix; under that assumption, the Bayes-optimal classifier is itself a linear function of the features.
Linear mixed models. When observations are grouped (students within schools, repeated measurements on the same patient), classical OLS underestimates standard errors because residuals are correlated within groups. Linear mixed models add random effects for groups in addition to the fixed-effect coefficients, giving valid inference for hierarchical and longitudinal data. They are central to biostatistics and educational measurement.
Bayesian linear regression. Putting a Gaussian prior on β and updating with Gaussian-distributed observations yields a Gaussian posterior over β in closed form. The posterior mean coincides with the ridge estimate, while the posterior covariance gives full uncertainty quantification. Bayesian linear regression is the workhorse of probabilistic numerics and provides the foundation for Gaussian-process regression.
Linear models occupy a privileged position in machine learning for several practical reasons.
| Strength | Why it matters |
|---|---|
| Interpretability | Each coefficient gives the effect of a one-unit change in its feature, holding others fixed |
| Speed | Closed-form OLS or a single L-BFGS run finishes in milliseconds for typical problems |
| Convex objective | Any local optimum is the global optimum, so optimisation is reliable |
| Statistical inference | Standard errors, p-values, confidence intervals all available with mature theory |
| Strong baseline | Often matches or beats neural networks when training data is small |
| Few hyperparameters | Usually only the regularization strength needs tuning |
| Calibration | Probabilities from logistic regression are well calibrated by default |
| Regulatory acceptance | Required or strongly preferred in credit scoring, insurance pricing, and clinical risk models |
The restrictions that make linear models simple also limit what they can express.
| Weakness | Mitigation |
|---|---|
| Cannot capture nonlinear relationships from raw features | Add polynomial or interaction features, use splines, switch to a kernel method or tree model |
| Sensitive to outliers under squared loss | Use robust losses (Huber, quantile) or trim observations |
| Assumes uncorrelated residuals | Use mixed models or generalized estimating equations for grouped data |
| Multicollinearity inflates variance | Use ridge or remove redundant features |
| Cannot model feature interactions automatically | Engineer cross-features or use a model that handles them |
| Assumes constant error variance | Use weighted least squares or transform the response |
The following assumptions underpin the BLUE property of OLS and the validity of the textbook standard errors. Diagnostics for each one form the backbone of any applied regression analysis.
| Assumption | Description | Diagnostic |
|---|---|---|
| Linearity | The conditional mean of y is a linear function of X | Residual-versus-fitted plot |
| Independence of errors | Residuals are uncorrelated across observations | Durbin-Watson test, residual autocorrelation plot |
| Homoscedasticity | Errors have constant variance | Residual-versus-fitted plot, Breusch-Pagan test |
| Normality of errors | Errors are normally distributed (needed for inference, not for BLUE) | Q-Q plot, Shapiro-Wilk test |
| No perfect multicollinearity | Columns of X are linearly independent | Variance inflation factor (VIF) |
| Exogenous regressors | Errors uncorrelated with regressors | Hausman test, instrumental variable methods |
| No influential outliers | No single observation dominates the fit | Leverage, Cook's distance, DFBETAS |
When these assumptions are violated, OLS may still be unbiased but its standard errors and p-values become unreliable. The standard remedies are robust standard errors (Huber-White sandwich), generalized least squares for known correlation structure, or moving to a GLM with the appropriate distributional assumption.
The practical decision of whether to reach for a linear model can be summarised as follows.
| Situation | Linear model is a good fit | Better alternative |
|---|---|---|
| Small dataset, few features | Yes, almost always | None usually needed |
| High-dimensional sparse data (text, genomics) | Lasso or elastic net | Linear models with L1 |
| Need interpretable coefficients for stakeholders | Yes, especially logistic regression | Generalized additive models |
| Strict regulatory environment (credit, insurance) | Yes, often required | None permitted |
| Probabilistic forecasting with calibration | Logistic or Poisson regression | Gradient-boosted models with calibration layer |
| Strong nonlinear interactions in raw features | No, unless you engineer them | Random forests, gradient boosting, neural networks |
| Image, audio, or sequence input | Only after deep feature extraction | Convolutional or transformer models |
| Causal inference from observational data | Yes, with care over confounders | Doubly robust or instrumental-variable estimators |
A reasonable workflow in any new project is to fit a linear model first as a baseline, then a tree-based model, and only move to deep learning if both fail and the data and budget justify it.
Linear models continue to dominate large parts of applied statistics and applied machine learning even as deep learning grabs the headlines.
Econometrics. OLS, two-stage least squares, and generalized method of moments remain the default tools for causal inference in economics. Almost every paper in top economics journals contains at least one linear regression table, and the entire treatment-effects literature builds on linear projections.
Healthcare and biostatistics. Logistic regression is the standard for clinical risk scoring (Framingham, MELD, CHADS₂, qSOFA, APACHE) because regulators, clinicians, and journals all expect a model whose coefficients can be read from a table on paper. Cox proportional hazards is the GLM-style workhorse of survival analysis.
Credit scoring. Logistic regression remains the benchmark in the credit risk industry because the lack of interpretability of ensemble methods is incompatible with the requirements of financial regulators. The coefficients of a credit-scoring logistic regression must be defensible to a model risk officer and to a banking supervisor, and a linear model is the easiest object to defend.
Recommendation systems. Logistic regression is still the standard baseline for click-through-rate prediction in advertising and recommendation. Google's Wide and Deep architecture, introduced by Cheng and colleagues in 2016, jointly trains a wide linear model on cross-product features with a deep neural network on dense embeddings, capturing memorisation and generalisation in a single model. The wide component is exactly a logistic regression.
Linear probes for representation learning. Guillaume Alain and Yoshua Bengio's 2017 paper "Understanding intermediate layers using linear classifier probes" introduced the use of linear classifiers trained on frozen intermediate features as a way to measure how much task-relevant information each layer has accumulated. Linear probing is now a standard evaluation protocol for self-supervised learning, including SimCLR, MoCo, MAE, and the major large-language-model evaluation suites.
Connection to neural networks. Every fully connected layer in a neural network is a linear transformation followed by a nonlinearity. The final classification layer is almost always a softmax linear classifier, so the very last decision in a billion-parameter model is made by a linear model on top of learned features. The kernel trick exploited by kernel SVMs and Gaussian processes is the same idea in reverse: stay linear in a lifted feature space and let the kernel handle the nonlinearity.
Linear models are available in essentially every numerical computing environment. The most widely used ecosystems are listed below.
| Library | Language | Notable estimators |
|---|---|---|
| scikit-learn | Python | LinearRegression, LogisticRegression, Ridge, Lasso, ElasticNet, BayesianRidge, SGDRegressor, SGDClassifier, PoissonRegressor, GammaRegressor, LinearDiscriminantAnalysis, LinearSVC, Perceptron |
| statsmodels | Python | OLS, WLS, GLS, Logit, Probit, Poisson, full GLM family with statistical inference, mixed models, robust regression |
| R base | R | lm(), glm() |
glmnet | R, Python | Lasso, ridge, elastic net for OLS, logistic, Poisson, Cox, multinomial |
| MASS | R | lda(), qda(), robust regression |
| Stan, PyMC, NumPyro | Python | Bayesian linear and generalized linear models |
| MLlib | Spark, Scala | Distributed linear and logistic regression for big data |
| LIBLINEAR | C++ | Large-scale linear classification and regression |
| Vowpal Wabbit | C++ | Online linear models for very large datasets |
For classical statistical inference (confidence intervals, p-values, model summaries), statsmodels in Python and base R remain the default choices. For predictive modelling, scikit-learn and glmnet are the standards.
The following snippet fits a simple OLS regression on a synthetic dataset and inspects the resulting coefficients.
import numpy as np
from sklearn.linear_model import LinearRegression
# y = 2 + 3*x1 - 1*x2 + noise
rng = np.random.default_rng(seed=0)
X = rng.normal(size=(200, 2))
y = 2 + 3 * X[:, 0] - 1 * X[:, 1] + rng.normal(scale=0.5, size=200)
model = LinearRegression().fit(X, y)
print("intercept:", model.intercept_)
print("coefficients:", model.coef_)
print("R^2:", model.score(X, y))
The estimated intercept will be close to 2.0, the coefficients close to (3.0, -1.0), and the in-sample R-squared close to 0.97. Switching to ridge or lasso requires only changing the imported class:
from sklearn.linear_model import Ridge, Lasso, ElasticNet
ridge = Ridge(alpha=1.0).fit(X, y)
lasso = Lasso(alpha=0.1).fit(X, y)
enet = ElasticNet(alpha=0.1, l1_ratio=0.5).fit(X, y)
For a binary classification problem, the equivalent line is LogisticRegression().fit(X, y). The uniform interface across all of these estimators is one of the reasons linear models are so productive in practice: changing the loss, the link, or the penalty rarely requires touching more than one line of code.