Linear model

Linear model is a broad family of statistics and machine learning models in which the prediction is a linear function of the input features (possibly composed with a fixed nonlinear link function). The family includes linear regression, logistic regression, ridge regression, lasso regression, elastic net, the wider class of generalized linear models, linear discriminant analysis, the linear support vector machine, the perceptron, and Bayesian linear models.

Linear models are the workhorse of applied statistics. They were the first parametric models studied formally, they remain the default tool in econometrics and biostatistics, and they still serve as the strongest available baseline in many machine learning problems where data is scarce, features are well engineered, or interpretation is part of the deliverable. Even in deep learning, the final classification head of almost every modern network is a linear layer, and linear classifier probes are a standard tool for inspecting representations inside large neural networks.

See also: Machine learning terms

ELI5: explain like I'm five

Imagine you have a recipe for pancakes, and you want to predict how fluffy they will be. You might guess that the fluffiness depends on three things: how much baking powder you add, how much milk you pour in, and how long you whisk the batter. A linear model says, very simply, that the fluffiness is the baking powder times one number, plus the milk times another number, plus the whisking time times a third number, plus a fixed starting amount. If you change the milk by one cup, the fluffiness changes by exactly the same amount no matter how much baking powder you used. That straight-line behavior is what makes the model linear, and it is the reason linear models are easy to read: each ingredient gets its own weight, and bigger weights mean more important ingredients.

General form

A linear model in its most basic regression form is written as

y = w · x + b + ε

where x is a vector of input features, w is a vector of learned weights (also called coefficients), b is a scalar intercept (sometimes called the bias term), y is the predicted target, and ε is a residual error term. The expression w · x + b is called the linear predictor.

For classification, the same linear predictor is fed through a fixed nonlinear function. In binary logistic regression, the linear predictor is passed through the sigmoid function σ(z) = 1 / (1 + e^(-z)) to produce a class probability. In multiclass logistic regression (sometimes called softmax regression), several linear predictors are combined through the softmax function. In count regression problems, an exponential link converts the linear predictor into a positive rate.

A model is called linear because it is linear in the parameters w, not necessarily in the raw inputs. A regression on [x, x^2, x^3] is still a linear model: the design matrix has been expanded with polynomial features, but the prediction remains a linear combination of those features. This distinction matters because all the estimation theory and software for linear models continues to apply once you fix the feature transformation.

The linear-model family at a glance

The table below summarises the main members of the family, the type of target each one handles, the loss or likelihood that defines the fit, and the closest scikit-learn implementation.

Model	Target	Link / loss	Closed form	Typical implementation
Ordinary least squares (OLS)	Continuous	Squared loss, identity link	Yes	scikit-learn `LinearRegression`, `statsmodels` `OLS`, R `lm()`
Ridge regression	Continuous	Squared loss + L2 penalty	Yes	scikit-learn `Ridge`, `glmnet`
Lasso regression	Continuous	Squared loss + L1 penalty	No (coordinate descent)	scikit-learn `Lasso`, `glmnet`
Elastic net	Continuous	Squared loss + L1 + L2	No	scikit-learn `ElasticNet`, `glmnet`
Logistic regression	Binary or multiclass	Cross-entropy, logit link	No (IRLS)	scikit-learn `LogisticRegression`, `statsmodels` `Logit`
Poisson regression	Counts	Negative log-likelihood, log link	No (IRLS)	scikit-learn `PoissonRegressor`, `statsmodels` `GLM(family=Poisson)`
Gamma regression	Positive continuous	Negative log-likelihood, log or inverse link	No (IRLS)	scikit-learn `GammaRegressor`, `statsmodels` `GLM(family=Gamma)`
Linear SVM	Binary or multiclass	Hinge loss + L2	No (subgradient or LIBLINEAR)	scikit-learn `LinearSVC`
Linear discriminant analysis	Multiclass	Gaussian class-conditional	Yes	scikit-learn `LinearDiscriminantAnalysis`
Perceptron	Binary	Perceptron update rule	No (online)	scikit-learn `Perceptron`
Bayesian linear regression	Continuous	Posterior over weights	Closed for Gaussian prior	scikit-learn `BayesianRidge`, PyMC, Stan

All of these models share the same prediction formula f(x) = g⁻¹(w · x + b) for some link g. They differ in the loss they minimise, the prior or penalty they impose on w, and how the optimisation is solved.

Linear regression and ordinary least squares

The canonical linear model is linear regression fit by ordinary least squares (OLS). Given a design matrix X of shape n × p and a response vector y, OLS chooses the coefficient vector β that minimises the residual sum of squares

RSS(β) = ||y - Xβ||²

When X'X is invertible, this minimisation has the famous closed-form solution

β̂ = (X'X)⁻¹ X'y

The normal equations X'X β = X'y were already known to Legendre and Gauss in the early 1800s. Under the Gauss-Markov assumptions (linearity in parameters, zero conditional mean of errors, homoscedasticity, no perfect multicollinearity, and uncorrelated errors), the OLS estimator is the best linear unbiased estimator (BLUE), meaning it has the smallest variance among all linear unbiased estimators of β. The errors do not need to be Gaussian for the BLUE property to hold; normality is only needed for exact small-sample inference (t-tests, F-tests, confidence intervals).

Generalized linear models

For responses that are not continuous and Gaussian, the natural extension is the generalized linear model (GLM), introduced by John Nelder and Robert Wedderburn in their 1972 paper "Generalized Linear Models" in the Journal of the Royal Statistical Society, Series A. A GLM combines three ingredients:

A distribution for the response from the exponential family (Gaussian, Bernoulli, binomial, Poisson, gamma, inverse Gaussian).
A linear predictor η = Xβ.
A link function g such that g(E[y]) = η, or equivalently E[y] = g⁻¹(Xβ).

Nelder and Wedderburn showed that maximum likelihood estimates for this whole class can be obtained by iteratively reweighted least squares (IRLS), a procedure that fits a sequence of weighted OLS problems with weights and working responses updated at each step until convergence. IRLS remains the standard estimation method in glm() in R and in statsmodels GLM in Python.

The table below summarises the most common GLMs and the response distribution and link they use.

GLM	Response	Distribution	Canonical link
Linear regression	Continuous	Gaussian	Identity
Logistic regression	Binary	Bernoulli	Logit
Probit regression	Binary	Bernoulli	Probit (inverse Gaussian CDF)
Multinomial logistic	Categorical	Multinomial	Multinomial logit
Poisson regression	Counts	Poisson	Log
Negative binomial regression	Overdispersed counts	Negative binomial	Log
Gamma regression	Positive continuous	Gamma	Inverse or log
Beta regression	Proportions in (0,1)	Beta	Logit

Logistic regression is by far the most widely used GLM in machine learning. It models the log-odds of the positive class as a linear function of the inputs and is the default classifier when interpretability or calibrated probability scores matter.

Estimation

Linear models can be fit by several different procedures depending on the loss function, the size of the dataset, and the regularization in use.

Method	When used	Typical solver
Closed-form normal equations	OLS, ridge, small-to-medium `p`	Cholesky, QR, SVD decomposition
Maximum likelihood estimation	All GLMs, logistic regression	IRLS, Newton-Raphson
Coordinate descent	Lasso, elastic net	`glmnet` algorithm
Stochastic gradient descent	Very large `n` or streaming data	scikit-learn `SGDRegressor`, `SGDClassifier`
Quasi-Newton (L-BFGS)	Logistic regression with smooth penalty	scikit-learn `LogisticRegression(solver='lbfgs')`
Markov chain Monte Carlo	Bayesian linear models	Stan, PyMC, NumPyro

For most modern problems with n in the tens of thousands and p in the hundreds, the closed-form OLS solution or a single L-BFGS run on the regularized log-likelihood completes in fractions of a second. The convexity of the underlying objective is the key reason fitting a linear model is so reliable: any local optimum is a global optimum, so the choice of solver only affects speed, not the final answer.

Regularized variants

When the number of features is comparable to or larger than the number of observations, OLS becomes unstable and overfits. Regularization addresses this by adding a penalty on the size of the coefficients.

Ridge regression, introduced by Arthur Hoerl and Robert Kennard in their 1970 paper "Ridge Regression: Biased Estimation for Nonorthogonal Problems" in Technometrics, adds an L2 penalty λ||β||² to the residual sum of squares. The estimator becomes β̂ = (X'X + λI)⁻¹ X'y, which is always well defined because X'X + λI is invertible for any λ > 0. Ridge shrinks all coefficients toward zero but never sets them exactly to zero, and it is particularly useful when predictors are highly correlated.

The lasso, introduced by Robert Tibshirani in his 1996 paper "Regression Shrinkage and Selection via the Lasso" in the Journal of the Royal Statistical Society, Series B, replaces the L2 penalty with an L1 penalty λ∑|β_j|. Because the L1 ball has corners at the axes, the solution often sets many coefficients exactly to zero. This makes the lasso a feature-selection method as well as a regularizer, which is invaluable in high-dimensional problems such as gene expression analysis where most predictors are expected to be irrelevant.

The elastic net, proposed by Hui Zou and Trevor Hastie in 2005, blends the L1 and L2 penalties: λ_1∑|β_j| + λ_2||β||². The L1 part still produces sparsity, but the L2 part stabilises the solution when groups of predictors are highly correlated. The lasso alone tends to pick one predictor at random from a correlated group; the elastic net tends to keep all members of the group, with smaller weights.

Penalty	Formula	Effect on coefficients	Selects features
L2 (ridge)	`λ∑β_j²`	Shrinks toward zero, rarely exactly zero	No
L1 (lasso)	`λ∑	β_j	`
L1 + L2 (elastic net)	`λ_1∑	β_j	+ λ_2∑β_j²`
L0	`λ·card(β ≠ 0)`	Best subset selection (NP-hard)	Yes

Other linear models

Several important models live outside the GLM framework but are still linear in their feature representation.

Linear support vector machines. A linear SVM replaces the cross-entropy loss with the hinge loss max(0, 1 - y·(w·x + b)) and adds an L2 penalty. The resulting decision boundary is still a hyperplane, but the loss is designed to maximise the margin between classes rather than to estimate class probabilities. LIBLINEAR is the standard implementation for large sparse problems and is what scikit-learn uses behind LinearSVC.

Perceptron. Frank Rosenblatt's perceptron, introduced in his 1958 Psychological Review paper "The perceptron: a probabilistic model for information storage and organization in the brain," was the first explicitly online linear classifier. Its update rule simply adds the misclassified input to the weight vector. The perceptron converges in a finite number of steps if and only if the data is linearly separable, a result later proved by Block and by Novikoff.

Linear discriminant analysis. Ronald Fisher's LDA, introduced in his 1936 paper "The use of multiple measurements in taxonomic problems" in the Annals of Eugenics, finds the linear projection that maximises the ratio of between-class variance to within-class variance. Fisher illustrated the method on the iris dataset, which has remained a benchmark for classification ever since. LDA assumes the class-conditional distributions are Gaussian with a shared covariance matrix; under that assumption, the Bayes-optimal classifier is itself a linear function of the features.

Linear mixed models. When observations are grouped (students within schools, repeated measurements on the same patient), classical OLS underestimates standard errors because residuals are correlated within groups. Linear mixed models add random effects for groups in addition to the fixed-effect coefficients, giving valid inference for hierarchical and longitudinal data. They are central to biostatistics and educational measurement.

Bayesian linear regression. Putting a Gaussian prior on β and updating with Gaussian-distributed observations yields a Gaussian posterior over β in closed form. The posterior mean coincides with the ridge estimate, while the posterior covariance gives full uncertainty quantification. Bayesian linear regression is the workhorse of probabilistic numerics and provides the foundation for Gaussian-process regression.

Strengths

Linear models occupy a privileged position in machine learning for several practical reasons.

Strength	Why it matters
Interpretability	Each coefficient gives the effect of a one-unit change in its feature, holding others fixed
Speed	Closed-form OLS or a single L-BFGS run finishes in milliseconds for typical problems
Convex objective	Any local optimum is the global optimum, so optimisation is reliable
Statistical inference	Standard errors, p-values, confidence intervals all available with mature theory
Strong baseline	Often matches or beats neural networks when training data is small
Few hyperparameters	Usually only the regularization strength needs tuning
Calibration	Probabilities from logistic regression are well calibrated by default
Regulatory acceptance	Required or strongly preferred in credit scoring, insurance pricing, and clinical risk models

Weaknesses

The restrictions that make linear models simple also limit what they can express.

Weakness	Mitigation
Cannot capture nonlinear relationships from raw features	Add polynomial or interaction features, use splines, switch to a kernel method or tree model
Sensitive to outliers under squared loss	Use robust losses (Huber, quantile) or trim observations
Assumes uncorrelated residuals	Use mixed models or generalized estimating equations for grouped data
Multicollinearity inflates variance	Use ridge or remove redundant features
Cannot model feature interactions automatically	Engineer cross-features or use a model that handles them
Assumes constant error variance	Use weighted least squares or transform the response

Classical OLS assumptions

The following assumptions underpin the BLUE property of OLS and the validity of the textbook standard errors. Diagnostics for each one form the backbone of any applied regression analysis.

Assumption	Description	Diagnostic
Linearity	The conditional mean of `y` is a linear function of `X`	Residual-versus-fitted plot
Independence of errors	Residuals are uncorrelated across observations	Durbin-Watson test, residual autocorrelation plot
Homoscedasticity	Errors have constant variance	Residual-versus-fitted plot, Breusch-Pagan test
Normality of errors	Errors are normally distributed (needed for inference, not for BLUE)	Q-Q plot, Shapiro-Wilk test
No perfect multicollinearity	Columns of `X` are linearly independent	Variance inflation factor (VIF)
Exogenous regressors	Errors uncorrelated with regressors	Hausman test, instrumental variable methods
No influential outliers	No single observation dominates the fit	Leverage, Cook's distance, DFBETAS

When these assumptions are violated, OLS may still be unbiased but its standard errors and p-values become unreliable. The standard remedies are robust standard errors (Huber-White sandwich), generalized least squares for known correlation structure, or moving to a GLM with the appropriate distributional assumption.

When to use a linear model

The practical decision of whether to reach for a linear model can be summarised as follows.

Situation	Linear model is a good fit	Better alternative
Small dataset, few features	Yes, almost always	None usually needed
High-dimensional sparse data (text, genomics)	Lasso or elastic net	Linear models with L1
Need interpretable coefficients for stakeholders	Yes, especially logistic regression	Generalized additive models
Strict regulatory environment (credit, insurance)	Yes, often required	None permitted
Probabilistic forecasting with calibration	Logistic or Poisson regression	Gradient-boosted models with calibration layer
Strong nonlinear interactions in raw features	No, unless you engineer them	Random forests, gradient boosting, neural networks
Image, audio, or sequence input	Only after deep feature extraction	Convolutional or transformer models
Causal inference from observational data	Yes, with care over confounders	Doubly robust or instrumental-variable estimators

A reasonable workflow in any new project is to fit a linear model first as a baseline, then a tree-based model, and only move to deep learning if both fail and the data and budget justify it.

Modern context

Linear models continue to dominate large parts of applied statistics and applied machine learning even as deep learning grabs the headlines.

Econometrics. OLS, two-stage least squares, and generalized method of moments remain the default tools for causal inference in economics. Almost every paper in top economics journals contains at least one linear regression table, and the entire treatment-effects literature builds on linear projections.

Healthcare and biostatistics. Logistic regression is the standard for clinical risk scoring (Framingham, MELD, CHADS₂, qSOFA, APACHE) because regulators, clinicians, and journals all expect a model whose coefficients can be read from a table on paper. Cox proportional hazards is the GLM-style workhorse of survival analysis.

Credit scoring. Logistic regression remains the benchmark in the credit risk industry because the lack of interpretability of ensemble methods is incompatible with the requirements of financial regulators. The coefficients of a credit-scoring logistic regression must be defensible to a model risk officer and to a banking supervisor, and a linear model is the easiest object to defend.

Recommendation systems. Logistic regression is still the standard baseline for click-through-rate prediction in advertising and recommendation. Google's Wide and Deep architecture, introduced by Cheng and colleagues in 2016, jointly trains a wide linear model on cross-product features with a deep neural network on dense embeddings, capturing memorisation and generalisation in a single model. The wide component is exactly a logistic regression.

Linear probes for representation learning. Guillaume Alain and Yoshua Bengio's 2017 paper "Understanding intermediate layers using linear classifier probes" introduced the use of linear classifiers trained on frozen intermediate features as a way to measure how much task-relevant information each layer has accumulated. Linear probing is now a standard evaluation protocol for self-supervised learning, including SimCLR, MoCo, MAE, and the major large-language-model evaluation suites.

Connection to neural networks. Every fully connected layer in a neural network is a linear transformation followed by a nonlinearity. The final classification layer is almost always a softmax linear classifier, so the very last decision in a billion-parameter model is made by a linear model on top of learned features. The kernel trick exploited by kernel SVMs and Gaussian processes is the same idea in reverse: stay linear in a lifted feature space and let the kernel handle the nonlinearity.

Implementations

Linear models are available in essentially every numerical computing environment. The most widely used ecosystems are listed below.

Library	Language	Notable estimators
scikit-learn	Python	`LinearRegression`, `LogisticRegression`, `Ridge`, `Lasso`, `ElasticNet`, `BayesianRidge`, `SGDRegressor`, `SGDClassifier`, `PoissonRegressor`, `GammaRegressor`, `LinearDiscriminantAnalysis`, `LinearSVC`, `Perceptron`
statsmodels	Python	`OLS`, `WLS`, `GLS`, `Logit`, `Probit`, `Poisson`, full GLM family with statistical inference, mixed models, robust regression
R base	R	`lm()`, `glm()`
`glmnet`	R, Python	Lasso, ridge, elastic net for OLS, logistic, Poisson, Cox, multinomial
MASS	R	`lda()`, `qda()`, robust regression
Stan, PyMC, NumPyro	Python	Bayesian linear and generalized linear models
MLlib	Spark, Scala	Distributed linear and logistic regression for big data
LIBLINEAR	C++	Large-scale linear classification and regression
Vowpal Wabbit	C++	Online linear models for very large datasets

For classical statistical inference (confidence intervals, p-values, model summaries), statsmodels in Python and base R remain the default choices. For predictive modelling, scikit-learn and glmnet are the standards.

Worked example: linear regression in scikit-learn

The following snippet fits a simple OLS regression on a synthetic dataset and inspects the resulting coefficients.

import numpy as np
from sklearn.linear_model import LinearRegression

# y = 2 + 3*x1 - 1*x2 + noise
rng = np.random.default_rng(seed=0)
X = rng.normal(size=(200, 2))
y = 2 + 3 * X[:, 0] - 1 * X[:, 1] + rng.normal(scale=0.5, size=200)

model = LinearRegression().fit(X, y)
print("intercept:", model.intercept_)
print("coefficients:", model.coef_)
print("R^2:", model.score(X, y))

The estimated intercept will be close to 2.0, the coefficients close to (3.0, -1.0), and the in-sample R-squared close to 0.97. Switching to ridge or lasso requires only changing the imported class:

from sklearn.linear_model import Ridge, Lasso, ElasticNet
ridge = Ridge(alpha=1.0).fit(X, y)
lasso = Lasso(alpha=0.1).fit(X, y)
enet = ElasticNet(alpha=0.1, l1_ratio=0.5).fit(X, y)

For a binary classification problem, the equivalent line is LogisticRegression().fit(X, y). The uniform interface across all of these estimators is one of the reasons linear models are so productive in practice: changing the loss, the link, or the penalty rarely requires touching more than one line of code.

References

Nelder, J. A. and Wedderburn, R. W. M. (1972). "Generalized Linear Models." *Journal of the Royal Statistical Society, Series A* 135(3), 370-384.
Hoerl, A. E. and Kennard, R. W. (1970). "Ridge Regression: Biased Estimation for Nonorthogonal Problems." *Technometrics* 12(1), 55-67.
Tibshirani, R. (1996). "Regression Shrinkage and Selection via the Lasso." *Journal of the Royal Statistical Society, Series B* 58(1), 267-288.
Zou, H. and Hastie, T. (2005). "Regularization and Variable Selection via the Elastic Net." *Journal of the Royal Statistical Society, Series B* 67(2), 301-320.
Fisher, R. A. (1936). "The use of multiple measurements in taxonomic problems." *Annals of Eugenics* 7(2), 179-188.
Rosenblatt, F. (1958). "The perceptron: a probabilistic model for information storage and organization in the brain." *Psychological Review* 65(6), 386-408.
Alain, G. and Bengio, Y. (2017). "Understanding intermediate layers using linear classifier probes." *International Conference on Learning Representations (Workshop Track)*. arXiv:1610.01644.
Cheng, H.-T., Koc, L., Harmsen, J., Shaked, T., Chandra, T., et al. (2016). "Wide & Deep Learning for Recommender Systems." *Proceedings of the 1st Workshop on Deep Learning for Recommender Systems*, 7-10.
McCullagh, P. and Nelder, J. A. (1989). *Generalized Linear Models* (2nd edition). Chapman and Hall.
Hastie, T., Tibshirani, R. and Friedman, J. (2009). *The Elements of Statistical Learning* (2nd edition). Springer.
James, G., Witten, D., Hastie, T. and Tibshirani, R. (2021). *An Introduction to Statistical Learning with Applications in R* (2nd edition). Springer.
Wooldridge, J. M. (2019). *Introductory Econometrics: A Modern Approach* (7th edition). Cengage.
Pedregosa, F. et al. (2011). "Scikit-learn: Machine Learning in Python." *Journal of Machine Learning Research* 12, 2825-2830.
Seabold, S. and Perktold, J. (2010). "statsmodels: Econometric and Statistical Modeling with Python." *Proceedings of the 9th Python in Science Conference*.
Friedman, J., Hastie, T. and Tibshirani, R. (2010). "Regularization Paths for Generalized Linear Models via Coordinate Descent." *Journal of Statistical Software* 33(1), 1-22.
Dumitrescu, E., Hué, S., Hurlin, C. and Tokpavi, S. (2022). "Machine learning for credit scoring: Improving logistic regression with non-linear decision-tree effects." *European Journal of Operational Research* 297(3), 1178-1192.

ELI5: explain like I'm five

General form

The linear-model family at a glance

Linear regression and ordinary least squares

Generalized linear models

Estimation

Regularized variants

Other linear models

Strengths

Weaknesses

Classical OLS assumptions

When to use a linear model

Modern context

Implementations

Worked example: linear regression in scikit-learn

See also

References

Improve this article

Related Articles

ARC-AGI 2

Multinomial regression

AUC-ROC

ARIMA

Machine learning terms/Clustering

Machine learning terms/Decision Forests

ELI5: explain like I'm five

General form

The linear-model family at a glance

Linear regression and ordinary least squares

Generalized linear models

Estimation

Regularized variants

Other linear models

Strengths

Weaknesses

Classical OLS assumptions

When to use a linear model

Modern context

Implementations

Worked example: linear regression in scikit-learn

See also

References

Related Articles

ARC-AGI 2

Multinomial regression

AUC-ROC

ARIMA

Machine learning terms/Clustering

Machine learning terms/Decision Forests