# Linear

> Source: https://aiwiki.ai/wiki/linear
> Updated: 2026-07-11
> Categories: Machine Learning, Mathematics
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

*See also: [Machine learning terms](/wiki/machine_learning_terms)*

In machine learning and mathematics, **linear** describes a function or relationship in which the output is built from the inputs only through addition and multiplication by constants. Google's Machine Learning Glossary defines a [linear](/wiki/linear_model) relationship as "a relationship between two or more variables that can be represented solely through addition and multiplication," and notes that the plot of a linear relationship is a line. [1] Linearity is the simplest, most well understood class of relationships in modeling, and it underlies a large fraction of classical statistics, signal processing, and modern machine learning.

A model is called linear when its prediction can be written as a weighted sum of features, optionally plus a constant. The same word also describes operations such as expectation, differentiation, integration, and matrix multiplication, all of which respect the same two algebraic rules. [2] Because linear maps compose to other linear maps, every layer of a deep network is itself linear before its [activation function](/wiki/activation_function) is applied. The deliberate insertion of [nonlinear](/wiki/nonlinear) activations is what stops a stack of linear layers from collapsing into a single linear transform, and a classic theorem makes that necessity precise: a feedforward network is a universal approximator if and only if its activation function is not a polynomial. [12]

## What is a linear function? (formal definition)

A function f from one vector space to another is linear if it satisfies two properties for all vectors x, y and all scalars α, β: [2]

1. **Additivity:** $$f(x + y) = f(x) + f(y)$$.
2. **Homogeneity of degree 1:** $$f(\alpha x) = \alpha f(x)$$.

These two conditions can be combined into a single statement, sometimes called the superposition principle: [9][10]

$$f(\alpha x + \beta y) = \alpha f(x) + \beta f(y).$$

In coordinates, every linear function on a finite dimensional space can be written as a dot product with a fixed weight vector w:

$$f(x) = w \cdot x = \sum_{i=1}^{d} w_i x_i.$$

The matrix form generalizes this to vector valued outputs, $$f(x) = W x$$, where W is a fixed matrix. Geometrically the level set $$\{x : w \cdot x = c\}$$ is a hyperplane, a flat (d minus 1) dimensional subset of d dimensional feature space. [7]

## What is the difference between linear and affine?

In everyday machine learning usage the word "linear" is loose. A function of the form $$g(x) = w \cdot x + b$$, with a non zero bias b, is technically affine rather than linear, because $$g(0) = b$$ is not 0 and homogeneity fails. [8] An affine function is a linear function composed with a translation. Most practical models, including [linear regression](/wiki/linear_regression), [logistic regression](/wiki/logistic_regression), and a [perceptron](/wiki/perceptron), are affine in this strict sense, since they include a bias or intercept term. The community calls them linear anyway, partly out of convenience and partly because the bias can always be folded into the weight vector by appending a constant 1 to the feature vector.

The distinction matters in two places. First, in linear algebra theorems (kernels, image, dimension counting) the bias has to be handled explicitly. Second, when a hyperplane is the decision boundary of a linear classifier, the bias is what lets that hyperplane miss the origin. [7]

## What properties do linear functions have?

| Property | Statement | Why it matters in ML |
|---|---|---|
| Additivity | $$f(x + y) = f(x) + f(y)$$ | Effects of features add up; no interactions by default |
| Homogeneity | $$f(\alpha x) = \alpha f(x)$$ | Doubling an input doubles the output |
| Superposition | $$f(\alpha x + \beta y) = \alpha f(x) + \beta f(y)$$ | Lets us decompose problems into per feature contributions |
| Closure under composition | $$f \circ g$$ is linear if f and g are | Stacked linear layers collapse into one linear layer |
| Convexity of the squared loss | $$(y - w \cdot x)^2$$ is convex in w | Linear regression has a unique global optimum |
| Closed form solution | $$w = (X^\top X)^{-1} X^\top y$$ | No iterative training needed for ordinary least squares |

The closed form solution above, ordinary least squares, is one of the oldest tools in statistics. The method of least squares was first published by Adrien-Marie Legendre in 1805, as an appendix to his work on the orbits of comets, and Carl Friedrich Gauss published his own treatment in 1809 while claiming to have used the method since 1795. [3][19] More than two centuries later, fitting a linear model to data still reduces to the same normal equations.

## Does "linear" mean linear in the parameters or linear in the features?

A persistent source of confusion is what the "linear" in linear regression refers to. The model $$y = \beta_0 + \beta_1 x + \beta_2 x^2 + \beta_3 \sin(x)$$ is still a linear model. It is linear in the parameters β, even though it is not linear in the original feature x. [3] The defining test is whether the model can be written as $$y = \sum_i \beta_i \phi_i(x)$$ for some fixed feature transforms φi, with the βi appearing only to the first power and not multiplied or divided by other parameters. [18]

This loophole is what makes classical linear models surprisingly flexible. By engineering nonlinear features (polynomials, splines, interaction terms, radial basis functions) and then fitting them with a linear estimator, a practitioner gets nonlinear behavior with linear training. The same trick is the basis of the [kernel trick](/wiki/kernel_trick): a kernel implicitly maps inputs into a high dimensional feature space where a linear model fits, without ever computing the features explicitly. By contrast, a model such as $$y = \beta_1 x^{\beta_2}$$ is genuinely nonlinear because β2 appears as an exponent, and it requires nonlinear optimization to fit. [18]

## What are the common linear methods in machine learning?

| Method | Task | What is linear |
|---|---|---|
| [Linear regression](/wiki/linear_regression) | Regression | Mean of y is a linear function of x |
| Ridge regression | Regression | Linear in parameters with L2 penalty |
| Lasso regression | Regression | Linear in parameters with L1 penalty |
| Elastic net | Regression | Linear in parameters with mixed L1 and L2 penalty |
| [Logistic regression](/wiki/logistic_regression) | Binary classification | Log odds are linear in x |
| Softmax regression | Multiclass classification | Logits are linear in x |
| Linear [SVM](/wiki/support_vector_machine_svm) | Classification | Decision boundary is a hyperplane |
| [Perceptron](/wiki/perceptron) | Binary classification | Decision rule is $$\mathrm{sign}(w \cdot x + b)$$ |
| Linear discriminant analysis | Classification, dimensionality reduction | Projects onto a linear subspace that separates classes [5] |
| [Principal component analysis](/wiki/principal_component_analysis) | Dimensionality reduction | Orthogonal linear projection that maximizes variance [6] |
| Generalized linear models | Regression with non Gaussian targets | Linear predictor passed through a link function |

All of these share two features: they fit a small number of weights, and the optimization problem they solve is convex. [14][17] Convexity guarantees a single global optimum, which is one reason linear methods are the workhorse of statistics.

## What are generalized linear models (GLMs)?

Generalized linear models, or GLMs, extend ordinary linear regression to response variables that are not Gaussian. A GLM has three pieces: a linear predictor $$\eta = w \cdot x + b$$, a probability distribution from the exponential family for the response, and a link function g that ties them together by $$g(\mathbb{E}[y]) = \eta$$. [4] Linear regression itself is a GLM with a normal response and an identity link. [Logistic regression](/wiki/logistic_regression) is a GLM with a Bernoulli response and a logit link. Poisson regression, used for count data, takes a Poisson response and a log link. [4] The linearity in the linear predictor is what gives GLMs interpretable coefficients and well behaved likelihoods.

## How does linearity appear in calculus and probability?

Linearity is not just a property of models; it is a property of many of the operations that machine learning relies on.

- **Linearity of expectation.** For any random variables X and Y, $$\mathbb{E}[X + Y] = \mathbb{E}[X] + \mathbb{E}[Y]$$, and $$\mathbb{E}[\alpha X] = \alpha \mathbb{E}[X]$$. This holds whether or not X and Y are independent, which makes it one of the most useful tools in probability. [11] It justifies tricks like decomposing a complicated estimator into a sum of simpler terms whose expectations are easy to compute.
- **Linearity of differentiation.** The derivative of a sum is the sum of the derivatives, and constants pull out of the derivative. This is what makes the chain rule work cleanly through a deep network and is the algebraic backbone of backpropagation.
- **Linearity of integration.** Integration distributes over sums and pulls constants out. Most analytical expectations and continuous loss functions exploit this.
- **Linearity of matrix multiplication.** Matrix multiplication is linear in each of its arguments, which is why a fully connected layer W x can be analyzed with eigenvalues, singular values, and rank. [2]

These facts are routine in textbooks, but they are also what allow large parts of machine learning theory (PAC bounds, bias variance decompositions, convergence proofs) to work at all.

## Why do neural networks need nonlinearity?

A modern [neural network](/wiki/neural_network) is not linear. Each layer, however, contains a linear piece. A fully connected layer computes a pre activation z = W x + b, which is affine in x. If the activation function applied to z is also linear, then the entire network reduces to one big affine map, because the composition of affine maps is itself affine. The presence of a nonlinear [activation function](/wiki/activation_function) such as ReLU, tanh, or the GELU used in transformers is what stops layers from collapsing and gives the network its representational power.

This is not just folklore. The universal approximation theorem makes the requirement exact: Leshno, Lin, Pinkus, and Schocken proved in 1993 that a multilayer feedforward network with locally bounded piecewise continuous activations can approximate any continuous function "if and only if the network's activation function is not a polynomial." [12] A purely linear (degree 1 polynomial) activation therefore cannot give a network any power beyond a single linear map, no matter how many layers it has. Cybenko had earlier established universal approximation for sigmoidal activations in 1989. [13] Convolutional layers, attention's projection matrices, and the input and output embeddings of large language models are all linear maps with learned parameters; the nonlinearity lives in activations, softmaxes, and normalization layers.

This structure is why people sometimes describe a deep network as "alternating linear and nonlinear pieces." It is also why removing all activations from a transformer turns it into a single matrix product over the input, a fact that has been used to design simplified theoretical models of attention.

## What is a linear probe?

A related modern usage is the **linear probe**, a diagnostic tool for evaluating learned representations. Given a frozen pretrained model, a linear probe trains a single linear classifier on top of its hidden activations to predict some downstream label. If the probe achieves high accuracy, the representation is said to linearly encode the property. Linear probes are popular because they are fast, have few hyperparameters, and isolate what the representation already contains from what an additional learner could do.

They were used heavily in the [CLIP](/wiki/clip) paper to evaluate transfer of vision features: Radford et al. fit a logistic regression classifier with scikit-learn's L-BFGS solver on frozen image features, and their best model (a ViT-L/14 trained at 336-by-336 pixel resolution) reached state of the art on 21 of 27 datasets in their linear probe suite. [15] Linear probes remain a standard evaluation in self supervised learning. A complementary approach, LP-FT, trains a linear probe first and then fine tunes the full model. Kumar et al. (2022) showed that on 10 distribution shift datasets, plain fine tuning averaged 2 percent higher in distribution accuracy but 7 percent lower out of distribution accuracy than linear probing, because fine tuning distorts good pretrained features; the LP-FT two step recipe beat full fine tuning by roughly 1 percent in distribution and 10 percent out of distribution. [16]

## When should you use a linear model?

Linear models are not a relic. They remain a strong default for structured tabular data and for tasks where interpretability or speed dominates accuracy. [14] Reasonable indications for reaching for a linear baseline include:

- Small datasets where complex models would overfit.
- High dimensional data with many features and few examples, where regularized linear methods such as Lasso are state of the art.
- Settings that demand interpretability, since each coefficient measures the marginal effect of one feature. [14]
- Real time inference where prediction time must be milliseconds or less.
- Convex optimization is desired, so that training has a guaranteed global optimum.
- A baseline against which more complex models will be compared. If a linear model already does well, the marginal value of a deep model is small.

Linear regression is also the standard tool in scientific applications because its assumptions (linearity of effects, independence of errors, homoscedasticity) line up with how scientists like to reason about cause and effect. [3]

## When does linearity fail?

Linearity is a strong assumption, and it breaks in any setting where the response depends on interactions between features, on thresholds, or on the geometry of complex objects. Linear models cannot represent XOR, cannot learn most image features without [feature engineering](/wiki/feature_engineering), and cannot capture the long range dependencies that make sense of text. Image classification, speech recognition, machine translation, and language modeling all moved decisively away from linear methods during the 2010s for this reason. Even on tabular data, gradient boosted trees and modern deep tabular models often beat linear baselines when interactions matter and there is enough training data. [20]

Diagnosing a linearity failure usually involves residual analysis: if the residuals from a linear regression show curvature when plotted against a feature, the linearity assumption is suspect, and either feature engineering or a nonlinear model is appropriate. [20]

## Is linearity still relevant for large language models?

Large language models and diffusion models are not linear; their core computations include attention softmaxes, ReLU or GELU activations, layer normalization, and gating, all of which are nonlinear. But the linear pieces are still everywhere. Token and position embeddings, the query, key, and value projections inside attention, the down projection back to the model dimension, and the final unembedding to the vocabulary are all linear maps. A great deal of mechanistic interpretability research treats these matrices as the primary object of study and asks what linear directions in the residual stream encode. The interaction between linear projections and a small number of nonlinear operations is also why phenomena such as superposition (where one neuron represents many features at once) are even possible.

In the broader machine learning landscape, regularized linear models such as Lasso remain dominant in genomics, econometrics, and other fields where a sparse, interpretable model is more valuable than a marginal accuracy gain. "Try a linear model first" is still a good rule of thumb, and a linear baseline that closely matches a deep model is often a sign that the deep model has not actually learned anything useful from the data.

## Explain like I'm 5

A linear model is like predicting the price of a sandwich by adding up fixed prices for the bread, the meat, and the cheese. Two slices of cheese cost twice as much as one slice. The bread does not change in price depending on what meat you choose. If the real price worked that simply, a linear model would get it right every time. Real food trucks have specials, combos, and weird discounts, which is why we sometimes need more complicated models that can notice those interactions.

## References

1. Google Developers. "Machine Learning Glossary: ML Fundamentals." https://developers.google.com/machine-learning/glossary/fundamentals
2. Wikipedia. "Linearity." https://en.wikipedia.org/wiki/Linearity
3. Wikipedia. "Linear regression." https://en.wikipedia.org/wiki/Linear_regression
4. Wikipedia. "Generalized linear model." https://en.wikipedia.org/wiki/Generalized_linear_model
5. Wikipedia. "Linear discriminant analysis." https://en.wikipedia.org/wiki/Linear_discriminant_analysis
6. Wikipedia. "Principal component analysis." https://en.wikipedia.org/wiki/Principal_component_analysis
7. Wikipedia. "Hyperplane." https://en.wikipedia.org/wiki/Hyperplane
8. Wikipedia. "Affine transformation." https://en.wikipedia.org/wiki/Affine_transformation
9. Wikipedia. "Superposition principle." https://en.wikipedia.org/wiki/Superposition_principle
10. Smith, S. W. "Superposition: The Foundation of DSP." The Scientist and Engineer's Guide to Digital Signal Processing. https://www.dspguide.com/ch5/6.htm
11. Stanford CS109. "Linearity of Expectation." https://web.stanford.edu/class/archive/cs/cs109/cs109.1218/files/student_drive/3.2.pdf
12. Leshno, M., Lin, V. Y., Pinkus, A., and Schocken, S. "Multilayer Feedforward Networks With a Nonpolynomial Activation Function Can Approximate Any Function." Neural Networks 6(6), 1993. https://www.sciencedirect.com/science/article/abs/pii/S0893608005801315
13. Cybenko, G. "Approximation by Superpositions of a Sigmoidal Function." Mathematics of Control, Signals and Systems 2, 1989. https://link.springer.com/article/10.1007/BF02551274
14. Christoph Molnar. "Interpretable Machine Learning, Chapter on Linear Regression." https://christophm.github.io/interpretable-ml-book/limo.html
15. Radford, A. et al. "Learning Transferable Visual Models From Natural Language Supervision." ICML 2021 (CLIP), supplementary material on linear probe evaluation. http://proceedings.mlr.press/v139/radford21a/radford21a-supp.pdf
16. Kumar, A., Raghunathan, A., Jones, R., Ma, T., and Liang, P. "Fine-Tuning Can Distort Pretrained Features and Underperform Out-of-Distribution." ICLR 2022. https://arxiv.org/abs/2202.10054
17. scikit-learn. "Linear and Quadratic Discriminant Analysis." https://scikit-learn.org/stable/modules/lda_qda.html
18. Statistics By Jim. "The Difference Between Linear and Nonlinear Regression Models." https://statisticsbyjim.com/regression/difference-between-linear-nonlinear-regression-models/
19. History of Information. "Carl Friedrich Gauss and Adrien-Marie Legendre Discover the Method of Least Squares." https://www.historyofinformation.com/detail.php?entryid=2707
20. IBM. "What Is Linear Regression?" https://www.ibm.com/think/topics/linear-regression