Linear

See also: Machine learning terms

In machine learning and mathematics, linear describes a function or relationship in which the output is built from the inputs only through addition and multiplication by constants. Google's Machine Learning Glossary defines a linear function as "a relationship between two or more variables that can be represented solely through addition and multiplication." Linearity is the simplest, most well understood class of relationships in modeling, and it underlies a large fraction of classical statistics, signal processing, and modern machine learning.

A model is called linear when its prediction can be written as a weighted sum of features, optionally plus a constant. The same word also describes operations such as expectation, differentiation, integration, and matrix multiplication, all of which respect the same two algebraic rules. Because linear maps compose to other linear maps, every layer of a deep network is itself linear before its activation function is applied. The deliberate insertion of nonlinear activations is what stops a stack of linear layers from collapsing into a single linear transform.

Formal definition

A function f from one vector space to another is linear if it satisfies two properties for all vectors x, y and all scalars α, β:

Additivity: f(x + y) = f(x) + f(y).
Homogeneity of degree 1: f(αx) = α f(x).

These two conditions can be combined into a single statement, sometimes called the superposition principle:

$$f(\alpha x + \beta y) = \alpha f(x) + \beta f(y).$$

In coordinates, every linear function on a finite dimensional space can be written as a dot product with a fixed weight vector w:

$$f(x) = w \cdot x = \sum_{i=1}^{d} w_i x_i.$$

The matrix form generalizes this to vector valued outputs, f(x) = W x, where W is a fixed matrix. Geometrically the level set {x : w · x = c} is a hyperplane, a flat (d minus 1) dimensional subset of d dimensional feature space.

Linear versus affine

In everyday machine learning usage the word "linear" is loose. A function of the form g(x) = w · x + b, with a non zero bias b, is technically affine rather than linear, because g(0) = b is not 0 and homogeneity fails. An affine function is a linear function composed with a translation. Most practical models, including linear regression, logistic regression, and a perceptron, are affine in this strict sense, since they include a bias or intercept term. The community calls them linear anyway, partly out of convenience and partly because the bias can always be folded into the weight vector by appending a constant 1 to the feature vector.

The distinction matters in two places. First, in linear algebra theorems (kernels, image, dimension counting) the bias has to be handled explicitly. Second, when a hyperplane is the decision boundary of a linear classifier, the bias is what lets that hyperplane miss the origin.

Properties of linear functions

Property	Statement	Why it matters in ML
Additivity	f(x + y) = f(x) + f(y)	Effects of features add up; no interactions by default
Homogeneity	f(αx) = α f(x)	Doubling an input doubles the output
Superposition	f(αx + βy) = α f(x) + β f(y)	Lets us decompose problems into per feature contributions
Closure under composition	f ∘ g is linear if f and g are	Stacked linear layers collapse into one linear layer
Convexity of the squared loss	(y minus w · x)^2 is convex in w	Linear regression has a unique global optimum
Closed form solution	w = (X^T X)^{-1} X^T y	No iterative training needed for ordinary least squares

Linear in parameters versus linear in features

A persistent source of confusion is what the "linear" in linear regression refers to. The model y = β0 + β1 x + β2 x^2 + β3 sin(x) is still a linear model. It is linear in the parameters β, even though it is not linear in the original feature x. The defining test is whether the model can be written as y = Σ βi φi(x) for some fixed feature transforms φi, with the βi appearing only to the first power and not multiplied or divided by other parameters.

This loophole is what makes classical linear models surprisingly flexible. By engineering nonlinear features (polynomials, splines, interaction terms, radial basis functions) and then fitting them with a linear estimator, a practitioner gets nonlinear behavior with linear training. The same trick is the basis of the kernel trick: a kernel implicitly maps inputs into a high dimensional feature space where a linear model fits, without ever computing the features explicitly. By contrast, a model such as y = β1 x^β2 is genuinely nonlinear because β2 appears as an exponent, and it requires nonlinear optimization to fit.

Common linear methods

Method	Task	What is linear
Linear regression	Regression	Mean of y is a linear function of x
Ridge regression	Regression	Linear in parameters with L2 penalty
Lasso regression	Regression	Linear in parameters with L1 penalty
Elastic net	Regression	Linear in parameters with mixed L1 and L2 penalty
Logistic regression	Binary classification	Log odds are linear in x
Softmax regression	Multiclass classification	Logits are linear in x
Linear SVM	Classification	Decision boundary is a hyperplane
Perceptron	Binary classification	Decision rule is sign(w · x + b)
Linear discriminant analysis	Classification, dimensionality reduction	Projects onto a linear subspace that separates classes
Principal component analysis	Dimensionality reduction	Orthogonal linear projection that maximizes variance
Generalized linear models	Regression with non Gaussian targets	Linear predictor passed through a link function

All of these share two features: they fit a small number of weights, and the optimization problem they solve is convex. Convexity guarantees a single global optimum, which is one reason linear methods are the workhorse of statistics.

Generalized linear models

Generalized linear models, or GLMs, extend ordinary linear regression to response variables that are not Gaussian. A GLM has three pieces: a linear predictor η = w · x + b, a probability distribution from the exponential family for the response, and a link function g that ties them together by g(E[y]) = η. Linear regression itself is a GLM with a normal response and an identity link. Logistic regression is a GLM with a Bernoulli response and a logit link. Poisson regression, used for count data, takes a Poisson response and a log link. The linearity in the linear predictor is what gives GLMs interpretable coefficients and well behaved likelihoods.

Linearity in calculus and probability

Linearity is not just a property of models; it is a property of many of the operations that machine learning relies on.

Linearity of expectation. For any random variables X and Y, E[X + Y] = E[X] + E[Y], and E[αX] = α E[X]. This holds whether or not X and Y are independent, which makes it one of the most useful tools in probability. It justifies tricks like decomposing a complicated estimator into a sum of simpler terms whose expectations are easy to compute.
Linearity of differentiation. The derivative of a sum is the sum of the derivatives, and constants pull out of the derivative. This is what makes the chain rule work cleanly through a deep network and is the algebraic backbone of backpropagation.
Linearity of integration. Integration distributes over sums and pulls constants out. Most analytical expectations and continuous loss functions exploit this.
Linearity of matrix multiplication. Matrix multiplication is linear in each of its arguments, which is why a fully connected layer W x can be analyzed with eigenvalues, singular values, and rank.

These facts are routine in textbooks, but they are also what allow large parts of machine learning theory (PAC bounds, bias variance decompositions, convergence proofs) to work at all.

Linearity inside neural networks

A modern neural network is not linear. Each layer, however, contains a linear piece. A fully connected layer computes a pre activation z = W x + b, which is affine in x. If the activation function applied to z is also linear, then the entire network reduces to one big affine map. As one tutorial puts it, without nonlinearity Linear ∘ Linear ∘ Linear = Linear. The presence of a nonlinear activation function such as ReLU, tanh, or the GELU used in transformers is what stops layers from collapsing and gives the network its representational power. Convolutional layers, attention's projection matrices, and the input and output embeddings of large language models are all linear maps with learned parameters; the nonlinearity lives in activations, softmaxes, and normalization layers.

This structure is why people sometimes describe a deep network as "alternating linear and nonlinear pieces." It is also why removing all activations from a transformer turns it into a single matrix product over the input, a fact that has been used to design simplified theoretical models of attention.

Linear probing

A related modern usage is the linear probe, a diagnostic tool for evaluating learned representations. Given a frozen pretrained model, a linear probe trains a single linear classifier on top of its hidden activations to predict some downstream label. If the probe achieves high accuracy, the representation is said to linearly encode the property. Linear probes are popular because they are fast, have few hyperparameters, and isolate what the representation already contains from what an additional learner could do. They were used heavily in the CLIP paper to evaluate transfer of vision features, and they remain a standard evaluation in self supervised learning. A complementary approach, LP FT, trains a linear probe first and then fine tunes the full model, which preserves pretrained features and tends to improve out of distribution accuracy.

When to use linear models

Linear models are not a relic. They remain a strong default for structured tabular data and for tasks where interpretability or speed dominates accuracy. Reasonable indications for reaching for a linear baseline include:

Small datasets where complex models would overfit.
High dimensional data with many features and few examples, where regularized linear methods such as Lasso are state of the art.
Settings that demand interpretability, since each coefficient measures the marginal effect of one feature.
Real time inference where prediction time must be milliseconds or less.
Convex optimization is desired, so that training has a guaranteed global optimum.
A baseline against which more complex models will be compared. If a linear model already does well, the marginal value of a deep model is small.

Linear regression is also the standard tool in scientific applications because its assumptions (linearity of effects, independence of errors, homoscedasticity) line up with how scientists like to reason about cause and effect.

When linearity fails

Linearity is a strong assumption, and it breaks in any setting where the response depends on interactions between features, on thresholds, or on the geometry of complex objects. Linear models cannot represent XOR, cannot learn most image features without feature engineering, and cannot capture the long range dependencies that make sense of text. Image classification, speech recognition, machine translation, and language modeling all moved decisively away from linear methods during the 2010s for this reason. Even on tabular data, gradient boosted trees and modern deep tabular models often beat linear baselines when interactions matter and there is enough training data.

Diagnosing a linearity failure usually involves residual analysis: if the residuals from a linear regression show curvature when plotted against a feature, the linearity assumption is suspect, and either feature engineering or a nonlinear model is appropriate.

Modern relevance

Large language models and diffusion models are not linear; their core computations include attention softmaxes, ReLU or GELU activations, layer normalization, and gating, all of which are nonlinear. But the linear pieces are still everywhere. Token and position embeddings, the query, key, and value projections inside attention, the down projection back to the model dimension, and the final unembedding to the vocabulary are all linear maps. A great deal of mechanistic interpretability research treats these matrices as the primary object of study and asks what linear directions in the residual stream encode. The interaction between linear projections and a small number of nonlinear operations is also why phenomena such as superposition (where one neuron represents many features at once) are even possible.

In the broader machine learning landscape, regularized linear models such as Lasso remain dominant in genomics, econometrics, and other fields where a sparse, interpretable model is more valuable than a marginal accuracy gain. "Try a linear model first" is still a good rule of thumb, and a linear baseline that closely matches a deep model is often a sign that the deep model has not actually learned anything useful from the data.

Explain like I'm 5

A linear model is like predicting the price of a sandwich by adding up fixed prices for the bread, the meat, and the cheese. Two slices of cheese cost twice as much as one slice. The bread does not change in price depending on what meat you choose. If the real price worked that simply, a linear model would get it right every time. Real food trucks have specials, combos, and weird discounts, which is why we sometimes need more complicated models that can notice those interactions.

Linear

Formal definition

Linear versus affine

Properties of linear functions

Linear in parameters versus linear in features

Common linear methods

Generalized linear models

Linearity in calculus and probability

Linearity inside neural networks

Linear probing

When to use linear models

When linearity fails

Modern relevance

Explain like I'm 5

References

Improve this article

Formal definition

Linear versus affine

Properties of linear functions

Linear in parameters versus linear in features

Common linear methods

Generalized linear models

Linearity in calculus and probability

Linearity inside neural networks

Linear probing

When to use linear models

When linearity fails

Modern relevance

Explain like I'm 5

References

Formal definition

Linear versus affine

Properties of linear functions

Linear in parameters versus linear in features

Common linear methods

Generalized linear models

Linearity in calculus and probability

Linearity inside neural networks

Linear probing

When to use linear models

When linearity fails

Modern relevance

Explain like I'm 5

References

Improve this article

Related Articles

ARC-AGI 2

Bellman Equation

Bias (Math) or Bias Term

Broadcasting

Convergence

Convex Function

Formal definition

Linear versus affine

Properties of linear functions

Linear in parameters versus linear in features

Common linear methods

Generalized linear models

Linearity in calculus and probability

Linearity inside neural networks

Linear probing

When to use linear models

When linearity fails

Modern relevance

Explain like I'm 5

References

Related Articles

ARC-AGI 2

Bellman Equation

Bias (Math) or Bias Term

Broadcasting

Convergence

Convex Function