See also: Machine learning terms
In machine learning and mathematics, linear describes a function or relationship in which the output is built from the inputs only through addition and multiplication by constants. Google's Machine Learning Glossary defines a linear function as "a relationship between two or more variables that can be represented solely through addition and multiplication." Linearity is the simplest, most well understood class of relationships in modeling, and it underlies a large fraction of classical statistics, signal processing, and modern machine learning.
A model is called linear when its prediction can be written as a weighted sum of features, optionally plus a constant. The same word also describes operations such as expectation, differentiation, integration, and matrix multiplication, all of which respect the same two algebraic rules. Because linear maps compose to other linear maps, every layer of a deep network is itself linear before its activation function is applied. The deliberate insertion of nonlinear activations is what stops a stack of linear layers from collapsing into a single linear transform.
A function f from one vector space to another is linear if it satisfies two properties for all vectors x, y and all scalars α, β:
These two conditions can be combined into a single statement, sometimes called the superposition principle:
$$f(\alpha x + \beta y) = \alpha f(x) + \beta f(y).$$
In coordinates, every linear function on a finite dimensional space can be written as a dot product with a fixed weight vector w:
$$f(x) = w \cdot x = \sum_{i=1}^{d} w_i x_i.$$
The matrix form generalizes this to vector valued outputs, f(x) = W x, where W is a fixed matrix. Geometrically the level set {x : w · x = c} is a hyperplane, a flat (d minus 1) dimensional subset of d dimensional feature space.
In everyday machine learning usage the word "linear" is loose. A function of the form g(x) = w · x + b, with a non zero bias b, is technically affine rather than linear, because g(0) = b is not 0 and homogeneity fails. An affine function is a linear function composed with a translation. Most practical models, including linear regression, logistic regression, and a perceptron, are affine in this strict sense, since they include a bias or intercept term. The community calls them linear anyway, partly out of convenience and partly because the bias can always be folded into the weight vector by appending a constant 1 to the feature vector.
The distinction matters in two places. First, in linear algebra theorems (kernels, image, dimension counting) the bias has to be handled explicitly. Second, when a hyperplane is the decision boundary of a linear classifier, the bias is what lets that hyperplane miss the origin.
| Property | Statement | Why it matters in ML |
|---|---|---|
| Additivity | f(x + y) = f(x) + f(y) | Effects of features add up; no interactions by default |
| Homogeneity | f(αx) = α f(x) | Doubling an input doubles the output |
| Superposition | f(αx + βy) = α f(x) + β f(y) | Lets us decompose problems into per feature contributions |
| Closure under composition | f ∘ g is linear if f and g are | Stacked linear layers collapse into one linear layer |
| Convexity of the squared loss | (y minus w · x)^2 is convex in w | Linear regression has a unique global optimum |
| Closed form solution | w = (X^T X)^{-1} X^T y | No iterative training needed for ordinary least squares |
A persistent source of confusion is what the "linear" in linear regression refers to. The model y = β0 + β1 x + β2 x^2 + β3 sin(x) is still a linear model. It is linear in the parameters β, even though it is not linear in the original feature x. The defining test is whether the model can be written as y = Σ βi φi(x) for some fixed feature transforms φi, with the βi appearing only to the first power and not multiplied or divided by other parameters.
This loophole is what makes classical linear models surprisingly flexible. By engineering nonlinear features (polynomials, splines, interaction terms, radial basis functions) and then fitting them with a linear estimator, a practitioner gets nonlinear behavior with linear training. The same trick is the basis of the kernel trick: a kernel implicitly maps inputs into a high dimensional feature space where a linear model fits, without ever computing the features explicitly. By contrast, a model such as y = β1 x^β2 is genuinely nonlinear because β2 appears as an exponent, and it requires nonlinear optimization to fit.
| Method | Task | What is linear |
|---|---|---|
| Linear regression | Regression | Mean of y is a linear function of x |
| Ridge regression | Regression | Linear in parameters with L2 penalty |
| Lasso regression | Regression | Linear in parameters with L1 penalty |
| Elastic net | Regression | Linear in parameters with mixed L1 and L2 penalty |
| Logistic regression | Binary classification | Log odds are linear in x |
| Softmax regression | Multiclass classification | Logits are linear in x |
| Linear SVM | Classification | Decision boundary is a hyperplane |
| Perceptron | Binary classification | Decision rule is sign(w · x + b) |
| Linear discriminant analysis | Classification, dimensionality reduction | Projects onto a linear subspace that separates classes |
| Principal component analysis | Dimensionality reduction | Orthogonal linear projection that maximizes variance |
| Generalized linear models | Regression with non Gaussian targets | Linear predictor passed through a link function |
All of these share two features: they fit a small number of weights, and the optimization problem they solve is convex. Convexity guarantees a single global optimum, which is one reason linear methods are the workhorse of statistics.
Generalized linear models, or GLMs, extend ordinary linear regression to response variables that are not Gaussian. A GLM has three pieces: a linear predictor η = w · x + b, a probability distribution from the exponential family for the response, and a link function g that ties them together by g(E[y]) = η. Linear regression itself is a GLM with a normal response and an identity link. Logistic regression is a GLM with a Bernoulli response and a logit link. Poisson regression, used for count data, takes a Poisson response and a log link. The linearity in the linear predictor is what gives GLMs interpretable coefficients and well behaved likelihoods.
Linearity is not just a property of models; it is a property of many of the operations that machine learning relies on.
These facts are routine in textbooks, but they are also what allow large parts of machine learning theory (PAC bounds, bias variance decompositions, convergence proofs) to work at all.
A modern neural network is not linear. Each layer, however, contains a linear piece. A fully connected layer computes a pre activation z = W x + b, which is affine in x. If the activation function applied to z is also linear, then the entire network reduces to one big affine map. As one tutorial puts it, without nonlinearity Linear ∘ Linear ∘ Linear = Linear. The presence of a nonlinear activation function such as ReLU, tanh, or the GELU used in transformers is what stops layers from collapsing and gives the network its representational power. Convolutional layers, attention's projection matrices, and the input and output embeddings of large language models are all linear maps with learned parameters; the nonlinearity lives in activations, softmaxes, and normalization layers.
This structure is why people sometimes describe a deep network as "alternating linear and nonlinear pieces." It is also why removing all activations from a transformer turns it into a single matrix product over the input, a fact that has been used to design simplified theoretical models of attention.
A related modern usage is the linear probe, a diagnostic tool for evaluating learned representations. Given a frozen pretrained model, a linear probe trains a single linear classifier on top of its hidden activations to predict some downstream label. If the probe achieves high accuracy, the representation is said to linearly encode the property. Linear probes are popular because they are fast, have few hyperparameters, and isolate what the representation already contains from what an additional learner could do. They were used heavily in the CLIP paper to evaluate transfer of vision features, and they remain a standard evaluation in self supervised learning. A complementary approach, LP FT, trains a linear probe first and then fine tunes the full model, which preserves pretrained features and tends to improve out of distribution accuracy.
Linear models are not a relic. They remain a strong default for structured tabular data and for tasks where interpretability or speed dominates accuracy. Reasonable indications for reaching for a linear baseline include:
Linear regression is also the standard tool in scientific applications because its assumptions (linearity of effects, independence of errors, homoscedasticity) line up with how scientists like to reason about cause and effect.
Linearity is a strong assumption, and it breaks in any setting where the response depends on interactions between features, on thresholds, or on the geometry of complex objects. Linear models cannot represent XOR, cannot learn most image features without feature engineering, and cannot capture the long range dependencies that make sense of text. Image classification, speech recognition, machine translation, and language modeling all moved decisively away from linear methods during the 2010s for this reason. Even on tabular data, gradient boosted trees and modern deep tabular models often beat linear baselines when interactions matter and there is enough training data.
Diagnosing a linearity failure usually involves residual analysis: if the residuals from a linear regression show curvature when plotted against a feature, the linearity assumption is suspect, and either feature engineering or a nonlinear model is appropriate.
Large language models and diffusion models are not linear; their core computations include attention softmaxes, ReLU or GELU activations, layer normalization, and gating, all of which are nonlinear. But the linear pieces are still everywhere. Token and position embeddings, the query, key, and value projections inside attention, the down projection back to the model dimension, and the final unembedding to the vocabulary are all linear maps. A great deal of mechanistic interpretability research treats these matrices as the primary object of study and asks what linear directions in the residual stream encode. The interaction between linear projections and a small number of nonlinear operations is also why phenomena such as superposition (where one neuron represents many features at once) are even possible.
In the broader machine learning landscape, regularized linear models such as Lasso remain dominant in genomics, econometrics, and other fields where a sparse, interpretable model is more valuable than a marginal accuracy gain. "Try a linear model first" is still a good rule of thumb, and a linear baseline that closely matches a deep model is often a sign that the deep model has not actually learned anything useful from the data.
A linear model is like predicting the price of a sandwich by adding up fixed prices for the bread, the meat, and the cheese. Two slices of cheese cost twice as much as one slice. The bread does not change in price depending on what meat you choose. If the real price worked that simply, a linear model would get it right every time. Real food trucks have specials, combos, and weird discounts, which is why we sometimes need more complicated models that can notice those interactions.