Nonlinear

See also: Machine learning terms

Nonlinear describes any function, model, or relationship that does not satisfy the property of linearity. In machine learning the word shows up in two closely related senses. First, it labels a class of mathematical functions: a function f is nonlinear when it fails the test f(αx + βy) = αf(x) + βf(y) for some inputs x, y and scalars α, β. Second, it labels a class of models that can fit input-output relationships of essentially arbitrary shape, including curves, thresholds, and high-order interactions. Almost every modern statistical learning method that performs well on real data is nonlinear in one or both of these senses, and the practical history of the field is in large part the history of figuring out how to fit nonlinear models cheaply and reliably.

definition

A real-valued function f is linear if it satisfies two conditions: additivity, f(x + y) = f(x) + f(y), and homogeneity, f(αx) = αf(x). Together these are called superposition. A function that breaks either condition is nonlinear. Strictly speaking a function such as f(x) = wx + b is affine rather than linear because of the constant offset, but in the ML literature the distinction is usually dropped and "linear" is used to mean "linear in the parameters and additive in the inputs."

A model is called nonlinear when its predicted output is a nonlinear function of the inputs. Note that a model can be linear in its parameters while remaining nonlinear in the inputs: polynomial regression of the form y = w0 + w1 x + w2 x^2 + w3 x^3 is fit by ordinary least squares, but the relationship between x and y is a cubic curve. Conversely, a single-layer neural network with a sigmoid activation is nonlinear in both inputs and parameters because the parameters multiply the inputs before passing through the nonlinearity.

why nonlinearity matters

Real-world data rarely lies on a hyperplane. House prices respond to square footage in a curved, eventually saturating way. Loan default risk depends on income, credit utilization, and length of history through interactions, not as a simple weighted sum. The pixel intensities of a cat photo bear no linear relationship to the label "cat." If a model is forced to assume a linear input-output mapping, it will systematically underfit problems with curvature, threshold effects, or feature interactions.

Three practical reasons nonlinearity matters in ML:

Curvature. A linear model produces a flat decision surface or a flat regression plane. Genuine curvature in the data forces a nonlinear hypothesis class.
Interactions. When the effect of one feature depends on the value of another, an additive linear model cannot capture it without explicit interaction terms.
Thresholds and saturations. Many natural processes have step-like or saturating behavior (e.g., a neuron firing, a customer churning above some price). Smooth nonlinear functions like sigmoid or tanh and piecewise functions like ReLU approximate these shapes naturally.

linear versus nonlinear methods

Most classical ML algorithms come in matched pairs: a fast, interpretable linear version and a more expensive, more flexible nonlinear cousin. The table below summarizes the main families.

Task	Linear method	Nonlinear method
Regression	Linear regression, Ridge, Lasso	Polynomial regression, kernel ridge, random forest regression, gradient boosting regression, neural network regression, Gaussian process regression
Binary classification	Logistic regression, linear SVM	Kernel SVM, decision tree, random forest, gradient boosting (XGBoost, LightGBM), kNN, MLP
Dimensionality reduction	PCA, LDA	Kernel PCA, t-SNE, UMAP, autoencoders
Sequence modeling	Linear autoregression, ARIMA	RNN, LSTM, Transformer
Density estimation	Gaussian, mixtures of linear models	Kernel density estimation, normalizing flows, diffusion models

Linear methods have three persistent advantages: they train fast, they are easy to regularize, and their coefficients are interpretable as marginal effects. Nonlinear methods give up some or all of those properties in exchange for the ability to fit data that the linear model would miss.

the universal approximation theorem

The theoretical justification for using nonlinearity in neural networks is the universal approximation theorem. George Cybenko proved in 1989 that a feed-forward network with a single hidden layer and a continuous sigmoidal activation can approximate any continuous function on a compact subset of R^n to arbitrary accuracy, given enough hidden units (Cybenko, 1989). Kurt Hornik extended the result in 1991, showing that the specific choice of activation does not matter much: any bounded, non-constant activation function will do, and even the boundedness condition can be relaxed as long as the activation is non-polynomial (Hornik, 1991).

The theorem is an existence result, not a constructive one. It does not say the required width is small, nor that gradient descent will find the right weights. What it does say is that nonlinearity is the ingredient that gives neural networks their representational power. Strip the nonlinearity out and the network collapses. A stack of N linear layers W_N W_{N-1} ... W_1 x is itself just a linear map W x with W = W_N W_{N-1} ... W_1, so depth alone buys nothing. The reason deep networks can represent things shallow networks cannot is that the nonlinearity between layers prevents the collapse and lets each new layer compose new features on top of the previous ones.

activation functions

In neural networks the nonlinearity is usually introduced by an activation function applied element-wise to the output of each layer. The choice of activation is one of the most studied design decisions in deep learning, and the field has cycled through several dominant choices.

Name	Formula	Range	Notes
Sigmoid	1 / (1 + e^-x)	(0, 1)	Smooth, historical default, suffers vanishing gradients for large \|x\|
Tanh	(e^x - e^-x) / (e^x + e^-x)	(-1, 1)	Zero-centered version of sigmoid, still saturates
ReLU	max(0, x)	[0, ∞)	Cheap, sparse activations, can produce "dead" neurons that output zero forever
Leaky ReLU	max(αx, x), small α	(-∞, ∞)	Fixes dead-neuron problem with a small negative slope
PReLU	max(αx, x), α learned	(-∞, ∞)	Learnable leak, He et al., 2015
ELU	x if x > 0, α(e^x - 1) otherwise	(-α, ∞)	Smooth negative side, Clevert et al., 2015
GELU	x · Φ(x)	(-∞, ∞)	Φ is the standard normal CDF, used in BERT and GPT, Hendrycks & Gimpel, 2016
Swish / SiLU	x · sigmoid(βx)	(-∞, ∞)	Smooth, non-monotonic, found by automated search, Ramachandran et al., 2017
SwiGLU	(xW + b) ⊗ Swish(xV + c)	varies	Gated variant used in PaLM, LLaMA, Mistral, Shazeer, 2020
Softmax	e^xi / Σ e^xj	(0, 1), sums to 1	Vector-valued, used as the output layer for multi-class classification

The modern story starts in 2011, when Glorot, Bordes, and Bengio showed that ReLU trains deep networks faster and to better accuracy than sigmoid or tanh, in part by avoiding the vanishing gradient problem and producing genuinely sparse activations (Glorot, Bordes, & Bengio, 2011). ReLU became the default after the 2012 ImageNet result with AlexNet, and held that position for most of the next decade.

GELU, introduced by Dan Hendrycks and Kevin Gimpel in 2016, weights inputs by their value through the Gaussian CDF rather than gating them by sign. It became the standard activation in transformer language models including BERT and the GPT series (Hendrycks & Gimpel, 2016). Swish, also written x · sigmoid(βx), was discovered in 2017 by Ramachandran, Zoph, and Le using a reinforcement-learning-based search over candidate activations. It is smooth, non-monotonic, and unbounded above, and it tends to help on deeper networks. The same function is also called SiLU, and it appears in EfficientNet and many vision models (Ramachandran, Zoph, & Le, 2017).

In 2020 Noam Shazeer showed that gated variants of activations, where a Swish or GELU output is multiplied element-wise by a second linear projection of the same input, give consistent perplexity improvements in transformer feed-forward layers (Shazeer, 2020). The SwiGLU variant has since been adopted by PaLM, LLaMA, Mistral, and most current open-weight large language models. The cost is one extra matrix multiplication per feed-forward block, traded for measurable quality gains.

Softmax is a slightly different beast. It is not a pointwise nonlinearity but a vector normalization that turns logits into a probability distribution. Most multi-class classifiers and the attention mechanism in transformers use it.

the kernel trick

A different way to inject nonlinearity, popular before deep learning, is the kernel trick. The idea, first used by Aizerman, Braverman, and Rozonoer in 1964 in the context of the potential function method, was popularized for SVMs by Boser, Guyon, and Vapnik at the 1992 Computational Learning Theory workshop (Boser, Guyon, & Vapnik, 1992). Many learning algorithms only ever access their training data through inner products x·y. If you replace every inner product with a kernel function k(x, y) = φ(x) · φ(y), where φ is some implicit nonlinear feature map, you fit a linear model in the φ-space without ever computing φ explicitly. The model is linear in the lifted features and nonlinear in the original inputs.

Kernel	Formula	Implicit feature map
Linear	x · y	identity
Polynomial	(γ x · y + r)^d	all monomials of degree ≤ d
RBF / Gaussian	exp(-γ ‖x - y‖^2)	infinite-dimensional
Sigmoid	tanh(γ x · y + r)	related to single-layer neural net
Laplacian	exp(-γ ‖x - y‖)	similar to RBF, heavier tails

The kernel trick made nonlinear SVMs practical and dominated tabular ML through the late 1990s and 2000s. It still shows up in Gaussian process regression, kernel ridge regression, and kernel PCA. Kernel methods scale poorly to huge datasets (training is O(n^2) or O(n^3) in the number of samples), which is one of the reasons neural networks displaced them once datasets and compute grew.

tree-based nonlinear models

Tree-based models are nonlinear by construction. A single decision tree recursively partitions the input space along axis-aligned splits, then predicts a constant in each leaf. The resulting hypothesis is piecewise constant. Single trees overfit easily, but ensembles of trees are among the most reliable models for tabular data.

Random forests, introduced by Leo Breiman in 2001, average the predictions of many trees grown on bootstrap samples with random feature subsets. The averaging smooths the staircase shape of individual trees and dramatically lowers variance. Gradient boosting, formalized by Jerome Friedman in 2001, fits trees sequentially, each one trained to predict the residual error of the current ensemble. Modern implementations such as XGBoost, LightGBM, and CatBoost are the workhorses of Kaggle competitions and many production tabular pipelines.

Tree ensembles capture nonlinearities and feature interactions automatically without requiring the user to specify the form. They are robust to monotonic feature transformations, handle mixed categorical and numerical inputs, and need much less feature engineering than linear models. The tradeoff is that they do not extrapolate beyond the range of training data and they produce models that are harder to introspect than a regression coefficient table.

basis expansion and additive models

A classical way to add nonlinearity to a regression is basis expansion: replace the original feature x with a vector of basis functions [b1(x), b2(x), ...] and then fit a linear model on top. Polynomial features b_k(x) = x^k are the simplest case; piecewise-polynomial splines, especially natural cubic splines and B-splines, give a much better fit-to-flexibility tradeoff because the smoothness is enforced only locally.

Generalized additive models (GAMs), introduced by Trevor Hastie and Robert Tibshirani in 1986, generalize generalized linear models by replacing each linear term β_j x_j with a smooth function f_j(x_j), so that g(E[Y]) = α + Σ f_j(x_j) for some link function g (Hastie & Tibshirani, 1986). Each f_j is typically a smoothing spline or other nonparametric smoother. GAMs capture nonlinearity per feature while staying additive across features, which keeps them interpretable in a way that random forests and neural networks are not.

Gaussian processes offer yet another route: the model places a prior over functions and combines it with the data through the kernel function. The result is a nonparametric, nonlinear regression that also gives well-calibrated uncertainty estimates, at the cost of cubic-time matrix inversions.

why depth needs nonlinearity

It is worth restating the point that depth without nonlinearity is useless. Suppose we stack two linear layers without an activation between them: h = W_1 x, y = W_2 h = W_2 W_1 x. The composition is just another linear map with weight matrix W = W_2 W_1, and the model has no more representational power than a single-layer linear model. Any two stacked linear transformations always collapse to one. The same logic applies to N layers.

Insert a nonlinearity σ between every pair of layers and the collapse no longer happens: y = W_2 σ(W_1 x) cannot in general be rewritten as a single matrix multiply. Each new layer composes a new nonlinear feature map on top of the previous one, and the representational capacity grows with depth. Empirically, deep ReLU networks can express piecewise linear functions whose number of linear regions grows exponentially in depth and only polynomially in width, which gives some intuition for why depth pays off (Montufar et al., 2014).

diagnosing nonlinearity in practice

A common starting point for any regression or classification problem is a linear baseline. If the baseline underfits, the question is whether the residual structure is genuinely nonlinear or merely noisy. Several diagnostics help.

For regression, plot the residuals against each feature and against the predicted value. Random scatter around zero suggests the linear model has captured the systematic part of the signal. A visible curve, fan, or trend in the residuals signals unmodeled nonlinearity. For classification, fit a flexible model (e.g., a small gradient-boosted ensemble) and compare its held-out performance against the linear baseline. A large gap argues for nonlinearity in the data. Partial dependence plots and SHAP values then show which features carry the curvature.

When the diagnosis points to nonlinearity, the standard escalation is:

Add hand-engineered features such as polynomial terms, log transforms, ratios, or interactions, and refit the linear model.
Switch to a kernel method or a tree ensemble that captures interactions automatically.
Move to a neural network if the data is high-dimensional, perceptual (images, audio, text), or large enough to benefit from learned representations.

The progression is not strict. On well-structured tabular data, gradient boosting often beats both a heavily engineered linear model and a deep network. On image and language tasks, the linear baseline is rarely competitive at all, and the question is which neural architecture to use rather than whether to use one.

explain like i'm 5

Imagine you have a bunch of dots on a piece of paper and you want to draw a single line that follows them as closely as possible. If the dots really do form a straight line, your job is easy. But if the dots form a curve, or a wave, or a cloud that swerves around, no straight line is going to do the job. You need a curve. Nonlinear methods in machine learning are the tools that draw the curves. Some of them stack many small curves together (neural networks). Some of them split the paper into rectangles and draw a flat patch in each one (decision trees). Some of them quietly lift the dots into a higher-dimensional space where a straight line works after all (kernel methods). The reason there are so many of them is that the world is shaped much more like a curve than like a straight line.

references

Cybenko, G. (1989). Approximation by superpositions of a sigmoidal function. Mathematics of Control, Signals, and Systems, 2(4), 303-314. https://link.springer.com/article/10.1007/BF02551274
Hornik, K. (1991). Approximation capabilities of multilayer feedforward networks. Neural Networks, 4(2), 251-257. https://www.sciencedirect.com/science/article/abs/pii/089360809190009T
Hornik, K., Stinchcombe, M., & White, H. (1989). Multilayer feedforward networks are universal approximators. Neural Networks, 2(5), 359-366.
Boser, B. E., Guyon, I. M., & Vapnik, V. N. (1992). A training algorithm for optimal margin classifiers. In Proceedings of the Fifth Annual Workshop on Computational Learning Theory (pp. 144-152). https://dl.acm.org/doi/10.1145/130385.130401
Hastie, T., & Tibshirani, R. (1986). Generalized additive models. Statistical Science, 1(3), 297-318. https://projecteuclid.org/journals/statistical-science/volume-1/issue-3/Generalized-Additive-Models/10.1214/ss/1177013604.full
Glorot, X., Bordes, A., & Bengio, Y. (2011). Deep sparse rectifier neural networks. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics (pp. 315-323). https://proceedings.mlr.press/v15/glorot11a.html
Hendrycks, D., & Gimpel, K. (2016). Gaussian Error Linear Units (GELUs). arXiv preprint arXiv:1606.08415. https://arxiv.org/abs/1606.08415
Ramachandran, P., Zoph, B., & Le, Q. V. (2017). Searching for activation functions. arXiv preprint arXiv:1710.05941. https://arxiv.org/abs/1710.05941
Shazeer, N. (2020). GLU variants improve Transformer. arXiv preprint arXiv:2002.05202. https://arxiv.org/abs/2002.05202
He, K., Zhang, X., Ren, S., & Sun, J. (2015). Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification. In Proceedings of the IEEE International Conference on Computer Vision (pp. 1026-1034).
Clevert, D. A., Unterthiner, T., & Hochreiter, S. (2015). Fast and accurate deep network learning by exponential linear units (ELUs). arXiv preprint arXiv:1511.07289.
Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5-32.
Friedman, J. H. (2001). Greedy function approximation: A gradient boosting machine. Annals of Statistics, 29(5), 1189-1232.
Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). ImageNet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems (Vol. 25).
Montufar, G., Pascanu, R., Cho, K., & Bengio, Y. (2014). On the number of linear regions of deep neural networks. In Advances in Neural Information Processing Systems (Vol. 27).
Aizerman, M. A., Braverman, E. M., & Rozonoer, L. I. (1964). Theoretical foundations of the potential function method in pattern recognition learning. Automation and Remote Control, 25, 821-837.

definition

why nonlinearity matters

linear versus nonlinear methods

the universal approximation theorem

activation functions

the kernel trick

tree-based nonlinear models

basis expansion and additive models

why depth needs nonlinearity

diagnosing nonlinearity in practice

explain like i'm 5

references

Improve this article

Related Articles

ARC-AGI 2

Bellman Equation

Bias (Math) or Bias Term

Broadcasting

Convergence

Convex Function

definition

why nonlinearity matters

linear versus nonlinear methods

the universal approximation theorem

activation functions

the kernel trick

tree-based nonlinear models

basis expansion and additive models

why depth needs nonlinearity

diagnosing nonlinearity in practice

explain like i'm 5

references

Related Articles

ARC-AGI 2

Bellman Equation

Bias (Math) or Bias Term

Broadcasting

Convergence

Convex Function