See also: Machine learning terms
Nonlinear describes any function, model, or relationship that does not satisfy the property of linearity. In machine learning the word shows up in two closely related senses. First, it labels a class of mathematical functions: a function f is nonlinear when it fails the test f(αx + βy) = αf(x) + βf(y) for some inputs x, y and scalars α, β. Second, it labels a class of models that can fit input-output relationships of essentially arbitrary shape, including curves, thresholds, and high-order interactions. Almost every modern statistical learning method that performs well on real data is nonlinear in one or both of these senses, and the practical history of the field is in large part the history of figuring out how to fit nonlinear models cheaply and reliably.
A real-valued function f is linear if it satisfies two conditions: additivity, f(x + y) = f(x) + f(y), and homogeneity, f(αx) = αf(x). Together these are called superposition. A function that breaks either condition is nonlinear. Strictly speaking a function such as f(x) = wx + b is affine rather than linear because of the constant offset, but in the ML literature the distinction is usually dropped and "linear" is used to mean "linear in the parameters and additive in the inputs."
A model is called nonlinear when its predicted output is a nonlinear function of the inputs. Note that a model can be linear in its parameters while remaining nonlinear in the inputs: polynomial regression of the form y = w0 + w1 x + w2 x^2 + w3 x^3 is fit by ordinary least squares, but the relationship between x and y is a cubic curve. Conversely, a single-layer neural network with a sigmoid activation is nonlinear in both inputs and parameters because the parameters multiply the inputs before passing through the nonlinearity.
Real-world data rarely lies on a hyperplane. House prices respond to square footage in a curved, eventually saturating way. Loan default risk depends on income, credit utilization, and length of history through interactions, not as a simple weighted sum. The pixel intensities of a cat photo bear no linear relationship to the label "cat." If a model is forced to assume a linear input-output mapping, it will systematically underfit problems with curvature, threshold effects, or feature interactions.
Three practical reasons nonlinearity matters in ML:
Most classical ML algorithms come in matched pairs: a fast, interpretable linear version and a more expensive, more flexible nonlinear cousin. The table below summarizes the main families.
| Task | Linear method | Nonlinear method |
|---|---|---|
| Regression | Linear regression, Ridge, Lasso | Polynomial regression, kernel ridge, random forest regression, gradient boosting regression, neural network regression, Gaussian process regression |
| Binary classification | Logistic regression, linear SVM | Kernel SVM, decision tree, random forest, gradient boosting (XGBoost, LightGBM), kNN, MLP |
| Dimensionality reduction | PCA, LDA | Kernel PCA, t-SNE, UMAP, autoencoders |
| Sequence modeling | Linear autoregression, ARIMA | RNN, LSTM, Transformer |
| Density estimation | Gaussian, mixtures of linear models | Kernel density estimation, normalizing flows, diffusion models |
Linear methods have three persistent advantages: they train fast, they are easy to regularize, and their coefficients are interpretable as marginal effects. Nonlinear methods give up some or all of those properties in exchange for the ability to fit data that the linear model would miss.
The theoretical justification for using nonlinearity in neural networks is the universal approximation theorem. George Cybenko proved in 1989 that a feed-forward network with a single hidden layer and a continuous sigmoidal activation can approximate any continuous function on a compact subset of R^n to arbitrary accuracy, given enough hidden units (Cybenko, 1989). Kurt Hornik extended the result in 1991, showing that the specific choice of activation does not matter much: any bounded, non-constant activation function will do, and even the boundedness condition can be relaxed as long as the activation is non-polynomial (Hornik, 1991).
The theorem is an existence result, not a constructive one. It does not say the required width is small, nor that gradient descent will find the right weights. What it does say is that nonlinearity is the ingredient that gives neural networks their representational power. Strip the nonlinearity out and the network collapses. A stack of N linear layers W_N W_{N-1} ... W_1 x is itself just a linear map W x with W = W_N W_{N-1} ... W_1, so depth alone buys nothing. The reason deep networks can represent things shallow networks cannot is that the nonlinearity between layers prevents the collapse and lets each new layer compose new features on top of the previous ones.
In neural networks the nonlinearity is usually introduced by an activation function applied element-wise to the output of each layer. The choice of activation is one of the most studied design decisions in deep learning, and the field has cycled through several dominant choices.
| Name | Formula | Range | Notes |
|---|---|---|---|
| Sigmoid | 1 / (1 + e^-x) | (0, 1) | Smooth, historical default, suffers vanishing gradients for large |x| |
| Tanh | (e^x - e^-x) / (e^x + e^-x) | (-1, 1) | Zero-centered version of sigmoid, still saturates |
| ReLU | max(0, x) | [0, ∞) | Cheap, sparse activations, can produce "dead" neurons that output zero forever |
| Leaky ReLU | max(αx, x), small α | (-∞, ∞) | Fixes dead-neuron problem with a small negative slope |
| PReLU | max(αx, x), α learned | (-∞, ∞) | Learnable leak, He et al., 2015 |
| ELU | x if x > 0, α(e^x - 1) otherwise | (-α, ∞) | Smooth negative side, Clevert et al., 2015 |
| GELU | x · Φ(x) | (-∞, ∞) | Φ is the standard normal CDF, used in BERT and GPT, Hendrycks & Gimpel, 2016 |
| Swish / SiLU | x · sigmoid(βx) | (-∞, ∞) | Smooth, non-monotonic, found by automated search, Ramachandran et al., 2017 |
| SwiGLU | (xW + b) ⊗ Swish(xV + c) | varies | Gated variant used in PaLM, LLaMA, Mistral, Shazeer, 2020 |
| Softmax | e^xi / Σ e^xj | (0, 1), sums to 1 | Vector-valued, used as the output layer for multi-class classification |
The modern story starts in 2011, when Glorot, Bordes, and Bengio showed that ReLU trains deep networks faster and to better accuracy than sigmoid or tanh, in part by avoiding the vanishing gradient problem and producing genuinely sparse activations (Glorot, Bordes, & Bengio, 2011). ReLU became the default after the 2012 ImageNet result with AlexNet, and held that position for most of the next decade.
GELU, introduced by Dan Hendrycks and Kevin Gimpel in 2016, weights inputs by their value through the Gaussian CDF rather than gating them by sign. It became the standard activation in transformer language models including BERT and the GPT series (Hendrycks & Gimpel, 2016). Swish, also written x · sigmoid(βx), was discovered in 2017 by Ramachandran, Zoph, and Le using a reinforcement-learning-based search over candidate activations. It is smooth, non-monotonic, and unbounded above, and it tends to help on deeper networks. The same function is also called SiLU, and it appears in EfficientNet and many vision models (Ramachandran, Zoph, & Le, 2017).
In 2020 Noam Shazeer showed that gated variants of activations, where a Swish or GELU output is multiplied element-wise by a second linear projection of the same input, give consistent perplexity improvements in transformer feed-forward layers (Shazeer, 2020). The SwiGLU variant has since been adopted by PaLM, LLaMA, Mistral, and most current open-weight large language models. The cost is one extra matrix multiplication per feed-forward block, traded for measurable quality gains.
Softmax is a slightly different beast. It is not a pointwise nonlinearity but a vector normalization that turns logits into a probability distribution. Most multi-class classifiers and the attention mechanism in transformers use it.
A different way to inject nonlinearity, popular before deep learning, is the kernel trick. The idea, first used by Aizerman, Braverman, and Rozonoer in 1964 in the context of the potential function method, was popularized for SVMs by Boser, Guyon, and Vapnik at the 1992 Computational Learning Theory workshop (Boser, Guyon, & Vapnik, 1992). Many learning algorithms only ever access their training data through inner products x·y. If you replace every inner product with a kernel function k(x, y) = φ(x) · φ(y), where φ is some implicit nonlinear feature map, you fit a linear model in the φ-space without ever computing φ explicitly. The model is linear in the lifted features and nonlinear in the original inputs.
| Kernel | Formula | Implicit feature map |
|---|---|---|
| Linear | x · y | identity |
| Polynomial | (γ x · y + r)^d | all monomials of degree ≤ d |
| RBF / Gaussian | exp(-γ ‖x - y‖^2) | infinite-dimensional |
| Sigmoid | tanh(γ x · y + r) | related to single-layer neural net |
| Laplacian | exp(-γ ‖x - y‖) | similar to RBF, heavier tails |
The kernel trick made nonlinear SVMs practical and dominated tabular ML through the late 1990s and 2000s. It still shows up in Gaussian process regression, kernel ridge regression, and kernel PCA. Kernel methods scale poorly to huge datasets (training is O(n^2) or O(n^3) in the number of samples), which is one of the reasons neural networks displaced them once datasets and compute grew.
Tree-based models are nonlinear by construction. A single decision tree recursively partitions the input space along axis-aligned splits, then predicts a constant in each leaf. The resulting hypothesis is piecewise constant. Single trees overfit easily, but ensembles of trees are among the most reliable models for tabular data.
Random forests, introduced by Leo Breiman in 2001, average the predictions of many trees grown on bootstrap samples with random feature subsets. The averaging smooths the staircase shape of individual trees and dramatically lowers variance. Gradient boosting, formalized by Jerome Friedman in 2001, fits trees sequentially, each one trained to predict the residual error of the current ensemble. Modern implementations such as XGBoost, LightGBM, and CatBoost are the workhorses of Kaggle competitions and many production tabular pipelines.
Tree ensembles capture nonlinearities and feature interactions automatically without requiring the user to specify the form. They are robust to monotonic feature transformations, handle mixed categorical and numerical inputs, and need much less feature engineering than linear models. The tradeoff is that they do not extrapolate beyond the range of training data and they produce models that are harder to introspect than a regression coefficient table.
A classical way to add nonlinearity to a regression is basis expansion: replace the original feature x with a vector of basis functions [b1(x), b2(x), ...] and then fit a linear model on top. Polynomial features b_k(x) = x^k are the simplest case; piecewise-polynomial splines, especially natural cubic splines and B-splines, give a much better fit-to-flexibility tradeoff because the smoothness is enforced only locally.
Generalized additive models (GAMs), introduced by Trevor Hastie and Robert Tibshirani in 1986, generalize generalized linear models by replacing each linear term β_j x_j with a smooth function f_j(x_j), so that g(E[Y]) = α + Σ f_j(x_j) for some link function g (Hastie & Tibshirani, 1986). Each f_j is typically a smoothing spline or other nonparametric smoother. GAMs capture nonlinearity per feature while staying additive across features, which keeps them interpretable in a way that random forests and neural networks are not.
Gaussian processes offer yet another route: the model places a prior over functions and combines it with the data through the kernel function. The result is a nonparametric, nonlinear regression that also gives well-calibrated uncertainty estimates, at the cost of cubic-time matrix inversions.
It is worth restating the point that depth without nonlinearity is useless. Suppose we stack two linear layers without an activation between them: h = W_1 x, y = W_2 h = W_2 W_1 x. The composition is just another linear map with weight matrix W = W_2 W_1, and the model has no more representational power than a single-layer linear model. Any two stacked linear transformations always collapse to one. The same logic applies to N layers.
Insert a nonlinearity σ between every pair of layers and the collapse no longer happens: y = W_2 σ(W_1 x) cannot in general be rewritten as a single matrix multiply. Each new layer composes a new nonlinear feature map on top of the previous one, and the representational capacity grows with depth. Empirically, deep ReLU networks can express piecewise linear functions whose number of linear regions grows exponentially in depth and only polynomially in width, which gives some intuition for why depth pays off (Montufar et al., 2014).
A common starting point for any regression or classification problem is a linear baseline. If the baseline underfits, the question is whether the residual structure is genuinely nonlinear or merely noisy. Several diagnostics help.
For regression, plot the residuals against each feature and against the predicted value. Random scatter around zero suggests the linear model has captured the systematic part of the signal. A visible curve, fan, or trend in the residuals signals unmodeled nonlinearity. For classification, fit a flexible model (e.g., a small gradient-boosted ensemble) and compare its held-out performance against the linear baseline. A large gap argues for nonlinearity in the data. Partial dependence plots and SHAP values then show which features carry the curvature.
When the diagnosis points to nonlinearity, the standard escalation is:
The progression is not strict. On well-structured tabular data, gradient boosting often beats both a heavily engineered linear model and a deep network. On image and language tasks, the linear baseline is rarely competitive at all, and the question is which neural architecture to use rather than whether to use one.
Imagine you have a bunch of dots on a piece of paper and you want to draw a single line that follows them as closely as possible. If the dots really do form a straight line, your job is easy. But if the dots form a curve, or a wave, or a cloud that swerves around, no straight line is going to do the job. You need a curve. Nonlinear methods in machine learning are the tools that draw the curves. Some of them stack many small curves together (neural networks). Some of them split the paper into rectangles and draw a flat patch in each one (decision trees). Some of them quietly lift the dots into a higher-dimensional space where a straight line works after all (kernel methods). The reason there are so many of them is that the world is shaped much more like a curve than like a straight line.