# Universal Approximation Theorem

> Source: https://aiwiki.ai/wiki/universal_approximation_theorem
> Updated: 2026-06-24
> Categories: Machine Learning
> From AI Wiki (https://aiwiki.ai), a free encyclopedia of artificial intelligence. Quote with attribution.

The universal approximation theorem states that a [feedforward neural network](/wiki/feedforward_neural_network_ffn) with a single hidden layer of finite width and a suitable nonlinear [activation function](/wiki/activation_function) can approximate any continuous function on a compact set to any desired accuracy. Formally, for a broad class of activations the set of functions such networks compute is dense in C(K), the space of continuous functions on a compact subset K of R^d under the supremum norm, so for every continuous target g, every compact domain, and every tolerance epsilon greater than zero there exists a finite network whose output stays within epsilon of g everywhere on the domain [1][2]. The result is a statement about expressive capacity: it guarantees that approximating weights exist, but says nothing about how many neurons are needed, whether training finds those weights, or how the network behaves outside the domain.

## What does the universal approximation theorem state?

The best-known classical form concerns a [neural network](/wiki/neural_network) with one hidden layer, also called a single-hidden-layer [multilayer perceptron](/wiki/multilayer_perceptron). For a broad class of nonlinear activations, the functions such networks compute are dense in C(K) [1][2]. Informally, for every continuous target g, every compact domain, and every tolerance epsilon greater than zero, there is some finite network of the form a_1 sigma(w_1 x + b_1) + ... + a_N sigma(w_N x + b_N), where each w_i x is an inner product between a weight vector and the input, whose output stays within epsilon of g everywhere on the domain. Nothing bounds the width N, which may grow without limit as the tolerance shrinks.

The theorem is among the most frequently invoked results in the theory of [deep learning](/wiki/deep_learning), and among the most frequently overread. It is a statement about expressive capacity: weights achieving a good approximation exist. It says nothing about whether training can find them, how large the network must be, how many samples learning requires, or how the trained network behaves outside its domain.

## Who proved the universal approximation theorem?

George Cybenko proved the canonical version in "Approximation by Superpositions of a Sigmoidal Function," published in Mathematics of Control, Signals and Systems in 1989 [1]. The paper's abstract states that "finite linear combinations of compositions of a fixed, univariate function and a set of affine functionals can uniformly approximate any continuous function of n real variables with support in the unit hypercube; only mild conditions are imposed on the univariate function" [1]. He showed the result holds whenever sigma is any continuous [sigmoidal](/wiki/sigmoid_function) function, meaning sigma(t) tends to 1 as t goes to positive infinity and to 0 as t goes to negative infinity. The proof argues by contradiction using the Hahn-Banach and Riesz representation theorems: if the networks were not dense, some nonzero signed measure would annihilate every network output, which the sigmoidal condition rules out. The argument establishes existence and constructs nothing.

In the same year Kurt Hornik, Maxwell Stinchcombe and Halbert White proved in Neural Networks that single-hidden-layer networks with arbitrary "squashing" activations can approximate any Borel measurable function between finite-dimensional Euclidean spaces to any desired accuracy, the paper whose title, "Multilayer feedforward networks are universal approximators," fixed the phrase "universal approximators" in the literature [2]. Ken-Ichi Funahashi published an independent density theorem for sigmoidal networks the same year [3]. In 1991 Hornik showed that the power lies in the architecture rather than in any particular activation: every continuous, bounded and nonconstant activation yields universal approximation on compact sets, with companion results in L^p norms [4]. Cybenko's and Hornik's papers are among the most cited results in neural network theory, each gathering several thousand citations.

The sharpest single-hidden-layer characterization came in 1993 from Moshe Leshno, Vladimir Lin, Allan Pinkus and Shimon Schocken: for locally bounded, piecewise continuous activations, a one-hidden-layer network is a universal approximator if and only if the activation is not almost everywhere equal to a polynomial [5]. Polynomial activations generate only polynomials of bounded degree; every other choice works, including the now-dominant [ReLU](/wiki/relu). Pinkus's 1999 Acta Numerica survey remains the standard account of the classical theory [6]. The results also have older mathematical ancestry: the [Stone-Weierstrass theorem](/wiki/stone_weierstrass_theorem) on the density of polynomials, and the 1957 [Kolmogorov-Arnold representation theorem](/wiki/kolmogorov_arnold_representation_theorem), an exact superposition result often cited as a precursor, which later lent its name to [Kolmogorov-Arnold networks](/wiki/kolmogorov_arnold_network).

## What are the modern width and depth variants?

The classical theorems fix depth at one hidden layer and let width grow without bound. A modern line of work asks the dual question: can networks of bounded width and unbounded depth approximate universally? Zhou Lu and coauthors answered yes in 2017, proving that "width-(n+4) ReLU networks, where n is the input dimension, are universal approximators" for any Lebesgue-integrable function on R^n in the L^1 sense, while width-n ReLU networks fail for almost every such target, "which exhibits a phase transition" at the input dimension [7]. Boris Hanin and Mark Sellke showed that for uniform approximation of continuous scalar-valued functions on the d-dimensional unit cube, the minimal ReLU width is exactly d + 1 [8]. Patrick Kidger and Terry Lyons generalized the activation in 2020: networks of width n + m + 2, with n inputs and m outputs, are dense in the continuous functions from a compact set to R^m for any nonaffine continuous activation that is continuously differentiable at at least one point with nonzero derivative there [9]. Sejun Park, Chulhee Yun, Jaeho Lee and Jinwoo Shin then pinned down the exact threshold for ReLU in 2021: L^p universal approximation of functions from R^n to R^m is possible precisely when the width is at least max(n + 1, m), while uniform approximation with ReLU alone obeys a different threshold [10].

The width and depth thresholds can be summarized compactly:

| Result | Year | Setting | Minimal width / size |
|---|---|---|---|
| Lu et al. [7] | 2017 | L^1 approximation, ReLU, input dim n | width n + 4 universal; width n fails (phase transition) |
| Hanin and Sellke [8] | 2017 | Uniform approximation, ReLU, scalar output on [0,1]^d | width exactly d + 1 |
| Kidger and Lyons [9] | 2020 | Continuous functions to R^m, n inputs, nonaffine activation | width n + m + 2 |
| Park et al. [10] | 2021 | L^p approximation, ReLU, R^n to R^m | width max(n + 1, m) |

A complementary line quantifies cost rather than possibility. Density says nothing about size, and the worst case is bleak: approximating a generic Lipschitz function on the d-dimensional cube to accuracy epsilon requires on the order of epsilon^(-d) parameters, an instance of the [curse of dimensionality](/wiki/curse_of_dimensionality) [6]. Andrew Barron identified an important exception in 1993: for functions whose Fourier transform has a finite first moment, a single-hidden-layer sigmoidal network with N units achieves L^2 error on the order of 1/sqrt(N), a rate independent of dimension [11]. Depth separation theorems sharpen the contrast between shallow and deep universality. Matus Telgarsky constructed, for every k, functions computed by ReLU networks with roughly k^3 layers and constant width that no network with on the order of k layers can approximate unless it has roughly 2^k units [12], and Ronen Eldan and Ohad Shamir exhibited a simple function expressible by a small three-layer network that no two-layer network can approximate to constant accuracy unless its width grows exponentially with the input dimension [13].

## Does universal approximation extend beyond feedforward networks?

Universality results now exist for most major architecture families. Tianping Chen and Hong Chen proved in 1995 that neural networks can universally approximate nonlinear continuous operators, maps whose inputs are themselves functions [14]. That theorem is the explicit foundation of [DeepONet](/wiki/deeponet), the operator-learning architecture published by Lu Lu and colleagues in Nature Machine Intelligence in 2021 [15], and corresponding universality and error bounds were proved for the [Fourier neural operator](/wiki/fourier_neural_operator) [16]. For sequence models, Chulhee Yun and coauthors showed in 2020 that [transformers](/wiki/transformer) are universal approximators of continuous permutation-equivariant sequence-to-sequence functions on compact domains, and of arbitrary continuous sequence-to-sequence functions once positional encodings are added [17]. Analogous theorems, each with architecture-specific conditions, cover [recurrent neural networks](/wiki/recurrent_neural_network) and [convolutional neural networks](/wiki/convolutional_neural_network).

## What does the universal approximation theorem NOT say?

The theorem is often cited as if it explained the practical success of deep learning. It does not, for several distinct reasons.

It is existential, not constructive. The classical proofs establish that approximating weights exist without providing any procedure for finding them or any bound on the number of units a given target requires [1][6].

It says nothing about optimization. Whether [backpropagation](/wiki/backpropagation) with [stochastic gradient descent](/wiki/stochastic_gradient_descent) will reach approximating weights is a separate question, and worst-case answers are negative: Avrim Blum and Ronald Rivest proved that even training a three-node network to fit a dataset exactly is NP-complete [18].

It says nothing about generalization. Approximation theory assumes the target function is fully known on its whole domain, while learning must infer it from finite samples. A class rich enough to approximate everything is also rich enough to fit noise, so statistical guarantees require separate tools.

The guaranteed network can be astronomically large. For broad function classes the required width grows exponentially in the input dimension, and the depth separation results show that choosing too shallow an architecture can incur exactly this exponential penalty [6][13].

Its guarantees are confined to compact domains. The theorem licenses accuracy inside a bounded region and implies nothing about extrapolation beyond it.

Universality is not distinctive. Polynomials, Fourier series, splines, kernel methods with universal kernels, and boosted decision trees are universal approximators in comparable senses, so universality alone cannot explain why deep networks outperform these alternatives in practice; explanations are sought instead in inductive bias, optimization dynamics, data, and scale.

## Why does the universal approximation theorem matter?

The classical theorems arrived at a pivotal moment. Marvin Minsky and Seymour Papert's 1969 analysis of the [perceptron](/wiki/perceptron) had shown that networks without hidden layers cannot represent simple predicates such as XOR, a critique often linked to the decline of neural network research in the 1970s. Cybenko, Hornik, Stinchcombe, White and Funahashi established that a single hidden layer removes the representational barrier entirely, just as backpropagation had made training multilayer networks practical. The results gave the field a durable framing: when a network underperforms, the explanation lies in data, optimization or scale rather than in any in-principle limit of the hypothesis class.

In contemporary research the theorem functions mostly as a baseline. A universality proof is a standard first checkbox for a new architecture family, as the DeepONet and transformer results illustrate, while the live questions concern efficiency and learnability: which functions a given depth and width can approximate at useful rates, and whether gradient-based training can actually reach them. The common summary is that universal approximation is close to necessary for a credible architecture but far from sufficient as an explanation of deep learning [6].

## References

1. Cybenko, G. (1989). "Approximation by Superpositions of a Sigmoidal Function." Mathematics of Control, Signals and Systems 2(4): 303-314. https://link.springer.com/article/10.1007/BF02551274
2. Hornik, K., Stinchcombe, M., White, H. (1989). "Multilayer feedforward networks are universal approximators." Neural Networks 2(5): 359-366. https://www.sciencedirect.com/science/article/abs/pii/0893608089900208
3. Funahashi, K. (1989). "On the approximate realization of continuous mappings by neural networks." Neural Networks 2(3): 183-192.
4. Hornik, K. (1991). "Approximation capabilities of multilayer feedforward networks." Neural Networks 4(2): 251-257. https://www.sciencedirect.com/science/article/abs/pii/089360809190009T
5. Leshno, M., Lin, V. Ya., Pinkus, A., Schocken, S. (1993). "Multilayer feedforward networks with a nonpolynomial activation function can approximate any function." Neural Networks 6(6): 861-867.
6. Pinkus, A. (1999). "Approximation theory of the MLP model in neural networks." Acta Numerica 8: 143-195.
7. Lu, Z., Pu, H., Wang, F., Hu, Z., Wang, L. (2017). "The Expressive Power of Neural Networks: A View from the Width." Advances in Neural Information Processing Systems 30 (NIPS 2017). https://arxiv.org/abs/1709.02540
8. Hanin, B., Sellke, M. (2017). "Approximating Continuous Functions by ReLU Nets of Minimal Width." arXiv preprint. https://arxiv.org/abs/1710.11278
9. Kidger, P., Lyons, T. (2020). "Universal Approximation with Deep Narrow Networks." Proceedings of the 33rd Conference on Learning Theory (COLT 2020), PMLR 125: 2306-2327. https://proceedings.mlr.press/v125/kidger20a.html
10. Park, S., Yun, C., Lee, J., Shin, J. (2021). "Minimum Width for Universal Approximation." International Conference on Learning Representations (ICLR 2021). https://arxiv.org/abs/2006.08859
11. Barron, A. R. (1993). "Universal approximation bounds for superpositions of a sigmoidal function." IEEE Transactions on Information Theory 39(3): 930-945.
12. Telgarsky, M. (2016). "Benefits of depth in neural networks." Proceedings of the 29th Conference on Learning Theory (COLT 2016), PMLR 49. https://proceedings.mlr.press/v49/telgarsky16.html
13. Eldan, R., Shamir, O. (2016). "The Power of Depth for Feedforward Neural Networks." Proceedings of the 29th Conference on Learning Theory (COLT 2016), PMLR 49. https://arxiv.org/abs/1512.03965
14. Chen, T., Chen, H. (1995). "Universal approximation to nonlinear operators by neural networks with arbitrary activation functions and its application to dynamical systems." IEEE Transactions on Neural Networks 6(4): 911-917.
15. Lu, L., Jin, P., Pang, G., Zhang, Z., Karniadakis, G. E. (2021). "Learning nonlinear operators via DeepONet based on the universal approximation theorem of operators." Nature Machine Intelligence 3: 218-229. https://www.nature.com/articles/s42256-021-00302-5
16. Kovachki, N., Lanthaler, S., Mishra, S. (2021). "On Universal Approximation and Error Bounds for Fourier Neural Operators." Journal of Machine Learning Research 22. https://www.jmlr.org/papers/volume22/21-0806/21-0806.pdf
17. Yun, C., Bhojanapalli, S., Rawat, A. S., Reddi, S. J., Kumar, S. (2020). "Are Transformers universal approximators of sequence-to-sequence functions?" International Conference on Learning Representations (ICLR 2020). https://arxiv.org/abs/1912.10077
18. Blum, A., Rivest, R. L. (1992). "Training a 3-node neural network is NP-complete." Neural Networks 5(1): 117-127. https://people.csail.mit.edu/rivest/pubs/BR93.pdf
