Kolmogorov-Arnold Network

Deep Learning Machine Learning Neural Networks

17 min read

Updated Jul 11, 2026

Suggest edit History Talk

RawGraph

Last edited

Jul 11, 2026

Fact-checked

In review queue

Sources

11 citations

Revision

v4 · 3,459 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

A Kolmogorov-Arnold Network (KAN) is a type of neural network architecture proposed as an alternative to the traditional Multi-Layer Perceptron (MLP). Introduced by Ziming Liu and collaborators from MIT, Caltech, Northeastern University, and the NSF Institute for Artificial Intelligence and Fundamental Interactions (IAIFI) in April 2024, KANs place learnable activation functions on the edges (connections) of the network rather than fixed activation functions on the nodes (neurons). This design is inspired by the Kolmogorov-Arnold representation theorem from 1957, a foundational result in approximation theory that guarantees any continuous multivariate function can be decomposed into a finite composition of continuous univariate functions and addition. The original KAN paper was published as an oral presentation at ICLR 2025.

Background: The Kolmogorov-Arnold Representation Theorem

The mathematical foundation of KANs lies in the Kolmogorov-Arnold representation theorem, which emerged from work on Hilbert's 13th problem. In 1900, David Hilbert conjectured that there exist continuous functions of three variables that cannot be expressed as finite compositions of continuous functions of two variables. This conjecture was disproved through a series of results by Andrey Kolmogorov and his student Vladimir Arnold in 1956 and 1957.

Kolmogorov first proved in 1956 that any continuous function of several variables can be written as compositions of continuous functions of three variables. Arnold extended this result in 1957, reducing the requirement to functions of two variables. Kolmogorov then proved the definitive version later that year: any continuous function of n variables can be decomposed using only functions of a single variable and addition.

The formal statement of the theorem is as follows. For any continuous function $f: [0,1]^n \to \mathbb{R}$ , there exist continuous one-dimensional outer functions $\Phi_q: \mathbb{R} \to \mathbb{R}$ and continuous one-dimensional inner functions $\psi_{q,p}: [0,1] \to \mathbb{R}$ such that:

f(x_1, \ldots, x_n) = \sum_{q=0}^{2n} \Phi_q\left( \sum_{p=1}^{n} \psi_{q,p}(x_p) \right)

The theorem uses $2n+1$ outer functions and $n(2n+1)$ inner functions. The inner functions $\psi_{q,p}$ are independent of the target function $f$ , while the outer functions $\Phi_q$ depend on $f$ . This decomposition shows that multivariate function representation reduces entirely to learning univariate functions, which is the core insight that KANs exploit.

Historically, the Kolmogorov-Arnold theorem received limited attention in the machine learning community because the inner functions can be highly non-smooth (even fractal-like), which made practical implementation seem infeasible. The KAN paper overcomes this by generalizing the theorem to arbitrary widths and depths rather than strictly adhering to the original two-layer, $(2n+1)$ -width structure.

Architecture

Key Differences from MLPs

In a standard MLP, each neuron (node) applies a fixed activation function (such as ReLU, sigmoid, or tanh) to a weighted sum of its inputs. The learnable parameters are the linear weights and biases on each connection.

In a KAN, the architecture is fundamentally different:

Learnable activation functions on edges: Every connection between nodes has its own learnable univariate function, parameterized as a B-spline. There are no fixed activation functions.
Summation at nodes: Nodes in a KAN simply sum the incoming signals without applying any nonlinearity. All nonlinear computation happens on the edges.
No linear weight parameters: KANs replace every scalar weight parameter with a univariate function. A KAN has no traditional weight matrices at all.

Layer Formulation

A KAN layer with $n_{\text{in}}$ input nodes and $n_{\text{out}}$ output nodes is defined by a matrix of univariate functions $\{\phi_{i,j}\}$ , where $i$ ranges over inputs and $j$ ranges over outputs. The output of node $j$ in the next layer is computed as:

x_{l+1, j} = \sum_{i=1}^{n_l} \phi_{l,i,j}(x_{l,i})

Each activation function $\phi$ is parameterized as:

\phi(x) = w_b \cdot b(x) + w_s \cdot \mathrm{spline}(x)

where $b(x)$ is a fixed basis function (typically SiLU, defined as $x/(1+e^{-x})$ ), and $\mathrm{spline}(x)$ is a linear combination of B-spline basis functions with trainable coefficients $c_i$ :

\mathrm{spline}(x) = \sum_i c_i B_i(x)

The residual connection through $b(x)$ helps stabilize training, similar to how residual connections work in other deep learning architectures. The full KAN is a composition of $L$ such layers:

\mathrm{KAN}(x) = (\Phi_{L-1} \circ \Phi_{L-2} \circ \cdots \circ \Phi_0)(x)

B-Spline Parameterization

B-splines (basis splines) serve as the default parameterization for the learnable activation functions in KANs. Each B-spline is defined over a grid of knot points. Key properties that make B-splines suitable for KANs include:

Local control: Adjusting a B-spline coefficient only affects the curve in the vicinity of the corresponding knot, enabling fine-grained optimization.
Smoothness: B-splines provide smooth function approximations, which is important for interpretability and gradient-based training.
Grid refinement: The resolution of a B-spline can be increased by adding more knot points without retraining from scratch, a feature that KANs exploit through grid extension.

The spline order $k$ (typically $k=3$ for cubic splines) and the number of grid intervals $G$ are hyperparameters that control the expressiveness of each activation function. The total number of parameters in a KAN with $L$ layers and widths $[n_0, n_1, \ldots, n_L]$ is approximately $O(N^2 L G)$ , where $N$ is the typical layer width.

Network Notation

KAN architectures are described using a bracket notation listing the width of each layer. For example, a [2,5,1] KAN has 2 input nodes, a hidden layer of 5 nodes, and 1 output node, with $2 \times 5 + 5 \times 1 = 15$ learnable activation functions.

Training Techniques

Grid Extension

One distinctive feature of KAN training is grid extension (also called grid refinement). As training progresses, the number of grid points defining each B-spline can be gradually increased. When the grid is refined from $G_1$ to $G_2$ intervals, new B-spline coefficients are computed by fitting the finer spline to match the coarser one. This allows a KAN to start training with a coarse representation and progressively increase its capacity, achieving higher accuracy without retraining from scratch. Because B-spline spaces are nested (a coarse spline can be exactly represented in a finer grid), no information is lost during grid extension.

Sparsification and Pruning

To improve interpretability, KANs use a combined sparsification and pruning strategy:

L1 regularization: An L1 penalty is applied to the activation functions, encouraging many of them to become near-zero (effectively removing unnecessary edges). The L1 norm of an activation function $\phi$ is computed as the mean absolute value of $\phi$ evaluated on training samples.
Entropy regularization: In addition to L1, an entropy penalty discourages activation functions from spreading their magnitude uniformly and instead encourages them to be either clearly active or clearly inactive.
Pruning: After sparsification, nodes whose incoming and outgoing activation scores both fall below a threshold ( $\theta = 10^{-2}$ by default) are removed from the network.

The total training objective combines the prediction loss with the regularization terms:

\text{loss}_{\text{total}} = \text{loss}_{\text{pred}} + \lambda \left( \mu_1 \sum \lVert \Phi_l \rVert_1 + \mu_2 \sum S(\Phi_l) \right)

where S denotes the entropy penalty.

Symbolification

After pruning, KANs offer a "symbolification" step where individual activation functions can be matched to known symbolic functions (such as sin, cos, exp, log, or polynomial functions). If an activation function closely matches a known symbolic form, it can be locked to that symbolic expression, converting the learned function into an exact formula. This is central to KANs' use in symbolic regression.

Claimed Advantages

Interpretability

KANs are designed to be inherently interpretable. Because each activation function is a visualizable 1D curve, users can inspect every learned function in the network. The combination of pruning and symbolification allows trained KANs to be simplified into compact, human-readable mathematical expressions. In the original paper, the authors demonstrate that KANs can act as "collaborators" for scientists, helping to (re)discover mathematical and physical laws.

Accuracy with Fewer Parameters

The original paper reports that KANs can achieve comparable or better accuracy than MLPs while using significantly fewer parameters. For example, on PDE-solving tasks, the authors reported that a 2-layer, width-10 KAN achieved 100 times better accuracy than a 4-layer, width-100 MLP ( $10^{-7}$ vs $10^{-5}$ MSE), while using 100 times fewer parameters (roughly $10^2$ vs $10^4$ ).

Neural Scaling Laws

The paper claims that KANs exhibit faster neural scaling laws than MLPs. For functions with a compositional structure matching the KAN architecture, the theoretical scaling exponent is $\alpha = k+1$ , where $k$ is the spline order. With the default cubic splines ( $k=3$ ), this gives $\alpha = 4$ , meaning that the test loss decreases as a power law with exponent 4 as the number of parameters increases. The authors argue that this scaling advantage stems from KANs' ability to exploit the compositional structure of target functions.

KAN 2.0: Kolmogorov-Arnold Networks Meet Science

In August 2024, Ziming Liu, Pingchuan Ma, Yixuan Wang, Wojciech Matusik, and Max Tegmark released a follow-up paper titled "KAN 2.0: Kolmogorov-Arnold Networks Meet Science." This work was later published in Physical Review X in December 2025 (volume 15, article 041051). KAN 2.0 focuses on bridging KANs and scientific discovery, proposing a bidirectional framework: science-to-KAN (incorporating domain knowledge into KANs) and KAN-to-science (extracting scientific insights from trained KANs).

New Features

KAN 2.0 introduced three major functionalities to the pykan software library:

Feature	Description
MultKAN	KANs augmented with multiplication nodes. In a MultKAN, some nodes perform addition (as in standard KANs) while others multiply k incoming sub-node values, enabling the network to represent multiplicative relationships more naturally.
Kanpiler	A KAN compiler that converts symbolic formulas into KAN architectures. The symbolic formula is parsed into an expression tree, which is then mapped onto a KAN structure. This compiled KAN can be fine-tuned with data.
Tree Converter	Converts a trained KAN (or any neural network) into a tree graph representation, which can then be translated into a symbolic formula.

Scientific Applications

The KAN 2.0 paper demonstrates applications across three pillars of scientific discovery:

Feature identification: Determining which input variables are most relevant to a given output.
Modular structure discovery: Revealing how a complex function decomposes into simpler sub-functions.
Symbolic formula discovery: Extracting closed-form mathematical expressions from trained networks.

Specific scientific applications demonstrated include discovering conserved quantities, Lagrangians, symmetries, and constitutive laws in physics.

Applications

Symbolic Regression

KANs have found their most natural application in symbolic regression, where the goal is to discover closed-form mathematical expressions that fit observed data. The KAN architecture, with its decomposition into visualizable 1D functions and its symbolification capability, provides a built-in pathway from a trained neural network to an interpretable equation. The KAN-SR (KAN-guided Symbolic Regression) framework has been used to recover ground-truth equations from benchmark datasets such as the Feynman Symbolic Regression for Scientific Discovery dataset.

Physics and Scientific Discovery

In the original paper, KANs were applied to two scientific discovery tasks:

Knot theory: KANs were used to discover relations between topological invariants of mathematical knots. Using an unsupervised approach, a [18,1,1] KAN was trained on 18 knot-theoretic variables to identify which features carry the most information. The network successfully rediscovered known mathematical relationships between knot invariants.
Anderson localization: KANs identified phase transition boundaries in disordered quantum systems without explicit physical guidance, demonstrating the ability to discover order parameters from data.

Subsequent work has applied KANs to fluid mechanics, material science, power systems, network dynamics, transistor modeling, and thermoelectric materials design.

Partial Differential Equations

KANs have been applied to solving partial differential equations (PDEs), where their parameter efficiency provides advantages over MLPs. Physics-informed KANs combine the KAN architecture with physics-informed constraints, similar to Physics-Informed Neural Networks (PINNs) but potentially achieving higher accuracy with fewer parameters.

Genomics

Researchers have explored KANs for genomic tasks, though results in this domain have been mixed. The high-dimensional nature of genomic data can be challenging for the B-spline-based architecture.

Comparison: KAN vs. MLP

The following table summarizes the key architectural and practical differences between KANs and MLPs:

Feature	KAN	MLP
Activation functions	Learnable (B-splines) on edges	Fixed (ReLU, sigmoid, etc.) on nodes
Node operation	Summation only	Weighted sum followed by activation
Learnable parameters	B-spline coefficients per edge	Weights and biases per connection
Theoretical basis	Kolmogorov-Arnold representation theorem	Universal approximation theorem
Interpretability	High (visualizable 1D functions, symbolification)	Low (opaque weight matrices)
Parameter efficiency (symbolic tasks)	Higher	Lower
Parameter efficiency (general tasks)	Comparable or lower	Comparable or higher
Training speed	Slower (roughly 10x for equal parameters)	Faster
GPU parallelization	Limited (diverse activation functions)	Efficient (uniform operations)
Computational complexity per layer	$O(N^2 L G)$	$O(N^2 L)$
Grid refinement	Supported (unique to KANs)	Not applicable
Scalability to large models	Challenging	Well-established
Maturity of ecosystem	Early stage	Decades of optimization

Criticism and Debate

Fairness of Comparisons

The most prominent critique of the original KAN paper came from Runpeng Yu, Weihao Yu, and Xinchao Wang in their July 2024 paper "KAN or MLP: A Fairer Comparison." Their key findings challenged several claims from the original work:

Controlled comparisons: When the number of parameters and FLOPs are held equal, KANs outperform MLPs only on symbolic formula representation tasks. On machine learning, computer vision, natural language processing, and audio processing benchmarks, MLPs generally outperform KANs.
B-spline as the key factor: The study found that KANs' advantage in symbolic tasks stems primarily from the use of B-spline activation functions rather than the KAN architecture itself. When B-spline activations are applied to standard MLPs, those MLPs match or exceed KAN performance on symbolic tasks.
Continual learning: Contrary to claims in the original KAN paper, the fairer comparison found that KANs exhibit more severe catastrophic forgetting than MLPs in class-incremental continual learning settings.

Scalability Concerns

Several practical limitations have been identified by the research community:

Training speed: KANs are approximately 10 times slower to train than MLPs with the same number of parameters, primarily because the diverse learnable activation functions cannot be efficiently batched on GPUs the way uniform matrix multiplications in MLPs can.
High-dimensional data: While B-splines excel at approximating low-dimensional functions, they suffer from the curse of dimensionality. For high-dimensional inputs like images, KANs face computational challenges that MLPs handle more gracefully through learned feature hierarchies.
Hyperparameter complexity: KANs introduce additional hyperparameters (spline order, grid size, regularization weights for L1 and entropy) that require tuning. This makes the architecture less accessible to practitioners without specialized knowledge.
Noisy data sensitivity: In the presence of significant noise, KANs can be more sensitive than MLPs and may perform worse, since the flexible B-spline functions can overfit to noise.

Critical Assessments

A comprehensive survey by Hou et al. (arXiv:2407.11075) critically assessed the claims, performance, and practical viability of KANs. The survey noted that while KANs represent an interesting theoretical contribution, their practical advantages over MLPs remain limited to specific domains, particularly low-dimensional scientific computing and symbolic regression tasks. For mainstream deep learning applications involving large-scale data and high-dimensional inputs, MLPs and their modern variants continue to dominate.

KAN Variants

The modular nature of the KAN architecture, where the B-spline basis can be swapped for alternative function families, has led to a proliferation of KAN variants since the original paper:

Variant	Basis Function	Key Advantage
Original KAN (Spl-KAN)	B-splines	Accuracy, local control, grid refinement
FourierKAN	Fourier series coefficients	Avoids grid boundary issues, faster computation
Wav-KAN	Wavelet functions	Better accuracy and speed than B-spline KANs
ChebyKAN	Chebyshev polynomials	Global orthogonality, strong spectral approximation
FastKAN	Gaussian radial basis functions (RBFs)	1.25x faster forward+backward pass
FasterKAN	Reflection Switch Activation Function (RSWAF)	3.33x faster than efficient-KAN
ReLU-KAN	ReLU combinations	5-20x GPU speedup over spline-based KANs
BSRBF-KAN	B-splines + radial basis functions	Combines local and global approximation

Many of these variants address the training speed bottleneck of the original B-spline-based KAN by using basis functions that are more amenable to GPU parallelization.

Kolmogorov-Arnold Transformer (KAT)

The Kolmogorov-Arnold Transformer, published at ICLR 2025, integrates KAN-style learnable activation functions into the Transformer architecture. KAT replaces the MLP layers in a standard Vision Transformer (ViT) with Group-Rational KAN (GR-KAN) layers, which use rational activation functions with shared coefficients among groups of edges. KAT-B achieved 82.3% top-1 accuracy on ImageNet-1K, surpassing ViT-B by 3.1 percentage points. When initialized with pre-trained ViT weights, KAT-B reached 82.7% accuracy. This work demonstrated that KAN principles can scale to large-scale vision tasks when combined with appropriate engineering.

Software and Implementation

The primary software library for KANs is pykan, developed by Ziming Liu and hosted on GitHub (github.com/KindXiaoming/pykan). As of 2025, the repository has over 16,000 GitHub stars and 1,600 forks. The library provides tools for constructing, training, visualizing, pruning, and symbolifying KAN models.

Several community-developed implementations address specific needs:

efficient-kan: A pure PyTorch reimplementation offering better performance than the original pykan.
fast-kan: Replaces B-splines with Gaussian RBFs for faster training.
warpKAN: Achieves up to 24x speedup for larger grid sizes compared to standard PyTorch KAN implementations.

Community Reception

The KAN paper generated significant excitement upon its release in April 2024, quickly becoming one of the most discussed machine learning papers of the year. The paper trended widely on social media and attracted attention from both the machine learning and physics communities.

The original paper was accepted as an oral presentation at ICLR 2025, one of the top machine learning conferences. As of early 2025, the paper had accumulated over 900 citations on Semantic Scholar, reflecting substantial follow-up research.

However, the reception has been mixed. Supporters highlight KANs' potential for interpretable scientific computing and their elegant connection to classical mathematics. Critics point to the limited practical advantages over MLPs on mainstream tasks, the training speed overhead, and concerns about the fairness of comparisons in the original paper. The community broadly agrees that KANs are most promising for scientific discovery and symbolic regression rather than as general-purpose replacements for MLPs.

The rapid development of KAN variants, the extension to Transformer architectures (KAT), and the publication of KAN 2.0 in Physical Review X suggest that KANs have established themselves as an active research direction, even if their ultimate impact on the broader field of deep learning remains an open question.

By 2025-2026, KANs had expanded well beyond their original scope. Kolmogorov-Arnold Graph Neural Networks (KA-GNNs) demonstrated that KAN modules could be integrated into graph neural network architectures at the node embedding, message passing, and readout stages, improving accuracy and interpretability in molecular property prediction tasks ^[9]. KAN-based neural operators (KAN-ONets) achieved 10-30% reductions in relative MSE compared to standard Fourier Neural Operator baselines across seven PDE benchmarks by replacing fixed activations with learnable B-spline functions ^[10]. A practitioner's guide to KANs, revised in May 2026, synthesized the rapidly expanding literature around three themes: the relationship between KANs and the Kolmogorov superposition theorem, basis function design as a central axis, and practical tradeoffs in accuracy, efficiency, and convergence, together with a "Choose-Your-KAN" decision guide ^[11].

References

Liu, Z., Wang, Y., Vaidya, S., Ruehle, F., Halverson, J., Soljacic, M., Hou, T.Y., & Tegmark, M. (2024). "KAN: Kolmogorov-Arnold Networks." arXiv:2404.19756. Published as a conference paper (oral) at ICLR 2025.
Liu, Z., Ma, P., Wang, Y., Matusik, W., & Tegmark, M. (2024). "KAN 2.0: Kolmogorov-Arnold Networks Meet Science." arXiv:2408.10205. Published in Physical Review X, 15, 041051 (2025).
Kolmogorov, A.N. (1957). "On the representation of continuous functions of many variables by superposition of continuous functions of one variable and addition." Doklady Akademii Nauk SSSR, 114, 953-956.
Arnold, V.I. (1957). "On functions of three variables." Doklady Akademii Nauk SSSR, 114, 679-681.
Yu, R., Yu, W., & Wang, X. (2024). "KAN or MLP: A Fairer Comparison." arXiv:2407.16674.
Hou, Y., Ji, T., Zhang, D., & Stefanidis, A. (2024). "A Comprehensive Survey on Kolmogorov Arnold Networks (KAN)." arXiv:2407.11075.
Yang, X., et al. (2025). "Kolmogorov-Arnold Transformer." Published as a conference paper at ICLR 2025.
Schmidt-Hieber, J. (2021). "The Kolmogorov-Arnold representation theorem revisited." Neural Networks, 137, 119-126.
Li, L., Zhang, Y., Wang, G., et al. (2025). "Kolmogorov-Arnold graph neural networks for molecular property prediction." Nature Machine Intelligence, 7, 1346-1354. https://www.nature.com/articles/s42256-025-01087-7 ↩
[Authors] (2026). "KAN-ONets: Improved expressivity and performance of neural operators on diverse grids with Kolmogorov-Arnold Networks." Neurocomputing. https://www.sciencedirect.com/science/article/abs/pii/S0925231225024476 ↩
Noorizadegan, A., Wang, S., Ling, L., & Dominguez-Morales, J.P. (2025). "A Practitioner's Guide to Kolmogorov-Arnold Networks." arXiv:2510.25781 (revised May 2026). https://arxiv.org/abs/2510.25781 ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

3 revisions by 1 contributors · full history

Suggest edit

What links here

Abbreviations Universal Approximation Theorem

Background: The Kolmogorov-Arnold Representation Theorem

Architecture

Key Differences from MLPs

Layer Formulation

B-Spline Parameterization

Network Notation

Training Techniques

Grid Extension

Sparsification and Pruning

Symbolification

Claimed Advantages

Interpretability

Accuracy with Fewer Parameters

Neural Scaling Laws

KAN 2.0: Kolmogorov-Arnold Networks Meet Science

New Features

Scientific Applications

Applications

Symbolic Regression

Physics and Scientific Discovery

Partial Differential Equations

Genomics

Comparison: KAN vs. MLP

Criticism and Debate

Fairness of Comparisons

Scalability Concerns

Critical Assessments

KAN Variants

Kolmogorov-Arnold Transformer (KAT)

Software and Implementation

Community Reception

See Also

References

Improve this article

Related Articles

Mixture of Experts (MoE)

Activation Function

Attention

Backpropagation

Batch Normalization

Bayesian Neural Network

What links here

Related Articles

Mixture of Experts (MoE)

Activation Function

Attention

Backpropagation

Batch Normalization

Bayesian Neural Network

What links here