A Kolmogorov-Arnold Network (KAN) is a type of neural network architecture proposed as an alternative to the traditional Multi-Layer Perceptron (MLP). Introduced by Ziming Liu and collaborators from MIT, Caltech, Northeastern University, and the NSF Institute for Artificial Intelligence and Fundamental Interactions (IAIFI) in April 2024, KANs place learnable activation functions on the edges (connections) of the network rather than fixed activation functions on the nodes (neurons). This design is inspired by the Kolmogorov-Arnold representation theorem from 1957, a foundational result in approximation theory that guarantees any continuous multivariate function can be decomposed into a finite composition of continuous univariate functions and addition. The original KAN paper was published as an oral presentation at ICLR 2025.
The mathematical foundation of KANs lies in the Kolmogorov-Arnold representation theorem, which emerged from work on Hilbert's 13th problem. In 1900, David Hilbert conjectured that there exist continuous functions of three variables that cannot be expressed as finite compositions of continuous functions of two variables. This conjecture was disproved through a series of results by Andrey Kolmogorov and his student Vladimir Arnold in 1956 and 1957.
Kolmogorov first proved in 1956 that any continuous function of several variables can be written as compositions of continuous functions of three variables. Arnold extended this result in 1957, reducing the requirement to functions of two variables. Kolmogorov then proved the definitive version later that year: any continuous function of n variables can be decomposed using only functions of a single variable and addition.
The formal statement of the theorem is as follows. For any continuous function f: [0,1]^n to R, there exist continuous one-dimensional outer functions Phi_q: R to R and continuous one-dimensional inner functions psi_{q,p}: [0,1] to R such that:
f(x_1, ..., x_n) = sum_{q=0}^{2n} Phi_q( sum_{p=1}^{n} psi_{q,p}(x_p) )
The theorem uses 2n+1 outer functions and n(2n+1) inner functions. The inner functions psi_{q,p} are independent of the target function f, while the outer functions Phi_q depend on f. This decomposition shows that multivariate function representation reduces entirely to learning univariate functions, which is the core insight that KANs exploit.
Historically, the Kolmogorov-Arnold theorem received limited attention in the machine learning community because the inner functions can be highly non-smooth (even fractal-like), which made practical implementation seem infeasible. The KAN paper overcomes this by generalizing the theorem to arbitrary widths and depths rather than strictly adhering to the original two-layer, (2n+1)-width structure.
In a standard MLP, each neuron (node) applies a fixed activation function (such as ReLU, sigmoid, or tanh) to a weighted sum of its inputs. The learnable parameters are the linear weights and biases on each connection.
In a KAN, the architecture is fundamentally different:
A KAN layer with n_in input nodes and n_out output nodes is defined by a matrix of univariate functions {phi_{i,j}}, where i ranges over inputs and j ranges over outputs. The output of node j in the next layer is computed as:
x_{l+1, j} = sum_{i=1}^{n_l} phi_{l,i,j}(x_{l,i})
Each activation function phi is parameterized as:
phi(x) = w_b * b(x) + w_s * spline(x)
where b(x) is a fixed basis function (typically SiLU, defined as x/(1+e^{-x})), and spline(x) is a linear combination of B-spline basis functions with trainable coefficients c_i:
spline(x) = sum_i c_i * B_i(x)
The residual connection through b(x) helps stabilize training, similar to how residual connections work in other deep learning architectures. The full KAN is a composition of L such layers:
KAN(x) = (Phi_{L-1} compose Phi_{L-2} compose ... compose Phi_0)(x)
B-splines (basis splines) serve as the default parameterization for the learnable activation functions in KANs. Each B-spline is defined over a grid of knot points. Key properties that make B-splines suitable for KANs include:
The spline order k (typically k=3 for cubic splines) and the number of grid intervals G are hyperparameters that control the expressiveness of each activation function. The total number of parameters in a KAN with L layers and widths [n_0, n_1, ..., n_L] is approximately O(N^2 * L * G), where N is the typical layer width.
KAN architectures are described using a bracket notation listing the width of each layer. For example, a [2,5,1] KAN has 2 input nodes, a hidden layer of 5 nodes, and 1 output node, with 2x5 + 5x1 = 15 learnable activation functions.
One distinctive feature of KAN training is grid extension (also called grid refinement). As training progresses, the number of grid points defining each B-spline can be gradually increased. When the grid is refined from G_1 to G_2 intervals, new B-spline coefficients are computed by fitting the finer spline to match the coarser one. This allows a KAN to start training with a coarse representation and progressively increase its capacity, achieving higher accuracy without retraining from scratch. Because B-spline spaces are nested (a coarse spline can be exactly represented in a finer grid), no information is lost during grid extension.
To improve interpretability, KANs use a combined sparsification and pruning strategy:
The total training objective combines the prediction loss with the regularization terms:
loss_total = loss_pred + lambda * (mu_1 * sum|Phi_l|_1 + mu_2 * sum S(Phi_l))
where S denotes the entropy penalty.
After pruning, KANs offer a "symbolification" step where individual activation functions can be matched to known symbolic functions (such as sin, cos, exp, log, or polynomial functions). If an activation function closely matches a known symbolic form, it can be locked to that symbolic expression, converting the learned function into an exact formula. This is central to KANs' use in symbolic regression.
KANs are designed to be inherently interpretable. Because each activation function is a visualizable 1D curve, users can inspect every learned function in the network. The combination of pruning and symbolification allows trained KANs to be simplified into compact, human-readable mathematical expressions. In the original paper, the authors demonstrate that KANs can act as "collaborators" for scientists, helping to (re)discover mathematical and physical laws.
The original paper reports that KANs can achieve comparable or better accuracy than MLPs while using significantly fewer parameters. For example, on PDE-solving tasks, the authors reported that a 2-layer, width-10 KAN achieved 100 times better accuracy than a 4-layer, width-100 MLP (10^{-7} vs 10^{-5} MSE), while using 100 times fewer parameters (roughly 10^2 vs 10^4).
The paper claims that KANs exhibit faster neural scaling laws than MLPs. For functions with a compositional structure matching the KAN architecture, the theoretical scaling exponent is alpha = k+1, where k is the spline order. With the default cubic splines (k=3), this gives alpha = 4, meaning that the test loss decreases as a power law with exponent 4 as the number of parameters increases. The authors argue that this scaling advantage stems from KANs' ability to exploit the compositional structure of target functions.
In August 2024, Ziming Liu, Pingchuan Ma, Yixuan Wang, Wojciech Matusik, and Max Tegmark released a follow-up paper titled "KAN 2.0: Kolmogorov-Arnold Networks Meet Science." This work was later published in Physical Review X in December 2025 (volume 15, article 041051). KAN 2.0 focuses on bridging KANs and scientific discovery, proposing a bidirectional framework: science-to-KAN (incorporating domain knowledge into KANs) and KAN-to-science (extracting scientific insights from trained KANs).
KAN 2.0 introduced three major functionalities to the pykan software library:
| Feature | Description |
|---|---|
| MultKAN | KANs augmented with multiplication nodes. In a MultKAN, some nodes perform addition (as in standard KANs) while others multiply k incoming sub-node values, enabling the network to represent multiplicative relationships more naturally. |
| Kanpiler | A KAN compiler that converts symbolic formulas into KAN architectures. The symbolic formula is parsed into an expression tree, which is then mapped onto a KAN structure. This compiled KAN can be fine-tuned with data. |
| Tree Converter | Converts a trained KAN (or any neural network) into a tree graph representation, which can then be translated into a symbolic formula. |
The KAN 2.0 paper demonstrates applications across three pillars of scientific discovery:
Specific scientific applications demonstrated include discovering conserved quantities, Lagrangians, symmetries, and constitutive laws in physics.
KANs have found their most natural application in symbolic regression, where the goal is to discover closed-form mathematical expressions that fit observed data. The KAN architecture, with its decomposition into visualizable 1D functions and its symbolification capability, provides a built-in pathway from a trained neural network to an interpretable equation. The KAN-SR (KAN-guided Symbolic Regression) framework has been used to recover ground-truth equations from benchmark datasets such as the Feynman Symbolic Regression for Scientific Discovery dataset.
In the original paper, KANs were applied to two scientific discovery tasks:
Subsequent work has applied KANs to fluid mechanics, material science, power systems, network dynamics, transistor modeling, and thermoelectric materials design.
KANs have been applied to solving partial differential equations (PDEs), where their parameter efficiency provides advantages over MLPs. Physics-informed KANs combine the KAN architecture with physics-informed constraints, similar to Physics-Informed Neural Networks (PINNs) but potentially achieving higher accuracy with fewer parameters.
Researchers have explored KANs for genomic tasks, though results in this domain have been mixed. The high-dimensional nature of genomic data can be challenging for the B-spline-based architecture.
The following table summarizes the key architectural and practical differences between KANs and MLPs:
| Feature | KAN | MLP |
|---|---|---|
| Activation functions | Learnable (B-splines) on edges | Fixed (ReLU, sigmoid, etc.) on nodes |
| Node operation | Summation only | Weighted sum followed by activation |
| Learnable parameters | B-spline coefficients per edge | Weights and biases per connection |
| Theoretical basis | Kolmogorov-Arnold representation theorem | Universal approximation theorem |
| Interpretability | High (visualizable 1D functions, symbolification) | Low (opaque weight matrices) |
| Parameter efficiency (symbolic tasks) | Higher | Lower |
| Parameter efficiency (general tasks) | Comparable or lower | Comparable or higher |
| Training speed | Slower (roughly 10x for equal parameters) | Faster |
| GPU parallelization | Limited (diverse activation functions) | Efficient (uniform operations) |
| Computational complexity per layer | O(N^2 * L * G) | O(N^2 * L) |
| Grid refinement | Supported (unique to KANs) | Not applicable |
| Scalability to large models | Challenging | Well-established |
| Maturity of ecosystem | Early stage | Decades of optimization |
The most prominent critique of the original KAN paper came from Runpeng Yu, Weihao Yu, and Xinchao Wang in their July 2024 paper "KAN or MLP: A Fairer Comparison." Their key findings challenged several claims from the original work:
Several practical limitations have been identified by the research community:
A comprehensive survey by Hou et al. (arXiv:2407.11075) critically assessed the claims, performance, and practical viability of KANs. The survey noted that while KANs represent an interesting theoretical contribution, their practical advantages over MLPs remain limited to specific domains, particularly low-dimensional scientific computing and symbolic regression tasks. For mainstream deep learning applications involving large-scale data and high-dimensional inputs, MLPs and their modern variants continue to dominate.
The modular nature of the KAN architecture, where the B-spline basis can be swapped for alternative function families, has led to a proliferation of KAN variants since the original paper:
| Variant | Basis Function | Key Advantage |
|---|---|---|
| Original KAN (Spl-KAN) | B-splines | Accuracy, local control, grid refinement |
| FourierKAN | Fourier series coefficients | Avoids grid boundary issues, faster computation |
| Wav-KAN | Wavelet functions | Better accuracy and speed than B-spline KANs |
| ChebyKAN | Chebyshev polynomials | Global orthogonality, strong spectral approximation |
| FastKAN | Gaussian radial basis functions (RBFs) | 1.25x faster forward+backward pass |
| FasterKAN | Reflection Switch Activation Function (RSWAF) | 3.33x faster than efficient-KAN |
| ReLU-KAN | ReLU combinations | 5-20x GPU speedup over spline-based KANs |
| BSRBF-KAN | B-splines + radial basis functions | Combines local and global approximation |
Many of these variants address the training speed bottleneck of the original B-spline-based KAN by using basis functions that are more amenable to GPU parallelization.
The Kolmogorov-Arnold Transformer, published at ICLR 2025, integrates KAN-style learnable activation functions into the Transformer architecture. KAT replaces the MLP layers in a standard Vision Transformer (ViT) with Group-Rational KAN (GR-KAN) layers, which use rational activation functions with shared coefficients among groups of edges. KAT-B achieved 82.3% top-1 accuracy on ImageNet-1K, surpassing ViT-B by 3.1 percentage points. When initialized with pre-trained ViT weights, KAT-B reached 82.7% accuracy. This work demonstrated that KAN principles can scale to large-scale vision tasks when combined with appropriate engineering.
The primary software library for KANs is pykan, developed by Ziming Liu and hosted on GitHub (github.com/KindXiaoming/pykan). As of 2025, the repository has over 16,000 GitHub stars and 1,600 forks. The library provides tools for constructing, training, visualizing, pruning, and symbolifying KAN models.
Several community-developed implementations address specific needs:
The KAN paper generated significant excitement upon its release in April 2024, quickly becoming one of the most discussed machine learning papers of the year. The paper trended widely on social media and attracted attention from both the machine learning and physics communities.
The original paper was accepted as an oral presentation at ICLR 2025, one of the top machine learning conferences. As of early 2025, the paper had accumulated over 900 citations on Semantic Scholar, reflecting substantial follow-up research.
However, the reception has been mixed. Supporters highlight KANs' potential for interpretable scientific computing and their elegant connection to classical mathematics. Critics point to the limited practical advantages over MLPs on mainstream tasks, the training speed overhead, and concerns about the fairness of comparisons in the original paper. The community broadly agrees that KANs are most promising for scientific discovery and symbolic regression rather than as general-purpose replacements for MLPs.
The rapid development of KAN variants, the extension to Transformer architectures (KAT), and the publication of KAN 2.0 in Physical Review X suggest that KANs have established themselves as an active research direction, even if their ultimate impact on the broader field of deep learning remains an open question.