# Kernel Support Vector Machines (KSVMs)

> Source: https://aiwiki.ai/wiki/kernel_support_vector_machines_ksvms
> Updated: 2026-07-12
> Categories: Machine Learning
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

*See also: [Machine learning terms](/wiki/machine_learning_terms)*

## What is a kernel support vector machine?

A kernel support vector machine (kernel SVM, or KSVM) is a [supervised learning](/wiki/supervised_learning) algorithm that finds the maximum-margin boundary between two classes after implicitly mapping the data into a high-dimensional feature space through a [kernel function](/wiki/kernel_function), so that a linear separator in that space corresponds to a non-linear decision boundary in the original input space. Kernel SVMs combine the maximum-margin principle of [support vector machines](/wiki/support_vector_machine_svm) with the kernel trick, and were introduced by Bernhard Boser, Isabelle Guyon, and Vladimir Vapnik at the 1992 Computational Learning Theory (COLT) conference [3] and given their soft-margin form by Corinna Cortes and Vladimir Vapnik in 1995 [4]. They were the dominant approach to non-linear classification and regression in [machine learning](/wiki/machine_learning) throughout the 1990s and 2000s, before [deep neural networks](/wiki/deep_learning) overtook them on perceptual tasks. The 1995 paper that introduced the soft margin, Support-vector networks, has been cited more than 45,000 times (Semantic Scholar, 2026), making it one of the most cited works in machine learning [4]. Kernel SVMs remain a workhorse for high-dimensional, small-sample problems in bioinformatics, chemoinformatics, text classification, and any setting where the training set is too small to fit a deep network reliably.

The central idea is simple. A linear SVM searches for a hyperplane that separates two classes with the largest possible margin. Many real datasets are not linearly separable in their original input space, so a kernel SVM implicitly maps the data into a high-dimensional, possibly infinite-dimensional feature space, and constructs the maximum-margin hyperplane in that space. Crucially, the algorithm never needs to compute the mapped vectors explicitly. It only needs to evaluate the inner product between pairs of mapped vectors, and that inner product is given by a kernel function defined directly on the original inputs. This substitution is known as the [kernel trick](/wiki/kernel_trick).

## Mathematical foundation

### Maximum-margin classifier

Consider a binary classification problem with training data (x₁, y₁), ..., (xₙ, yₙ), where each input xᵢ lies in ℝᵈ and each label yᵢ is either -1 or +1. A linear classifier predicts the sign of f(x) = wᵀx + b. If the data is linearly separable, infinitely many hyperplanes correctly separate the classes. The SVM picks the unique hyperplane that maximizes the margin, defined as the distance from the hyperplane to the nearest training points. This is the optimal margin classifier of Boser, Guyon, and Vapnik [3].

For a hyperplane scaled so that the closest correctly classified points satisfy yᵢ(wᵀxᵢ + b) = 1, the margin equals 2 / ‖w‖. Maximizing the margin is therefore equivalent to minimizing (1/2)‖w‖² subject to yᵢ(wᵀxᵢ + b) ≥ 1 for all i. This is a [convex optimization](/wiki/convex_optimization) problem, specifically a quadratic program, and it has a unique global minimum.

### Soft margin

Real data are noisy and rarely separable. Cortes and Vapnik introduced the soft-margin SVM in their 1995 Machine Learning paper, which adds non-negative slack variables ξᵢ that allow individual points to violate the margin at a cost [4]. The optimization becomes

$$\min_{w, b, \xi} \; \tfrac{1}{2} \|w\|^2 + C \sum_{i=1}^n \xi_i \quad \text{s.t.} \quad y_i (w^\top x_i + b) \ge 1 - \xi_i, \; \xi_i \ge 0.$$

The regularization constant C controls the trade-off between margin width and training-error tolerance. Small C yields a wide margin with more violations and stronger regularization. Large C pushes the optimizer toward fitting the training data exactly, with a narrower margin and greater risk of [overfitting](/wiki/overfitting).

### Lagrangian dual and the kernel trick

Introducing Lagrange multipliers αᵢ ≥ 0 for each constraint and eliminating the primal variables yields the dual problem:

$$\max_{\alpha} \; \sum_{i=1}^n \alpha_i - \tfrac{1}{2} \sum_{i,j} \alpha_i \alpha_j y_i y_j \, x_i^\top x_j \quad \text{s.t.} \quad 0 \le \alpha_i \le C, \; \sum_i \alpha_i y_i = 0.$$

The dual depends on the data only through the inner products xᵢᵀxⱼ, gathered in the Gram matrix. The decision function is f(x) = Σᵢ αᵢ yᵢ (xᵢᵀx) + b, which again uses only inner products. Replacing every xᵢᵀxⱼ with a kernel evaluation K(xᵢ, xⱼ) = φ(xᵢ)ᵀφ(xⱼ) for some feature map φ produces a maximum-margin classifier in the feature space induced by φ, without ever computing φ explicitly. This is the [kernel trick](/wiki/kernel_trick), and it turns a linear algorithm into a non-linear one with essentially no change to the optimization machinery [3][4].

### Mercer's theorem

Not every symmetric function of two arguments can serve as a kernel. A function K(x, x') is a valid kernel if and only if it is symmetric and positive semi-definite, meaning that for any finite collection of inputs the resulting Gram matrix has only non-negative eigenvalues. This characterization is a consequence of [Mercer's theorem](/wiki/mercer_theorem), proved by James Mercer in 1909, which guarantees that any such kernel can be written as an inner product in some Hilbert space [2]. The connection between positive-definite kernels and feature maps was first applied to learning by Mark Aizerman, Emmanuil Braverman, and Lev Rozonoer in their 1964 paper on the potential function method in pattern recognition, published in Automation and Remote Control [1]. The 1992 paper by Boser, Guyon, and Vapnik adapted this old idea to the maximum-margin setting and started the modern kernel-methods era [3].

### Support vectors

At the optimum, most αᵢ are zero. The training points with αᵢ > 0 are the [support vectors](/wiki/support_vector). They are the only points that matter for the decision function: removing any non-support vector and retraining leaves the solution unchanged. The number of support vectors usually grows roughly linearly with the size of the training set, which is the source of both the model's robustness and its scaling problems.

## When were kernel SVMs invented?

The modern story of kernel SVMs has three turning points.

* The 1964 paper by Mark Aizerman, Emmanuil Braverman, and Lev Rozonoer on the potential function method introduced the idea of using positive-definite kernels to perform pattern recognition in an implicit feature space [1].
* In 1992, Bernhard Boser, Isabelle Guyon, and Vladimir Vapnik combined kernels with the optimal-margin classifier in a paper titled A training algorithm for optimal margin classifiers, presented at the Fifth Annual Workshop on Computational Learning Theory (COLT) in Pittsburgh [3]. This is the first paper that calls the resulting algorithm a kernel SVM in the modern sense.
* In 1995, Corinna Cortes and Vladimir Vapnik published Support-vector networks in Machine Learning, introducing the soft margin and reporting strong empirical results on handwritten digit recognition [4].

The 1995 results were concrete and, at the time, striking. On the US Postal Service database of 16 by 16 pixel digits, polynomial support-vector machines reached raw test error rates of 4.2% to 4.7% for polynomial degrees 2 through 7, against 12.0% for a plain linear classifier and 5.1% for the LeNet1 neural network of the day [4]. On the larger NIST digit benchmark (the database from which [MNIST](/wiki/mnist) was later built), a 4th-degree polynomial SVM with no preprocessing matched the best neural network, LeNet4, at 1.1% test error [4]. A benchmark study quoted in the paper described the outcome as "most remarkable, because unlike the other high performance classifiers, it does not include knowledge about the geometry of the problem" [4]. Support-vector networks has since been cited more than 45,000 times (Semantic Scholar, 2026), and is what most practitioners mean when they say "the SVM paper."

Throughout the late 1990s, kernel SVMs spread rapidly. Thorsten Joachims showed in 1998 that linear SVMs were the strongest classifier then known for text categorization [8]. Bernhard Schölkopf, Alex Smola, and others extended kernels to regression ([support vector regression](/wiki/support_vector_regression)), unsupervised learning ([kernel principal component analysis](/wiki/kernel_pca)), and a wider family of kernel methods including [Gaussian processes](/wiki/gaussian_process) and [kernel ridge regression](/wiki/kernel_ridge_regression) [10]. The book Learning with Kernels by Schölkopf and Smola (MIT Press, 2002) became the standard reference [15].

The practical takeoff came with John Platt's 1998 Sequential Minimal Optimization (SMO) algorithm at Microsoft Research [7], and the release of [LIBSVM](/wiki/libsvm) by Chih-Chung Chang and Chih-Jen Lin at National Taiwan University, in development since 2000 [12]. LIBSVM made high-quality kernel SVM training accessible from C, Java, Python, MATLAB, and R, and became the engine inside [scikit-learn](/wiki/scikit_learn)'s SVC and NuSVC classes [20].

After roughly 2012, deep convolutional networks began to dominate large-scale perceptual benchmarks such as [ImageNet](/wiki/imagenet), and the research center of gravity shifted toward [neural networks](/wiki/neural_network). Kernel methods continued in the background, however, and the analysis of infinite-width networks via the [neural tangent kernel](/wiki/neural_tangent_kernel) (Jacot, Gabriel, and Hongler, NeurIPS 2018) revived theoretical interest in kernels as a way to understand deep learning [19].

## Which kernels do support vector machines use?

Many kernels are in everyday use. The table below summarizes the most common families.

| Kernel | Formula | Hyperparameters | Typical use |
| --- | --- | --- | --- |
| [Linear](/wiki/linear_kernel) | K(x, x') = xᵀx' | none | High-dimensional sparse data such as text or genomics; very large datasets where speed matters. |
| [Polynomial](/wiki/polynomial_kernel) | K(x, x') = (γ xᵀx' + r)ᵈ | degree d, scale γ, offset r | Tasks with explicit feature interactions, image patches, NLP with character n-grams. |
| [RBF / Gaussian](/wiki/rbf_kernel) | K(x, x') = exp(-γ ‖x - x'‖²) | bandwidth γ | Default choice for moderate-sized continuous data; smooth decision boundaries. |
| [Sigmoid](/wiki/sigmoid_kernel) | K(x, x') = tanh(γ xᵀx' + r) | γ, r | Resembles a two-layer neural network; not always positive semi-definite, used historically. |
| [Laplacian](/wiki/laplacian_kernel) | K(x, x') = exp(-γ ‖x - x'‖₁) | γ | Like RBF but with ℓ₁ distance, more robust to outliers in some settings. |
| Chi-squared | K(x, x') = exp(-γ Σᵢ (xᵢ - x'ᵢ)² / (xᵢ + x'ᵢ)) | γ | Histogram features in computer vision, e.g. bag-of-visual-words. |
| Histogram intersection | K(x, x') = Σᵢ min(xᵢ, x'ᵢ) | none | Image categorization with histogram descriptors. |
| [String kernel](/wiki/string_kernel) | counts shared substrings or k-mers | substring length, weighting | Text classification, protein sequence classification. |
| [Graph kernel](/wiki/graph_kernel) | counts shared substructures (walks, subtrees, shortest paths) | structure-specific | Cheminformatics, social network analysis, molecule property prediction. |
| ANOVA | K(x, x') = Σₖ exp(-γ (xₖ - x'ₖ)²)ᵈ | γ, d | Multivariate regression problems with feature interactions. |

The Gaussian RBF kernel is the most common default. Its bandwidth parameter γ controls the locality of influence: large γ makes the kernel concentrate near each training point and risks overfitting, small γ makes it nearly constant and risks underfitting. RBF kernels are universal approximators in the sense that they can fit any continuous decision boundary given enough support vectors and the right γ [15].

Kernels can be combined to make new kernels. The sum, product, and non-negative scalar multiple of valid kernels are valid kernels [15]. Multiple kernel learning (MKL) treats the choice of kernel weights as another optimization problem, learning a convex combination of base kernels jointly with the SVM.

## How are kernel SVMs trained?

### Sequential Minimal Optimization (SMO)

The quadratic program for kernel SVMs has n variables and a dense n × n kernel matrix. Naive solvers run out of memory for n much larger than ten thousand. John Platt's 1998 [Sequential Minimal Optimization](/wiki/sequential_minimal_optimization) algorithm at Microsoft Research solved this problem by decomposing the quadratic program into a sequence of two-variable subproblems [7]. Each subproblem has an analytical closed-form solution because the equality constraint Σᵢ αᵢ yᵢ = 0 reduces a two-variable quadratic program to a one-dimensional optimization on a line segment. SMO is the basis of LIBSVM and most production kernel SVM trainers. With heuristics for choosing the working pair (the most-violating pair under the Karush-Kuhn-Tucker conditions), SMO converges in roughly O(n^2) to O(n^3) kernel evaluations, depending on the problem [7].

### Working-set methods

SVM-light by Thorsten Joachims (1999) generalized SMO to working sets of size larger than two [9]. Modern solvers in [LIBLINEAR](/wiki/liblinear) use coordinate descent on the dual for the linear case, and a related decomposition for the kernelized case in LIBSVM [13].

### Stochastic and gradient methods

For large datasets and the linear case, stochastic methods such as Pegasos (Shalev-Shwartz, Singer, Srebro, ICML 2007) treat the primal SVM problem as a regularized [hinge loss](/wiki/hinge_loss) optimization and run [stochastic gradient descent](/wiki/stochastic_gradient_descent_sgd) [16]. Pegasos converges in time independent of the dataset size for fixed accuracy. For non-linear kernels, kernelized SGD and budgeted online algorithms maintain a bounded set of support vectors during training.

### Cutting-plane and bundle methods

For structural SVMs and very large linear problems, cutting-plane methods (SVMperf, BMRM) iteratively build a piecewise-linear lower bound to the objective and solve a small quadratic program at each step. These methods converge in O(1/ε) iterations to ε-accuracy.

## How do SVMs handle more than two classes?

SVMs are inherently binary classifiers. Three strategies extend them to K > 2 classes.

* **One-vs-rest (OvR)** trains K binary classifiers, each separating one class from all the others. Prediction picks the class with the highest decision-function value. Simple, fast to train, but the binary problems are usually unbalanced.
* **One-vs-one (OvO)** trains K(K-1)/2 pairwise classifiers and aggregates their votes. Each subproblem uses only data from the two classes involved, so individual training is fast. LIBSVM and scikit-learn's SVC use this scheme by default. Prediction time grows quadratically in K, which can be a problem for very large K.
* **Crammer-Singer formulation** (Koby Crammer and Yoram Singer, JMLR 2001) expresses multi-class SVMs as a single joint optimization that learns one weight vector per class with a multi-class margin constraint [14]. Theoretically appealing but computationally heavier and rarely better in practice than OvO.

For probability estimates rather than discrete labels, Platt scaling (a sigmoid fit on top of the decision function) and isotonic regression are the standard post-processing techniques.

## What software implements kernel SVMs?

| Library | Language | Solver | Strengths |
| --- | --- | --- | --- |
| [LIBSVM](/wiki/libsvm) | C++ with bindings | SMO | The reference implementation; powers sklearn.svm.SVC, NuSVC, SVR. |
| [LIBLINEAR](/wiki/liblinear) | C++ | Coordinate descent | Linear SVMs only, fast on millions of samples and features. Powers sklearn.svm.LinearSVC. |
| [scikit-learn](/wiki/scikit_learn) svm module | Python | LIBSVM / LIBLINEAR | Standard Python interface, supports SVC, NuSVC, LinearSVC, SVR, OneClassSVM. |
| SVM-light | C | Working set | Mature, supports structural and ranking SVMs (SVMstruct, SVMrank). |
| SVMTorch | C++ | Decomposition | Early scalable solver from IDIAP. |
| Pegasos | several | Primal SGD | Very large linear problems. |
| ThunderSVM | C++/CUDA | Parallel SMO | GPU and multi-core CPU acceleration; generally about an order of magnitude faster than LIBSVM while producing identical models. |
| cuML SVC | Python/CUDA | GPU SMO | Drop-in replacement for sklearn SVC on NVIDIA GPUs as part of [RAPIDS](/wiki/rapids). |
| Vowpal Wabbit | C++ | Online | Streaming learning, supports kernels via random features. |
| MATLAB Statistics Toolbox | MATLAB | SMO | Used widely in academic and engineering settings. |
| R e1071, kernlab | R | LIBSVM, custom | Standard R interfaces. |

sklearn.svm.SVC is the most widely used Python interface. It exposes parameters C (regularization), kernel ('linear', 'poly', 'rbf', 'sigmoid', 'precomputed', or a callable), gamma, degree, and coef0. NuSVC reparameterizes the soft margin in terms of ν ∈ (0, 1], an upper bound on the fraction of margin errors and a lower bound on the fraction of support vectors, which is sometimes more interpretable than C. LinearSVC uses LIBLINEAR and scales to much larger datasets but only supports linear kernels [20]. On the GPU side, ThunderSVM (Wen et al. 2018) is, in its authors' words, "generally an order of magnitude faster than LibSVM while producing identical SVMs" [21].

## How well do kernel SVMs scale?

Kernel SVMs scale poorly with the training set size. Building the full kernel matrix takes O(n^2) time and memory. The dual quadratic program is solved in roughly O(n^2) to O(n^3) time depending on the data and the chosen tolerance. Empirical scaling for LIBSVM on RBF kernels is between O(n^2) and O(n^2.3) [12]. Prediction with m support vectors takes O(md) for the linear case and O(m times the cost of one kernel evaluation) in general. For datasets larger than roughly 10⁵ to 10⁶ examples, exact kernel SVMs become impractical and approximation methods or linear models are preferred.

The table below sketches the cost of training and predicting with the main approaches.

| Method | Training time | Prediction time per example | Memory |
| --- | --- | --- | --- |
| Linear SVM (LIBLINEAR) | O(nd) per pass, often a few passes | O(d) | O(nd) for sparse data |
| Kernel SVM (LIBSVM, SMO) | O(n^2) to O(n^3) | O(md) with m support vectors | O(n^2) kernel cache |
| Random features + linear SVM | O(nD) for D features | O(D) | O(nD) |
| Nyström + linear SVM | O(n m^2 + m^3) for m landmarks | O(md) | O(nm + m^2) |
| Pegasos | O(d / (λε)), independent of n | O(d) | O(d) |

## Approximate kernels: random features and the Nyström method

Two families of approximations let kernel SVMs scale to large datasets by replacing the implicit infinite-dimensional feature map with an explicit finite one.

### Random Fourier features

Ali Rahimi and Benjamin Recht's Random Features for Large-Scale Kernel Machines (NeurIPS 2007) is the most influential approximation paper [17]. For shift-invariant kernels such as the Gaussian RBF, Bochner's theorem gives the kernel as the Fourier transform of a probability distribution p:

$$K(x, x') = \int p(\omega) e^{i \omega^\top (x - x')} \, d\omega.$$

Sampling D frequency vectors by Monte Carlo from p and defining the explicit feature map

$$z(x) = \sqrt{2/D} \, [\cos(\omega_1^\top x + b_1), \dots, \cos(\omega_D^\top x + b_D)]$$

produces a mapping such that z(x)ᵀz(x') ≈ K(x, x'). After mapping, a linear SVM on z(x) approximates a kernel SVM with kernel K at a fraction of the cost. Random Fourier features and their variants (Random Kitchen Sinks, Fastfood, Quasi-Monte Carlo features, Orthogonal Random Features) brought kernel methods to dataset sizes that previously required deep nets.

The paper won the NeurIPS Test of Time Award in 2017, ten years after publication [17]. In his acceptance speech, provocatively titled around the question of whether machine learning had become "alchemy," Rahimi argued for more theoretical rigor, saying he "would like to live in a society where systems are built on top of verifiable, rigorous thorough knowledge and not alchemy." The talk sparked a long-running debate in the field about empirical rigor.

### Nyström method

The [Nyström method](/wiki/nystrom_method) (Williams and Seeger, NIPS 2001) approximates the kernel matrix by sampling m ≪ n training points as landmarks and using the eigendecomposition of the small m × m submatrix to extend approximate eigenvectors to the full dataset [11]. The resulting low-rank approximation has the form C W⁻¹ Cᵀ, where C is the column submatrix and W the corresponding m × m block, and can be plugged into linear solvers. Nyström approximations are data-adaptive and often more accurate than random features for the same target rank, at the cost of needing access to a representative sample of the training data.

### When to use which

For very large n and stationary kernels (RBF, Laplacian, Matérn), random Fourier features are usually the simplest. For more general kernels, including non-stationary or non-vectorial ones (string, graph, intersection), Nyström is more flexible because it only requires the ability to evaluate the kernel.

## What other kernel methods are there?

Kernel SVMs are the best-known kernel method, but the same toolbox extends to many other algorithms by replacing inner products with kernel evaluations [22].

* [Kernel ridge regression](/wiki/kernel_ridge_regression) is the kernelized form of [ridge regression](/wiki/ridge_regression). It has a closed-form solution involving the inverse of K + λI, and is a building block for [Gaussian process](/wiki/gaussian_process) regression.
* [Kernel principal component analysis](/wiki/kernel_pca) (Schölkopf, Smola, Müller, 1998) performs PCA in feature space, recovering non-linear principal directions [10].
* [Support vector regression](/wiki/support_vector_regression) extends SVM to real-valued outputs using the ε-insensitive loss.
* [One-class SVM](/wiki/one_class_svm) (Schölkopf et al. 2001) estimates the support of a distribution for novelty detection and outlier scoring [18].
* [Gaussian processes](/wiki/gaussian_process) are a Bayesian counterpart that uses the same kernel as a covariance function.
* Kernelized Fisher discriminant, kernel canonical correlation analysis, kernel independent component analysis, and kernel mean embedding all follow the same recipe.

Together these methods form the field of kernel methods, in which the kernel is the unit of algorithm design [22]. Picking the right kernel is the modeling step; the rest is convex optimization.

## How do kernel SVMs compare to deep learning?

Kernel SVMs and deep neural networks solve the same broad problem (learn non-linear functions from data) but trade different things off.

| Property | Kernel SVM | Deep neural network |
| --- | --- | --- |
| Optimization | Convex; unique global minimum | Non-convex; many local minima |
| Hyperparameters | Few (C, γ, kernel choice) | Many (architecture, optimizer, learning rate, regularization) |
| Sample efficiency | Strong on small datasets | Usually needs large data |
| Computational cost | O(n^2) to O(n^3) training | Roughly O(n) per epoch on GPU |
| Feature engineering | The kernel is the engineered representation | Features learned end-to-end |
| Handling raw perception (images, audio) | Weak without hand-designed kernels | State of the art |
| Theoretical guarantees | Generalization bounds via VC dimension and margin theory | Mostly empirical, with neural tangent kernel and double descent providing partial theory |
| Interpretability | Decision function is a sum over support vectors | Generally opaque |
| Online updates | Hard for non-linear kernels | Natural via SGD |

A useful mental model is that deep networks learn the feature map φ from data, while kernel SVMs require the user to specify φ implicitly through K. When the right K is known (text n-grams, molecule fingerprints, RBF over normalized features), kernel SVMs are remarkably hard to beat, especially with limited training data. When the data is large and perceptual, deep networks dominate because they can learn a better φ than any hand-designed kernel. The generalization guarantees behind SVMs come from Vapnik's statistical learning theory, in which margin and VC dimension bound the expected error [5][6].

The two approaches connect through the [neural tangent kernel](/wiki/neural_tangent_kernel) (NTK). In the infinite-width limit and small learning rate, gradient descent on a deep network is equivalent to kernel ridge regression with a specific kernel determined by the architecture (Jacot, Gabriel, and Hongler, NeurIPS 2018) [19]. The NTK has become a major tool for analyzing deep learning theoretically, and shows that even deep models live inside the kernel-methods picture in a precise sense.

## Are kernel SVMs still used today?

Kernel SVMs are no longer the default for large-scale image or language tasks, but they remain widely used and are often the strongest baseline in several settings. A 2017 Kaggle survey of data mining and machine learning practitioners found that 26% use SVMs, as reported in the ThunderSVM paper [21].

* **Bioinformatics and chemoinformatics.** Kernel SVMs over string kernels for protein sequences and graph kernels for molecules remain competitive with graph neural networks, particularly when datasets are small (hundreds to thousands of compounds).
* **Text classification with linear SVMs.** For document categorization, sentiment analysis, and spam detection on TF-IDF features, linear SVMs (LIBLINEAR) routinely match or beat deep models on small to medium corpora and run orders of magnitude faster at inference time [13].
* **Tabular data with hundreds to tens of thousands of rows.** RBF SVMs are a strong baseline alongside [gradient boosted trees](/wiki/gradient_boosting). Both often beat deep nets on this regime.
* **Anomaly and novelty detection.** One-class SVMs are a default tool for fraud detection, intrusion detection, and quality control [18].
* **Brain-computer interfaces and medical signals.** EEG, ECG, and similar low-data, high-noise signals continue to favor SVMs because of their sample efficiency and the convexity of training.
* **Robotics and control.** Kernel methods underpin many [Gaussian process](/wiki/gaussian_process) regression models used for system identification and Bayesian optimization of controllers.
* **Theoretical analysis of deep learning.** The NTK and related kernels provide one of the few frameworks in which deep networks admit closed-form analysis [19].

Recent research has pushed kernel SVMs further with structured kernels for graphs and sequences, scalable solvers on GPUs (ThunderSVM, cuML), and hybrid systems where deep network embeddings are used as inputs to a kernel SVM trained on top [21]. The latter pattern is common when transferring large pretrained models to small downstream datasets where retraining the network would overfit.

## Explain like I'm 5 (ELI5)

Imagine red and blue marbles scattered on a table, and you want to draw a line that separates them. If you can use a straight line, that is a regular SVM, and the SVM finds the line with the biggest empty band on either side, like a road as wide as possible without running over any marbles.

Now suppose the marbles are arranged in a ring: blue marbles in the middle, red marbles around the outside. No straight line works. The kernel SVM cheats. It pretends to lift each marble up off the table by an amount that depends on how far it is from the center, so the blue marbles in the middle stay low and the red marbles outside go higher. From the side, you can now slide a flat sheet of paper between the two layers and call that the boundary. Looking back down at the table, the boundary becomes a circle.

The trick is that the SVM never actually lifts anything. It only computes how far apart any two marbles would be after lifting, using a kernel function. The math says that is enough to find the boundary. By choosing different kernel functions, you get different ways of "lifting" and therefore different shapes for the final boundary on the table: circles, wiggles, rings within rings, almost anything.

## References

1. Aizerman, M., Braverman, E., and Rozonoer, L. (1964). *Theoretical foundations of the potential function method in pattern recognition learning*. Automation and Remote Control, 25, 821-837.
2. Mercer, J. (1909). *Functions of positive and negative type, and their connection with the theory of integral equations*. Philosophical Transactions of the Royal Society A, 209, 415-446.
3. Boser, B. E., Guyon, I. M., and Vapnik, V. N. (1992). *A training algorithm for optimal margin classifiers*. Proceedings of the Fifth Annual Workshop on Computational Learning Theory (COLT 1992), 144-152.
4. Cortes, C., and Vapnik, V. (1995). *Support-vector networks*. Machine Learning, 20(3), 273-297.
5. Vapnik, V. N. (1995). *The Nature of Statistical Learning Theory*. Springer-Verlag.
6. Vapnik, V. N. (1998). *Statistical Learning Theory*. Wiley-Interscience.
7. Platt, J. C. (1998). *Sequential minimal optimization: A fast algorithm for training support vector machines*. Microsoft Research Technical Report MSR-TR-98-14.
8. Joachims, T. (1998). *Text categorization with support vector machines: Learning with many relevant features*. Proceedings of the European Conference on Machine Learning (ECML), 137-142.
9. Joachims, T. (1999). *Making large-scale SVM learning practical*. In Schölkopf, Burges, and Smola (eds.), Advances in Kernel Methods, MIT Press.
10. Schölkopf, B., Smola, A. J., and Müller, K.-R. (1998). *Nonlinear component analysis as a kernel eigenvalue problem*. Neural Computation, 10(5), 1299-1319.
11. Williams, C. K. I., and Seeger, M. (2001). *Using the Nyström method to speed up kernel machines*. Advances in Neural Information Processing Systems 13, 682-688.
12. Chang, C.-C., and Lin, C.-J. (2011). *LIBSVM: A library for support vector machines*. ACM Transactions on Intelligent Systems and Technology, 2(3), 27:1-27:27. (Software in development since 2000.)
13. Fan, R.-E., Chang, K.-W., Hsieh, C.-J., Wang, X.-R., and Lin, C.-J. (2008). *LIBLINEAR: A library for large linear classification*. Journal of Machine Learning Research, 9, 1871-1874.
14. Crammer, K., and Singer, Y. (2001). *On the algorithmic implementation of multiclass kernel-based vector machines*. Journal of Machine Learning Research, 2, 265-292.
15. Schölkopf, B., and Smola, A. J. (2002). *Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond*. MIT Press.
16. Shalev-Shwartz, S., Singer, Y., and Srebro, N. (2007). *Pegasos: Primal estimated sub-gradient solver for SVM*. Proceedings of the 24th International Conference on Machine Learning (ICML), 807-814.
17. Rahimi, A., and Recht, B. (2007). *Random features for large-scale kernel machines*. Advances in Neural Information Processing Systems 20, 1177-1184. (NeurIPS Test of Time Award, 2017.)
18. Schölkopf, B., Platt, J. C., Shawe-Taylor, J., Smola, A. J., and Williamson, R. C. (2001). *Estimating the support of a high-dimensional distribution*. Neural Computation, 13(7), 1443-1471.
19. Jacot, A., Gabriel, F., and Hongler, C. (2018). *Neural tangent kernel: Convergence and generalization in neural networks*. Advances in Neural Information Processing Systems 31.
20. Pedregosa, F. et al. (2011). *Scikit-learn: Machine learning in Python*. Journal of Machine Learning Research, 12, 2825-2830.
21. Wen, Z., Shi, J., Li, Q., He, B., and Chen, J. (2018). *ThunderSVM: A fast SVM library on GPUs and CPUs*. Journal of Machine Learning Research, 19(21), 1-5.
22. Hofmann, T., Schölkopf, B., and Smola, A. J. (2008). *Kernel methods in machine learning*. Annals of Statistics, 36(3), 1171-1220.