Kernel Support Vector Machines (KSVMs)

Introduction

Kernel Support Vector Machines (KSVMs) are a family of supervised learning algorithms that combine the maximum-margin principle of support vector machines with the kernel trick to learn non-linear decision boundaries. Introduced by Bernhard Boser, Isabelle Guyon, and Vladimir Vapnik at the 1992 COLT conference, kernel SVMs were the dominant approach to non-linear classification and regression in machine learning throughout the 1990s and 2000s, before deep neural networks overtook them on perceptual tasks. They remain a workhorse for high-dimensional, small-sample problems in bioinformatics, chemoinformatics, text classification, and any setting where the training set is too small to fit a deep network reliably.

The central idea is simple. A linear SVM searches for a hyperplane that separates two classes with the largest possible margin. Many real datasets are not linearly separable in their original input space, so a kernel SVM implicitly maps the data into a high-dimensional, possibly infinite-dimensional feature space, and constructs the maximum-margin hyperplane in that space. Crucially, the algorithm never needs to compute the mapped vectors explicitly. It only needs to evaluate the inner product between pairs of mapped vectors, and that inner product is given by a kernel function defined directly on the original inputs. This substitution is known as the kernel trick.

Mathematical foundation

Maximum-margin classifier

Consider a binary classification problem with training data $(x_1, y_1), \dots, (x_n, y_n)$ where $x_i \in \mathbb{R}^d$ and $y_i \in {-1, +1}$. A linear classifier predicts the sign of $f(x) = w^\top x + b$. If the data is linearly separable, infinitely many hyperplanes correctly separate the classes. The SVM picks the unique hyperplane that maximizes the margin, defined as the distance from the hyperplane to the nearest training points.

For a hyperplane scaled so that the closest correctly classified points satisfy $y_i (w^\top x_i + b) = 1$, the margin equals $2 / |w|$. Maximizing the margin is therefore equivalent to minimizing $\tfrac{1}{2}|w|^2$ subject to $y_i (w^\top x_i + b) \ge 1$ for all $i$. This convex quadratic program has a unique global minimum.

Soft margin

Real data are noisy and rarely separable. Cortes and Vapnik introduced the soft-margin SVM in their 1995 Machine Learning paper, which adds non-negative slack variables $\xi_i$ that allow individual points to violate the margin at a cost. The optimization becomes

$$\min_{w, b, \xi} ; \tfrac{1}{2} |w|^2 + C \sum_{i=1}^n \xi_i \quad \text{s.t.} \quad y_i (w^\top x_i + b) \ge 1 - \xi_i, ; \xi_i \ge 0.$$

The regularization constant $C$ controls the trade-off between margin width and training-error tolerance. Small $C$ yields a wide margin with more violations and stronger regularization. Large $C$ pushes the optimizer toward fitting the training data exactly, with a narrower margin.

Lagrangian dual and the kernel trick

Introducing Lagrange multipliers $\alpha_i \ge 0$ for each constraint and eliminating the primal variables yields the dual problem:

$$\max_{\alpha} ; \sum_{i=1}^n \alpha_i - \tfrac{1}{2} \sum_{i,j} \alpha_i \alpha_j y_i y_j , x_i^\top x_j \quad \text{s.t.} \quad 0 \le \alpha_i \le C, ; \sum_i \alpha_i y_i = 0.$$

The dual depends on the data only through the inner products $x_i^\top x_j$, gathered in the Gram matrix. The decision function is $f(x) = \sum_{i=1}^n \alpha_i y_i (x_i^\top x) + b$, which again uses only inner products. Replacing every $x_i^\top x_j$ with a kernel evaluation $K(x_i, x_j) = \phi(x_i)^\top \phi(x_j)$ for some feature map $\phi$ produces a maximum-margin classifier in the feature space induced by $\phi$, without ever computing $\phi$ explicitly. This is the kernel trick, and it turns a linear algorithm into a non-linear one with essentially no change to the optimization machinery.

Mercer's theorem

Not every symmetric function of two arguments can serve as a kernel. A function $K(x, x')$ is a valid kernel if and only if it is symmetric and positive semi-definite, meaning that for any finite collection of inputs the resulting Gram matrix has only non-negative eigenvalues. This characterization is a consequence of Mercer's theorem, proved by James Mercer in 1909, which guarantees that any such kernel can be written as an inner product in some Hilbert space. The connection between positive-definite kernels and feature maps was first applied to learning by Aizerman, Braverman, and Rozonoer in their 1964 paper on the potential function method in pattern recognition, published in Automation and Remote Control. The 1992 paper by Boser, Guyon, and Vapnik adapted this old idea to the maximum-margin setting and started the modern kernel-methods era.

Support vectors

At the optimum, most $\alpha_i$ are zero. The training points with $\alpha_i > 0$ are the support vectors. They are the only points that matter for the decision function: removing any non-support vector and retraining leaves the solution unchanged. The number of support vectors usually grows roughly linearly with the size of the training set, which is the source of both the model's robustness and its scaling problems.

History

The modern story of kernel SVMs has three turning points.

The 1964 paper by Mark Aizerman, Emmanuil Braverman, and Lev Rozonoer on the potential function method introduced the idea of using positive-definite kernels to perform pattern recognition in an implicit feature space.
In 1992, Bernhard Boser, Isabelle Guyon, and Vladimir Vapnik combined kernels with the optimal-margin classifier in a paper titled A training algorithm for optimal margin classifiers presented at COLT. This is the first paper that calls the resulting algorithm a kernel SVM in the modern sense.
In 1995, Corinna Cortes and Vladimir Vapnik published Support-vector networks in Machine Learning, introducing the soft margin and demonstrating strong empirical results on handwritten digit recognition. This paper is the most-cited SVM reference and is what most practitioners mean when they say "the SVM paper."

Throughout the late 1990s, kernel SVMs spread rapidly. Thorsten Joachims showed in 1998 that linear SVMs were the strongest classifier known for text categorization. Bernhard Schölkopf, Alex Smola, and others extended kernels to regression (support vector regression), unsupervised learning (kernel principal component analysis), and a wider family of kernel methods including Gaussian processes and kernel ridge regression. The book Learning with Kernels by Schölkopf and Smola (MIT Press, 2002) became the standard reference.

The practical takeoff came with John Platt's 1998 Sequential Minimal Optimization (SMO) algorithm at Microsoft Research, and the 2001 release of LIBSVM by Chih-Chung Chang and Chih-Jen Lin at National Taiwan University. LIBSVM made high-quality kernel SVM training accessible from C, Java, Python, MATLAB, and R, and became the engine inside scikit-learn's SVC and NuSVC classes.

After roughly 2012, deep convolutional networks began to dominate large-scale perceptual benchmarks such as ImageNet, and the research center of gravity shifted toward neural networks. Kernel methods continued in the background, however, and the analysis of infinite-width networks via the neural tangent kernel (Jacot, Gabriel, and Hongler, NeurIPS 2018) revived theoretical interest in kernels as a way to understand deep learning.

Common kernels

Many kernels are in everyday use. The table below summarizes the most common families.

Kernel	Formula	Hyperparameters	Typical use
Linear	$K(x, x') = x^\top x'$	none	High-dimensional sparse data such as text or genomics; very large datasets where speed matters.
Polynomial	$K(x, x') = (\gamma , x^\top x' + r)^d$	degree $d$, scale $\gamma$, offset $r$	Tasks with explicit feature interactions, image patches, NLP with character $n$-grams.
RBF / Gaussian	$K(x, x') = \exp(-\gamma \|x - x'\|^2)$	bandwidth $\gamma$	Default choice for moderate-sized continuous data; smooth decision boundaries.
Sigmoid	$K(x, x') = \tanh(\gamma , x^\top x' + r)$	$\gamma$, $r$	Resembles a two-layer neural network; not always positive semi-definite, used historically.
Laplacian	$K(x, x') = \exp(-\gamma \|x - x'\|_1)$	$\gamma$	Like RBF but with $\ell_1$ distance, more robust to outliers in some settings.
Chi-squared	$K(x, x') = \exp(-\gamma \sum_i (x_i - x'_i)^2 / (x_i + x'_i))$	$\gamma$	Histogram features in computer vision, e.g. bag-of-visual-words.
Histogram intersection	$K(x, x') = \sum_i \min(x_i, x'_i)$	none	Image categorization with histogram descriptors.
String kernel	counts shared substrings or $k$-mers	substring length, weighting	Text classification, protein sequence classification.
Graph kernel	counts shared substructures (walks, subtrees, shortest paths)	structure-specific	Cheminformatics, social network analysis, molecule property prediction.
ANOVA	$K(x, x') = \sum_k \exp(-\gamma (x_k - x'_k)^2)^d$	$\gamma$, $d$	Multivariate regression problems with feature interactions.

The Gaussian RBF kernel is the most common default. Its bandwidth parameter $\gamma$ controls the locality of influence: large $\gamma$ makes the kernel concentrate near each training point and risks overfitting, small $\gamma$ makes it nearly constant and risks underfitting. RBF kernels are universal approximators in the sense that they can fit any continuous decision boundary given enough support vectors and the right $\gamma$.

Kernels can be combined to make new kernels. The sum, product, and non-negative scalar multiple of valid kernels are valid kernels. Multiple kernel learning (MKL) treats the choice of kernel weights as another optimization problem, learning a convex combination of base kernels jointly with the SVM.

Algorithms

Sequential Minimal Optimization (SMO)

The quadratic program for kernel SVMs has $n$ variables and a dense $n \times n$ kernel matrix. Naive solvers run out of memory for $n$ much larger than ten thousand. John Platt's 1998 Sequential Minimal Optimization algorithm at Microsoft Research solved this problem by decomposing the QP into a sequence of two-variable subproblems. Each subproblem has an analytical closed-form solution because the equality constraint $\sum_i \alpha_i y_i = 0$ reduces a two-variable QP to a one-dimensional optimization on a line segment. SMO is the basis of LIBSVM and most production kernel SVM trainers. With heuristics for choosing the working pair (the most-violating pair under the KKT conditions), SMO converges in roughly $O(n^2)$ to $O(n^3)$ kernel evaluations, depending on the problem.

Working-set methods

SV$^{\text{light}}$ by Thorsten Joachims (1999) generalized SMO to working sets of size $q > 2$. Modern solvers in LIBLINEAR use coordinate descent on the dual for the linear case, and a related decomposition for the kernelized case in LIBSVM.

Stochastic and gradient methods

For large datasets and the linear case, stochastic methods such as Pegasos (Shalev-Shwartz, Singer, Srebro, ICML 2007) treat the primal SVM problem as a regularized hinge-loss optimization and run stochastic gradient descent. Pegasos converges in time independent of the dataset size for fixed accuracy. For non-linear kernels, kernelized SGD and budgeted online algorithms maintain a bounded set of support vectors during training.

Cutting-plane and bundle methods

For structural SVMs and very large linear problems, cutting-plane methods (SVMperf, BMRM) iteratively build a piecewise-linear lower bound to the objective and solve a small QP at each step. These methods converge in $O(1/\epsilon)$ iterations to $\epsilon$-accuracy.

Multi-class strategies

SVMs are inherently binary classifiers. Three strategies extend them to $K > 2$ classes.

One-vs-rest (OvR) trains $K$ binary classifiers, each separating one class from all the others. Prediction picks the class with the highest decision-function value. Simple, fast to train, but the binary problems are usually unbalanced.
One-vs-one (OvO) trains $K(K-1)/2$ pairwise classifiers and aggregates their votes. Each subproblem uses only data from the two classes involved, so individual training is fast. LIBSVM and scikit-learn's SVC use this scheme by default. Prediction time grows quadratically in $K$, which can be a problem for very large $K$.
Crammer-Singer formulation (Koby Crammer and Yoram Singer, JMLR 2001) expresses multi-class SVMs as a single joint optimization that learns one weight vector per class with a multi-class margin constraint. Theoretically appealing but computationally heavier and rarely better in practice than OvO.

For probability estimates rather than discrete labels, Platt scaling (sigmoid fit on top of the decision function) and isotonic regression are the standard post-processing techniques.

Implementations

Library	Language	Solver	Strengths
LIBSVM	C++ with bindings	SMO	The reference implementation; powers `sklearn.svm.SVC`, `NuSVC`, `SVR`.
LIBLINEAR	C++	Coordinate descent	Linear SVMs only, fast on millions of samples and features. Powers `sklearn.svm.LinearSVC`.
scikit-learn `svm` module	Python	LIBSVM / LIBLINEAR	Standard Python interface, supports `SVC`, `NuSVC`, `LinearSVC`, `SVR`, `OneClassSVM`.
SVM$^{\text{light}}$	C	Working set	Mature, supports structural and ranking SVMs (SVMstruct, SVMrank).
SVMTorch	C++	Decomposition	Early scalable solver from IDIAP.
Pegasos	several	Primal SGD	Very large linear problems.
ThunderSVM	C++/CUDA	Parallel SMO	GPU acceleration; speedups of 5x to 100x over LIBSVM on RBF kernels.
cuML SVC	Python/CUDA	GPU SMO	Drop-in replacement for `sklearn` `SVC` on NVIDIA GPUs as part of RAPIDS.
Vowpal Wabbit	C++	Online	Streaming learning, supports kernels via random features.
MATLAB Statistics Toolbox	MATLAB	SMO	Used widely in academic and engineering settings.
R `e1071`, `kernlab`	R	LIBSVM, custom	Standard R interfaces.

sklearn.svm.SVC is the most widely used Python interface. It exposes parameters C (regularization), kernel ('linear', 'poly', 'rbf', 'sigmoid', 'precomputed' or callable), gamma, degree, and coef0. NuSVC reparameterizes the soft margin in terms of $\nu \in (0, 1]$, an upper bound on the fraction of margin errors and a lower bound on the fraction of support vectors, which is sometimes more interpretable than $C$. LinearSVC uses LIBLINEAR and scales to much larger datasets but only supports linear kernels.

Computational complexity

Kernel SVMs scale poorly with the training set size. Building the full kernel matrix takes $O(n^2)$ time and memory. The dual QP is solved in roughly $O(n^2)$ to $O(n^3)$ time depending on the data and the chosen tolerance. Empirical scaling for LIBSVM on RBF kernels is between $O(n^2)$ and $O(n^{2.3})$. Prediction with $m$ support vectors takes $O(md)$ for the linear case and $O(m \cdot \text{cost of one kernel evaluation})$ in general. For datasets larger than roughly $10^5$ to $10^6$ examples, exact kernel SVMs become impractical and approximation methods or linear models are preferred.

The table below sketches the cost of training and predicting with the main approaches.

Method	Training time	Prediction time per example	Memory
Linear SVM (LIBLINEAR)	$O(nd)$ per pass, often a few passes	$O(d)$	$O(nd)$ for sparse data
Kernel SVM (LIBSVM, SMO)	$O(n^2)$ to $O(n^3)$	$O(m \cdot d)$ with $m$ support vectors	$O(n^2)$ kernel cache
Random features + linear SVM	$O(n D)$ for $D$ features	$O(D)$	$O(n D)$
Nyström + linear SVM	$O(n m^2 + m^3)$ for $m$ landmarks	$O(m d)$	$O(n m + m^2)$
Pegasos	$O(d / (\lambda \epsilon))$, independent of $n$	$O(d)$	$O(d)$

Approximate kernels

Two families of approximations let kernel SVMs scale to large datasets by replacing the implicit infinite-dimensional feature map with an explicit finite one.

Random Fourier features

Ali Rahimi and Benjamin Recht's Random Features for Large-Scale Kernel Machines (NeurIPS 2007) is the most influential approximation paper. For shift-invariant kernels such as the Gaussian RBF, Bochner's theorem gives the kernel as the Fourier transform of a probability distribution $p$:

$$K(x, x') = \int p(\omega) e^{i \omega^\top (x - x')} , d\omega.$$

Monte Carlo sampling $\omega_1, \dots, \omega_D$ from $p$ and defining $z(x) = \sqrt{2/D} , [\cos(\omega_1^\top x + b_1), \dots, \cos(\omega_D^\top x + b_D)]$ produces an explicit feature map such that $z(x)^\top z(x') \approx K(x, x')$. After mapping, a linear SVM on $z(x)$ approximates a kernel SVM with kernel $K$ at a fraction of the cost. Random Fourier features and their variants (Random Kitchen Sinks, Fastfood, Quasi-Monte Carlo features, Orthogonal Random Features) brought kernel methods to dataset sizes that previously required deep nets.

The paper won the NeurIPS Test of Time Award in 2017, ten years after publication. Rahimi's acceptance speech, in which he argued that machine learning had become too much "alchemy," sparked a long-running debate in the field about empirical rigor.

Nyström method

The Nyström method (Williams and Seeger, NIPS 2001) approximates the kernel matrix by sampling $m \ll n$ training points as landmarks and using the eigendecomposition of the small $m \times m$ submatrix to extend approximate eigenvectors to the full dataset. The resulting low-rank approximation $\tilde{K} \approx C W^{-1} C^\top$, where $C$ is the column submatrix and $W$ the corresponding $m \times m$ block, can be plugged into linear solvers. Nyström approximations are data-adaptive and often more accurate than random features for the same target rank, at the cost of needing access to a representative sample of the training data.

When to use which

For very large $n$ and stationary kernels (RBF, Laplacian, Matérn), random Fourier features are usually the simplest. For more general kernels, including non-stationary or non-vectorial ones (string, graph, intersection), Nyström is more flexible because it only requires the ability to evaluate the kernel.

Kernel methods more broadly

Kernel SVMs are the best-known kernel method, but the same toolbox extends to many other algorithms by replacing inner products with kernel evaluations.

Kernel ridge regression is the kernelized form of ridge regression. It has a closed-form solution involving the inverse of $K + \lambda I$, and is a building block for Gaussian process regression.
Kernel principal component analysis (Schölkopf, Smola, Müller, 1998) performs PCA in feature space, recovering non-linear principal directions.
Support vector regression extends SVM to real-valued outputs using the $\epsilon$-insensitive loss.
One-class SVM (Schölkopf et al. 2001) estimates the support of a distribution for novelty detection and outlier scoring.
Gaussian processes are a Bayesian counterpart that uses the same kernel as a covariance function.
Kernelized Fisher discriminant, kernel canonical correlation analysis, kernel independent component analysis, and kernel mean embedding all follow the same recipe.

Together these methods form the field of kernel methods, in which the kernel is the unit of algorithm design. Picking the right kernel is the modeling step; the rest is convex optimization.

Comparison to deep learning

Kernel SVMs and deep neural networks solve the same broad problem (learn non-linear functions from data) but trade different things off.

Property	Kernel SVM	Deep neural network
Optimization	Convex; unique global minimum	Non-convex; many local minima
Hyperparameters	Few ($C$, $\gamma$, kernel choice)	Many (architecture, optimizer, learning rate, regularization)
Sample efficiency	Strong on small datasets	Usually needs large data
Computational cost	$O(n^2)$ to $O(n^3)$ training	Roughly $O(n)$ per epoch on GPU
Feature engineering	The kernel is the engineered representation	Features learned end-to-end
Handling raw perception (images, audio)	Weak without hand-designed kernels	State of the art
Theoretical guarantees	Generalization bounds via VC dimension and margin theory	Mostly empirical, with neural tangent kernel and double descent providing partial theory
Interpretability	Decision function is a sum over support vectors	Generally opaque
Online updates	Hard for non-linear kernels	Natural via SGD

A useful mental model is that deep networks learn the feature map $\phi$ from data, while kernel SVMs require the user to specify $\phi$ implicitly through $K$. When the right $K$ is known (text $n$-grams, molecule fingerprints, RBF over normalized features), kernel SVMs are remarkably hard to beat, especially with limited training data. When the data is large and perceptual, deep networks dominate because they can learn a better $\phi$ than any hand-designed kernel.

The two approaches connect through the neural tangent kernel (NTK). In the infinite-width limit and small learning rate, gradient descent on a deep network is equivalent to kernel ridge regression with a specific kernel determined by the architecture (Jacot, Gabriel, and Hongler, NeurIPS 2018). The NTK has become a major tool for analyzing deep learning theoretically, and shows that even deep models live inside the kernel-methods picture in a precise sense.

Modern relevance

Kernel SVMs are no longer the default for large-scale image or language tasks, but they remain widely used and are often the strongest baseline in several settings.

Bioinformatics and chemoinformatics. Kernel SVMs over string kernels for protein sequences and graph kernels for molecules remain competitive with graph neural networks, particularly when datasets are small (hundreds to thousands of compounds).
Text classification with linear SVMs. For document categorization, sentiment analysis, and spam detection on TF-IDF features, linear SVMs (LIBLINEAR) routinely match or beat deep models on small to medium corpora and run orders of magnitude faster at inference time.
Tabular data with hundreds to tens of thousands of rows. RBF SVMs are a strong baseline alongside gradient boosted trees. Both often beat deep nets on this regime.
Anomaly and novelty detection. One-class SVMs are a default tool for fraud detection, intrusion detection, and quality control.
Brain-computer interfaces and medical signals. EEG, ECG, and similar low-data, high-noise signals continue to favor SVMs because of their sample efficiency and the convexity of training.
Robotics and control. Kernel methods underpin many Gaussian process regression models used for system identification and Bayesian optimization of controllers.
Theoretical analysis of deep learning. The NTK and related kernels provide one of the few frameworks in which deep networks admit closed-form analysis.

Recent research has pushed kernel SVMs further with structured kernels for graphs and sequences, scalable solvers on GPUs (ThunderSVM, cuML), and hybrid systems where deep network embeddings are used as inputs to a kernel SVM trained on top. The latter pattern is common when transferring large pretrained models to small downstream datasets where retraining the network would overfit.

Explain like I'm 5 (ELI5)

Imagine red and blue marbles scattered on a table, and you want to draw a line that separates them. If you can use a straight line, that is a regular SVM, and the SVM finds the line with the biggest empty band on either side, like a road as wide as possible without running over any marbles.

Now suppose the marbles are arranged in a ring: blue marbles in the middle, red marbles around the outside. No straight line works. The kernel SVM cheats. It pretends to lift each marble up off the table by an amount that depends on how far it is from the center, so the blue marbles in the middle stay low and the red marbles outside go higher. From the side, you can now slide a flat sheet of paper between the two layers and call that the boundary. Looking back down at the table, the boundary becomes a circle.

The trick is that the SVM never actually lifts anything. It only computes how far apart any two marbles would be after lifting, using a kernel function. The math says that is enough to find the boundary. By choosing different kernel functions, you get different ways of "lifting" and therefore different shapes for the final boundary on the table: circles, wiggles, rings within rings, almost anything.

References

Aizerman, M., Braverman, E., and Rozonoer, L. (1964). *Theoretical foundations of the potential function method in pattern recognition learning*. Automation and Remote Control, 25, 821-837.
Mercer, J. (1909). *Functions of positive and negative type, and their connection with the theory of integral equations*. Philosophical Transactions of the Royal Society A, 209, 415-446.
Boser, B. E., Guyon, I. M., and Vapnik, V. N. (1992). *A training algorithm for optimal margin classifiers*. Proceedings of the Fifth Annual Workshop on Computational Learning Theory (COLT 1992), 144-152.
Cortes, C., and Vapnik, V. (1995). *Support-vector networks*. Machine Learning, 20(3), 273-297.
Vapnik, V. N. (1995). *The Nature of Statistical Learning Theory*. Springer-Verlag.
Vapnik, V. N. (1998). *Statistical Learning Theory*. Wiley-Interscience.
Platt, J. C. (1998). *Sequential minimal optimization: A fast algorithm for training support vector machines*. Microsoft Research Technical Report MSR-TR-98-14.
Joachims, T. (1998). *Text categorization with support vector machines: Learning with many relevant features*. Proceedings of the European Conference on Machine Learning (ECML), 137-142.
Joachims, T. (1999). *Making large-scale SVM learning practical*. In Schölkopf, Burges, and Smola (eds.), Advances in Kernel Methods, MIT Press.
Schölkopf, B., Smola, A. J., and Müller, K.-R. (1998). *Nonlinear component analysis as a kernel eigenvalue problem*. Neural Computation, 10(5), 1299-1319.
Williams, C. K. I., and Seeger, M. (2001). *Using the Nyström method to speed up kernel machines*. Advances in Neural Information Processing Systems 13, 682-688.
Chang, C.-C., and Lin, C.-J. (2011). *LIBSVM: A library for support vector machines*. ACM Transactions on Intelligent Systems and Technology, 2(3), 27:1-27:27. (Software released 2001.)
Fan, R.-E., Chang, K.-W., Hsieh, C.-J., Wang, X.-R., and Lin, C.-J. (2008). *LIBLINEAR: A library for large linear classification*. Journal of Machine Learning Research, 9, 1871-1874.
Crammer, K., and Singer, Y. (2001). *On the algorithmic implementation of multiclass kernel-based vector machines*. Journal of Machine Learning Research, 2, 265-292.
Schölkopf, B., and Smola, A. J. (2002). *Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond*. MIT Press.
Shalev-Shwartz, S., Singer, Y., and Srebro, N. (2007). *Pegasos: Primal estimated sub-gradient solver for SVM*. Proceedings of the 24th International Conference on Machine Learning (ICML), 807-814.
Rahimi, A., and Recht, B. (2007). *Random features for large-scale kernel machines*. Advances in Neural Information Processing Systems 20, 1177-1184.
Schölkopf, B., Platt, J. C., Shawe-Taylor, J., Smola, A. J., and Williamson, R. C. (2001). *Estimating the support of a high-dimensional distribution*. Neural Computation, 13(7), 1443-1471.
Jacot, A., Gabriel, F., and Hongler, C. (2018). *Neural tangent kernel: Convergence and generalization in neural networks*. Advances in Neural Information Processing Systems 31.
Pedregosa, F. et al. (2011). *Scikit-learn: Machine learning in Python*. Journal of Machine Learning Research, 12, 2825-2830.
Wen, Z., Shi, J., Li, Q., He, B., and Chen, J. (2018). *ThunderSVM: A fast SVM library on GPUs and CPUs*. Journal of Machine Learning Research, 19(21), 1-5.
Hofmann, T., Schölkopf, B., and Smola, A. J. (2008). *Kernel methods in machine learning*. Annals of Statistics, 36(3), 1171-1220.

Introduction

Mathematical foundation

Maximum-margin classifier

Soft margin

Lagrangian dual and the kernel trick

Mercer's theorem

Support vectors

History

Common kernels

Algorithms

Sequential Minimal Optimization (SMO)

Working-set methods

Stochastic and gradient methods

Cutting-plane and bundle methods

Multi-class strategies

Implementations

Computational complexity

Approximate kernels

Random Fourier features

Nyström method

When to use which

Kernel methods more broadly

Comparison to deep learning

Modern relevance

Explain like I'm 5 (ELI5)

References

Improve this article

Related Articles

ARC-AGI 2

Squared Hinge Loss

AUC-ROC

Machine learning terms/Clustering

Machine learning terms/Decision Forests

Machine learning terms/Fairness

Introduction

Mathematical foundation

Maximum-margin classifier

Soft margin

Lagrangian dual and the kernel trick

Mercer's theorem

Support vectors

History

Common kernels

Algorithms

Sequential Minimal Optimization (SMO)

Working-set methods

Stochastic and gradient methods

Cutting-plane and bundle methods

Multi-class strategies

Implementations

Computational complexity

Approximate kernels

Random Fourier features

Nyström method

When to use which

Kernel methods more broadly

Comparison to deep learning

Modern relevance

Explain like I'm 5 (ELI5)

References

Related Articles

ARC-AGI 2

Squared Hinge Loss

AUC-ROC

Machine learning terms/Clustering

Machine learning terms/Decision Forests

Machine learning terms/Fairness