Support Vector Machine (SVM)

Machine Learning

21 min read

Updated Jun 20, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 20, 2026

Fact-checked

In review queue

Sources

12 citations

Revision

v5 · 4,291 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

A support vector machine (SVM) is a supervised learning algorithm that classifies data by finding the optimal hyperplane that separates points into distinct classes with the widest possible margin. The margin is the gap between the decision boundary and the nearest training points of each class, and the points that sit on that margin are called support vectors. SVMs are effective for both classification and regression, and were introduced in their modern soft margin form by Corinna Cortes and Vladimir Vapnik at AT&T Bell Labs in 1995. ^[4] Originally developed for binary classification, SVMs have been extended to handle multi-class problems, regression, and even anomaly detection.

In their founding paper, Cortes and Vapnik described the method as follows: "The support-vector network is a new learning machine for two-group classification problems. The machine conceptually implements the following idea: input vectors are non-linearly mapped to a very high-dimension feature space. In this feature space a linear decision surface is constructed." ^[4] That 1995 paper, "Support-vector networks," has since been cited more than 40,000 times, making it one of the most influential papers in machine learning. ^[4]

SVMs are rooted in statistical learning theory, specifically the Vapnik-Chervonenkis (VC) theory, and they remain one of the most theoretically well-founded machine learning algorithms. ^[5]^[11] Their ability to handle high-dimensional data and non-linear decision boundaries through the kernel trick has made them widely used in fields ranging from natural language processing to bioinformatics. ^[3]

ELI5 (Explain like I'm 5)

Imagine you have a pile of red balls and blue balls mixed together on a table. You want to draw a line (or put down a stick) so all the red balls are on one side and all the blue balls are on the other side. But you don't just want any line. You want the line that gives the most space between the closest red ball and the closest blue ball. That way, if someone tosses a new ball onto the table, you have the best chance of guessing its color correctly based on which side of the line it lands on.

Sometimes the balls are mixed up so much that no straight line can separate them. In that case, you can imagine lifting the balls up into the air (adding a new dimension), and suddenly you can slide a flat sheet between them. That "lifting" trick is called the kernel trick, and it's one of the things that makes SVMs so powerful.

When was the SVM invented?

The development of support vector machines spans several decades and involves contributions from multiple researchers.

Year	Event	Contributors
1963-1964	Development of the Generalized Portrait algorithm, the foundation of linear SVMs	Vladimir Vapnik and Alexey Chervonenkis
1971	Introduction of VC dimension theory, providing the statistical learning framework underlying SVMs	Vapnik and Chervonenkis
1974	Publication of "Theory of Pattern Recognition" formalizing VC theory	Vapnik and Chervonenkis
1979	The concept of kernel functions in pattern recognition first explored	Aizerman, Braverman, and Rozonoer (building on earlier 1964 work)
1982	Publication of "Estimation of Dependences Based on Empirical Data" laying out statistical learning theory	Vapnik
1992	Introduction of the kernel trick for nonlinear classification in SVMs	Bernhard Boser, Isabelle Guyon, and Vapnik
1995	Publication of the soft margin SVM formulation	Corinna Cortes and Vapnik
1998	Development of the Sequential Minimal Optimization (SMO) algorithm for efficient SVM training	John Platt (Microsoft Research)
2001	Release of LIBSVM, which became the standard SVM implementation library	Chih-Chung Chang and Chih-Jen Lin

The original SVM algorithm, developed by Vapnik and Chervonenkis in the early 1960s at the Institute of Control Sciences in Moscow, was a linear classifier. It was not until 1992 that Boser, Guyon, and Vapnik proposed applying the kernel trick to maximum-margin hyperplanes, enabling SVMs to handle non-linearly separable data. The 1992 work was presented at the Fifth Annual Workshop on Computational Learning Theory (COLT) in Pittsburgh on July 27-29, 1992, pages 144-152. ^[3] The soft margin formulation by Cortes and Vapnik in 1995 further extended SVMs to handle noisy, real-world data where perfect separation is not possible. ^[4]

Mathematical formulation

Linear SVM (hard margin)

Given a training dataset of n points of the form (x_1, y_1), (x_2, y_2), ..., (x_n, y_n), where each x_i is a d-dimensional real vector and y_i is either +1 or -1 (indicating the class), the goal is to find the maximum-margin hyperplane that separates the two classes.

A hyperplane can be written as the set of points x satisfying:

w · x - b = 0

where w is the normal vector to the hyperplane and b is the bias (offset) term. The decision boundary divides the space into two half-spaces.

For the data to be correctly classified, we require:

y_i(w · x_i - b) >= 1, for all i = 1, ..., n

The margin is the distance between the two parallel hyperplanes w · x - b = 1 and w · x - b = -1, which equals 2 / ||w||. Maximizing this margin is equivalent to minimizing ||w||, or more conveniently, minimizing (1/2)||w||^2.

The primal optimization problem for hard margin SVM is:

Minimize: (1/2)||w||^2

Subject to: y_i(w · x_i - b) >= 1, for all i = 1, ..., n

This is a convex quadratic programming problem with linear constraints.

Soft margin SVM

In practice, data is rarely perfectly linearly separable. The soft margin formulation, introduced by Cortes and Vapnik (1995), allows some data points to violate the margin constraint by introducing slack variables (xi_i >= 0): ^[4]

Minimize: (1/2)||w||^2 + C * sum(xi_i)

Subject to: y_i(w · x_i - b) >= 1 - xi_i, and xi_i >= 0, for all i

The parameter C > 0 controls the trade-off between maximizing the margin and minimizing the classification error. A larger C penalizes misclassifications more heavily, leading to a narrower margin. A smaller C allows more margin violations, resulting in a wider margin but potentially more misclassifications.

The hinge loss function is closely related to the soft margin SVM objective. The hinge loss for a single data point is:

L(y, f(x)) = max(0, 1 - y * f(x))

Using hinge loss, the SVM objective can be written as:

Minimize: (1/n) * sum(max(0, 1 - y_i * f(x_i))) + lambda * ||w||^2

where lambda = 1 / (2nC) is the regularization parameter.

Dual formulation

Using Lagrange multipliers (alpha_i >= 0), the primal problem can be transformed into its dual form. The Lagrangian is:

L(w, b, alpha) = (1/2)||w||^2 - sum(alpha_i * [y_i(w · x_i - b) - 1])

Taking partial derivatives with respect to w and b and setting them to zero gives:

w = sum(alpha_i * y_i * x_i)

sum(alpha_i * y_i) = 0

Substituting these back into the Lagrangian yields the dual optimization problem:

Maximize: sum(alpha_i) - (1/2) * sum_i(sum_j(alpha_i * alpha_j * y_i * y_j * (x_i · x_j)))

Subject to: alpha_i >= 0 for all i, and sum(alpha_i * y_i) = 0

For the soft margin case, the constraint becomes 0 <= alpha_i <= C.

The dual formulation has several advantages. It depends on the data only through dot products x_i · x_j, which enables the kernel trick. It also typically has fewer effective variables since many alpha_i values will be zero. ^[5]

KKT conditions

The Karush-Kuhn-Tucker (KKT) conditions are necessary and sufficient conditions for optimality in the SVM problem (since it is convex). The KKT conditions for SVM are:

Stationarity: w = sum(alpha_i * y_i * x_i)
Primal feasibility: y_i(w · x_i - b) >= 1 - xi_i, and xi_i >= 0
Dual feasibility: alpha_i >= 0
Complementary slackness: alpha_i * [y_i(w · x_i - b) - 1 + xi_i] = 0

The complementary slackness condition is particularly informative: it means that for each data point, either alpha_i = 0 (the point is not a support vector and lies outside the margin) or y_i(w · x_i - b) = 1 - xi_i (the point lies on or within the margin). Only the points with alpha_i > 0 are support vectors, and these are the only points that influence the position of the decision boundary. ^[5]

What are support vectors?

Support vectors are the data points that lie closest to the decision boundary. They are the critical elements of the training set because:

They are the only data points with non-zero Lagrange multipliers (alpha_i > 0).
The decision boundary is entirely determined by the support vectors. Removing any non-support-vector data point from the training set would not change the resulting hyperplane.
In the soft margin formulation, support vectors include points on the margin boundary (alpha_i < C), points within the margin (alpha_i = C, xi_i < 1), and misclassified points (alpha_i = C, xi_i >= 1).

The number of support vectors relative to the total number of training points gives an indication of the model's complexity. A model with very few support vectors is more likely to generalize well. ^[5]

How does the kernel trick work?

The kernel trick is a method that allows SVMs to operate in a high-dimensional (or even infinite-dimensional) feature space without explicitly computing the coordinates of the data in that space. Instead of mapping data points to a higher-dimensional space using a transformation function phi(x), the kernel trick computes the inner product of the mapped points directly: ^[3]

K(x_i, x_j) = phi(x_i) · phi(x_j)

This is possible because the dual formulation of the SVM depends on the data only through dot products. By replacing every dot product x_i · x_j with a kernel function K(x_i, x_j), the SVM can learn non-linear decision boundaries in the original input space. ^[3]

Mercer's theorem

For a function K to be a valid kernel, it must satisfy Mercer's theorem: the function must be symmetric and positive semi-definite. Formally, for any finite set of points, the kernel matrix (also called the Gram matrix) with entries K_ij = K(x_i, x_j) must be positive semi-definite. This guarantees that the kernel corresponds to an inner product in some (possibly infinite-dimensional) feature space and ensures that the SVM dual problem remains a convex optimization problem. ^[8]

Common kernel functions

Kernel	Formula	Key parameters	Typical use cases
Linear	K(x, y) = x · y	None	Linearly separable data, text classification, high-dimensional sparse data
Polynomial	K(x, y) = (gamma * x · y + r)^d	d (degree), gamma (scale), r (constant)	Image processing, natural language processing
Radial Basis Function (RBF) / Gaussian	K(x, y) = exp(-gamma * \|\|x - y\|\|^2)	gamma (width parameter, gamma > 0)	General-purpose non-linear classification, default kernel in many libraries
Sigmoid	K(x, y) = tanh(gamma * x · y + r)	gamma (scale), r (constant)	Neural network-like behavior (less commonly used)
Laplacian	K(x, y) = exp(-gamma * \|\|x - y\|\|_1)	gamma (width parameter)	Data with sharp edges or discontinuities

The RBF kernel is the most widely used kernel in practice and serves as the default in many SVM implementations, including scikit-learn. It can model complex decision boundaries and is equivalent to mapping data into an infinite-dimensional feature space. ^[8] The gamma parameter controls the width of the Gaussian; a small gamma means a wide Gaussian (smoother decision boundary), while a large gamma means a narrow Gaussian (more complex decision boundary).

The C parameter

The regularization parameter C is one of the most important hyperparameters in SVMs. It controls the trade-off between two competing objectives: ^[4]

Maximizing the margin: A wider margin improves generalization but may allow more misclassifications.
Minimizing classification error: Correctly classifying training points reduces training error but may result in a narrower margin and potential overfitting.

C value	Effect on margin	Effect on bias	Effect on variance	Risk
Very small (e.g., 0.001)	Very wide margin	High bias	Low variance	Underfitting
Small (e.g., 0.1)	Wide margin	Moderate bias	Low-moderate variance	Slight underfitting
Moderate (e.g., 1.0)	Balanced margin	Balanced	Balanced	Good generalization
Large (e.g., 10)	Narrow margin	Low bias	High variance	Slight overfitting
Very large (e.g., 1000)	Very narrow margin	Very low bias	Very high variance	Overfitting

The optimal value of C is typically found through cross-validation. A common strategy is to search over a logarithmic grid (e.g., C = 0.001, 0.01, 0.1, 1, 10, 100, 1000). ^[12]

How are SVMs trained?

Sequential Minimal Optimization (SMO)

The SMO algorithm, developed by John Platt at Microsoft Research in 1998, is the most widely used algorithm for training SVMs. It breaks the large quadratic programming (QP) problem into a series of the smallest possible QP problems, each involving only two Lagrange multipliers. These sub-problems can be solved analytically, avoiding the need for a general-purpose numerical QP solver. ^[6]

Platt summarized the approach in the paper's abstract: "SMO breaks this large QP problem into a series of smallest possible QP problems. These small QP problems are solved analytically, which avoids using a time-consuming numerical QP optimization as an inner loop." ^[6]

Key properties of SMO:

Memory requirement is linear in the training set size, which allows SMO to handle very large training sets. ^[6]
Training time scales somewhere between linearly and quadratically with the number of training examples, compared with the standard chunking algorithm, which scales between linearly and cubically. ^[6]
It avoids the need for external QP optimization libraries.
It is the algorithm used internally by LIBSVM and, by extension, scikit-learn's SVM implementation. ^[9]

Other training methods

Several other approaches have been developed for training SVMs:

Chunking: Solves the QP problem by working on subsets (chunks) of the training data at a time. An ancestor of SMO.
Decomposition methods: Generalize chunking by selecting working sets of variables to optimize.
Gradient descent: Stochastic gradient descent can be applied to the primal SVM objective, which is useful for very large datasets.
Cutting plane methods: Iteratively add constraints to approximate the full optimization problem.

Support vector regression (SVR)

Support Vector Regression extends the SVM framework to regression tasks. Instead of finding a hyperplane that separates classes, SVR finds a function that deviates from the actual observed targets by a value no greater than epsilon for each training point. SVR was introduced by Drucker, Burges, Kaufman, Smola, and Vapnik in 1997. ^[10]

Epsilon-insensitive loss

SVR uses the epsilon-insensitive loss function:

L_epsilon(y, f(x)) = max(0, |y - f(x)| - epsilon)

This loss function defines an "epsilon tube" around the predicted function. Points inside the tube (with prediction error less than epsilon) incur zero loss. Only points outside the tube contribute to the loss, and the penalty grows linearly with the distance from the tube boundary. ^[10]

The SVR optimization problem is:

Minimize: (1/2)||w||^2 + C * sum(xi_i + xi_i*)

Subject to: y_i - (w · x_i + b) <= epsilon + xi_i, (w · x_i + b) - y_i <= epsilon + xi_i*, and xi_i, xi_i* >= 0

Here, xi_i and xi_i* are slack variables for deviations above and below the epsilon tube, respectively.

Key differences from SVM classification

Aspect	SVM (classification)	SVR (regression)
Objective	Find a separating hyperplane	Find a function that approximates targets
Loss function	Hinge loss	Epsilon-insensitive loss
Margin/tube	Margin between classes	Epsilon tube around the function
Output	Class label (+1 or -1)	Continuous value
Support vectors	Points on or within the margin	Points on or outside the epsilon tube

Multi-class classification

SVMs are inherently binary classifiers. To handle problems with more than two classes, several strategies have been developed. ^[12]

One-vs-rest (one-vs-all)

For a problem with K classes, this strategy trains K binary classifiers. Each classifier is trained to distinguish one class from all the remaining classes combined. During prediction, the class whose classifier produces the highest confidence score is selected.

Number of classifiers: K
Training data per classifier: all training data
Potential issue: class imbalance, since each classifier sees one small positive class versus a large negative class

One-vs-one

This strategy trains a binary classifier for every pair of classes. During prediction, each classifier votes for one of its two classes, and the class with the most votes wins.

Number of classifiers: K(K-1)/2
Training data per classifier: only data from the two classes involved
Advantage: each classifier is trained on a smaller, balanced subset
Disadvantage: the number of classifiers grows quadratically with the number of classes

Comparison of multi-class strategies

Strategy	Number of classifiers	Training data per classifier	Advantages	Disadvantages
One-vs-rest	K	All data	Fewer classifiers, simpler	Class imbalance, less precise boundaries
One-vs-one	K(K-1)/2	Subset of data (two classes)	Balanced training, often more accurate	Many classifiers for large K, higher memory

The default in scikit-learn's SVC implementation is one-vs-one, while LinearSVC uses one-vs-rest. ^[9]^[12]

How does an SVM differ from a neural network?

SVMs have distinctive strengths and weaknesses compared to other commonly used machine learning algorithms.

Feature	SVM	Logistic regression	Random forest	Neural network
Decision boundary	Maximum margin hyperplane	Probabilistic (log-odds)	Ensemble of decision trees	Learned through backpropagation
Handling non-linearity	Kernel trick	Feature engineering or polynomial features	Inherently non-linear	Inherently non-linear
Scalability	Poor for large datasets (O(n^2) to O(n^3))	Good (O(nd))	Good (parallelizable)	Good with GPU support
Interpretability	Low (except linear kernel)	High (coefficients are interpretable)	Moderate (feature importance)	Low
Probability output	Not native (requires calibration)	Native (sigmoid function)	Native (class proportions)	Native (softmax)
Handling of high dimensions	Excellent	Good	Can overfit	Good with sufficient data
Small datasets	Strong performance	Strong performance	Moderate	Weak (needs lots of data)
Hyperparameter sensitivity	High (C, kernel, gamma)	Low	Low	High
Theoretical guarantees	Strong (VC theory, margin bounds)	Strong (convex optimization)	Limited	Limited

The key practical difference is that an SVM solves a convex optimization problem with a single global optimum, whereas a neural network minimizes a non-convex loss through backpropagation and can converge to different local minima. SVMs also tend to perform well on small to medium datasets with many features, while deep learning models typically need large amounts of data to outperform them.

When to use SVMs

SVMs tend to work best when:

The dataset is small to medium-sized (up to tens of thousands of samples).
The number of features is large relative to the number of samples.
There is a clear margin of separation between classes.
High-dimensional data needs to be classified (e.g., text classification with many features).

SVMs are less suitable when:

The dataset is very large (hundreds of thousands or millions of samples), because training time scales poorly.
The problem requires probability estimates (SVMs do not natively output probabilities).
Interpretability is a priority.
The data is very noisy with many overlapping classes.

What is a support vector machine used for?

SVMs have been applied successfully across a wide range of domains.

Text classification and NLP

SVMs became one of the most popular algorithms for text classification tasks in the late 1990s and 2000s. In high-dimensional bag-of-words or TF-IDF feature spaces, linear SVMs perform particularly well. Applications include spam detection, sentiment analysis, document categorization, and topic classification. Joachims (1998) demonstrated that SVMs outperformed other classifiers on text categorization benchmarks, reporting that SVMs "achieve substantial improvements over the currently best performing methods" while being "fully automatic" and eliminating the need for manual parameter tuning. ^[7]

Bioinformatics

In computational biology, SVMs have been used for protein classification (achieving up to 90% accuracy in some studies), gene expression classification, disease diagnosis from microarray data, and protein structure prediction. The ability of SVMs to handle high-dimensional feature spaces with relatively few samples makes them well-suited for genomics applications, where the number of genes (features) far exceeds the number of patients (samples). ^[8]

Image recognition and computer vision

Image recognition was one of the early success stories for SVMs. They have been applied to handwriting recognition (including postal code reading and digit classification on the MNIST dataset), face detection and recognition, object detection, and medical image analysis. Before the rise of deep learning, SVMs combined with hand-crafted features (such as HOG or SIFT descriptors) were the dominant approach for many computer vision tasks.

Other applications

Remote sensing: Classification of satellite imagery and land-use mapping.
Financial forecasting: Stock market prediction and credit scoring.
Drug discovery: Predicting biological activity of chemical compounds.
Intrusion detection: Network security and anomaly detection.
Speech recognition: Phoneme classification and speaker identification.

Software implementations

Several mature software libraries implement SVMs:

Library	Language	Notes
LIBSVM	C/C++ (with bindings for Python, R, MATLAB, Java, and others)	The most widely used SVM library, developed by Chang and Lin and in active development since 2000. Implements SMO-type decomposition. The 2011 reference paper has been cited more than 40,000 times.
scikit-learn	Python	Uses LIBSVM internally for SVC and SVR. Also provides LinearSVC based on LIBLINEAR for faster linear SVMs.
SVMlight	C	Efficient implementation by Thorsten Joachims. Popular for text classification.
TensorFlow / PyTorch	Python	Can implement SVMs using hinge loss, though these frameworks are primarily designed for neural networks.
MATLAB	MATLAB	Built-in SVM functions in the Statistics and Machine Learning Toolbox.

Is SVM still used?

Although deep learning has overtaken SVMs for large-scale perception tasks such as image recognition and natural language processing, support vector machines remain a standard tool for tabular and small-to-medium datasets. They are still a default classifier in libraries such as scikit-learn, are widely taught as a foundational machine learning method, and continue to be used in bioinformatics, finance, and other domains where datasets are modest in size, features outnumber samples, and strong theoretical guarantees matter. The LIBSVM library that powers many of these uses has remained in active development since 2000. ^[9]

Limitations

Scalability: Training complexity is O(n^2) to O(n^3) in memory and time, making SVMs impractical for datasets with millions of samples. Approximate or online methods exist but sacrifice accuracy.
Kernel selection: Choosing the right kernel and its parameters (gamma, degree) requires experimentation and domain knowledge. A poor kernel choice can result in bad performance.
No native probability estimates: SVMs output decision values, not probabilities. Platt scaling or isotonic regression can calibrate outputs to probabilities, but this adds an extra step.
Sensitivity to feature scaling: SVMs are sensitive to the scale of input features. Features should typically be standardized (zero mean, unit variance) or normalized before training.
Difficulty with very noisy data: When classes overlap significantly, SVMs may struggle to find a meaningful margin.
Memory usage: The kernel matrix is O(n^2), which can be prohibitive for large datasets.
Black-box nature: Non-linear SVMs (with RBF or polynomial kernels) are difficult to interpret. Unlike decision trees or logistic regression, it is not straightforward to understand why an SVM made a particular prediction.

References

Vapnik, V. N., and Chervonenkis, A. Ya. (1964). "A note on one class of perceptrons." Automation and Remote Control, 25(1).
Vapnik, V. N. (1982). *Estimation of Dependences Based on Empirical Data*. Springer-Verlag.
Boser, B. E., Guyon, I. M., and Vapnik, V. N. (1992). "A training algorithm for optimal margin classifiers." Proceedings of the 5th Annual Workshop on Computational Learning Theory (COLT), Pittsburgh, July 27-29, 1992, pp. 144-152. doi:10.1145/130385.130401. ↩
Cortes, C., and Vapnik, V. (1995). "Support-vector networks." Machine Learning, 20(3), 273-297. doi:10.1007/BF00994018. https://link.springer.com/article/10.1007/BF00994018 ↩
Vapnik, V. N. (1995). *The Nature of Statistical Learning Theory*. Springer. ↩
Platt, J. C. (1998). "Sequential Minimal Optimization: A Fast Algorithm for Training Support Vector Machines." Microsoft Research Technical Report MSR-TR-98-14. https://www.microsoft.com/en-us/research/publication/sequential-minimal-optimization-a-fast-algorithm-for-training-support-vector-machines/ ↩
Joachims, T. (1998). "Text categorization with support vector machines: Learning with many relevant features." Proceedings of the 10th European Conference on Machine Learning (ECML), Lecture Notes in Computer Science vol. 1398, Springer, pp. 137-142. doi:10.1007/BFb0026683. ↩
Scholkopf, B., and Smola, A. J. (2002). *Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond*. MIT Press. ↩
Chang, C.-C., and Lin, C.-J. (2011). "LIBSVM: A library for support vector machines." ACM Transactions on Intelligent Systems and Technology, 2(3), 27:1-27:27. doi:10.1145/1961189.1961199. https://www.csie.ntu.edu.tw/~cjlin/libsvm/ ↩
Drucker, H., Burges, C. J. C., Kaufman, L., Smola, A., and Vapnik, V. (1997). "Support Vector Regression Machines." Advances in Neural Information Processing Systems (NIPS), 9, pp. 155-161. ↩
Vapnik, V. N., and Chervonenkis, A. Ya. (1971). "On the uniform convergence of relative frequencies of events to their probabilities." Theory of Probability and its Applications, 16(2), 264-280. ↩
Hsu, C.-W., and Lin, C.-J. (2002). "A comparison of methods for multiclass support vector machines." IEEE Transactions on Neural Networks, 13(2), 415-425. ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

4 revisions by 1 contributors · full history

Suggest edit

Support Vector Machine (SVM)

ELI5 (Explain like I'm 5)

When was the SVM invented?

Mathematical formulation

Linear SVM (hard margin)

Soft margin SVM

Dual formulation

KKT conditions

What are support vectors?

How does the kernel trick work?

Mercer's theorem

Common kernel functions

The C parameter

How are SVMs trained?

Sequential Minimal Optimization (SMO)

Other training methods

Support vector regression (SVR)

Epsilon-insensitive loss

Key differences from SVM classification

Multi-class classification

One-vs-rest (one-vs-all)

One-vs-one

Comparison of multi-class strategies

How does an SVM differ from a neural network?

When to use SVMs

What is a support vector machine used for?

Text classification and NLP

Bioinformatics

Image recognition and computer vision

Other applications

Software implementations

Is SVM still used?

Limitations

References

Improve this article

What links here (24 of 103)

What links here (24 of 103)

ELI5 (Explain like I'm 5)

When was the SVM invented?

Mathematical formulation

Linear SVM (hard margin)

Soft margin SVM

Dual formulation

KKT conditions

What are support vectors?

How does the kernel trick work?

Mercer's theorem

Common kernel functions

The C parameter

How are SVMs trained?

Sequential Minimal Optimization (SMO)

Other training methods

Support vector regression (SVR)

Epsilon-insensitive loss

Key differences from SVM classification

Multi-class classification

One-vs-rest (one-vs-all)

One-vs-one

Comparison of multi-class strategies

How does an SVM differ from a neural network?

When to use SVMs

What is a support vector machine used for?

Text classification and NLP

Bioinformatics

Image recognition and computer vision

Other applications

Software implementations

Is SVM still used?

Limitations

References

Improve this article

Related Articles

A/B Testing

Diffusion model

Dimension Reduction

Dimensions

Discrete Feature

Discriminative Model

What links here (24 of 103)

Related Articles

A/B Testing

Diffusion model

Dimension Reduction

Dimensions

Discrete Feature

Discriminative Model

What links here (24 of 103)