# Support Vector Machine (SVM)

> Source: https://aiwiki.ai/wiki/support_vector_machine_svm
> Updated: 2026-06-20
> Categories: Machine Learning
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

A **support vector machine** (SVM) is a [supervised learning](/wiki/supervised_machine_learning) algorithm that classifies data by finding the optimal [hyperplane](/wiki/hyperplane) that separates points into distinct classes with the widest possible margin. The margin is the gap between the decision boundary and the nearest training points of each class, and the points that sit on that margin are called support vectors. SVMs are effective for both [classification](/wiki/classification_model) and [regression](/wiki/regression_model), and were introduced in their modern soft margin form by Corinna Cortes and [Vladimir Vapnik](/wiki/vladimir_vapnik) at AT&T Bell Labs in 1995. [4] Originally developed for binary classification, SVMs have been extended to handle multi-class problems, [regression](/wiki/regression_model), and even [anomaly detection](/wiki/anomaly_detection).

In their founding paper, Cortes and Vapnik described the method as follows: "The support-vector network is a new learning machine for two-group classification problems. The machine conceptually implements the following idea: input vectors are non-linearly mapped to a very high-dimension feature space. In this feature space a linear decision surface is constructed." [4] That 1995 paper, "Support-vector networks," has since been cited more than 40,000 times, making it one of the most influential papers in [machine learning](/wiki/machine_learning). [4]

SVMs are rooted in [statistical learning theory](/wiki/statistical_learning_theory), specifically the Vapnik-Chervonenkis (VC) theory, and they remain one of the most theoretically well-founded [machine learning](/wiki/machine_learning) algorithms. [5][11] Their ability to handle high-dimensional data and non-linear decision boundaries through the kernel trick has made them widely used in fields ranging from [natural language processing](/wiki/natural_language_understanding) to bioinformatics. [3]

## ELI5 (Explain like I'm 5)

Imagine you have a pile of red balls and blue balls mixed together on a table. You want to draw a line (or put down a stick) so all the red balls are on one side and all the blue balls are on the other side. But you don't just want any line. You want the line that gives the most space between the closest red ball and the closest blue ball. That way, if someone tosses a new ball onto the table, you have the best chance of guessing its color correctly based on which side of the line it lands on.

Sometimes the balls are mixed up so much that no straight line can separate them. In that case, you can imagine lifting the balls up into the air (adding a new dimension), and suddenly you can slide a flat sheet between them. That "lifting" trick is called the kernel trick, and it's one of the things that makes SVMs so powerful.

## When was the SVM invented?

The development of support vector machines spans several decades and involves contributions from multiple researchers.

| Year | Event | Contributors |
|------|-------|--------------|
| 1963-1964 | Development of the Generalized Portrait algorithm, the foundation of linear SVMs | [Vladimir Vapnik](/wiki/vladimir_vapnik) and Alexey Chervonenkis |
| 1971 | Introduction of VC dimension theory, providing the statistical learning framework underlying SVMs | Vapnik and Chervonenkis |
| 1974 | Publication of "Theory of Pattern Recognition" formalizing VC theory | Vapnik and Chervonenkis |
| 1979 | The concept of kernel functions in pattern recognition first explored | Aizerman, Braverman, and Rozonoer (building on earlier 1964 work) |
| 1982 | Publication of "Estimation of Dependences Based on Empirical Data" laying out statistical learning theory | Vapnik |
| 1992 | Introduction of the kernel trick for nonlinear classification in SVMs | Bernhard Boser, Isabelle Guyon, and Vapnik |
| 1995 | Publication of the soft margin SVM formulation | Corinna Cortes and Vapnik |
| 1998 | Development of the Sequential Minimal Optimization (SMO) algorithm for efficient SVM training | John Platt (Microsoft Research) |
| 2001 | Release of LIBSVM, which became the standard SVM implementation library | Chih-Chung Chang and Chih-Jen Lin |

The original SVM algorithm, developed by Vapnik and Chervonenkis in the early 1960s at the Institute of Control Sciences in Moscow, was a linear classifier. It was not until 1992 that Boser, Guyon, and Vapnik proposed applying the kernel trick to maximum-margin hyperplanes, enabling SVMs to handle non-linearly separable data. The 1992 work was presented at the Fifth Annual Workshop on Computational Learning Theory (COLT) in Pittsburgh on July 27-29, 1992, pages 144-152. [3] The soft margin formulation by Cortes and Vapnik in 1995 further extended SVMs to handle noisy, real-world data where perfect separation is not possible. [4]

## Mathematical formulation

### Linear SVM (hard margin)

Given a training dataset of n points of the form (x_1, y_1), (x_2, y_2), ..., (x_n, y_n), where each x_i is a d-dimensional real vector and y_i is either +1 or -1 (indicating the class), the goal is to find the maximum-margin hyperplane that separates the two classes.

A hyperplane can be written as the set of points x satisfying:

**w** · **x** - b = 0

where **w** is the normal vector to the hyperplane and b is the bias (offset) term. The [decision boundary](/wiki/decision_boundary) divides the space into two half-spaces.

For the data to be correctly classified, we require:

y_i(**w** · **x_i** - b) >= 1, for all i = 1, ..., n

The margin is the distance between the two parallel hyperplanes **w** · **x** - b = 1 and **w** · **x** - b = -1, which equals 2 / ||**w**||. Maximizing this margin is equivalent to minimizing ||**w**||, or more conveniently, minimizing (1/2)||**w**||^2.

The **primal optimization problem** for hard margin SVM is:

Minimize: (1/2)||**w**||^2

Subject to: y_i(**w** · **x_i** - b) >= 1, for all i = 1, ..., n

This is a convex quadratic programming problem with linear constraints.

### Soft margin SVM

In practice, data is rarely perfectly linearly separable. The soft margin formulation, introduced by Cortes and Vapnik (1995), allows some data points to violate the margin constraint by introducing slack variables (xi_i >= 0): [4]

Minimize: (1/2)||**w**||^2 + C * sum(xi_i)

Subject to: y_i(**w** · **x_i** - b) >= 1 - xi_i, and xi_i >= 0, for all i

The parameter C > 0 controls the trade-off between maximizing the margin and minimizing the classification error. A larger C penalizes misclassifications more heavily, leading to a narrower margin. A smaller C allows more margin violations, resulting in a wider margin but potentially more misclassifications.

The [hinge loss](/wiki/hinge_loss) function is closely related to the soft margin SVM objective. The hinge loss for a single data point is:

L(y, f(x)) = max(0, 1 - y * f(x))

Using hinge loss, the SVM objective can be written as:

Minimize: (1/n) * sum(max(0, 1 - y_i * f(x_i))) + lambda * ||**w**||^2

where lambda = 1 / (2nC) is the [regularization](/wiki/regularization) parameter.

### Dual formulation

Using Lagrange multipliers (alpha_i >= 0), the primal problem can be transformed into its dual form. The Lagrangian is:

L(**w**, b, **alpha**) = (1/2)||**w**||^2 - sum(alpha_i * [y_i(**w** · **x_i** - b) - 1])

Taking partial derivatives with respect to **w** and b and setting them to zero gives:

**w** = sum(alpha_i * y_i * **x_i**)

sum(alpha_i * y_i) = 0

Substituting these back into the Lagrangian yields the **dual optimization problem**:

Maximize: sum(alpha_i) - (1/2) * sum_i(sum_j(alpha_i * alpha_j * y_i * y_j * (**x_i** · **x_j**)))

Subject to: alpha_i >= 0 for all i, and sum(alpha_i * y_i) = 0

For the soft margin case, the constraint becomes 0 <= alpha_i <= C.

The dual formulation has several advantages. It depends on the data only through dot products **x_i** · **x_j**, which enables the kernel trick. It also typically has fewer effective variables since many alpha_i values will be zero. [5]

### KKT conditions

The Karush-Kuhn-Tucker (KKT) conditions are necessary and sufficient conditions for optimality in the SVM problem (since it is convex). The KKT conditions for SVM are:

1. **Stationarity**: **w** = sum(alpha_i * y_i * **x_i**)
2. **Primal feasibility**: y_i(**w** · **x_i** - b) >= 1 - xi_i, and xi_i >= 0
3. **Dual feasibility**: alpha_i >= 0
4. **Complementary slackness**: alpha_i * [y_i(**w** · **x_i** - b) - 1 + xi_i] = 0

The complementary slackness condition is particularly informative: it means that for each data point, either alpha_i = 0 (the point is not a support vector and lies outside the margin) or y_i(**w** · **x_i** - b) = 1 - xi_i (the point lies on or within the margin). Only the points with alpha_i > 0 are support vectors, and these are the only points that influence the position of the decision boundary. [5]

## What are support vectors?

Support vectors are the data points that lie closest to the decision boundary. They are the critical elements of the training set because:

- They are the only data points with non-zero Lagrange multipliers (alpha_i > 0).
- The decision boundary is entirely determined by the support vectors. Removing any non-support-vector data point from the training set would not change the resulting hyperplane.
- In the soft margin formulation, support vectors include points on the margin boundary (alpha_i < C), points within the margin (alpha_i = C, xi_i < 1), and misclassified points (alpha_i = C, xi_i >= 1).

The number of support vectors relative to the total number of training points gives an indication of the model's complexity. A model with very few support vectors is more likely to generalize well. [5]

## How does the kernel trick work?

The kernel trick is a method that allows SVMs to operate in a high-dimensional (or even infinite-dimensional) feature space without explicitly computing the coordinates of the data in that space. Instead of mapping data points to a higher-dimensional space using a transformation function phi(**x**), the kernel trick computes the inner product of the mapped points directly: [3]

K(**x_i**, **x_j**) = phi(**x_i**) · phi(**x_j**)

This is possible because the dual formulation of the SVM depends on the data only through dot products. By replacing every dot product **x_i** · **x_j** with a kernel function K(**x_i**, **x_j**), the SVM can learn non-linear decision boundaries in the original input space. [3]

### Mercer's theorem

For a function K to be a valid kernel, it must satisfy Mercer's theorem: the function must be symmetric and positive semi-definite. Formally, for any finite set of points, the kernel matrix (also called the Gram matrix) with entries K_ij = K(**x_i**, **x_j**) must be positive semi-definite. This guarantees that the kernel corresponds to an inner product in some (possibly infinite-dimensional) feature space and ensures that the SVM dual problem remains a convex optimization problem. [8]

### Common kernel functions

| Kernel | Formula | Key parameters | Typical use cases |
|--------|---------|---------------|------------------|
| Linear | K(**x**, **y**) = **x** · **y** | None | Linearly separable data, text classification, high-dimensional sparse data |
| [Polynomial](/wiki/polynomial_kernel) | K(**x**, **y**) = (gamma * **x** · **y** + r)^d | d (degree), gamma (scale), r (constant) | Image processing, [natural language processing](/wiki/natural_language_understanding) |
| Radial Basis Function (RBF) / Gaussian | K(**x**, **y**) = exp(-gamma * \|\|**x** - **y**\|\|^2) | gamma (width parameter, gamma > 0) | General-purpose non-linear classification, default kernel in many libraries |
| Sigmoid | K(**x**, **y**) = tanh(gamma * **x** · **y** + r) | gamma (scale), r (constant) | [Neural network](/wiki/neural_network)-like behavior (less commonly used) |
| Laplacian | K(**x**, **y**) = exp(-gamma * \|\|**x** - **y**\|\|_1) | gamma (width parameter) | Data with sharp edges or discontinuities |

The RBF kernel is the most widely used kernel in practice and serves as the default in many SVM implementations, including [scikit-learn](/wiki/scikit-learn). It can model complex decision boundaries and is equivalent to mapping data into an infinite-dimensional feature space. [8] The gamma parameter controls the width of the Gaussian; a small gamma means a wide Gaussian (smoother decision boundary), while a large gamma means a narrow Gaussian (more complex decision boundary).

## The C parameter

The regularization parameter C is one of the most important [hyperparameters](/wiki/hyperparameter) in SVMs. It controls the trade-off between two competing objectives: [4]

1. **Maximizing the margin**: A wider margin improves generalization but may allow more misclassifications.
2. **Minimizing classification error**: Correctly classifying training points reduces training error but may result in a narrower margin and potential [overfitting](/wiki/overfitting).

| C value | Effect on margin | Effect on bias | Effect on variance | Risk |
|---------|-----------------|----------------|-------------------|------|
| Very small (e.g., 0.001) | Very wide margin | High [bias](/wiki/bias_math_or_bias_term) | Low variance | [Underfitting](/wiki/underfitting) |
| Small (e.g., 0.1) | Wide margin | Moderate bias | Low-moderate variance | Slight underfitting |
| Moderate (e.g., 1.0) | Balanced margin | Balanced | Balanced | Good generalization |
| Large (e.g., 10) | Narrow margin | Low bias | High variance | Slight overfitting |
| Very large (e.g., 1000) | Very narrow margin | Very low bias | Very high variance | Overfitting |

The optimal value of C is typically found through [cross-validation](/wiki/cross-validation). A common strategy is to search over a logarithmic grid (e.g., C = 0.001, 0.01, 0.1, 1, 10, 100, 1000). [12]

## How are SVMs trained?

### Sequential Minimal Optimization (SMO)

The SMO algorithm, developed by John Platt at Microsoft Research in 1998, is the most widely used algorithm for training SVMs. It breaks the large quadratic programming (QP) problem into a series of the smallest possible QP problems, each involving only two Lagrange multipliers. These sub-problems can be solved analytically, avoiding the need for a general-purpose numerical QP solver. [6]

Platt summarized the approach in the paper's abstract: "SMO breaks this large QP problem into a series of smallest possible QP problems. These small QP problems are solved analytically, which avoids using a time-consuming numerical QP optimization as an inner loop." [6]

Key properties of SMO:

- Memory requirement is linear in the training set size, which allows SMO to handle very large training sets. [6]
- Training time scales somewhere between linearly and quadratically with the number of training examples, compared with the standard chunking algorithm, which scales between linearly and cubically. [6]
- It avoids the need for external QP optimization libraries.
- It is the algorithm used internally by [LIBSVM](/wiki/libsvm) and, by extension, [scikit-learn](/wiki/scikit-learn)'s SVM implementation. [9]

### Other training methods

Several other approaches have been developed for training SVMs:

- **Chunking**: Solves the QP problem by working on subsets (chunks) of the training data at a time. An ancestor of SMO.
- **Decomposition methods**: Generalize chunking by selecting working sets of variables to optimize.
- **Gradient descent**: [Stochastic gradient descent](/wiki/stochastic_gradient_descent_sgd) can be applied to the primal SVM objective, which is useful for very large datasets.
- **Cutting plane methods**: Iteratively add constraints to approximate the full optimization problem.

## Support vector regression (SVR)

Support Vector Regression extends the SVM framework to [regression](/wiki/regression_model) tasks. Instead of finding a hyperplane that separates classes, SVR finds a function that deviates from the actual observed targets by a value no greater than epsilon for each training point. SVR was introduced by Drucker, Burges, Kaufman, Smola, and Vapnik in 1997. [10]

### Epsilon-insensitive loss

SVR uses the epsilon-insensitive loss function:

L_epsilon(y, f(x)) = max(0, |y - f(x)| - epsilon)

This loss function defines an "epsilon tube" around the predicted function. Points inside the tube (with prediction error less than epsilon) incur zero loss. Only points outside the tube contribute to the loss, and the penalty grows linearly with the distance from the tube boundary. [10]

The SVR optimization problem is:

Minimize: (1/2)||**w**||^2 + C * sum(xi_i + xi_i*)

Subject to: y_i - (**w** · **x_i** + b) <= epsilon + xi_i, (**w** · **x_i** + b) - y_i <= epsilon + xi_i*, and xi_i, xi_i* >= 0

Here, xi_i and xi_i* are slack variables for deviations above and below the epsilon tube, respectively.

### Key differences from SVM classification

| Aspect | SVM (classification) | SVR (regression) |
|--------|---------------------|------------------|
| Objective | Find a separating hyperplane | Find a function that approximates targets |
| Loss function | [Hinge loss](/wiki/hinge_loss) | Epsilon-insensitive loss |
| Margin/tube | Margin between classes | Epsilon tube around the function |
| Output | Class label (+1 or -1) | Continuous value |
| Support vectors | Points on or within the margin | Points on or outside the epsilon tube |

## Multi-class classification

SVMs are inherently binary classifiers. To handle problems with more than two classes, several strategies have been developed. [12]

### One-vs-rest (one-vs-all)

For a problem with K classes, this strategy trains K binary classifiers. Each classifier is trained to distinguish one class from all the remaining classes combined. During prediction, the class whose classifier produces the highest confidence score is selected.

- Number of classifiers: K
- Training data per classifier: all training data
- Potential issue: class imbalance, since each classifier sees one small positive class versus a large negative class

### One-vs-one

This strategy trains a binary classifier for every pair of classes. During prediction, each classifier votes for one of its two classes, and the class with the most votes wins.

- Number of classifiers: K(K-1)/2
- Training data per classifier: only data from the two classes involved
- Advantage: each classifier is trained on a smaller, balanced subset
- Disadvantage: the number of classifiers grows quadratically with the number of classes

### Comparison of multi-class strategies

| Strategy | Number of classifiers | Training data per classifier | Advantages | Disadvantages |
|----------|-----------------------|-----------------------------|------------|---------------|
| One-vs-rest | K | All data | Fewer classifiers, simpler | Class imbalance, less precise boundaries |
| One-vs-one | K(K-1)/2 | Subset of data (two classes) | Balanced training, often more accurate | Many classifiers for large K, higher memory |

The default in [scikit-learn](/wiki/scikit-learn)'s SVC implementation is one-vs-one, while LinearSVC uses one-vs-rest. [9][12]

## How does an SVM differ from a neural network?

SVMs have distinctive strengths and weaknesses compared to other commonly used [machine learning](/wiki/machine_learning) algorithms.

| Feature | SVM | [Logistic regression](/wiki/logistic_regression) | [Random forest](/wiki/random_forest) | [Neural network](/wiki/neural_network) |
|---------|-----|---------------------|---------------|----------------|
| Decision boundary | Maximum margin hyperplane | Probabilistic (log-odds) | Ensemble of [decision trees](/wiki/decision_tree) | Learned through [backpropagation](/wiki/backpropagation) |
| Handling non-linearity | Kernel trick | Feature engineering or polynomial features | Inherently non-linear | Inherently non-linear |
| Scalability | Poor for large datasets (O(n^2) to O(n^3)) | Good (O(nd)) | Good (parallelizable) | Good with GPU support |
| Interpretability | Low (except linear kernel) | High (coefficients are interpretable) | Moderate ([feature importance](/wiki/feature_importances)) | Low |
| Probability output | Not native (requires calibration) | Native (sigmoid function) | Native (class proportions) | Native ([softmax](/wiki/softmax)) |
| Handling of high dimensions | Excellent | Good | Can overfit | Good with sufficient data |
| Small datasets | Strong performance | Strong performance | Moderate | Weak (needs lots of data) |
| [Hyperparameter](/wiki/hyperparameter) sensitivity | High (C, kernel, gamma) | Low | Low | High |
| Theoretical guarantees | Strong (VC theory, margin bounds) | Strong (convex optimization) | Limited | Limited |

The key practical difference is that an SVM solves a convex optimization problem with a single global optimum, whereas a [neural network](/wiki/neural_network) minimizes a non-convex loss through [backpropagation](/wiki/backpropagation) and can converge to different local minima. SVMs also tend to perform well on small to medium datasets with many features, while [deep learning](/wiki/deep_learning) models typically need large amounts of data to outperform them.

### When to use SVMs

SVMs tend to work best when:

- The dataset is small to medium-sized (up to tens of thousands of samples).
- The number of features is large relative to the number of samples.
- There is a clear margin of separation between classes.
- High-dimensional data needs to be classified (e.g., text classification with many features).

SVMs are less suitable when:

- The dataset is very large (hundreds of thousands or millions of samples), because training time scales poorly.
- The problem requires probability estimates (SVMs do not natively output probabilities).
- Interpretability is a priority.
- The data is very noisy with many overlapping classes.

## What is a support vector machine used for?

SVMs have been applied successfully across a wide range of domains.

### Text classification and NLP

SVMs became one of the most popular algorithms for text classification tasks in the late 1990s and 2000s. In high-dimensional [bag-of-words](/wiki/bag_of_words) or TF-IDF feature spaces, linear SVMs perform particularly well. Applications include spam detection, [sentiment analysis](/wiki/sentiment_analysis), document categorization, and topic classification. Joachims (1998) demonstrated that SVMs outperformed other classifiers on text categorization benchmarks, reporting that SVMs "achieve substantial improvements over the currently best performing methods" while being "fully automatic" and eliminating the need for manual parameter tuning. [7]

### Bioinformatics

In computational biology, SVMs have been used for protein classification (achieving up to 90% accuracy in some studies), gene expression classification, disease diagnosis from microarray data, and protein structure prediction. The ability of SVMs to handle high-dimensional feature spaces with relatively few samples makes them well-suited for genomics applications, where the number of genes (features) far exceeds the number of patients (samples). [8]

### Image recognition and computer vision

[Image recognition](/wiki/image_recognition) was one of the early success stories for SVMs. They have been applied to handwriting recognition (including postal code reading and digit classification on the [MNIST](/wiki/mnist) dataset), face detection and recognition, [object detection](/wiki/object_detection), and medical image analysis. Before the rise of [deep learning](/wiki/deep_learning), SVMs combined with hand-crafted features (such as HOG or SIFT descriptors) were the dominant approach for many [computer vision](/wiki/computer_vision) tasks.

### Other applications

- **Remote sensing**: Classification of satellite imagery and land-use mapping.
- **Financial forecasting**: Stock market prediction and credit scoring.
- **Drug discovery**: Predicting biological activity of chemical compounds.
- **Intrusion detection**: Network security and anomaly detection.
- **Speech recognition**: Phoneme classification and speaker identification.

## Software implementations

Several mature software libraries implement SVMs:

| Library | Language | Notes |
|---------|----------|-------|
| [LIBSVM](/wiki/libsvm) | C/C++ (with bindings for Python, R, MATLAB, Java, and others) | The most widely used SVM library, developed by Chang and Lin and in active development since 2000. Implements SMO-type decomposition. The 2011 reference paper has been cited more than 40,000 times. |
| [scikit-learn](/wiki/scikit-learn) | Python | Uses LIBSVM internally for SVC and SVR. Also provides LinearSVC based on LIBLINEAR for faster linear SVMs. |
| SVMlight | C | Efficient implementation by Thorsten Joachims. Popular for text classification. |
| [TensorFlow](/wiki/tensorflow) / [PyTorch](/wiki/pytorch) | Python | Can implement SVMs using hinge loss, though these frameworks are primarily designed for [neural networks](/wiki/neural_network). |
| MATLAB | MATLAB | Built-in SVM functions in the Statistics and Machine Learning Toolbox. |

## Is SVM still used?

Although [deep learning](/wiki/deep_learning) has overtaken SVMs for large-scale perception tasks such as [image recognition](/wiki/image_recognition) and [natural language processing](/wiki/natural_language_understanding), support vector machines remain a standard tool for tabular and small-to-medium datasets. They are still a default classifier in libraries such as [scikit-learn](/wiki/scikit-learn), are widely taught as a foundational [machine learning](/wiki/machine_learning) method, and continue to be used in bioinformatics, finance, and other domains where datasets are modest in size, features outnumber samples, and strong theoretical guarantees matter. The LIBSVM library that powers many of these uses has remained in active development since 2000. [9]

## Limitations

- **Scalability**: Training complexity is O(n^2) to O(n^3) in memory and time, making SVMs impractical for datasets with millions of samples. Approximate or online methods exist but sacrifice accuracy.
- **Kernel selection**: Choosing the right kernel and its parameters (gamma, degree) requires experimentation and domain knowledge. A poor kernel choice can result in bad performance.
- **No native probability estimates**: SVMs output decision values, not probabilities. Platt scaling or isotonic regression can calibrate outputs to probabilities, but this adds an extra step.
- **Sensitivity to feature scaling**: SVMs are sensitive to the scale of input features. Features should typically be standardized (zero mean, unit variance) or normalized before training.
- **Difficulty with very noisy data**: When classes overlap significantly, SVMs may struggle to find a meaningful margin.
- **Memory usage**: The kernel matrix is O(n^2), which can be prohibitive for large datasets.
- **Black-box nature**: Non-linear SVMs (with RBF or polynomial kernels) are difficult to interpret. Unlike [decision trees](/wiki/decision_tree) or [logistic regression](/wiki/logistic_regression), it is not straightforward to understand why an SVM made a particular prediction.

## References

1. Vapnik, V. N., and Chervonenkis, A. Ya. (1964). "A note on one class of perceptrons." Automation and Remote Control, 25(1).
2. Vapnik, V. N. (1982). *Estimation of Dependences Based on Empirical Data*. Springer-Verlag.
3. Boser, B. E., Guyon, I. M., and Vapnik, V. N. (1992). "A training algorithm for optimal margin classifiers." Proceedings of the 5th Annual Workshop on Computational Learning Theory (COLT), Pittsburgh, July 27-29, 1992, pp. 144-152. doi:10.1145/130385.130401.
4. Cortes, C., and Vapnik, V. (1995). "Support-vector networks." Machine Learning, 20(3), 273-297. doi:10.1007/BF00994018. https://link.springer.com/article/10.1007/BF00994018
5. Vapnik, V. N. (1995). *The Nature of Statistical Learning Theory*. Springer.
6. Platt, J. C. (1998). "Sequential Minimal Optimization: A Fast Algorithm for Training Support Vector Machines." Microsoft Research Technical Report MSR-TR-98-14. https://www.microsoft.com/en-us/research/publication/sequential-minimal-optimization-a-fast-algorithm-for-training-support-vector-machines/
7. Joachims, T. (1998). "Text categorization with support vector machines: Learning with many relevant features." Proceedings of the 10th European Conference on Machine Learning (ECML), Lecture Notes in Computer Science vol. 1398, Springer, pp. 137-142. doi:10.1007/BFb0026683.
8. Scholkopf, B., and Smola, A. J. (2002). *Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond*. MIT Press.
9. Chang, C.-C., and Lin, C.-J. (2011). "LIBSVM: A library for support vector machines." ACM Transactions on Intelligent Systems and Technology, 2(3), 27:1-27:27. doi:10.1145/1961189.1961199. https://www.csie.ntu.edu.tw/~cjlin/libsvm/
10. Drucker, H., Burges, C. J. C., Kaufman, L., Smola, A., and Vapnik, V. (1997). "Support Vector Regression Machines." Advances in Neural Information Processing Systems (NIPS), 9, pp. 155-161.
11. Vapnik, V. N., and Chervonenkis, A. Ya. (1971). "On the uniform convergence of relative frequencies of events to their probabilities." Theory of Probability and its Applications, 16(2), 264-280.
12. Hsu, C.-W., and Lin, C.-J. (2002). "A comparison of methods for multiclass support vector machines." IEEE Transactions on Neural Networks, 13(2), 415-425.