See also: Machine learning terms
One-vs.-all (OvA), also known as one-vs.-rest (OvR) or one-against-all, is a strategy for turning a multi-class classification problem into several binary classification problems. Given a dataset with K classes, OvA trains K separate binary classifiers. Each classifier learns to distinguish a single target class from every other class combined. At prediction time, all K classifiers are applied to the input, and the class whose classifier reports the highest confidence score is returned as the predicted label.
OvA is one of the oldest and most widely deployed multiclass strategies. It dates back to early work on neural networks and support vector machines in the 1990s, and remains the default multiclass approach for several popular libraries, including the LIBLINEAR backend used by scikit-learn. A 2004 study by Ryan Rifkin and Aldebaro Klautau argued that, with well-tuned binary classifiers, OvA matches the accuracy of more elaborate multiclass schemes such as one-vs.-one and error-correcting output codes, despite its simplicity (Rifkin and Klautau, 2004).
Let the training set contain examples drawn from K distinct classes labeled 1 through K. The OvA reduction proceeds in two phases.
For each class k in 1 to K, construct a binary training set by relabeling examples of class k as positive (+1) and every other example as negative (-1). Train a binary classifier f_k on this relabeled data. After K rounds of training, the model consists of K independent binary scorers, one per class.
Given a new input x, each binary classifier produces a real-valued score (a margin, a log-odds, or a probability). The predicted class is the argmax over the K scores:
y_pred = argmax_k f_k(x)
When the binary classifiers output well-calibrated probabilities, the rule reduces to picking the class with the highest probability of being the positive class in its own binary problem. When they output uncalibrated margins (for example, raw SVM decision values), the argmax is still defined but the comparison across K classifiers assumes the scores are on roughly the same scale. This calibration assumption is the most common source of OvA failure in practice.
The technique is referred to under several names, often interchangeably:
| Name | Common usage |
|---|---|
| One-vs.-all (OvA) | Statistical and ML literature |
| One-vs.-rest (OvR) | scikit-learn API, multilabel literature |
| One-against-all | Early SVM literature, especially Vapnik's group |
| Binary relevance | Multi-label classification literature |
The "one-vs.-rest" name became dominant in scikit-learn and has spread through Python ML tutorials. "One-against-all" appears in the SVM literature of the late 1990s. "Binary relevance" is the standard term in multi-label classification, where each binary classifier predicts the presence of a separate label rather than competing for a single class label.
Three main strategies turn a binary learner into a multiclass classifier: one-vs.-all, one-vs.-one (OvO), and error-correcting output codes (ECOC). The table below summarizes the trade-offs.
| Property | One-vs.-all | One-vs.-one | ECOC |
|---|---|---|---|
| Number of classifiers | K | K(K-1)/2 | L (codeword length, typically O(K) or O(log K)) |
| Training set per classifier | Full dataset, relabeled | Only examples from two classes | Full dataset, relabeled per bit |
| Inference | Argmax of K scores | Vote across K(K-1)/2 classifiers | Decode predicted bitstring against codebook |
| Class imbalance per binary problem | High (1 vs. K-1) | Balanced | Roughly balanced if codebook is balanced |
| Memory at inference | K models | K(K-1)/2 models | L models |
| Calibration sensitivity | High | Low (votes, not scores) | Medium |
| Default in | LIBLINEAR, scikit-learn LogisticRegression | LIBSVM, scikit-learn SVC | scikit-learn OutputCodeClassifier |
OvO trains a separate binary classifier for every unordered pair of classes, giving K(K-1)/2 classifiers in total. At inference, each pair classifier casts a vote for one of its two classes, and the class with the most votes wins. OvO is the default multiclass scheme in LIBSVM, the kernel SVM library that backs scikit-learn's SVC class. OvO scales quadratically in the number of classes, but each classifier sees only a small subset of the data, which can be useful when the underlying learner does not scale well with the number of training samples.
Introduced by Thomas Dietterich and Ghulum Bakiri in 1995, ECOC assigns each class a unique binary codeword of length L. One binary classifier is trained per bit. At inference, the predicted bitstring is compared against every class codeword, and the class with the smallest Hamming distance (or another decoding metric) is chosen. The redundancy in the code lets the system correct individual bit errors, much like an error-correcting code in communications. OvA can be viewed as a special case of ECOC where the codebook is the K-by-K identity matrix (each class has a one-hot codeword) and the decoding rule is argmax instead of nearest-codeword (Dietterich and Bakiri, 1995).
Many classifiers handle multiclass problems directly, without any binary decomposition. For these methods, OvA is unnecessary and often worse than the native multiclass version.
| Algorithm | Native multiclass mechanism |
|---|---|
| Softmax regression (multinomial logistic regression) | Joint optimization of K weight vectors with cross-entropy loss |
| Decision trees (CART, ID3, C4.5) | Each leaf stores a class label or class distribution |
| Random forests | Majority vote across trees |
| Naive Bayes | Posterior probability per class via Bayes' rule |
| Multiclass perceptron (Kessler construction) | One weight vector per class, joint update rule |
| Deep neural networks | Softmax output layer trained with cross-entropy |
| K-nearest neighbors | Majority vote among nearest neighbors |
| Gradient boosting (multiclass) | Per-class boosting rounds with softmax loss |
For neural networks in particular, the softmax + cross-entropy combination has effectively replaced OvA. A softmax layer produces a probability distribution that sums to 1 across all K classes, and the cross-entropy loss directly maximizes the log-probability of the correct class. This joint formulation means the K class scores are coupled during training, so calibration across classes is built in.
A practical concern with OvA is that each binary subproblem is heavily skewed. If the original dataset is balanced with 1/K of the examples per class, then the binary problem for class k has 1/K positive examples and (K-1)/K negative examples. With 10 classes, every binary classifier sees 10% positives and 90% negatives. With 100 classes, the imbalance is 1% versus 99%. Many binary learners default to a decision threshold of 0.5, which can collapse to predicting the majority class on every input when imbalance is severe.
Common mitigations include:
For the OvA argmax rule to be well-founded, the K binary scores must be comparable. Different binary problems may produce scores on different scales because they were trained on different positive-to-negative ratios and may have different characteristic margins.
Two standard calibration techniques are used:
Both methods extend to multiclass problems by calibrating each OvA classifier independently and then renormalizing the K calibrated probabilities so they sum to 1. Scikit-learn provides CalibratedClassifierCV to wrap this procedure around any base estimator.
A related issue is the abstention region: regions of input space where every binary classifier outputs a low confidence (or a confidence below 0.5). The argmax rule still picks the highest of these low scores, so the model never explicitly abstains. Some applications add an explicit "reject" option when no f_k(x) clears a chosen threshold.
| Library | Class or function | Notes |
|---|---|---|
| scikit-learn | sklearn.multiclass.OneVsRestClassifier | Generic wrapper around any binary estimator; also handles multilabel problems |
| scikit-learn | sklearn.linear_model.LogisticRegression(multi_class='ovr') | Built-in OvR mode (default for some solvers in older versions) |
| LIBLINEAR | Default multiclass mode | Trains K linear SVM or logistic regression classifiers |
| LIBSVM | -b 0 -m 0 with appropriate flags, but defaults to OvO | OvO is the LIBSVM default |
| Spark MLlib | OneVsRest | Wrapper for binary classifiers in Spark |
| H2O | multinomial distribution by default; OvR available for some learners |
A minimal scikit-learn example using the Iris dataset:
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
from sklearn.multiclass import OneVsRestClassifier
X, y = load_iris(return_X_y=True)
clf = OneVsRestClassifier(LogisticRegression(max_iter=1000))
clf.fit(X, y)
print(clf.predict(X[:5]))
With 3 classes (setosa, versicolor, virginica), the wrapper trains three logistic regression models: setosa-vs-rest, versicolor-vs-rest, and virginica-vs-rest. Inference picks the class whose binary model scores the input highest.
Let N be the number of training examples, D the feature dimension, and K the number of classes. Assume the binary learner has training cost T(N, D) and prediction cost P(D).
| Strategy | Training cost | Inference cost | Models stored |
|---|---|---|---|
| One-vs.-all | K * T(N, D) | K * P(D) | K |
| One-vs.-one | K(K-1)/2 * T(2N/K, D) | K(K-1)/2 * P(D) | K(K-1)/2 |
| ECOC (length L) | L * T(N, D) | L * P(D) + O(LK) decoding | L |
| Native softmax | T(N, D, K) typically O(K) factor over binary | P(D, K) | 1 joint model |
For learners with linear training cost, OvA training scales as O(K * N), and OvO training scales as O(K * N) as well (since each OvO classifier sees only 2N/K examples on average and there are K(K-1)/2 of them, the total is K(K-1)/2 * 2N/K = (K-1) * N). For learners with super-linear training cost in N (such as kernel SVMs with O(N^2) or O(N^3) training), OvO can actually be cheaper than OvA because each subproblem is much smaller.
At inference, OvA scales as O(K), which is unavoidable for any decomposition method that uses K class scores.
Use OvA when:
Prefer OvO when the binary learner scales poorly with N and the per-class data is small enough that K(K-1)/2 small problems are cheaper than K large ones. This is the standard recommendation for kernel SVMs.
Prefer a native multiclass method (softmax, decision trees, random forests) when one is available and well-supported. For deep learning in particular, softmax + cross-entropy is the default and OvA is rarely a good idea for the final classification head.
Despite the dominance of softmax in deep learning, OvA remains practically important in several settings.
Multi-label classification. When an input can belong to several classes simultaneously (for example, an image tagged with both "beach" and "sunset"), the OvR reduction is the standard baseline under the name binary relevance. Each label gets its own independent binary classifier, and the model can output any subset of labels. Modern multi-label methods such as classifier chains build on top of binary relevance by passing earlier predictions as features to later classifiers.
Calibrated probability estimates. Some downstream applications need a probability score for each individual class, not just a top-1 prediction. OvR with per-class calibration produces K independent probabilities that can be inspected, thresholded, or combined with prior knowledge.
Model interpretability and debugging. Because each OvA classifier corresponds to exactly one class, practitioners can inspect feature weights or attribution scores per class, retrain a single class without touching the others, and add new classes incrementally. A native softmax model couples all K classes in a single weight matrix, which is harder to modify after training.
One-class and anomaly detection. A degenerate version of OvA with a single classifier (one class versus everything else) underlies one-class SVMs and many novelty detection systems. Adding a second class label converts a one-class detector into a binary classifier with the same architecture.
The Iris dataset has 150 examples, 4 features (sepal length, sepal width, petal length, petal width), and 3 classes (setosa, versicolor, virginica). With OvA using logistic regression, three binary classifiers are trained:
| Classifier | Positive class | Negative class | Approximate decision summary |
|---|---|---|---|
| f_setosa | setosa (50 examples) | versicolor + virginica (100 examples) | Petal length and width separate setosa cleanly; the binary problem is linearly separable |
| f_versicolor | versicolor (50 examples) | setosa + virginica (100 examples) | Versicolor sits between setosa and virginica in feature space, so this binary problem is harder; some virginica are close to versicolor |
| f_virginica | virginica (50 examples) | setosa + versicolor (100 examples) | Larger petals than versicolor; mostly separable but with some overlap on the boundary |
At prediction, all three classifiers score the input. For a setosa-like input, f_setosa scores high, f_versicolor and f_virginica score low, and the argmax returns setosa. For an input near the versicolor/virginica boundary, both f_versicolor and f_virginica may score moderately, and the higher of the two wins.
Known drawbacks of OvA include:
For neural networks, all of these issues are addressed by the joint softmax + cross-entropy formulation, which is why OvA is rarely used for deep learning classification heads.
Imagine you have a basket of fruits: apples, bananas, and oranges, and you want to teach a robot to tell them apart. The one-vs.-all approach is to train three little robots, each one good at a single yes-or-no question. The first robot only knows "is this an apple, or is it not an apple?" The second robot only knows "is this a banana, or is it not a banana?" The third only knows "orange, or not?" When you show the robots a new fruit, each one says how sure it is that the fruit is its specialty. Whichever robot is most confident wins, and that fruit name is the answer.