One-vs.-all

See also: Machine learning terms

One-vs.-all (OvA), also known as one-vs.-rest (OvR) or one-against-all, is a strategy for turning a multi-class classification problem into several binary classification problems. Given a dataset with K classes, OvA trains K separate binary classifiers. Each classifier learns to distinguish a single target class from every other class combined. At prediction time, all K classifiers are applied to the input, and the class whose classifier reports the highest confidence score is returned as the predicted label.

OvA is one of the oldest and most widely deployed multiclass strategies. It dates back to early work on neural networks and support vector machines in the 1990s, and remains the default multiclass approach for several popular libraries, including the LIBLINEAR backend used by scikit-learn. A 2004 study by Ryan Rifkin and Aldebaro Klautau argued that, with well-tuned binary classifiers, OvA matches the accuracy of more elaborate multiclass schemes such as one-vs.-one and error-correcting output codes, despite its simplicity (Rifkin and Klautau, 2004).

How one-vs.-all works

Let the training set contain examples drawn from K distinct classes labeled 1 through K. The OvA reduction proceeds in two phases.

Training

For each class k in 1 to K, construct a binary training set by relabeling examples of class k as positive (+1) and every other example as negative (-1). Train a binary classifier f_k on this relabeled data. After K rounds of training, the model consists of K independent binary scorers, one per class.

Inference

Given a new input x, each binary classifier produces a real-valued score (a margin, a log-odds, or a probability). The predicted class is the argmax over the K scores:

y_pred = argmax_k f_k(x)

When the binary classifiers output well-calibrated probabilities, the rule reduces to picking the class with the highest probability of being the positive class in its own binary problem. When they output uncalibrated margins (for example, raw SVM decision values), the argmax is still defined but the comparison across K classifiers assumes the scores are on roughly the same scale. This calibration assumption is the most common source of OvA failure in practice.

Naming and history

The technique is referred to under several names, often interchangeably:

Name	Common usage
One-vs.-all (OvA)	Statistical and ML literature
One-vs.-rest (OvR)	scikit-learn API, multilabel literature
One-against-all	Early SVM literature, especially Vapnik's group
Binary relevance	Multi-label classification literature

The "one-vs.-rest" name became dominant in scikit-learn and has spread through Python ML tutorials. "One-against-all" appears in the SVM literature of the late 1990s. "Binary relevance" is the standard term in multi-label classification, where each binary classifier predicts the presence of a separate label rather than competing for a single class label.

Comparison with other multiclass strategies

Three main strategies turn a binary learner into a multiclass classifier: one-vs.-all, one-vs.-one (OvO), and error-correcting output codes (ECOC). The table below summarizes the trade-offs.

Property	One-vs.-all	One-vs.-one	ECOC
Number of classifiers	K	K(K-1)/2	L (codeword length, typically O(K) or O(log K))
Training set per classifier	Full dataset, relabeled	Only examples from two classes	Full dataset, relabeled per bit
Inference	Argmax of K scores	Vote across K(K-1)/2 classifiers	Decode predicted bitstring against codebook
Class imbalance per binary problem	High (1 vs. K-1)	Balanced	Roughly balanced if codebook is balanced
Memory at inference	K models	K(K-1)/2 models	L models
Calibration sensitivity	High	Low (votes, not scores)	Medium
Default in	LIBLINEAR, scikit-learn `LogisticRegression`	LIBSVM, scikit-learn `SVC`	scikit-learn `OutputCodeClassifier`

One-vs.-one (OvO)

OvO trains a separate binary classifier for every unordered pair of classes, giving K(K-1)/2 classifiers in total. At inference, each pair classifier casts a vote for one of its two classes, and the class with the most votes wins. OvO is the default multiclass scheme in LIBSVM, the kernel SVM library that backs scikit-learn's SVC class. OvO scales quadratically in the number of classes, but each classifier sees only a small subset of the data, which can be useful when the underlying learner does not scale well with the number of training samples.

Error-correcting output codes (ECOC)

Introduced by Thomas Dietterich and Ghulum Bakiri in 1995, ECOC assigns each class a unique binary codeword of length L. One binary classifier is trained per bit. At inference, the predicted bitstring is compared against every class codeword, and the class with the smallest Hamming distance (or another decoding metric) is chosen. The redundancy in the code lets the system correct individual bit errors, much like an error-correcting code in communications. OvA can be viewed as a special case of ECOC where the codebook is the K-by-K identity matrix (each class has a one-hot codeword) and the decoding rule is argmax instead of nearest-codeword (Dietterich and Bakiri, 1995).

Naturally multiclass methods

Many classifiers handle multiclass problems directly, without any binary decomposition. For these methods, OvA is unnecessary and often worse than the native multiclass version.

Algorithm	Native multiclass mechanism
Softmax regression (multinomial logistic regression)	Joint optimization of K weight vectors with cross-entropy loss
Decision trees (CART, ID3, C4.5)	Each leaf stores a class label or class distribution
Random forests	Majority vote across trees
Naive Bayes	Posterior probability per class via Bayes' rule
Multiclass perceptron (Kessler construction)	One weight vector per class, joint update rule
Deep neural networks	Softmax output layer trained with cross-entropy
K-nearest neighbors	Majority vote among nearest neighbors
Gradient boosting (multiclass)	Per-class boosting rounds with softmax loss

For neural networks in particular, the softmax + cross-entropy combination has effectively replaced OvA. A softmax layer produces a probability distribution that sums to 1 across all K classes, and the cross-entropy loss directly maximizes the log-probability of the correct class. This joint formulation means the K class scores are coupled during training, so calibration across classes is built in.

Class imbalance in OvA

A practical concern with OvA is that each binary subproblem is heavily skewed. If the original dataset is balanced with 1/K of the examples per class, then the binary problem for class k has 1/K positive examples and (K-1)/K negative examples. With 10 classes, every binary classifier sees 10% positives and 90% negatives. With 100 classes, the imbalance is 1% versus 99%. Many binary learners default to a decision threshold of 0.5, which can collapse to predicting the majority class on every input when imbalance is severe.

Common mitigations include:

Rebalancing the negative examples by undersampling or applying class weights inversely proportional to class frequency.
Using a learner that optimizes a margin-based loss (such as SVM hinge loss or logistic regression log-loss) rather than 0/1 accuracy, since margin losses penalize wrong scores even when the binary prediction is correct.
Comparing scores rather than thresholded labels at inference, which is what the OvA argmax rule does naturally.

Probability calibration

For the OvA argmax rule to be well-founded, the K binary scores must be comparable. Different binary problems may produce scores on different scales because they were trained on different positive-to-negative ratios and may have different characteristic margins.

Two standard calibration techniques are used:

Platt scaling fits a logistic sigmoid mapping from raw classifier scores to probabilities, using the model's own validation predictions and true labels (Platt, 1999). It was originally designed for SVMs but applies to any monotone-scoring binary classifier.
Isotonic regression fits a non-parametric monotone mapping from scores to probabilities. It is more flexible than Platt scaling but typically needs more calibration data to avoid overfitting.

Both methods extend to multiclass problems by calibrating each OvA classifier independently and then renormalizing the K calibrated probabilities so they sum to 1. Scikit-learn provides CalibratedClassifierCV to wrap this procedure around any base estimator.

A related issue is the abstention region: regions of input space where every binary classifier outputs a low confidence (or a confidence below 0.5). The argmax rule still picks the highest of these low scores, so the model never explicitly abstains. Some applications add an explicit "reject" option when no f_k(x) clears a chosen threshold.

Implementations

Library	Class or function	Notes
scikit-learn	`sklearn.multiclass.OneVsRestClassifier`	Generic wrapper around any binary estimator; also handles multilabel problems
scikit-learn	`sklearn.linear_model.LogisticRegression(multi_class='ovr')`	Built-in OvR mode (default for some solvers in older versions)
LIBLINEAR	Default multiclass mode	Trains K linear SVM or logistic regression classifiers
LIBSVM	`-b 0 -m 0` with appropriate flags, but defaults to OvO	OvO is the LIBSVM default
Spark MLlib	`OneVsRest`	Wrapper for binary classifiers in Spark
H2O	`multinomial` distribution by default; OvR available for some learners

A minimal scikit-learn example using the Iris dataset:

from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
from sklearn.multiclass import OneVsRestClassifier

X, y = load_iris(return_X_y=True)
clf = OneVsRestClassifier(LogisticRegression(max_iter=1000))
clf.fit(X, y)
print(clf.predict(X[:5]))

With 3 classes (setosa, versicolor, virginica), the wrapper trains three logistic regression models: setosa-vs-rest, versicolor-vs-rest, and virginica-vs-rest. Inference picks the class whose binary model scores the input highest.

Computational complexity

Let N be the number of training examples, D the feature dimension, and K the number of classes. Assume the binary learner has training cost T(N, D) and prediction cost P(D).

Strategy	Training cost	Inference cost	Models stored
One-vs.-all	K * T(N, D)	K * P(D)	K
One-vs.-one	K(K-1)/2 * T(2N/K, D)	K(K-1)/2 * P(D)	K(K-1)/2
ECOC (length L)	L * T(N, D)	L * P(D) + O(LK) decoding	L
Native softmax	T(N, D, K) typically O(K) factor over binary	P(D, K)	1 joint model

For learners with linear training cost, OvA training scales as O(K * N), and OvO training scales as O(K * N) as well (since each OvO classifier sees only 2N/K examples on average and there are K(K-1)/2 of them, the total is K(K-1)/2 * 2N/K = (K-1) * N). For learners with super-linear training cost in N (such as kernel SVMs with O(N^2) or O(N^3) training), OvO can actually be cheaper than OvA because each subproblem is much smaller.

At inference, OvA scales as O(K), which is unavoidable for any decomposition method that uses K class scores.

When to use one-vs.-all

Use OvA when:

The binary learner produces well-calibrated scores or probabilities, or you can apply Platt scaling or isotonic regression after training.
You want a simple, parallelizable training procedure (the K problems are independent and embarrassingly parallel).
You need an interpretable model where each class has its own classifier you can inspect, debug, or replace.
You are doing multi-label classification (binary relevance), where each label really is independent of the others.

Prefer OvO when the binary learner scales poorly with N and the per-class data is small enough that K(K-1)/2 small problems are cheaper than K large ones. This is the standard recommendation for kernel SVMs.

Prefer a native multiclass method (softmax, decision trees, random forests) when one is available and well-supported. For deep learning in particular, softmax + cross-entropy is the default and OvA is rarely a good idea for the final classification head.

Modern relevance

Despite the dominance of softmax in deep learning, OvA remains practically important in several settings.

Multi-label classification. When an input can belong to several classes simultaneously (for example, an image tagged with both "beach" and "sunset"), the OvR reduction is the standard baseline under the name binary relevance. Each label gets its own independent binary classifier, and the model can output any subset of labels. Modern multi-label methods such as classifier chains build on top of binary relevance by passing earlier predictions as features to later classifiers.

Calibrated probability estimates. Some downstream applications need a probability score for each individual class, not just a top-1 prediction. OvR with per-class calibration produces K independent probabilities that can be inspected, thresholded, or combined with prior knowledge.

Model interpretability and debugging. Because each OvA classifier corresponds to exactly one class, practitioners can inspect feature weights or attribution scores per class, retrain a single class without touching the others, and add new classes incrementally. A native softmax model couples all K classes in a single weight matrix, which is harder to modify after training.

One-class and anomaly detection. A degenerate version of OvA with a single classifier (one class versus everything else) underlies one-class SVMs and many novelty detection systems. Adding a second class label converts a one-class detector into a binary classifier with the same architecture.

Worked example: Iris dataset

The Iris dataset has 150 examples, 4 features (sepal length, sepal width, petal length, petal width), and 3 classes (setosa, versicolor, virginica). With OvA using logistic regression, three binary classifiers are trained:

Classifier	Positive class	Negative class	Approximate decision summary
f_setosa	setosa (50 examples)	versicolor + virginica (100 examples)	Petal length and width separate setosa cleanly; the binary problem is linearly separable
f_versicolor	versicolor (50 examples)	setosa + virginica (100 examples)	Versicolor sits between setosa and virginica in feature space, so this binary problem is harder; some virginica are close to versicolor
f_virginica	virginica (50 examples)	setosa + versicolor (100 examples)	Larger petals than versicolor; mostly separable but with some overlap on the boundary

At prediction, all three classifiers score the input. For a setosa-like input, f_setosa scores high, f_versicolor and f_virginica score low, and the argmax returns setosa. For an input near the versicolor/virginica boundary, both f_versicolor and f_virginica may score moderately, and the higher of the two wins.

Limitations

Known drawbacks of OvA include:

Score incommensurability. Without calibration, the K binary scores may not be on the same scale, so the argmax can be biased toward classes whose binary problem happens to produce larger margins.
Class imbalance per subproblem. As K grows, each binary problem has more negatives than positives, which can hurt learners sensitive to imbalance.
No pairwise information. OvA never explicitly compares two specific classes against each other. If two classes are easily confused, no single OvA classifier is responsible for separating them; the discrimination falls out of the argmax of two separate scorers.
Tied predictions in low-confidence regions. When all f_k(x) are small or roughly equal, the argmax is essentially arbitrary. OvO's voting scheme handles ties more gracefully.
K-times training cost. OvA needs K full passes over the dataset, whereas a native softmax classifier needs only one.

For neural networks, all of these issues are addressed by the joint softmax + cross-entropy formulation, which is why OvA is rarely used for deep learning classification heads.

Explain like I'm 5

Imagine you have a basket of fruits: apples, bananas, and oranges, and you want to teach a robot to tell them apart. The one-vs.-all approach is to train three little robots, each one good at a single yes-or-no question. The first robot only knows "is this an apple, or is it not an apple?" The second robot only knows "is this a banana, or is it not a banana?" The third only knows "orange, or not?" When you show the robots a new fruit, each one says how sure it is that the fruit is its specialty. Whichever robot is most confident wins, and that fruit name is the answer.

References

Rifkin, R., and Klautau, A. (2004). "In Defense of One-Vs-All Classification." *Journal of Machine Learning Research*, 5, 101-141. https://www.jmlr.org/papers/v5/rifkin04a.html
Dietterich, T. G., and Bakiri, G. (1995). "Solving Multiclass Learning Problems via Error-Correcting Output Codes." *Journal of Artificial Intelligence Research*, 2, 263-286. https://arxiv.org/abs/cs/9501101
Platt, J. C. (1999). "Probabilistic Outputs for Support Vector Machines and Comparisons to Regularized Likelihood Methods." In *Advances in Large Margin Classifiers*, MIT Press, 61-74.
scikit-learn developers. "1.12. Multiclass and multioutput algorithms." scikit-learn documentation. https://scikit-learn.org/stable/modules/multiclass.html
scikit-learn developers. "OneVsRestClassifier." scikit-learn API reference. https://scikit-learn.org/stable/modules/generated/sklearn.multiclass.OneVsRestClassifier.html
scikit-learn developers. "1.16. Probability calibration." scikit-learn documentation. https://scikit-learn.org/stable/modules/calibration.html
Hsu, C.-W., and Lin, C.-J. (2002). "A Comparison of Methods for Multiclass Support Vector Machines." *IEEE Transactions on Neural Networks*, 13(2), 415-425.
Fan, R.-E., Chang, K.-W., Hsieh, C.-J., Wang, X.-R., and Lin, C.-J. (2008). "LIBLINEAR: A Library for Large Linear Classification." *Journal of Machine Learning Research*, 9, 1871-1874.
Tsoumakas, G., and Katakis, I. (2007). "Multi-Label Classification: An Overview." *International Journal of Data Warehousing and Mining*, 3(3), 1-13.

How one-vs.-all works

Training

Inference

Naming and history

Comparison with other multiclass strategies

One-vs.-one (OvO)

Error-correcting output codes (ECOC)

Naturally multiclass methods

Class imbalance in OvA

Probability calibration

Implementations

Computational complexity

When to use one-vs.-all

Modern relevance

Worked example: Iris dataset

Limitations

Explain like I'm 5

See also

References

Improve this article

Related Articles

ARC-AGI 2

Squared Hinge Loss

AUC (Area Under the ROC Curve)

Accuracy

Binary Classification

Class-Imbalanced Dataset

How one-vs.-all works

Training

Inference

Naming and history

Comparison with other multiclass strategies

One-vs.-one (OvO)

Error-correcting output codes (ECOC)

Naturally multiclass methods

Class imbalance in OvA

Probability calibration

Implementations

Computational complexity

When to use one-vs.-all

Modern relevance

Worked example: Iris dataset

Limitations

Explain like I'm 5

See also

References

Related Articles

ARC-AGI 2

Squared Hinge Loss

AUC (Area Under the ROC Curve)

Accuracy

Binary Classification

Class-Imbalanced Dataset