# Multi-Class Classification

> Source: https://aiwiki.ai/wiki/multi-class_classification
> Updated: 2026-06-21
> Categories: Machine Learning
> From AI Wiki (https://aiwiki.ai), a free encyclopedia of artificial intelligence. Quote with attribution.

*See also: [Machine learning terms](/wiki/machine_learning_terms)*

Multi-class classification is a [supervised learning](/wiki/supervised_learning) task in [machine learning](/wiki/machine_learning) that assigns each input to exactly one of three or more mutually exclusive classes. It generalizes [binary classification](/wiki/binary_classification) (two categories) to a label set of K classes where K is 3 or greater, and it differs from [multi-label classification](/wiki/multi_label_classification), in which an instance can carry several labels at once. A handwritten-digit recognizer that identifies digits 0 through 9 is a 10-class problem, and tagging a news article as politics, sports, technology, or entertainment is a 4-class problem.

Multi-class classification is one of the most widely encountered problem types in applied machine learning. It underpins applications ranging from image recognition and natural language processing to medical diagnosis and autonomous driving. Because many real-world tasks involve more than two possible outcomes, understanding the algorithms, loss functions, and evaluation strategies specific to multi-class settings is essential for any practitioner.

## What is the formal definition of multi-class classification?

Given a training set of labeled examples {(x₁, y₁), (x₂, y₂), ..., (xₙ, yₙ)}, where each input xᵢ belongs to a feature space X and each label yᵢ belongs to a finite label set Y = {1, 2, ..., K} with K ≥ 3, the goal of multi-class classification is to learn a function f: X → Y that maps unseen inputs to their correct class labels with high accuracy.[1]

The key constraint that distinguishes multi-class classification from [multi-label classification](/wiki/multi_label_classification) is mutual exclusivity: each instance belongs to exactly one class. The classifier must output a single predicted label per input. As Bishop's Pattern Recognition and Machine Learning frames it, the task is to assign each input vector "to one of K discrete classes" that "are taken to be disjoint, so that each input is assigned to one and only one class."[2]

## Which algorithms are inherently multi-class?

Several machine learning algorithms can handle multiple classes natively, without requiring any special decomposition strategy.

| Algorithm | How It Handles Multiple Classes | Strengths |
|---|---|---|
| [Decision tree](/wiki/decision_tree) | Splits the feature space recursively using information gain or Gini impurity; leaf nodes can represent any number of classes | Interpretable, no special encoding needed |
| [Random forest](/wiki/random_forest) | Ensemble of decision trees; each tree votes, and the majority vote determines the class | Robust to overfitting, handles high-dimensional data |
| [k-Nearest Neighbors](/wiki/k_nearest_neighbors) (k-NN) | Assigns the most common class among the k closest training examples | Simple, no training phase, naturally multi-class |
| [Naive Bayes](/wiki/naive_bayes) | Computes posterior probability for each class using Bayes' theorem and selects the class with the highest probability | Fast, works well with high-dimensional text data |
| [Neural network](/wiki/neural_network) with [softmax](/wiki/softmax) output | Final layer has K neurons (one per class) activated by the softmax function, producing a probability distribution over all classes | Highly flexible, state-of-the-art on many benchmarks |
| Multinomial [logistic regression](/wiki/logistic_regression) | Extends binary logistic regression to K classes by learning K sets of weights and applying softmax | Probabilistic outputs, well-calibrated |
| [Gradient boosting](/wiki/gradient_boosting) | Builds an ensemble of weak learners sequentially; uses multi-class loss functions such as multi-class log loss | High accuracy, handles structured/tabular data well |

These algorithms are sometimes called "direct methods" or "all-at-once" methods because they solve the full K-class problem in a single model.[2]

## How are binary classifiers extended to multiple classes?

Algorithms originally designed for [binary classification](/wiki/binary_classification), such as standard [logistic regression](/wiki/logistic_regression) or [support vector machines](/wiki/support_vector_machine_svm), can be extended to multi-class settings through decomposition (also called reduction) strategies. The two most common approaches are One-vs-Rest and One-vs-One.

### One-vs-Rest (OvR)

[One-vs.-All](/wiki/one-vs_-all) (also called One-vs-Rest or OvR) trains K separate binary classifiers, one for each class. The classifier for class k treats all examples of class k as positives and all other examples as negatives. At prediction time, all K classifiers score the input, and the class whose classifier produces the highest confidence wins.[3] For a 10-class problem this means training 10 binary models.

| Aspect | Detail |
|---|---|
| Number of classifiers | K (one per class) |
| Training complexity | K times the cost of training one binary classifier |
| Prediction rule | argmax of the K classifier scores |
| Common issue | Class imbalance, since each binary problem groups K-1 classes into the negative set |

OvR is the default multi-class strategy in many libraries, including scikit-learn's `LinearSVC` and `LogisticRegression` implementations.

### One-vs-One (OvO)

One-vs-One trains a binary classifier for every pair of classes, resulting in K(K-1)/2 classifiers. Each classifier is trained only on examples from two classes. At prediction time, each classifier votes for one of its two classes, and the class with the most votes is selected.[3] For a 10-class problem this yields 10 * 9 / 2 = 45 pairwise classifiers.

| Aspect | Detail |
|---|---|
| Number of classifiers | K(K-1)/2 |
| Training complexity | More classifiers, but each is trained on a smaller subset of data |
| Prediction rule | Majority vote across all pairwise classifiers |
| Common issue | Computationally expensive for large K |

OvO is the default strategy for algorithms like `sklearn.svm.SVC` because SVMs scale super-linearly with dataset size, and training on smaller subsets is often faster overall.[11]

### Error-Correcting Output Codes (ECOC)

A more general reduction framework assigns each class a unique binary codeword. Multiple binary classifiers are trained, one per bit position in the code. At prediction time, the outputs of all classifiers form a binary string, and the class whose codeword is closest (in Hamming distance) to this string is selected. ECOC provides a degree of error tolerance because misclassification by a single binary classifier does not necessarily cause an incorrect final prediction.[4]

## How does the softmax output layer work?

In [neural network](/wiki/neural_network) architectures used for multi-class classification, the final layer typically contains K output neurons, one for each class. The raw outputs of these neurons, called logits, are passed through the [softmax](/wiki/softmax) function, which normalizes them into a valid probability distribution:

softmax(zᵢ) = exp(zᵢ) / Σⱼ exp(zⱼ)

where zᵢ is the logit for class i and the sum runs over all K classes. The softmax function ensures that all output values are between 0 and 1 and that they sum to exactly 1. The predicted class is the one with the highest probability.[5]

The softmax layer is a natural fit for multi-class classification because it directly models the assumption of mutual exclusivity: increasing the probability assigned to one class necessarily decreases the probability assigned to others. This contrasts with the sigmoid activation used in [multi-label classification](/wiki/multi_label_classification), where each output is independent.

## What loss functions are used for multi-class classification?

The standard loss function for multi-class classification is [cross-entropy](/wiki/cross-entropy) loss, also called categorical cross-entropy or log loss.

### Categorical Cross-Entropy

For a single example with true one-hot label vector y and predicted probability vector p, the categorical cross-entropy loss is:

L = -Σᵢ yᵢ log(pᵢ)

Because y is one-hot (only one element equals 1), this simplifies to L = -log(p_c), where c is the index of the correct class. The loss penalizes the model proportionally to how far the predicted probability for the correct class is from 1. A confident correct prediction (p_c close to 1) produces a low loss, while a confident wrong prediction produces a very high loss.[5]

### Sparse Categorical Cross-Entropy

Sparse categorical cross-entropy is mathematically identical to categorical cross-entropy, but it accepts the true label as an integer index rather than a one-hot vector. This avoids the overhead of constructing and storing one-hot vectors, which becomes significant when the number of classes K is large (for example, in language modeling with vocabulary sizes exceeding 50,000). Most deep learning frameworks, including TensorFlow and PyTorch, offer both variants.[6]

| Loss Function | Label Format | Memory Efficiency | Use Case |
|---|---|---|---|
| Categorical cross-entropy | One-hot vector [0, 0, 1, 0] | Higher memory for large K | Small to moderate number of classes |
| Sparse categorical cross-entropy | Integer index (e.g., 2) | Lower memory | Large number of classes |

## How is a multi-class classifier evaluated?

Evaluating multi-class classifiers requires metrics that account for the performance across all K classes. Single-number accuracy can be misleading, especially when classes are imbalanced.

### Confusion Matrix

A [confusion matrix](/wiki/confusion_matrix) for a K-class problem is a K x K table where the entry in row i, column j indicates how many instances of true class i were predicted as class j. The diagonal entries represent correct predictions. Off-diagonal entries reveal systematic misclassification patterns, such as a model frequently confusing cats with dogs.[7]

### Precision, Recall, and F1 Score

Precision, recall, and the [F1 score](/wiki/f1_score) are computed per class and then aggregated. The aggregation method significantly affects the reported score:

| Averaging Method | Definition | When to Use |
|---|---|---|
| Macro average | Compute the metric for each class independently, then take the unweighted mean | When all classes are equally important, regardless of size |
| Micro average | Aggregate true positives, false positives, and false negatives across all classes before computing the metric | When larger classes should contribute more; equals accuracy for single-label classification |
| Weighted average | Compute per-class metric, then average weighted by each class's support (number of true instances) | When class imbalance exists and you want proportional representation |

The distinction is consequential: macro-averaging "treats all classes equally" while micro-averaging "favors bigger classes."[8] Macro F1 is therefore often preferred for imbalanced datasets because it gives equal weight to every class, including rare ones, whereas micro F1 is dominated by the performance on majority classes.[8] The widely cited Sokolova and Lapalme analysis of classification performance measures, which formalized these averaging conventions, has accumulated roughly 2,000 citations.[7]

### Top-k Accuracy

Top-k accuracy considers a prediction correct if the true class is among the k classes with the highest predicted probabilities. Top-5 accuracy, for instance, was the primary metric in the [ImageNet](/wiki/imagenet) Large Scale Visual Recognition Challenge (ILSVRC), where models had to classify images into one of 1,000 categories. In ILSVRC 2012, the AlexNet [convolutional neural network](/wiki/convolutional_neural_network) achieved a top-5 error of 15.3%, more than 10.8 percentage points lower than the runner-up, a result widely credited with igniting the modern [deep learning](/wiki/deep_learning) era.[9][12] Top-k accuracy is useful when multiple classes are visually or semantically similar and a strict top-1 requirement is overly harsh.[9]

### Cohen's Kappa

Cohen's Kappa measures the agreement between predicted and true labels while accounting for agreement that would occur by chance. A kappa of 1 indicates perfect agreement, 0 indicates agreement no better than random, and negative values indicate worse-than-random performance. It is particularly useful when class distributions are highly skewed.

## How does multi-class differ from multi-label classification?

Multi-class and [multi-label classification](/wiki/multi_label_classification) are frequently confused, but they address fundamentally different problems.

| Aspect | Multi-Class Classification | Multi-Label Classification |
|---|---|---|
| Labels per instance | Exactly one | Zero, one, or many |
| Class relationship | Mutually exclusive | Independent |
| Output activation | [Softmax](/wiki/softmax) (probabilities sum to 1) | Sigmoid (each output independent) |
| Loss function | Categorical [cross-entropy](/wiki/cross-entropy) | Binary cross-entropy per label |
| Example | An image is a cat, dog, or bird | A movie is tagged as action, comedy, and romance simultaneously |

The choice between multi-class and multi-label framing depends on the problem. If an instance can legitimately belong to more than one category, multi-label is appropriate. If categories are mutually exclusive by definition, multi-class is the correct formulation.

## How do you handle class imbalance?

In many real-world multi-class problems, the number of examples per class varies dramatically. For instance, a medical diagnosis dataset may have thousands of "healthy" examples but only dozens of examples for a rare disease. Class imbalance causes standard classifiers to be biased toward majority classes, underperforming on minority ones.

Common strategies for addressing class imbalance in multi-class settings include:

**Data-level techniques:**
- Oversampling minority classes using methods like SMOTE (Synthetic Minority Over-sampling Technique) or ADASYN (Adaptive Synthetic Sampling), which generate synthetic examples by interpolating between existing minority samples.[10]
- Undersampling majority classes by removing excess examples, though this risks discarding informative data.
- Hybrid approaches that combine oversampling of minority classes with undersampling of majority classes (e.g., SMOTEENN, SMOTETomek).

**Algorithm-level techniques:**
- Cost-sensitive learning, where different misclassification costs are assigned to different classes. Misclassifying a rare class is penalized more heavily than misclassifying a common class.
- Class-weighted loss functions, where the loss contribution of each example is scaled inversely to its class frequency. Most frameworks support a `class_weight` parameter.
- Focal loss, originally proposed for object detection, which down-weights well-classified examples and focuses training on hard, misclassified ones.

**Ensemble techniques:**
- Balanced [random forests](/wiki/random_forest) that undersample the majority class within each bootstrap sample.
- EasyEnsemble and BalanceCascade, which build ensembles of classifiers trained on balanced subsets of the data.

## What is hierarchical classification?

When the set of classes has a natural hierarchical structure, hierarchical classification can improve both accuracy and interpretability. Instead of treating all K classes as a flat set, classes are organized into a tree or directed acyclic graph (DAG), where parent nodes represent broader categories and child nodes represent finer distinctions.

For example, in biological taxonomy, an image might first be classified as "animal" vs. "plant," then as "mammal" vs. "bird" vs. "reptile," and finally as a specific species. This top-down approach allows the model to make broad distinctions first and refine its prediction at each level.

Common strategies for hierarchical classification include:

- **Local classifier per node (LCN):** A binary classifier is trained at each node in the hierarchy.
- **Local classifier per parent node (LCPN):** A multi-class classifier is trained at each non-leaf node to distinguish among its children.
- **Global classifier:** A single model learns the entire hierarchy and predicts at all levels simultaneously.

Hierarchical classification is especially useful when the number of classes is very large (hundreds or thousands) and the classes have a meaningful taxonomy, such as product categorization in e-commerce or species identification in ecology.

## What is multi-class classification used for?

Multi-class classification is used across nearly every domain of artificial intelligence.

| Application | Classes | Description |
|---|---|---|
| Handwritten digit recognition ([MNIST](/wiki/mnist)) | 10 digits (0-9) | A foundational benchmark of 70,000 28x28 grayscale images (60,000 training, 10,000 test); models classify handwritten digits[13] |
| [ImageNet](/wiki/imagenet) classification | 1,000 object categories | The ILSVRC challenge drove breakthroughs in [deep learning](/wiki/deep_learning), including AlexNet, VGG, and [ResNet](/wiki/resnet)[9] |
| Document categorization | Varies (e.g., topics, genres) | Assigning text documents to predefined categories such as politics, sports, science, or business |
| [Sentiment analysis](/wiki/sentiment_analysis) | Typically 3-5 (e.g., very negative to very positive) | Classifying the emotional tone of text beyond simple positive/negative |
| Medical diagnosis | Multiple disease types | Classifying patient data, medical images, or lab results into specific diagnoses |
| Speech command recognition | Predefined spoken words | Classifying audio clips into categories like "yes," "no," "stop," "go" |
| Plant or animal species identification | Hundreds to thousands of species | [Computer vision](/wiki/computer_vision) models that identify species from photographs |
| Intent classification in chatbots | Predefined user intents | Routing user queries to the appropriate handler based on detected intent |

## Software and Libraries

Most major machine learning libraries provide robust support for multi-class classification.

- **scikit-learn:** Offers `OneVsRestClassifier`, `OneVsOneClassifier`, and native multi-class support in algorithms like `RandomForestClassifier`, `GradientBoostingClassifier`, and `LogisticRegression(multi_class='multinomial')`.[11]
- **[TensorFlow](/wiki/tensorflow) / Keras:** Provides `SparseCategoricalCrossentropy` and `CategoricalCrossentropy` loss functions, along with softmax activation layers.[6]
- **[PyTorch](/wiki/pytorch):** Implements `torch.nn.CrossEntropyLoss` (which combines log-softmax and negative log-likelihood) for multi-class training.
- **XGBoost / LightGBM:** Support multi-class classification through `multi:softmax` and `multi:softprob` objectives.

## Explain Like I'm 5 (ELI5)

Imagine you have a big basket of mixed fruits: apples, bananas, oranges, and grapes. Your job is to sort each piece of fruit into the right pile. That is multi-class classification. The machine looks at a piece of fruit and decides: "This one is an apple, so it goes in the apple pile."

With only two types of fruit (say, apples and bananas), the task is simple: "Is it an apple, or is it a banana?" That is called binary classification. But when you have three or more types, the machine needs to pick from more options, and that makes it a multi-class problem.

There are different ways the machine can learn to sort. Some methods look at the fruit all at once and decide which pile it belongs to. Other methods break the problem into smaller questions, like "Is it an apple or a banana?" and "Is it an apple or an orange?" and then combine the answers.

To check if the machine is doing a good job, you count how many fruits it put in the right pile and how many it got wrong. If it keeps confusing oranges with grapes, you know it needs more practice telling those two apart.

## References

1. Mohri, M., Rostamizadeh, A., and Talwalkar, A. "Foundations of Machine Learning: Multi-Class Classification." MIT Press.
2. Bishop, Christopher M. "Pattern Recognition and Machine Learning." Springer, 2006.
3. Aly, Mohamed. "Survey on Multiclass Classification Methods." Caltech, 2005.
4. Dietterich, T. G. and Bakiri, G. "Solving Multiclass Learning Problems via Error-Correcting Output Codes." Journal of Artificial Intelligence Research, 1995.
5. Goodfellow, I., Bengio, Y., and Courville, A. "Deep Learning." MIT Press, 2016.
6. TensorFlow Documentation. "Sparse Categorical Crossentropy." https://www.tensorflow.org/api_docs/python/tf/keras/losses/SparseCategoricalCrossentropy
7. Sokolova, M. and Lapalme, G. "A Systematic Analysis of Performance Measures for Classification Tasks." Information Processing and Management, 2009.
8. scikit-learn Documentation. "Precision, Recall and F1-score for Multi-Class Data." https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html
9. Russakovsky, O. et al. "ImageNet Large Scale Visual Recognition Challenge." International Journal of Computer Vision, 2015.
10. Chawla, N. V. et al. "SMOTE: Synthetic Minority Over-sampling Technique." Journal of Artificial Intelligence Research, 2002.
11. scikit-learn Documentation. "Multiclass and multioutput algorithms." https://scikit-learn.org/stable/modules/multiclass.html
12. Krizhevsky, A., Sutskever, I., and Hinton, G. E. "ImageNet Classification with Deep Convolutional Neural Networks." Advances in Neural Information Processing Systems (NeurIPS), 2012.
13. LeCun, Y., Cortes, C., and Burges, C. J. C. "The MNIST Database of Handwritten Digits." http://yann.lecun.com/exdb/mnist/

