See also: Machine learning terms
Multi-class classification is a type of supervised learning problem in machine learning where an algorithm assigns each input instance to exactly one of three or more discrete classes. In contrast to binary classification, which distinguishes between only two categories, multi-class classification handles scenarios where the label space contains three or more mutually exclusive outcomes. A handwritten digit recognizer that identifies digits 0 through 9, for example, is a 10-class classification problem. Similarly, categorizing a news article as politics, sports, technology, or entertainment is a four-class problem.
Multi-class classification is one of the most widely encountered problem types in applied machine learning. It underpins applications ranging from image recognition and natural language processing to medical diagnosis and autonomous driving. Because many real-world tasks involve more than two possible outcomes, understanding the algorithms, loss functions, and evaluation strategies specific to multi-class settings is essential for any practitioner.
Given a training set of labeled examples {(x₁, y₁), (x₂, y₂), ..., (xₙ, yₙ)}, where each input xᵢ belongs to a feature space X and each label yᵢ belongs to a finite label set Y = {1, 2, ..., K} with K ≥ 3, the goal of multi-class classification is to learn a function f: X → Y that maps unseen inputs to their correct class labels with high accuracy.[1]
The key constraint that distinguishes multi-class classification from multi-label classification is mutual exclusivity: each instance belongs to exactly one class. The classifier must output a single predicted label per input.
Several machine learning algorithms can handle multiple classes natively, without requiring any special decomposition strategy.
| Algorithm | How It Handles Multiple Classes | Strengths |
|---|---|---|
| Decision tree | Splits the feature space recursively using information gain or Gini impurity; leaf nodes can represent any number of classes | Interpretable, no special encoding needed |
| Random forest | Ensemble of decision trees; each tree votes, and the majority vote determines the class | Robust to overfitting, handles high-dimensional data |
| k-Nearest Neighbors (k-NN) | Assigns the most common class among the k closest training examples | Simple, no training phase, naturally multi-class |
| Naive Bayes | Computes posterior probability for each class using Bayes' theorem and selects the class with the highest probability | Fast, works well with high-dimensional text data |
| Neural network with softmax output | Final layer has K neurons (one per class) activated by the softmax function, producing a probability distribution over all classes | Highly flexible, state-of-the-art on many benchmarks |
| Multinomial logistic regression | Extends binary logistic regression to K classes by learning K sets of weights and applying softmax | Probabilistic outputs, well-calibrated |
| Gradient boosting | Builds an ensemble of weak learners sequentially; uses multi-class loss functions such as multi-class log loss | High accuracy, handles structured/tabular data well |
These algorithms are sometimes called "direct methods" or "all-at-once" methods because they solve the full K-class problem in a single model.[2]
Algorithms originally designed for binary classification, such as standard logistic regression or support vector machines, can be extended to multi-class settings through decomposition (also called reduction) strategies. The two most common approaches are One-vs-Rest and One-vs-One.
One-vs.-All (also called One-vs-Rest or OvR) trains K separate binary classifiers, one for each class. The classifier for class k treats all examples of class k as positives and all other examples as negatives. At prediction time, all K classifiers score the input, and the class whose classifier produces the highest confidence wins.[3]
| Aspect | Detail |
|---|---|
| Number of classifiers | K (one per class) |
| Training complexity | K times the cost of training one binary classifier |
| Prediction rule | argmax of the K classifier scores |
| Common issue | Class imbalance, since each binary problem groups K-1 classes into the negative set |
OvR is the default multi-class strategy in many libraries, including scikit-learn's LinearSVC and LogisticRegression implementations.
One-vs-One trains a binary classifier for every pair of classes, resulting in K(K-1)/2 classifiers. Each classifier is trained only on examples from two classes. At prediction time, each classifier votes for one of its two classes, and the class with the most votes is selected.[3]
| Aspect | Detail |
|---|---|
| Number of classifiers | K(K-1)/2 |
| Training complexity | More classifiers, but each is trained on a smaller subset of data |
| Prediction rule | Majority vote across all pairwise classifiers |
| Common issue | Computationally expensive for large K |
OvO is the default strategy for algorithms like sklearn.svm.SVC because SVMs scale super-linearly with dataset size, and training on smaller subsets is often faster overall.
A more general reduction framework assigns each class a unique binary codeword. Multiple binary classifiers are trained, one per bit position in the code. At prediction time, the outputs of all classifiers form a binary string, and the class whose codeword is closest (in Hamming distance) to this string is selected. ECOC provides a degree of error tolerance because misclassification by a single binary classifier does not necessarily cause an incorrect final prediction.[4]
In neural network architectures used for multi-class classification, the final layer typically contains K output neurons, one for each class. The raw outputs of these neurons, called logits, are passed through the softmax function, which normalizes them into a valid probability distribution:
softmax(zᵢ) = exp(zᵢ) / Σⱼ exp(zⱼ)
where zᵢ is the logit for class i and the sum runs over all K classes. The softmax function ensures that all output values are between 0 and 1 and that they sum to exactly 1. The predicted class is the one with the highest probability.[5]
The softmax layer is a natural fit for multi-class classification because it directly models the assumption of mutual exclusivity: increasing the probability assigned to one class necessarily decreases the probability assigned to others. This contrasts with the sigmoid activation used in multi-label classification, where each output is independent.
The standard loss function for multi-class classification is cross-entropy loss, also called categorical cross-entropy or log loss.
For a single example with true one-hot label vector y and predicted probability vector p, the categorical cross-entropy loss is:
L = -Σᵢ yᵢ log(pᵢ)
Because y is one-hot (only one element equals 1), this simplifies to L = -log(p_c), where c is the index of the correct class. The loss penalizes the model proportionally to how far the predicted probability for the correct class is from 1. A confident correct prediction (p_c close to 1) produces a low loss, while a confident wrong prediction produces a very high loss.[5]
Sparse categorical cross-entropy is mathematically identical to categorical cross-entropy, but it accepts the true label as an integer index rather than a one-hot vector. This avoids the overhead of constructing and storing one-hot vectors, which becomes significant when the number of classes K is large (for example, in language modeling with vocabulary sizes exceeding 50,000). Most deep learning frameworks, including TensorFlow and PyTorch, offer both variants.[6]
| Loss Function | Label Format | Memory Efficiency | Use Case |
|---|---|---|---|
| Categorical cross-entropy | One-hot vector [0, 0, 1, 0] | Higher memory for large K | Small to moderate number of classes |
| Sparse categorical cross-entropy | Integer index (e.g., 2) | Lower memory | Large number of classes |
Evaluating multi-class classifiers requires metrics that account for the performance across all K classes. Single-number accuracy can be misleading, especially when classes are imbalanced.
A confusion matrix for a K-class problem is a K x K table where the entry in row i, column j indicates how many instances of true class i were predicted as class j. The diagonal entries represent correct predictions. Off-diagonal entries reveal systematic misclassification patterns, such as a model frequently confusing cats with dogs.[7]
Precision, recall, and the F1 score are computed per class and then aggregated. The aggregation method significantly affects the reported score:
| Averaging Method | Definition | When to Use |
|---|---|---|
| Macro average | Compute the metric for each class independently, then take the unweighted mean | When all classes are equally important, regardless of size |
| Micro average | Aggregate true positives, false positives, and false negatives across all classes before computing the metric | When larger classes should contribute more; equals accuracy for single-label classification |
| Weighted average | Compute per-class metric, then average weighted by each class's support (number of true instances) | When class imbalance exists and you want proportional representation |
Macro F1 is often preferred for imbalanced datasets because it gives equal weight to every class, including rare ones. Micro F1, on the other hand, is dominated by the performance on majority classes.[8]
Top-k accuracy considers a prediction correct if the true class is among the k classes with the highest predicted probabilities. Top-5 accuracy, for instance, was the primary metric in the ImageNet Large Scale Visual Recognition Challenge (ILSVRC), where models had to classify images into one of 1,000 categories. Top-k accuracy is useful when multiple classes are visually or semantically similar and a strict top-1 requirement is overly harsh.[9]
Cohen's Kappa measures the agreement between predicted and true labels while accounting for agreement that would occur by chance. A kappa of 1 indicates perfect agreement, 0 indicates agreement no better than random, and negative values indicate worse-than-random performance. It is particularly useful when class distributions are highly skewed.
Multi-class and multi-label classification are frequently confused, but they address fundamentally different problems.
| Aspect | Multi-Class Classification | Multi-Label Classification |
|---|---|---|
| Labels per instance | Exactly one | Zero, one, or many |
| Class relationship | Mutually exclusive | Independent |
| Output activation | Softmax (probabilities sum to 1) | Sigmoid (each output independent) |
| Loss function | Categorical cross-entropy | Binary cross-entropy per label |
| Example | An image is a cat, dog, or bird | A movie is tagged as action, comedy, and romance simultaneously |
The choice between multi-class and multi-label framing depends on the problem. If an instance can legitimately belong to more than one category, multi-label is appropriate. If categories are mutually exclusive by definition, multi-class is the correct formulation.
In many real-world multi-class problems, the number of examples per class varies dramatically. For instance, a medical diagnosis dataset may have thousands of "healthy" examples but only dozens of examples for a rare disease. Class imbalance causes standard classifiers to be biased toward majority classes, underperforming on minority ones.
Common strategies for addressing class imbalance in multi-class settings include:
Data-level techniques:
Algorithm-level techniques:
class_weight parameter.Ensemble techniques:
When the set of classes has a natural hierarchical structure, hierarchical classification can improve both accuracy and interpretability. Instead of treating all K classes as a flat set, classes are organized into a tree or directed acyclic graph (DAG), where parent nodes represent broader categories and child nodes represent finer distinctions.
For example, in biological taxonomy, an image might first be classified as "animal" vs. "plant," then as "mammal" vs. "bird" vs. "reptile," and finally as a specific species. This top-down approach allows the model to make broad distinctions first and refine its prediction at each level.
Common strategies for hierarchical classification include:
Hierarchical classification is especially useful when the number of classes is very large (hundreds or thousands) and the classes have a meaningful taxonomy, such as product categorization in e-commerce or species identification in ecology.
Multi-class classification is used across nearly every domain of artificial intelligence.
| Application | Classes | Description |
|---|---|---|
| Handwritten digit recognition (MNIST) | 10 digits (0-9) | One of the foundational benchmarks in machine learning; models classify 28x28 grayscale images of handwritten digits |
| ImageNet classification | 1,000 object categories | The ILSVRC challenge drove breakthroughs in deep learning, including AlexNet, VGG, and ResNet |
| Document categorization | Varies (e.g., topics, genres) | Assigning text documents to predefined categories such as politics, sports, science, or business |
| Sentiment analysis | Typically 3-5 (e.g., very negative to very positive) | Classifying the emotional tone of text beyond simple positive/negative |
| Medical diagnosis | Multiple disease types | Classifying patient data, medical images, or lab results into specific diagnoses |
| Speech command recognition | Predefined spoken words | Classifying audio clips into categories like "yes," "no," "stop," "go" |
| Plant or animal species identification | Hundreds to thousands of species | Computer vision models that identify species from photographs |
| Intent classification in chatbots | Predefined user intents | Routing user queries to the appropriate handler based on detected intent |
Most major machine learning libraries provide robust support for multi-class classification.
OneVsRestClassifier, OneVsOneClassifier, and native multi-class support in algorithms like RandomForestClassifier, GradientBoostingClassifier, and LogisticRegression(multi_class='multinomial').SparseCategoricalCrossentropy and CategoricalCrossentropy loss functions, along with softmax activation layers.torch.nn.CrossEntropyLoss (which combines log-softmax and negative log-likelihood) for multi-class training.multi:softmax and multi:softprob objectives.Imagine you have a big basket of mixed fruits: apples, bananas, oranges, and grapes. Your job is to sort each piece of fruit into the right pile. That is multi-class classification. The machine looks at a piece of fruit and decides: "This one is an apple, so it goes in the apple pile."
With only two types of fruit (say, apples and bananas), the task is simple: "Is it an apple, or is it a banana?" That is called binary classification. But when you have three or more types, the machine needs to pick from more options, and that makes it a multi-class problem.
There are different ways the machine can learn to sort. Some methods look at the fruit all at once and decide which pile it belongs to. Other methods break the problem into smaller questions, like "Is it an apple or a banana?" and "Is it an apple or an orange?" and then combine the answers.
To check if the machine is doing a good job, you count how many fruits it put in the right pile and how many it got wrong. If it keeps confusing oranges with grapes, you know it needs more practice telling those two apart.