See also: Machine learning terms
Multinomial classification, also known as multi-class or multiclass classification, is a supervised learning task where each input belongs to exactly one of K possible classes, with K greater than two. The classifier learns a function f(x) = y that maps a feature vector x to a single label y in {1, 2, ..., K}. Examples include classifying a handwritten digit as one of 0 through 9, predicting which species of iris a flower belongs to, or labelling an ImageNet photograph with one of 1,000 object categories.
The task sits at the heart of machine learning. Many real-world problems have more than two outcomes, so binary methods alone are not enough. Standard approaches either build a single model that outputs K probabilities directly, or wrap a binary classification algorithm with a meta-strategy such as one-vs-rest or one-vs-one. The choice depends on the algorithm, the number of classes, the size of the dataset, and the computational budget.
Given a training set {(x_i, y_i)} for i = 1 to n, where x_i is a feature vector in R^d and y_i is an integer label in {1, 2, ..., K}, the goal is to learn a hypothesis h that predicts a label for any new input x. Most classifiers learn a scoring function s_k(x) for each class k, then predict the argmax:
y_hat = argmax_k s_k(x)
When the scores can be interpreted as probabilities, the model also returns a posterior P(y = k | x). Probabilistic outputs matter for downstream tasks such as ranking, calibration, abstention, and decision theory under asymmetric costs.
Multinomial classification is often confused with adjacent problems. The differences matter because the choice of loss, evaluation metric, and output layer all change.
| Task | Number of labels per instance | Class structure | Typical example |
|---|---|---|---|
| Binary classification | 1 of 2 | Flat | Spam vs not spam |
| Multinomial classification | 1 of K (K > 2) | Flat | Digit recognition (0 to 9) |
| Multi-label classification | Subset of K | Flat | Topic tagging an article with several tags |
| Hierarchical classification | 1 of K | Tree or DAG | Biological taxonomy classification |
| Ordinal classification | 1 of K | Ordered | Rating prediction (1 to 5 stars) |
A classifier can either treat K classes natively or reduce the problem to a series of binary subproblems. The four main strategies are summarised below.
| Strategy | How it works | Number of submodels | Notes |
|---|---|---|---|
| Native multiclass | A single model produces K outputs (often via softmax) | 1 | Used by softmax regression, decision trees, neural networks |
| One-vs-rest (OvR or OvA) | Train one binary classifier per class, that class against all others | K | Simple, parallelisable, works with any binary learner |
| One-vs-one (OvO) | Train one binary classifier for every pair of classes | K(K - 1) / 2 | Each model trains on a subset of data; voting picks the winner |
| Error-correcting output codes (ECOC) | Assign each class a unique binary codeword; train one binary classifier per codeword bit | L (code length) | Adds redundancy so the system can recover from individual classifier mistakes |
The one-vs-rest approach (also called one-vs-all) trains K classifiers, where the k-th classifier learns to separate class k from the union of all other classes. Prediction picks the class whose classifier produces the highest score. It is the default meta-strategy in many libraries because it scales linearly with K and produces a per-class confidence score.
One-vs-one trains K(K - 1) / 2 classifiers, each on the subset of training data belonging to two specific classes. At test time, every classifier votes for one of its two classes, and the class with the most votes wins. This strategy is preferred for kernel methods such as support vector machines because each individual problem uses less data, and the cost of training kernel matrices grows superlinearly with sample size.
Error-correcting output codes were introduced by Thomas Dietterich and Ghulum Bakiri in 1995. Each of the K classes is assigned a binary codeword of length L, and L binary classifiers are trained to predict each bit. At test time, the predicted bits form a codeword that is decoded to the nearest class codeword by Hamming distance. The redundancy in the code lets the system recover when individual binary classifiers are wrong, which often improves generalisation.
Most modern algorithms support multinomial classification directly without reduction to binary subproblems.
| Method | Output mechanism | Typical loss | Notes |
|---|---|---|---|
| Softmax regression | Linear scores passed through softmax | Categorical cross-entropy | Generalises logistic regression to K classes |
| Decision tree (CART) | Leaf node majority vote or class probabilities | Gini impurity or entropy | Handles K classes with no extra machinery |
| Random forest | Average of per-tree class probabilities | Gini impurity per tree | Strong baseline on tabular data |
| Gradient boosting | One booster per class, softmax over scores | multi:softmax or multi:softprob in XGBoost | Used in LightGBM, XGBoost, CatBoost |
| Naive Bayes | Class posteriors via Bayes rule | Maximum likelihood | Multinomial NB is widely used for text |
| k-Nearest Neighbours | Class vote among k nearest training points | None (lazy learner) | Trivially supports any K |
| Neural network | K logits passed through softmax | Categorical cross-entropy | Used in nearly all deep learning classifiers |
| Crammer-Singer SVM | Joint margin across all K classes | Multiclass hinge loss | Direct multiclass formulation, single optimisation problem |
Koby Crammer and Yoram Singer published their multiclass kernel-based vector machine formulation in the Journal of Machine Learning Research in 2001. Unlike earlier multiclass support vector machines, which decomposed the problem into independent binary tasks, their algorithm casts the entire K-class problem as a single quadratic program. Solving it through the dual yields a fixed-point iteration that handles many classes efficiently.
Softmax regression, also called multinomial regression or multinomial logistic regression, is the canonical native multiclass model. For each class k it learns a weight vector w_k and bias b_k, then defines:
P(y = k | x) = exp(w_k . x + b_k) / sum_j exp(w_j . x + b_j)
The denominator normalises the scores into a valid probability distribution. The model is trained by maximising the log-likelihood of the observed labels, which is equivalent to minimising the categorical cross-entropy loss:
L = -sum_i sum_k 1{y_i = k} log P(y_i = k | x_i)
When K = 2, softmax regression reduces to ordinary logistic regression. The same softmax + cross-entropy combination is the default output layer for almost every modern deep classifier, from MNIST digit recognisers to large language models.
The choice of loss controls what the model optimises and how it handles imbalance, noise, and overconfidence.
| Loss | Description | When to use |
|---|---|---|
| Categorical cross-entropy | Negative log-likelihood with one-hot targets | Default for softmax classifiers |
| Sparse categorical cross-entropy | Same loss, integer labels instead of one-hot | Memory-efficient when K is large |
| Multiclass hinge (Crammer-Singer) | Margin-based loss for multiclass SVMs | Linear or kernel SVM training |
| Label-smoothed cross-entropy | Targets become (1 - epsilon) for the true class and epsilon / (K - 1) elsewhere | Reduces overconfidence, improves calibration |
| Focal loss | Down-weights easy examples by a (1 - p)^gamma factor | Highly imbalanced datasets |
| KL divergence | Matches a soft target distribution | Knowledge distillation, ensembling |
Label smoothing was introduced by Christian Szegedy and colleagues in the 2016 Inception-v3 paper "Rethinking the Inception Architecture for Computer Vision." The technique replaces the one-hot target with a smoothed distribution, which discourages the network from producing extremely confident logits. It tends to improve generalisation and calibration on large vision benchmarks.
Focal loss was introduced by Tsung-Yi Lin and colleagues in 2017 in "Focal Loss for Dense Object Detection." Although the original target was object detection with a long-tail of background examples, the modulating factor (1 - p_t)^gamma works for any imbalanced multiclass problem by reducing the loss contribution of well-classified examples and concentrating gradient on hard ones.
No single number captures multiclass performance, especially when classes are imbalanced. Practitioners typically report several complementary metrics.
| Metric | Definition | Range | Notes |
|---|---|---|---|
| Accuracy | Fraction of correctly predicted labels | [0, 1] | Misleading when classes are imbalanced |
| Top-k accuracy | Fraction of cases where the true label is among the k highest-scoring predictions | [0, 1] | Top-5 accuracy is the standard ImageNet metric |
| Confusion matrix | K x K table of true vs predicted labels | Counts | Reveals which classes are confused with each other |
| Per-class precision | TP / (TP + FP) for one class | [0, 1] | Reported alongside recall and F1 |
| Per-class recall | TP / (TP + FN) for one class | [0, 1] | Sensitivity for that class |
| Per-class F1 score | Harmonic mean of precision and recall | [0, 1] | Useful when both errors matter |
| Macro-averaged F1 | Unweighted mean of per-class F1 | [0, 1] | Treats every class equally regardless of size |
| Micro-averaged F1 | Computed from total TP, FP, FN across classes | [0, 1] | Equals accuracy when each instance has one label |
| Weighted-averaged F1 | Mean of per-class F1 weighted by class support | [0, 1] | Useful for imbalanced data |
| Cohen's kappa | (p_o - p_e) / (1 - p_e), where p_o is observed agreement and p_e is chance agreement | [-1, 1] | Adjusts accuracy for the agreement expected by chance |
| Multiclass AUC (OvR or OvO) | Average of pairwise or one-vs-rest binary AUCs | [0, 1] | scikit-learn supports both averaging schemes |
| Matthews correlation coefficient | Generalisation of MCC to K classes | [-1, 1] | Robust under imbalance |
Macro-averaging treats every class as equally important, which matters when the rare classes are the ones you care about. Micro-averaging weights by instance count, so the dominant class drives the score. Weighted averaging is a compromise that uses per-class support as weights. The convention is documented in the scikit-learn classification report and is widely reused in research papers.
Multiclass datasets often have skewed label distributions. ImageNet-21K's tail contains classes with only a few hundred images, and many real text classification problems have one dominant category and a long list of rare ones. Common remedies include:
For more on these techniques and their trade-offs, see class imbalance.
Progress in multinomial classification has been driven by a handful of canonical datasets.
| Benchmark | Classes (K) | Domain | Typical state-of-the-art metric |
|---|---|---|---|
| MNIST | 10 | Handwritten digits | Test error below 0.2 percent |
| Fashion-MNIST | 10 | Clothing item images | Around 96 percent accuracy |
| CIFAR-10 | 10 | Tiny natural images | Above 99 percent on best models |
| CIFAR-100 | 100 | Tiny natural images, fine-grained | Around 96 percent on best models |
| ImageNet (ILSVRC-1K) | 1,000 | Natural images | Top-5 error below 2 percent |
| ImageNet-21K | about 21,000 | Natural images | Used mainly for pretraining |
| LSHTC | thousands to millions | Web text taxonomies | Macro-F1 evaluation |
| iNaturalist | over 8,000 | Plant and animal species | Top-1 accuracy |
| GLUE / SuperGLUE | mixed binary and multiclass | NLP understanding | Per-task accuracy or F1 |
The ImageNet Large Scale Visual Recognition Challenge popularised top-1 and top-5 accuracy as the headline scores. Top-5 accuracy counts a prediction as correct if the true label appears in the model's five highest-scoring guesses, which makes sense for a 1,000-class problem where neighbouring categories (such as different dog breeds) are visually almost identical. AlexNet hit a top-5 error of 15.3 percent in 2012, ResNet drove it to 3.57 percent in 2015, and Squeeze-and-Excitation Networks reached 2.25 percent in 2017. The challenge ended after 2017, with organisers noting that the headline benchmark had been effectively solved.
Most machine learning libraries handle K-class problems with little extra code.
| Library | Default behaviour | Multiclass notes |
|---|---|---|
| scikit-learn | All classifiers handle multiclass natively | LogisticRegression(multi_class='multinomial') for softmax, OneVsRestClassifier and OneVsOneClassifier for reductions |
| PyTorch | nn.CrossEntropyLoss combines log-softmax and NLL | Targets are integer class indices |
| TensorFlow / Keras | SparseCategoricalCrossentropy for integer labels, CategoricalCrossentropy for one-hot | Output layer uses softmax activation |
| XGBoost | objective='multi:softmax' returns labels, multi:softprob returns class probabilities | Set num_class parameter |
| LightGBM | objective='multiclass' | num_class parameter required |
| CatBoost | MultiClass loss function | Native support, no one-hot needed |
A minimal scikit-learn example using the iris dataset, which has three classes, looks like this:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix
X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(
X, y, stratify=y, random_state=0
)
clf = LogisticRegression(multi_class='multinomial', max_iter=1000)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred, digits=3))
The classification_report prints per-class precision, recall, F1, and support, plus macro and weighted averages, which together give a much fuller picture than accuracy alone.
Multinomial classification underpins much of modern deep learning, often in places that look at first like very different tasks.
Next-token prediction in large language models is a multinomial classification problem over the model's vocabulary, which typically has tens of thousands of tokens (50,257 for GPT-2, 100,256 for OpenAI's o200k base, 128,000-plus for some Llama variants). At every position, the model produces a vector of logits, applies softmax, and is trained with cross-entropy against the next true token. The same loss that powers iris classification scales up to train trillion-parameter models on web-scale text.
Zero-shot image classification with CLIP is another reformulation. CLIP encodes the image and a list of candidate class names (often phrased as "a photo of a {label}") into a shared embedding space, then picks the class with the highest cosine similarity. This effectively turns any vocabulary into a multiclass classifier without retraining, which is one reason CLIP-style models opened up flexible recognition systems.
Retrieval-augmented systems and recommender systems often dress up as ranking or retrieval problems, but at the prediction layer they are usually multiclass softmaxes over a discrete catalogue of items.
The softmax + cross-entropy recipe is robust, but it has known weaknesses.
Imagine a basket of fruits with apples, bananas, and oranges. Sorting each piece into the right pile is multinomial classification. The computer looks at lots of labelled examples until it learns what each fruit looks like, then it can pick the right pile for a new piece it has never seen before. The only difference from a yes-or-no question is that there are more than two piles to choose from.