Multinomial classification

Multinomial classification, also known as multi-class or multiclass classification, is a supervised learning task where each input belongs to exactly one of K possible classes, with K greater than two. The classifier learns a function f(x) = y that maps a feature vector x to a single label y in {1, 2, ..., K}. Examples include classifying a handwritten digit as one of 0 through 9, predicting which species of iris a flower belongs to, or labelling an ImageNet photograph with one of 1,000 object categories.

The task sits at the heart of machine learning. Many real-world problems have more than two outcomes, so binary methods alone are not enough. Standard approaches either build a single model that outputs K probabilities directly, or wrap a binary classification algorithm with a meta-strategy such as one-vs-rest or one-vs-one. The choice depends on the algorithm, the number of classes, the size of the dataset, and the computational budget.

Problem formulation

Given a training set {(x_i, y_i)} for i = 1 to n, where x_i is a feature vector in R^d and y_i is an integer label in {1, 2, ..., K}, the goal is to learn a hypothesis h that predicts a label for any new input x. Most classifiers learn a scoring function s_k(x) for each class k, then predict the argmax:

y_hat = argmax_k s_k(x)

When the scores can be interpreted as probabilities, the model also returns a posterior P(y = k | x). Probabilistic outputs matter for downstream tasks such as ranking, calibration, abstention, and decision theory under asymmetric costs.

Multinomial classification is often confused with adjacent problems. The differences matter because the choice of loss, evaluation metric, and output layer all change.

Task	Number of labels per instance	Class structure	Typical example
Binary classification	1 of 2	Flat	Spam vs not spam
Multinomial classification	1 of K (K > 2)	Flat	Digit recognition (0 to 9)
Multi-label classification	Subset of K	Flat	Topic tagging an article with several tags
Hierarchical classification	1 of K	Tree or DAG	Biological taxonomy classification
Ordinal classification	1 of K	Ordered	Rating prediction (1 to 5 stars)

Strategies for handling K classes

A classifier can either treat K classes natively or reduce the problem to a series of binary subproblems. The four main strategies are summarised below.

Strategy	How it works	Number of submodels	Notes
Native multiclass	A single model produces K outputs (often via softmax)	1	Used by softmax regression, decision trees, neural networks
One-vs-rest (OvR or OvA)	Train one binary classifier per class, that class against all others	K	Simple, parallelisable, works with any binary learner
One-vs-one (OvO)	Train one binary classifier for every pair of classes	K(K - 1) / 2	Each model trains on a subset of data; voting picks the winner
Error-correcting output codes (ECOC)	Assign each class a unique binary codeword; train one binary classifier per codeword bit	L (code length)	Adds redundancy so the system can recover from individual classifier mistakes

The one-vs-rest approach (also called one-vs-all) trains K classifiers, where the k-th classifier learns to separate class k from the union of all other classes. Prediction picks the class whose classifier produces the highest score. It is the default meta-strategy in many libraries because it scales linearly with K and produces a per-class confidence score.

One-vs-one trains K(K - 1) / 2 classifiers, each on the subset of training data belonging to two specific classes. At test time, every classifier votes for one of its two classes, and the class with the most votes wins. This strategy is preferred for kernel methods such as support vector machines because each individual problem uses less data, and the cost of training kernel matrices grows superlinearly with sample size.

Error-correcting output codes were introduced by Thomas Dietterich and Ghulum Bakiri in 1995. Each of the K classes is assigned a binary codeword of length L, and L binary classifiers are trained to predict each bit. At test time, the predicted bits form a codeword that is decoded to the nearest class codeword by Hamming distance. The redundancy in the code lets the system recover when individual binary classifiers are wrong, which often improves generalisation.

Native multiclass methods

Most modern algorithms support multinomial classification directly without reduction to binary subproblems.

Method	Output mechanism	Typical loss	Notes
Softmax regression	Linear scores passed through softmax	Categorical cross-entropy	Generalises logistic regression to K classes
Decision tree (CART)	Leaf node majority vote or class probabilities	Gini impurity or entropy	Handles K classes with no extra machinery
Random forest	Average of per-tree class probabilities	Gini impurity per tree	Strong baseline on tabular data
Gradient boosting	One booster per class, softmax over scores	multi:softmax or multi:softprob in XGBoost	Used in LightGBM, XGBoost, CatBoost
Naive Bayes	Class posteriors via Bayes rule	Maximum likelihood	Multinomial NB is widely used for text
k-Nearest Neighbours	Class vote among k nearest training points	None (lazy learner)	Trivially supports any K
Neural network	K logits passed through softmax	Categorical cross-entropy	Used in nearly all deep learning classifiers
Crammer-Singer SVM	Joint margin across all K classes	Multiclass hinge loss	Direct multiclass formulation, single optimisation problem

Koby Crammer and Yoram Singer published their multiclass kernel-based vector machine formulation in the Journal of Machine Learning Research in 2001. Unlike earlier multiclass support vector machines, which decomposed the problem into independent binary tasks, their algorithm casts the entire K-class problem as a single quadratic program. Solving it through the dual yields a fixed-point iteration that handles many classes efficiently.

Softmax regression

Softmax regression, also called multinomial regression or multinomial logistic regression, is the canonical native multiclass model. For each class k it learns a weight vector w_k and bias b_k, then defines:

P(y = k | x) = exp(w_k . x + b_k) / sum_j exp(w_j . x + b_j)

The denominator normalises the scores into a valid probability distribution. The model is trained by maximising the log-likelihood of the observed labels, which is equivalent to minimising the categorical cross-entropy loss:

L = -sum_i sum_k 1{y_i = k} log P(y_i = k | x_i)

When K = 2, softmax regression reduces to ordinary logistic regression. The same softmax + cross-entropy combination is the default output layer for almost every modern deep classifier, from MNIST digit recognisers to large language models.

Loss functions for multiclass

The choice of loss controls what the model optimises and how it handles imbalance, noise, and overconfidence.

Loss	Description	When to use
Categorical cross-entropy	Negative log-likelihood with one-hot targets	Default for softmax classifiers
Sparse categorical cross-entropy	Same loss, integer labels instead of one-hot	Memory-efficient when K is large
Multiclass hinge (Crammer-Singer)	Margin-based loss for multiclass SVMs	Linear or kernel SVM training
Label-smoothed cross-entropy	Targets become (1 - epsilon) for the true class and epsilon / (K - 1) elsewhere	Reduces overconfidence, improves calibration
Focal loss	Down-weights easy examples by a (1 - p)^gamma factor	Highly imbalanced datasets
KL divergence	Matches a soft target distribution	Knowledge distillation, ensembling

Label smoothing was introduced by Christian Szegedy and colleagues in the 2016 Inception-v3 paper "Rethinking the Inception Architecture for Computer Vision." The technique replaces the one-hot target with a smoothed distribution, which discourages the network from producing extremely confident logits. It tends to improve generalisation and calibration on large vision benchmarks.

Focal loss was introduced by Tsung-Yi Lin and colleagues in 2017 in "Focal Loss for Dense Object Detection." Although the original target was object detection with a long-tail of background examples, the modulating factor (1 - p_t)^gamma works for any imbalanced multiclass problem by reducing the loss contribution of well-classified examples and concentrating gradient on hard ones.

Evaluation metrics

No single number captures multiclass performance, especially when classes are imbalanced. Practitioners typically report several complementary metrics.

Metric	Definition	Range	Notes
Accuracy	Fraction of correctly predicted labels	[0, 1]	Misleading when classes are imbalanced
Top-k accuracy	Fraction of cases where the true label is among the k highest-scoring predictions	[0, 1]	Top-5 accuracy is the standard ImageNet metric
Confusion matrix	K x K table of true vs predicted labels	Counts	Reveals which classes are confused with each other
Per-class precision	TP / (TP + FP) for one class	[0, 1]	Reported alongside recall and F1
Per-class recall	TP / (TP + FN) for one class	[0, 1]	Sensitivity for that class
Per-class F1 score	Harmonic mean of precision and recall	[0, 1]	Useful when both errors matter
Macro-averaged F1	Unweighted mean of per-class F1	[0, 1]	Treats every class equally regardless of size
Micro-averaged F1	Computed from total TP, FP, FN across classes	[0, 1]	Equals accuracy when each instance has one label
Weighted-averaged F1	Mean of per-class F1 weighted by class support	[0, 1]	Useful for imbalanced data
Cohen's kappa	(p_o - p_e) / (1 - p_e), where p_o is observed agreement and p_e is chance agreement	[-1, 1]	Adjusts accuracy for the agreement expected by chance
Multiclass AUC (OvR or OvO)	Average of pairwise or one-vs-rest binary AUCs	[0, 1]	scikit-learn supports both averaging schemes
Matthews correlation coefficient	Generalisation of MCC to K classes	[-1, 1]	Robust under imbalance

Macro-averaging treats every class as equally important, which matters when the rare classes are the ones you care about. Micro-averaging weights by instance count, so the dominant class drives the score. Weighted averaging is a compromise that uses per-class support as weights. The convention is documented in the scikit-learn classification report and is widely reused in research papers.

Class imbalance

Multiclass datasets often have skewed label distributions. ImageNet-21K's tail contains classes with only a few hundred images, and many real text classification problems have one dominant category and a long list of rare ones. Common remedies include:

Stratified train, validation, and test splits so every class is represented in proportion.
Class weights in the loss, where each class's contribution is scaled by the inverse of its frequency.
Resampling, by oversampling rare classes or undersampling common ones.
Synthetic minority oversampling (SMOTE) for tabular data.
Focal loss for extreme tail behaviour.
Hierarchical softmax or sampled softmax when K is so large that computing the full normalisation is impractical.

For more on these techniques and their trade-offs, see class imbalance.

Famous benchmarks

Progress in multinomial classification has been driven by a handful of canonical datasets.

Benchmark	Classes (K)	Domain	Typical state-of-the-art metric
MNIST	10	Handwritten digits	Test error below 0.2 percent
Fashion-MNIST	10	Clothing item images	Around 96 percent accuracy
CIFAR-10	10	Tiny natural images	Above 99 percent on best models
CIFAR-100	100	Tiny natural images, fine-grained	Around 96 percent on best models
ImageNet (ILSVRC-1K)	1,000	Natural images	Top-5 error below 2 percent
ImageNet-21K	about 21,000	Natural images	Used mainly for pretraining
LSHTC	thousands to millions	Web text taxonomies	Macro-F1 evaluation
iNaturalist	over 8,000	Plant and animal species	Top-1 accuracy
GLUE / SuperGLUE	mixed binary and multiclass	NLP understanding	Per-task accuracy or F1

The ImageNet Large Scale Visual Recognition Challenge popularised top-1 and top-5 accuracy as the headline scores. Top-5 accuracy counts a prediction as correct if the true label appears in the model's five highest-scoring guesses, which makes sense for a 1,000-class problem where neighbouring categories (such as different dog breeds) are visually almost identical. AlexNet hit a top-5 error of 15.3 percent in 2012, ResNet drove it to 3.57 percent in 2015, and Squeeze-and-Excitation Networks reached 2.25 percent in 2017. The challenge ended after 2017, with organisers noting that the headline benchmark had been effectively solved.

Implementations

Most machine learning libraries handle K-class problems with little extra code.

Library	Default behaviour	Multiclass notes
scikit-learn	All classifiers handle multiclass natively	`LogisticRegression(multi_class='multinomial')` for softmax, `OneVsRestClassifier` and `OneVsOneClassifier` for reductions
PyTorch	`nn.CrossEntropyLoss` combines log-softmax and NLL	Targets are integer class indices
TensorFlow / Keras	`SparseCategoricalCrossentropy` for integer labels, `CategoricalCrossentropy` for one-hot	Output layer uses softmax activation
XGBoost	`objective='multi:softmax'` returns labels, `multi:softprob` returns class probabilities	Set `num_class` parameter
LightGBM	`objective='multiclass'`	`num_class` parameter required
CatBoost	`MultiClass` loss function	Native support, no one-hot needed

A minimal scikit-learn example using the iris dataset, which has three classes, looks like this:

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix

X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, stratify=y, random_state=0
)

clf = LogisticRegression(multi_class='multinomial', max_iter=1000)
clf.fit(X_train, y_train)

y_pred = clf.predict(X_test)
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred, digits=3))

The classification_report prints per-class precision, recall, F1, and support, plus macro and weighted averages, which together give a much fuller picture than accuracy alone.

Modern context

Multinomial classification underpins much of modern deep learning, often in places that look at first like very different tasks.

Next-token prediction in large language models is a multinomial classification problem over the model's vocabulary, which typically has tens of thousands of tokens (50,257 for GPT-2, 100,256 for OpenAI's o200k base, 128,000-plus for some Llama variants). At every position, the model produces a vector of logits, applies softmax, and is trained with cross-entropy against the next true token. The same loss that powers iris classification scales up to train trillion-parameter models on web-scale text.

Zero-shot image classification with CLIP is another reformulation. CLIP encodes the image and a list of candidate class names (often phrased as "a photo of a {label}") into a shared embedding space, then picks the class with the highest cosine similarity. This effectively turns any vocabulary into a multiclass classifier without retraining, which is one reason CLIP-style models opened up flexible recognition systems.

Retrieval-augmented systems and recommender systems often dress up as ranking or retrieval problems, but at the prediction layer they are usually multiclass softmaxes over a discrete catalogue of items.

Limitations and challenges

The softmax + cross-entropy recipe is robust, but it has known weaknesses.

The softmax denominator scales linearly with K, which becomes a bottleneck for extreme classification problems with millions of classes. Sampled softmax, hierarchical softmax, and noise-contrastive estimation are common workarounds.
Vanilla classifiers ignore relationships between classes. "Husky" and "malamute" are penalised the same as "husky" and "airliner," which is rarely what you want. Hierarchical or label-embedding methods address this when class relationships are known.
Confident wrong predictions are common with cross-entropy training. Label smoothing, temperature scaling, and post-hoc calibration help.
Long-tailed distributions hurt rare classes. Resampling, class-balanced losses, and decoupled training schedules each address parts of the problem.
Open-set recognition is not handled by softmax, which assigns probability mass to one of the K seen classes even for genuinely unknown inputs. Specialised methods such as OpenMax, energy-based detectors, and threshold tuning are used in practice.

Explain like I'm 5

Imagine a basket of fruits with apples, bananas, and oranges. Sorting each piece into the right pile is multinomial classification. The computer looks at lots of labelled examples until it learns what each fruit looks like, then it can pick the right pile for a new piece it has never seen before. The only difference from a yes-or-no question is that there are more than two piles to choose from.

Multinomial classification