See also: Machine learning terms, Logistic regression, Classification
Multi-class logistic regression, also known as multinomial logistic regression, softmax regression, or the maximum entropy (MaxEnt) classifier, is a supervised learning algorithm that generalizes logistic regression to handle classification problems with more than two possible output categories. While binary logistic regression models a single outcome probability using the sigmoid function, multi-class logistic regression uses the softmax function to produce a probability distribution over K mutually exclusive classes.
The algorithm takes an input feature vector, computes a linear score for each class, and then normalizes these scores through the softmax function so that all output probabilities sum to one. The predicted class is the one with the highest probability. Training is performed by minimizing the cross-entropy loss function, typically through iterative optimization methods such as gradient descent.
Multi-class logistic regression is widely used in natural language processing, image classification, medical diagnosis, and many other domains. It serves as the final classification layer in most modern neural networks and remains one of the most commonly applied multi-class classifiers in both statistics and machine learning.
The development of multi-class logistic regression grew out of foundational work on binary logistic regression. Joseph Berkson introduced the term "logit" in 1944 and developed the logistic regression model as an alternative to probit analysis for modeling binary outcomes. In 1958, David Cox refined the model and coined the term "logistic regression," generalizing it to handle multiple explanatory variables.
Cox extended logistic regression to multinomial outcomes in 1966, allowing the model to handle dependent variables with more than two categories. This broadened the scope and popularity of logistic models considerably. In 1973, Daniel McFadden connected the multinomial logit model to discrete choice theory in economics, showing that the model arises naturally when individuals choose among alternatives to maximize random utility. McFadden received the Nobel Prize in Economics in 2000 for this contribution.
Separately, in the field of statistical mechanics and information theory, the principle of maximum entropy was developed by Edwin Jaynes in 1957. Researchers later recognized that multinomial logistic regression is mathematically equivalent to the maximum entropy classifier, a connection that became especially important in natural language processing during the 1990s and 2000s. The equivalence between these two perspectives, one rooted in statistics and the other in information theory, provided a strong theoretical foundation for the model.
Imagine you have a bag of colored candies: red, blue, green, and yellow. You want to build a machine that can guess the color of a candy just by weighing it and measuring its size.
The machine looks at the weight and size, and for each color, it calculates a "score" that represents how much the candy looks like that color. A heavier candy might get a high score for "red" and a low score for "yellow."
But scores by themselves are hard to compare. So the machine uses a special trick called softmax to turn all the scores into percentages that add up to 100%. If red gets 70%, blue gets 15%, green gets 10%, and yellow gets 5%, the machine guesses "red" because it has the highest percentage.
To get better at guessing, the machine practices on lots of candies where you already know the color. Each time it guesses wrong, it adjusts its internal settings a little bit so the correct color gets a higher percentage next time. After practicing on hundreds of candies, the machine learns what weight-and-size combinations go with each color.
That is multi-class logistic regression: calculate scores for each option, convert them to percentages, and pick the highest one.
Consider a classification problem with K classes (indexed k = 1, 2, ..., K) and a training set of m examples. Each training example consists of a feature vector x (of dimension n) and a true class label y, where y takes values in {1, 2, ..., K}. The model maintains a weight matrix W of dimensions n x K and a bias vector b of dimension K.
For each class k, the model computes a raw score (also called a logit) as a linear function of the input features:
z_k = w_k^T * x + b_k
where w_k is the k-th column of the weight matrix W and b_k is the k-th element of the bias vector. The vector z = (z_1, z_2, ..., z_K) contains the raw scores for all K classes.
The softmax function transforms the raw score vector z into a valid probability distribution over the K classes:
P(y = k | x) = exp(z_k) / sum_{j=1}^{K} exp(z_j)
This function has several properties:
In practice, a numerical stability trick is applied by subtracting the maximum score from all scores before exponentiation:
P(y = k | x) = exp(z_k - max(z)) / sum_{j=1}^{K} exp(z_j - max(z))
This prevents overflow errors without changing the result, since the subtraction cancels out in the ratio.
When K = 2, the softmax function reduces to the sigmoid function used in binary classification. Setting one class as the reference (with its weight vector fixed to zero), the probability of the other class becomes:
P(y = 1 | x) = exp(w_1^T * x) / (exp(w_1^T * x) + exp(0)) = 1 / (1 + exp(-w_1^T * x))
This is exactly the sigmoid function, confirming that multi-class logistic regression is a strict generalization of binary logistic regression.
The softmax model has a redundancy in its parameters: adding any constant vector to all weight vectors produces the same probabilities. This means only K - 1 weight vectors are independently identifiable. A common convention in statistics is to fix one class (typically the last) as a reference category by setting its weight vector to zero. The remaining K - 1 weight vectors then encode log-odds relative to the reference class:
log(P(y = k | x) / P(y = K | x)) = w_k^T * x + b_k
In machine learning practice, all K weight vectors are typically kept and the redundancy is resolved implicitly through regularization.
The model is trained by minimizing the cross-entropy loss (also called log loss or negative log-likelihood). For a single training example with true class label y, the loss is:
L = -log(P(y | x)) = -log(exp(z_y) / sum_{j=1}^{K} exp(z_j))
Using one-hot encoding to represent the true label as a vector t (where t_k = 1 if k = y and t_k = 0 otherwise), this can be written as:
L = -sum_{k=1}^{K} t_k * log(P(y = k | x))
Since only the term where t_k = 1 is nonzero, only the predicted probability of the true class contributes to the loss. The total cost over a dataset of m training examples is:
J(W, b) = -(1/m) * sum_{i=1}^{m} sum_{k=1}^{K} t_k^(i) * log(P(y = k | x^(i)))
This cost function is convex, which guarantees that any local minimum is also a global minimum.
Minimizing the cross-entropy loss is equivalent to maximizing the likelihood of the observed data under the model. The likelihood function for m independent and identically distributed training examples is:
L(W, b) = product_{i=1}^{m} P(y^(i) | x^(i); W, b)
Taking the negative logarithm of the likelihood yields the cross-entropy cost function. This connection to maximum likelihood estimation provides a principled statistical justification for the training procedure.
The gradient of the cross-entropy loss with respect to the weight vector w_k for class k is:
dJ/dw_k = (1/m) * sum_{i=1}^{m} (P(y = k | x^(i)) - t_k^(i)) * x^(i)
This gradient has an elegant form: for each training example, it is proportional to the difference between the predicted probability and the true label (0 or 1), multiplied by the input feature vector. The gradient for the bias term b_k is:
dJ/db_k = (1/m) * sum_{i=1}^{m} (P(y = k | x^(i)) - t_k^(i))
These gradients are used to update the parameters iteratively during training.
The partial derivative of the softmax output with respect to its input scores forms a Jacobian matrix. For outputs S_i = softmax(z_i):
This can be written compactly as dS_i / dz_j = S_i * (delta_ij - S_j), where delta_ij is the Kronecker delta. When combined with the cross-entropy loss, the overall gradient simplifies to the difference between predicted probabilities and true labels, which is the same simple form used in binary logistic regression.
Several optimization algorithms can be used to minimize the cross-entropy cost function and train the model.
Gradient descent computes the gradient over the entire training set and updates all parameters simultaneously:
w_k := w_k - alpha * dJ/dw_k
where alpha is the learning rate. This method produces stable updates but can be slow for large datasets because every update requires a full pass through the data.
Stochastic gradient descent updates parameters using a single randomly selected training example at each step. This introduces noise into the gradient estimate but allows for much faster progress, particularly on large datasets. The noisy updates can also help escape shallow local minima in non-convex variants of the problem (such as when softmax regression is used as part of a deep learning model).
Mini-batch gradient descent is a compromise between batch and stochastic methods. It computes the gradient on a small random subset (mini-batch) of the training data at each step, typically containing 32 to 256 examples. This approach benefits from vectorized computation on modern hardware while still providing the convergence benefits of stochastic updates.
Modern implementations often use adaptive learning rate methods such as the Adam optimizer, AdaGrad, or RMSProp. These algorithms adjust the learning rate for each parameter individually based on the history of past gradients. For classical statistical applications, quasi-Newton methods such as L-BFGS are commonly used because they converge in fewer iterations by approximating second-order curvature information.
| Optimizer | Update rule | Advantages | Disadvantages |
|---|---|---|---|
| Batch gradient descent | Uses full dataset gradient | Stable convergence, deterministic | Slow on large datasets |
| SGD | Uses single-example gradient | Fast updates, good generalization | Noisy, sensitive to learning rate |
| Mini-batch SGD | Uses small-batch gradient | Balances speed and stability | Requires batch size tuning |
| L-BFGS | Quasi-Newton with curvature approximation | Fast convergence, no learning rate needed | High memory cost for large models |
| Adam | Adaptive per-parameter learning rates | Works well out of the box | May not generalize as well as SGD |
Overfitting occurs when the model fits the training data too closely and fails to generalize to unseen examples. Regularization techniques address this by adding a penalty term to the loss function.
L2 regularization adds the squared magnitude of the weight vectors to the cost function:
J_reg(W, b) = J(W, b) + (lambda / 2) * sum_{k=1}^{K} ||w_k||^2
where lambda is the regularization strength hyperparameter. L2 regularization shrinks all weights toward zero but rarely sets any weight to exactly zero. It produces smoother decision boundaries and is equivalent to placing a zero-mean Gaussian prior on the weights in the Bayesian interpretation.
L1 regularization adds the sum of absolute values of the weights:
J_reg(W, b) = J(W, b) + lambda * sum_{k=1}^{K} ||w_k||_1
L1 regularization tends to drive some weights to exactly zero, effectively performing feature selection. This produces sparse models that use only a subset of the input features. L1 regularization is especially useful when the input has many features and only a few are expected to be relevant.
Elastic net combines L1 and L2 regularization:
J_reg(W, b) = J(W, b) + lambda_1 * sum ||w_k||_1 + (lambda_2 / 2) * sum ||w_k||^2
This method can produce sparse solutions (from the L1 term) while still maintaining the grouping effect of L2 regularization when features are correlated.
| Regularization method | Penalty term | Effect on weights | Sparsity | Use case |
|---|---|---|---|---|
| L2 (Ridge) | lambda/2 * sum of squared weights | Shrinks all weights toward zero | No | General-purpose, correlated features |
| L1 (Lasso) | lambda * sum of absolute weights | Drives some weights to zero | Yes | Feature selection, high-dimensional data |
| Elastic net | Combination of L1 and L2 | Balanced shrinkage and zeroing | Partial | Correlated features with sparsity |
There are multiple strategies for extending binary classifiers to multi-class problems. Multi-class logistic regression via the softmax function is one approach, but two other commonly used strategies are one-vs-rest and one-vs-one.
The softmax approach, described in the sections above, trains a single model with K output nodes. All classes are handled jointly in a single optimization problem, and the softmax function ensures that the predicted probabilities form a valid distribution. This approach is sometimes called the "native" or "direct" multi-class method.
Advantages of the softmax approach include producing well-calibrated probabilities that sum to one, requiring only a single model to train and maintain, and jointly optimizing all class boundaries. A limitation is that it assumes classes are mutually exclusive.
One-vs-rest (also called one-vs-all, OvA) trains K separate binary classifiers. Each classifier learns to distinguish one class from all other classes combined. At prediction time, the input is passed through all K classifiers, and the class whose classifier produces the highest confidence score is selected.
For a problem with K classes:
OvR is simple to implement and can use any binary classifier as a building block, including support vector machines and decision trees. However, it can suffer from class imbalance (each binary problem has one small positive class against a large negative class), and the scores from different classifiers are not directly comparable because they are trained independently.
One-vs-one trains a binary classifier for every pair of classes. For K classes, this requires K * (K - 1) / 2 classifiers. At prediction time, each classifier votes for one of its two classes, and the class with the most votes is selected.
For a problem with K classes:
OvO is often used with kernel methods like SVMs because training time for kernel methods scales super-linearly with dataset size, and each OvO classifier trains on a smaller subset of the data. The disadvantage is that the number of classifiers grows quadratically with the number of classes.
| Strategy | Number of models | Probability output | Handles non-exclusive classes | Training complexity |
|---|---|---|---|---|
| Softmax (multinomial) | 1 | Yes, well-calibrated | No | Single optimization |
| One-vs-rest (OvR) | K | No (requires calibration) | Yes | K independent problems |
| One-vs-one (OvO) | K(K-1)/2 | No (requires calibration) | Yes | K(K-1)/2 independent problems |
The softmax approach is generally preferred when classes are mutually exclusive and when calibrated probability estimates are needed. OvR is more flexible and works with any binary classifier. OvO is mainly used with SVMs or other classifiers that benefit from training on smaller subsets.
Multi-class logistic regression is mathematically equivalent to the maximum entropy classifier. This equivalence connects two different theoretical traditions: one from statistics (logistic regression) and one from information theory (maximum entropy).
The principle of maximum entropy, introduced by Edwin Jaynes in 1957, states that when choosing a probability distribution subject to known constraints, one should select the distribution with the greatest entropy (i.e., the distribution that makes the fewest additional assumptions beyond the constraints). Entropy measures the uncertainty in a distribution and is defined as:
H(p) = -sum_k p(k) * log(p(k))
In the maximum entropy framework, we want to find a conditional distribution P(y | x) that:
Solving this constrained optimization problem using Lagrange multipliers yields the log-linear form:
P(y = k | x) = (1 / Z(x)) * exp(w_k^T * x)
where Z(x) = sum_{j=1}^{K} exp(w_j^T * x) is the partition function (normalization constant). This is exactly the softmax regression model. The Lagrange multipliers become the weight parameters of the model.
The maximum entropy perspective provides an information-theoretic justification for the model: among all models consistent with the observed data, softmax regression is the one that introduces the least amount of unwarranted bias. Additionally, maximizing conditional log-likelihood (the standard training objective for logistic regression) is equivalent to minimizing the Kullback-Leibler divergence between the empirical distribution and the model distribution, which is itself a dual formulation of entropy maximization.
This equivalence has been particularly influential in NLP, where maximum entropy classifiers were widely used for tasks such as part-of-speech tagging, named entity recognition, and text classification before the rise of deep learning methods.
Multi-class logistic regression relies on several assumptions:
The performance of a multi-class logistic regression model is typically assessed using the following metrics.
A confusion matrix is a K x K table that compares predicted class labels against true class labels. Each row represents the true class, each column represents the predicted class, and each cell contains the count of examples with that combination. Diagonal cells represent correct predictions.
| Metric | Formula | Description |
|---|---|---|
| Accuracy | Correct predictions / Total predictions | Overall fraction of correct predictions |
| Precision (per class k) | TP_k / (TP_k + FP_k) | Fraction of predictions for class k that are correct |
| Recall (per class k) | TP_k / (TP_k + FN_k) | Fraction of true class k examples that are correctly identified |
| F1 score (per class k) | 2 * Precision_k * Recall_k / (Precision_k + Recall_k) | Harmonic mean of precision and recall |
| Macro-average | Mean of per-class metric | Treats all classes equally |
| Weighted average | Weighted mean by class frequency | Accounts for class imbalance |
| Log loss | -(1/m) * sum of log(P(true class)) | Directly evaluates predicted probabilities |
Log loss (cross-entropy loss) is often preferred for evaluating multi-class logistic regression because it assesses the quality of the predicted probability distribution, not just the predicted label.
Multi-class logistic regression is implemented in all major machine learning frameworks.
In Scikit-learn, the LogisticRegression class supports multinomial logistic regression. Setting multi_class='multinomial' uses the softmax formulation, while multi_class='ovr' uses the one-vs-rest strategy. Solvers that support the multinomial loss include lbfgs, newton-cg, sag, and saga. As of recent versions, multinomial is the default when the number of classes exceeds two.
from sklearn.linear_model import LogisticRegression
model = LogisticRegression(multi_class='multinomial', solver='lbfgs', max_iter=1000)
model.fit(X_train, y_train)
predictions = model.predict(X_test)
probabilities = model.predict_proba(X_test)
In PyTorch, multinomial logistic regression is typically implemented as a single linear layer followed by the cross-entropy loss, which internally applies the softmax function:
import torch
import torch.nn as nn
model = nn.Linear(n_features, n_classes)
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model.parameters(), lr=0.01)
# Training loop
for epoch in range(num_epochs):
outputs = model(X_train)
loss = criterion(outputs, y_train)
optimizer.zero_grad()
loss.backward()
optimizer.step()
Note that nn.CrossEntropyLoss in PyTorch expects raw logits (pre-softmax scores), not probabilities. It applies log-softmax internally for numerical stability.
In TensorFlow, a simple multinomial logistic regression model can be built with Keras:
import tensorflow as tf
model = tf.keras.Sequential([
tf.keras.layers.Dense(n_classes, activation='softmax', input_shape=(n_features,))
])
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
model.fit(X_train, y_train_onehot, epochs=100, batch_size=32)
Alternatively, using sparse_categorical_crossentropy avoids the need to one-hot encode the labels.
In R, the multinom() function from the nnet package fits multinomial logistic regression models:
library(nnet)
model <- multinom(y ~ x1 + x2 + x3, data = training_data)
summary(model)
predicted <- predict(model, newdata = test_data, type = "class")
The glmnet package provides regularized multinomial logistic regression with L1, L2, or elastic net penalties.
Multi-class logistic regression is applied across many fields.
In NLP, the model is used for text classification, sentiment analysis, part-of-speech tagging, and named entity recognition. The maximum entropy classifier (equivalent to multinomial logistic regression) was a standard tool for these tasks before deep learning became dominant. Unlike Naive Bayes, multinomial logistic regression does not assume that features are conditionally independent, which often leads to better performance on text data with correlated features.
Softmax regression is the standard output layer for image classification in neural networks. Even in complex deep learning architectures, the final layer typically applies a linear transformation followed by softmax to produce class probabilities. The MNIST handwritten digit dataset (10 classes) is a classic benchmark where softmax regression serves as a baseline classifier.
In healthcare, multinomial logistic regression is used to predict diagnostic categories based on patient features such as lab results, symptoms, and demographic variables. Its ability to produce calibrated probabilities is valuable in medical decision-making, where clinicians need to assess the relative likelihood of different diagnoses.
The multinomial logit model is widely used in economics for discrete choice analysis, such as predicting consumer preferences among products, transportation mode choice, or occupational selection. The model's connection to random utility theory provides a structural interpretation of the coefficients.
| Domain | Example application |
|---|---|
| Biology | Species classification from morphological measurements |
| Finance | Credit rating prediction (AAA, AA, A, BBB, etc.) |
| Marketing | Customer segmentation into behavioral categories |
| Geology | Rock type classification from mineral composition |
| Ecology | Habitat type prediction from environmental variables |
Multi-class logistic regression is connected to several other models in machine learning and statistics.
| Model | Relationship |
|---|---|
| Binary logistic regression | Multi-class logistic regression reduces to binary logistic regression when K = 2 |
| Neural network with softmax output | A neural network with no hidden layers and a softmax output is exactly multi-class logistic regression |
| Naive Bayes classifier | Both are linear classifiers; Naive Bayes makes stronger independence assumptions but is faster to train |
| Support vector machine | SVMs find maximum-margin decision boundaries; multi-class SVMs typically use OvR or OvO strategies |
| Multinomial probit | Similar purpose but uses a different link function (probit) and does not assume IIA |
| Ordinal logistic regression | Used when categories have a natural ordering; multi-class logistic regression treats categories as unordered |
| Conditional random field (CRF) | CRFs generalize logistic regression to structured prediction over sequences |
| Perceptron | A perceptron with softmax activation is equivalent to multi-class logistic regression |