See also: Machine learning terms
Multinomial regression is a statistical model that predicts which one of K possible categories an observation belongs to, given a vector of input features. It generalises logistic regression, which handles two classes, to the case where K is greater than two. The model assigns each class its own linear score, then turns those scores into a probability distribution using the softmax function. The class with the highest probability is the prediction.
The model goes by several names, all of which describe the same algorithm: multinomial logistic regression, softmax regression, polytomous logistic regression, and the maximum entropy classifier (often shortened to MaxEnt) in the natural language processing literature. Discrete-choice economists call the same model the multinomial logit, and a closely related variant is McFadden's conditional logit. All of these names refer to a generalized linear model with a multinomial response distribution and a softmax link function.
Multinomial regression sits at the boundary between classical statistics and modern machine learning. Statisticians use it for inference: estimating odds ratios, computing confidence intervals, and testing whether a predictor matters. Machine learning practitioners use it as a strong baseline for multinomial classification and as the final layer of nearly every deep neural network that produces a categorical output, including the next-token head of large language models.
Let x in R^d be a feature vector and let y be a categorical label taking values in {1, 2, ..., K}. Multinomial regression models the conditional probability of each class as
P(y = k | x) = exp(w_k . x + b_k) / sum_{j=1}^{K} exp(w_j . x + b_j)
for k = 1, ..., K. Each class k has its own weight vector w_k in R^d and its own bias b_k in R. The denominator (the partition function) ensures the K probabilities sum to one. Equivalently, if z_k = w_k . x + b_k is the linear score (or logit) for class k, then the predicted probability is the softmax of the score vector z = (z_1, ..., z_K).
The decision boundary between any two classes k and l is the set of inputs where w_k . x + b_k = w_l . x + b_l, which is a hyperplane. So multinomial regression carves the input space into K convex polytopes, one per class. The boundaries are linear in x, even though the probability surface is smooth.
The parameterisation above is overcomplete. Adding the same constant vector c to every weight vector (replacing each w_k with w_k + c) leaves the probabilities unchanged, because the c terms cancel inside the softmax. The same applies to the biases. With no further constraint there are infinitely many parameter settings that give identical predictions, which makes maximum-likelihood estimates undefined.
Two conventions resolve this:
| Convention | Description | Used by |
|---|---|---|
| Reference category | Fix w_K = 0 and b_K = 0 for one chosen class. Estimate K - 1 sets of coefficients. | Statsmodels, R nnet, classical statistics |
| Symmetric (full softmax) | Keep all K weight vectors but add a regulariser (typically L2) so the optimum is unique. | scikit-learn, PyTorch, TensorFlow |
The reference-category form has a clean interpretation: each w_k tells you how feature x shifts the log-odds of class k versus the reference class. The symmetric form is more convenient for software, especially when the model is a layer inside a neural network and weight decay is already in the loss.
Multinomial regression is trained by maximum likelihood estimation. Given a dataset {(x_i, y_i)} for i = 1, ..., n, the log-likelihood is
L(theta) = sum_{i=1}^{n} sum_{k=1}^{K} 1{y_i = k} . log P(y_i = k | x_i)
where theta = {w_k, b_k} collects all weights and biases. Maximising this likelihood is exactly the same as minimising the average categorical cross-entropy loss. The cross-entropy view is dominant in deep learning; the maximum-likelihood view is dominant in statistics. They are the same objective up to a sign and a constant.
The gradient with respect to w_k is
d L / d w_k = sum_i (1{y_i = k} - P(y_i = k | x_i)) . x_i
which has the elegant form (label minus prediction) times input. This is the same gradient that backpropagates through a softmax-plus-cross-entropy layer in a neural network, and it is one reason this combination is so popular: the gradient is numerically stable and easy to compute.
Unlike ordinary least squares, multinomial regression has no closed-form solution. The likelihood is, however, concave in the parameters (strictly concave once a regulariser is added), so any local optimiser converges to the unique global maximum. Common algorithms include:
| Algorithm | Type | Notes |
|---|---|---|
| Newton-Raphson | Second order | Uses the exact Hessian; fast on small problems but expensive at high d or K |
| Iteratively reweighted least squares (IRLS) | Second order | Equivalent to Newton-Raphson for GLMs; classic in statistics |
| L-BFGS | Quasi-Newton | Approximates the Hessian; default solver in scikit-learn for most cases |
| Newton-CG | Truncated Newton | Solves the Newton step with conjugate gradient; good for large d |
| Stochastic gradient descent (SGD) | First order | Scales to enormous datasets; standard inside deep learning frameworks |
| Coordinate descent | First order | Used by glmnet for elastic-net penalties |
Thomas Minka's 2003 comparison of optimisers for logistic regression found that L-BFGS and conjugate-gradient methods consistently dominated naive iterative scaling, which had been the standard in the maximum-entropy NLP community. Modern libraries reflect that finding: scikit-learn's default solver is L-BFGS, and even when training enormous neural networks the final softmax layer is fit with the same kind of gradient steps.
Without a penalty term, multinomial regression on linearly separable data has weights that diverge to infinity, which is one symptom of overfitting. Regularization fixes this by adding a penalty on the size of the weights. The three standard choices map directly to Bayesian priors on the weights:
| Penalty | Formula added to loss | Bayesian interpretation | Effect |
|---|---|---|---|
| L2 (ridge) | (1 / 2) lambda . sum_k | w_k | |
| L1 (lasso) | lambda . sum_k | w_k | |
| Elastic net | alpha . L1 + (1 - alpha) . L2 | Mixture of Gaussian and Laplace | Combines shrinkage with sparsity |
L2 is the workhorse and the default in scikit-learn, statsmodels, and almost every deep-learning framework (where it is usually called weight decay). L1 is useful when most features are irrelevant and you want feature selection as a by-product of fitting. Elastic net, introduced by Zou and Hastie in 2005, is preferred when correlated features make pure L1 unstable.
Multinomial regression keeps surfacing under different names in different fields, because the same mathematical object is genuinely useful in many places.
A common alternative to fitting a single multinomial model is to fit K independent binary logistic regressions, each one separating its class from all the rest. Both approaches give a per-class score, but they have different statistical properties.
| Property | Multinomial (joint softmax) | One-vs-rest (K binary fits) |
|---|---|---|
| Number of optimisations | 1 joint problem | K independent problems |
| Probability normalisation | Built in: outputs sum to 1 | Not joint; each binary score is calibrated against its rest set |
| Calibration | Naturally calibrated under MLE | Often needs post-hoc calibration (e.g. Platt scaling) |
| Treatment of class imbalance | Handles all classes simultaneously | Each binary problem may be very imbalanced |
| Cost when K is large | Loss includes a sum over K classes | Embarrassingly parallel across K |
| Default in scikit-learn | Yes (since version 0.22) | Was the default before 0.22 |
For most applications with moderate K, the joint multinomial fit produces better calibrated probabilities and slightly higher accuracy. One-vs-rest is more attractive when K is enormous (millions of categories) and you need to parallelise across many machines, or when you want to plug a binary-only library into a multiclass problem.
When multinomial regression is used as a statistical model rather than a black-box predictor, several inference tools are standard:
Multinomial regression is in essentially every statistics and machine-learning library. The most widely used:
| Library | Function or class | Notes |
|---|---|---|
| scikit-learn | sklearn.linear_model.LogisticRegression | Default solver L-BFGS; supports L2, L1 (with saga), and elastic-net (with saga). The deprecated multi_class='multinomial' argument was removed; current versions choose multinomial automatically when K > 2. |
| statsmodels | statsmodels.discrete.discrete_model.MNLogit | Reference-category form with full inference output (Wald tests, p-values, confidence intervals, McFadden's pseudo R-squared). |
| R nnet | nnet::multinom | Fits multinomial logit by minimising negative log-likelihood. The companion summary() reports standard errors and z-statistics. |
| R mlogit | mlogit::mlogit | Implements McFadden-style conditional logit and several extensions (nested logit, mixed logit) for econometric work. |
| TensorFlow | tf.keras.layers.Dense(K) plus tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True) | The standard idiom for the final layer of a classifier. |
| PyTorch | torch.nn.Linear(d, K) plus torch.nn.CrossEntropyLoss | CrossEntropyLoss combines log-softmax and negative log-likelihood for numerical stability. |
| Stan | categorical_logit_glm | Bayesian multinomial regression with arbitrary priors. |
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
clf = LogisticRegression(solver="lbfgs", C=1.0, max_iter=1000)
clf.fit(X_train, y_train)
print("accuracy:", clf.score(X_test, y_test))
print("per-class probabilities for first test point:", clf.predict_proba(X_test[:1]))
With three iris species this is a textbook three-class softmax fit. predict_proba returns the softmax probabilities directly, and clf.coef_ has shape (3, 4): one weight vector per class, one entry per input feature.
In econometrics, multinomial logit is the workhorse for modelling choice among finite alternatives, including which mode of transport someone takes to work, which brand of cereal they buy, or which job offer they accept. McFadden's 1974 conditional logit allowed the features to depend on the alternatives themselves (price of each option, travel time of each route), which was a major step beyond the basic multinomial logit where only the chooser's attributes mattered.
The basic multinomial logit imposes the independence of irrelevant alternatives (IIA): the ratio of probabilities between any two alternatives does not depend on what other alternatives are available. McFadden's famous red-bus blue-bus example shows why this is restrictive. If a commuter is split 50/50 between car and bus, and you add an identical bus painted a different colour, IIA forces the model to predict 1/3, 1/3, 1/3 instead of the more realistic 1/2, 1/4, 1/4. Nested logit and mixed logit relax IIA by grouping similar alternatives or letting coefficients vary across the population.
Multinomial regression has several real limitations that are worth being honest about.
The decision boundary between any two classes is linear in the input features. If the true boundary is curved (a parabola, a circle, anything nonlinear) the model cannot fit it without explicit feature engineering, kernel methods, or a nonlinear model on top.
The IIA assumption is baked into the basic model. When alternatives are close substitutes, IIA produces predictions that are noticeably wrong. The practical fix is either to use a richer model (nested logit, mixed logit, neural network) or to be careful about how the choice set is defined.
Like all maximum-likelihood methods, multinomial regression can overfit when d is large relative to n, especially when classes are nearly separable. Without regularisation the optimiser will happily push weights to infinity to drive the softmax probabilities to one and zero. L2 regularisation prevents this but introduces a hyperparameter that must be tuned by cross-validation.
Finally, multinomial regression is symmetric in the way it treats classes, so it does not exploit ordinal structure. If your labels have a natural order (1 to 5 stars, low/medium/high), an ordinal model usually fits better and uses fewer parameters.
For classification problems where the input is raw text, image pixels, or audio waveforms, multinomial regression on hand-crafted features has been displaced by deep neural networks that learn their own representations. But the displacement is partial. Almost every classifier in production today, including the largest neural networks, ends with multinomial regression on top of learned features.
The final layer of a classifier built with TensorFlow or PyTorch is a Dense(K) (or Linear(d, K)) followed by a softmax-cross-entropy loss. That is multinomial regression. The features it operates on are the activations of the previous layer, learned end-to-end with the rest of the network. The same arithmetic that Berger and colleagues used for maximum-entropy text classification in 1996 is what trains the final layer of GPT-style language models in 2026.
Language modelling is the most striking example. A transformer language model treats next-token prediction as a multinomial regression problem over the entire vocabulary, which is typically tens of thousands of tokens. The model produces a logit vector of size V (vocabulary size), the softmax converts it into a probability distribution, and training minimises the cross-entropy between that distribution and the one-hot vector for the actual next token. Sampling from a language model is sampling from a multinomial regression's predicted probabilities. Greedy decoding is taking the argmax. Temperature scaling is dividing the logits by a constant before the softmax. All of these tricks are operations on a multinomial regression that happens to have a giant feature extractor in front of it.
Multinomial regression is also still useful in its own right. It is fast to train, easy to interpret, naturally calibrated, and gives a strong baseline that more complex models have to beat. For tabular data with modest dimensionality it often loses very little accuracy to gradient boosting, and it remains the default first model for many practical classification problems.
Imagine you have a bag of candies in different colours, and you want to guess each candy's colour from its shape and size. For two colours, you would draw one line that separates them. For ten colours, you draw ten lines, each one giving a score for one colour. To turn the scores into probabilities (so they all add up to one and the highest score wins), you use a function called softmax. That whole recipe is multinomial regression. You teach the lines by showing many candies whose colours you know, and the computer adjusts the lines until the probabilities for the right colours are as high as possible.