Multinomial regression

Multinomial regression is a statistical model that predicts which one of K possible categories an observation belongs to, given a vector of input features. It generalises logistic regression, which handles two classes, to the case where K is greater than two. The model assigns each class its own linear score, then turns those scores into a probability distribution using the softmax function. The class with the highest probability is the prediction.

The model goes by several names, all of which describe the same algorithm: multinomial logistic regression, softmax regression, polytomous logistic regression, and the maximum entropy classifier (often shortened to MaxEnt) in the natural language processing literature. Discrete-choice economists call the same model the multinomial logit, and a closely related variant is McFadden's conditional logit. All of these names refer to a generalized linear model with a multinomial response distribution and a softmax link function.

Multinomial regression sits at the boundary between classical statistics and modern machine learning. Statisticians use it for inference: estimating odds ratios, computing confidence intervals, and testing whether a predictor matters. Machine learning practitioners use it as a strong baseline for multinomial classification and as the final layer of nearly every deep neural network that produces a categorical output, including the next-token head of large language models.

Model definition

Let x in R^d be a feature vector and let y be a categorical label taking values in {1, 2, ..., K}. Multinomial regression models the conditional probability of each class as

P(y = k | x) = exp(w_k . x + b_k) / sum_{j=1}^{K} exp(w_j . x + b_j)

for k = 1, ..., K. Each class k has its own weight vector w_k in R^d and its own bias b_k in R. The denominator (the partition function) ensures the K probabilities sum to one. Equivalently, if z_k = w_k . x + b_k is the linear score (or logit) for class k, then the predicted probability is the softmax of the score vector z = (z_1, ..., z_K).

The decision boundary between any two classes k and l is the set of inputs where w_k . x + b_k = w_l . x + b_l, which is a hyperplane. So multinomial regression carves the input space into K convex polytopes, one per class. The boundaries are linear in x, even though the probability surface is smooth.

Identifiability and the reference category

The parameterisation above is overcomplete. Adding the same constant vector c to every weight vector (replacing each w_k with w_k + c) leaves the probabilities unchanged, because the c terms cancel inside the softmax. The same applies to the biases. With no further constraint there are infinitely many parameter settings that give identical predictions, which makes maximum-likelihood estimates undefined.

Two conventions resolve this:

Convention	Description	Used by
Reference category	Fix w_K = 0 and b_K = 0 for one chosen class. Estimate K - 1 sets of coefficients.	Statsmodels, R nnet, classical statistics
Symmetric (full softmax)	Keep all K weight vectors but add a regulariser (typically L2) so the optimum is unique.	scikit-learn, PyTorch, TensorFlow

The reference-category form has a clean interpretation: each w_k tells you how feature x shifts the log-odds of class k versus the reference class. The symmetric form is more convenient for software, especially when the model is a layer inside a neural network and weight decay is already in the loss.

Training

Multinomial regression is trained by maximum likelihood estimation. Given a dataset {(x_i, y_i)} for i = 1, ..., n, the log-likelihood is

L(theta) = sum_{i=1}^{n} sum_{k=1}^{K} 1{y_i = k} . log P(y_i = k | x_i)

where theta = {w_k, b_k} collects all weights and biases. Maximising this likelihood is exactly the same as minimising the average categorical cross-entropy loss. The cross-entropy view is dominant in deep learning; the maximum-likelihood view is dominant in statistics. They are the same objective up to a sign and a constant.

The gradient with respect to w_k is

d L / d w_k = sum_i (1{y_i = k} - P(y_i = k | x_i)) . x_i

which has the elegant form (label minus prediction) times input. This is the same gradient that backpropagates through a softmax-plus-cross-entropy layer in a neural network, and it is one reason this combination is so popular: the gradient is numerically stable and easy to compute.

Unlike ordinary least squares, multinomial regression has no closed-form solution. The likelihood is, however, concave in the parameters (strictly concave once a regulariser is added), so any local optimiser converges to the unique global maximum. Common algorithms include:

Algorithm	Type	Notes
Newton-Raphson	Second order	Uses the exact Hessian; fast on small problems but expensive at high d or K
Iteratively reweighted least squares (IRLS)	Second order	Equivalent to Newton-Raphson for GLMs; classic in statistics
L-BFGS	Quasi-Newton	Approximates the Hessian; default solver in scikit-learn for most cases
Newton-CG	Truncated Newton	Solves the Newton step with conjugate gradient; good for large d
Stochastic gradient descent (SGD)	First order	Scales to enormous datasets; standard inside deep learning frameworks
Coordinate descent	First order	Used by glmnet for elastic-net penalties

Thomas Minka's 2003 comparison of optimisers for logistic regression found that L-BFGS and conjugate-gradient methods consistently dominated naive iterative scaling, which had been the standard in the maximum-entropy NLP community. Modern libraries reflect that finding: scikit-learn's default solver is L-BFGS, and even when training enormous neural networks the final softmax layer is fit with the same kind of gradient steps.

Regularization

Without a penalty term, multinomial regression on linearly separable data has weights that diverge to infinity, which is one symptom of overfitting. Regularization fixes this by adding a penalty on the size of the weights. The three standard choices map directly to Bayesian priors on the weights:

Penalty	Formula added to loss	Bayesian interpretation	Effect
L2 (ridge)	(1 / 2) lambda . sum_k		w_k
L1 (lasso)	lambda . sum_k		w_k
Elastic net	alpha . L1 + (1 - alpha) . L2	Mixture of Gaussian and Laplace	Combines shrinkage with sparsity

L2 is the workhorse and the default in scikit-learn, statsmodels, and almost every deep-learning framework (where it is usually called weight decay). L1 is useful when most features are irrelevant and you want feature selection as a by-product of fitting. Elastic net, introduced by Zou and Hastie in 2005, is preferred when correlated features make pure L1 unstable.

Connections and equivalent models

Multinomial regression keeps surfacing under different names in different fields, because the same mathematical object is genuinely useful in many places.

Generalized linear model. The multinomial response distribution combined with the softmax (multinomial logit) link function gives a GLM. This is the framework that connects logistic regression, Poisson regression, and ordinary least squares as instances of the same template.
Log-linear model. The softmax can be rewritten as a log-linear model over (x, y) pairs. This is how the model is presented in many statistics textbooks on contingency tables.
Maximum entropy classifier. Adam Berger and colleagues at IBM showed in 1996 that fitting a log-linear model under feature-expectation constraints is equivalent to choosing the maximum-entropy distribution consistent with those constraints. The dual of the maximum-entropy problem is the maximum-likelihood problem for multinomial regression. NLP researchers used the MaxEnt name for the next decade or so. Adwait Ratnaparkhi's 1996 part-of-speech tagger built on the same framework and was the dominant POS approach until neural networks took over.
Conditional random field. A linear-chain CRF is structurally a multinomial regression at every position, with extra terms that couple adjacent labels. CRFs were the standard sequence-labelling tool from 2001 (Lafferty, McCallum, Pereira) until BiLSTM-CRF and then transformer architectures replaced them.
Conditional logit. Daniel McFadden's 1974 conditional logit model is multinomial regression in which the features depend on both the chooser and the alternative being chosen. McFadden shared the 2000 Nobel Memorial Prize in Economic Sciences for this work, which underpinned modern transportation demand forecasting and consumer choice analysis.
Final layer of a neural network. A softmax over logits with a cross-entropy loss is exactly multinomial regression on whatever features the rest of the network has produced. Image classifiers, speech recognisers, and large language models all end this way. In a transformer language model the last layer is multinomial regression over a vocabulary of tens or hundreds of thousands of tokens.

Multinomial fit versus one-vs-rest

A common alternative to fitting a single multinomial model is to fit K independent binary logistic regressions, each one separating its class from all the rest. Both approaches give a per-class score, but they have different statistical properties.

Property	Multinomial (joint softmax)	One-vs-rest (K binary fits)
Number of optimisations	1 joint problem	K independent problems
Probability normalisation	Built in: outputs sum to 1	Not joint; each binary score is calibrated against its rest set
Calibration	Naturally calibrated under MLE	Often needs post-hoc calibration (e.g. Platt scaling)
Treatment of class imbalance	Handles all classes simultaneously	Each binary problem may be very imbalanced
Cost when K is large	Loss includes a sum over K classes	Embarrassingly parallel across K
Default in scikit-learn	Yes (since version 0.22)	Was the default before 0.22

For most applications with moderate K, the joint multinomial fit produces better calibrated probabilities and slightly higher accuracy. One-vs-rest is more attractive when K is enormous (millions of categories) and you need to parallelise across many machines, or when you want to plug a binary-only library into a multiclass problem.

Statistical inference

When multinomial regression is used as a statistical model rather than a black-box predictor, several inference tools are standard:

Standard errors and Wald tests. Asymptotic standard errors come from the inverse Fisher information (the inverse Hessian of the negative log-likelihood at the optimum). Dividing each coefficient by its standard error gives a Wald statistic, which is approximately normal under the null hypothesis that the coefficient is zero.
Likelihood ratio tests. To test whether a group of coefficients can be set to zero, fit the model with and without those coefficients and compare twice the difference in log-likelihoods to a chi-squared distribution with degrees of freedom equal to the number of constrained parameters.
Odds ratios. For each coefficient w_{k,m} (feature m, class k versus the reference class K), exp(w_{k,m}) is the multiplicative change in the odds of class k versus class K when feature m increases by one unit, holding other features fixed. Confidence intervals on odds ratios are usually obtained by exponentiating Wald intervals on the coefficients themselves.
McFadden's pseudo R-squared. Defined as 1 minus the ratio of the fitted model's log-likelihood to the null model's log-likelihood. McFadden suggested values between 0.2 and 0.4 indicate excellent fit. The value is not directly comparable to OLS R-squared and tends to be much smaller than what people expect from a linear regression context.

Implementations

Multinomial regression is in essentially every statistics and machine-learning library. The most widely used:

Library	Function or class	Notes
scikit-learn	`sklearn.linear_model.LogisticRegression`	Default solver L-BFGS; supports L2, L1 (with saga), and elastic-net (with saga). The deprecated `multi_class='multinomial'` argument was removed; current versions choose multinomial automatically when K > 2.
statsmodels	`statsmodels.discrete.discrete_model.MNLogit`	Reference-category form with full inference output (Wald tests, p-values, confidence intervals, McFadden's pseudo R-squared).
R nnet	`nnet::multinom`	Fits multinomial logit by minimising negative log-likelihood. The companion `summary()` reports standard errors and z-statistics.
R mlogit	`mlogit::mlogit`	Implements McFadden-style conditional logit and several extensions (nested logit, mixed logit) for econometric work.
TensorFlow	`tf.keras.layers.Dense(K)` plus `tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)`	The standard idiom for the final layer of a classifier.
PyTorch	`torch.nn.Linear(d, K)` plus `torch.nn.CrossEntropyLoss`	`CrossEntropyLoss` combines log-softmax and negative log-likelihood for numerical stability.
Stan	`categorical_logit_glm`	Bayesian multinomial regression with arbitrary priors.

Code example: scikit-learn

from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

clf = LogisticRegression(solver="lbfgs", C=1.0, max_iter=1000)
clf.fit(X_train, y_train)

print("accuracy:", clf.score(X_test, y_test))
print("per-class probabilities for first test point:", clf.predict_proba(X_test[:1]))

With three iris species this is a textbook three-class softmax fit. predict_proba returns the softmax probabilities directly, and clf.coef_ has shape (3, 4): one weight vector per class, one entry per input feature.

Discrete choice modelling

In econometrics, multinomial logit is the workhorse for modelling choice among finite alternatives, including which mode of transport someone takes to work, which brand of cereal they buy, or which job offer they accept. McFadden's 1974 conditional logit allowed the features to depend on the alternatives themselves (price of each option, travel time of each route), which was a major step beyond the basic multinomial logit where only the chooser's attributes mattered.

The basic multinomial logit imposes the independence of irrelevant alternatives (IIA): the ratio of probabilities between any two alternatives does not depend on what other alternatives are available. McFadden's famous red-bus blue-bus example shows why this is restrictive. If a commuter is split 50/50 between car and bus, and you add an identical bus painted a different colour, IIA forces the model to predict 1/3, 1/3, 1/3 instead of the more realistic 1/2, 1/4, 1/4. Nested logit and mixed logit relax IIA by grouping similar alternatives or letting coefficients vary across the population.

Limitations

Multinomial regression has several real limitations that are worth being honest about.

The decision boundary between any two classes is linear in the input features. If the true boundary is curved (a parabola, a circle, anything nonlinear) the model cannot fit it without explicit feature engineering, kernel methods, or a nonlinear model on top.

The IIA assumption is baked into the basic model. When alternatives are close substitutes, IIA produces predictions that are noticeably wrong. The practical fix is either to use a richer model (nested logit, mixed logit, neural network) or to be careful about how the choice set is defined.

Like all maximum-likelihood methods, multinomial regression can overfit when d is large relative to n, especially when classes are nearly separable. Without regularisation the optimiser will happily push weights to infinity to drive the softmax probabilities to one and zero. L2 regularisation prevents this but introduces a hyperparameter that must be tuned by cross-validation.

Finally, multinomial regression is symmetric in the way it treats classes, so it does not exploit ordinal structure. If your labels have a natural order (1 to 5 stars, low/medium/high), an ordinal model usually fits better and uses fewer parameters.

Modern context

For classification problems where the input is raw text, image pixels, or audio waveforms, multinomial regression on hand-crafted features has been displaced by deep neural networks that learn their own representations. But the displacement is partial. Almost every classifier in production today, including the largest neural networks, ends with multinomial regression on top of learned features.

The final layer of a classifier built with TensorFlow or PyTorch is a Dense(K) (or Linear(d, K)) followed by a softmax-cross-entropy loss. That is multinomial regression. The features it operates on are the activations of the previous layer, learned end-to-end with the rest of the network. The same arithmetic that Berger and colleagues used for maximum-entropy text classification in 1996 is what trains the final layer of GPT-style language models in 2026.

Language modelling is the most striking example. A transformer language model treats next-token prediction as a multinomial regression problem over the entire vocabulary, which is typically tens of thousands of tokens. The model produces a logit vector of size V (vocabulary size), the softmax converts it into a probability distribution, and training minimises the cross-entropy between that distribution and the one-hot vector for the actual next token. Sampling from a language model is sampling from a multinomial regression's predicted probabilities. Greedy decoding is taking the argmax. Temperature scaling is dividing the logits by a constant before the softmax. All of these tricks are operations on a multinomial regression that happens to have a giant feature extractor in front of it.

Multinomial regression is also still useful in its own right. It is fast to train, easy to interpret, naturally calibrated, and gives a strong baseline that more complex models have to beat. For tabular data with modest dimensionality it often loses very little accuracy to gradient boosting, and it remains the default first model for many practical classification problems.

Explain like I'm 5 (ELI5)

Imagine you have a bag of candies in different colours, and you want to guess each candy's colour from its shape and size. For two colours, you would draw one line that separates them. For ten colours, you draw ten lines, each one giving a score for one colour. To turn the scores into probabilities (so they all add up to one and the highest score wins), you use a function called softmax. That whole recipe is multinomial regression. You teach the lines by showing many candies whose colours you know, and the computer adjusts the lines until the probabilities for the right colours are as high as possible.

Multinomial regression