Multi-Class Logistic Regression

Introduction

Multi-class logistic regression, also known as multinomial logistic regression, softmax regression, or the maximum entropy (MaxEnt) classifier, is a supervised learning algorithm that generalizes logistic regression to handle classification problems with more than two possible output categories. While binary logistic regression models a single outcome probability using the sigmoid function, multi-class logistic regression uses the softmax function to produce a probability distribution over K mutually exclusive classes.

The algorithm takes an input feature vector, computes a linear score for each class, and then normalizes these scores through the softmax function so that all output probabilities sum to one. The predicted class is the one with the highest probability. Training is performed by minimizing the cross-entropy loss function, typically through iterative optimization methods such as gradient descent.

Multi-class logistic regression is widely used in natural language processing, image classification, medical diagnosis, and many other domains. It serves as the final classification layer in most modern neural networks and remains one of the most commonly applied multi-class classifiers in both statistics and machine learning.

Historical background

The development of multi-class logistic regression grew out of foundational work on binary logistic regression. Joseph Berkson introduced the term "logit" in 1944 and developed the logistic regression model as an alternative to probit analysis for modeling binary outcomes. In 1958, David Cox refined the model and coined the term "logistic regression," generalizing it to handle multiple explanatory variables.

Cox extended logistic regression to multinomial outcomes in 1966, allowing the model to handle dependent variables with more than two categories. This broadened the scope and popularity of logistic models considerably. In 1973, Daniel McFadden connected the multinomial logit model to discrete choice theory in economics, showing that the model arises naturally when individuals choose among alternatives to maximize random utility. McFadden received the Nobel Prize in Economics in 2000 for this contribution.

Separately, in the field of statistical mechanics and information theory, the principle of maximum entropy was developed by Edwin Jaynes in 1957. Researchers later recognized that multinomial logistic regression is mathematically equivalent to the maximum entropy classifier, a connection that became especially important in natural language processing during the 1990s and 2000s. The equivalence between these two perspectives, one rooted in statistics and the other in information theory, provided a strong theoretical foundation for the model.

Explain like I'm 5 (ELI5)

Imagine you have a bag of colored candies: red, blue, green, and yellow. You want to build a machine that can guess the color of a candy just by weighing it and measuring its size.

The machine looks at the weight and size, and for each color, it calculates a "score" that represents how much the candy looks like that color. A heavier candy might get a high score for "red" and a low score for "yellow."

But scores by themselves are hard to compare. So the machine uses a special trick called softmax to turn all the scores into percentages that add up to 100%. If red gets 70%, blue gets 15%, green gets 10%, and yellow gets 5%, the machine guesses "red" because it has the highest percentage.

To get better at guessing, the machine practices on lots of candies where you already know the color. Each time it guesses wrong, it adjusts its internal settings a little bit so the correct color gets a higher percentage next time. After practicing on hundreds of candies, the machine learns what weight-and-size combinations go with each color.

That is multi-class logistic regression: calculate scores for each option, convert them to percentages, and pick the highest one.

Mathematical formulation

Notation

Consider a classification problem with K classes (indexed k = 1, 2, ..., K) and a training set of m examples. Each training example consists of a feature vector x (of dimension n) and a true class label y, where y takes values in {1, 2, ..., K}. The model maintains a weight matrix W of dimensions n x K and a bias vector b of dimension K.

Linear scoring

For each class k, the model computes a raw score (also called a logit) as a linear function of the input features:

z_k = w_k^T * x + b_k

where w_k is the k-th column of the weight matrix W and b_k is the k-th element of the bias vector. The vector z = (z_1, z_2, ..., z_K) contains the raw scores for all K classes.

Softmax function

The softmax function transforms the raw score vector z into a valid probability distribution over the K classes:

P(y = k | x) = exp(z_k) / sum_{j=1}^{K} exp(z_j)

This function has several properties:

All output values are between 0 and 1.
The outputs sum to exactly 1, forming a valid probability distribution.
Larger input scores produce larger output probabilities.
The function is differentiable everywhere, which is necessary for gradient-based optimization.

In practice, a numerical stability trick is applied by subtracting the maximum score from all scores before exponentiation:

P(y = k | x) = exp(z_k - max(z)) / sum_{j=1}^{K} exp(z_j - max(z))

This prevents overflow errors without changing the result, since the subtraction cancels out in the ratio.

Relationship to binary logistic regression

When K = 2, the softmax function reduces to the sigmoid function used in binary classification. Setting one class as the reference (with its weight vector fixed to zero), the probability of the other class becomes:

P(y = 1 | x) = exp(w_1^T * x) / (exp(w_1^T * x) + exp(0)) = 1 / (1 + exp(-w_1^T * x))

This is exactly the sigmoid function, confirming that multi-class logistic regression is a strict generalization of binary logistic regression.

Identifiability and the reference category

The softmax model has a redundancy in its parameters: adding any constant vector to all weight vectors produces the same probabilities. This means only K - 1 weight vectors are independently identifiable. A common convention in statistics is to fix one class (typically the last) as a reference category by setting its weight vector to zero. The remaining K - 1 weight vectors then encode log-odds relative to the reference class:

log(P(y = k | x) / P(y = K | x)) = w_k^T * x + b_k

In machine learning practice, all K weight vectors are typically kept and the redundancy is resolved implicitly through regularization.

Loss function and training

Cross-entropy loss

The model is trained by minimizing the cross-entropy loss (also called log loss or negative log-likelihood). For a single training example with true class label y, the loss is:

L = -log(P(y | x)) = -log(exp(z_y) / sum_{j=1}^{K} exp(z_j))

Using one-hot encoding to represent the true label as a vector t (where t_k = 1 if k = y and t_k = 0 otherwise), this can be written as:

L = -sum_{k=1}^{K} t_k * log(P(y = k | x))

Since only the term where t_k = 1 is nonzero, only the predicted probability of the true class contributes to the loss. The total cost over a dataset of m training examples is:

J(W, b) = -(1/m) * sum_{i=1}^{m} sum_{k=1}^{K} t_k^(i) * log(P(y = k | x^(i)))

This cost function is convex, which guarantees that any local minimum is also a global minimum.

Maximum likelihood interpretation

Minimizing the cross-entropy loss is equivalent to maximizing the likelihood of the observed data under the model. The likelihood function for m independent and identically distributed training examples is:

L(W, b) = product_{i=1}^{m} P(y^(i) | x^(i); W, b)

Taking the negative logarithm of the likelihood yields the cross-entropy cost function. This connection to maximum likelihood estimation provides a principled statistical justification for the training procedure.

Gradient computation

The gradient of the cross-entropy loss with respect to the weight vector w_k for class k is:

dJ/dw_k = (1/m) * sum_{i=1}^{m} (P(y = k | x^(i)) - t_k^(i)) * x^(i)

This gradient has an elegant form: for each training example, it is proportional to the difference between the predicted probability and the true label (0 or 1), multiplied by the input feature vector. The gradient for the bias term b_k is:

dJ/db_k = (1/m) * sum_{i=1}^{m} (P(y = k | x^(i)) - t_k^(i))

These gradients are used to update the parameters iteratively during training.

Softmax Jacobian

The partial derivative of the softmax output with respect to its input scores forms a Jacobian matrix. For outputs S_i = softmax(z_i):

Diagonal entries (i = j): dS_i / dz_j = S_i * (1 - S_i)
Off-diagonal entries (i != j): dS_i / dz_j = -S_i * S_j

This can be written compactly as dS_i / dz_j = S_i * (delta_ij - S_j), where delta_ij is the Kronecker delta. When combined with the cross-entropy loss, the overall gradient simplifies to the difference between predicted probabilities and true labels, which is the same simple form used in binary logistic regression.

Optimization techniques

Several optimization algorithms can be used to minimize the cross-entropy cost function and train the model.

Batch gradient descent

Gradient descent computes the gradient over the entire training set and updates all parameters simultaneously:

w_k := w_k - alpha * dJ/dw_k

where alpha is the learning rate. This method produces stable updates but can be slow for large datasets because every update requires a full pass through the data.

Stochastic gradient descent (SGD)

Stochastic gradient descent updates parameters using a single randomly selected training example at each step. This introduces noise into the gradient estimate but allows for much faster progress, particularly on large datasets. The noisy updates can also help escape shallow local minima in non-convex variants of the problem (such as when softmax regression is used as part of a deep learning model).

Mini-batch gradient descent

Mini-batch gradient descent is a compromise between batch and stochastic methods. It computes the gradient on a small random subset (mini-batch) of the training data at each step, typically containing 32 to 256 examples. This approach benefits from vectorized computation on modern hardware while still providing the convergence benefits of stochastic updates.

Advanced optimizers

Modern implementations often use adaptive learning rate methods such as the Adam optimizer, AdaGrad, or RMSProp. These algorithms adjust the learning rate for each parameter individually based on the history of past gradients. For classical statistical applications, quasi-Newton methods such as L-BFGS are commonly used because they converge in fewer iterations by approximating second-order curvature information.

Optimizer	Update rule	Advantages	Disadvantages
Batch gradient descent	Uses full dataset gradient	Stable convergence, deterministic	Slow on large datasets
SGD	Uses single-example gradient	Fast updates, good generalization	Noisy, sensitive to learning rate
Mini-batch SGD	Uses small-batch gradient	Balances speed and stability	Requires batch size tuning
L-BFGS	Quasi-Newton with curvature approximation	Fast convergence, no learning rate needed	High memory cost for large models
Adam	Adaptive per-parameter learning rates	Works well out of the box	May not generalize as well as SGD

Regularization

Overfitting occurs when the model fits the training data too closely and fails to generalize to unseen examples. Regularization techniques address this by adding a penalty term to the loss function.

L2 regularization (ridge)

L2 regularization adds the squared magnitude of the weight vectors to the cost function:

J_reg(W, b) = J(W, b) + (lambda / 2) * sum_{k=1}^{K} ||w_k||^2

where lambda is the regularization strength hyperparameter. L2 regularization shrinks all weights toward zero but rarely sets any weight to exactly zero. It produces smoother decision boundaries and is equivalent to placing a zero-mean Gaussian prior on the weights in the Bayesian interpretation.

L1 regularization (lasso)

L1 regularization adds the sum of absolute values of the weights:

J_reg(W, b) = J(W, b) + lambda * sum_{k=1}^{K} ||w_k||_1

L1 regularization tends to drive some weights to exactly zero, effectively performing feature selection. This produces sparse models that use only a subset of the input features. L1 regularization is especially useful when the input has many features and only a few are expected to be relevant.

Elastic net

Elastic net combines L1 and L2 regularization:

J_reg(W, b) = J(W, b) + lambda_1 * sum ||w_k||_1 + (lambda_2 / 2) * sum ||w_k||^2

This method can produce sparse solutions (from the L1 term) while still maintaining the grouping effect of L2 regularization when features are correlated.

Regularization method	Penalty term	Effect on weights	Sparsity	Use case
L2 (Ridge)	lambda/2 * sum of squared weights	Shrinks all weights toward zero	No	General-purpose, correlated features
L1 (Lasso)	lambda * sum of absolute weights	Drives some weights to zero	Yes	Feature selection, high-dimensional data
Elastic net	Combination of L1 and L2	Balanced shrinkage and zeroing	Partial	Correlated features with sparsity

Multi-class classification strategies

There are multiple strategies for extending binary classifiers to multi-class problems. Multi-class logistic regression via the softmax function is one approach, but two other commonly used strategies are one-vs-rest and one-vs-one.

Softmax (multinomial) approach

The softmax approach, described in the sections above, trains a single model with K output nodes. All classes are handled jointly in a single optimization problem, and the softmax function ensures that the predicted probabilities form a valid distribution. This approach is sometimes called the "native" or "direct" multi-class method.

Advantages of the softmax approach include producing well-calibrated probabilities that sum to one, requiring only a single model to train and maintain, and jointly optimizing all class boundaries. A limitation is that it assumes classes are mutually exclusive.

One-vs-rest (OvR)

One-vs-rest (also called one-vs-all, OvA) trains K separate binary classifiers. Each classifier learns to distinguish one class from all other classes combined. At prediction time, the input is passed through all K classifiers, and the class whose classifier produces the highest confidence score is selected.

For a problem with K classes:

Number of binary classifiers: K
Training: Each classifier uses all training data, with positive labels for one class and negative labels for all others
Prediction: argmax over all K classifier outputs

OvR is simple to implement and can use any binary classifier as a building block, including support vector machines and decision trees. However, it can suffer from class imbalance (each binary problem has one small positive class against a large negative class), and the scores from different classifiers are not directly comparable because they are trained independently.

One-vs-one (OvO)

One-vs-one trains a binary classifier for every pair of classes. For K classes, this requires K * (K - 1) / 2 classifiers. At prediction time, each classifier votes for one of its two classes, and the class with the most votes is selected.

For a problem with K classes:

Number of binary classifiers: K * (K - 1) / 2
Training: Each classifier uses only the training examples from its two classes
Prediction: Majority voting across all classifiers

OvO is often used with kernel methods like SVMs because training time for kernel methods scales super-linearly with dataset size, and each OvO classifier trains on a smaller subset of the data. The disadvantage is that the number of classifiers grows quadratically with the number of classes.

Comparison of strategies

Strategy	Number of models	Probability output	Handles non-exclusive classes	Training complexity
Softmax (multinomial)	1	Yes, well-calibrated	No	Single optimization
One-vs-rest (OvR)	K	No (requires calibration)	Yes	K independent problems
One-vs-one (OvO)	K(K-1)/2	No (requires calibration)	Yes	K(K-1)/2 independent problems

The softmax approach is generally preferred when classes are mutually exclusive and when calibrated probability estimates are needed. OvR is more flexible and works with any binary classifier. OvO is mainly used with SVMs or other classifiers that benefit from training on smaller subsets.

Connection to maximum entropy

Multi-class logistic regression is mathematically equivalent to the maximum entropy classifier. This equivalence connects two different theoretical traditions: one from statistics (logistic regression) and one from information theory (maximum entropy).

The maximum entropy principle

The principle of maximum entropy, introduced by Edwin Jaynes in 1957, states that when choosing a probability distribution subject to known constraints, one should select the distribution with the greatest entropy (i.e., the distribution that makes the fewest additional assumptions beyond the constraints). Entropy measures the uncertainty in a distribution and is defined as:

H(p) = -sum_k p(k) * log(p(k))

Deriving the model from maximum entropy

In the maximum entropy framework, we want to find a conditional distribution P(y | x) that:

Maximizes the conditional entropy H(Y | X)
Satisfies constraints that the expected value of each feature function under the model distribution matches its empirical expectation in the training data

Solving this constrained optimization problem using Lagrange multipliers yields the log-linear form:

P(y = k | x) = (1 / Z(x)) * exp(w_k^T * x)

where Z(x) = sum_{j=1}^{K} exp(w_j^T * x) is the partition function (normalization constant). This is exactly the softmax regression model. The Lagrange multipliers become the weight parameters of the model.

Why the equivalence matters

The maximum entropy perspective provides an information-theoretic justification for the model: among all models consistent with the observed data, softmax regression is the one that introduces the least amount of unwarranted bias. Additionally, maximizing conditional log-likelihood (the standard training objective for logistic regression) is equivalent to minimizing the Kullback-Leibler divergence between the empirical distribution and the model distribution, which is itself a dual formulation of entropy maximization.

This equivalence has been particularly influential in NLP, where maximum entropy classifiers were widely used for tasks such as part-of-speech tagging, named entity recognition, and text classification before the rise of deep learning methods.

Assumptions and limitations

Assumptions

Multi-class logistic regression relies on several assumptions:

Mutual exclusivity of classes. The model assumes that each observation belongs to exactly one class. It is not suitable for multi-label problems where an observation can belong to multiple classes simultaneously.
Linear decision boundaries. The log-odds between any two classes are modeled as a linear function of the input features. The model cannot capture nonlinear relationships unless the input features are explicitly transformed (for example, by adding polynomial features or using kernel methods).
Independence of irrelevant alternatives (IIA). The ratio of probabilities between any two classes depends only on the features and not on what other classes are available. This means that adding or removing a class does not change the relative probabilities of the remaining classes. The IIA assumption can be violated in practice; for example, in consumer choice modeling, similar alternatives may "split" demand in ways the model cannot capture.
No severe multicollinearity. Highly correlated input features make it difficult to estimate stable, interpretable coefficients and can lead to numerical issues during optimization.

Limitations

Linear boundaries only. The model cannot learn nonlinear decision boundaries on its own. For complex problems, nonlinear feature transformations or a multi-layer neural network with a softmax output layer is needed.
Large sample sizes required. Because the model estimates K - 1 weight vectors (each of dimension n), the total number of parameters grows as O(n * K). Adequate training requires a dataset large enough to support reliable estimation of all these parameters.
Sensitivity to outliers. Extreme values in input features can disproportionately influence the model, especially without regularization.
IIA violations. In settings where alternatives are similar or correlated (such as transportation mode choice), the IIA assumption leads to unrealistic predictions. Alternative models like nested logit, mixed logit, or multinomial probit can handle these cases.
Difficulty with diagnostics. Compared to binary logistic regression, there are fewer standard diagnostic tools for assessing model fit in the multinomial case.

Evaluation metrics

The performance of a multi-class logistic regression model is typically assessed using the following metrics.

Confusion matrix

A confusion matrix is a K x K table that compares predicted class labels against true class labels. Each row represents the true class, each column represents the predicted class, and each cell contains the count of examples with that combination. Diagonal cells represent correct predictions.

Per-class and aggregate metrics

Metric	Formula	Description
Accuracy	Correct predictions / Total predictions	Overall fraction of correct predictions
Precision (per class k)	TP_k / (TP_k + FP_k)	Fraction of predictions for class k that are correct
Recall (per class k)	TP_k / (TP_k + FN_k)	Fraction of true class k examples that are correctly identified
F1 score (per class k)	2 * Precision_k * Recall_k / (Precision_k + Recall_k)	Harmonic mean of precision and recall
Macro-average	Mean of per-class metric	Treats all classes equally
Weighted average	Weighted mean by class frequency	Accounts for class imbalance
Log loss	-(1/m) * sum of log(P(true class))	Directly evaluates predicted probabilities

Log loss (cross-entropy loss) is often preferred for evaluating multi-class logistic regression because it assesses the quality of the predicted probability distribution, not just the predicted label.

Software implementations

Multi-class logistic regression is implemented in all major machine learning frameworks.

Scikit-learn

In Scikit-learn, the LogisticRegression class supports multinomial logistic regression. Setting multi_class='multinomial' uses the softmax formulation, while multi_class='ovr' uses the one-vs-rest strategy. Solvers that support the multinomial loss include lbfgs, newton-cg, sag, and saga. As of recent versions, multinomial is the default when the number of classes exceeds two.

from sklearn.linear_model import LogisticRegression

model = LogisticRegression(multi_class='multinomial', solver='lbfgs', max_iter=1000)
model.fit(X_train, y_train)
predictions = model.predict(X_test)
probabilities = model.predict_proba(X_test)

PyTorch

In PyTorch, multinomial logistic regression is typically implemented as a single linear layer followed by the cross-entropy loss, which internally applies the softmax function:

import torch
import torch.nn as nn

model = nn.Linear(n_features, n_classes)
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model.parameters(), lr=0.01)

# Training loop
for epoch in range(num_epochs):
    outputs = model(X_train)
    loss = criterion(outputs, y_train)
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

Note that nn.CrossEntropyLoss in PyTorch expects raw logits (pre-softmax scores), not probabilities. It applies log-softmax internally for numerical stability.

TensorFlow/Keras

In TensorFlow, a simple multinomial logistic regression model can be built with Keras:

import tensorflow as tf

model = tf.keras.Sequential([
    tf.keras.layers.Dense(n_classes, activation='softmax', input_shape=(n_features,))
])
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
model.fit(X_train, y_train_onehot, epochs=100, batch_size=32)

Alternatively, using sparse_categorical_crossentropy avoids the need to one-hot encode the labels.

R

In R, the multinom() function from the nnet package fits multinomial logistic regression models:

library(nnet)
model <- multinom(y ~ x1 + x2 + x3, data = training_data)
summary(model)
predicted <- predict(model, newdata = test_data, type = "class")

The glmnet package provides regularized multinomial logistic regression with L1, L2, or elastic net penalties.

Applications

Multi-class logistic regression is applied across many fields.

Natural language processing

In NLP, the model is used for text classification, sentiment analysis, part-of-speech tagging, and named entity recognition. The maximum entropy classifier (equivalent to multinomial logistic regression) was a standard tool for these tasks before deep learning became dominant. Unlike Naive Bayes, multinomial logistic regression does not assume that features are conditionally independent, which often leads to better performance on text data with correlated features.

Image classification

Softmax regression is the standard output layer for image classification in neural networks. Even in complex deep learning architectures, the final layer typically applies a linear transformation followed by softmax to produce class probabilities. The MNIST handwritten digit dataset (10 classes) is a classic benchmark where softmax regression serves as a baseline classifier.

Medical diagnosis

In healthcare, multinomial logistic regression is used to predict diagnostic categories based on patient features such as lab results, symptoms, and demographic variables. Its ability to produce calibrated probabilities is valuable in medical decision-making, where clinicians need to assess the relative likelihood of different diagnoses.

The multinomial logit model is widely used in economics for discrete choice analysis, such as predicting consumer preferences among products, transportation mode choice, or occupational selection. The model's connection to random utility theory provides a structural interpretation of the coefficients.

Other domains

Domain	Example application
Biology	Species classification from morphological measurements
Finance	Credit rating prediction (AAA, AA, A, BBB, etc.)
Marketing	Customer segmentation into behavioral categories
Geology	Rock type classification from mineral composition
Ecology	Habitat type prediction from environmental variables

Relationship to other models

Multi-class logistic regression is connected to several other models in machine learning and statistics.

Model	Relationship
Binary logistic regression	Multi-class logistic regression reduces to binary logistic regression when K = 2
Neural network with softmax output	A neural network with no hidden layers and a softmax output is exactly multi-class logistic regression
Naive Bayes classifier	Both are linear classifiers; Naive Bayes makes stronger independence assumptions but is faster to train
Support vector machine	SVMs find maximum-margin decision boundaries; multi-class SVMs typically use OvR or OvO strategies
Multinomial probit	Similar purpose but uses a different link function (probit) and does not assume IIA
Ordinal logistic regression	Used when categories have a natural ordering; multi-class logistic regression treats categories as unordered
Conditional random field (CRF)	CRFs generalize logistic regression to structured prediction over sequences
Perceptron	A perceptron with softmax activation is equivalent to multi-class logistic regression

References

Bishop, Christopher M. *Pattern Recognition and Machine Learning*. Springer, 2006. Chapter 4: Linear Models for Classification.
Murphy, Kevin P. *Machine Learning: A Probabilistic Perspective*. MIT Press, 2012. Chapter 8: Logistic Regression.
Hastie, Trevor, Robert Tibshirani, and Jerome Friedman. *The Elements of Statistical Learning*. 2nd ed. Springer, 2009. Chapter 4: Linear Methods for Classification.
Goodfellow, Ian, Yoshua Bengio, and Aaron Courville. *Deep Learning*. MIT Press, 2016. Chapter 6.2.2: Softmax Units for Multinoulli Output Distributions.
Berkson, Joseph. "Application of the logistic function to bio-assay." *Journal of the American Statistical Association* 39, no. 227 (1944): 357-365.
Cox, David R. "The regression analysis of binary sequences." *Journal of the Royal Statistical Society: Series B* 20, no. 2 (1958): 215-242.
McFadden, Daniel. "Conditional logit analysis of qualitative choice behavior." In *Frontiers in Econometrics*, edited by P. Zarembka, 105-142. Academic Press, 1973.
Jaynes, Edwin T. "Information theory and statistical mechanics." *Physical Review* 106, no. 4 (1957): 620-630.
Berger, Adam L., Stephen A. Della Pietra, and Vincent J. Della Pietra. "A maximum entropy approach to natural language processing." *Computational Linguistics* 22, no. 1 (1996): 39-71.
Jurafsky, Daniel, and James H. Martin. *Speech and Language Processing*. 3rd ed. draft, 2024. Chapter 5: Logistic Regression.
Pedregosa, Fabian, et al. "Scikit-learn: Machine learning in Python." *Journal of Machine Learning Research* 12 (2011): 2825-2830.
Agresti, Alan. *Categorical Data Analysis*. 3rd ed. Wiley, 2013. Chapter 8: Multicategory Logit Models.
Train, Kenneth E. *Discrete Choice Methods with Simulation*. 2nd ed. Cambridge University Press, 2009.
Nigam, Kamal, John Lafferty, and Andrew McCallum. "Using maximum entropy for text classification." In *IJCAI-99 Workshop on Machine Learning for Information Filtering*, 1999.

Introduction

Historical background

Explain like I'm 5 (ELI5)

Mathematical formulation

Notation

Linear scoring

Softmax function

Relationship to binary logistic regression

Identifiability and the reference category

Loss function and training

Cross-entropy loss

Maximum likelihood interpretation

Gradient computation

Softmax Jacobian

Optimization techniques

Batch gradient descent

Stochastic gradient descent (SGD)

Mini-batch gradient descent

Advanced optimizers

Regularization

L2 regularization (ridge)

L1 regularization (lasso)

Elastic net

Multi-class classification strategies

Softmax (multinomial) approach

One-vs-rest (OvR)

One-vs-one (OvO)

Comparison of strategies

Connection to maximum entropy

The maximum entropy principle

Deriving the model from maximum entropy

Why the equivalence matters

Assumptions and limitations

Assumptions

Limitations

Evaluation metrics

Confusion matrix

Per-class and aggregate metrics

Software implementations

Scikit-learn

PyTorch

TensorFlow/Keras

R

Applications

Natural language processing

Image classification

Medical diagnosis

Economics and social sciences

Other domains

Relationship to other models

See also

References

Improve this article

Related Articles

ARC-AGI 2

False Negative Rate

False Positive Rate (FPR)

Log-Odds

Logistic Regression

Linear Discriminant Analysis

Introduction

Historical background

Explain like I'm 5 (ELI5)

Mathematical formulation

Notation

Linear scoring

Softmax function

Relationship to binary logistic regression

Identifiability and the reference category

Loss function and training

Cross-entropy loss

Maximum likelihood interpretation

Gradient computation

Softmax Jacobian

Optimization techniques

Batch gradient descent

Stochastic gradient descent (SGD)

Mini-batch gradient descent

Advanced optimizers

Regularization