See also: Machine learning terms
In machine learning, the softmax function (also called the softargmax or normalized exponential function) is a mathematical function that converts a vector of real numbers (often called logits) into a probability distribution. Each output value lies between 0 and 1, and all outputs sum to exactly 1. Softmax is the standard final activation function in neural networks for multi-class classification tasks and also plays a central role in the attention mechanism of transformer architectures.
The function originated in statistical mechanics as the Boltzmann distribution and was later adopted in decision theory and machine learning. The name "softmax" reflects that it is a smooth (differentiable) approximation of the argmax function, which returns a one-hot vector with a 1 at the position of the largest input and 0 elsewhere. Where argmax makes a hard selection, softmax makes a soft selection, assigning nonzero probability to every input.
The softmax function has roots spanning statistical physics, decision theory, and computer science.
Statistical mechanics (1868 to 1902). Ludwig Boltzmann first introduced the underlying exponential distribution in his 1868 paper on the equilibrium of kinetic energy among material points. Josiah Willard Gibbs later formalized and popularized the distribution in his influential 1902 textbook Elementary Principles in Statistical Mechanics, where it became known as the Boltzmann distribution (or Gibbs distribution). In this physical context, the function describes the probability that a system occupies a particular microstate given its energy, with the temperature controlling how sharply the distribution concentrates on lower-energy states.
Decision theory (1959). R. Duncan Luce adopted the same mathematical form in his Individual Choice Behavior (1959). Luce's choice axiom states that the relative odds of choosing one option over another are not affected by the presence or absence of other alternatives. This "independence of irrelevant alternatives" property leads directly to the softmax form for modeling choice probabilities, and it became a standard tool in economics and psychology.
Machine learning (1989 to 1990). John S. Bridle is credited with introducing the term "softmax" in two 1989 conference papers (published in proceedings in 1990). Bridle described the function as "a normalised exponential multi-input generalisation of the logistic non-linearity" and argued that it should replace the argmax in feedforward classification networks because it "preserves the rank order of its input values, and is a differentiable generalisation of the 'winner-take-all' operation of picking the maximum value."
Given an input vector z = (z_1, z_2, ..., z_n), the softmax function computes the i-th output as:
softmax(z)i = exp(z_i) / sum{j=1}^{n} exp(z_j)
for i in {1, 2, ..., n}.
Here, exp(.) denotes the exponential function, z_i is the i-th element of the input vector, and the denominator sums the exponentials of all elements, serving as a normalization constant.
For example, given the input vector z = [2.0, 1.0, 0.1], the softmax output is approximately [0.659, 0.242, 0.099]. The largest input (2.0) receives the highest probability, but the other elements still receive nonzero probability.
The softmax function produces a valid probability distribution. The sum of all output values equals 1:
sum_{i=1}^{n} softmax(z)_i = 1
This property allows the outputs to be interpreted directly as class probabilities in classification tasks.
Because the exponential function is always positive, every softmax output is strictly greater than zero. No class ever receives exactly zero probability under standard softmax.
Softmax is monotonic: if z_i > z_j, then softmax(z)_i > softmax(z)_j. A higher input value always maps to a higher output probability. This means softmax never reverses the ranking of inputs; it simply rescales them into a probability distribution.
The softmax function is differentiable everywhere, which makes it compatible with gradient-based optimization. The Jacobian of the softmax function has a known closed form:
For i = j: d softmax(z)_i / d z_j = softmax(z)_i * (1 - softmax(z)_i)
For i != j: d softmax(z)_i / d z_j = -softmax(z)_i * softmax(z)_j
This can be written compactly as: d softmax(z) / d z = diag(softmax(z)) - softmax(z) * softmax(z)^T.
These derivatives are used during backpropagation to compute gradients through the softmax layer.
Softmax is invariant to adding a constant to all inputs:
softmax(z + c) = softmax(z) for any scalar c
This property is the mathematical basis of the numerical stability trick discussed below. Note that softmax is not invariant to scaling: multiplying all inputs by a constant changes the output distribution.
A temperature parameter T (or tau) can be introduced to control how "peaked" or "flat" the output distribution is:
softmax(z / T)i = exp(z_i / T) / sum{j=1}^{n} exp(z_j / T)
The temperature parameter has a direct effect on the entropy of the output distribution:
| Temperature | Effect | Distribution shape | Typical use |
|---|---|---|---|
| T approaching 0 | Outputs converge to a one-hot vector (argmax) | Very peaked | Hard decisions, greedy decoding |
| T = 1 | Standard softmax | Moderate | Default training and inference |
| T > 1 | Outputs become more uniform | Flattened | Exploration in reinforcement learning, knowledge distillation |
| T approaching infinity | All outputs approach 1/n (uniform) | Flat | Maximum exploration |
Temperature scaling is used extensively in knowledge distillation (Hinton et al., 2015), where a high temperature (T = 2 to 20) softens the teacher model's output distribution so that the student model can learn from the relative probabilities of incorrect classes, not just the top prediction. It is also commonly used during text generation with large language models to control the randomness of the output: lower temperatures produce more deterministic text, while higher temperatures produce more diverse and creative outputs.
A naive implementation of softmax can cause numerical overflow or underflow. When the input values are large and positive, exp(z_i) can overflow to infinity. When the input values are large and negative, exp(z_i) can underflow to zero, and the denominator can become zero, causing a division-by-zero error.
The standard solution exploits softmax's invariance to constant shifts. By subtracting the maximum input value before computing the exponentials, the largest exponent becomes exp(0) = 1, preventing overflow:
softmax(z)i = exp(z_i - max(z)) / sum{j=1}^{n} exp(z_j - max(z))
This is mathematically equivalent to the original definition but numerically stable. All major deep learning frameworks (PyTorch, TensorFlow, JAX) implement this trick internally.
For the related log-softmax computation (log of the softmax output), the log-sum-exp (LSE) trick is used:
log softmax(z)i = z_i - log(sum{j=1}^{n} exp(z_j))
The log-sum-exp is computed stably as:
log(sum exp(z_j)) = max(z) + log(sum exp(z_j - max(z)))
Computing log-softmax directly (rather than computing softmax and then taking the log) avoids taking the logarithm of very small numbers, which would result in large negative values with poor floating-point precision. This is why PyTorch provides torch.nn.LogSoftmax and torch.nn.functional.log_softmax as separate operations, and why nn.CrossEntropyLoss internally combines log-softmax with negative log-likelihood loss for numerical stability.
The softmax function is mathematically identical to the Boltzmann distribution (or Gibbs distribution) from statistical mechanics. The correspondence is direct:
| Softmax concept | Statistical mechanics equivalent |
|---|---|
| Input values z_i | Negative energy of microstate i (or energy with sign convention) |
| Temperature T | Thermodynamic temperature (kT, where k is Boltzmann's constant) |
| Denominator (sum of exponentials) | Partition function Z |
| Softmax output (probability of class i) | Probability of the system being in microstate i |
In the physics formulation, the probability of a system being in microstate i with energy E_i at temperature T is:
P(i) = exp(-E_i / kT) / Z, where Z = sum_j exp(-E_j / kT)
The sign convention differs (physics uses negative energy, machine learning uses positive logits), but the mathematical structure is identical. The temperature parameter in softmax directly corresponds to the physical temperature: at low temperature, the distribution concentrates on the lowest-energy (highest-logit) state, while at high temperature, all states become equally likely.
This connection is more than a historical curiosity. Methods from statistical physics, such as simulated annealing and energy-based models, exploit the same exponential-distribution machinery. The partition function Z is central to both domains and is generally intractable to compute exactly for large systems, motivating approximation techniques in both fields.
Softmax is the standard activation function in the output layer of neural networks for multi-class classification. Given a network that produces a vector of raw scores (logits) for each class, softmax converts these logits into a probability distribution over classes.
For a classification problem with K classes, the final layer typically has K output neurons, one per class. The softmax function is applied to these K outputs, and the predicted class is the one with the highest probability (argmax of the softmax output).
This pattern is used across many architectures:
In large language models, the softmax layer converts the final hidden state into a probability distribution over the entire vocabulary (which can contain 32,000 to 256,000 tokens). This is one of the most computationally expensive operations in language model inference because it scales linearly with vocabulary size.
The sigmoid function is the special case of softmax for binary (two-class) classification. For two classes with logits z_1 and z_2, the softmax probability for class 1 is:
softmax(z)_1 = exp(z_1) / (exp(z_1) + exp(z_2)) = 1 / (1 + exp(-(z_1 - z_2))) = sigmoid(z_1 - z_2)
This means that for binary classification, using a single output neuron with a sigmoid activation is mathematically equivalent to using two output neurons with softmax. In practice, binary classification typically uses sigmoid because it is simpler (one output instead of two).
More generally, sigmoid can be viewed as the softmax over two categories where one of the logits is fixed at zero.
| Property | Sigmoid | Softmax |
|---|---|---|
| Number of classes | 2 (binary) | Any K >= 2 |
| Output range | (0, 1) scalar | (0, 1) vector summing to 1 |
| Number of output neurons | 1 | K |
| Mathematical relationship | Special case of softmax | Generalization of sigmoid |
| Multi-label support | Yes (independent per class) | No (outputs sum to 1, so classes compete) |
For multi-label classification, where each input can belong to multiple classes simultaneously, sigmoid is applied independently to each output neuron (since the classes are not mutually exclusive). Softmax is inappropriate for multi-label tasks because its outputs are constrained to sum to 1.
Softmax plays a different but equally important role in the attention mechanism of transformer models. In scaled dot-product attention (Vaswani et al., 2017), softmax converts raw attention scores into attention weights:
Attention(Q, K, V) = softmax(Q * K^T / sqrt(d_k)) * V
where Q (queries), K (keys), and V (values) are matrices derived from the input, and d_k is the dimension of the key vectors.
The scaling factor 1/sqrt(d_k) is applied before the softmax to prevent the dot products from growing too large as the dimension increases. Without scaling, large dot products push softmax into regions of extremely small gradients (because the output becomes nearly one-hot), slowing down training. This scaling issue was explicitly noted in the original "Attention Is All You Need" paper by Vaswani et al.
In multi-head attention, softmax is applied independently within each attention head, producing a separate set of attention weights per head.
Softmax was chosen for attention for several reasons:
However, softmax also introduces challenges. Because it always assigns nonzero weight to every key, every token attends to every other token (at least slightly). This "dense" attention pattern means the full seq_len x seq_len attention matrix must be computed, leading to O(n^2) time and memory complexity in the sequence length n.
In autoregressive models (like GPT and other decoder-only transformers), causal masking ensures that each token can only attend to previous tokens. Before applying softmax, the attention scores for future positions are set to negative infinity:
S_masked[i][j] = S[i][j] if j <= i, else -infinity
When softmax is applied, exp(-infinity) = 0, so future tokens receive exactly zero attention weight. This ensures the autoregressive property is maintained while still using standard softmax computation.
For a model with L layers and H heads per layer, each forward pass requires L * H independent softmax operations over (seq_len x seq_len) matrices. For a model like GPT-3 (96 layers, 96 heads) processing a 2048-token sequence, this amounts to 9,216 softmax operations per forward pass, each over a matrix with over 4 million entries.
The quadratic memory cost of storing the full attention matrix is a major bottleneck for processing long sequences. FlashAttention (Dao et al., 2022) addresses this by computing attention in blocks using an online softmax algorithm (Milakov and Gimelshein, 2018), keeping intermediate results in fast on-chip SRAM rather than materializing the full attention matrix. FlashAttention produces numerically identical results to standard attention while using O(n) memory instead of O(n^2) and running 2 to 4 times faster due to reduced memory I/O.
Softmax is almost always paired with cross-entropy loss in classification tasks. The cross-entropy loss for a single example with true class y is:
L = -log(softmax(z)y) = -z_y + log(sum{j=1}^{n} exp(z_j))
The gradient of this combined loss with respect to the logits has a particularly clean form:
dL/dz_i = softmax(z)i - 1{i=y}
where 1_{i=y} is 1 if i equals the true class y and 0 otherwise. This means the gradient is simply the softmax output minus the one-hot target vector. This simplicity (and its numerical stability when computed as a single fused operation) is why deep learning frameworks provide combined softmax-cross-entropy functions like PyTorch's nn.CrossEntropyLoss and TensorFlow's tf.nn.softmax_cross_entropy_with_logits.
The log-softmax function computes the logarithm of the softmax output directly:
log_softmax(z)i = z_i - log(sum{j=1}^{n} exp(z_j))
Computing log-softmax as a single fused operation (rather than computing softmax and then taking the log) is numerically superior. Taking the logarithm of very small softmax outputs produces large negative numbers with poor floating-point precision. The fused computation avoids this by working in log-space throughout.
Log-softmax is used whenever the logarithm of probabilities is needed, which is common in several contexts:
nn.CrossEntropyLoss computes this as a single fused operation for numerical stability.nn.NLLLoss expects log-probabilities as input. The standard pattern is to apply nn.LogSoftmax to the network output and then pass the result to nn.NLLLoss. This two-step approach is mathematically equivalent to nn.CrossEntropyLoss but gives the user access to the intermediate log-probabilities.| Computation | Numerical stability | Speed | When to use |
|---|---|---|---|
| softmax(z) then log() | Poor (log of small numbers) | Slower (two operations) | Avoid |
| log_softmax(z) directly | Good (fused log-sum-exp) | Faster (single operation) | Whenever log-probabilities are needed |
| CrossEntropyLoss (logits input) | Best (fully fused) | Fastest | Standard classification training |
In reinforcement learning, the softmax function is used as an action selection policy known as Boltzmann exploration (or softmax exploration). Given estimated action values Q(s, a) for each action a in state s, the probability of selecting action a is:
P(a | s) = exp(Q(s, a) / T) / sum_{a'} exp(Q(s, a') / T)
This approach offers a more nuanced alternative to epsilon-greedy exploration. While epsilon-greedy selects a random action with fixed probability epsilon (treating all non-greedy actions equally), Boltzmann exploration assigns higher selection probabilities to actions with higher estimated values. Actions that the agent believes are nearly as good as the best action are selected much more often than actions the agent believes are poor.
The temperature parameter T controls the exploration-exploitation tradeoff:
Sutton and Barto discuss softmax action selection in Reinforcement Learning: An Introduction (Section 2.3), noting that it is especially useful when the worst actions are very bad and should be avoided even during exploration, unlike epsilon-greedy which gives them equal chance.
Yang et al. (2017) identified an expressiveness limitation they called the "softmax bottleneck" in neural language models. When a language model uses a softmax output layer with a low-rank weight matrix (which is common because the hidden dimension is much smaller than the vocabulary size), the model cannot represent certain high-rank distributions over the next token.
Formally, if the hidden dimension d is less than the rank of the true log-probability matrix minus 1, then no setting of the output weight matrix can produce the correct distribution for all contexts. This limits the model's ability to capture the full complexity of natural language distributions.
To address this, Yang et al. proposed Mixture of Softmaxes (MoS), which computes multiple softmax distributions and combines them as a weighted mixture. MoS can represent higher-rank distributions and achieved substantial perplexity improvements on Penn Treebank (47.69) and WikiText-2 (40.68) benchmarks. The paper was published at ICLR 2018.
Modern large language models mitigate this bottleneck through much larger hidden dimensions (4,096 to 16,384 or more), which make the rank constraint less binding in practice.
For models with very large output vocabularies, computing the full softmax is expensive because the normalization constant requires summing over all vocabulary entries. Hierarchical softmax (Morin and Bengio, 2005) reduces this cost from O(V) to O(log V), where V is the vocabulary size.
The idea is to organize the vocabulary as a binary tree, with each word at a leaf node. Instead of computing a single softmax over V classes, the model makes a sequence of binary decisions along the path from the root to the target leaf. Each internal node has a learned binary classifier. The probability of a word is the product of the probabilities along its path from the root.
Mikolov et al. (2013) used hierarchical softmax with a Huffman tree in Word2Vec, assigning shorter paths to more frequent words. This made training on large corpora (billions of words) practical; for a vocabulary of 100,000 words, it reduced the per-example cost from 100,000 operations to roughly 17 (log_2 of 100,000). Hierarchical softmax is also implemented in Facebook's fastText library.
Hierarchical softmax has largely fallen out of use in modern large language models, which rely on GPU parallelism to compute the full softmax efficiently. However, it remains relevant for resource-constrained settings or extremely large output spaces.
Sparsemax (Martins and Astudillo, 2016) projects the input onto the probability simplex using an Euclidean projection rather than the exponential mapping used by softmax. The result is a sparse probability distribution: many outputs are exactly zero, and only the most relevant inputs receive nonzero probability. Sparsemax is differentiable almost everywhere (with a well-defined subgradient at non-differentiable points) and can be computed efficiently.
Sparsemax is useful in attention mechanisms where only a few tokens should receive attention, or in classification problems where the model should commit to a small subset of classes.
Entmax (Peters et al., 2019) generalizes both softmax and sparsemax into a parametric family indexed by alpha. When alpha = 1, entmax reduces to softmax; when alpha = 2, it reduces to sparsemax. Intermediate values of alpha produce distributions with intermediate levels of sparsity. The parameter alpha can even be learned jointly with the model parameters.
The entmax family is based on Tsallis entropy, a generalization of Shannon entropy. Higher alpha values encourage sparser outputs.
| Function | Sparsity | Key property | When to use |
|---|---|---|---|
| Softmax (alpha = 1) | None (all outputs > 0) | Smooth, well-understood | Default for classification and attention |
| 1.5-entmax | Moderate | Tunable sparsity | When moderate sparsity is desired |
| Sparsemax (alpha = 2) | High (many exact zeros) | Euclidean projection | Sparse attention, interpretable models |
The Gumbel-softmax (Jang et al., 2016; Maddison et al., 2016) is a continuous relaxation of discrete categorical sampling. Standard sampling from a categorical distribution is not differentiable, which prevents gradients from flowing through discrete choices during training. Gumbel-softmax solves this by adding Gumbel noise to the log-probabilities and applying softmax with a low temperature:
y_i = exp((log(pi_i) + g_i) / T) / sum_j exp((log(pi_j) + g_j) / T)
where g_i are independent samples from the standard Gumbel distribution (g = -log(-log(u)), u ~ Uniform(0,1)) and T is a temperature parameter. As T approaches 0, the samples approach one-hot vectors (true discrete samples). At higher T, the output is a soft approximation that allows gradient computation.
Gumbel-softmax is widely used in variational autoencoders with discrete latent variables, text generation with GANs, and any architecture that needs to make differentiable discrete selections during training.
All major deep learning frameworks provide optimized, numerically stable implementations of softmax and related operations.
PyTorch:
torch.nn.Softmax(dim): Module form, applies softmax along the specified dimension.torch.nn.functional.softmax(input, dim): Functional form.torch.nn.LogSoftmax(dim): Computes log-softmax in a numerically stable way.torch.nn.functional.log_softmax(input, dim): Functional form of log-softmax.torch.nn.CrossEntropyLoss: Combines log-softmax and NLLLoss into a single operation. Accepts raw logits as input (not softmax outputs).torch.nn.functional.gumbel_softmax(logits, tau, hard): Gumbel-softmax sampling with optional straight-through estimator.TensorFlow / Keras:
tf.nn.softmax(logits, axis): Standard softmax.tf.nn.log_softmax(logits, axis): Numerically stable log-softmax.tf.nn.softmax_cross_entropy_with_logits(labels, logits): Fused softmax + cross-entropy.tf.keras.layers.Softmax(axis): Keras layer form.A common mistake is to apply softmax to the network output and then pass the result to a loss function that internally applies softmax again (double softmax). This leads to degraded training performance. Both PyTorch's nn.CrossEntropyLoss and TensorFlow's tf.nn.softmax_cross_entropy_with_logits expect raw logits as input, not softmax outputs.
Beyond its use as a final layer in classifiers and in attention mechanisms, softmax appears in several other contexts:
Imagine you and your friends each got different scores on a test. Softmax is like a machine that looks at everyone's scores and turns them into "chances of winning a prize." The person with the highest score gets the biggest chance, but everyone still gets at least a tiny chance. And when you add up all the chances, they equal exactly 100%. So if your scores were 10, 5, and 2, softmax would say something like: "Player 1 has a 95% chance, Player 2 has a 4.5% chance, and Player 3 has a 0.5% chance." The bigger your score, the bigger your slice of the pie. But nobody's slice is ever exactly zero.