The full softmax (also called the standard softmax or exact softmax) is the complete computation of the softmax function over all possible output classes in a neural network. In contrast to approximate methods such as hierarchical softmax, negative sampling, or adaptive softmax, the full softmax evaluates every class in the output vocabulary to produce an exact probability distribution. While this guarantees mathematically precise probabilities, it becomes a computational bottleneck when the number of output classes is very large, as is common in language models with vocabularies of hundreds of thousands of tokens.
Imagine you have a huge bag with thousands of different colored marbles. You want to figure out the chance of picking each color. Full softmax is like carefully weighing every single marble in the bag, one by one, and then calculating the exact chance for each color. This gives you a perfect answer, but if the bag has a million marbles, it takes a really long time. Some faster methods skip most of the marbles and only look at a few of them to get a "good enough" guess. Full softmax never takes shortcuts; it always checks every marble.
The softmax function was first used in statistical mechanics as the Boltzmann distribution, appearing in foundational work by Ludwig Boltzmann in 1868 and formalized by Josiah Willard Gibbs in 1902. In decision theory, R. Duncan Luce applied the axiom of independence of irrelevant alternatives to formulate what is now known as Luce's choice axiom (1959). The modern term "softmax" was coined by John S. Bridle in two 1989 conference papers, where he described it as a "differentiable generalisation of the winner-take-all operation" that "preserves the rank order of its input values."
Given an input vector z = (z_1, z_2, ..., z_K) of raw scores (called logits) from the final layer of a neural network, the full softmax function computes the probability for each class i as:
$$\sigma(\mathbf{z})i = \frac{e^{z_i}}{\sum{j=1}^{K} e^{z_j}}$$
where K is the total number of classes, e is Euler's number (approximately 2.71828), and the denominator sums the exponentials of all logit values. The output is a vector of K probabilities that are all positive and sum to exactly 1.
For example, given an input vector (1, 2, 3, 4, 1, 2, 3), the full softmax yields approximately (0.024, 0.064, 0.175, 0.475, 0.024, 0.064, 0.175). The largest input value (4) receives the highest probability (0.475), while the relative ordering of all inputs is preserved.
The softmax function can be parameterized with a temperature value T (or equivalently, an inverse temperature parameter beta = 1/T):
$$\sigma(\mathbf{z})i = \frac{e^{z_i / T}}{\sum{j=1}^{K} e^{z_j / T}}$$
The temperature controls the "sharpness" of the output distribution:
| Temperature | Effect on distribution | Behavior |
|---|---|---|
| T = 1 | Standard softmax | Default behavior; no scaling applied |
| T < 1 (low) | Sharper, more peaked | The highest-probability class dominates; the model appears more "confident" |
| T > 1 (high) | Flatter, more uniform | Probabilities spread more evenly across classes; the model appears less "confident" |
| T approaches 0 | Approaches argmax | Output converges to a one-hot vector selecting the maximum logit |
| T approaches infinity | Approaches uniform | All classes receive equal probability of 1/K |
Temperature scaling is widely used in large language models such as GPT and LLaMA to control the randomness of text generation. Lower temperatures produce more deterministic and repetitive text, while higher temperatures increase diversity and creativity at the cost of coherence. Temperature scaling is also used for model calibration, where a single scalar T is learned on a validation set to make predicted probabilities better reflect true likelihoods.
The full softmax function has several mathematical properties that make it useful in machine learning:
| Property | Description |
|---|---|
| Output range | Each output value lies strictly in the open interval (0, 1), meaning no output is ever exactly 0 or exactly 1 |
| Normalization | All output values sum to exactly 1, forming a valid probability distribution |
| Translation invariance | Adding a constant c to every element of the input vector does not change the output: sigma(z + c) = sigma(z). This is because the constant cancels in the ratio of exponentials |
| Monotonicity | The function preserves the rank ordering of inputs. If z_i > z_j, then sigma(z)_i > sigma(z)_j |
| Differentiability | The function is continuously differentiable everywhere, making it compatible with gradient descent and backpropagation |
| Geometric interpretation | The softmax maps R^K to the interior of the standard (K-1)-simplex, reducing the effective dimensionality by one due to the unit-sum constraint |
| LogSumExp relationship | The softmax is the gradient of the LogSumExp function. That is, the partial derivative of LSE with respect to z_i equals sigma(z)_i |
The partial derivative of the softmax output with respect to input logits is:
$$\frac{\partial \sigma_i}{\partial z_j} = \sigma_i (\delta_{ij} - \sigma_j)$$
where delta_ij is the Kronecker delta (1 when i = j, 0 otherwise). When paired with the cross-entropy loss function, the gradient simplifies to:
$$\frac{\partial L}{\partial z_i} = \sigma_i - y_i$$
where y_i is the one-hot encoded true label. This clean gradient expression (predicted probability minus true label) is one of the reasons the softmax plus cross-entropy combination is so widely used in classification tasks. The simplicity of this gradient also makes the backward pass computationally efficient.
The "full" in full softmax refers to the fact that the denominator requires summing over all K output classes. This computation has time complexity O(K) for each input example, which is acceptable when K is small (e.g., 10 classes in digit recognition or 1,000 classes in ImageNet). However, in many real-world applications, K can be extremely large:
| Application | Typical vocabulary/class size (K) |
|---|---|
| Digit classification (MNIST) | 10 |
| Image classification (ImageNet) | 1,000 |
| Part-of-speech tagging | 50 to 100 |
| Named entity recognition | 10 to 50 |
| Word-level language modeling | 50,000 to 200,000 |
| Subword language modeling (BPE) | 30,000 to 100,000 |
| Machine translation output | 30,000 to 100,000 |
| Web-scale recommendation | Millions |
For a language model with a vocabulary of 100,000 tokens and a hidden dimension of 2,048, the output projection matrix alone contains approximately 200 million parameters. At each training step, computing the full softmax requires a matrix-vector multiplication of size (K x d), followed by exponentiation and normalization across all K values. When K is in the hundreds of thousands, this computation dominates the training cost and can account for a majority of the total floating-point operations per step.
The memory cost is also significant. The output embedding matrix of size K x d must be stored in GPU memory, and the full vector of K softmax probabilities must be materialized for each example in the batch during both the forward and backward passes.
A naive implementation of the softmax function is prone to numerical overflow and underflow. When logit values are large and positive, computing e^(z_i) can exceed the range of floating-point representation, producing infinity. When logit values are large and negative, e^(z_i) underflows to zero, which can cause division-by-zero errors.
The standard solution is the log-sum-exp trick (also called the "exp-normalize trick" or "safe softmax"). Before computing the exponentials, the maximum value m = max(z_1, ..., z_K) is subtracted from every logit:
$$\sigma(\mathbf{z})i = \frac{e^{z_i - m}}{\sum{j=1}^{K} e^{z_j - m}}$$
This transformation is valid because the softmax is translation-invariant. After the shift, the largest exponent is e^0 = 1, which prevents overflow. The corresponding log-softmax can be computed as:
$$\log \sigma(\mathbf{z})i = (z_i - m) - \log \sum{j=1}^{K} e^{z_j - m}$$
All major deep learning frameworks, including PyTorch, TensorFlow, and JAX, implement this numerically stable version by default. In practice, combining the softmax and cross-entropy into a single fused operation (such as PyTorch's CrossEntropyLoss or TensorFlow's tf.nn.softmax_cross_entropy_with_logits) provides even better numerical stability by avoiding explicit computation of the intermediate probabilities.
Because full softmax becomes expensive for large output spaces, researchers have developed numerous approximation techniques. These generally fall into two categories: softmax-based approaches that restructure the computation, and sampling-based approaches that avoid computing the full normalization constant.
Hierarchical softmax (H-Softmax), introduced by Morin and Bengio in 2005, replaces the flat output layer with a binary tree structure in which each word is a leaf node. Instead of computing probabilities over all K classes, the model traverses a path from the root to the target leaf, making a binary decision at each internal node. This reduces the computational complexity from O(K) to O(log_2 K).
Using a Huffman tree (as in Word2Vec) assigns shorter paths to more frequent words, further reducing the average computation. Hierarchical softmax achieves training speedups of 25x to 100x compared to full softmax. However, it has drawbacks: the tree structure introduces a dependency on the quality of the word hierarchy, and at test time the full distribution over all words must still be computed when ranking is needed. It also tends to perform poorly on small vocabularies.
Adaptive softmax, proposed by Grave et al. in 2017, exploits Zipf's law (the observation that a small fraction of words account for most of the probability mass in natural language). It partitions the vocabulary into clusters based on word frequency, assigning smaller projection dimensions to infrequent word clusters. This approach is specifically optimized for GPU computation, taking advantage of efficient matrix-matrix operations. Experiments on benchmarks such as EuroParl and the One Billion Word dataset showed speedups of 2x to 10x over full softmax with minimal loss in perplexity.
Differentiated softmax (D-Softmax) allocates varying embedding dimensions to words based on their frequency. Frequent words receive larger embeddings, while rare words receive smaller ones. Unlike hierarchical softmax, the speedup persists at test time because fewer parameters are needed overall. The trade-off is that rare words may be modeled less accurately due to their smaller representation.
Noise contrastive estimation reformulates the softmax computation as a binary classification problem. Instead of computing probabilities over all K classes, the model learns to distinguish between true data samples and artificially generated noise samples drawn from a known distribution. For each training example, only k noise samples are drawn (typically k = 25), and a logistic regression classifier is trained on the k + 1 samples. The computational cost becomes O(k) per example rather than O(K). Gutmann and Hyvarinen (2010) proved that as the number of noise samples increases, the NCE gradient converges to the gradient of the full softmax. NCE achieves speedups of 8x to 45x during training.
Negative sampling, popularized by Mikolov et al. (2013) in the Word2Vec system, is a simplified variant of NCE. It drops the normalization constant entirely and uses a simplified objective function. For each positive (target) word, k negative words are sampled from a noise distribution (often the unigram distribution raised to the 3/4 power). While negative sampling lacks the asymptotic consistency guarantees of NCE (meaning it does not converge to the true softmax distribution), it works well in practice for learning word embeddings. It is generally not recommended for language modeling tasks where accurate probability estimates are needed.
Importance sampling approximates the softmax gradient by sampling from a proposal distribution Q rather than summing over all classes. The samples are reweighted by importance weights P(w)/Q(w) to correct for the sampling bias. Adaptive importance sampling, which learns a mixture of bigram and unigram distributions during training, has achieved speedups of up to 100x. However, importance sampling can diverge if the proposal distribution is a poor match for the true distribution.
| Method | Complexity | Training speedup | Exact at test time | Best for |
|---|---|---|---|---|
| Full softmax | O(K) | 1x (baseline) | Yes | Small vocabularies, exact probability needs |
| Hierarchical softmax | O(log K) | 25-100x | No (requires full computation) | Large vocabularies during training |
| Adaptive softmax | O(K) with smaller constants | 2-10x | Yes | GPU-based language modeling |
| Differentiated softmax | O(K) with fewer parameters | ~2x | Yes | Balanced speed at train and test time |
| NCE | O(k) where k is noise samples | 8-45x | No | Large-vocabulary training |
| Negative sampling | O(k) where k is negative samples | ~25x | No | Word embedding learning |
| Importance sampling | O(k) where k is samples | 19-100x | No | Large-vocabulary training |
Beyond the computational cost, Yang et al. (2018) identified a theoretical limitation called the softmax bottleneck. They formulated language modeling as a matrix factorization problem and showed that when the true data distribution has a higher rank than the hidden dimension of the model, a single softmax layer cannot represent the distribution exactly. In other words, the expressiveness of softmax-based language models is fundamentally limited by the rank of the context-to-logit mapping.
To address this, they proposed Mixture of Softmaxes (MoS), which introduces discrete latent variables and computes the output distribution as a weighted average of multiple softmax distributions. MoS achieved state-of-the-art perplexity on Penn Treebank (47.69) and WikiText-2 (40.68) at the time of publication. Later work by Chang and McCallum (2022) further showed that the softmax bottleneck makes language models unable to represent multi-mode word distributions.
While full softmax always produces dense probability distributions (every class receives nonzero probability), several alternatives have been proposed that produce sparse outputs:
Sparsemax, introduced by Martins and Astudillo at ICML 2016, projects the input onto the probability simplex in a way that can assign exactly zero probability to some classes. This produces more interpretable and selective distributions, which is useful in attention mechanisms where focusing on a small number of inputs is desirable. Sparsemax achieves similar performance to softmax on natural language inference tasks while producing more compact attention patterns.
Entmax (alpha-entmax) generalizes both softmax and sparsemax as a continuous family parameterized by alpha. When alpha = 1, entmax reduces to softmax. When alpha = 2, it reduces to sparsemax. Intermediate values allow fine-grained control over the degree of sparsity.
The Gumbel-softmax trick, introduced independently by Jang et al. (2017) and Maddison et al. (2017), provides a differentiable approximation to sampling from categorical (discrete) distributions. Standard softmax outputs can be used to define categorical distributions, but sampling from them is not differentiable, which prevents the use of the reparameterization trick for backpropagation.
Gumbel-softmax addresses this by adding Gumbel noise to the logits before applying temperature-scaled softmax:
$$y_i = \frac{\exp((\log \pi_i + g_i) / \tau)}{\sum_{j=1}^{K} \exp((\log \pi_j + g_j) / \tau)}$$
where pi_i are the category probabilities, g_i are independent samples from the Gumbel(0, 1) distribution, and tau is a temperature parameter. As tau approaches 0, the Gumbel-softmax output approaches a one-hot vector (a true categorical sample). As tau approaches infinity, it approaches a uniform distribution. This technique is widely used in variational autoencoders and other generative models that require differentiable discrete sampling.
In transformer architectures, the softmax function is applied at every layer of the self-attention mechanism to compute attention weights. Given query Q, key K, and value V matrices, the attention output is:
$$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right) V$$
Here the softmax is applied over the sequence dimension (not the vocabulary dimension), so the number of classes K equals the sequence length. For long sequences, this becomes a bottleneck because the attention matrix is of size (sequence_length x sequence_length), leading to quadratic memory and computation costs.
FlashAttention, introduced by Dao et al. in 2022, addresses this by tiling the softmax computation to avoid materializing the full attention matrix in GPU high-bandwidth memory (HBM). It uses an online softmax algorithm that computes the normalization constant incrementally, block by block, achieving 2x to 4x speedups and 10x to 20x memory savings compared to standard attention implementations. FlashAttention computes the exact same result as full softmax attention; it is a hardware-aware optimization, not an approximation.
Full softmax is used across many areas of machine learning and artificial intelligence:
Multi-class classification. In tasks like image classification (e.g., classifying images into one of 1,000 ImageNet categories), the full softmax is applied at the output layer to produce class probabilities. Combined with cross-entropy loss, it forms the standard training objective for classification networks such as ResNet, VGG, and Vision Transformers.
Language modeling. Large language models use full softmax to compute the probability distribution over the next token at each position. Despite the large vocabulary sizes, modern hardware and optimizations (such as mixed-precision training and tensor parallelism) have made full softmax feasible even for vocabularies exceeding 100,000 tokens.
Machine translation. Encoder-decoder models for machine translation apply full softmax at each decoding step to select the next word in the target language. Beam search and other decoding strategies rely on the exact probability values produced by full softmax.
Reinforcement learning. In policy-based methods, the softmax function converts action-value estimates (Q-values) into a policy (a probability distribution over actions). This is known as Boltzmann exploration or softmax exploration, where the temperature parameter controls the trade-off between exploration and exploitation. Higher temperatures encourage more exploration by making the policy more uniform, while lower temperatures make the agent more likely to select the action with the highest estimated value.
Recommender systems. Softmax is used in recommendation systems to convert item scores into selection probabilities, especially in session-based and sequential recommendation models.
All major deep learning frameworks provide optimized implementations of the full softmax:
| Framework | Softmax function | Fused softmax + cross-entropy |
|---|---|---|
| PyTorch | torch.nn.functional.softmax() | torch.nn.CrossEntropyLoss() |
| TensorFlow | tf.nn.softmax() | tf.nn.softmax_cross_entropy_with_logits() |
| JAX | jax.nn.softmax() | optax.softmax_cross_entropy() |
| NumPy (manual) | np.exp(x) / np.sum(np.exp(x)) | Must be implemented manually |
The fused softmax-cross-entropy functions are preferred in practice because they are more numerically stable (they operate in log space internally) and more computationally efficient (they avoid materializing the intermediate probability vector).
Full softmax remains the default choice for classification and language modeling tasks where exact probability distributions are needed and the number of output classes is manageable. Its mathematical properties (differentiability, normalization, translation invariance) make it well suited for gradient-based optimization. For very large output spaces, approximate methods offer significant speedups at the cost of some accuracy. The choice of method depends on the specific application: full softmax for exact probabilities, hierarchical softmax or adaptive softmax for large-vocabulary training, and negative sampling for learning word embeddings.