Full Softmax

The full softmax (also called the standard softmax or exact softmax) is the complete computation of the softmax function over all possible output classes in a neural network. In contrast to approximate methods such as hierarchical softmax, negative sampling, or adaptive softmax, the full softmax evaluates every class in the output vocabulary to produce an exact probability distribution. While this guarantees mathematically precise probabilities, it becomes a computational bottleneck when the number of output classes is very large, as is common in language models with vocabularies of hundreds of thousands of tokens.

ELI5 (Explain like I'm 5)

Imagine you have a huge bag with thousands of different colored marbles. You want to figure out the chance of picking each color. Full softmax is like carefully weighing every single marble in the bag, one by one, and then calculating the exact chance for each color. This gives you a perfect answer, but if the bag has a million marbles, it takes a really long time. Some faster methods skip most of the marbles and only look at a few of them to get a "good enough" guess. Full softmax never takes shortcuts; it always checks every marble.

Mathematical definition

The softmax function was first used in statistical mechanics as the Boltzmann distribution, appearing in foundational work by Ludwig Boltzmann in 1868 and formalized by Josiah Willard Gibbs in 1902. In decision theory, R. Duncan Luce applied the axiom of independence of irrelevant alternatives to formulate what is now known as Luce's choice axiom (1959). The modern term "softmax" was coined by John S. Bridle in two 1989 conference papers, where he described it as a "differentiable generalisation of the winner-take-all operation" that "preserves the rank order of its input values."

Given an input vector z = (z_1, z_2, ..., z_K) of raw scores (called logits) from the final layer of a neural network, the full softmax function computes the probability for each class i as:

$$\sigma(\mathbf{z})i = \frac{e^{z_i}}{\sum{j=1}^{K} e^{z_j}}$$

where K is the total number of classes, e is Euler's number (approximately 2.71828), and the denominator sums the exponentials of all logit values. The output is a vector of K probabilities that are all positive and sum to exactly 1.

For example, given an input vector (1, 2, 3, 4, 1, 2, 3), the full softmax yields approximately (0.024, 0.064, 0.175, 0.475, 0.024, 0.064, 0.175). The largest input value (4) receives the highest probability (0.475), while the relative ordering of all inputs is preserved.

Temperature scaling

The softmax function can be parameterized with a temperature value T (or equivalently, an inverse temperature parameter beta = 1/T):

$$\sigma(\mathbf{z})i = \frac{e^{z_i / T}}{\sum{j=1}^{K} e^{z_j / T}}$$

The temperature controls the "sharpness" of the output distribution:

Temperature	Effect on distribution	Behavior
T = 1	Standard softmax	Default behavior; no scaling applied
T < 1 (low)	Sharper, more peaked	The highest-probability class dominates; the model appears more "confident"
T > 1 (high)	Flatter, more uniform	Probabilities spread more evenly across classes; the model appears less "confident"
T approaches 0	Approaches argmax	Output converges to a one-hot vector selecting the maximum logit
T approaches infinity	Approaches uniform	All classes receive equal probability of 1/K

Temperature scaling is widely used in large language models such as GPT and LLaMA to control the randomness of text generation. Lower temperatures produce more deterministic and repetitive text, while higher temperatures increase diversity and creativity at the cost of coherence. Temperature scaling is also used for model calibration, where a single scalar T is learned on a validation set to make predicted probabilities better reflect true likelihoods.

Key properties

The full softmax function has several mathematical properties that make it useful in machine learning:

Property	Description
Output range	Each output value lies strictly in the open interval (0, 1), meaning no output is ever exactly 0 or exactly 1
Normalization	All output values sum to exactly 1, forming a valid probability distribution
Translation invariance	Adding a constant c to every element of the input vector does not change the output: sigma(z + c) = sigma(z). This is because the constant cancels in the ratio of exponentials
Monotonicity	The function preserves the rank ordering of inputs. If z_i > z_j, then sigma(z)_i > sigma(z)_j
Differentiability	The function is continuously differentiable everywhere, making it compatible with gradient descent and backpropagation
Geometric interpretation	The softmax maps R^K to the interior of the standard (K-1)-simplex, reducing the effective dimensionality by one due to the unit-sum constraint
LogSumExp relationship	The softmax is the gradient of the LogSumExp function. That is, the partial derivative of LSE with respect to z_i equals sigma(z)_i

Gradient and backpropagation

The partial derivative of the softmax output with respect to input logits is:

$$\frac{\partial \sigma_i}{\partial z_j} = \sigma_i (\delta_{ij} - \sigma_j)$$

where delta_ij is the Kronecker delta (1 when i = j, 0 otherwise). When paired with the cross-entropy loss function, the gradient simplifies to:

$$\frac{\partial L}{\partial z_i} = \sigma_i - y_i$$

where y_i is the one-hot encoded true label. This clean gradient expression (predicted probability minus true label) is one of the reasons the softmax plus cross-entropy combination is so widely used in classification tasks. The simplicity of this gradient also makes the backward pass computationally efficient.

Computational cost of full softmax

The "full" in full softmax refers to the fact that the denominator requires summing over all K output classes. This computation has time complexity O(K) for each input example, which is acceptable when K is small (e.g., 10 classes in digit recognition or 1,000 classes in ImageNet). However, in many real-world applications, K can be extremely large:

Application	Typical vocabulary/class size (K)
Digit classification (MNIST)	10
Image classification (ImageNet)	1,000
Part-of-speech tagging	50 to 100
Named entity recognition	10 to 50
Word-level language modeling	50,000 to 200,000
Subword language modeling (BPE)	30,000 to 100,000
Machine translation output	30,000 to 100,000
Web-scale recommendation	Millions

For a language model with a vocabulary of 100,000 tokens and a hidden dimension of 2,048, the output projection matrix alone contains approximately 200 million parameters. At each training step, computing the full softmax requires a matrix-vector multiplication of size (K x d), followed by exponentiation and normalization across all K values. When K is in the hundreds of thousands, this computation dominates the training cost and can account for a majority of the total floating-point operations per step.

The memory cost is also significant. The output embedding matrix of size K x d must be stored in GPU memory, and the full vector of K softmax probabilities must be materialized for each example in the batch during both the forward and backward passes.

Numerical stability

A naive implementation of the softmax function is prone to numerical overflow and underflow. When logit values are large and positive, computing e^(z_i) can exceed the range of floating-point representation, producing infinity. When logit values are large and negative, e^(z_i) underflows to zero, which can cause division-by-zero errors.

The standard solution is the log-sum-exp trick (also called the "exp-normalize trick" or "safe softmax"). Before computing the exponentials, the maximum value m = max(z_1, ..., z_K) is subtracted from every logit:

$$\sigma(\mathbf{z})i = \frac{e^{z_i - m}}{\sum{j=1}^{K} e^{z_j - m}}$$

This transformation is valid because the softmax is translation-invariant. After the shift, the largest exponent is e^0 = 1, which prevents overflow. The corresponding log-softmax can be computed as:

$$\log \sigma(\mathbf{z})i = (z_i - m) - \log \sum{j=1}^{K} e^{z_j - m}$$

All major deep learning frameworks, including PyTorch, TensorFlow, and JAX, implement this numerically stable version by default. In practice, combining the softmax and cross-entropy into a single fused operation (such as PyTorch's CrossEntropyLoss or TensorFlow's tf.nn.softmax_cross_entropy_with_logits) provides even better numerical stability by avoiding explicit computation of the intermediate probabilities.

Softmax approximation methods

Because full softmax becomes expensive for large output spaces, researchers have developed numerous approximation techniques. These generally fall into two categories: softmax-based approaches that restructure the computation, and sampling-based approaches that avoid computing the full normalization constant.

Softmax-based approaches

Hierarchical softmax

Hierarchical softmax (H-Softmax), introduced by Morin and Bengio in 2005, replaces the flat output layer with a binary tree structure in which each word is a leaf node. Instead of computing probabilities over all K classes, the model traverses a path from the root to the target leaf, making a binary decision at each internal node. This reduces the computational complexity from O(K) to O(log_2 K).

Using a Huffman tree (as in Word2Vec) assigns shorter paths to more frequent words, further reducing the average computation. Hierarchical softmax achieves training speedups of 25x to 100x compared to full softmax. However, it has drawbacks: the tree structure introduces a dependency on the quality of the word hierarchy, and at test time the full distribution over all words must still be computed when ranking is needed. It also tends to perform poorly on small vocabularies.

Adaptive softmax

Adaptive softmax, proposed by Grave et al. in 2017, exploits Zipf's law (the observation that a small fraction of words account for most of the probability mass in natural language). It partitions the vocabulary into clusters based on word frequency, assigning smaller projection dimensions to infrequent word clusters. This approach is specifically optimized for GPU computation, taking advantage of efficient matrix-matrix operations. Experiments on benchmarks such as EuroParl and the One Billion Word dataset showed speedups of 2x to 10x over full softmax with minimal loss in perplexity.

Differentiated softmax

Differentiated softmax (D-Softmax) allocates varying embedding dimensions to words based on their frequency. Frequent words receive larger embeddings, while rare words receive smaller ones. Unlike hierarchical softmax, the speedup persists at test time because fewer parameters are needed overall. The trade-off is that rare words may be modeled less accurately due to their smaller representation.

Sampling-based approaches

Noise contrastive estimation (NCE)

Noise contrastive estimation reformulates the softmax computation as a binary classification problem. Instead of computing probabilities over all K classes, the model learns to distinguish between true data samples and artificially generated noise samples drawn from a known distribution. For each training example, only k noise samples are drawn (typically k = 25), and a logistic regression classifier is trained on the k + 1 samples. The computational cost becomes O(k) per example rather than O(K). Gutmann and Hyvarinen (2010) proved that as the number of noise samples increases, the NCE gradient converges to the gradient of the full softmax. NCE achieves speedups of 8x to 45x during training.

Negative sampling

Negative sampling, popularized by Mikolov et al. (2013) in the Word2Vec system, is a simplified variant of NCE. It drops the normalization constant entirely and uses a simplified objective function. For each positive (target) word, k negative words are sampled from a noise distribution (often the unigram distribution raised to the 3/4 power). While negative sampling lacks the asymptotic consistency guarantees of NCE (meaning it does not converge to the true softmax distribution), it works well in practice for learning word embeddings. It is generally not recommended for language modeling tasks where accurate probability estimates are needed.

Importance sampling

Importance sampling approximates the softmax gradient by sampling from a proposal distribution Q rather than summing over all classes. The samples are reweighted by importance weights P(w)/Q(w) to correct for the sampling bias. Adaptive importance sampling, which learns a mixture of bigram and unigram distributions during training, has achieved speedups of up to 100x. However, importance sampling can diverge if the proposal distribution is a poor match for the true distribution.

Comparison of softmax methods

Method	Complexity	Training speedup	Exact at test time	Best for
Full softmax	O(K)	1x (baseline)	Yes	Small vocabularies, exact probability needs
Hierarchical softmax	O(log K)	25-100x	No (requires full computation)	Large vocabularies during training
Adaptive softmax	O(K) with smaller constants	2-10x	Yes	GPU-based language modeling
Differentiated softmax	O(K) with fewer parameters	~2x	Yes	Balanced speed at train and test time
NCE	O(k) where k is noise samples	8-45x	No	Large-vocabulary training
Negative sampling	O(k) where k is negative samples	~25x	No	Word embedding learning
Importance sampling	O(k) where k is samples	19-100x	No	Large-vocabulary training

The softmax bottleneck

Beyond the computational cost, Yang et al. (2018) identified a theoretical limitation called the softmax bottleneck. They formulated language modeling as a matrix factorization problem and showed that when the true data distribution has a higher rank than the hidden dimension of the model, a single softmax layer cannot represent the distribution exactly. In other words, the expressiveness of softmax-based language models is fundamentally limited by the rank of the context-to-logit mapping.

To address this, they proposed Mixture of Softmaxes (MoS), which introduces discrete latent variables and computes the output distribution as a weighted average of multiple softmax distributions. MoS achieved state-of-the-art perplexity on Penn Treebank (47.69) and WikiText-2 (40.68) at the time of publication. Later work by Chang and McCallum (2022) further showed that the softmax bottleneck makes language models unable to represent multi-mode word distributions.

Sparse alternatives to softmax

While full softmax always produces dense probability distributions (every class receives nonzero probability), several alternatives have been proposed that produce sparse outputs:

Sparsemax, introduced by Martins and Astudillo at ICML 2016, projects the input onto the probability simplex in a way that can assign exactly zero probability to some classes. This produces more interpretable and selective distributions, which is useful in attention mechanisms where focusing on a small number of inputs is desirable. Sparsemax achieves similar performance to softmax on natural language inference tasks while producing more compact attention patterns.

Entmax (alpha-entmax) generalizes both softmax and sparsemax as a continuous family parameterized by alpha. When alpha = 1, entmax reduces to softmax. When alpha = 2, it reduces to sparsemax. Intermediate values allow fine-grained control over the degree of sparsity.

Gumbel-softmax

The Gumbel-softmax trick, introduced independently by Jang et al. (2017) and Maddison et al. (2017), provides a differentiable approximation to sampling from categorical (discrete) distributions. Standard softmax outputs can be used to define categorical distributions, but sampling from them is not differentiable, which prevents the use of the reparameterization trick for backpropagation.

Gumbel-softmax addresses this by adding Gumbel noise to the logits before applying temperature-scaled softmax:

$$y_i = \frac{\exp((\log \pi_i + g_i) / \tau)}{\sum_{j=1}^{K} \exp((\log \pi_j + g_j) / \tau)}$$

where pi_i are the category probabilities, g_i are independent samples from the Gumbel(0, 1) distribution, and tau is a temperature parameter. As tau approaches 0, the Gumbel-softmax output approaches a one-hot vector (a true categorical sample). As tau approaches infinity, it approaches a uniform distribution. This technique is widely used in variational autoencoders and other generative models that require differentiable discrete sampling.

Softmax in attention mechanisms

In transformer architectures, the softmax function is applied at every layer of the self-attention mechanism to compute attention weights. Given query Q, key K, and value V matrices, the attention output is:

$$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right) V$$

Here the softmax is applied over the sequence dimension (not the vocabulary dimension), so the number of classes K equals the sequence length. For long sequences, this becomes a bottleneck because the attention matrix is of size (sequence_length x sequence_length), leading to quadratic memory and computation costs.

FlashAttention, introduced by Dao et al. in 2022, addresses this by tiling the softmax computation to avoid materializing the full attention matrix in GPU high-bandwidth memory (HBM). It uses an online softmax algorithm that computes the normalization constant incrementally, block by block, achieving 2x to 4x speedups and 10x to 20x memory savings compared to standard attention implementations. FlashAttention computes the exact same result as full softmax attention; it is a hardware-aware optimization, not an approximation.

Applications of full softmax

Full softmax is used across many areas of machine learning and artificial intelligence:

Multi-class classification. In tasks like image classification (e.g., classifying images into one of 1,000 ImageNet categories), the full softmax is applied at the output layer to produce class probabilities. Combined with cross-entropy loss, it forms the standard training objective for classification networks such as ResNet, VGG, and Vision Transformers.

Language modeling. Large language models use full softmax to compute the probability distribution over the next token at each position. Despite the large vocabulary sizes, modern hardware and optimizations (such as mixed-precision training and tensor parallelism) have made full softmax feasible even for vocabularies exceeding 100,000 tokens.

Machine translation. Encoder-decoder models for machine translation apply full softmax at each decoding step to select the next word in the target language. Beam search and other decoding strategies rely on the exact probability values produced by full softmax.

Reinforcement learning. In policy-based methods, the softmax function converts action-value estimates (Q-values) into a policy (a probability distribution over actions). This is known as Boltzmann exploration or softmax exploration, where the temperature parameter controls the trade-off between exploration and exploitation. Higher temperatures encourage more exploration by making the policy more uniform, while lower temperatures make the agent more likely to select the action with the highest estimated value.

Recommender systems. Softmax is used in recommendation systems to convert item scores into selection probabilities, especially in session-based and sequential recommendation models.

Implementations in deep learning frameworks

All major deep learning frameworks provide optimized implementations of the full softmax:

Framework	Softmax function	Fused softmax + cross-entropy
PyTorch	`torch.nn.functional.softmax()`	`torch.nn.CrossEntropyLoss()`
TensorFlow	`tf.nn.softmax()`	`tf.nn.softmax_cross_entropy_with_logits()`
JAX	`jax.nn.softmax()`	`optax.softmax_cross_entropy()`
NumPy (manual)	`np.exp(x) / np.sum(np.exp(x))`	Must be implemented manually

The fused softmax-cross-entropy functions are preferred in practice because they are more numerically stable (they operate in log space internally) and more computationally efficient (they avoid materializing the intermediate probability vector).

Summary

Full softmax remains the default choice for classification and language modeling tasks where exact probability distributions are needed and the number of output classes is manageable. Its mathematical properties (differentiability, normalization, translation invariance) make it well suited for gradient-based optimization. For very large output spaces, approximate methods offer significant speedups at the cost of some accuracy. The choice of method depends on the specific application: full softmax for exact probabilities, hierarchical softmax or adaptive softmax for large-vocabulary training, and negative sampling for learning word embeddings.

References

Bridle, J. S. (1990). "Probabilistic Interpretation of Feedforward Classification Network Outputs, with Relationships to Statistical Pattern Recognition." *Neurocomputing: Algorithms, Architectures and Applications*. Springer.
Morin, F. and Bengio, Y. (2005). "Hierarchical Probabilistic Neural Network Language Model." *Proceedings of AISTATS*.
Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., and Dean, J. (2013). "Distributed Representations of Words and Phrases and their Compositionality." *Advances in Neural Information Processing Systems (NeurIPS)*.
Gutmann, M. U. and Hyvarinen, A. (2010). "Noise-contrastive estimation: A new estimation principle for unnormalized statistical models." *Proceedings of AISTATS*.
Martins, A. and Astudillo, R. (2016). "From Softmax to Sparsemax: A Sparse Model of Attention and Multi-Label Classification." *Proceedings of ICML*.
Grave, E., Joulin, A., Cisse, M., Grangier, D., and Jegou, H. (2017). "Efficient softmax approximation for GPUs." *Proceedings of ICML*.
Jang, E., Gu, S., and Poole, B. (2017). "Categorical Reparameterization with Gumbel-Softmax." *Proceedings of ICLR*.
Yang, Z., Dai, Z., Salakhutdinov, R., and Cohen, W. W. (2018). "Breaking the Softmax Bottleneck: A High-Rank RNN Language Model." *Proceedings of ICLR*.
Dao, T., Fu, D. Y., Ermon, S., Rudra, A., and Re, C. (2022). "FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness." *Advances in Neural Information Processing Systems (NeurIPS)*.
Blanchard, P. and Higham, D. J. (2021). "Accurately Computing the Log-Sum-Exp and Softmax Functions." *IMA Journal of Numerical Analysis*, 41(4), 2311-2330.
Ruder, S. (2016). "Approximating the Softmax." Blog post. https://www.ruder.io/word-embeddings-softmax/
Chang, H. and McCallum, A. (2022). "Softmax Bottleneck Makes Language Models Unable to Represent Multi-mode Word Distributions." *Proceedings of ACL*.
Goodfellow, I., Bengio, Y., and Courville, A. (2016). *Deep Learning*. MIT Press. Chapter 6.2.2.
Luce, R. D. (1959). *Individual Choice Behavior: A Theoretical Analysis*. John Wiley and Sons.

ELI5 (Explain like I'm 5)

Mathematical definition

Temperature scaling

Key properties

Gradient and backpropagation

Computational cost of full softmax

Numerical stability

Softmax approximation methods

Softmax-based approaches

Hierarchical softmax

Adaptive softmax

Differentiated softmax

Sampling-based approaches

Noise contrastive estimation (NCE)

Negative sampling

Importance sampling

Comparison of softmax methods

The softmax bottleneck

Sparse alternatives to softmax

Gumbel-softmax

Softmax in attention mechanisms

Applications of full softmax

Implementations in deep learning frameworks

Summary

References

Improve this article

Related Articles

GELU (Gaussian Error Linear Unit)

Multi-head Latent Attention

Sparse autoencoder

Context window

OCR Models

Post-training

ELI5 (Explain like I'm 5)

Mathematical definition

Temperature scaling

Key properties

Gradient and backpropagation

Computational cost of full softmax

Numerical stability

Softmax approximation methods

Softmax-based approaches

Hierarchical softmax

Adaptive softmax

Differentiated softmax

Sampling-based approaches

Noise contrastive estimation (NCE)

Negative sampling

Importance sampling

Comparison of softmax methods

The softmax bottleneck

Sparse alternatives to softmax

Gumbel-softmax

Softmax in attention mechanisms

Applications of full softmax

Implementations in deep learning frameworks

Summary

References

Related Articles

GELU (Gaussian Error Linear Unit)

Multi-head Latent Attention

Sparse autoencoder

Context window

OCR Models

Post-training