Softmax

Introduction

In machine learning, the softmax function (also called the softargmax or normalized exponential function) is a mathematical function that converts a vector of real numbers (often called logits) into a probability distribution. Each output value lies between 0 and 1, and all outputs sum to exactly 1. Softmax is the standard final activation function in neural networks for multi-class classification tasks and also plays a central role in the attention mechanism of transformer architectures.

The function originated in statistical mechanics as the Boltzmann distribution and was later adopted in decision theory and machine learning. The name "softmax" reflects that it is a smooth (differentiable) approximation of the argmax function, which returns a one-hot vector with a 1 at the position of the largest input and 0 elsewhere. Where argmax makes a hard selection, softmax makes a soft selection, assigning nonzero probability to every input.

Historical origins

The softmax function has roots spanning statistical physics, decision theory, and computer science.

Statistical mechanics (1868 to 1902). Ludwig Boltzmann first introduced the underlying exponential distribution in his 1868 paper on the equilibrium of kinetic energy among material points. Josiah Willard Gibbs later formalized and popularized the distribution in his influential 1902 textbook Elementary Principles in Statistical Mechanics, where it became known as the Boltzmann distribution (or Gibbs distribution). In this physical context, the function describes the probability that a system occupies a particular microstate given its energy, with the temperature controlling how sharply the distribution concentrates on lower-energy states.

Decision theory (1959). R. Duncan Luce adopted the same mathematical form in his Individual Choice Behavior (1959). Luce's choice axiom states that the relative odds of choosing one option over another are not affected by the presence or absence of other alternatives. This "independence of irrelevant alternatives" property leads directly to the softmax form for modeling choice probabilities, and it became a standard tool in economics and psychology.

Machine learning (1989 to 1990). John S. Bridle is credited with introducing the term "softmax" in two 1989 conference papers (published in proceedings in 1990). Bridle described the function as "a normalised exponential multi-input generalisation of the logistic non-linearity" and argued that it should replace the argmax in feedforward classification networks because it "preserves the rank order of its input values, and is a differentiable generalisation of the 'winner-take-all' operation of picking the maximum value."

Mathematical definition

Given an input vector z = (z_1, z_2, ..., z_n), the softmax function computes the i-th output as:

softmax(z)i = exp(z_i) / sum{j=1}^{n} exp(z_j)

for i in {1, 2, ..., n}.

Here, exp(.) denotes the exponential function, z_i is the i-th element of the input vector, and the denominator sums the exponentials of all elements, serving as a normalization constant.

For example, given the input vector z = [2.0, 1.0, 0.1], the softmax output is approximately [0.659, 0.242, 0.099]. The largest input (2.0) receives the highest probability, but the other elements still receive nonzero probability.

Properties

Output sums to 1

The softmax function produces a valid probability distribution. The sum of all output values equals 1:

sum_{i=1}^{n} softmax(z)_i = 1

This property allows the outputs to be interpreted directly as class probabilities in classification tasks.

All outputs are positive

Because the exponential function is always positive, every softmax output is strictly greater than zero. No class ever receives exactly zero probability under standard softmax.

Preserves ordering (monotonicity)

Softmax is monotonic: if z_i > z_j, then softmax(z)_i > softmax(z)_j. A higher input value always maps to a higher output probability. This means softmax never reverses the ranking of inputs; it simply rescales them into a probability distribution.

Differentiability

The softmax function is differentiable everywhere, which makes it compatible with gradient-based optimization. The Jacobian of the softmax function has a known closed form:

For i = j: d softmax(z)_i / d z_j = softmax(z)_i * (1 - softmax(z)_i)

For i != j: d softmax(z)_i / d z_j = -softmax(z)_i * softmax(z)_j

This can be written compactly as: d softmax(z) / d z = diag(softmax(z)) - softmax(z) * softmax(z)^T.

These derivatives are used during backpropagation to compute gradients through the softmax layer.

Invariance to constant shifts

Softmax is invariant to adding a constant to all inputs:

softmax(z + c) = softmax(z) for any scalar c

This property is the mathematical basis of the numerical stability trick discussed below. Note that softmax is not invariant to scaling: multiplying all inputs by a constant changes the output distribution.

Temperature scaling

A temperature parameter T (or tau) can be introduced to control how "peaked" or "flat" the output distribution is:

softmax(z / T)i = exp(z_i / T) / sum{j=1}^{n} exp(z_j / T)

The temperature parameter has a direct effect on the entropy of the output distribution:

Temperature	Effect	Distribution shape	Typical use
T approaching 0	Outputs converge to a one-hot vector (argmax)	Very peaked	Hard decisions, greedy decoding
T = 1	Standard softmax	Moderate	Default training and inference
T > 1	Outputs become more uniform	Flattened	Exploration in reinforcement learning, knowledge distillation
T approaching infinity	All outputs approach 1/n (uniform)	Flat	Maximum exploration

Temperature scaling is used extensively in knowledge distillation (Hinton et al., 2015), where a high temperature (T = 2 to 20) softens the teacher model's output distribution so that the student model can learn from the relative probabilities of incorrect classes, not just the top prediction. It is also commonly used during text generation with large language models to control the randomness of the output: lower temperatures produce more deterministic text, while higher temperatures produce more diverse and creative outputs.

Numerical stability and the log-sum-exp trick

A naive implementation of softmax can cause numerical overflow or underflow. When the input values are large and positive, exp(z_i) can overflow to infinity. When the input values are large and negative, exp(z_i) can underflow to zero, and the denominator can become zero, causing a division-by-zero error.

The standard solution exploits softmax's invariance to constant shifts. By subtracting the maximum input value before computing the exponentials, the largest exponent becomes exp(0) = 1, preventing overflow:

softmax(z)i = exp(z_i - max(z)) / sum{j=1}^{n} exp(z_j - max(z))

This is mathematically equivalent to the original definition but numerically stable. All major deep learning frameworks (PyTorch, TensorFlow, JAX) implement this trick internally.

For the related log-softmax computation (log of the softmax output), the log-sum-exp (LSE) trick is used:

log softmax(z)i = z_i - log(sum{j=1}^{n} exp(z_j))

The log-sum-exp is computed stably as:

log(sum exp(z_j)) = max(z) + log(sum exp(z_j - max(z)))

Computing log-softmax directly (rather than computing softmax and then taking the log) avoids taking the logarithm of very small numbers, which would result in large negative values with poor floating-point precision. This is why PyTorch provides torch.nn.LogSoftmax and torch.nn.functional.log_softmax as separate operations, and why nn.CrossEntropyLoss internally combines log-softmax with negative log-likelihood loss for numerical stability.

Connection to the Boltzmann distribution

The softmax function is mathematically identical to the Boltzmann distribution (or Gibbs distribution) from statistical mechanics. The correspondence is direct:

Softmax concept	Statistical mechanics equivalent
Input values z_i	Negative energy of microstate i (or energy with sign convention)
Temperature T	Thermodynamic temperature (kT, where k is Boltzmann's constant)
Denominator (sum of exponentials)	Partition function Z
Softmax output (probability of class i)	Probability of the system being in microstate i

In the physics formulation, the probability of a system being in microstate i with energy E_i at temperature T is:

P(i) = exp(-E_i / kT) / Z, where Z = sum_j exp(-E_j / kT)

The sign convention differs (physics uses negative energy, machine learning uses positive logits), but the mathematical structure is identical. The temperature parameter in softmax directly corresponds to the physical temperature: at low temperature, the distribution concentrates on the lowest-energy (highest-logit) state, while at high temperature, all states become equally likely.

This connection is more than a historical curiosity. Methods from statistical physics, such as simulated annealing and energy-based models, exploit the same exponential-distribution machinery. The partition function Z is central to both domains and is generally intractable to compute exactly for large systems, motivating approximation techniques in both fields.

Use in classification

Softmax is the standard activation function in the output layer of neural networks for multi-class classification. Given a network that produces a vector of raw scores (logits) for each class, softmax converts these logits into a probability distribution over classes.

For a classification problem with K classes, the final layer typically has K output neurons, one per class. The softmax function is applied to these K outputs, and the predicted class is the one with the highest probability (argmax of the softmax output).

This pattern is used across many architectures:

Convolutional neural networks for image classification (e.g., classifying an image as one of 1,000 ImageNet categories)
Feedforward networks for tabular data classification
Sequence models that predict the next token from a vocabulary (each token is a "class")

In large language models, the softmax layer converts the final hidden state into a probability distribution over the entire vocabulary (which can contain 32,000 to 256,000 tokens). This is one of the most computationally expensive operations in language model inference because it scales linearly with vocabulary size.

Softmax vs sigmoid

The sigmoid function is the special case of softmax for binary (two-class) classification. For two classes with logits z_1 and z_2, the softmax probability for class 1 is:

softmax(z)_1 = exp(z_1) / (exp(z_1) + exp(z_2)) = 1 / (1 + exp(-(z_1 - z_2))) = sigmoid(z_1 - z_2)

This means that for binary classification, using a single output neuron with a sigmoid activation is mathematically equivalent to using two output neurons with softmax. In practice, binary classification typically uses sigmoid because it is simpler (one output instead of two).

More generally, sigmoid can be viewed as the softmax over two categories where one of the logits is fixed at zero.

Property	Sigmoid	Softmax
Number of classes	2 (binary)	Any K >= 2
Output range	(0, 1) scalar	(0, 1) vector summing to 1
Number of output neurons	1	K
Mathematical relationship	Special case of softmax	Generalization of sigmoid
Multi-label support	Yes (independent per class)	No (outputs sum to 1, so classes compete)

For multi-label classification, where each input can belong to multiple classes simultaneously, sigmoid is applied independently to each output neuron (since the classes are not mutually exclusive). Softmax is inappropriate for multi-label tasks because its outputs are constrained to sum to 1.

Use in attention mechanisms

Softmax plays a different but equally important role in the attention mechanism of transformer models. In scaled dot-product attention (Vaswani et al., 2017), softmax converts raw attention scores into attention weights:

Attention(Q, K, V) = softmax(Q * K^T / sqrt(d_k)) * V

where Q (queries), K (keys), and V (values) are matrices derived from the input, and d_k is the dimension of the key vectors.

The scaling factor 1/sqrt(d_k) is applied before the softmax to prevent the dot products from growing too large as the dimension increases. Without scaling, large dot products push softmax into regions of extremely small gradients (because the output becomes nearly one-hot), slowing down training. This scaling issue was explicitly noted in the original "Attention Is All You Need" paper by Vaswani et al.

In multi-head attention, softmax is applied independently within each attention head, producing a separate set of attention weights per head.

Why softmax for attention

Softmax was chosen for attention for several reasons:

It produces a valid probability distribution (non-negative weights summing to 1), giving the attention weights a clear probabilistic interpretation.
It is differentiable everywhere, allowing gradient-based training.
It amplifies differences between scores: tokens with higher attention scores receive disproportionately more weight, creating a natural "soft selection" mechanism.

However, softmax also introduces challenges. Because it always assigns nonzero weight to every key, every token attends to every other token (at least slightly). This "dense" attention pattern means the full seq_len x seq_len attention matrix must be computed, leading to O(n^2) time and memory complexity in the sequence length n.

Causal masking

In autoregressive models (like GPT and other decoder-only transformers), causal masking ensures that each token can only attend to previous tokens. Before applying softmax, the attention scores for future positions are set to negative infinity:

S_masked[i][j] = S[i][j] if j <= i, else -infinity

When softmax is applied, exp(-infinity) = 0, so future tokens receive exactly zero attention weight. This ensures the autoregressive property is maintained while still using standard softmax computation.

Quadratic cost and efficient alternatives

For a model with L layers and H heads per layer, each forward pass requires L * H independent softmax operations over (seq_len x seq_len) matrices. For a model like GPT-3 (96 layers, 96 heads) processing a 2048-token sequence, this amounts to 9,216 softmax operations per forward pass, each over a matrix with over 4 million entries.

The quadratic memory cost of storing the full attention matrix is a major bottleneck for processing long sequences. FlashAttention (Dao et al., 2022) addresses this by computing attention in blocks using an online softmax algorithm (Milakov and Gimelshein, 2018), keeping intermediate results in fast on-chip SRAM rather than materializing the full attention matrix. FlashAttention produces numerically identical results to standard attention while using O(n) memory instead of O(n^2) and running 2 to 4 times faster due to reduced memory I/O.

Connection to cross-entropy loss

Softmax is almost always paired with cross-entropy loss in classification tasks. The cross-entropy loss for a single example with true class y is:

L = -log(softmax(z)y) = -z_y + log(sum{j=1}^{n} exp(z_j))

The gradient of this combined loss with respect to the logits has a particularly clean form:

dL/dz_i = softmax(z)i - 1{i=y}

where 1_{i=y} is 1 if i equals the true class y and 0 otherwise. This means the gradient is simply the softmax output minus the one-hot target vector. This simplicity (and its numerical stability when computed as a single fused operation) is why deep learning frameworks provide combined softmax-cross-entropy functions like PyTorch's nn.CrossEntropyLoss and TensorFlow's tf.nn.softmax_cross_entropy_with_logits.

Log-softmax

The log-softmax function computes the logarithm of the softmax output directly:

log_softmax(z)i = z_i - log(sum{j=1}^{n} exp(z_j))

Computing log-softmax as a single fused operation (rather than computing softmax and then taking the log) is numerically superior. Taking the logarithm of very small softmax outputs produces large negative numbers with poor floating-point precision. The fused computation avoids this by working in log-space throughout.

Why log-softmax matters

Log-softmax is used whenever the logarithm of probabilities is needed, which is common in several contexts:

Cross-entropy loss: The standard classification loss is -log(softmax(z)_y), which is exactly log-softmax applied to the true class. PyTorch's nn.CrossEntropyLoss computes this as a single fused operation for numerical stability.
Log-likelihood computation: In language modeling, the log-probability of a token is a log-softmax output.
KL divergence: Computing KL divergence between distributions requires log-probabilities.
Negative log-likelihood loss (NLLLoss): In PyTorch, nn.NLLLoss expects log-probabilities as input. The standard pattern is to apply nn.LogSoftmax to the network output and then pass the result to nn.NLLLoss. This two-step approach is mathematically equivalent to nn.CrossEntropyLoss but gives the user access to the intermediate log-probabilities.

Computation	Numerical stability	Speed	When to use
softmax(z) then log()	Poor (log of small numbers)	Slower (two operations)	Avoid
log_softmax(z) directly	Good (fused log-sum-exp)	Faster (single operation)	Whenever log-probabilities are needed
CrossEntropyLoss (logits input)	Best (fully fused)	Fastest	Standard classification training

Softmax in reinforcement learning

In reinforcement learning, the softmax function is used as an action selection policy known as Boltzmann exploration (or softmax exploration). Given estimated action values Q(s, a) for each action a in state s, the probability of selecting action a is:

P(a | s) = exp(Q(s, a) / T) / sum_{a'} exp(Q(s, a') / T)

This approach offers a more nuanced alternative to epsilon-greedy exploration. While epsilon-greedy selects a random action with fixed probability epsilon (treating all non-greedy actions equally), Boltzmann exploration assigns higher selection probabilities to actions with higher estimated values. Actions that the agent believes are nearly as good as the best action are selected much more often than actions the agent believes are poor.

The temperature parameter T controls the exploration-exploitation tradeoff:

At low T, the agent almost always selects the action with the highest Q-value (exploitation).
At high T, all actions are selected with nearly equal probability (exploration).
A common strategy is to start with a high temperature and anneal it toward zero during training, shifting from broad exploration to focused exploitation.

Sutton and Barto discuss softmax action selection in Reinforcement Learning: An Introduction (Section 2.3), noting that it is especially useful when the worst actions are very bad and should be avoided even during exploration, unlike epsilon-greedy which gives them equal chance.

Softmax bottleneck

Yang et al. (2017) identified an expressiveness limitation they called the "softmax bottleneck" in neural language models. When a language model uses a softmax output layer with a low-rank weight matrix (which is common because the hidden dimension is much smaller than the vocabulary size), the model cannot represent certain high-rank distributions over the next token.

Formally, if the hidden dimension d is less than the rank of the true log-probability matrix minus 1, then no setting of the output weight matrix can produce the correct distribution for all contexts. This limits the model's ability to capture the full complexity of natural language distributions.

To address this, Yang et al. proposed Mixture of Softmaxes (MoS), which computes multiple softmax distributions and combines them as a weighted mixture. MoS can represent higher-rank distributions and achieved substantial perplexity improvements on Penn Treebank (47.69) and WikiText-2 (40.68) benchmarks. The paper was published at ICLR 2018.

Modern large language models mitigate this bottleneck through much larger hidden dimensions (4,096 to 16,384 or more), which make the rank constraint less binding in practice.

Hierarchical softmax

For models with very large output vocabularies, computing the full softmax is expensive because the normalization constant requires summing over all vocabulary entries. Hierarchical softmax (Morin and Bengio, 2005) reduces this cost from O(V) to O(log V), where V is the vocabulary size.

The idea is to organize the vocabulary as a binary tree, with each word at a leaf node. Instead of computing a single softmax over V classes, the model makes a sequence of binary decisions along the path from the root to the target leaf. Each internal node has a learned binary classifier. The probability of a word is the product of the probabilities along its path from the root.

Mikolov et al. (2013) used hierarchical softmax with a Huffman tree in Word2Vec, assigning shorter paths to more frequent words. This made training on large corpora (billions of words) practical; for a vocabulary of 100,000 words, it reduced the per-example cost from 100,000 operations to roughly 17 (log_2 of 100,000). Hierarchical softmax is also implemented in Facebook's fastText library.

Hierarchical softmax has largely fallen out of use in modern large language models, which rely on GPU parallelism to compute the full softmax efficiently. However, it remains relevant for resource-constrained settings or extremely large output spaces.

Alternatives to softmax

Sparsemax

Sparsemax (Martins and Astudillo, 2016) projects the input onto the probability simplex using an Euclidean projection rather than the exponential mapping used by softmax. The result is a sparse probability distribution: many outputs are exactly zero, and only the most relevant inputs receive nonzero probability. Sparsemax is differentiable almost everywhere (with a well-defined subgradient at non-differentiable points) and can be computed efficiently.

Sparsemax is useful in attention mechanisms where only a few tokens should receive attention, or in classification problems where the model should commit to a small subset of classes.

Entmax

Entmax (Peters et al., 2019) generalizes both softmax and sparsemax into a parametric family indexed by alpha. When alpha = 1, entmax reduces to softmax; when alpha = 2, it reduces to sparsemax. Intermediate values of alpha produce distributions with intermediate levels of sparsity. The parameter alpha can even be learned jointly with the model parameters.

The entmax family is based on Tsallis entropy, a generalization of Shannon entropy. Higher alpha values encourage sparser outputs.

Function	Sparsity	Key property	When to use
Softmax (alpha = 1)	None (all outputs > 0)	Smooth, well-understood	Default for classification and attention
1.5-entmax	Moderate	Tunable sparsity	When moderate sparsity is desired
Sparsemax (alpha = 2)	High (many exact zeros)	Euclidean projection	Sparse attention, interpretable models

Gumbel-softmax

The Gumbel-softmax (Jang et al., 2016; Maddison et al., 2016) is a continuous relaxation of discrete categorical sampling. Standard sampling from a categorical distribution is not differentiable, which prevents gradients from flowing through discrete choices during training. Gumbel-softmax solves this by adding Gumbel noise to the log-probabilities and applying softmax with a low temperature:

y_i = exp((log(pi_i) + g_i) / T) / sum_j exp((log(pi_j) + g_j) / T)

where g_i are independent samples from the standard Gumbel distribution (g = -log(-log(u)), u ~ Uniform(0,1)) and T is a temperature parameter. As T approaches 0, the samples approach one-hot vectors (true discrete samples). At higher T, the output is a soft approximation that allows gradient computation.

Gumbel-softmax is widely used in variational autoencoders with discrete latent variables, text generation with GANs, and any architecture that needs to make differentiable discrete selections during training.

Implementation in deep learning frameworks

All major deep learning frameworks provide optimized, numerically stable implementations of softmax and related operations.

PyTorch:

torch.nn.Softmax(dim): Module form, applies softmax along the specified dimension.
torch.nn.functional.softmax(input, dim): Functional form.
torch.nn.LogSoftmax(dim): Computes log-softmax in a numerically stable way.
torch.nn.functional.log_softmax(input, dim): Functional form of log-softmax.
torch.nn.CrossEntropyLoss: Combines log-softmax and NLLLoss into a single operation. Accepts raw logits as input (not softmax outputs).
torch.nn.functional.gumbel_softmax(logits, tau, hard): Gumbel-softmax sampling with optional straight-through estimator.

TensorFlow / Keras:

tf.nn.softmax(logits, axis): Standard softmax.
tf.nn.log_softmax(logits, axis): Numerically stable log-softmax.
tf.nn.softmax_cross_entropy_with_logits(labels, logits): Fused softmax + cross-entropy.
tf.keras.layers.Softmax(axis): Keras layer form.

A common mistake is to apply softmax to the network output and then pass the result to a loss function that internally applies softmax again (double softmax). This leads to degraded training performance. Both PyTorch's nn.CrossEntropyLoss and TensorFlow's tf.nn.softmax_cross_entropy_with_logits expect raw logits as input, not softmax outputs.

Applications beyond classification

Beyond its use as a final layer in classifiers and in attention mechanisms, softmax appears in several other contexts:

Mixture of experts routing: In mixture of experts architectures, softmax (or a variant) is used to compute gating weights that determine which experts process each input.
Neural architecture search: Softmax is used to parameterize architectural choices in differentiable NAS methods like DARTS.
Contrastive learning: The InfoNCE loss used in contrastive learning frameworks (SimCLR, CLIP) is essentially a softmax-based cross-entropy over similarity scores.
Optical flow and stereo matching: In computer vision, softmax is applied over correlation volumes to produce soft correspondences between pixels.
Sequence-to-sequence models: In encoder-decoder architectures, softmax is used both in the cross-attention mechanism and in the final output layer.

Explain like I'm 5 (ELI5)

Imagine you and your friends each got different scores on a test. Softmax is like a machine that looks at everyone's scores and turns them into "chances of winning a prize." The person with the highest score gets the biggest chance, but everyone still gets at least a tiny chance. And when you add up all the chances, they equal exactly 100%. So if your scores were 10, 5, and 2, softmax would say something like: "Player 1 has a 95% chance, Player 2 has a 4.5% chance, and Player 3 has a 0.5% chance." The bigger your score, the bigger your slice of the pie. But nobody's slice is ever exactly zero.

References

Boltzmann, L. (1868). "Studien uber das Gleichgewicht der lebendigen Kraft zwischen bewegten materiellen Punkten." *Wiener Berichte*, 58, 517-560. (Original Boltzmann distribution.)
Gibbs, J. W. (1902). *Elementary Principles in Statistical Mechanics*. Charles Scribner's Sons. (Formalized the Boltzmann/Gibbs distribution.)
Luce, R. D. (1959). *Individual Choice Behavior: A Theoretical Analysis*. John Wiley and Sons. (Luce's choice axiom leading to the softmax form.)
Bridle, J. S. (1990). "Probabilistic Interpretation of Feedforward Classification Network Outputs, with Relationships to Statistical Pattern Recognition." *Neurocomputing*, 227-236. (Introduced the term "softmax" for neural networks.)
Morin, F. and Bengio, Y. (2005). "Hierarchical Probabilistic Neural Network Language Model." *AISTATS 2005*. (Introduced hierarchical softmax.)
Mikolov, T., et al. (2013). "Efficient Estimation of Word Representations in Vector Space." *arXiv:1301.3781*. https://arxiv.org/abs/1301.3781 (Word2Vec with hierarchical softmax.)
Hinton, G., Vinyals, O., and Dean, J. (2015). "Distilling the Knowledge in a Neural Network." *arXiv:1503.02531*. https://arxiv.org/abs/1503.02531 (Temperature scaling for knowledge distillation.)
Martins, A. F. T. and Astudillo, R. F. (2016). "From Softmax to Sparsemax: A Sparse Model of Attention and Multi-Label Classification." *ICML 2016*. https://arxiv.org/abs/1602.02068
Jang, E., Gu, S., and Poole, B. (2016). "Categorical Reparameterization with Gumbel-Softmax." *arXiv:1611.01144*. https://arxiv.org/abs/1611.01144
Vaswani, A., et al. (2017). "Attention Is All You Need." *Advances in Neural Information Processing Systems*, 30. https://arxiv.org/abs/1706.03762
Yang, Z., Dai, Z., Salakhutdinov, R., and Cohen, W. W. (2017). "Breaking the Softmax Bottleneck: A High-Rank RNN Language Model." *ICLR 2018*. https://arxiv.org/abs/1711.03953
Milakov, M. and Gimelshein, N. (2018). "Online normalizer calculation for softmax." *arXiv preprint*. https://arxiv.org/abs/1805.02867
Peters, B., Niculae, V., and Martins, A. F. T. (2019). "Sparse Sequence-to-Sequence Models." *ACL 2019*. https://arxiv.org/abs/1905.12096
Blanchard, P., Higham, D. J., and Higham, N. J. (2021). "Accurately Computing the Log-Sum-Exp and Softmax Functions." *IMA Journal of Numerical Analysis*, 41(4), 2311-2330. https://academic.oup.com/imajna/article/41/4/2311/5893596
Dao, T., Fu, D.Y., Ermon, S., Rudra, A., and Re, C. (2022). "FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness." *NeurIPS 2022*. https://arxiv.org/abs/2205.14135

Introduction

Historical origins

Mathematical definition

Properties

Output sums to 1

All outputs are positive

Preserves ordering (monotonicity)

Differentiability

Invariance to constant shifts

Temperature scaling

Numerical stability and the log-sum-exp trick

Connection to the Boltzmann distribution

Use in classification

Softmax vs sigmoid

Use in attention mechanisms

Why softmax for attention

Causal masking

Quadratic cost and efficient alternatives

Connection to cross-entropy loss

Log-softmax

Why log-softmax matters

Softmax in reinforcement learning

Softmax bottleneck

Hierarchical softmax

Alternatives to softmax

Sparsemax

Entmax

Gumbel-softmax

Implementation in deep learning frameworks

Applications beyond classification

Explain like I'm 5 (ELI5)

References

Improve this article

Related Articles

GELU (Gaussian Error Linear Unit)

Multi-head Latent Attention

Sparse autoencoder

ARC-AGI 2

LeNet

Mixture of Experts (MoE)

Introduction

Historical origins

Mathematical definition

Properties

Output sums to 1

All outputs are positive

Preserves ordering (monotonicity)

Differentiability

Invariance to constant shifts

Temperature scaling

Numerical stability and the log-sum-exp trick

Connection to the Boltzmann distribution

Use in classification

Softmax vs sigmoid

Use in attention mechanisms

Why softmax for attention

Causal masking

Quadratic cost and efficient alternatives

Connection to cross-entropy loss

Log-softmax

Why log-softmax matters

Softmax in reinforcement learning

Softmax bottleneck

Hierarchical softmax

Alternatives to softmax

Sparsemax

Entmax

Gumbel-softmax

Implementation in deep learning frameworks

Applications beyond classification

Explain like I'm 5 (ELI5)

References

Related Articles

GELU (Gaussian Error Linear Unit)

Multi-head Latent Attention

Sparse autoencoder

ARC-AGI 2

LeNet

Mixture of Experts (MoE)