# Cross-Entropy

> Source: https://aiwiki.ai/wiki/cross-entropy
> Updated: 2026-07-11
> Categories: Deep Learning, Machine Learning, Mathematics
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

*See also: [Machine learning terms](/wiki/machine_learning_terms), [Loss function](/wiki/loss_function), [Entropy](/wiki/entropy)*

Cross-entropy is a measure from [information theory](/wiki/information_theory) of how many bits (or nats) are needed to encode data drawn from a true probability distribution P when using a code optimized for a different, predicted distribution Q. In machine learning it is the most widely used [loss function](/wiki/loss_function) for training classifiers, defined as $$H(P, Q) = -\sum_x P(x) \log Q(x)$$, where P is the true label distribution and Q is the model's prediction. Minimizing cross-entropy makes the model's predicted distribution match the true distribution as closely as possible; the loss is low when predictions are accurate and grows without bound as the model assigns vanishing probability to the correct outcome. Cross-entropy is mathematically equivalent to the negative log-likelihood of the data and, in the classification literature, is also called "log loss" or "logistic loss".[3][15]

The concept traces its roots to Claude Shannon's 1948 paper "A Mathematical Theory of Communication," published in the *Bell System Technical Journal*, which laid the groundwork for information theory.[1] Shannon introduced [entropy](/wiki/entropy) as a measure of the average amount of information produced by a stochastic source of data, writing that quantities of the form $$H = -\sum p_i \log p_i$$ "play a central role in information theory as measures of information, choice and uncertainty," and adding: "We shall call H the entropy of the set of probabilities $$p_1, \ldots, p_n$$."[1] Cross-entropy extends this idea to the comparison of two distributions, and it has since become a foundational tool in statistical learning, coding theory, and modern deep learning.

## Cross-entropy at a glance

| Question | Short answer |
|---|---|
| What does it measure? | The mismatch between a true distribution P and a predicted distribution Q, in bits or nats |
| Core formula | $$H(P, Q) = -\sum P(x) \log Q(x)$$ |
| Single-label classification form | $$L = -\log Q(\text{correct class})$$, the negative log-probability of the true label[3] |
| Also known as | Log loss, logistic loss, negative log-likelihood (for the Bernoulli case)[15] |
| Minimum value | $$H(P)$$, the entropy of the true distribution, reached when $$Q = P$$ |
| First defined | Claude Shannon, 1948, *Bell System Technical Journal*[1] |
| Primary use | Loss function for logistic regression, [softmax](/wiki/softmax) classifiers, [neural networks](/wiki/neural_network), and [language models](/wiki/language_model) |

## Information-theoretic foundations

To understand cross-entropy, it helps to first review the building blocks from information theory: self-information, entropy, and Kullback-Leibler divergence.

### Self-information

Given a discrete event $$x$$ that occurs with probability $$P(x)$$, the self-information (also called surprisal) is defined as:

$$
I(x) = -\log P(x)
$$

Self-information quantifies how "surprising" an event is. A certain event ($$P(x) = 1$$) carries zero information, while a very unlikely event carries high information. The choice of logarithm base determines the unit: base 2 gives bits, base *e* gives nats, and base 10 gives hartleys.

### Shannon entropy

Shannon entropy measures the average self-information across all possible events in a distribution $$P$$:

$$
H(P) = -\sum_{x} P(x) \log P(x)
$$

Entropy represents the minimum average number of bits (or nats) required to encode events drawn from distribution *P* using an optimal coding scheme.[1] It reaches its maximum value when all outcomes are equally likely (a uniform distribution) and its minimum value of zero when one outcome is certain.

### Kullback-Leibler divergence

The Kullback-Leibler (KL) divergence measures how one probability distribution $$Q$$ differs from a reference distribution $$P$$:[2]

$$
D_{\mathrm{KL}}(P \parallel Q) = \sum_{x} P(x) \log \frac{P(x)}{Q(x)}
$$

KL divergence is always non-negative and equals zero only when $$P = Q$$. It is not symmetric, meaning $$D_{\mathrm{KL}}(P \parallel Q) \ne D_{\mathrm{KL}}(Q \parallel P)$$ in general.[2] KL divergence can be interpreted as the additional number of bits required to encode samples from *P* when using a code optimized for *Q* instead of *P*.

### From KL divergence to cross-entropy

The relationship between cross-entropy, entropy, and KL divergence is given by:

$$
H(P, Q) = H(P) + D_{\mathrm{KL}}(P \parallel Q)
$$

This identity shows that cross-entropy equals the entropy of *P* plus the KL divergence from *P* to *Q*.[4] In machine learning, the true distribution *P* is fixed (it represents the ground-truth labels), so *H(P)* is a constant that does not depend on the model parameters. As a result, minimizing cross-entropy with respect to the model is equivalent to minimizing the KL divergence between the true distribution and the model's predicted distribution.[4]

This relationship also makes clear that cross-entropy is always at least as large as the entropy of the true distribution: $$H(P, Q) \ge H(P)$$. The gap between the two is precisely the KL divergence, which measures the "inefficiency" of using *Q* to model *P*.

## Mathematical definition

Given two discrete probability distributions $$P$$ (the true distribution) and $$Q$$ (the predicted distribution) over the same set of events, the cross-entropy is defined as:

$$
H(P, Q) = -\sum_{x} P(x) \log Q(x)
$$

Here, *P(x)* is the true probability of event *x*, and *Q(x)* is the predicted probability. Cross-entropy is always non-negative, and it achieves its minimum value when $$Q = P$$. In that case, $$H(P, Q) = H(P)$$, the entropy of the true distribution.[4]

For continuous distributions, the cross-entropy is defined using an integral:

$$
H(P, Q) = -\int P(x) \log Q(x)\, dx
$$

### Key properties

| Property | Description |
|---|---|
| Non-negativity | $$H(P, Q) \ge 0$$ for all distributions *P* and *Q* |
| Minimum at *P = Q* | Cross-entropy is minimized when the predicted distribution matches the true distribution |
| Asymmetry | $$H(P, Q) \ne H(Q, P)$$ in general; the order of arguments matters |
| Relation to KL divergence | $$H(P, Q) = H(P) + D_{\mathrm{KL}}(P \parallel Q)$$, so $$H(P, Q) \ge H(P)$$ |
| Decomposition | Separates into entropy (irreducible uncertainty) plus divergence (model error) |
| Additivity | For independent variables, $$H(P_{XY}, Q_{XY}) = H(P_X, Q_X) + H(P_Y, Q_Y)$$ |

## How does cross-entropy differ in information theory versus machine learning?

The term "cross-entropy" appears in both information theory and machine learning, but there are some differences in convention and emphasis that can cause confusion.

In information theory, cross-entropy $$H(P, Q)$$ measures the average number of bits (or nats) needed to encode events from a true distribution *P* using an optimal code designed for an approximating distribution *Q*. The focus is on coding efficiency and compression.[1] Information theorists typically use base-2 logarithms, reporting cross-entropy in bits, and the distributions *P* and *Q* are both genuine probability distributions over the same event space.

In machine learning, cross-entropy serves as a loss function for training classifiers. Several conventions differ from the information-theoretic usage:

| Aspect | Information theory | Machine learning |
|---|---|---|
| Logarithm base | Base 2 (bits) | Natural logarithm (nats) |
| Role of *P* | Any true distribution | Empirical distribution of training labels (often one-hot) |
| Role of *Q* | Approximate distribution | Model's predicted distribution (parameterized) |
| Primary goal | Measure coding inefficiency | Provide a differentiable training objective |
| Interpretation | Expected message length | Negative log-likelihood of correct labels |

Because training labels in classification are typically one-hot vectors (a single class has probability 1 and all others have probability 0), the cross-entropy loss for a single sample reduces to the negative log-probability of the correct class: $$L = -\log Q(y_{\text{correct}})$$.[3] This simplification does not arise in general information-theoretic settings where *P* may be a full distribution over multiple outcomes.

Another distinction: in information theory, cross-entropy includes the irreducible entropy $$H(P)$$ of the true distribution. In machine learning, since $$H(P)$$ is constant with respect to model parameters, it is often dropped from the optimization objective. Some practitioners therefore use "cross-entropy loss" and "KL divergence" interchangeably during training, even though they differ by the constant $$H(P)$$.

## Binary cross-entropy

In [binary classification](/wiki/binary_classification) problems with two possible outcomes (positive = 1, negative = 0), the cross-entropy simplifies to a particularly clean form. Let $$y$$ denote the true label (0 or 1) and $$\hat{y}$$ denote the predicted probability that the label is 1. The binary cross-entropy (BCE) for a single sample is:

$$
L(y, \hat{y}) = -\left[y \log(\hat{y}) + (1 - y) \log(1 - \hat{y})\right]
$$

When $$y = 1$$, only the first term contributes, penalizing the model if $$\hat{y}$$ is far from 1. When $$y = 0$$, only the second term contributes, penalizing the model if $$\hat{y}$$ is far from 0.

For a dataset of *N* samples, the average binary cross-entropy loss is:

$$
L = -\frac{1}{N} \sum_{i=1}^{N} \left[y_i \log(\hat{y}_i) + (1 - y_i) \log(1 - \hat{y}_i)\right]
$$

Binary cross-entropy is the standard loss function for training binary classifiers, including logistic regression models and neural networks with a [sigmoid function](/wiki/sigmoid_function) output layer.[3] It is also commonly referred to as "log loss" in the statistics and machine learning literature, and is the exact quantity computed by scikit-learn's `log_loss` metric for a logistic model.[15]

The gradient of binary cross-entropy with respect to the predicted probability has an intuitive form:

$$
\frac{\partial L}{\partial \hat{y}} = -\frac{y}{\hat{y}} + \frac{1 - y}{1 - \hat{y}}
$$

When combined with a sigmoid activation, the gradient with respect to the pre-activation [logits](/wiki/logits) simplifies to $$(\hat{y} - y)$$, which provides strong learning signals when the prediction is far from the true label.[4]

## Categorical cross-entropy

For [multi-class classification](/wiki/multi-class_classification) with *C* classes, the target labels are typically represented as one-hot encoded vectors. If $$y_i$$ is a one-hot vector where $$y_{i,c} = 1$$ for the correct class and 0 otherwise, and $$\hat{y}_{i,c}$$ is the predicted probability for class $$c$$, the categorical cross-entropy loss for a single sample is:

$$
L(y_i, \hat{y}_i) = -\sum_{c=1}^{C} y_{i,c} \log(\hat{y}_{i,c})
$$

Because the true label is one-hot, only the term corresponding to the correct class survives. If the correct class is $$k$$, this reduces to:

$$
L = -\log(\hat{y}_{i,k})
$$

This is simply the negative log-probability assigned to the correct class. The model is penalized more heavily when it assigns a low probability to the correct class.[9] For example, if the model assigns probability 0.9 to the correct class, the loss is approximately 0.105. If it assigns only 0.01, the loss jumps to approximately 4.605, creating a strong signal to correct the error.

For a dataset of *N* samples, the average categorical cross-entropy is:

$$
L = -\frac{1}{N} \sum_{i=1}^{N} \sum_{c=1}^{C} y_{i,c} \log(\hat{y}_{i,c})
$$

### Sparse categorical cross-entropy

Some frameworks (such as TensorFlow and Keras) offer a "sparse" variant of categorical cross-entropy. The mathematical formula is identical, but instead of requiring one-hot encoded target vectors, it accepts integer class labels directly. This is computationally more efficient for problems with a large number of classes, since there is no need to allocate and store full one-hot vectors.

| Variant | Target format | Use case | Output activation |
|---|---|---|---|
| Binary cross-entropy | Scalar (0 or 1) | Two-class classification | Sigmoid |
| Categorical cross-entropy | One-hot vector | Multi-class classification (few classes) | [Softmax](/wiki/softmax) |
| Sparse categorical cross-entropy | Integer label | Multi-class classification (many classes) | Softmax |

## Equivalence to maximum likelihood estimation

One of the most important theoretical results connecting cross-entropy to classical statistics is its equivalence to maximum likelihood estimation (MLE).[11] When training a model by minimizing cross-entropy loss, the optimization objective is mathematically identical to maximizing the likelihood of the observed data under the model.[4]

Consider a dataset of *N* independent samples. The likelihood function is:

$$
L(\theta) = \prod_{i=1}^{N} Q(y_i \mid x_i; \theta)
$$

Taking the negative logarithm and dividing by *N* yields:

$$
-\frac{1}{N} \log L(\theta) = -\frac{1}{N} \sum_{i=1}^{N} \log Q(y_i \mid x_i; \theta)
$$

This is exactly the cross-entropy loss when *P* is the empirical distribution of the data. Therefore, minimizing cross-entropy is equivalent to maximizing the log-likelihood.[11] This equivalence provides a strong statistical justification for using cross-entropy as a loss function: it is the principled way to fit a probabilistic model to observed data.

This connection also extends to information theory. The distribution *Q* that minimizes the cross-entropy *H(P, Q)* over a family of distributions is the one that best approximates *P* in the KL divergence sense. In exponential family models, MLE, minimum KL divergence, and minimum cross-entropy all coincide, unifying the information-theoretic and statistical perspectives.[9]

## Cross-entropy as a loss function in deep learning

### Why does cross-entropy work well for classification?

Cross-entropy has become the default loss function for classification tasks in deep learning for several practical reasons.

**Strong gradients for incorrect predictions.** The gradient of cross-entropy loss with respect to the model's output logits has a simple and well-behaved form. For a softmax output layer, the gradient of the loss with respect to logit $$z_k$$ is:

$$
\frac{\partial L}{\partial z_k} = \hat{y}_k - y_k
$$

This means the gradient is simply the difference between the predicted probability and the true label. When the model is confidently wrong (for example, predicting 0.01 for the correct class), the gradient is large, pushing the weights to correct the mistake quickly. When the model is already close to the correct answer, the gradient is small, leading to gentle adjustments.

**No vanishing gradient problem at saturation.** Unlike mean squared error (MSE), cross-entropy does not suffer from vanishing gradients when the output neuron saturates. With a sigmoid or softmax activation, MSE gradients can become extremely small when the output is near 0 or 1 (since the derivative of the sigmoid is near zero in those regions). Cross-entropy cancels out this saturation effect, ensuring that learning continues even when predictions are far from the target.[4]

**Convexity properties.** When combined with a softmax or sigmoid output layer, the cross-entropy loss is convex with respect to the logits (the pre-activation values).[3] This makes the optimization landscape smoother and reduces the risk of getting stuck in poor local minima for the final layer.

**Probabilistic interpretation.** Cross-entropy naturally produces well-calibrated probability estimates. Because minimizing cross-entropy is equivalent to maximum likelihood estimation, the resulting model provides probability outputs that are meaningful and can be used directly for decision-making, risk assessment, or downstream probabilistic reasoning.

### How does cross-entropy compare with mean squared error?

| Property | Cross-entropy | Mean squared error (MSE) |
|---|---|---|
| Gradient when confidently wrong | Large (fast correction) | Can be small (slow correction) |
| Vanishing gradient at saturation | No | Yes, with sigmoid/softmax |
| Convexity with softmax/sigmoid | Convex in logits | Non-convex |
| Probabilistic interpretation | Direct (negative log-likelihood) | Indirect |
| Typical use case | Classification | Regression |
| Sensitivity to outliers | Moderate | High |
| Gradient form (softmax output) | $$(\hat{y} - y)$$, linear | $$(\hat{y} - y) \hat{y} (1 - \hat{y})$$, saturates |

The gradient comparison is particularly revealing. With MSE and a sigmoid output, the gradient includes a $$\hat{y} (1 - \hat{y})$$ term from the sigmoid derivative. When $$\hat{y}$$ is close to 0 or 1 (saturated), this term approaches zero, making the gradient vanishingly small even when the prediction is completely wrong. Cross-entropy avoids this problem because the logarithm in the loss cancels the sigmoid derivative, producing the clean $$(\hat{y} - y)$$ gradient.

### Softmax and cross-entropy: numerical stability

In practice, computing softmax probabilities and then taking their logarithm for cross-entropy can lead to numerical issues. The softmax function involves computing exponentials, which can overflow (producing infinity) or underflow (producing zero) for large or very negative logit values.

**The overflow problem.** If any logit $$z_k$$ is very large, then $$\exp(z_k)$$ can exceed the range of floating-point representation (roughly $$10^{308}$$ for float64), resulting in infinity.

**The underflow problem.** After softmax normalization, some probabilities may be extremely close to zero. Taking $$\log(0)$$ then produces negative infinity, corrupting the loss computation.

**The log-sum-exp trick.** The standard solution is to subtract the maximum logit before computing softmax:

$$
\mathrm{softmax}(z_k) = \frac{\exp(z_k - \max(z))}{\sum_j \exp(z_j - \max(z))}
$$

This shift does not change the result mathematically (the constant cancels in numerator and denominator) but prevents overflow by ensuring the largest exponent is zero.[4]

**Fused log-softmax and cross-entropy.** Modern deep learning frameworks provide fused operations that compute the log-softmax and cross-entropy together in a single numerically stable pass. The combined log-softmax can be written as:

$$
\log \mathrm{softmax}(z_k) = z_k - \log\left(\sum_j \exp(z_j)\right)
$$

By using the log-sum-exp trick on the second term, this computation avoids ever materializing the raw softmax probabilities. The cross-entropy loss for the correct class *k* then simplifies to:

$$
L = -z_k + \log\left(\sum_j \exp(z_j)\right)
$$

This fused formulation is both faster and more numerically robust than computing softmax and log separately. In PyTorch, `torch.nn.CrossEntropyLoss` accepts raw logits directly and handles all of this internally; its documentation specifies that the loss "is equivalent to applying LogSoftmax on an input, followed by NLLLoss," and that the input "is expected to contain the unnormalized logits for each class."[13]

## Framework implementations

All major deep learning frameworks provide built-in cross-entropy loss functions. Understanding the differences between them is important for correct usage.

| Framework | Function | Input type | Task | Notes |
|---|---|---|---|---|
| PyTorch | `nn.CrossEntropyLoss` | Raw logits | Multi-class | Combines log-softmax + NLL loss; accepts integer class labels[13] |
| PyTorch | `nn.BCELoss` | Probabilities (after sigmoid) | Binary / multi-label | Requires manual sigmoid; less numerically stable |
| PyTorch | `nn.BCEWithLogitsLoss` | Raw logits | Binary / multi-label | Fuses sigmoid + BCE; recommended over `BCELoss` |
| TensorFlow | `CategoricalCrossentropy` | Probabilities or logits | Multi-class | Set `from_logits=True` for logits input[14] |
| TensorFlow | `SparseCategoricalCrossentropy` | Probabilities or logits | Multi-class | Integer labels; `from_logits=True` recommended[14] |
| TensorFlow | `BinaryCrossentropy` | Probabilities or logits | Binary / multi-label | Set `from_logits=True` for stability |
| JAX (Optax) | `softmax_cross_entropy` | Raw logits | Multi-class | Pure-function API |

A common mistake is to apply softmax or sigmoid before passing the output to a loss function that already applies it internally. For example, using `nn.Softmax` followed by `nn.CrossEntropyLoss` in PyTorch applies softmax twice, producing incorrect gradients and poor training results. Always check whether your loss function expects raw logits or probabilities.

PyTorch's `nn.CrossEntropyLoss` also supports optional `weight`, `ignore_index`, and `label_smoothing` parameters, making it flexible for weighted losses, masked sequences, and regularized training without needing separate implementations.[13]

## Variants and extensions

Several modifications to the standard cross-entropy loss have been proposed to handle specific challenges in machine learning.[10]

### Weighted cross-entropy

In datasets with imbalanced class distributions, the standard cross-entropy loss can be biased toward the majority class because it treats all samples equally. Weighted cross-entropy addresses this by assigning different weights to different classes:

$$
L = -\frac{1}{N} \sum_{i=1}^{N} \sum_{c=1}^{C} w_c y_{i,c} \log(\hat{y}_{i,c})
$$

where *w_c* is the weight for class *c*. A common approach is to set weights inversely proportional to class frequency: classes with fewer training examples receive higher weights, encouraging the model to pay more attention to underrepresented classes.

In PyTorch, class weights can be passed directly to `nn.CrossEntropyLoss(weight=tensor)`. In TensorFlow, the `class_weight` parameter in `model.fit()` achieves the same effect.

### Focal loss

Focal loss was introduced by Tsung-Yi Lin and colleagues at Facebook AI Research in 2017 to address the extreme class imbalance encountered in dense object detection tasks.[5] In one-stage detectors like RetinaNet, the vast majority of anchor boxes correspond to background (easy negatives), which can overwhelm the detector during training. The authors traced the accuracy gap between one-stage and two-stage detectors directly to this problem, stating that "the extreme foreground-background class imbalance encountered during training of dense detectors is the central cause," and proposing "to address this class imbalance by reshaping the standard cross entropy loss such that it down-weights the loss assigned to well-classified examples."[5]

Focal loss modifies the standard cross-entropy by adding a modulating factor:

$$
\mathrm{FL}(p_t) = -\alpha_t (1 - p_t)^\gamma \log(p_t)
$$

where $$p_t$$ is the model's predicted probability for the correct class, $$\alpha_t$$ is a class-balancing weight, and $$\gamma \ge 0$$ is a focusing parameter that, in the authors' words, "smoothly adjusts the rate at which easy examples are down-weighted."[5] When $$\gamma = 0$$, focal loss reduces to standard cross-entropy. As $$\gamma$$ increases, the loss contribution from well-classified examples (those with high $$p_t$$) is down-weighted, allowing the model to focus its learning on hard, misclassified examples.

In the original paper, the authors found that $$\gamma = 2$$ worked well across a range of object detection benchmarks.[5] Trained with focal loss, their RetinaNet detector reached 39.1 average precision (AP) on the COCO test-dev benchmark with a ResNet-101-FPN backbone, rising to 40.8 AP with a ResNeXt-101-FPN backbone, surpassing the published accuracy of all existing two-stage detectors at the time while matching the speed of one-stage methods.[5] The key insight is that weighted cross-entropy handles class imbalance by reweighting entire classes, while focal loss handles it by reweighting individual examples based on difficulty. This makes focal loss particularly effective when the imbalance is severe and there is a large number of easy negatives.

Focal loss has since been adopted in many domains beyond object detection, including medical image segmentation, natural language processing, and audio classification.

### Label smoothing

Label smoothing, proposed by Christian Szegedy and colleagues in 2016 as part of refinements to the Inception architecture, is a regularization technique that modifies the target distribution used in cross-entropy.[6] Instead of using hard one-hot targets, label smoothing replaces the target for the correct class with $$1 - \epsilon$$ and distributes the remaining $$\epsilon$$ uniformly across all other classes:

$$
y_c^{\text{smooth}} = y_c (1 - \epsilon) + \frac{\epsilon}{C}
$$

where $$\epsilon$$ is a small smoothing parameter and $$C$$ is the number of classes. In the original Inception-v3 experiments, the authors used $$\epsilon = 0.1$$ over $$C = 1000$$ ImageNet classes, and reported that this and their other refinements helped the model reach a 3.5% top-5 error rate on the ILSVRC 2012 validation set in an ensemble, multi-crop setting.[6] Label smoothing prevents the model from becoming overly confident and can improve generalization. The smoothed loss function can be decomposed into two terms: the standard cross-entropy with respect to hard targets plus an entropy regularization term that penalizes low-entropy (overconfident) predictions.

Label smoothing has become a standard component in many modern training recipes, particularly for image classification on benchmarks like ImageNet. Research by Muller, Kornblith, and Hinton (2019) showed that while label smoothing generally improves generalization and model calibration, it can actually hurt performance when the model is used as a teacher in knowledge distillation.[8] As they put it, "if a teacher network is trained with label smoothing, knowledge distillation into a student network is much less effective."[8] The reason is that label smoothing encourages the model to treat all incorrect classes as roughly equally probable, erasing the "dark knowledge" about inter-class similarities that distillation relies on.

### Binary cross-entropy with logits

Rather than applying a sigmoid function to the output and then computing binary cross-entropy, many frameworks offer a version that accepts raw logits. This combined operation is numerically more stable for the same reasons as the fused softmax-cross-entropy described above. In PyTorch, this is `torch.nn.BCEWithLogitsLoss`. It also supports a `pos_weight` parameter for adjusting the relative weight of positive versus negative samples, which is useful in multi-label classification where each label may have a different class balance.

## Cross-entropy in knowledge distillation

Knowledge distillation, introduced by Hinton, Vinyals, and Dean in 2015, is a model compression technique that transfers knowledge from a large "teacher" model to a smaller "student" model.[7] The motivation, in the authors' words, is that "making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets."[7] Cross-entropy plays a central role in this process.

The key insight is that the teacher model's soft probability outputs contain more information than hard labels alone. For example, when classifying an image of a cat, a well-trained teacher might output probabilities like [cat: 0.85, tiger: 0.10, dog: 0.04, ...]. The relatively high probability assigned to "tiger" encodes the teacher's learned knowledge that cats and tigers share visual features. These inter-class relationships, sometimes called "dark knowledge," would be lost with hard one-hot labels.

The distillation procedure uses a temperature parameter *T* to soften the teacher's output distribution. The softmax with temperature is:

$$
q_i = \frac{\exp(z_i / T)}{\sum_j \exp(z_j / T)}
$$

Higher temperatures produce softer (more uniform) distributions that reveal more information about the teacher's learned similarities between classes. At $$T = 1$$, this is the standard softmax.

The student model is trained with a combined loss function:

$$
L = \alpha H(y_{\text{hard}}, q_{\text{student}}) + (1 - \alpha) T^2 H(q_{\text{teacher}}^T, q_{\text{student}}^T)
$$

where $$H(y_{\text{hard}}, q_{\text{student}})$$ is the standard cross-entropy with hard labels (at $$T = 1$$), $$H(q_{\text{teacher}}^T, q_{\text{student}}^T)$$ is the cross-entropy between the teacher's and student's soft distributions (at temperature $$T$$), and $$\alpha$$ controls the balance between the two terms. The $$T^2$$ factor compensates for the fact that gradients from the soft targets are scaled down by $$1/T^2$$ when temperature is raised.

Hinton et al. found that setting $$\alpha$$ to a small value (giving most weight to the distillation loss) generally produced the best results.[7] The technique has been widely adopted for deploying efficient models in production, and it underpins the creation of models like DistilBERT and many other compressed architectures.

## Cross-entropy and perplexity

In natural language processing, [perplexity](/wiki/perplexity) is the standard metric for evaluating [language models](/wiki/language_model). Perplexity is directly derived from cross-entropy: it is the exponentiation of the cross-entropy loss.[9]

If $$L$$ is the average cross-entropy loss per token (in nats), then perplexity is:

$$
\mathrm{PP} = e^L
$$

If the loss is measured in bits (using log base 2), then:

$$
\mathrm{PP} = 2^L
$$

Perplexity has an intuitive interpretation: it represents the effective number of equally likely choices the model is uncertain between at each step. A language model with a perplexity of 10 on a given text corpus is, on average, as uncertain as if it were choosing uniformly among 10 possible next tokens at each position.

This connection means that reducing the cross-entropy loss of a language model by even a small amount translates into a measurable reduction in perplexity. For large language models such as GPT and BERT, cross-entropy over the training tokens is the primary optimization objective. Improvements in model architecture, data quality, or training methodology are often assessed by their effect on perplexity (and thus on cross-entropy).

In autoregressive language models, cross-entropy is computed over the entire sequence. Given a sequence of tokens $$(t_1, t_2, \ldots, t_N)$$, the loss is the average negative log-probability of each token given all preceding tokens:

$$
L = -\frac{1}{N} \sum_{i=1}^{N} \log P(t_i \mid t_1, \ldots, t_{i-1})
$$

This is exactly the cross-entropy between the empirical distribution of next tokens and the model's predicted distribution at each position.

| Cross-entropy (nats) | Perplexity | Interpretation |
|---|---|---|
| 0 | 1 | Perfect prediction (certainty) |
| 1.0 | 2.72 | Low uncertainty |
| 2.3 | 10 | Moderate uncertainty |
| 4.6 | 100 | High uncertainty |
| 6.9 | 1000 | Very high uncertainty |

Modern large language models achieve perplexities in the range of 10 to 30 on standard benchmarks like WikiText-103, corresponding to cross-entropy values of roughly 2.3 to 3.4 nats per token. Scaling laws research by Kaplan and colleagues (2020) showed that cross-entropy loss decreases predictably as a power law with increasing model size, dataset size, and compute budget; they reported that "the loss scales as a power-law with model size, dataset size, and the amount of compute used for training, with some trends spanning more than seven orders of magnitude," based on a study of more than 200 trained language models.[12]

## Applications

### Image classification

In computer vision, cross-entropy is the standard loss function for image classification tasks. Models like ResNet, VGG, Inception, and Vision Transformers (ViT) are all trained using categorical cross-entropy over the class labels. The output layer typically uses a softmax activation that produces a probability distribution over classes, and the cross-entropy loss measures how far this distribution is from the one-hot ground truth.

### Natural language processing

Cross-entropy plays a central role in virtually all NLP tasks that involve predicting tokens or classes:

- **Language modeling.** Autoregressive models like GPT predict the next token in a sequence, and the loss is the cross-entropy between the predicted token distribution and the actual next token.
- **Machine translation.** Sequence-to-sequence models for translation use cross-entropy at each decoding step.
- **Text classification.** Sentiment analysis, topic classification, and spam detection models use binary or categorical cross-entropy.
- **Named entity recognition.** Token-level classification tasks use cross-entropy for each token's predicted label.

### Object detection and segmentation

One-stage detectors like RetinaNet and two-stage detectors like Faster R-CNN both use cross-entropy (or its focal loss variant) for the classification head.[5] In semantic segmentation, pixel-wise cross-entropy computes the loss for each pixel independently and averages across the image.

### Generative models

Variational autoencoders (VAEs) use binary cross-entropy as the reconstruction loss when the data consists of binary or near-binary values (such as binarized MNIST images). Generative adversarial networks (GANs) use binary cross-entropy in the discriminator's loss function to distinguish between real and generated samples.

### Reinforcement learning

In reinforcement learning, cross-entropy appears in policy [gradient descent](/wiki/gradient_descent) methods where the agent's policy is parameterized as a probability distribution over actions. The cross-entropy between the policy distribution and target action distributions is used in algorithms like the cross-entropy method (CEM) for optimization.

## Practical considerations

### Numerical clipping

When computing cross-entropy manually, it is important to clip predicted probabilities away from exactly 0 and 1. Taking $$\log(0)$$ produces negative infinity, which will corrupt the training process. A common practice is to clip predictions to a small range such as $$[10^{-7}, 1 - 10^{-7}]$$ before computing the logarithm. When using framework-provided loss functions that accept logits directly, this clipping is handled automatically.

### Learning rate sensitivity

Cross-entropy loss can produce very large gradient magnitudes early in training when the model's predictions are nearly random. For a $$C$$-class classification problem with random initialization, the initial loss is approximately $$\log(C)$$. With many classes (e.g., $$C = 10000$$ in large vocabulary language models), this initial loss can be quite high. Techniques such as learning rate warmup, gradient clipping, or using an adaptive optimizer like Adam can help manage this.

### Class imbalance

For imbalanced datasets, standard cross-entropy can lead to models that predict the majority class almost exclusively. Several strategies help mitigate this:

| Strategy | Description | When to use |
|---|---|---|
| Weighted cross-entropy | Assign higher loss weights to minority classes | Moderate imbalance |
| Focal loss | Down-weight easy (majority class) examples | Severe imbalance with many easy negatives |
| Oversampling | Duplicate minority class examples in training data | Small datasets |
| Undersampling | Remove majority class examples from training data | Large datasets with severe imbalance |
| Synthetic data (SMOTE) | Generate synthetic minority class examples | Tabular data |

### Calibration

While cross-entropy training encourages calibrated probability outputs in theory, modern deep neural networks often produce overconfident predictions in practice.[8] This happens because modern networks have enough capacity to drive training loss to near zero, resulting in very sharp output distributions. Techniques like temperature scaling, Platt scaling, and label smoothing can improve calibration after or during training. Temperature scaling is particularly popular because it requires only a single parameter and does not affect the model's accuracy.

### Multi-label classification

When each input can belong to multiple classes simultaneously (for example, tagging an image with multiple attributes), the problem is framed as multiple independent binary classification tasks. Binary cross-entropy is applied independently to each label, and the total loss is the sum across all labels. In this setting, a sigmoid activation is applied to each output independently rather than a softmax across all outputs.

## Explain like I'm 5 (ELI5)

Imagine you have a bag of different-colored balls, and you want to teach a friend to guess the color of a ball before pulling it out of the bag. Your friend starts by guessing the chances of each color. Cross-entropy is a way to measure how good their guesses are compared to the real chances.

If the bag has mostly red balls and your friend guesses that red is likely, their cross-entropy score will be low (good). If they guess that blue is most likely when the bag is mostly red, their score will be high (bad). The goal of training a machine learning model is to make this score as low as possible, which means the model's guesses get closer and closer to reality.

The "cross" part of the name comes from comparing across two different probability estimates: the real one (what is actually in the bag) and the guessed one (what your friend thinks is in the bag).

## References

1. Shannon, C. E. (1948). "A Mathematical Theory of Communication." *Bell System Technical Journal*, 27(3), 379-423.
2. Kullback, S.; Leibler, R. A. (1951). "On Information and Sufficiency." *Annals of Mathematical Statistics*, 22(1), 79-86.
3. Bishop, C. M. (2006). *Pattern Recognition and Machine Learning*. Springer.
4. Goodfellow, I.; Bengio, Y.; Courville, A. (2016). *Deep Learning*. MIT Press. Chapter 3.13 (Information Theory) and Chapter 6.2.2 (Cross-Entropy Loss).
5. Lin, T.-Y.; Goyal, P.; Girshick, R.; He, K.; Dollar, P. (2017). "Focal Loss for Dense Object Detection." *IEEE International Conference on Computer Vision (ICCV)*. arXiv:1708.02002.
6. Szegedy, C.; Vanhoucke, V.; Ioffe, S.; Shlens, J.; Wojna, Z. (2016). "Rethinking the Inception Architecture for Computer Vision." *IEEE Conference on Computer Vision and Pattern Recognition (CVPR)*. arXiv:1512.00567.
7. Hinton, G.; Vinyals, O.; Dean, J. (2015). "Distilling the Knowledge in a Neural Network." *arXiv preprint arXiv:1503.02531*.
8. Muller, R.; Kornblith, S.; Hinton, G. (2019). "When Does Label Smoothing Help?" *Advances in Neural Information Processing Systems (NeurIPS)*. arXiv:1906.02629.
9. Murphy, K. P. (2022). *Probabilistic Machine Learning: An Introduction*. MIT Press.
10. Zhang, Z.; Sabuncu, M. R. (2018). "Generalized Cross Entropy Loss for Training Deep Neural Networks with Noisy Labels." *Advances in Neural Information Processing Systems (NeurIPS)*.
11. Mao, L. (2020). "Cross Entropy, KL Divergence, and Maximum Likelihood Estimation." *Lei Mao's Log Book*. https://leimao.github.io/blog/Cross-Entropy-KL-Divergence-MLE/
12. Kaplan, J.; McCandlish, S.; Henighan, T. et al. (2020). "Scaling Laws for Neural Language Models." *arXiv preprint arXiv:2001.08361*.
13. PyTorch documentation. "CrossEntropyLoss." PyTorch 2.x. https://docs.pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html
14. TensorFlow / Keras documentation. "CategoricalCrossentropy, SparseCategoricalCrossentropy, BinaryCrossentropy losses." https://www.tensorflow.org/api_docs/python/tf/keras/losses
15. scikit-learn documentation. "sklearn.metrics.log_loss (logistic / cross-entropy loss)." https://scikit-learn.org/stable/modules/generated/sklearn.metrics.log_loss.html