Cross-Entropy

Introduction

Cross-entropy is a concept from information theory that measures the difference between two probability distributions. In the context of machine learning, it has become one of the most widely used loss functions for training classification models, including neural networks, logistic regression, and softmax regression. The core idea is straightforward: cross-entropy quantifies how well a predicted probability distribution matches the true distribution of the data. When the predicted distribution closely matches the actual distribution, cross-entropy is low; when the two distributions diverge, cross-entropy is high.

The concept traces its roots to Claude Shannon's 1948 paper "A Mathematical Theory of Communication," which laid the groundwork for information theory. Shannon introduced entropy as a measure of the average amount of information produced by a stochastic source of data. Cross-entropy extends this idea to the comparison of two distributions, and it has since become a foundational tool in statistical learning, coding theory, and modern deep learning.

Information-theoretic foundations

To understand cross-entropy, it helps to first review the building blocks from information theory: self-information, entropy, and Kullback-Leibler divergence.

Self-information

Given a discrete event x that occurs with probability P(x), the self-information (also called surprisal) is defined as:

I(x) = -log P(x)

Self-information quantifies how "surprising" an event is. A certain event (P(x) = 1) carries zero information, while a very unlikely event carries high information. The choice of logarithm base determines the unit: base 2 gives bits, base e gives nats, and base 10 gives hartleys.

Shannon entropy

Shannon entropy measures the average self-information across all possible events in a distribution P:

H(P) = -\sum_{x} P(x) \log P(x)

Entropy represents the minimum average number of bits (or nats) required to encode events drawn from distribution P using an optimal coding scheme. It reaches its maximum value when all outcomes are equally likely (a uniform distribution) and its minimum value of zero when one outcome is certain.

Kullback-Leibler divergence

The Kullback-Leibler (KL) divergence measures how one probability distribution Q differs from a reference distribution P:

D_{KL}(P || Q) = \sum_{x} P(x) \log \frac{P(x)}{Q(x)}

KL divergence is always non-negative and equals zero only when P = Q. It is not symmetric, meaning D_{KL}(P || Q) =/= D_{KL}(Q || P) in general. KL divergence can be interpreted as the additional number of bits required to encode samples from P when using a code optimized for Q instead of P.

From KL divergence to cross-entropy

The relationship between cross-entropy, entropy, and KL divergence is given by:

H(P, Q) = H(P) + D_{KL}(P || Q)

This identity shows that cross-entropy equals the entropy of P plus the KL divergence from P to Q. In machine learning, the true distribution P is fixed (it represents the ground-truth labels), so H(P) is a constant that does not depend on the model parameters. As a result, minimizing cross-entropy with respect to the model is equivalent to minimizing the KL divergence between the true distribution and the model's predicted distribution.

This relationship also makes clear that cross-entropy is always at least as large as the entropy of the true distribution: H(P, Q) >= H(P). The gap between the two is precisely the KL divergence, which measures the "inefficiency" of using Q to model P.

Mathematical definition

Given two discrete probability distributions P (the true distribution) and Q (the predicted distribution) over the same set of events, the cross-entropy is defined as:

H(P, Q) = -\sum_{x} P(x) \log Q(x)

Here, P(x) is the true probability of event x, and Q(x) is the predicted probability. Cross-entropy is always non-negative, and it achieves its minimum value when Q = P. In that case, H(P, Q) = H(P), the entropy of the true distribution.

For continuous distributions, the cross-entropy is defined using an integral:

H(P, Q) = -\int P(x) \log Q(x) dx

Key properties

Property	Description
Non-negativity	H(P, Q) >= 0 for all distributions P and Q
Minimum at P = Q	Cross-entropy is minimized when the predicted distribution matches the true distribution
Asymmetry	H(P, Q) =/= H(Q, P) in general; the order of arguments matters
Relation to KL divergence	H(P, Q) = H(P) + D_{KL}(P \| Q), so H(P, Q) >= H(P)
Decomposition	Separates into entropy (irreducible uncertainty) plus divergence (model error)
Additivity	For independent variables, H(P_{XY}, Q_{XY}) = H(P_X, Q_X) + H(P_Y, Q_Y)

Cross-entropy in information theory vs. machine learning

The term "cross-entropy" appears in both information theory and machine learning, but there are some differences in convention and emphasis that can cause confusion.

In information theory, cross-entropy H(P, Q) measures the average number of bits (or nats) needed to encode events from a true distribution P using an optimal code designed for an approximating distribution Q. The focus is on coding efficiency and compression. Information theorists typically use base-2 logarithms, reporting cross-entropy in bits, and the distributions P and Q are both genuine probability distributions over the same event space.

In machine learning, cross-entropy serves as a loss function for training classifiers. Several conventions differ from the information-theoretic usage:

Aspect	Information theory	Machine learning
Logarithm base	Base 2 (bits)	Natural logarithm (nats)
Role of P	Any true distribution	Empirical distribution of training labels (often one-hot)
Role of Q	Approximate distribution	Model's predicted distribution (parameterized)
Primary goal	Measure coding inefficiency	Provide a differentiable training objective
Interpretation	Expected message length	Negative log-likelihood of correct labels

Because training labels in classification are typically one-hot vectors (a single class has probability 1 and all others have probability 0), the cross-entropy loss for a single sample reduces to the negative log-probability of the correct class: L = -log Q(y_correct). This simplification does not arise in general information-theoretic settings where P may be a full distribution over multiple outcomes.

Another distinction: in information theory, cross-entropy includes the irreducible entropy H(P) of the true distribution. In machine learning, since H(P) is constant with respect to model parameters, it is often dropped from the optimization objective. Some practitioners therefore use "cross-entropy loss" and "KL divergence" interchangeably during training, even though they differ by the constant H(P).

Binary cross-entropy

In binary classification problems with two possible outcomes (positive = 1, negative = 0), the cross-entropy simplifies to a particularly clean form. Let y denote the true label (0 or 1) and y_hat denote the predicted probability that the label is 1. The binary cross-entropy (BCE) for a single sample is:

L(y, y_hat) = -[y * log(y_hat) + (1 - y) * log(1 - y_hat)]

When y = 1, only the first term contributes, penalizing the model if y_hat is far from 1. When y = 0, only the second term contributes, penalizing the model if y_hat is far from 0.

For a dataset of N samples, the average binary cross-entropy loss is:

L = -(1/N) * sum_{i=1}^{N} [y_i * log(y_hat_i) + (1 - y_i) * log(1 - y_hat_i)]

Binary cross-entropy is the standard loss function for training binary classifiers, including logistic regression models and neural networks with a sigmoid function output layer. It is also commonly referred to as "log loss" in the statistics and machine learning literature.

The gradient of binary cross-entropy with respect to the predicted probability has an intuitive form:

dL/d(y_hat) = -(y / y_hat) + (1 - y) / (1 - y_hat)

When combined with a sigmoid activation, the gradient with respect to the pre-activation logits simplifies to (y_hat - y), which provides strong learning signals when the prediction is far from the true label.

Categorical cross-entropy

For multi-class classification with C classes, the target labels are typically represented as one-hot encoded vectors. If y_i is a one-hot vector where y_{i,c} = 1 for the correct class and 0 otherwise, and y_hat_{i,c} is the predicted probability for class c, the categorical cross-entropy loss for a single sample is:

L(y_i, y_hat_i) = -\sum_{c=1}^{C} y_{i,c} * log(y_hat_{i,c})

Because the true label is one-hot, only the term corresponding to the correct class survives. If the correct class is k, this reduces to:

L = -log(y_hat_{i,k})

This is simply the negative log-probability assigned to the correct class. The model is penalized more heavily when it assigns a low probability to the correct class. For example, if the model assigns probability 0.9 to the correct class, the loss is approximately 0.105. If it assigns only 0.01, the loss jumps to approximately 4.605, creating a strong signal to correct the error.

For a dataset of N samples, the average categorical cross-entropy is:

L = -(1/N) * sum_{i=1}^{N} sum_{c=1}^{C} y_{i,c} * log(y_hat_{i,c})

Sparse categorical cross-entropy

Some frameworks (such as TensorFlow and Keras) offer a "sparse" variant of categorical cross-entropy. The mathematical formula is identical, but instead of requiring one-hot encoded target vectors, it accepts integer class labels directly. This is computationally more efficient for problems with a large number of classes, since there is no need to allocate and store full one-hot vectors.

Variant	Target format	Use case	Output activation
Binary cross-entropy	Scalar (0 or 1)	Two-class classification	Sigmoid
Categorical cross-entropy	One-hot vector	Multi-class classification (few classes)	Softmax
Sparse categorical cross-entropy	Integer label	Multi-class classification (many classes)	Softmax

Equivalence to maximum likelihood estimation

One of the most important theoretical results connecting cross-entropy to classical statistics is its equivalence to maximum likelihood estimation (MLE). When training a model by minimizing cross-entropy loss, the optimization objective is mathematically identical to maximizing the likelihood of the observed data under the model.

Consider a dataset of N independent samples. The likelihood function is:

L(theta) = prod_{i=1}^{N} Q(y_i | x_i; theta)

Taking the negative logarithm and dividing by N yields:

-(1/N) * log L(theta) = -(1/N) * sum_{i=1}^{N} log Q(y_i | x_i; theta)

This is exactly the cross-entropy loss when P is the empirical distribution of the data. Therefore, minimizing cross-entropy is equivalent to maximizing the log-likelihood. This equivalence provides a strong statistical justification for using cross-entropy as a loss function: it is the principled way to fit a probabilistic model to observed data.

This connection also extends to information theory. The distribution Q that minimizes the cross-entropy H(P, Q) over a family of distributions is the one that best approximates P in the KL divergence sense. In exponential family models, MLE, minimum KL divergence, and minimum cross-entropy all coincide, unifying the information-theoretic and statistical perspectives.

Cross-entropy as a loss function in deep learning

Why cross-entropy works well for classification

Cross-entropy has become the default loss function for classification tasks in deep learning for several practical reasons.

Strong gradients for incorrect predictions. The gradient of cross-entropy loss with respect to the model's output logits has a simple and well-behaved form. For a softmax output layer, the gradient of the loss with respect to logit z_k is:

dL/dz_k = y_hat_k - y_k

This means the gradient is simply the difference between the predicted probability and the true label. When the model is confidently wrong (for example, predicting 0.01 for the correct class), the gradient is large, pushing the weights to correct the mistake quickly. When the model is already close to the correct answer, the gradient is small, leading to gentle adjustments.

No vanishing gradient problem at saturation. Unlike mean squared error (MSE), cross-entropy does not suffer from vanishing gradients when the output neuron saturates. With a sigmoid or softmax activation, MSE gradients can become extremely small when the output is near 0 or 1 (since the derivative of the sigmoid is near zero in those regions). Cross-entropy cancels out this saturation effect, ensuring that learning continues even when predictions are far from the target.

Convexity properties. When combined with a softmax or sigmoid output layer, the cross-entropy loss is convex with respect to the logits (the pre-activation values). This makes the optimization landscape smoother and reduces the risk of getting stuck in poor local minima for the final layer.

Probabilistic interpretation. Cross-entropy naturally produces well-calibrated probability estimates. Because minimizing cross-entropy is equivalent to maximum likelihood estimation, the resulting model provides probability outputs that are meaningful and can be used directly for decision-making, risk assessment, or downstream probabilistic reasoning.

Comparison with mean squared error

Property	Cross-entropy	Mean squared error (MSE)
Gradient when confidently wrong	Large (fast correction)	Can be small (slow correction)
Vanishing gradient at saturation	No	Yes, with sigmoid/softmax
Convexity with softmax/sigmoid	Convex in logits	Non-convex
Probabilistic interpretation	Direct (negative log-likelihood)	Indirect
Typical use case	Classification	Regression
Sensitivity to outliers	Moderate	High
Gradient form (softmax output)	(y_hat - y), linear	(y_hat - y) y_hat * (1 - y_hat)*, saturates

The gradient comparison is particularly revealing. With MSE and a sigmoid output, the gradient includes a y_hat * (1 - y_hat) term from the sigmoid derivative. When y_hat is close to 0 or 1 (saturated), this term approaches zero, making the gradient vanishingly small even when the prediction is completely wrong. Cross-entropy avoids this problem because the logarithm in the loss cancels the sigmoid derivative, producing the clean (y_hat - y) gradient.

Softmax and cross-entropy: numerical stability

In practice, computing softmax probabilities and then taking their logarithm for cross-entropy can lead to numerical issues. The softmax function involves computing exponentials, which can overflow (producing infinity) or underflow (producing zero) for large or very negative logit values.

The overflow problem. If any logit z_k is very large, then exp(z_k) can exceed the range of floating-point representation (roughly 10^308 for float64), resulting in infinity.

The underflow problem. After softmax normalization, some probabilities may be extremely close to zero. Taking log(0) then produces negative infinity, corrupting the loss computation.

The log-sum-exp trick. The standard solution is to subtract the maximum logit before computing softmax:

softmax(z_k) = exp(z_k - max(z)) / sum_j exp(z_j - max(z))

This shift does not change the result mathematically (the constant cancels in numerator and denominator) but prevents overflow by ensuring the largest exponent is zero.

Fused log-softmax and cross-entropy. Modern deep learning frameworks provide fused operations that compute the log-softmax and cross-entropy together in a single numerically stable pass. The combined log-softmax can be written as:

log softmax(z_k) = z_k - log(sum_j exp(z_j))

By using the log-sum-exp trick on the second term, this computation avoids ever materializing the raw softmax probabilities. The cross-entropy loss for the correct class k then simplifies to:

L = -z_k + log(sum_j exp(z_j))

This fused formulation is both faster and more numerically robust than computing softmax and log separately. In PyTorch, torch.nn.CrossEntropyLoss accepts raw logits directly and handles all of this internally.

Framework implementations

All major deep learning frameworks provide built-in cross-entropy loss functions. Understanding the differences between them is important for correct usage.

Framework	Function	Input type	Task	Notes
PyTorch	`nn.CrossEntropyLoss`	Raw logits	Multi-class	Combines log-softmax + NLL loss; accepts integer class labels
PyTorch	`nn.BCELoss`	Probabilities (after sigmoid)	Binary / multi-label	Requires manual sigmoid; less numerically stable
PyTorch	`nn.BCEWithLogitsLoss`	Raw logits	Binary / multi-label	Fuses sigmoid + BCE; recommended over `BCELoss`
TensorFlow	`CategoricalCrossentropy`	Probabilities or logits	Multi-class	Set `from_logits=True` for logits input
TensorFlow	`SparseCategoricalCrossentropy`	Probabilities or logits	Multi-class	Integer labels; `from_logits=True` recommended
TensorFlow	`BinaryCrossentropy`	Probabilities or logits	Binary / multi-label	Set `from_logits=True` for stability
JAX (Optax)	`softmax_cross_entropy`	Raw logits	Multi-class	Pure-function API

A common mistake is to apply softmax or sigmoid before passing the output to a loss function that already applies it internally. For example, using nn.Softmax followed by nn.CrossEntropyLoss in PyTorch applies softmax twice, producing incorrect gradients and poor training results. Always check whether your loss function expects raw logits or probabilities.

PyTorch's nn.CrossEntropyLoss also supports optional weight, ignore_index, and label_smoothing parameters, making it flexible for weighted losses, masked sequences, and regularized training without needing separate implementations.

Variants and extensions

Several modifications to the standard cross-entropy loss have been proposed to handle specific challenges in machine learning.

Weighted cross-entropy

In datasets with imbalanced class distributions, the standard cross-entropy loss can be biased toward the majority class because it treats all samples equally. Weighted cross-entropy addresses this by assigning different weights to different classes:

L = -(1/N) * sum_{i=1}^{N} sum_{c=1}^{C} w_c * y_{i,c} * log(y_hat_{i,c})

where w_c is the weight for class c. A common approach is to set weights inversely proportional to class frequency: classes with fewer training examples receive higher weights, encouraging the model to pay more attention to underrepresented classes.

In PyTorch, class weights can be passed directly to nn.CrossEntropyLoss(weight=tensor). In TensorFlow, the class_weight parameter in model.fit() achieves the same effect.

Focal loss

Focal loss was introduced by Tsung-Yi Lin and colleagues at Facebook AI Research in 2017 to address the extreme class imbalance encountered in dense object detection tasks. In one-stage detectors like RetinaNet, the vast majority of anchor boxes correspond to background (easy negatives), which can overwhelm the detector during training.

Focal loss modifies the standard cross-entropy by adding a modulating factor:

FL(p_t) = -alpha_t * (1 - p_t)^gamma * log(p_t)

where p_t is the model's predicted probability for the correct class, alpha_t is a class-balancing weight, and gamma >= 0 is a focusing parameter. When gamma = 0, focal loss reduces to standard cross-entropy. As gamma increases, the loss contribution from well-classified examples (those with high p_t) is down-weighted, allowing the model to focus its learning on hard, misclassified examples.

In the original paper, the authors found that gamma = 2 worked well across a range of object detection benchmarks. The key insight is that weighted cross-entropy handles class imbalance by reweighting entire classes, while focal loss handles it by reweighting individual examples based on difficulty. This makes focal loss particularly effective when the imbalance is severe and there is a large number of easy negatives.

Focal loss has since been adopted in many domains beyond object detection, including medical image segmentation, natural language processing, and audio classification.

Label smoothing

Label smoothing, proposed by Christian Szegedy and colleagues in 2016 as part of refinements to the Inception architecture, is a regularization technique that modifies the target distribution used in cross-entropy. Instead of using hard one-hot targets, label smoothing replaces the target for the correct class with 1 - epsilon and distributes the remaining epsilon uniformly across all other classes:

y_c^{smooth} = y_c * (1 - epsilon) + epsilon / C

where epsilon is a small smoothing parameter (typically 0.1) and C is the number of classes. This prevents the model from becoming overly confident and can improve generalization. The smoothed loss function can be decomposed into two terms: the standard cross-entropy with respect to hard targets plus an entropy regularization term that penalizes low-entropy (overconfident) predictions.

Label smoothing has become a standard component in many modern training recipes, particularly for image classification on benchmarks like ImageNet. Research by Muller, Kornblith, and Hinton (2019) showed that while label smoothing generally improves generalization and model calibration, it can actually hurt performance when the model is used as a teacher in knowledge distillation. The reason is that label smoothing encourages the model to treat all incorrect classes as equally probable, erasing the "dark knowledge" about inter-class similarities that distillation relies on.

Binary cross-entropy with logits

Rather than applying a sigmoid function to the output and then computing binary cross-entropy, many frameworks offer a version that accepts raw logits. This combined operation is numerically more stable for the same reasons as the fused softmax-cross-entropy described above. In PyTorch, this is torch.nn.BCEWithLogitsLoss. It also supports a pos_weight parameter for adjusting the relative weight of positive versus negative samples, which is useful in multi-label classification where each label may have a different class balance.

Cross-entropy in knowledge distillation

Knowledge distillation, introduced by Hinton, Vinyals, and Dean in 2015, is a model compression technique that transfers knowledge from a large "teacher" model to a smaller "student" model. Cross-entropy plays a central role in this process.

The key insight is that the teacher model's soft probability outputs contain more information than hard labels alone. For example, when classifying an image of a cat, a well-trained teacher might output probabilities like [cat: 0.85, tiger: 0.10, dog: 0.04, ...]. The relatively high probability assigned to "tiger" encodes the teacher's learned knowledge that cats and tigers share visual features. These inter-class relationships, sometimes called "dark knowledge," would be lost with hard one-hot labels.

The distillation procedure uses a temperature parameter T to soften the teacher's output distribution. The softmax with temperature is:

q_i = exp(z_i / T) / sum_j exp(z_j / T)

Higher temperatures produce softer (more uniform) distributions that reveal more information about the teacher's learned similarities between classes. At T = 1, this is the standard softmax.

The student model is trained with a combined loss function:

L = alpha * H(y_hard, q_student) + (1 - alpha) * T^2 * H(q_teacher^T, q_student^T)

where H(y_hard, q_student) is the standard cross-entropy with hard labels (at T = 1), H(q_teacher^T, q_student^T) is the cross-entropy between the teacher's and student's soft distributions (at temperature T), and alpha controls the balance between the two terms. The T^2 factor compensates for the fact that gradients from the soft targets are scaled down by 1/T^2 when temperature is raised.

Hinton et al. found that setting alpha to a small value (giving most weight to the distillation loss) generally produced the best results. The technique has been widely adopted for deploying efficient models in production, and it underpins the creation of models like DistilBERT and many other compressed architectures.

Cross-entropy and perplexity

In natural language processing, perplexity is the standard metric for evaluating language models. Perplexity is directly derived from cross-entropy: it is the exponentiation of the cross-entropy loss.

If L is the average cross-entropy loss per token (in nats), then perplexity is:

PP = e^L

If the loss is measured in bits (using log base 2), then:

PP = 2^L

Perplexity has an intuitive interpretation: it represents the effective number of equally likely choices the model is uncertain between at each step. A language model with a perplexity of 10 on a given text corpus is, on average, as uncertain as if it were choosing uniformly among 10 possible next tokens at each position.

This connection means that reducing the cross-entropy loss of a language model by even a small amount translates into a measurable reduction in perplexity. For large language models such as GPT and BERT, cross-entropy over the training tokens is the primary optimization objective. Improvements in model architecture, data quality, or training methodology are often assessed by their effect on perplexity (and thus on cross-entropy).

In autoregressive language models, cross-entropy is computed over the entire sequence. Given a sequence of tokens (t_1, t_2, ..., t_N), the loss is the average negative log-probability of each token given all preceding tokens:

L = -(1/N) * sum_{i=1}^{N} log P(t_i | t_1, ..., t_{i-1})

This is exactly the cross-entropy between the empirical distribution of next tokens and the model's predicted distribution at each position.

Cross-entropy (nats)	Perplexity	Interpretation
0	1	Perfect prediction (certainty)
1.0	2.72	Low uncertainty
2.3	10	Moderate uncertainty
4.6	100	High uncertainty
6.9	1000	Very high uncertainty

Modern large language models achieve perplexities in the range of 10 to 30 on standard benchmarks like WikiText-103, corresponding to cross-entropy values of roughly 2.3 to 3.4 nats per token. Scaling laws research has shown that cross-entropy loss decreases predictably as a power law with increasing model size, dataset size, and compute budget.

Applications

Image classification

In computer vision, cross-entropy is the standard loss function for image classification tasks. Models like ResNet, VGG, Inception, and Vision Transformers (ViT) are all trained using categorical cross-entropy over the class labels. The output layer typically uses a softmax activation that produces a probability distribution over classes, and the cross-entropy loss measures how far this distribution is from the one-hot ground truth.

Natural language processing

Cross-entropy plays a central role in virtually all NLP tasks that involve predicting tokens or classes:

Language modeling. Autoregressive models like GPT predict the next token in a sequence, and the loss is the cross-entropy between the predicted token distribution and the actual next token.
Machine translation. Sequence-to-sequence models for translation use cross-entropy at each decoding step.
Text classification. Sentiment analysis, topic classification, and spam detection models use binary or categorical cross-entropy.
Named entity recognition. Token-level classification tasks use cross-entropy for each token's predicted label.

Object detection and segmentation

One-stage detectors like RetinaNet and two-stage detectors like Faster R-CNN both use cross-entropy (or its focal loss variant) for the classification head. In semantic segmentation, pixel-wise cross-entropy computes the loss for each pixel independently and averages across the image.

Generative models

Variational autoencoders (VAEs) use binary cross-entropy as the reconstruction loss when the data consists of binary or near-binary values (such as binarized MNIST images). Generative adversarial networks (GANs) use binary cross-entropy in the discriminator's loss function to distinguish between real and generated samples.

Reinforcement learning

In reinforcement learning, cross-entropy appears in policy gradient descent methods where the agent's policy is parameterized as a probability distribution over actions. The cross-entropy between the policy distribution and target action distributions is used in algorithms like the cross-entropy method (CEM) for optimization.

Practical considerations

Numerical clipping

When computing cross-entropy manually, it is important to clip predicted probabilities away from exactly 0 and 1. Taking log(0) produces negative infinity, which will corrupt the training process. A common practice is to clip predictions to a small range such as [1e-7, 1 - 1e-7] before computing the logarithm. When using framework-provided loss functions that accept logits directly, this clipping is handled automatically.

Learning rate sensitivity

Cross-entropy loss can produce very large gradient magnitudes early in training when the model's predictions are nearly random. For a C-class classification problem with random initialization, the initial loss is approximately log(C). With many classes (e.g., C = 10000 in large vocabulary language models), this initial loss can be quite high. Techniques such as learning rate warmup, gradient clipping, or using an adaptive optimizer like Adam can help manage this.

Class imbalance

For imbalanced datasets, standard cross-entropy can lead to models that predict the majority class almost exclusively. Several strategies help mitigate this:

Strategy	Description	When to use
Weighted cross-entropy	Assign higher loss weights to minority classes	Moderate imbalance
Focal loss	Down-weight easy (majority class) examples	Severe imbalance with many easy negatives
Oversampling	Duplicate minority class examples in training data	Small datasets
Undersampling	Remove majority class examples from training data	Large datasets with severe imbalance
Synthetic data (SMOTE)	Generate synthetic minority class examples	Tabular data

Calibration

While cross-entropy training encourages calibrated probability outputs in theory, modern deep neural networks often produce overconfident predictions in practice. This happens because modern networks have enough capacity to drive training loss to near zero, resulting in very sharp output distributions. Techniques like temperature scaling, Platt scaling, and label smoothing can improve calibration after or during training. Temperature scaling is particularly popular because it requires only a single parameter and does not affect the model's accuracy.

Multi-label classification

When each input can belong to multiple classes simultaneously (for example, tagging an image with multiple attributes), the problem is framed as multiple independent binary classification tasks. Binary cross-entropy is applied independently to each label, and the total loss is the sum across all labels. In this setting, a sigmoid activation is applied to each output independently rather than a softmax across all outputs.

Explain like I'm 5 (ELI5)

Imagine you have a bag of different-colored balls, and you want to teach a friend to guess the color of a ball before pulling it out of the bag. Your friend starts by guessing the chances of each color. Cross-entropy is a way to measure how good their guesses are compared to the real chances.

If the bag has mostly red balls and your friend guesses that red is likely, their cross-entropy score will be low (good). If they guess that blue is most likely when the bag is mostly red, their score will be high (bad). The goal of training a machine learning model is to make this score as low as possible, which means the model's guesses get closer and closer to reality.

The "cross" part of the name comes from comparing across two different probability estimates: the real one (what is actually in the bag) and the guessed one (what your friend thinks is in the bag).

References

Shannon, C. E. (1948). "A Mathematical Theory of Communication." *Bell System Technical Journal*, 27(3), 379-423.
Kullback, S.; Leibler, R. A. (1951). "On Information and Sufficiency." *Annals of Mathematical Statistics*, 22(1), 79-86.
Bishop, C. M. (2006). *Pattern Recognition and Machine Learning*. Springer.
Goodfellow, I.; Bengio, Y.; Courville, A. (2016). *Deep Learning*. MIT Press. Chapter 3.13 (Information Theory) and Chapter 6.2.2 (Cross-Entropy Loss).
Lin, T.-Y.; Goyal, P.; Girshick, R.; He, K.; Dollar, P. (2017). "Focal Loss for Dense Object Detection." *IEEE International Conference on Computer Vision (ICCV)*.
Szegedy, C.; Vanhoucke, V.; Ioffe, S.; Shlens, J.; Wojna, Z. (2016). "Rethinking the Inception Architecture." *IEEE Conference on Computer Vision and Pattern Recognition (CVPR)*.
Hinton, G.; Vinyals, O.; Dean, J. (2015). "Distilling the Knowledge in a Neural Network." *arXiv preprint arXiv:1503.02531*.
Muller, R.; Kornblith, S.; Hinton, G. (2019). "When Does Label Smoothing Help?" *Advances in Neural Information Processing Systems (NeurIPS)*.
Murphy, K. P. (2022). *Probabilistic Machine Learning: An Introduction*. MIT Press.
Zhang, Z.; Sabuncu, M. R. (2018). "Generalized Cross Entropy Loss for Training Deep Neural Networks with Noisy Labels." *Advances in Neural Information Processing Systems (NeurIPS)*.
Mao, L. (2020). "Cross Entropy, KL Divergence, and Maximum Likelihood Estimation." *Lei Mao's Log Book*. https://leimao.github.io/blog/Cross-Entropy-KL-Divergence-MLE/
Kaplan, J.; McCandlish, S.; Henighan, T. et al. (2020). "Scaling Laws for Neural Language Models." *arXiv preprint arXiv:2001.08361*.

Introduction

Information-theoretic foundations

Self-information

Shannon entropy

Kullback-Leibler divergence

From KL divergence to cross-entropy

Mathematical definition

Key properties

Cross-entropy in information theory vs. machine learning

Binary cross-entropy

Categorical cross-entropy

Sparse categorical cross-entropy

Equivalence to maximum likelihood estimation

Cross-entropy as a loss function in deep learning

Why cross-entropy works well for classification

Comparison with mean squared error

Softmax and cross-entropy: numerical stability

Framework implementations

Variants and extensions

Weighted cross-entropy

Focal loss

Label smoothing

Binary cross-entropy with logits

Cross-entropy in knowledge distillation

Cross-entropy and perplexity

Applications

Image classification

Natural language processing

Object detection and segmentation

Generative models

Reinforcement learning

Practical considerations

Numerical clipping

Learning rate sensitivity

Class imbalance

Calibration

Multi-label classification

Explain like I'm 5 (ELI5)

References

Improve this article

Related Articles

ARC-AGI 2

GELU (Gaussian Error Linear Unit)

LeNet

Entropy

Information Gain

Log Loss

Introduction

Information-theoretic foundations

Self-information

Shannon entropy

Kullback-Leibler divergence

From KL divergence to cross-entropy

Mathematical definition

Key properties

Cross-entropy in information theory vs. machine learning

Binary cross-entropy

Categorical cross-entropy

Sparse categorical cross-entropy

Equivalence to maximum likelihood estimation

Cross-entropy as a loss function in deep learning

Why cross-entropy works well for classification

Comparison with mean squared error

Softmax and cross-entropy: numerical stability

Framework implementations

Variants and extensions

Weighted cross-entropy

Focal loss

Label smoothing

Binary cross-entropy with logits

Cross-entropy in knowledge distillation

Cross-entropy and perplexity

Applications

Image classification

Natural language processing

Object detection and segmentation

Generative models

Reinforcement learning

Practical considerations

Numerical clipping