Cross-entropy loss (also called log loss) is the standard loss function for classification tasks in machine learning. It measures the difference between two probability distributions: the true distribution of labels and the predicted distribution output by a model. Cross-entropy has its roots in information theory, where it quantifies the average number of bits needed to encode data from one distribution using a code optimized for another distribution. In deep learning, cross-entropy loss provides the gradient signal that drives backpropagation in classification models, from simple logistic regression to billion-parameter large language models.
Cross-entropy is so central to modern machine learning that it appears in virtually every classification pipeline. Image classifiers, text classifiers, language models, speech recognition systems, and recommendation engines all typically optimize some form of cross-entropy loss. Its dominance stems from a combination of theoretical elegance (it arises naturally from maximum likelihood estimation) and practical effectiveness (it produces well-calibrated probabilities and strong gradients for learning).
Cross-entropy builds on several concepts from Claude Shannon's information theory, published in 1948 [1].
The entropy of a probability distribution P measures the average amount of surprise (or information) contained in events drawn from P. For a discrete distribution over K outcomes:
H(P) = -sum_{k=1}^{K} p_k * log(p_k)
Entropy is maximized when all outcomes are equally likely (uniform distribution) and minimized (equal to zero) when one outcome is certain. The logarithm base determines the unit: base 2 gives bits, base e gives nats. In machine learning, the natural logarithm (base e) is standard.
Intuitively, entropy captures how "spread out" a distribution is. A distribution concentrated on a single class has low entropy; a distribution spread across many classes has high entropy.
The Kullback-Leibler (KL) divergence measures how one probability distribution P differs from a reference distribution Q:
D_KL(P || Q) = sum_{k=1}^{K} p_k * log(p_k / q_k)
KL divergence is always non-negative and equals zero only when P = Q. It is not symmetric: D_KL(P || Q) does not equal D_KL(Q || P) in general, which is why it is called a divergence rather than a distance.
Cross-entropy between the true distribution P and the predicted distribution Q is defined as:
H(P, Q) = -sum_{k=1}^{K} p_k * log(q_k)
Cross-entropy can be decomposed as:
H(P, Q) = H(P) + D_KL(P || Q)
Since the entropy H(P) of the true distribution is constant (it does not depend on the model's predictions), minimizing cross-entropy is equivalent to minimizing KL divergence between the true distribution and the model's output. This is also equivalent to maximum likelihood estimation: choosing the model parameters that maximize the probability of the observed data [2].
For binary classification (two classes, typically labeled 0 and 1), the cross-entropy loss for a single sample reduces to:
L = -[y * log(p) + (1 - y) * log(1 - p)]
where y is the true label (0 or 1) and p is the model's predicted probability for class 1.
When y = 1, only the first term is active: L = -log(p). The loss is small when p is close to 1 (correct and confident) and large when p is close to 0 (wrong and confident).
When y = 0, only the second term is active: L = -log(1 - p). The loss is small when p is close to 0 and large when p is close to 1.
For a dataset of N samples, the total binary cross-entropy loss is the average over all samples:
L = -(1/N) * sum_{i=1}^{N} [y_i * log(p_i) + (1 - y_i) * log(1 - p_i)]
Binary cross-entropy is used in logistic regression, binary classification heads, multi-label classification (where each class is treated as an independent binary prediction), and as a component of more complex losses.
For multi-class classification with K mutually exclusive classes, the cross-entropy loss for a single sample is:
L = -sum_{k=1}^{K} y_k * log(p_k)
where y is a one-hot encoded vector (y_k = 1 for the true class, y_k = 0 for all others) and p_k is the model's predicted probability for class k.
Since y is one-hot, this simplifies to:
L = -log(p_c)
where c is the index of the true class. The loss is simply the negative log probability assigned to the correct class. This is why cross-entropy loss is also called negative log-likelihood loss.
For a dataset of N samples:
L = -(1/N) * sum_{i=1}^{N} log(p_{i,c_i})
| Loss type | Formula (single sample) | Number of classes | Use case |
|---|---|---|---|
| Binary cross-entropy | -[y * log(p) + (1-y) * log(1-p)] | 2 (or multi-label) | Binary classification, multi-label |
| Categorical cross-entropy | -log(p_c) | K (mutually exclusive) | Multi-class classification |
| Sparse categorical cross-entropy | -log(p_c) (integer labels) | K (mutually exclusive) | Same as categorical, different label format |
In neural networks, the final layer for classification typically produces a vector of raw scores (logits) z = [z_1, z_2, ..., z_K], one per class. These logits are converted to probabilities using the softmax function:
p_k = exp(z_k) / sum_{j=1}^{K} exp(z_j)
Softmax ensures that the outputs are non-negative and sum to one, forming a valid probability distribution. The cross-entropy loss is then computed on these probabilities.
The combination of softmax and cross-entropy loss has a particularly clean gradient. When computing the gradient of the loss with respect to the logits z, the derivative simplifies to:
dL/dz_k = p_k - y_k
This is simply the difference between the predicted probability and the true label for each class. This clean gradient is one of the reasons cross-entropy loss pairs so well with softmax: the gradient is large when the prediction is wrong and small when it is right, providing a strong and intuitive learning signal.
Naively computing cross-entropy as -log(softmax(z)) can cause numerical problems. The softmax function involves computing exp(z_k), which can overflow to infinity for large z_k or underflow to zero for very negative z_k. Taking the log of these extreme values compounds the problem.
The standard solution is to compute log-softmax directly, combining the log and softmax into a single numerically stable operation:
log(p_k) = log(exp(z_k) / sum_j exp(z_j)) = z_k - log(sum_j exp(z_j))
To prevent overflow in the log-sum-exp term, a constant m = max(z) is subtracted from all logits:
log(sum_j exp(z_j)) = m + log(sum_j exp(z_j - m))
Since z_j - m is at most zero, exp(z_j - m) cannot overflow. This trick is implemented in all deep learning frameworks.
In PyTorch, the recommended approach is to use nn.CrossEntropyLoss, which takes raw logits as input and internally computes log-softmax plus negative log-likelihood in a numerically stable way:
import torch.nn as nn
criterion = nn.CrossEntropyLoss()
loss = criterion(logits, targets) # logits: (batch, num_classes), targets: (batch,)
Computing softmax first and then passing probabilities to a separate log-loss function is numerically inferior and should be avoided.
Label smoothing is a regularization technique that modifies the target distribution used in cross-entropy loss [3]. Instead of using hard one-hot targets (where the true class has probability 1.0 and all other classes have probability 0.0), label smoothing softens the targets:
y_k_smooth = (1 - alpha) * y_k + alpha / K
where alpha is the smoothing parameter (typically 0.1) and K is the number of classes.
For the true class, the target becomes (1 - alpha + alpha/K) instead of 1.0. For incorrect classes, the target becomes alpha/K instead of 0.0.
With hard targets, the cross-entropy loss drives the model to make the logit for the correct class infinitely larger than all other logits (since -log(p_c) approaches zero only as p_c approaches 1, which requires z_c to be much larger than all other z_k). This encourages overconfident predictions and large logit values, which can hurt generalization.
Label smoothing prevents this by giving the model a softer target that can be achieved with smaller logit differences. This produces better-calibrated probabilities (the model's confidence more closely matches its actual accuracy) and can improve generalization [3].
Label smoothing was used in training Inception v2 and many subsequent models. It remains a standard technique in training vision transformers and other classification models.
| Technique | Target for true class | Target for other classes | Effect |
|---|---|---|---|
| Hard targets (standard) | 1.0 | 0.0 | Encourages maximum confidence |
| Label smoothing (alpha=0.1, K=10) | 0.91 | 0.01 | Prevents overconfidence, improves calibration |
| Label smoothing (alpha=0.2, K=10) | 0.82 | 0.02 | Stronger smoothing, more regularization |
Focal loss, introduced by Tsung-Yi Lin et al. (2017) at Facebook AI Research, is a modification of cross-entropy designed to handle extreme class imbalance in object detection [4].
In one-stage object detectors like RetinaNet, the model evaluates tens of thousands of candidate locations per image. The vast majority of these candidates are background (negative examples), and only a tiny fraction contain objects of interest. With standard cross-entropy, the loss is dominated by the large number of easy negatives. Although each easy negative contributes a small loss individually, their sheer number overwhelms the gradient signal from the rare hard positives, preventing the model from learning effectively.
Focal loss adds a modulating factor to the standard cross-entropy:
FL(p_t) = -alpha_t * (1 - p_t)^gamma * log(p_t)
where p_t is the model's predicted probability for the true class, alpha_t is a per-class weighting factor, and gamma is the focusing parameter (typically gamma = 2).
The key term is (1 - p_t)^gamma. For well-classified examples where p_t is high (say, 0.9), this factor is small ((1 - 0.9)^2 = 0.01), so the loss contribution is heavily down-weighted. For misclassified examples where p_t is low (say, 0.1), the factor is close to 1 ((1 - 0.1)^2 = 0.81), so the loss is nearly unchanged.
The effect is that focal loss automatically focuses the model's learning on the hard, misclassified examples and ignores the easy ones. This is particularly effective for class-imbalanced detection tasks.
| gamma value | Effect on easy examples | Effect on hard examples |
|---|---|---|
| 0 (standard CE) | Full loss contribution | Full loss contribution |
| 1 | Moderate down-weighting | Nearly full loss |
| 2 (typical) | Strong down-weighting | Nearly full loss |
| 5 | Very strong down-weighting | Nearly full loss |
Focal loss was a key component of RetinaNet, which demonstrated that one-stage detectors could match or exceed the accuracy of two-stage detectors (like Faster R-CNN) when the class imbalance problem was properly addressed. The paper has been cited over 25,000 times and focal loss has been adopted widely beyond object detection, including in medical imaging, NLP, and any setting with severe class imbalance.
Cross-entropy loss plays a central role in training language models, where it serves as the primary training objective.
Autoregressive language models (like GPT, LLaMA, and Claude) are trained to predict the next token in a sequence, given all preceding tokens. For a sequence of tokens (x_1, x_2, ..., x_T), the model produces a probability distribution over the vocabulary V at each position:
p(x_t | x_1, ..., x_{t-1})
The training loss is the average cross-entropy across all positions in the sequence:
L = -(1/T) * sum_{t=1}^{T} log p(x_t | x_1, ..., x_{t-1})
This is equivalent to maximizing the log-likelihood of the training data under the model. The vocabulary size for modern language models ranges from 32,000 to over 100,000 tokens, making this a very high-dimensional classification problem at each position.
Cross-entropy loss in language modeling is sometimes reported in bits per byte (BPB) or bits per character (BPC) rather than nats per token. To convert from the natural logarithm (nats) to bits, divide by ln(2). To convert from per-token to per-byte, multiply by the average number of tokens per byte (which depends on the tokenizer).
These metrics allow comparison across models that use different tokenization schemes, since the byte-level and character-level rates are tokenizer-independent.
Perplexity is the standard evaluation metric for language models, and it is defined as the exponentiated cross-entropy:
PPL = exp(L) = exp(-(1/T) * sum_{t=1}^{T} log p(x_t | x_1, ..., x_{t-1}))
Perplexity can be interpreted as the effective number of equally likely choices the model is considering at each position. A perplexity of 10 means the model is, on average, as uncertain as if it were choosing uniformly among 10 options at each step [5].
Lower perplexity indicates a better model. A model that assigns all probability mass to the correct next token at every position would have perplexity 1.0 (cross-entropy 0.0).
| Perplexity | Cross-entropy (nats) | Cross-entropy (bits) | Interpretation |
|---|---|---|---|
| 1.0 | 0.0 | 0.0 | Perfect prediction |
| 10 | 2.30 | 3.32 | Choosing among ~10 options |
| 100 | 4.61 | 6.64 | Choosing among ~100 options |
| 1000 | 6.91 | 9.97 | Choosing among ~1000 options |
Perplexity has been the primary metric for comparing language models since the earliest n-gram models. It remains central in the era of large language models, though downstream task performance has become equally or more important as a practical evaluation criterion.
Perplexity is preferred over raw cross-entropy for two practical reasons. First, it has a more intuitive interpretation ("the model is choosing among N options" is easier to grasp than "the loss is L nats"). Second, perplexity is measured on an exponential scale, which makes differences between good models more visible: a model with cross-entropy 3.0 and one with cross-entropy 2.5 have perplexities of 20.1 and 12.2, respectively, making the improvement much more apparent.
Cross-entropy is not the only loss function used for classification-related tasks. Several alternatives exist, each with specific strengths.
MSE can technically be used for classification by treating the one-hot target as a regression target. However, MSE is a poor choice for classification because its gradients are weaker when the model is very wrong (the gradient of (y - p)^2 with respect to p is 2(p - y), which is linear, while the gradient of -log(p) is -1/p, which grows much faster as p approaches 0). This means cross-entropy provides stronger corrective gradients for badly misclassified examples, leading to faster and more stable training.
Hinge loss, used in support vector machines, is L = max(0, 1 - y * z) where y is in {-1, +1} and z is the raw score. Hinge loss focuses only on samples near the decision boundary (the "support vectors") and ignores well-classified samples entirely. Cross-entropy, by contrast, always provides a gradient, even for correctly classified examples, encouraging the model to increase confidence.
Contrastive learning losses (like InfoNCE) are used in self-supervised and representation learning. InfoNCE is structurally similar to cross-entropy: it treats the positive pair as the correct class and all negative pairs as incorrect classes, computing a cross-entropy-like loss over these "classes." This connection has been made explicit in several theoretical analyses.
Connectionist Temporal Classification (CTC) loss is used in sequence-to-sequence tasks (like speech recognition) where the alignment between input and output is unknown. CTC marginalizes over all possible alignments, computing the total probability of the target sequence. It uses cross-entropy as a component but adds the marginalization over alignments.
| Loss function | Best for | Produces probabilities? | Gradient behavior |
|---|---|---|---|
| Cross-entropy | Classification | Yes (with softmax) | Strong gradient when wrong |
| MSE | Regression | Not naturally | Linear gradient |
| Hinge loss | SVM, max-margin | No | Zero gradient when correct |
| Focal loss | Imbalanced classification | Yes | Down-weights easy examples |
| CTC loss | Sequence alignment | Yes (marginal) | Marginalizes over alignments |
| InfoNCE | Contrastive learning | Relative probabilities | Cross-entropy-like |
When classes are imbalanced (some classes appear much more frequently than others), standard cross-entropy gives equal importance to each sample. This can cause the model to achieve low loss simply by predicting the majority class.
Weighted cross-entropy assigns a weight w_k to each class:
L = -sum_{k=1}^{K} w_k * y_k * log(p_k)
Common weighting strategies include:
Weighted cross-entropy is simpler than focal loss and often sufficient for moderate class imbalance. For severe imbalance (ratio > 100:1), focal loss or oversampling strategies may be more effective.
Knowledge distillation (Hinton et al., 2015) uses a modified cross-entropy loss to transfer knowledge from a large "teacher" model to a smaller "student" model [6]. The student is trained to match the teacher's soft probability distribution (not just the hard labels) using cross-entropy between the teacher's output distribution and the student's output distribution, both computed with a temperature parameter T > 1 that softens the distributions:
L_distill = -sum_{k=1}^{K} p_k^teacher(T) * log(p_k^student(T))
where p_k(T) = exp(z_k / T) / sum_j exp(z_j / T).
The total training loss for knowledge distillation is typically a weighted combination of the distillation loss and the standard cross-entropy loss with hard labels:
L_total = alpha * L_distill + (1 - alpha) * L_hard
This use of cross-entropy is foundational to model compression and the training of smaller, faster models that retain much of the performance of their larger counterparts.
Cross-entropy loss is the primary quantity monitored during training of classification models. Key patterns to watch for:
Deep learning frameworks offer different reduction modes for cross-entropy loss: mean (average over the batch), sum (total over the batch), and none (per-sample loss). The choice affects the effective learning rate: using sum reduction with a batch size of 32 produces gradients that are 32 times larger than mean reduction. Most practitioners use mean reduction, as it makes the learning rate independent of batch size.
In mixed-precision training, cross-entropy loss should be computed in float32 (full precision) even when the rest of the forward pass uses float16 or bfloat16. The log and softmax operations are particularly sensitive to precision, and computing them in half precision can cause significant numerical errors. PyTorch's nn.CrossEntropyLoss handles this correctly when used with automatic mixed precision (torch.cuda.amp).