# Cross-Entropy Loss

> Source: https://aiwiki.ai/wiki/cross_entropy_loss
> Updated: 2026-06-23
> Categories: Deep Learning, Machine Learning
> From AI Wiki (https://aiwiki.ai), a free encyclopedia of artificial intelligence. Quote with attribution.

**Cross-entropy loss** is the standard [loss function](/wiki/loss_function) for classification and language modeling, defined as the negative log-probability a model assigns to the correct answer: for a single example its value is H(P, Q) = -sum_k p_k * log(q_k), where P is the true label distribution and Q is the model's predicted distribution. When labels are one-hot, this reduces to L = -log(p_c), the negative log-probability of the true class c, which is why cross-entropy loss is also called log loss or negative log-likelihood (NLL). Minimizing cross-entropy is mathematically equivalent to maximum likelihood estimation, so the loss arises naturally rather than being chosen by convention [1][2]. It supplies the gradient signal that drives [backpropagation](/wiki/backpropagation) in nearly every classification model, from logistic regression to billion-parameter [large language models](/wiki/large_language_model), and exponentiating it gives [perplexity](/wiki/perplexity), the standard language-model metric (PPL = exp(cross-entropy)) [5].

Cross-entropy has its roots in [information theory](/wiki/information_theory), where it quantifies the average number of bits (or nats) needed to encode data drawn from one distribution using a code optimized for another. Image classifiers, text classifiers, [language models](/wiki/large_language_model), speech recognition systems, and recommendation engines all typically optimize some form of cross-entropy loss. Its dominance stems from a combination of theoretical elegance (it arises from maximum likelihood estimation) and practical effectiveness (it produces well-calibrated probabilities and strong gradients for learning).

## What are the information theory foundations of cross-entropy?

Cross-entropy builds on several concepts from Claude Shannon's information theory, published in 1948 [1].

### Entropy

The entropy of a probability distribution P measures the average amount of surprise (or information) contained in events drawn from P. For a discrete distribution over K outcomes:

H(P) = -sum_{k=1}^{K} p_k * log(p_k)

Entropy is maximized when all outcomes are equally likely (uniform distribution) and minimized (equal to zero) when one outcome is certain. The logarithm base determines the unit: base 2 gives bits, base e gives nats. In machine learning, the natural logarithm (base e) is standard.

Intuitively, entropy captures how "spread out" a distribution is. A distribution concentrated on a single class has low entropy; a distribution spread across many classes has high entropy.

### KL divergence

The Kullback-Leibler (KL) divergence measures how one probability distribution P differs from a reference distribution Q:

D_KL(P || Q) = sum_{k=1}^{K} p_k * log(p_k / q_k)

KL divergence is always non-negative and equals zero only when P = Q. It is not symmetric: D_KL(P || Q) does not equal D_KL(Q || P) in general, which is why it is called a divergence rather than a distance.

### Cross-entropy

Cross-entropy between the true distribution P and the predicted distribution Q is defined as:

H(P, Q) = -sum_{k=1}^{K} p_k * log(q_k)

Cross-entropy can be decomposed as:

H(P, Q) = H(P) + D_KL(P || Q)

Since the entropy H(P) of the true distribution is constant (it does not depend on the model's predictions), minimizing cross-entropy is equivalent to minimizing KL divergence between the true distribution and the model's output. This is also equivalent to maximum likelihood estimation: choosing the model parameters that maximize the probability of the observed data [2].

## What is binary cross-entropy?

For binary classification (two classes, typically labeled 0 and 1), the cross-entropy loss for a single sample reduces to:

L = -[y * log(p) + (1 - y) * log(1 - p)]

where y is the true label (0 or 1) and p is the model's predicted probability for class 1.

When y = 1, only the first term is active: L = -log(p). The loss is small when p is close to 1 (correct and confident) and large when p is close to 0 (wrong and confident).

When y = 0, only the second term is active: L = -log(1 - p). The loss is small when p is close to 0 and large when p is close to 1.

For a dataset of N samples, the total binary cross-entropy loss is the average over all samples:

L = -(1/N) * sum_{i=1}^{N} [y_i * log(p_i) + (1 - y_i) * log(1 - p_i)]

Binary cross-entropy is used in logistic regression, binary classification heads, multi-label classification (where each class is treated as an independent binary prediction), and as a component of more complex losses.

## What is categorical cross-entropy, and how does it differ from binary cross-entropy?

For multi-class classification with K mutually exclusive classes, the cross-entropy loss for a single sample is:

L = -sum_{k=1}^{K} y_k * log(p_k)

where y is a one-hot encoded vector (y_k = 1 for the true class, y_k = 0 for all others) and p_k is the model's predicted probability for class k.

Since y is one-hot, this simplifies to:

L = -log(p_c)

where c is the index of the true class. The loss is simply the negative log probability assigned to the correct class. This is why cross-entropy loss is also called negative log-likelihood loss [2].

For a dataset of N samples:

L = -(1/N) * sum_{i=1}^{N} log(p_{i,c_i})

The difference from binary cross-entropy is the structure of the prediction problem: binary cross-entropy treats each output as an independent yes/no decision (so a single example can belong to several classes at once, as in multi-label tasks), whereas categorical cross-entropy assumes the K classes are mutually exclusive and competing through a single [softmax](/wiki/softmax) over the whole class set.

| Loss type | Formula (single sample) | Number of classes | Use case |
|---|---|---|---|
| Binary cross-entropy | -[y * log(p) + (1-y) * log(1-p)] | 2 (or multi-label) | Binary classification, multi-label |
| Categorical cross-entropy | -log(p_c) | K (mutually exclusive) | Multi-class classification |
| Sparse categorical cross-entropy | -log(p_c) (integer labels) | K (mutually exclusive) | Same as categorical, different label format |

## How does cross-entropy relate to softmax?

In [neural networks](/wiki/neural_network), the final layer for classification typically produces a vector of raw scores (logits) z = [z_1, z_2, ..., z_K], one per class. These logits are converted to probabilities using the [softmax](/wiki/softmax) function:

p_k = exp(z_k) / sum_{j=1}^{K} exp(z_j)

Softmax ensures that the outputs are non-negative and sum to one, forming a valid probability distribution. The cross-entropy loss is then computed on these probabilities.

The combination of softmax and cross-entropy loss has a particularly clean gradient. When computing the gradient of the loss with respect to the logits z, the derivative simplifies to:

dL/dz_k = p_k - y_k

This is simply the difference between the predicted probability and the true label for each class [2]. This clean gradient is one of the reasons cross-entropy loss pairs so well with softmax: the gradient is large when the prediction is wrong and small when it is right, providing a strong and intuitive learning signal.

## How is cross-entropy computed in a numerically stable way?

Naively computing cross-entropy as -log(softmax(z)) can cause numerical problems. The softmax function involves computing exp(z_k), which can overflow to infinity for large z_k or underflow to zero for very negative z_k. Taking the log of these extreme values compounds the problem.

### The log-softmax trick

The standard solution is to compute log-softmax directly, combining the log and softmax into a single numerically stable operation:

log(p_k) = log(exp(z_k) / sum_j exp(z_j))
         = z_k - log(sum_j exp(z_j))

To prevent overflow in the log-sum-exp term, a constant m = max(z) is subtracted from all logits:

log(sum_j exp(z_j)) = m + log(sum_j exp(z_j - m))

Since z_j - m is at most zero, exp(z_j - m) cannot overflow. This trick is implemented in all deep learning frameworks.

In [PyTorch](/wiki/pytorch), the recommended approach is to use `nn.CrossEntropyLoss`, which takes raw logits as input and internally computes log-softmax plus negative log-likelihood in a numerically stable way [7]:

```python
import torch.nn as nn

criterion = nn.CrossEntropyLoss()
loss = criterion(logits, targets)  # logits: (batch, num_classes), targets: (batch,)
```

The PyTorch documentation states that `nn.CrossEntropyLoss` "combines `nn.LogSoftmax()` and `nn.NLLLoss()` in one single class," which is exactly why it is preferred over applying softmax and a log-loss separately [7]. Computing softmax first and then passing probabilities to a separate log-loss function is numerically inferior and should be avoided.

## What is label smoothing?

Label smoothing is a regularization technique that modifies the target distribution used in cross-entropy loss [3]. Instead of using hard one-hot targets (where the true class has probability 1.0 and all other classes have probability 0.0), label smoothing softens the targets:

y_k_smooth = (1 - alpha) * y_k + alpha / K

where alpha is the smoothing parameter (typically 0.1) and K is the number of classes.

For the true class, the target becomes (1 - alpha + alpha/K) instead of 1.0. For incorrect classes, the target becomes alpha/K instead of 0.0.

### Why does label smoothing help?

With hard targets, the cross-entropy loss drives the model to make the logit for the correct class infinitely larger than all other logits (since -log(p_c) approaches zero only as p_c approaches 1, which requires z_c to be much larger than all other z_k). This encourages overconfident predictions and large logit values, which can hurt generalization.

Label smoothing prevents this by giving the model a softer target that can be achieved with smaller logit differences. This produces better-calibrated probabilities (the model's confidence more closely matches its actual accuracy) and can improve generalization [3].

Label smoothing was introduced by Szegedy et al. in the 2016 "Rethinking the Inception Architecture" paper, where they applied a smoothing value of alpha = 0.1 over the 1,000 ImageNet classes and described it as "a mechanism to regularize the classifier layer by estimating the marginalized effect of label-dropout during training" [3]. It was used in training [Inception](/wiki/inception) v2 and v3 and remains a standard technique in training [vision transformers](/wiki/vision_transformer) and other classification models.

| Technique | Target for true class | Target for other classes | Effect |
|---|---|---|---|
| Hard targets (standard) | 1.0 | 0.0 | Encourages maximum confidence |
| Label smoothing (alpha=0.1, K=10) | 0.91 | 0.01 | Prevents overconfidence, improves calibration |
| Label smoothing (alpha=0.2, K=10) | 0.82 | 0.02 | Stronger smoothing, more regularization |

## What is focal loss?

Focal loss, introduced by Tsung-Yi Lin et al. (2017) at Facebook AI Research, is a modification of cross-entropy designed to handle extreme class imbalance in object detection [4]. The paper's central claim is direct: "We discover that the extreme foreground-background class imbalance encountered during training of dense detectors is the central cause" of why one-stage detectors had underperformed two-stage detectors [4].

### The class imbalance problem

In one-stage object detectors like RetinaNet, the model evaluates tens of thousands of candidate locations per image. The vast majority of these candidates are background (negative examples), and only a tiny fraction contain objects of interest. With standard cross-entropy, the loss is dominated by the large number of easy negatives. Although each easy negative contributes a small loss individually, their sheer number overwhelms the gradient signal from the rare hard positives, preventing the model from learning effectively.

### How does focal loss work?

Focal loss adds a modulating factor to the standard cross-entropy:

FL(p_t) = -alpha_t * (1 - p_t)^gamma * log(p_t)

where p_t is the model's predicted probability for the true class, alpha_t is a per-class weighting factor, and gamma is the focusing parameter. The authors recommend gamma = 2 as a default and report that it is relatively robust across the range gamma in [0.5, 5] [4].

The key term is (1 - p_t)^gamma. For well-classified examples where p_t is high (say, 0.9), this factor is small ((1 - 0.9)^2 = 0.01), so the loss contribution is heavily down-weighted. For misclassified examples where p_t is low (say, 0.1), the factor is close to 1 ((1 - 0.1)^2 = 0.81), so the loss is nearly unchanged.

The effect is that focal loss automatically focuses the model's learning on the hard, misclassified examples and ignores the easy ones. This is particularly effective for class-imbalanced detection tasks.

| gamma value | Effect on easy examples | Effect on hard examples |
|---|---|---|
| 0 (standard CE) | Full loss contribution | Full loss contribution |
| 1 | Moderate down-weighting | Nearly full loss |
| 2 (typical) | Strong down-weighting | Nearly full loss |
| 5 | Very strong down-weighting | Nearly full loss |

Focal loss was a key component of RetinaNet, which demonstrated that one-stage detectors could match or exceed the accuracy of two-stage detectors (like Faster R-CNN) when the class imbalance problem was properly addressed. The paper has been cited more than 20,000 times and focal loss has been adopted widely beyond object detection, including in medical imaging, NLP, and any setting with severe class imbalance [4].

## How is cross-entropy used in language modeling?

Cross-entropy loss plays a central role in training [language models](/wiki/large_language_model), where it serves as the primary training objective.

### Next-token prediction

Autoregressive language models (like GPT, [LLaMA](/wiki/llama), and [Claude](/wiki/claude)) are trained to predict the next [token](/wiki/tokenization) in a sequence, given all preceding tokens. For a sequence of tokens (x_1, x_2, ..., x_T), the model produces a probability distribution over the vocabulary V at each position:

p(x_t | x_1, ..., x_{t-1})

The training loss is the average cross-entropy across all positions in the sequence:

L = -(1/T) * sum_{t=1}^{T} log p(x_t | x_1, ..., x_{t-1})

This is equivalent to maximizing the log-likelihood of the training data under the model [2]. The vocabulary size for modern language models ranges from 32,000 to over 100,000 tokens, making this a very high-dimensional classification problem at each position. Next-token cross-entropy is the single objective used during the entire pretraining phase of GPT-style models; the loss is computed in parallel over every position in the context window.

### Bits per byte and bits per character

Cross-entropy loss in language modeling is sometimes reported in bits per byte (BPB) or bits per character (BPC) rather than nats per token. To convert from the natural logarithm (nats) to bits, divide by ln(2) (approximately 0.693). To convert from per-token to per-byte, multiply by the average number of tokens per byte (which depends on the [tokenizer](/wiki/tokenization)).

These metrics allow comparison across models that use different tokenization schemes, since the byte-level and character-level rates are tokenizer-independent.

## What is perplexity, and how does it relate to cross-entropy?

[Perplexity](/wiki/perplexity) is the standard evaluation metric for language models, and it is defined as the exponentiated cross-entropy:

PPL = exp(L) = exp(-(1/T) * sum_{t=1}^{T} log p(x_t | x_1, ..., x_{t-1}))

Perplexity can be interpreted as the effective number of equally likely choices the model is considering at each position. Jurafsky and Martin describe perplexity as "the weighted average branching factor of a language," that is, the average number of possible next words, and equivalently as the geometric mean of the inverse per-token probabilities [5]. A perplexity of 10 means the model is, on average, as uncertain as if it were choosing uniformly among 10 options at each step [5].

Lower perplexity indicates a better model. A model that assigns all probability mass to the correct next token at every position would have perplexity 1.0 (cross-entropy 0.0).

| Perplexity | Cross-entropy (nats) | Cross-entropy (bits) | Interpretation |
|---|---|---|---|
| 1.0 | 0.0 | 0.0 | Perfect prediction |
| 10 | 2.30 | 3.32 | Choosing among ~10 options |
| 100 | 4.61 | 6.64 | Choosing among ~100 options |
| 1000 | 6.91 | 9.97 | Choosing among ~1000 options |

Perplexity has been the primary metric for comparing language models since the earliest n-gram models. It remains central in the era of large language models, though downstream task performance has become equally or more important as a practical evaluation criterion.

### Why report perplexity and not just cross-entropy?

Perplexity is preferred over raw cross-entropy for two practical reasons. First, it has a more intuitive interpretation ("the model is choosing among N options" is easier to grasp than "the loss is L nats"). Second, perplexity is measured on an exponential scale, which makes differences between good models more visible: a model with cross-entropy 3.0 and one with cross-entropy 2.5 have perplexities of 20.1 and 12.2, respectively, making the improvement much more apparent.

## How does cross-entropy compare with other loss functions?

Cross-entropy is not the only loss function used for classification-related tasks. Several alternatives exist, each with specific strengths.

### Mean squared error (MSE)

MSE can technically be used for classification by treating the one-hot target as a regression target. However, MSE is a poor choice for classification because its gradients are weaker when the model is very wrong (the gradient of (y - p)^2 with respect to p is 2(p - y), which is linear, while the gradient of -log(p) is -1/p, which grows much faster as p approaches 0). This means cross-entropy provides stronger corrective gradients for badly misclassified examples, leading to faster and more stable training.

### Hinge loss

Hinge loss, used in [support vector machines](/wiki/support_vector_machine_svm), is L = max(0, 1 - y * z) where y is in {-1, +1} and z is the raw score. Hinge loss focuses only on samples near the decision boundary (the "support vectors") and ignores well-classified samples entirely. Cross-entropy, by contrast, always provides a gradient, even for correctly classified examples, encouraging the model to increase confidence.

### Contrastive loss

[Contrastive learning](/wiki/contrastive_learning) losses (like InfoNCE) are used in self-supervised and representation learning. InfoNCE is structurally similar to cross-entropy: it treats the positive pair as the correct class and all negative pairs as incorrect classes, computing a cross-entropy-like loss over these "classes." This connection has been made explicit in several theoretical analyses.

### CTC loss

Connectionist Temporal Classification (CTC) loss is used in sequence-to-sequence tasks (like speech recognition) where the alignment between input and output is unknown. CTC marginalizes over all possible alignments, computing the total probability of the target sequence. It uses cross-entropy as a component but adds the marginalization over alignments.

| Loss function | Best for | Produces probabilities? | Gradient behavior |
|---|---|---|---|
| Cross-entropy | Classification | Yes (with softmax) | Strong gradient when wrong |
| MSE | Regression | Not naturally | Linear gradient |
| Hinge loss | SVM, max-margin | No | Zero gradient when correct |
| Focal loss | Imbalanced classification | Yes | Down-weights easy examples |
| CTC loss | Sequence alignment | Yes (marginal) | Marginalizes over alignments |
| InfoNCE | Contrastive learning | Relative probabilities | Cross-entropy-like |

## What is weighted cross-entropy?

When classes are imbalanced (some classes appear much more frequently than others), standard cross-entropy gives equal importance to each sample. This can cause the model to achieve low loss simply by predicting the majority class.

Weighted cross-entropy assigns a weight w_k to each class:

L = -sum_{k=1}^{K} w_k * y_k * log(p_k)

Common weighting strategies include:

- **Inverse frequency weighting**: w_k = N / (K * n_k), where n_k is the number of samples in class k and N is the total number of samples.
- **Effective number weighting**: Uses the effective number of samples (Cui et al., 2019), which accounts for data overlap.
- **Manual weighting**: Setting weights based on domain knowledge about the relative importance of different classes.

Weighted cross-entropy is simpler than focal loss and often sufficient for moderate class imbalance. For severe imbalance (ratio > 100:1), focal loss or oversampling strategies may be more effective.

## How is cross-entropy used in knowledge distillation?

[Knowledge distillation](/wiki/knowledge_distillation) (Hinton et al., 2015) uses a modified cross-entropy loss to transfer knowledge from a large "teacher" model to a smaller "student" model [6]. The student is trained to match the teacher's soft probability distribution (not just the hard labels) using cross-entropy between the teacher's output distribution and the student's output distribution, both computed with a temperature parameter T > 1 that softens the distributions:

L_distill = -sum_{k=1}^{K} p_k^teacher(T) * log(p_k^student(T))

where p_k(T) = exp(z_k / T) / sum_j exp(z_j / T).

Hinton, Vinyals, and Dean describe the core idea as training the small model "to generalize in the same way as the large model," using the teacher's class probabilities as "soft targets" that carry more information than hard labels [6]. The total training loss for knowledge distillation is typically a weighted combination of the distillation loss and the standard cross-entropy loss with hard labels:

L_total = alpha * L_distill + (1 - alpha) * L_hard

This use of cross-entropy is foundational to model compression and the training of smaller, faster models that retain much of the performance of their larger counterparts.

## Practical considerations

### Monitoring training with cross-entropy

Cross-entropy loss is the primary quantity monitored during training of classification models. Key patterns to watch for:

- **Training loss decreasing, validation loss decreasing**: Normal training, model is learning.
- **Training loss decreasing, validation loss increasing**: [Overfitting](/wiki/overfitting). Apply [regularization](/wiki/regularization) ([dropout](/wiki/dropout), weight decay, data augmentation).
- **Training loss plateauing**: Learning rate may be too small, or the model may have reached its capacity.
- **Training loss spiking**: Numerical instability, learning rate too large, or corrupt data.

### Choice of reduction

Deep learning frameworks offer different reduction modes for cross-entropy loss: mean (average over the batch), sum (total over the batch), and none (per-sample loss). The choice affects the effective learning rate: using sum reduction with a batch size of 32 produces gradients that are 32 times larger than mean reduction. Most practitioners use mean reduction, as it makes the [learning rate](/wiki/learning_rate) independent of batch size.

### Mixed-precision and cross-entropy

In mixed-precision training, cross-entropy loss should be computed in float32 (full precision) even when the rest of the forward pass uses float16 or bfloat16. The log and softmax operations are particularly sensitive to precision, and computing them in half precision can cause significant numerical errors. PyTorch's `nn.CrossEntropyLoss` handles this correctly when used with automatic mixed precision (`torch.cuda.amp`) [7].

## References

1. Shannon, C. E. (1948). "A Mathematical Theory of Communication." Bell System Technical Journal, 27(3), 379-423. [https://people.math.harvard.edu/~ctm/home/text/others/shannon/entropy/entropy.pdf](https://people.math.harvard.edu/~ctm/home/text/others/shannon/entropy/entropy.pdf)
2. Bishop, C. M. (2006). "Pattern Recognition and Machine Learning." Springer. Chapter 4.3.4, "Multiclass logistic regression." [https://www.microsoft.com/en-us/research/publication/pattern-recognition-machine-learning/](https://www.microsoft.com/en-us/research/publication/pattern-recognition-machine-learning/)
3. Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., and Wojna, Z. (2016). "Rethinking the Inception Architecture for Computer Vision." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2016). [https://arxiv.org/abs/1512.00567](https://arxiv.org/abs/1512.00567)
4. Lin, T.-Y., Goyal, P., Girshick, R., He, K., and Dollar, P. (2017). "Focal Loss for Dense Object Detection." Proceedings of the IEEE International Conference on Computer Vision (ICCV 2017). [https://arxiv.org/abs/1708.02002](https://arxiv.org/abs/1708.02002)
5. Jurafsky, D. and Martin, J. H. (2024). "Speech and Language Processing." 3rd edition draft. Chapter 3, "N-gram Language Models." [https://web.stanford.edu/~jurafsky/slp3/](https://web.stanford.edu/~jurafsky/slp3/)
6. Hinton, G., Vinyals, O., and Dean, J. (2015). "Distilling the Knowledge in a Neural Network." [NeurIPS](/wiki/neurips) 2014 Deep Learning Workshop. [https://arxiv.org/abs/1503.02531](https://arxiv.org/abs/1503.02531)
7. PyTorch documentation. "CrossEntropyLoss." Accessed 2026. [https://pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html](https://pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html)

