Cross-Entropy Loss

Deep Learning Machine Learning

19 min read

Updated Jul 11, 2026

Suggest edit History Talk

RawGraph

Last edited

Jul 11, 2026

Fact-checked

In review queue

Sources

7 citations

Revision

v5 · 3,773 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Cross-entropy loss is the standard loss function for classification and language modeling, defined as the negative log-probability a model assigns to the correct answer: for a single example its value is $H(P, Q) = -\sum_k p_k \log(q_k)$ , where P is the true label distribution and Q is the model's predicted distribution. When labels are one-hot, this reduces to $L = -\log(p_c)$ , the negative log-probability of the true class c, which is why cross-entropy loss is also called log loss or negative log-likelihood (NLL). Minimizing cross-entropy is mathematically equivalent to maximum likelihood estimation, so the loss arises naturally rather than being chosen by convention ^[1]^[2]. It supplies the gradient signal that drives backpropagation in nearly every classification model, from logistic regression to billion-parameter large language models, and exponentiating it gives perplexity, the standard language-model metric ( $\mathrm{PPL} = \exp(\text{cross-entropy})$ ) ^[5].

Cross-entropy has its roots in information theory, where it quantifies the average number of bits (or nats) needed to encode data drawn from one distribution using a code optimized for another. Image classifiers, text classifiers, language models, speech recognition systems, and recommendation engines all typically optimize some form of cross-entropy loss. Its dominance stems from a combination of theoretical elegance (it arises from maximum likelihood estimation) and practical effectiveness (it produces well-calibrated probabilities and strong gradients for learning).

What are the information theory foundations of cross-entropy?

Cross-entropy builds on several concepts from Claude Shannon's information theory, published in 1948 ^[1].

Entropy

The entropy of a probability distribution P measures the average amount of surprise (or information) contained in events drawn from P. For a discrete distribution over K outcomes:

H(P) = -\sum_{k=1}^{K} p_k \log(p_k)

Entropy is maximized when all outcomes are equally likely (uniform distribution) and minimized (equal to zero) when one outcome is certain. The logarithm base determines the unit: base 2 gives bits, base e gives nats. In machine learning, the natural logarithm (base e) is standard.

Intuitively, entropy captures how "spread out" a distribution is. A distribution concentrated on a single class has low entropy; a distribution spread across many classes has high entropy.

KL divergence

The Kullback-Leibler (KL) divergence measures how one probability distribution P differs from a reference distribution Q:

D_{\mathrm{KL}}(P \parallel Q) = \sum_{k=1}^{K} p_k \log\left(\frac{p_k}{q_k}\right)

KL divergence is always non-negative and equals zero only when $P = Q$ . It is not symmetric: $D_{\mathrm{KL}}(P \parallel Q)$ does not equal $D_{\mathrm{KL}}(Q \parallel P)$ in general, which is why it is called a divergence rather than a distance.

Cross-entropy

Cross-entropy between the true distribution P and the predicted distribution Q is defined as:

H(P, Q) = -\sum_{k=1}^{K} p_k \log(q_k)

Cross-entropy can be decomposed as:

H(P, Q) = H(P) + D_{\mathrm{KL}}(P \parallel Q)

Since the entropy H(P) of the true distribution is constant (it does not depend on the model's predictions), minimizing cross-entropy is equivalent to minimizing KL divergence between the true distribution and the model's output. This is also equivalent to maximum likelihood estimation: choosing the model parameters that maximize the probability of the observed data ^[2].

What is binary cross-entropy?

For binary classification (two classes, typically labeled 0 and 1), the cross-entropy loss for a single sample reduces to:

L = -\left[y \log(p) + (1 - y) \log(1 - p)\right]

where y is the true label (0 or 1) and p is the model's predicted probability for class 1.

When $y = 1$ , only the first term is active: $L = -\log(p)$ . The loss is small when p is close to 1 (correct and confident) and large when p is close to 0 (wrong and confident).

When $y = 0$ , only the second term is active: $L = -\log(1 - p)$ . The loss is small when p is close to 0 and large when p is close to 1.

For a dataset of N samples, the total binary cross-entropy loss is the average over all samples:

L = -\frac{1}{N} \sum_{i=1}^{N} \left[y_i \log(p_i) + (1 - y_i) \log(1 - p_i)\right]

Binary cross-entropy is used in logistic regression, binary classification heads, multi-label classification (where each class is treated as an independent binary prediction), and as a component of more complex losses.

What is categorical cross-entropy, and how does it differ from binary cross-entropy?

For multi-class classification with K mutually exclusive classes, the cross-entropy loss for a single sample is:

L = -\sum_{k=1}^{K} y_k \log(p_k)

where y is a one-hot encoded vector ( $y_k = 1$ for the true class, $y_k = 0$ for all others) and $p_k$ is the model's predicted probability for class k.

Since y is one-hot, this simplifies to:

L = -\log(p_c)

where c is the index of the true class. The loss is simply the negative log probability assigned to the correct class. This is why cross-entropy loss is also called negative log-likelihood loss ^[2].

For a dataset of N samples:

L = -\frac{1}{N} \sum_{i=1}^{N} \log(p_{i,c_i})

The difference from binary cross-entropy is the structure of the prediction problem: binary cross-entropy treats each output as an independent yes/no decision (so a single example can belong to several classes at once, as in multi-label tasks), whereas categorical cross-entropy assumes the K classes are mutually exclusive and competing through a single softmax over the whole class set.

Loss type	Formula (single sample)	Number of classes	Use case
Binary cross-entropy	$-[y \log(p) + (1-y) \log(1-p)]$	2 (or multi-label)	Binary classification, multi-label
Categorical cross-entropy	$-\log(p_c)$	K (mutually exclusive)	Multi-class classification
Sparse categorical cross-entropy	$-\log(p_c)$ (integer labels)	K (mutually exclusive)	Same as categorical, different label format

How does cross-entropy relate to softmax?

In neural networks, the final layer for classification typically produces a vector of raw scores (logits) $z = [z_1, z_2, \ldots, z_K]$ , one per class. These logits are converted to probabilities using the softmax function:

p_k = \frac{\exp(z_k)}{\sum_{j=1}^{K} \exp(z_j)}

Softmax ensures that the outputs are non-negative and sum to one, forming a valid probability distribution. The cross-entropy loss is then computed on these probabilities.

The combination of softmax and cross-entropy loss has a particularly clean gradient. When computing the gradient of the loss with respect to the logits z, the derivative simplifies to:

\frac{\partial L}{\partial z_k} = p_k - y_k

This is simply the difference between the predicted probability and the true label for each class ^[2]. This clean gradient is one of the reasons cross-entropy loss pairs so well with softmax: the gradient is large when the prediction is wrong and small when it is right, providing a strong and intuitive learning signal.

How is cross-entropy computed in a numerically stable way?

Naively computing cross-entropy as -log(softmax(z)) can cause numerical problems. The softmax function involves computing exp(z_k), which can overflow to infinity for large z_k or underflow to zero for very negative z_k. Taking the log of these extreme values compounds the problem.

The log-softmax trick

The standard solution is to compute log-softmax directly, combining the log and softmax into a single numerically stable operation:

\log(p_k) = \log\left(\frac{\exp(z_k)}{\sum_j \exp(z_j)}\right) = z_k - \log\left(\sum_j \exp(z_j)\right)

To prevent overflow in the log-sum-exp term, a constant $m = \max(z)$ is subtracted from all logits:

\log\left(\sum_j \exp(z_j)\right) = m + \log\left(\sum_j \exp(z_j - m)\right)

Since $z_j - m$ is at most zero, $\exp(z_j - m)$ cannot overflow. This trick is implemented in all deep learning frameworks.

In PyTorch, the recommended approach is to use nn.CrossEntropyLoss, which takes raw logits as input and internally computes log-softmax plus negative log-likelihood in a numerically stable way ^[7]:

import torch.nn as nn

criterion = nn.CrossEntropyLoss()
loss = criterion(logits, targets)  # logits: (batch, num_classes), targets: (batch,)

The PyTorch documentation states that nn.CrossEntropyLoss "combines nn.LogSoftmax() and nn.NLLLoss() in one single class," which is exactly why it is preferred over applying softmax and a log-loss separately ^[7]. Computing softmax first and then passing probabilities to a separate log-loss function is numerically inferior and should be avoided.

What is label smoothing?

Label smoothing is a regularization technique that modifies the target distribution used in cross-entropy loss ^[3]. Instead of using hard one-hot targets (where the true class has probability 1.0 and all other classes have probability 0.0), label smoothing softens the targets:

y_k^{\text{smooth}} = (1 - \alpha) y_k + \frac{\alpha}{K}

where $\alpha$ is the smoothing parameter (typically 0.1) and K is the number of classes.

For the true class, the target becomes $(1 - \alpha + \alpha/K)$ instead of 1.0. For incorrect classes, the target becomes $\alpha/K$ instead of 0.0.

Why does label smoothing help?

With hard targets, the cross-entropy loss drives the model to make the logit for the correct class infinitely larger than all other logits (since $-\log(p_c)$ approaches zero only as $p_c$ approaches 1, which requires $z_c$ to be much larger than all other $z_k$ ). This encourages overconfident predictions and large logit values, which can hurt generalization.

Label smoothing prevents this by giving the model a softer target that can be achieved with smaller logit differences. This produces better-calibrated probabilities (the model's confidence more closely matches its actual accuracy) and can improve generalization ^[3].

Label smoothing was introduced by Szegedy et al. in the 2016 "Rethinking the Inception Architecture" paper, where they applied a smoothing value of $\alpha = 0.1$ over the 1,000 ImageNet classes and described it as "a mechanism to regularize the classifier layer by estimating the marginalized effect of label-dropout during training" ^[3]. It was used in training Inception v2 and v3 and remains a standard technique in training vision transformers and other classification models.

Technique	Target for true class	Target for other classes	Effect
Hard targets (standard)	1.0	0.0	Encourages maximum confidence
Label smoothing ( $\alpha=0.1$ , $K=10$ )	0.91	0.01	Prevents overconfidence, improves calibration
Label smoothing ( $\alpha=0.2$ , $K=10$ )	0.82	0.02	Stronger smoothing, more regularization

What is focal loss?

Focal loss, introduced by Tsung-Yi Lin et al. (2017) at Facebook AI Research, is a modification of cross-entropy designed to handle extreme class imbalance in object detection ^[4]. The paper's central claim is direct: "We discover that the extreme foreground-background class imbalance encountered during training of dense detectors is the central cause" of why one-stage detectors had underperformed two-stage detectors ^[4].

The class imbalance problem

In one-stage object detectors like RetinaNet, the model evaluates tens of thousands of candidate locations per image. The vast majority of these candidates are background (negative examples), and only a tiny fraction contain objects of interest. With standard cross-entropy, the loss is dominated by the large number of easy negatives. Although each easy negative contributes a small loss individually, their sheer number overwhelms the gradient signal from the rare hard positives, preventing the model from learning effectively.

How does focal loss work?

Focal loss adds a modulating factor to the standard cross-entropy:

\mathrm{FL}(p_t) = -\alpha_t (1 - p_t)^\gamma \log(p_t)

where $p_t$ is the model's predicted probability for the true class, $\alpha_t$ is a per-class weighting factor, and $\gamma$ is the focusing parameter. The authors recommend $\gamma = 2$ as a default and report that it is relatively robust across the range $\gamma \in [0.5, 5]$ ^[4].

The key term is $(1 - p_t)^\gamma$ . For well-classified examples where $p_t$ is high (say, 0.9), this factor is small ( $(1 - 0.9)^2 = 0.01$ ), so the loss contribution is heavily down-weighted. For misclassified examples where $p_t$ is low (say, 0.1), the factor is close to 1 ( $(1 - 0.1)^2 = 0.81$ ), so the loss is nearly unchanged.

The effect is that focal loss automatically focuses the model's learning on the hard, misclassified examples and ignores the easy ones. This is particularly effective for class-imbalanced detection tasks.

$\gamma$ value	Effect on easy examples	Effect on hard examples
0 (standard CE)	Full loss contribution	Full loss contribution
1	Moderate down-weighting	Nearly full loss
2 (typical)	Strong down-weighting	Nearly full loss
5	Very strong down-weighting	Nearly full loss

Focal loss was a key component of RetinaNet, which demonstrated that one-stage detectors could match or exceed the accuracy of two-stage detectors (like Faster R-CNN) when the class imbalance problem was properly addressed. The paper has been cited more than 20,000 times and focal loss has been adopted widely beyond object detection, including in medical imaging, NLP, and any setting with severe class imbalance ^[4].

How is cross-entropy used in language modeling?

Cross-entropy loss plays a central role in training language models, where it serves as the primary training objective.

Next-token prediction

Autoregressive language models (like GPT, LLaMA, and Claude) are trained to predict the next token in a sequence, given all preceding tokens. For a sequence of tokens $(x_1, x_2, \ldots, x_T)$ , the model produces a probability distribution over the vocabulary V at each position:

p(x_t \mid x_1, \ldots, x_{t-1})

The training loss is the average cross-entropy across all positions in the sequence:

L = -\frac{1}{T} \sum_{t=1}^{T} \log p(x_t \mid x_1, \ldots, x_{t-1})

This is equivalent to maximizing the log-likelihood of the training data under the model ^[2]. The vocabulary size for modern language models ranges from 32,000 to over 100,000 tokens, making this a very high-dimensional classification problem at each position. Next-token cross-entropy is the single objective used during the entire pretraining phase of GPT-style models; the loss is computed in parallel over every position in the context window.

Bits per byte and bits per character

Cross-entropy loss in language modeling is sometimes reported in bits per byte (BPB) or bits per character (BPC) rather than nats per token. To convert from the natural logarithm (nats) to bits, divide by $\ln(2)$ (approximately 0.693). To convert from per-token to per-byte, multiply by the average number of tokens per byte (which depends on the tokenizer).

These metrics allow comparison across models that use different tokenization schemes, since the byte-level and character-level rates are tokenizer-independent.

What is perplexity, and how does it relate to cross-entropy?

Perplexity is the standard evaluation metric for language models, and it is defined as the exponentiated cross-entropy:

\mathrm{PPL} = \exp(L) = \exp\left(-\frac{1}{T} \sum_{t=1}^{T} \log p(x_t \mid x_1, \ldots, x_{t-1})\right)

Perplexity can be interpreted as the effective number of equally likely choices the model is considering at each position. Jurafsky and Martin describe perplexity as "the weighted average branching factor of a language," that is, the average number of possible next words, and equivalently as the geometric mean of the inverse per-token probabilities ^[5]. A perplexity of 10 means the model is, on average, as uncertain as if it were choosing uniformly among 10 options at each step ^[5].

Lower perplexity indicates a better model. A model that assigns all probability mass to the correct next token at every position would have perplexity 1.0 (cross-entropy 0.0).

Perplexity	Cross-entropy (nats)	Cross-entropy (bits)	Interpretation
1.0	0.0	0.0	Perfect prediction
10	2.30	3.32	Choosing among ~10 options
100	4.61	6.64	Choosing among ~100 options
1000	6.91	9.97	Choosing among ~1000 options

Perplexity has been the primary metric for comparing language models since the earliest n-gram models. It remains central in the era of large language models, though downstream task performance has become equally or more important as a practical evaluation criterion.

Why report perplexity and not just cross-entropy?

Perplexity is preferred over raw cross-entropy for two practical reasons. First, it has a more intuitive interpretation ("the model is choosing among N options" is easier to grasp than "the loss is L nats"). Second, perplexity is measured on an exponential scale, which makes differences between good models more visible: a model with cross-entropy 3.0 and one with cross-entropy 2.5 have perplexities of 20.1 and 12.2, respectively, making the improvement much more apparent.

How does cross-entropy compare with other loss functions?

Cross-entropy is not the only loss function used for classification-related tasks. Several alternatives exist, each with specific strengths.

Mean squared error (MSE)

MSE can technically be used for classification by treating the one-hot target as a regression target. However, MSE is a poor choice for classification because its gradients are weaker when the model is very wrong (the gradient of $(y - p)^2$ with respect to $p$ is $2(p - y)$ , which is linear, while the gradient of $-\log(p)$ is $-1/p$ , which grows much faster as $p$ approaches 0). This means cross-entropy provides stronger corrective gradients for badly misclassified examples, leading to faster and more stable training.

Hinge loss

Hinge loss, used in support vector machines, is $L = \max(0, 1 - y z)$ where $y$ is in $\{-1, +1\}$ and $z$ is the raw score. Hinge loss focuses only on samples near the decision boundary (the "support vectors") and ignores well-classified samples entirely. Cross-entropy, by contrast, always provides a gradient, even for correctly classified examples, encouraging the model to increase confidence.

Contrastive loss

Contrastive learning losses (like InfoNCE) are used in self-supervised and representation learning. InfoNCE is structurally similar to cross-entropy: it treats the positive pair as the correct class and all negative pairs as incorrect classes, computing a cross-entropy-like loss over these "classes." This connection has been made explicit in several theoretical analyses.

CTC loss

Connectionist Temporal Classification (CTC) loss is used in sequence-to-sequence tasks (like speech recognition) where the alignment between input and output is unknown. CTC marginalizes over all possible alignments, computing the total probability of the target sequence. It uses cross-entropy as a component but adds the marginalization over alignments.

Loss function	Best for	Produces probabilities?	Gradient behavior
Cross-entropy	Classification	Yes (with softmax)	Strong gradient when wrong
MSE	Regression	Not naturally	Linear gradient
Hinge loss	SVM, max-margin	No	Zero gradient when correct
Focal loss	Imbalanced classification	Yes	Down-weights easy examples
CTC loss	Sequence alignment	Yes (marginal)	Marginalizes over alignments
InfoNCE	Contrastive learning	Relative probabilities	Cross-entropy-like

What is weighted cross-entropy?

When classes are imbalanced (some classes appear much more frequently than others), standard cross-entropy gives equal importance to each sample. This can cause the model to achieve low loss simply by predicting the majority class.

Weighted cross-entropy assigns a weight w_k to each class:

L = -\sum_{k=1}^{K} w_k y_k \log(p_k)

Common weighting strategies include:

Inverse frequency weighting: $w_k = N / (K n_k)$ , where $n_k$ is the number of samples in class k and N is the total number of samples.
Effective number weighting: Uses the effective number of samples (Cui et al., 2019), which accounts for data overlap.
Manual weighting: Setting weights based on domain knowledge about the relative importance of different classes.

Weighted cross-entropy is simpler than focal loss and often sufficient for moderate class imbalance. For severe imbalance (ratio > 100:1), focal loss or oversampling strategies may be more effective.

How is cross-entropy used in knowledge distillation?

Knowledge distillation (Hinton et al., 2015) uses a modified cross-entropy loss to transfer knowledge from a large "teacher" model to a smaller "student" model ^[6]. The student is trained to match the teacher's soft probability distribution (not just the hard labels) using cross-entropy between the teacher's output distribution and the student's output distribution, both computed with a temperature parameter $T > 1$ that softens the distributions:

L_{\text{distill}} = -\sum_{k=1}^{K} p_k^{\text{teacher}}(T) \log(p_k^{\text{student}}(T))

where $p_k(T) = \frac{\exp(z_k / T)}{\sum_j \exp(z_j / T)}$ .

Hinton, Vinyals, and Dean describe the core idea as training the small model "to generalize in the same way as the large model," using the teacher's class probabilities as "soft targets" that carry more information than hard labels ^[6]. The total training loss for knowledge distillation is typically a weighted combination of the distillation loss and the standard cross-entropy loss with hard labels:

L_{\text{total}} = \alpha L_{\text{distill}} + (1 - \alpha) L_{\text{hard}}

This use of cross-entropy is foundational to model compression and the training of smaller, faster models that retain much of the performance of their larger counterparts.

Practical considerations

Monitoring training with cross-entropy

Cross-entropy loss is the primary quantity monitored during training of classification models. Key patterns to watch for:

Training loss decreasing, validation loss decreasing: Normal training, model is learning.
Training loss decreasing, validation loss increasing: Overfitting. Apply regularization (dropout, weight decay, data augmentation).
Training loss plateauing: Learning rate may be too small, or the model may have reached its capacity.
Training loss spiking: Numerical instability, learning rate too large, or corrupt data.

Choice of reduction

Deep learning frameworks offer different reduction modes for cross-entropy loss: mean (average over the batch), sum (total over the batch), and none (per-sample loss). The choice affects the effective learning rate: using sum reduction with a batch size of 32 produces gradients that are 32 times larger than mean reduction. Most practitioners use mean reduction, as it makes the learning rate independent of batch size.

Mixed-precision and cross-entropy

In mixed-precision training, cross-entropy loss should be computed in float32 (full precision) even when the rest of the forward pass uses float16 or bfloat16. The log and softmax operations are particularly sensitive to precision, and computing them in half precision can cause significant numerical errors. PyTorch's nn.CrossEntropyLoss handles this correctly when used with automatic mixed precision (torch.cuda.amp) ^[7].

References

Shannon, C. E. (1948). "A Mathematical Theory of Communication." Bell System Technical Journal, 27(3), 379-423. https://people.math.harvard.edu/~ctm/home/text/others/shannon/entropy/entropy.pdf ↩
Bishop, C. M. (2006). "Pattern Recognition and Machine Learning." Springer. Chapter 4.3.4, "Multiclass logistic regression." https://www.microsoft.com/en-us/research/publication/pattern-recognition-machine-learning/ ↩
Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., and Wojna, Z. (2016). "Rethinking the Inception Architecture for Computer Vision." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2016). https://arxiv.org/abs/1512.00567 ↩
Lin, T.-Y., Goyal, P., Girshick, R., He, K., and Dollar, P. (2017). "Focal Loss for Dense Object Detection." Proceedings of the IEEE International Conference on Computer Vision (ICCV 2017). https://arxiv.org/abs/1708.02002 ↩
Jurafsky, D. and Martin, J. H. (2024). "Speech and Language Processing." 3rd edition draft. Chapter 3, "N-gram Language Models." https://web.stanford.edu/~jurafsky/slp3/ ↩
Hinton, G., Vinyals, O., and Dean, J. (2015). "Distilling the Knowledge in a Neural Network." NeurIPS 2014 Deep Learning Workshop. https://arxiv.org/abs/1503.02531 ↩
PyTorch documentation. "CrossEntropyLoss." Accessed 2026. https://pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

4 revisions by 1 contributors · full history

Suggest edit

What links here

Cost-sensitive learning DINOv2 Discrete diffusion language model Expected calibration error Focal loss Information theory L1 Loss L2 Loss Log Loss Maximum likelihood estimation (MLE)Selective Language Modeling (Rho-1)SigLIP SimCLR Unsloth

What are the information theory foundations of cross-entropy?

Entropy

KL divergence

Cross-entropy

What is binary cross-entropy?

What is categorical cross-entropy, and how does it differ from binary cross-entropy?

How does cross-entropy relate to softmax?

How is cross-entropy computed in a numerically stable way?

The log-softmax trick

What is label smoothing?

Why does label smoothing help?

What is focal loss?

The class imbalance problem

How does focal loss work?

How is cross-entropy used in language modeling?

Next-token prediction

Bits per byte and bits per character

What is perplexity, and how does it relate to cross-entropy?

Why report perplexity and not just cross-entropy?

How does cross-entropy compare with other loss functions?

Mean squared error (MSE)

Hinge loss

Contrastive loss

CTC loss

What is weighted cross-entropy?

How is cross-entropy used in knowledge distillation?

Practical considerations

Monitoring training with cross-entropy

Choice of reduction

Mixed-precision and cross-entropy

References

Improve this article

Related Articles

Diffusion model

Generalization

Mixture of Experts (MoE)

Modality

Sparsity

Activation Function

What links here

Related Articles

Diffusion model

Generalization

Mixture of Experts (MoE)

Modality

Sparsity

Activation Function

What links here