Log Loss

Log loss (also called logarithmic loss, logistic loss, or binary cross-entropy) is a loss function used to evaluate and train classification models that output probability estimates. It measures how well predicted probabilities match the true class labels by penalizing predictions that are both confident and wrong. Log loss is the standard objective function for logistic regression, neural networks performing classification, and many other probabilistic classifiers. It also serves as a widely used evaluation metric in machine learning competitions and benchmarks.

Log loss has deep roots in information theory, where it corresponds to the cross-entropy between the empirical distribution of the labels and the model's predicted distribution. Minimizing log loss is mathematically equivalent to maximum likelihood estimation (MLE) under a Bernoulli or categorical model, and it is a strictly proper scoring rule, meaning that it is uniquely minimized when the predicted probabilities equal the true conditional probabilities of the classes.

Explain like I'm 5 (ELI5)

Imagine you are playing a guessing game. Your friend is thinking of an animal, and you have to say how sure you are about your guess. If you say "I'm 90% sure it's a cat" and it really is a cat, you get a good score. But if you say "I'm 90% sure it's a cat" and it turns out to be a dog, you get a very bad score because you were very confident and very wrong.

Log loss works the same way. It rewards you for being confident and correct, and it punishes you harshly for being confident and incorrect. The best strategy is to say probabilities that honestly reflect how likely each answer is. If you are not sure, it is better to say "50-50" than to guess wrong with high confidence.

Mathematical definition

Binary log loss

For a single observation with true binary label $y \in {0, 1}$ and predicted probability $\hat{y} = P(y=1)$, the log loss is:

$$L(y, \hat{y}) = -\bigl[y \log(\hat{y}) + (1 - y) \log(1 - \hat{y})\bigr]$$

For a dataset of $N$ observations, the average log loss is:

$$\mathcal{L} = -\frac{1}{N} \sum_{i=1}^{N} \bigl[y_i \log(\hat{y}_i) + (1 - y_i) \log(1 - \hat{y}_i)\bigr]$$

When $y = 1$, only the $-\log(\hat{y})$ term is active. When $y = 0$, only the $-\log(1 - \hat{y})$ term is active. In both cases, the loss equals the negative logarithm of the probability assigned to the correct class.

Multiclass log loss (categorical cross-entropy)

For multiclass classification with $K$ classes, the log loss generalizes to:

$$\mathcal{L} = -\frac{1}{N} \sum_{i=1}^{N} \sum_{k=1}^{K} y_{i,k} \log(\hat{y}_{i,k})$$

where $y_{i,k}$ is 1 if observation $i$ belongs to class $k$ and 0 otherwise (one-hot encoding), and $\hat{y}_{i,k}$ is the predicted probability for class $k$. In practice, the predicted probabilities are typically produced by a softmax function applied to the model's raw output logits.

Behavior and properties of the loss

The negative logarithm has several properties that make it well-suited as a loss function:

Property	Description
Range	Log loss is non-negative. It equals 0 only when the model assigns probability 1.0 to the correct class. It approaches infinity as the predicted probability for the correct class approaches 0.
Asymmetry of penalties	A prediction of 0.01 for a true positive incurs a loss of $-\log(0.01) \approx 4.61$, while a prediction of 0.99 for a true positive incurs only $-\log(0.99) \approx 0.01$. Confident wrong predictions are punished far more severely than uncertain ones.
Differentiability	Log loss is smooth and differentiable everywhere in $(0,1)$, making it compatible with gradient descent optimization.
Convexity	When used with logistic regression (where $\hat{y} = \sigma(w^T x)$), the log loss is a convex function of the model parameters, guaranteeing a unique global minimum.

The following table illustrates how log loss changes with predicted probability for a true positive ($y = 1$):

Predicted probability ($\hat{y}$)	Log loss ($-\log(\hat{y})$)	Interpretation
0.99	0.0101	Highly confident, correct
0.90	0.1054	Confident, correct
0.70	0.3567	Moderately confident, correct
0.50	0.6931	Maximum uncertainty (coin flip)
0.30	1.2040	Moderately confident, wrong
0.10	2.3026	Confident, wrong
0.01	4.6052	Highly confident, wrong

Connection to information theory

Log loss is directly connected to several foundational concepts in information theory.

Shannon entropy and cross-entropy

The Shannon entropy of a discrete probability distribution $p$ is defined as:

$$H(p) = -\sum_{x} p(x) \log p(x)$$

It measures the average amount of information (in bits or nats, depending on the logarithm base) needed to encode outcomes drawn from $p$. The cross-entropy between a true distribution $p$ and a model distribution $q$ is:

$$H(p, q) = -\sum_{x} p(x) \log q(x)$$

Cross-entropy measures the average number of bits needed to encode outcomes from $p$ when using a code optimized for $q$. In the context of classification, $p$ is the empirical label distribution (one-hot vectors) and $q$ is the model's predicted distribution. The average log loss over a dataset is exactly the cross-entropy between these two distributions.

KL divergence decomposition

Cross-entropy decomposes into Shannon entropy plus the Kullback-Leibler (KL) divergence:

$$H(p, q) = H(p) + D_{\text{KL}}(p | q)$$

Since $H(p)$ is a constant determined by the true data distribution (and does not depend on the model), minimizing the cross-entropy loss is equivalent to minimizing the KL divergence $D_{\text{KL}}(p | q)$. The KL divergence measures the "extra" bits of information needed when using $q$ instead of $p$, and it equals zero if and only if $p = q$. This means that training a model by minimizing log loss drives the model's predicted distribution toward the true conditional distribution of the labels.

Information-theoretic interpretation

From the perspective of coding theory, a model that minimizes log loss is finding the most efficient code for the observed labels. The Kraft-McMillan theorem establishes that any uniquely decodable code corresponds to an implicit probability distribution, and the expected code length under that distribution equals the cross-entropy. A model with lower log loss assigns shorter codes (higher probabilities) to the events that actually occur.

Connection to maximum likelihood estimation

Log loss is the negative log-likelihood of the data under a Bernoulli (binary) or categorical (multiclass) probabilistic model. This connection is one of the most important results linking statistical estimation to machine learning optimization.

Derivation for binary classification

Consider a binary classification model that predicts $\hat{y}_i = P(y_i = 1 \mid x_i)$ for each observation $x_i$. If the labels are independent Bernoulli random variables, the likelihood of the observed data is:

$$L(\theta) = \prod_{i=1}^{N} \hat{y}_i^{y_i} (1 - \hat{y}_i)^{1 - y_i}$$

Taking the logarithm:

$$\log L(\theta) = \sum_{i=1}^{N} \bigl[y_i \log(\hat{y}_i) + (1 - y_i) \log(1 - \hat{y}_i)\bigr]$$

Maximizing this log-likelihood is equivalent to minimizing its negation, which is exactly the sum of the per-sample log losses. Therefore, minimizing log loss is identical to maximum likelihood estimation.

Derivation for multiclass classification

For multiclass classification with a categorical distribution, the same reasoning applies. The likelihood is:

$$L(\theta) = \prod_{i=1}^{N} \prod_{k=1}^{K} \hat{y}{i,k}^{y{i,k}}$$

The negative log-likelihood reduces to the multiclass cross-entropy loss. This is the theoretical justification for using cross-entropy as the training objective in neural networks with softmax output layers.

Log loss as a proper scoring rule

A scoring rule assigns a numerical score to a probabilistic forecast based on the observed outcome. A scoring rule is proper if the expected score is optimized when the forecaster reports their true believed probabilities. It is strictly proper if the optimum is unique, meaning the forecaster can do no better than reporting the exact true probabilities.

Log loss (also known as the logarithmic scoring rule) was introduced by I.J. Good in 1952 and is one of the most well-known strictly proper scoring rules. Its strict properness has several practical implications.

First, it encourages honest probability estimation. A model trained with log loss cannot improve its expected loss by systematically over- or under-estimating probabilities. Second, it promotes calibration. A well-calibrated model is one where, among all predictions assigned probability $p$, approximately a fraction $p$ of them are actually positive. Because log loss is strictly proper, models optimized under it tend to produce well-calibrated probabilities. Third, it is a local scoring rule. The log loss depends only on the probability assigned to the outcome that actually occurred, not on the probabilities assigned to other outcomes. This locality property is unique among strictly proper scoring rules on non-binary outcome spaces.

Comparison with other scoring rules

Scoring rule	Formula (binary case)	Strictly proper?	Key characteristics
Log loss (logarithmic)	$-[y\log\hat{y} + (1-y)\log(1-\hat{y})]$	Yes	Local; penalizes confident errors severely; unbounded
Brier score (quadratic)	$(y - \hat{y})^2$	Yes	Bounded [0, 1]; gentler penalty for confident wrong predictions
Spherical score	$\hat{y}_k / \|\hat{y}\|$	Yes	Normalized; used in some meteorological forecasting applications
0-1 loss (accuracy)	$\mathbb{1}[\hat{y}_{\text{class}} \neq y]$	No	Threshold-based; ignores probability quality; not proper
Hinge loss	$\max(0, 1 - y \cdot f(x))$	No	Used in SVMs; margin-based; does not yield probabilities

A key practical distinction is that log loss penalizes confident wrong predictions much more severely than the Brier score does. For example, predicting 0.001 for a true positive produces a log loss of about 6.9 but a Brier score of only about 1.0. This heavy penalization makes log loss especially sensitive to outlier predictions.

Gradient and optimization

Gradient for logistic regression

In logistic regression, the predicted probability is $\hat{y} = \sigma(w^T x) = 1 / (1 + e^{-w^T x})$, where $\sigma$ is the sigmoid function. The gradient of the log loss with respect to the weight vector $w$ is:

$$\nabla_w \mathcal{L} = \frac{1}{N} \sum_{i=1}^{N} (\hat{y}_i - y_i) x_i = \frac{1}{N} X^T (\hat{y} - y)$$

This clean gradient formula, where the gradient is proportional to the prediction error $(\hat{y}_i - y_i)$, is a consequence of the sigmoid function being the canonical link function for the Bernoulli distribution. It is analogous to the gradient of mean squared error in linear regression.

Gradient for softmax cross-entropy

In multiclass classification with a softmax output layer, the gradient of the cross-entropy loss with respect to the logit $z_k$ for class $k$ is:

$$\frac{\partial \mathcal{L}}{\partial z_k} = \hat{y}_k - y_k$$

where $\hat{y}_k = \text{softmax}(z)_k$. This simple form (predicted probability minus true label) is one reason why cross-entropy combined with softmax is the standard choice for training classification neural networks. The gradient never saturates (goes to zero) when the model is making large errors, unlike loss functions paired with non-canonical activation functions.

Optimization methods

Because log loss is differentiable and (for linear models) convex, it can be minimized using standard gradient-based optimization methods:

Gradient descent: Standard batch gradient descent computes the full gradient over all training examples.
Stochastic gradient descent (SGD): Updates parameters using the gradient from a single randomly sampled example or a small mini-batch.
Adam and other adaptive methods: Adaptive learning rate methods like Adam, AdaGrad, and RMSProp are commonly used for training deep neural networks with cross-entropy loss.
Newton's method and L-BFGS: For logistic regression specifically, second-order methods can exploit the convexity to converge faster.

Practical considerations

Numerical stability

Computing $\log(\hat{y})$ directly can cause numerical problems when $\hat{y}$ is very close to 0 or 1. If $\hat{y} = 0$, then $\log(0) = -\infty$, producing undefined or infinite loss values. Several techniques address this.

Clipping (epsilon clamping): The simplest approach is to clip predicted probabilities to the range $[\epsilon, 1 - \epsilon]$ for a small constant like $\epsilon = 10^{-15}$. For example, scikit-learn's log_loss function uses this strategy internally.

Log-sum-exp trick: Instead of computing softmax and then taking the log separately, the log-softmax function can be computed in a numerically stable way using the identity $\log(\text{softmax}(z)_k) = z_k - \log(\sum_j e^{z_j})$, with the log-sum-exp evaluated using the max-subtraction trick.

Combined loss functions: Deep learning frameworks provide fused operations such as PyTorch's BCEWithLogitsLoss (which combines a sigmoid layer and binary cross-entropy) and CrossEntropyLoss (which combines log-softmax and negative log-likelihood). These fused operations are more numerically stable than computing the steps separately.

Label smoothing

Label smoothing is a regularization technique that replaces hard one-hot labels with softened versions:

$$y_{\text{smooth}} = y \cdot (1 - \alpha) + \frac{\alpha}{K}$$

where $\alpha$ is a small smoothing parameter (commonly 0.1) and $K$ is the number of classes. Label smoothing prevents the model from becoming overconfident and can improve generalization. It was popularized by Szegedy et al. (2016) in the Inception v2 architecture.

Class imbalance

When training data has a skewed class distribution, standard log loss can lead to a model that predicts the majority class too often. Common mitigation strategies include the following.

Weighted cross-entropy: Assign different weights $w_k$ to each class in the loss function: $\mathcal{L} = -\frac{1}{N} \sum_i \sum_k w_k , y_{i,k} \log(\hat{y}_{i,k})$. The weights are typically set inversely proportional to class frequency.

Focal loss: Introduced by Lin et al. (2017) for object detection, focal loss adds a modulating factor $(1 - \hat{y}_t)^\gamma$ to the standard cross-entropy, down-weighting easy examples and focusing learning on hard, misclassified examples:

$$\text{FL}(\hat{y}_t) = -(1 - \hat{y}_t)^\gamma \log(\hat{y}_t)$$

where $\hat{y}_t$ is the predicted probability for the true class and $\gamma \geq 0$ is a focusing parameter. When $\gamma = 0$, focal loss reduces to standard cross-entropy.

Oversampling and undersampling: Adjusting the training data distribution through oversampling the minority class or downsampling the majority class.

Interpreting log loss values

Unlike accuracy, which ranges from 0 to 1, log loss has no fixed upper bound, which can make interpretation less intuitive. Some useful reference points:

Scenario	Log loss value
Perfect predictions (all probabilities = 1.0 for true class)	0.0
Random guessing for binary classification (always predict 0.5)	$-\log(0.5) \approx 0.6931$
Random guessing for 10-class classification (always predict 0.1)	$-\log(0.1) \approx 2.3026$
Baseline with class prior (predict class frequency)	Entropy of the class distribution

A model is performing better than random guessing if its log loss is below $\log(K)$, where $K$ is the number of classes. Comparing a model's log loss to the entropy of the class distribution gives a sense of how much the model has learned beyond the base rates.

Applications

Logistic regression

Log loss is the native loss function for logistic regression. The logistic regression model directly parameterizes class probabilities using the sigmoid function, and the model parameters are fit by minimizing the average log loss (equivalently, maximizing the likelihood).

Neural networks

Virtually all neural networks trained for classification use cross-entropy loss. For binary classification, the final layer typically uses a sigmoid activation paired with binary cross-entropy. For multiclass classification, a softmax output layer is paired with categorical cross-entropy. The clean gradient properties make cross-entropy especially effective for training deep networks via backpropagation.

Gradient-boosted trees

Gradient boosting methods such as XGBoost, LightGBM, and CatBoost use log loss as their default objective function for classification tasks. Each boosting iteration fits a new tree to the negative gradient of the log loss.

Kaggle competitions and benchmarks

Log loss is one of the most popular evaluation metrics on Kaggle and in machine learning benchmarks. It is preferred over accuracy because it evaluates the quality of probability estimates, not just whether the top prediction is correct. Competitions that use log loss reward participants for producing well-calibrated, nuanced probability estimates rather than just getting the hard classification right.

Calibration assessment

Because log loss is a strictly proper scoring rule, it is used alongside calibration plots to assess whether a model's predicted probabilities are reliable. A model with low log loss but poor calibration (as revealed by a reliability diagram) may benefit from post-hoc calibration techniques such as Platt scaling or isotonic regression.

Software implementations

Log loss is available in all popular machine learning frameworks.

Library	Function / class	Notes
scikit-learn	`sklearn.metrics.log_loss`	Evaluation metric; clips probabilities with $\epsilon = 10^{-15}$
PyTorch	`torch.nn.CrossEntropyLoss`	Combines log-softmax and NLL loss; accepts raw logits
PyTorch	`torch.nn.BCEWithLogitsLoss`	Combines sigmoid and binary cross-entropy; numerically stable
TensorFlow / Keras	`tf.keras.losses.BinaryCrossentropy`	Supports label smoothing and from-logits mode
TensorFlow / Keras	`tf.keras.losses.CategoricalCrossentropy`	Multiclass version; supports label smoothing
XGBoost	`binary:logistic`, `multi:softprob`	Default objectives for classification
LightGBM	`binary`, `multiclass`	Optimizes log loss by default for classification

History

The logarithmic scoring rule was proposed by I.J. Good in 1952 as a method for evaluating the quality of probabilistic predictions. Good connected it to the concept of "weight of evidence" in Bayesian inference.

The theoretical foundations were formalized further in the 1970s when proper scoring rules were developed into a principled framework for assessing probabilistic forecasts, with applications in meteorology, economics, and psychology. The Brier score (proposed by Glenn Brier in 1950) predates the logarithmic scoring rule by two years and is the other major proper scoring rule used in practice.

In information theory, the connection between cross-entropy and coding efficiency was established through Claude Shannon's foundational 1948 paper "A Mathematical Theory of Communication." The relationship between maximum likelihood estimation and cross-entropy minimization was recognized as a unifying bridge between statistics and information theory.

In machine learning, log loss became the standard training objective for logistic regression and later for neural network classifiers. The term "log loss" gained widespread popularity through its adoption as a competition metric on platforms like Kaggle. Variants such as focal loss (Lin et al., 2017) have extended the basic cross-entropy framework to handle specific challenges like class imbalance in object detection.

References

Good, I.J. (1952). "Rational Decisions." *Journal of the Royal Statistical Society, Series B*, 14(1), 107-114.
Shannon, C.E. (1948). "A Mathematical Theory of Communication." *Bell System Technical Journal*, 27(3), 379-423.
Brier, G.W. (1950). "Verification of forecasts expressed in terms of probability." *Monthly Weather Review*, 78(1), 1-3.
Gneiting, T. and Raftery, A.E. (2007). "Strictly Proper Scoring Rules, Prediction, and Estimation." *Journal of the American Statistical Association*, 102(477), 359-378.
Bishop, C.M. (2006). *Pattern Recognition and Machine Learning*. Springer. Chapter 4.
Murphy, K.P. (2012). *Machine Learning: A Probabilistic Perspective*. MIT Press. Chapters 8 and 10.
Goodfellow, I., Bengio, Y., and Courville, A. (2016). *Deep Learning*. MIT Press. Chapter 6.
Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., and Wojna, Z. (2016). "Rethinking the Inception Architecture for Computer Vision." *Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition*, 2818-2826.
Lin, T.Y., Goyal, P., Girshick, R., He, K., and Dollar, P. (2017). "Focal Loss for Dense Object Detection." *Proceedings of the IEEE International Conference on Computer Vision*, 2980-2988.
Hastie, T., Tibshirani, R., and Friedman, J. (2009). *The Elements of Statistical Learning*. 2nd edition. Springer. Chapter 4.
Kullback, S. and Leibler, R.A. (1951). "On Information and Sufficiency." *Annals of Mathematical Statistics*, 22(1), 79-86.
Cover, T.M. and Thomas, J.A. (2006). *Elements of Information Theory*. 2nd edition. Wiley.
scikit-learn developers. "sklearn.metrics.log_loss." scikit-learn documentation. https://scikit-learn.org/stable/modules/generated/sklearn.metrics.log_loss.html
Pedregosa, F. et al. (2011). "Scikit-learn: Machine Learning in Python." *Journal of Machine Learning Research*, 12, 2825-2830.

Explain like I'm 5 (ELI5)

Mathematical definition

Binary log loss

Multiclass log loss (categorical cross-entropy)

Behavior and properties of the loss

Connection to information theory

Shannon entropy and cross-entropy

KL divergence decomposition

Information-theoretic interpretation

Connection to maximum likelihood estimation

Derivation for binary classification

Derivation for multiclass classification

Log loss as a proper scoring rule

Comparison with other scoring rules

Gradient and optimization

Gradient for logistic regression

Gradient for softmax cross-entropy

Optimization methods

Practical considerations

Numerical stability

Label smoothing

Class imbalance

Interpreting log loss values

Applications

Logistic regression

Neural networks

Gradient-boosted trees

Kaggle competitions and benchmarks

Calibration assessment

Software implementations

History

See also

References

Improve this article

Related Articles

ARC-AGI 2

Squared Hinge Loss

Cross-Entropy

Entropy

Information Gain

Perplexity

Explain like I'm 5 (ELI5)

Mathematical definition

Binary log loss

Multiclass log loss (categorical cross-entropy)

Behavior and properties of the loss

Connection to information theory

Shannon entropy and cross-entropy

KL divergence decomposition

Information-theoretic interpretation

Connection to maximum likelihood estimation

Derivation for binary classification

Derivation for multiclass classification

Log loss as a proper scoring rule

Comparison with other scoring rules

Gradient and optimization

Gradient for logistic regression

Gradient for softmax cross-entropy

Optimization methods

Practical considerations

Numerical stability

Label smoothing

Class imbalance

Interpreting log loss values

Applications

Logistic regression

Neural networks

Gradient-boosted trees

Kaggle competitions and benchmarks

Calibration assessment

Software implementations

History

See also

References

Related Articles

ARC-AGI 2

Squared Hinge Loss

Cross-Entropy

Entropy

Information Gain