Log loss (also called logarithmic loss, logistic loss, or binary cross-entropy) is a loss function used to evaluate and train classification models that output probability estimates. It measures how well predicted probabilities match the true class labels by penalizing predictions that are both confident and wrong. Log loss is the standard objective function for logistic regression, neural networks performing classification, and many other probabilistic classifiers. It also serves as a widely used evaluation metric in machine learning competitions and benchmarks.
Log loss has deep roots in information theory, where it corresponds to the cross-entropy between the empirical distribution of the labels and the model's predicted distribution. Minimizing log loss is mathematically equivalent to maximum likelihood estimation (MLE) under a Bernoulli or categorical model, and it is a strictly proper scoring rule, meaning that it is uniquely minimized when the predicted probabilities equal the true conditional probabilities of the classes.
Imagine you are playing a guessing game. Your friend is thinking of an animal, and you have to say how sure you are about your guess. If you say "I'm 90% sure it's a cat" and it really is a cat, you get a good score. But if you say "I'm 90% sure it's a cat" and it turns out to be a dog, you get a very bad score because you were very confident and very wrong.
Log loss works the same way. It rewards you for being confident and correct, and it punishes you harshly for being confident and incorrect. The best strategy is to say probabilities that honestly reflect how likely each answer is. If you are not sure, it is better to say "50-50" than to guess wrong with high confidence.
For a single observation with true binary label $y \in {0, 1}$ and predicted probability $\hat{y} = P(y=1)$, the log loss is:
$$L(y, \hat{y}) = -\bigl[y \log(\hat{y}) + (1 - y) \log(1 - \hat{y})\bigr]$$
For a dataset of $N$ observations, the average log loss is:
$$\mathcal{L} = -\frac{1}{N} \sum_{i=1}^{N} \bigl[y_i \log(\hat{y}_i) + (1 - y_i) \log(1 - \hat{y}_i)\bigr]$$
When $y = 1$, only the $-\log(\hat{y})$ term is active. When $y = 0$, only the $-\log(1 - \hat{y})$ term is active. In both cases, the loss equals the negative logarithm of the probability assigned to the correct class.
For multiclass classification with $K$ classes, the log loss generalizes to:
$$\mathcal{L} = -\frac{1}{N} \sum_{i=1}^{N} \sum_{k=1}^{K} y_{i,k} \log(\hat{y}_{i,k})$$
where $y_{i,k}$ is 1 if observation $i$ belongs to class $k$ and 0 otherwise (one-hot encoding), and $\hat{y}_{i,k}$ is the predicted probability for class $k$. In practice, the predicted probabilities are typically produced by a softmax function applied to the model's raw output logits.
The negative logarithm has several properties that make it well-suited as a loss function:
| Property | Description |
|---|---|
| Range | Log loss is non-negative. It equals 0 only when the model assigns probability 1.0 to the correct class. It approaches infinity as the predicted probability for the correct class approaches 0. |
| Asymmetry of penalties | A prediction of 0.01 for a true positive incurs a loss of $-\log(0.01) \approx 4.61$, while a prediction of 0.99 for a true positive incurs only $-\log(0.99) \approx 0.01$. Confident wrong predictions are punished far more severely than uncertain ones. |
| Differentiability | Log loss is smooth and differentiable everywhere in $(0,1)$, making it compatible with gradient descent optimization. |
| Convexity | When used with logistic regression (where $\hat{y} = \sigma(w^T x)$), the log loss is a convex function of the model parameters, guaranteeing a unique global minimum. |
The following table illustrates how log loss changes with predicted probability for a true positive ($y = 1$):
| Predicted probability ($\hat{y}$) | Log loss ($-\log(\hat{y})$) | Interpretation |
|---|---|---|
| 0.99 | 0.0101 | Highly confident, correct |
| 0.90 | 0.1054 | Confident, correct |
| 0.70 | 0.3567 | Moderately confident, correct |
| 0.50 | 0.6931 | Maximum uncertainty (coin flip) |
| 0.30 | 1.2040 | Moderately confident, wrong |
| 0.10 | 2.3026 | Confident, wrong |
| 0.01 | 4.6052 | Highly confident, wrong |
Log loss is directly connected to several foundational concepts in information theory.
The Shannon entropy of a discrete probability distribution $p$ is defined as:
$$H(p) = -\sum_{x} p(x) \log p(x)$$
It measures the average amount of information (in bits or nats, depending on the logarithm base) needed to encode outcomes drawn from $p$. The cross-entropy between a true distribution $p$ and a model distribution $q$ is:
$$H(p, q) = -\sum_{x} p(x) \log q(x)$$
Cross-entropy measures the average number of bits needed to encode outcomes from $p$ when using a code optimized for $q$. In the context of classification, $p$ is the empirical label distribution (one-hot vectors) and $q$ is the model's predicted distribution. The average log loss over a dataset is exactly the cross-entropy between these two distributions.
Cross-entropy decomposes into Shannon entropy plus the Kullback-Leibler (KL) divergence:
$$H(p, q) = H(p) + D_{\text{KL}}(p | q)$$
Since $H(p)$ is a constant determined by the true data distribution (and does not depend on the model), minimizing the cross-entropy loss is equivalent to minimizing the KL divergence $D_{\text{KL}}(p | q)$. The KL divergence measures the "extra" bits of information needed when using $q$ instead of $p$, and it equals zero if and only if $p = q$. This means that training a model by minimizing log loss drives the model's predicted distribution toward the true conditional distribution of the labels.
From the perspective of coding theory, a model that minimizes log loss is finding the most efficient code for the observed labels. The Kraft-McMillan theorem establishes that any uniquely decodable code corresponds to an implicit probability distribution, and the expected code length under that distribution equals the cross-entropy. A model with lower log loss assigns shorter codes (higher probabilities) to the events that actually occur.
Log loss is the negative log-likelihood of the data under a Bernoulli (binary) or categorical (multiclass) probabilistic model. This connection is one of the most important results linking statistical estimation to machine learning optimization.
Consider a binary classification model that predicts $\hat{y}_i = P(y_i = 1 \mid x_i)$ for each observation $x_i$. If the labels are independent Bernoulli random variables, the likelihood of the observed data is:
$$L(\theta) = \prod_{i=1}^{N} \hat{y}_i^{y_i} (1 - \hat{y}_i)^{1 - y_i}$$
Taking the logarithm:
$$\log L(\theta) = \sum_{i=1}^{N} \bigl[y_i \log(\hat{y}_i) + (1 - y_i) \log(1 - \hat{y}_i)\bigr]$$
Maximizing this log-likelihood is equivalent to minimizing its negation, which is exactly the sum of the per-sample log losses. Therefore, minimizing log loss is identical to maximum likelihood estimation.
For multiclass classification with a categorical distribution, the same reasoning applies. The likelihood is:
$$L(\theta) = \prod_{i=1}^{N} \prod_{k=1}^{K} \hat{y}{i,k}^{y{i,k}}$$
The negative log-likelihood reduces to the multiclass cross-entropy loss. This is the theoretical justification for using cross-entropy as the training objective in neural networks with softmax output layers.
A scoring rule assigns a numerical score to a probabilistic forecast based on the observed outcome. A scoring rule is proper if the expected score is optimized when the forecaster reports their true believed probabilities. It is strictly proper if the optimum is unique, meaning the forecaster can do no better than reporting the exact true probabilities.
Log loss (also known as the logarithmic scoring rule) was introduced by I.J. Good in 1952 and is one of the most well-known strictly proper scoring rules. Its strict properness has several practical implications.
First, it encourages honest probability estimation. A model trained with log loss cannot improve its expected loss by systematically over- or under-estimating probabilities. Second, it promotes calibration. A well-calibrated model is one where, among all predictions assigned probability $p$, approximately a fraction $p$ of them are actually positive. Because log loss is strictly proper, models optimized under it tend to produce well-calibrated probabilities. Third, it is a local scoring rule. The log loss depends only on the probability assigned to the outcome that actually occurred, not on the probabilities assigned to other outcomes. This locality property is unique among strictly proper scoring rules on non-binary outcome spaces.
| Scoring rule | Formula (binary case) | Strictly proper? | Key characteristics |
|---|---|---|---|
| Log loss (logarithmic) | $-[y\log\hat{y} + (1-y)\log(1-\hat{y})]$ | Yes | Local; penalizes confident errors severely; unbounded |
| Brier score (quadratic) | $(y - \hat{y})^2$ | Yes | Bounded [0, 1]; gentler penalty for confident wrong predictions |
| Spherical score | $\hat{y}_k / |\hat{y}|$ | Yes | Normalized; used in some meteorological forecasting applications |
| 0-1 loss (accuracy) | $\mathbb{1}[\hat{y}_{\text{class}} \neq y]$ | No | Threshold-based; ignores probability quality; not proper |
| Hinge loss | $\max(0, 1 - y \cdot f(x))$ | No | Used in SVMs; margin-based; does not yield probabilities |
A key practical distinction is that log loss penalizes confident wrong predictions much more severely than the Brier score does. For example, predicting 0.001 for a true positive produces a log loss of about 6.9 but a Brier score of only about 1.0. This heavy penalization makes log loss especially sensitive to outlier predictions.
In logistic regression, the predicted probability is $\hat{y} = \sigma(w^T x) = 1 / (1 + e^{-w^T x})$, where $\sigma$ is the sigmoid function. The gradient of the log loss with respect to the weight vector $w$ is:
$$\nabla_w \mathcal{L} = \frac{1}{N} \sum_{i=1}^{N} (\hat{y}_i - y_i) x_i = \frac{1}{N} X^T (\hat{y} - y)$$
This clean gradient formula, where the gradient is proportional to the prediction error $(\hat{y}_i - y_i)$, is a consequence of the sigmoid function being the canonical link function for the Bernoulli distribution. It is analogous to the gradient of mean squared error in linear regression.
In multiclass classification with a softmax output layer, the gradient of the cross-entropy loss with respect to the logit $z_k$ for class $k$ is:
$$\frac{\partial \mathcal{L}}{\partial z_k} = \hat{y}_k - y_k$$
where $\hat{y}_k = \text{softmax}(z)_k$. This simple form (predicted probability minus true label) is one reason why cross-entropy combined with softmax is the standard choice for training classification neural networks. The gradient never saturates (goes to zero) when the model is making large errors, unlike loss functions paired with non-canonical activation functions.
Because log loss is differentiable and (for linear models) convex, it can be minimized using standard gradient-based optimization methods:
Computing $\log(\hat{y})$ directly can cause numerical problems when $\hat{y}$ is very close to 0 or 1. If $\hat{y} = 0$, then $\log(0) = -\infty$, producing undefined or infinite loss values. Several techniques address this.
Clipping (epsilon clamping): The simplest approach is to clip predicted probabilities to the range $[\epsilon, 1 - \epsilon]$ for a small constant like $\epsilon = 10^{-15}$. For example, scikit-learn's log_loss function uses this strategy internally.
Log-sum-exp trick: Instead of computing softmax and then taking the log separately, the log-softmax function can be computed in a numerically stable way using the identity $\log(\text{softmax}(z)_k) = z_k - \log(\sum_j e^{z_j})$, with the log-sum-exp evaluated using the max-subtraction trick.
Combined loss functions: Deep learning frameworks provide fused operations such as PyTorch's BCEWithLogitsLoss (which combines a sigmoid layer and binary cross-entropy) and CrossEntropyLoss (which combines log-softmax and negative log-likelihood). These fused operations are more numerically stable than computing the steps separately.
Label smoothing is a regularization technique that replaces hard one-hot labels with softened versions:
$$y_{\text{smooth}} = y \cdot (1 - \alpha) + \frac{\alpha}{K}$$
where $\alpha$ is a small smoothing parameter (commonly 0.1) and $K$ is the number of classes. Label smoothing prevents the model from becoming overconfident and can improve generalization. It was popularized by Szegedy et al. (2016) in the Inception v2 architecture.
When training data has a skewed class distribution, standard log loss can lead to a model that predicts the majority class too often. Common mitigation strategies include the following.
Weighted cross-entropy: Assign different weights $w_k$ to each class in the loss function: $\mathcal{L} = -\frac{1}{N} \sum_i \sum_k w_k , y_{i,k} \log(\hat{y}_{i,k})$. The weights are typically set inversely proportional to class frequency.
Focal loss: Introduced by Lin et al. (2017) for object detection, focal loss adds a modulating factor $(1 - \hat{y}_t)^\gamma$ to the standard cross-entropy, down-weighting easy examples and focusing learning on hard, misclassified examples:
$$\text{FL}(\hat{y}_t) = -(1 - \hat{y}_t)^\gamma \log(\hat{y}_t)$$
where $\hat{y}_t$ is the predicted probability for the true class and $\gamma \geq 0$ is a focusing parameter. When $\gamma = 0$, focal loss reduces to standard cross-entropy.
Oversampling and undersampling: Adjusting the training data distribution through oversampling the minority class or downsampling the majority class.
Unlike accuracy, which ranges from 0 to 1, log loss has no fixed upper bound, which can make interpretation less intuitive. Some useful reference points:
| Scenario | Log loss value |
|---|---|
| Perfect predictions (all probabilities = 1.0 for true class) | 0.0 |
| Random guessing for binary classification (always predict 0.5) | $-\log(0.5) \approx 0.6931$ |
| Random guessing for 10-class classification (always predict 0.1) | $-\log(0.1) \approx 2.3026$ |
| Baseline with class prior (predict class frequency) | Entropy of the class distribution |
A model is performing better than random guessing if its log loss is below $\log(K)$, where $K$ is the number of classes. Comparing a model's log loss to the entropy of the class distribution gives a sense of how much the model has learned beyond the base rates.
Log loss is the native loss function for logistic regression. The logistic regression model directly parameterizes class probabilities using the sigmoid function, and the model parameters are fit by minimizing the average log loss (equivalently, maximizing the likelihood).
Virtually all neural networks trained for classification use cross-entropy loss. For binary classification, the final layer typically uses a sigmoid activation paired with binary cross-entropy. For multiclass classification, a softmax output layer is paired with categorical cross-entropy. The clean gradient properties make cross-entropy especially effective for training deep networks via backpropagation.
Gradient boosting methods such as XGBoost, LightGBM, and CatBoost use log loss as their default objective function for classification tasks. Each boosting iteration fits a new tree to the negative gradient of the log loss.
Log loss is one of the most popular evaluation metrics on Kaggle and in machine learning benchmarks. It is preferred over accuracy because it evaluates the quality of probability estimates, not just whether the top prediction is correct. Competitions that use log loss reward participants for producing well-calibrated, nuanced probability estimates rather than just getting the hard classification right.
Because log loss is a strictly proper scoring rule, it is used alongside calibration plots to assess whether a model's predicted probabilities are reliable. A model with low log loss but poor calibration (as revealed by a reliability diagram) may benefit from post-hoc calibration techniques such as Platt scaling or isotonic regression.
Log loss is available in all popular machine learning frameworks.
| Library | Function / class | Notes |
|---|---|---|
| scikit-learn | sklearn.metrics.log_loss | Evaluation metric; clips probabilities with $\epsilon = 10^{-15}$ |
| PyTorch | torch.nn.CrossEntropyLoss | Combines log-softmax and NLL loss; accepts raw logits |
| PyTorch | torch.nn.BCEWithLogitsLoss | Combines sigmoid and binary cross-entropy; numerically stable |
| TensorFlow / Keras | tf.keras.losses.BinaryCrossentropy | Supports label smoothing and from-logits mode |
| TensorFlow / Keras | tf.keras.losses.CategoricalCrossentropy | Multiclass version; supports label smoothing |
| XGBoost | binary:logistic, multi:softprob | Default objectives for classification |
| LightGBM | binary, multiclass | Optimizes log loss by default for classification |
The logarithmic scoring rule was proposed by I.J. Good in 1952 as a method for evaluating the quality of probabilistic predictions. Good connected it to the concept of "weight of evidence" in Bayesian inference.
The theoretical foundations were formalized further in the 1970s when proper scoring rules were developed into a principled framework for assessing probabilistic forecasts, with applications in meteorology, economics, and psychology. The Brier score (proposed by Glenn Brier in 1950) predates the logarithmic scoring rule by two years and is the other major proper scoring rule used in practice.
In information theory, the connection between cross-entropy and coding efficiency was established through Claude Shannon's foundational 1948 paper "A Mathematical Theory of Communication." The relationship between maximum likelihood estimation and cross-entropy minimization was recognized as a unifying bridge between statistics and information theory.
In machine learning, log loss became the standard training objective for logistic regression and later for neural network classifiers. The term "log loss" gained widespread popularity through its adoption as a competition metric on platforms like Kaggle. Variants such as focal loss (Lin et al., 2017) have extended the basic cross-entropy framework to handle specific challenges like class imbalance in object detection.