# Log Loss

> Source: https://aiwiki.ai/wiki/log_loss
> Updated: 2026-06-24
> Categories: Machine Learning, Mathematics, Training & Optimization
> From AI Wiki (https://aiwiki.ai), a free encyclopedia of artificial intelligence. Quote with attribution.

**Log loss** is the negative log-likelihood of the predicted probabilities and the standard loss function for probabilistic [classification](/wiki/classification): for binary labels it is computed as $-\frac{1}{N}\sum_{i=1}^{N}\bigl[y_i\log\hat{y}_i + (1-y_i)\log(1-\hat{y}_i)\bigr]$, where $\hat{y}_i$ is the model's predicted probability that example $i$ is positive. A perfect prediction scores 0, and the loss grows without bound (toward infinity) as the probability assigned to the correct class approaches 0, so log loss heavily penalizes predictions that are confident and wrong [1][13]. Log loss is also called **logarithmic loss**, **logistic loss**, or **binary [cross-entropy loss](/wiki/cross_entropy_loss)**, and it is mathematically identical to the [cross-entropy](/wiki/cross_entropy_loss) between the true label distribution and the model's predicted distribution.

Log loss is the native training objective for [logistic regression](/wiki/logistic_regression), [neural networks](/wiki/neural_network) performing classification, and most other probabilistic classifiers. The scikit-learn documentation defines it as "the loss function used in (multinomial) logistic regression and extensions of it such as neural networks, defined as the negative log-likelihood of a logistic model that returns `y_proba` probabilities for its training data `y_true`" [13]. Minimizing log loss is equivalent to [maximum likelihood estimation](/wiki/maximum_likelihood_estimation) under a Bernoulli or categorical model, and log loss is a strictly proper scoring rule, meaning it is uniquely minimized when the predicted probabilities equal the true conditional probabilities of the classes [1][4]. It is widely used both as a training objective and as an evaluation metric in [machine learning](/wiki/machine_learning) competitions and benchmarks.

## What is log loss in simple terms (ELI5)

Imagine you are playing a guessing game. Your friend is thinking of an animal, and you have to say how sure you are about your guess. If you say "I'm 90% sure it's a cat" and it really is a cat, you get a good score. But if you say "I'm 90% sure it's a cat" and it turns out to be a dog, you get a very bad score because you were very confident and very wrong.

Log loss works the same way. It rewards you for being confident and correct, and it punishes you harshly for being confident and incorrect. The best strategy is to say probabilities that honestly reflect how likely each answer is. If you are not sure, it is better to say "50-50" than to guess wrong with high confidence.

## How is log loss calculated?

### Binary log loss

For a single observation with true binary label $y \in \{0, 1\}$ and predicted probability $\hat{y} = P(y=1)$, the log loss is:

$$L(y, \hat{y}) = -\bigl[y \log(\hat{y}) + (1 - y) \log(1 - \hat{y})\bigr]$$

For a dataset of $N$ observations, the average log loss is:

$$\mathcal{L} = -\frac{1}{N} \sum_{i=1}^{N} \bigl[y_i \log(\hat{y}_i) + (1 - y_i) \log(1 - \hat{y}_i)\bigr]$$

When $y = 1$, only the $-\log(\hat{y})$ term is active. When $y = 0$, only the $-\log(1 - \hat{y})$ term is active. In both cases, the loss equals the negative logarithm of the probability assigned to the correct class. This is exactly the formula scikit-learn uses for a single sample, $L_{\log}(y, p) = -(y \log(p) + (1 - y)\log(1 - p))$ [13].

### Multiclass log loss (categorical cross-entropy)

For multiclass classification with $K$ classes, the log loss generalizes to:

$$\mathcal{L} = -\frac{1}{N} \sum_{i=1}^{N} \sum_{k=1}^{K} y_{i,k} \log(\hat{y}_{i,k})$$

where $y_{i,k}$ is 1 if observation $i$ belongs to class $k$ and 0 otherwise (one-hot encoding), and $\hat{y}_{i,k}$ is the predicted probability for class $k$. In practice, the predicted probabilities are typically produced by a [softmax](/wiki/softmax) function applied to the model's raw output logits.

### Behavior and properties of the loss

The negative logarithm has several properties that make it well-suited as a loss function:

| Property | Description |
|---|---|
| Range | Log loss is non-negative. It equals 0 only when the model assigns probability 1.0 to the correct class. It approaches infinity as the predicted probability for the correct class approaches 0. |
| Asymmetry of penalties | A prediction of 0.01 for a true positive incurs a loss of $-\log(0.01) \approx 4.61$, while a prediction of 0.99 for a true positive incurs only $-\log(0.99) \approx 0.01$. Confident wrong predictions are punished far more severely than uncertain ones. |
| Differentiability | Log loss is smooth and differentiable everywhere in $(0,1)$, making it compatible with [gradient descent](/wiki/stochastic_gradient_descent_sgd) optimization. |
| Convexity | When used with logistic regression (where $\hat{y} = \sigma(w^T x)$), the log loss is a convex function of the model parameters, guaranteeing a unique global minimum. |

The following table illustrates how log loss changes with predicted probability for a true positive ($y = 1$):

| Predicted probability ($\hat{y}$) | Log loss ($-\log(\hat{y})$) | Interpretation |
|---|---|---|
| 0.99 | 0.0101 | Highly confident, correct |
| 0.90 | 0.1054 | Confident, correct |
| 0.70 | 0.3567 | Moderately confident, correct |
| 0.50 | 0.6931 | Maximum uncertainty (coin flip) |
| 0.30 | 1.2040 | Moderately confident, wrong |
| 0.10 | 2.3026 | Confident, wrong |
| 0.01 | 4.6052 | Highly confident, wrong |

## How does log loss relate to cross-entropy and information theory?

Log loss is directly connected to several foundational concepts in [information theory](/wiki/information_theory), and the average log loss over a dataset is exactly the [cross-entropy](/wiki/cross_entropy_loss) between the empirical label distribution and the model's predicted distribution.

### Shannon entropy and cross-entropy

The **Shannon entropy** of a discrete probability distribution $p$ is defined as:

$$H(p) = -\sum_{x} p(x) \log p(x)$$

It measures the average amount of information (in bits or nats, depending on the logarithm base) needed to encode outcomes drawn from $p$. The concept and the link between code length and probability were established in Claude Shannon's 1948 paper "A Mathematical Theory of Communication" [2]. The **cross-entropy** between a true distribution $p$ and a model distribution $q$ is:

$$H(p, q) = -\sum_{x} p(x) \log q(x)$$

Cross-entropy measures the average number of bits needed to encode outcomes from $p$ when using a code optimized for $q$. In the context of classification, $p$ is the empirical label distribution (one-hot vectors) and $q$ is the model's predicted distribution. The average log loss over a dataset is exactly the cross-entropy between these two distributions [7][12].

### KL divergence decomposition

Cross-entropy decomposes into Shannon entropy plus the Kullback-Leibler (KL) divergence [11]:

$$H(p, q) = H(p) + D_{\text{KL}}(p \| q)$$

Since $H(p)$ is a constant determined by the true data distribution (and does not depend on the model), minimizing the cross-entropy loss is equivalent to minimizing the [KL divergence](/wiki/kl_divergence) $D_{\text{KL}}(p \| q)$. The KL divergence measures the "extra" bits of information needed when using $q$ instead of $p$, and it equals zero if and only if $p = q$ [12]. This means that training a model by minimizing log loss drives the model's predicted distribution toward the true conditional distribution of the labels.

### Information-theoretic interpretation

From the perspective of coding theory, a model that minimizes log loss is finding the most efficient code for the observed labels. The Kraft-McMillan theorem establishes that any uniquely decodable code corresponds to an implicit probability distribution, and the expected code length under that distribution equals the cross-entropy [12]. A model with lower log loss assigns shorter codes (higher probabilities) to the events that actually occur.

## Why is log loss the same as maximum likelihood?

Log loss is the negative log-likelihood of the data under a Bernoulli (binary) or categorical (multiclass) probabilistic model. This connection is one of the most important results linking statistical estimation to machine learning optimization. As Goodfellow, Bengio, and Courville note in *Deep Learning*, most modern neural networks are trained using maximum likelihood, so the cost function is the negative log-likelihood, equivalently described as the cross-entropy between the training data and the model distribution [7].

### Derivation for binary classification

Consider a binary classification model that predicts $\hat{y}_i = P(y_i = 1 \mid x_i)$ for each observation $x_i$. If the labels are independent Bernoulli random variables, the likelihood of the observed data is:

$$L(\theta) = \prod_{i=1}^{N} \hat{y}_i^{y_i} (1 - \hat{y}_i)^{1 - y_i}$$

Taking the logarithm:

$$\log L(\theta) = \sum_{i=1}^{N} \bigl[y_i \log(\hat{y}_i) + (1 - y_i) \log(1 - \hat{y}_i)\bigr]$$

Maximizing this log-likelihood is equivalent to minimizing its negation, which is exactly the sum of the per-sample log losses. Therefore, minimizing log loss is identical to maximum likelihood estimation [5][6].

### Derivation for multiclass classification

For multiclass classification with a [categorical distribution](/wiki/multi-class_classification), the same reasoning applies. The likelihood is:

$$L(\theta) = \prod_{i=1}^{N} \prod_{k=1}^{K} \hat{y}_{i,k}^{y_{i,k}}$$

The negative log-likelihood reduces to the multiclass cross-entropy loss. This is the theoretical justification for using cross-entropy as the training objective in neural networks with softmax output layers [7].

## Why is log loss a proper scoring rule?

A **scoring rule** assigns a numerical score to a probabilistic forecast based on the observed outcome. A scoring rule is **proper** if the expected score is optimized when the forecaster reports their true believed probabilities. It is **strictly proper** if the optimum is unique, meaning the forecaster can do no better than reporting the exact true probabilities [4].

Log loss (also known as the **logarithmic scoring rule**) was introduced by I.J. Good in 1952 and is one of the most well-known strictly proper scoring rules [1]. Good's paper "Rational Decisions" was based on a lecture delivered to the Royal Statistical Society in 1951, and Good connected the rule to the "weight of evidence" (the log Bayes factor) in Bayesian inference [1]. Its strict properness has several practical implications.

First, it encourages honest probability estimation. A model trained with log loss cannot improve its expected loss by systematically over- or under-estimating probabilities. Second, it promotes [calibration](/wiki/calibration). A well-calibrated model is one where, among all predictions assigned probability $p$, approximately a fraction $p$ of them are actually positive. Because log loss is strictly proper, models optimized under it tend to produce well-calibrated probabilities. Third, it is a local scoring rule. The log loss depends only on the probability assigned to the outcome that actually occurred, not on the probabilities assigned to other outcomes. Gneiting and Raftery (2007) showed that, under regularity conditions, the logarithmic score is the only proper scoring rule that is local in this sense [4].

### Comparison with other scoring rules

| Scoring rule | Formula (binary case) | Strictly proper? | Key characteristics |
|---|---|---|---|
| Log loss (logarithmic) | $-[y\log\hat{y} + (1-y)\log(1-\hat{y})]$ | Yes | Local; penalizes confident errors severely; unbounded |
| Brier score (quadratic) | $(y - \hat{y})^2$ | Yes | Bounded [0, 1]; gentler penalty for confident wrong predictions |
| Spherical score | $\hat{y}_k / \|\hat{y}\|$ | Yes | Normalized; used in some meteorological forecasting applications |
| 0-1 loss ([accuracy](/wiki/accuracy)) | $\mathbb{1}[\hat{y}_{\text{class}} \neq y]$ | No | Threshold-based; ignores probability quality; not proper |
| [Hinge loss](/wiki/hinge_loss) | $\max(0, 1 - y \cdot f(x))$ | No | Used in [SVMs](/wiki/hinge_loss); margin-based; does not yield probabilities |

The Brier score, proposed by Glenn Brier in 1950, predates Good's logarithmic rule by two years and is the other major proper scoring rule used in practice [3]. A key practical distinction is that log loss penalizes confident wrong predictions much more severely than the Brier score does. For example, predicting 0.001 for a true positive produces a log loss of about 6.9 but a Brier score of only about 1.0. This heavy penalization makes log loss especially sensitive to outlier predictions.

## Gradient and optimization

### Gradient for logistic regression

In [logistic regression](/wiki/logistic_regression), the predicted probability is $\hat{y} = \sigma(w^T x) = 1 / (1 + e^{-w^T x})$, where $\sigma$ is the [sigmoid function](/wiki/sigmoid_function). The gradient of the log loss with respect to the weight vector $w$ is:

$$\nabla_w \mathcal{L} = \frac{1}{N} \sum_{i=1}^{N} (\hat{y}_i - y_i) x_i = \frac{1}{N} X^T (\hat{y} - y)$$

This clean gradient formula, where the gradient is proportional to the prediction error $(\hat{y}_i - y_i)$, is a consequence of the sigmoid function being the canonical link function for the Bernoulli distribution [6]. It is analogous to the gradient of [mean squared error](/wiki/mean_squared_error_mse) in [linear regression](/wiki/linear_regression).

### Gradient for softmax cross-entropy

In multiclass classification with a softmax output layer, the gradient of the cross-entropy loss with respect to the logit $z_k$ for class $k$ is:

$$\frac{\partial \mathcal{L}}{\partial z_k} = \hat{y}_k - y_k$$

where $\hat{y}_k = \text{softmax}(z)_k$. This simple form (predicted probability minus true label) is one reason why cross-entropy combined with softmax is the standard choice for training classification neural networks [7]. The gradient never saturates (goes to zero) when the model is making large errors, unlike loss functions paired with non-canonical activation functions.

### Optimization methods

Because log loss is differentiable and (for linear models) convex, it can be minimized using standard gradient-based optimization methods:

- **[Gradient descent](/wiki/stochastic_gradient_descent_sgd):** Standard batch gradient descent computes the full gradient over all training examples.
- **Stochastic gradient descent (SGD):** Updates parameters using the gradient from a single randomly sampled example or a small mini-batch.
- **Adam and other adaptive methods:** Adaptive learning rate methods like Adam, AdaGrad, and RMSProp are commonly used for training deep neural networks with cross-entropy loss.
- **Newton's method and L-BFGS:** For logistic regression specifically, second-order methods can exploit the convexity to converge faster.

## Practical considerations

### Numerical stability

Computing $\log(\hat{y})$ directly can cause numerical problems when $\hat{y}$ is very close to 0 or 1. If $\hat{y} = 0$, then $\log(0) = -\infty$, producing undefined or infinite loss values. Several techniques address this.

**Clipping (epsilon clamping):** The simplest approach is to clip predicted probabilities to the range $[\epsilon, 1 - \epsilon]$ for a small constant $\epsilon$. scikit-learn's `log_loss` applies this internally: as the current documentation states, "`y_proba` values are clipped to `[eps, 1-eps]` where `eps` is the machine precision for `y_proba`'s dtype" [13]. Earlier versions exposed a user-settable `eps` parameter that defaulted to $10^{-15}$; this default was changed to `"auto"` in scikit-learn 1.2, deprecated in 1.3, and removed in 1.5, so clipping is now performed automatically at the data type's machine precision [13].

**Log-sum-exp trick:** Instead of computing softmax and then taking the log separately, the log-softmax function can be computed in a numerically stable way using the identity $\log(\text{softmax}(z)_k) = z_k - \log(\sum_j e^{z_j})$, with the log-sum-exp evaluated using the max-subtraction trick.

**Combined loss functions:** Deep learning frameworks provide fused operations such as PyTorch's `BCEWithLogitsLoss` (which combines a sigmoid layer and binary cross-entropy) and `CrossEntropyLoss` (which combines log-softmax and negative log-likelihood). PyTorch's documentation notes that `CrossEntropyLoss` combines `LogSoftmax` and `NLLLoss` in one operation and should be fed raw logits, which is more numerically stable than computing the steps separately [15].

### Label smoothing

Label smoothing is a [regularization](/wiki/regularization) technique that replaces hard one-hot labels with softened versions:

$$y_{\text{smooth}} = y \cdot (1 - \alpha) + \frac{\alpha}{K}$$

where $\alpha$ is a small smoothing parameter (commonly 0.1) and $K$ is the number of classes. Label smoothing prevents the model from becoming overconfident and can improve [generalization](/wiki/generalization). It was popularized by Szegedy et al. (2016) in the Inception-v3 architecture, which combined label smoothing with factorized convolutions, RMSProp, and batch-normalized auxiliary classifiers [8].

### Class imbalance

When training data has a skewed class distribution, standard log loss can lead to a model that predicts the majority class too often. Common mitigation strategies include the following.

**Weighted cross-entropy:** Assign different weights $w_k$ to each class in the loss function: $\mathcal{L} = -\frac{1}{N} \sum_i \sum_k w_k \, y_{i,k} \log(\hat{y}_{i,k})$. The weights are typically set inversely proportional to class frequency.

**[Focal loss](/wiki/focal_loss):** Introduced by Lin et al. (2017) for object detection, focal loss adds a modulating factor $(1 - \hat{y}_t)^\gamma$ to the standard cross-entropy, down-weighting easy examples and focusing learning on hard, misclassified examples:

$$\text{FL}(\hat{y}_t) = -(1 - \hat{y}_t)^\gamma \log(\hat{y}_t)$$

where $\hat{y}_t$ is the predicted probability for the true class and $\gamma \geq 0$ is a focusing parameter. When $\gamma = 0$, focal loss reduces to standard cross-entropy. The authors designed focal loss for the RetinaNet detector, where roughly 100,000 easy background examples can overwhelm a handful of objects per image, and reported that $\gamma = 2$ worked best in their experiments [9].

**Oversampling and undersampling:** Adjusting the training data distribution through [oversampling](/wiki/oversampling) the minority class or [downsampling](/wiki/downsampling) the majority class.

### Interpreting log loss values

Unlike [accuracy](/wiki/accuracy), which ranges from 0 to 1, log loss has no fixed upper bound, which can make interpretation less intuitive. Some useful reference points:

| Scenario | Log loss value |
|---|---|
| Perfect predictions (all probabilities = 1.0 for true class) | 0.0 |
| Random guessing for binary classification (always predict 0.5) | $-\log(0.5) \approx 0.6931$ |
| Random guessing for 10-class classification (always predict 0.1) | $-\log(0.1) \approx 2.3026$ |
| Baseline with class prior (predict class frequency) | Entropy of the class distribution |

A model is performing better than random guessing if its log loss is below $\log(K)$, where $K$ is the number of classes. Comparing a model's log loss to the entropy of the class distribution gives a sense of how much the model has learned beyond the base rates.

## What is log loss used for?

### Logistic regression

Log loss is the native loss function for [logistic regression](/wiki/logistic_regression). The logistic regression model directly parameterizes class probabilities using the sigmoid function, and the model parameters are fit by minimizing the average log loss (equivalently, maximizing the likelihood) [13].

### Neural networks

Virtually all [neural networks](/wiki/neural_network) trained for classification use cross-entropy loss. For binary classification, the final layer typically uses a [sigmoid activation](/wiki/sigmoid_function) paired with binary cross-entropy. For multiclass classification, a [softmax](/wiki/softmax) output layer is paired with categorical cross-entropy. The clean gradient properties make cross-entropy especially effective for training deep networks via [backpropagation](/wiki/backpropagation) [7].

### Gradient-boosted trees

[Gradient boosting](/wiki/gradient_boosting) methods such as [XGBoost](/wiki/xgboost), [LightGBM](/wiki/lightgbm), and CatBoost use log loss as their default objective function for classification tasks. Each boosting iteration fits a new tree to the negative gradient of the log loss.

### Kaggle competitions and benchmarks

Log loss is one of the most popular evaluation metrics on [Kaggle](https://www.kaggle.com) and in machine learning benchmarks. To compute it, a classifier must assign a probability to each class rather than only output the most likely class [13]. It is preferred over accuracy because it evaluates the quality of probability estimates, not just whether the top prediction is correct: a perfect set of predictions scores 0, while being confident in the wrong class escalates the loss sharply. Competitions that use log loss reward participants for producing well-calibrated, nuanced probability estimates rather than just getting the hard classification right.

### Calibration assessment

Because log loss is a strictly proper scoring rule, it is used alongside calibration plots to assess whether a model's predicted probabilities are reliable. A model with low log loss but poor [calibration](/wiki/calibration) (as revealed by a reliability diagram) may benefit from post-hoc calibration techniques such as Platt scaling or isotonic regression.

## Software implementations

Log loss is available in all popular machine learning frameworks.

| Library | Function / class | Notes |
|---|---|---|
| [scikit-learn](/wiki/scikit-learn) | `sklearn.metrics.log_loss` | Evaluation metric; clips probabilities to `[eps, 1-eps]` at the dtype's machine precision (the user-settable `eps` was removed in v1.5) [13] |
| [PyTorch](/wiki/pytorch) | `torch.nn.CrossEntropyLoss` | Combines log-softmax and NLL loss; accepts raw logits [15] |
| PyTorch | `torch.nn.BCEWithLogitsLoss` | Combines sigmoid and binary cross-entropy; numerically stable |
| [TensorFlow](/wiki/tensorflow) / [Keras](/wiki/keras) | `tf.keras.losses.BinaryCrossentropy` | Supports label smoothing and from-logits mode |
| TensorFlow / Keras | `tf.keras.losses.CategoricalCrossentropy` | Multiclass version; supports label smoothing |
| XGBoost | `binary:logistic`, `multi:softprob` | Default objectives for classification |
| LightGBM | `binary`, `multiclass` | Optimizes log loss by default for classification |

## When was log loss introduced?

The logarithmic scoring rule was proposed by I.J. Good in 1952 as a method for evaluating the quality of probabilistic predictions, in a paper based on a 1951 lecture to the Royal Statistical Society [1]. Good connected it to the concept of "weight of evidence" (the log Bayes factor) in Bayesian inference [1].

The theoretical foundations were formalized further in the following decades as proper scoring rules were developed into a principled framework for assessing probabilistic forecasts, with applications in meteorology, economics, and psychology; Gneiting and Raftery (2007) gave the modern unifying treatment [4]. The Brier score (proposed by Glenn Brier in 1950) predates the logarithmic scoring rule by two years and is the other major proper scoring rule used in practice [3].

In information theory, the connection between cross-entropy and coding efficiency was established through Claude Shannon's foundational 1948 paper "A Mathematical Theory of Communication" [2]. The relationship between maximum likelihood estimation and cross-entropy minimization was recognized as a unifying bridge between statistics and information theory [7][12].

In machine learning, log loss became the standard training objective for logistic regression and later for neural network classifiers. The term "log loss" gained widespread popularity through its adoption as a competition metric on platforms like Kaggle. Variants such as focal loss (Lin et al., 2017) have extended the basic cross-entropy framework to handle specific challenges like class imbalance in object detection [9].

## See also

- [Cross-entropy loss](/wiki/cross_entropy_loss)
- [Loss function](/wiki/loss_function)
- [Logistic regression](/wiki/logistic_regression)
- [Maximum likelihood estimation](/wiki/maximum_likelihood_estimation)
- [Softmax](/wiki/softmax)
- [KL divergence](/wiki/kl_divergence)
- [Entropy](/wiki/entropy)
- [Sigmoid function](/wiki/sigmoid_function)
- [Accuracy](/wiki/accuracy)
- [Hinge loss](/wiki/hinge_loss)
- [Focal loss](/wiki/focal_loss)
- [Mean squared error](/wiki/mean_squared_error_mse)
- [Gradient boosting](/wiki/gradient_boosting)

## References

1. Good, I.J. (1952). "Rational Decisions." *Journal of the Royal Statistical Society, Series B*, 14(1), 107-114. https://rss.onlinelibrary.wiley.com/doi/10.1111/j.2517-6161.1952.tb00104.x
2. Shannon, C.E. (1948). "A Mathematical Theory of Communication." *Bell System Technical Journal*, 27(3), 379-423.
3. Brier, G.W. (1950). "Verification of forecasts expressed in terms of probability." *Monthly Weather Review*, 78(1), 1-3.
4. Gneiting, T. and Raftery, A.E. (2007). "Strictly Proper Scoring Rules, Prediction, and Estimation." *Journal of the American Statistical Association*, 102(477), 359-378.
5. Bishop, C.M. (2006). *Pattern Recognition and Machine Learning*. Springer. Chapter 4.
6. Murphy, K.P. (2012). *Machine Learning: A Probabilistic Perspective*. MIT Press. Chapters 8 and 10.
7. Goodfellow, I., Bengio, Y., and Courville, A. (2016). *Deep Learning*. MIT Press. Chapter 6. https://www.deeplearningbook.org/contents/mlp.html
8. Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., and Wojna, Z. (2016). "Rethinking the Inception Architecture for Computer Vision." *Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition*, 2818-2826.
9. Lin, T.Y., Goyal, P., Girshick, R., He, K., and Dollar, P. (2017). "Focal Loss for Dense Object Detection." *Proceedings of the IEEE International Conference on Computer Vision*, 2980-2988. https://openaccess.thecvf.com/content_ICCV_2017/papers/Lin_Focal_Loss_for_ICCV_2017_paper.pdf
10. Hastie, T., Tibshirani, R., and Friedman, J. (2009). *The Elements of Statistical Learning*. 2nd edition. Springer. Chapter 4.
11. Kullback, S. and Leibler, R.A. (1951). "On Information and Sufficiency." *Annals of Mathematical Statistics*, 22(1), 79-86.
12. Cover, T.M. and Thomas, J.A. (2006). *Elements of Information Theory*. 2nd edition. Wiley.
13. scikit-learn developers. "sklearn.metrics.log_loss." scikit-learn documentation. https://scikit-learn.org/stable/modules/generated/sklearn.metrics.log_loss.html
14. Pedregosa, F. et al. (2011). "Scikit-learn: Machine Learning in Python." *Journal of Machine Learning Research*, 12, 2825-2830.
15. PyTorch developers. "CrossEntropyLoss." PyTorch documentation. https://pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html

