# Entropy

> Source: https://aiwiki.ai/wiki/entropy
> Updated: 2026-07-11
> Categories: Machine Learning, Mathematics
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

**Entropy** is a quantitative measure of the uncertainty, randomness, or average information content in a probability distribution: for a discrete random variable it equals the average number of yes/no questions needed to pin down an outcome. Originating from [information theory](/wiki/information_theory), the concept was formalized by Claude Shannon in his landmark 1948 paper "A Mathematical Theory of Communication," published in the *Bell System Technical Journal* (volume 27, pages 379-423), and has since become one of the most widely used quantities in [machine learning](/wiki/machine_learning), statistics, physics, and data science.[1] In machine learning, entropy serves as the foundation for [loss functions](/wiki/loss_function) such as [cross-entropy](/wiki/cross-entropy), splitting criteria in [decision trees](/wiki/decision_tree), feature selection methods, and exploration strategies in [reinforcement learning](/wiki/reinforcement_learning_rl).

## Explain like I'm 5 (ELI5)

Imagine you have a bag of candy. If every single piece is a red gummy bear, there is no surprise when you reach in and pull one out. You already know what you are going to get. That bag has **low entropy** because there is almost no uncertainty.

Now imagine a bag filled with dozens of different candies: gummy bears, chocolate bars, lollipops, sour worms, and more, all mixed together. Every time you reach in, you have no idea what you will pull out. That bag has **high entropy** because there is a lot of uncertainty.

Entropy is just a number that tells you how surprised you should expect to be. When outcomes are predictable, entropy is low. When outcomes are unpredictable and spread out evenly among many possibilities, entropy is high. In machine learning, algorithms use entropy to figure out which questions (features) reduce surprise the most, helping them make better predictions.

## Where does the concept of entropy come from?

The concept of entropy has roots in 19th-century thermodynamics. Rudolf Clausius introduced the term in 1865 to describe the irreversible dissipation of energy in physical systems. Ludwig Boltzmann later gave it a statistical interpretation in the 1870s, relating the entropy of a thermodynamic system to the number of microscopic configurations (microstates) consistent with its macroscopic state. Boltzmann's formula, $$S = k_B \ln W$$, where $$k_B$$ is Boltzmann's constant and W is the number of microstates, laid the groundwork for statistical mechanics.

In 1948, Claude Shannon independently developed an analogous quantity for communication systems. His paper, published in the Bell System Technical Journal, defined information entropy as a measure of the average uncertainty associated with a random variable's outcomes. Shannon reportedly chose the name "entropy" on the advice of the mathematician John von Neumann, who according to Shannon's recollection told him: "You should call it entropy, for two reasons. In the first place your uncertainty function has been used in statistical mechanics under that name, so it already has a name. In the second place, and more important, nobody knows what entropy really is, so in a debate you will always have the advantage."[1] Shannon's work established the field of information theory and demonstrated that entropy sets a fundamental limit on lossless data compression.[1]

E. T. Jaynes further bridged the gap between thermodynamics and information theory in 1957 by proposing the **principle of maximum entropy**. Jaynes argued that, given incomplete information, the most honest probability distribution to assign is the one that maximizes entropy subject to known constraints.[3] This principle became a cornerstone of Bayesian inference and statistical modeling.

## What is Shannon entropy?

### Definition

For a discrete random variable X with possible outcomes x_1, x_2, ..., x_n and probability mass function p(x), the **Shannon entropy** H(X) is defined as:

$$H(X) = -\sum_{i=1}^{n} p(x_i) \log p(x_i)$$

The convention $$0 \log 0 = 0$$ is adopted, consistent with the limit as p approaches zero. The base of the logarithm determines the unit of measurement:

| Logarithm base | Unit name | Common usage |
|---|---|---|
| 2 | Bit (shannon) | Information theory, computing |
| e | Nat (natural unit) | Theoretical analysis, machine learning |
| 10 | Hartley (ban) | Early information theory |

When base 2 is used, entropy measures the minimum average number of binary questions (yes/no) needed to identify the outcome of the random variable.[1][9]

### Intuition

Each term $$p(x_i) \log(1/p(x_i))$$ represents the "surprise" or "information content" of observing outcome $$x_i$$, weighted by its probability. Rare events carry more surprise (higher information content) than common events. Entropy is the expected value of this surprise across all outcomes.

For example, consider a fair coin with $$P(\text{heads}) = P(\text{tails}) = 0.5$$. The entropy is:

$$
H = -(0.5 \log_2 0.5 + 0.5 \log_2 0.5) = -(0.5 \times (-1) + 0.5 \times (-1)) = 1 \text{ bit}
$$

This means one binary digit is needed on average to encode each coin flip. By contrast, a biased coin with $$P(\text{heads}) = 0.9$$ and $$P(\text{tails}) = 0.1$$ has entropy:

$$
H = -(0.9 \log_2 0.9 + 0.1 \log_2 0.1) \approx 0.469 \text{ bits}
$$

The biased coin is more predictable, so less information is needed to describe its outcomes.

### Key properties

Shannon entropy satisfies several important mathematical properties:

| Property | Statement | Significance |
|---|---|---|
| Non-negativity | $$H(X) \ge 0$$ | Uncertainty is never negative for discrete variables |
| Maximum entropy | $$H(X) \le \log(n)$$, with equality iff p is uniform | Uniform distribution is the most uncertain |
| Concavity | $$H(\lambda p + (1-\lambda) q) \ge \lambda H(p) + (1-\lambda) H(q)$$ | Mixing distributions cannot decrease entropy |
| Additivity | $$H(X, Y) = H(X) + H(Y)$$ for independent X, Y | Entropies of independent sources add |
| Conditioning reduces entropy | $$H(X \mid Y) \le H(X)$$ | Observing another variable cannot increase uncertainty |
| Subadditivity | $$H(X, Y) \le H(X) + H(Y)$$ | Joint entropy never exceeds the sum of marginals |
| Permutation invariance | H is unchanged by reordering outcomes | Entropy depends only on probability values, not labels |

Shannon proved that these properties (along with continuity) uniquely characterize the entropy function, meaning any measure of uncertainty satisfying these axioms must take the logarithmic form.[1][2]

## Binary entropy function

The **binary entropy function** is a special case of Shannon entropy for a Bernoulli random variable with parameter p (probability of success):

$$
H_b(p) = -p \log_2 p - (1 - p) \log_2 (1 - p)
$$

This function maps the interval [0, 1] to [0, 1] and has the following characteristics:

- It equals 0 at the endpoints $$p = 0$$ and $$p = 1$$ (complete certainty).
- It reaches its maximum value of 1 bit at $$p = 0.5$$ (maximum uncertainty).
- It is symmetric about $$p = 0.5$$, meaning $$H_b(p) = H_b(1 - p)$$.
- It is strictly concave on $$(0, 1)$$.

The binary entropy function appears frequently in coding theory, hypothesis testing, and as a building block for analyzing more complex entropy expressions.[2]

## Related information-theoretic quantities

### Joint entropy

The **joint entropy** of two discrete random variables X and Y measures the total uncertainty in the pair (X, Y):

$$
H(X, Y) = -\sum_{x} \sum_{y} p(x, y) \log p(x, y)
$$

Joint entropy satisfies the inequality $$H(X, Y) \le H(X) + H(Y)$$, with equality if and only if X and Y are independent.[2] The chain rule for entropy decomposes joint entropy as:

$$
H(X, Y) = H(X) + H(Y \mid X) = H(Y) + H(X \mid Y)
$$

### Conditional entropy

The **conditional entropy** H(Y|X) quantifies the remaining uncertainty about Y after observing X:

$$
H(Y \mid X) = -\sum_{x} \sum_{y} p(x, y) \log p(y \mid x)
$$

Equivalently, $$H(Y \mid X) = H(X, Y) - H(X)$$. Conditional entropy is always non-negative and satisfies $$H(Y \mid X) \le H(Y)$$, meaning that conditioning on additional information never increases entropy on average.[2]

### Mutual information

**Mutual information** I(X; Y) measures the amount of information that one random variable contains about another:

$$
I(X; Y) = H(X) - H(X \mid Y) = H(Y) - H(Y \mid X) = H(X) + H(Y) - H(X, Y)
$$

Mutual information is symmetric, meaning $$I(X; Y) = I(Y; X)$$, and is always non-negative. It equals zero if and only if X and Y are statistically independent.[2] Mutual information can also be expressed as the Kullback-Leibler divergence between the joint distribution and the product of marginals:

$$
I(X; Y) = D_{\mathrm{KL}}(p(x, y) \parallel p(x) p(y))
$$

In machine learning, mutual information is used for feature selection, clustering evaluation, and measuring the dependence between variables.

### KL divergence (relative entropy)

The **Kullback-Leibler (KL) divergence**, also called relative entropy, measures how one probability distribution P diverges from a reference distribution Q:[4]

$$
D_{\mathrm{KL}}(P \parallel Q) = \sum_{x} P(x) \log(P(x) / Q(x))
$$

Key properties of KL divergence include:

- **Non-negativity**: $$D_{\mathrm{KL}}(P \parallel Q) \ge 0$$, with equality iff $$P = Q$$ (Gibbs' inequality).[2]
- **Asymmetry**: $$D_{\mathrm{KL}}(P \parallel Q) \ne D_{\mathrm{KL}}(Q \parallel P)$$ in general, so it is not a true distance metric.
- **Unboundedness**: $$D_{\mathrm{KL}}$$ can be infinite if Q assigns zero probability to an event that P assigns positive probability to.

KL divergence is foundational to variational inference, variational autoencoders, and many optimization procedures in machine learning.

### Cross-entropy

**Cross-entropy** between a true distribution P and an estimated distribution Q is defined as:

$$
H(P, Q) = -\sum_{x} P(x) \log Q(x)
$$

Cross-entropy relates to entropy and KL divergence through the identity:

$$
H(P, Q) = H(P) + D_{\mathrm{KL}}(P \parallel Q)
$$

Since $$H(P)$$ is constant with respect to Q, minimizing cross-entropy with respect to Q is equivalent to minimizing KL divergence. This is why cross-entropy is the standard [loss function](/wiki/loss_function) for classification tasks in deep learning.[10] When used with a [softmax](/wiki/softmax) output layer, minimizing cross-entropy is equivalent to maximum likelihood estimation.[10]

## Differential entropy

For continuous random variables with probability density function f(x), the **differential entropy** (or continuous entropy) is defined as:

$$
h(X) = -\int f(x) \log f(x) \, dx
$$

Differential entropy shares some properties with discrete entropy but differs in important ways:

| Property | Discrete entropy | Differential entropy |
|---|---|---|
| Non-negativity | Always >= 0 | Can be negative |
| Coordinate invariance | Yes | No (changes under coordinate transformations) |
| Translation invariance | N/A | $$h(X + c) = h(X)$$ |
| Scaling | N/A | $$h(aX) = h(X) + \log\lvert a \rvert$$ |
| Maximum entropy (fixed variance) | Uniform distribution | Gaussian distribution |

Notable differential entropy values for common distributions:

| Distribution | Differential entropy |
|---|---|
| Uniform $$U(a, b)$$ | $$\ln(b - a)$$ |
| Gaussian $$N(\mu, \sigma^2)$$ | $$(1/2) \ln(2 \pi e \sigma^2)$$ |
| Exponential $$\mathrm{Exp}(\lambda)$$ | $$1 + \ln(1/\lambda)$$ |

The Gaussian distribution has the maximum differential entropy among all distributions with a given mean and variance, a result that justifies many Gaussian assumptions in statistical modeling.[2][12]

## How is entropy used in decision trees?

Entropy plays a central role in the construction of [decision trees](/wiki/decision_tree), particularly in the ID3, C4.5, and C5.0 algorithms. At each node, the algorithm selects the feature that maximally reduces the entropy of the target variable.[5]

### Information gain

[Information gain](/wiki/information_gain) (IG) is the reduction in entropy achieved by partitioning a dataset according to a feature A:

$$
\mathrm{IG}(S, A) = H(S) - \sum_{v \in \mathrm{Values}(A)} \frac{\lvert S_v \rvert}{\lvert S \rvert} H(S_v)
$$

where H(S) is the entropy of the full dataset, S_v is the subset of examples with feature value v, and |S_v|/|S| is the proportion of examples in that subset. The feature with the highest information gain is chosen as the splitting criterion.[5]

For a binary classification problem with positive proportion p_+ and negative proportion p_-:

$$
H(S) = -p_+ \log_2 p_+ - p_- \log_2 p_-
$$

A pure node (all examples belong to one class) has entropy 0, while a node with equal class proportions has entropy 1 bit.

### Gain ratio

One limitation of information gain is its bias toward features with many distinct values. The **gain ratio**, introduced in the C4.5 algorithm, addresses this by normalizing information gain by the feature's intrinsic information:[5]

$$
\mathrm{GainRatio}(S, A) = \mathrm{IG}(S, A) / \mathrm{SplitInfo}(S, A)
$$

where $$\mathrm{SplitInfo}(S, A) = -\sum_{v} \frac{\lvert S_v \rvert}{\lvert S \rvert} \log_2\frac{\lvert S_v \rvert}{\lvert S \rvert}$$. This penalizes features that produce many small partitions.

### How does entropy differ from Gini impurity?

| Criterion | Formula | Range | Used by |
|---|---|---|---|
| Entropy | $$-\sum p_i \log_2 p_i$$ | $$[0, \log_2 C]$$ | ID3, C4.5, C5.0 |
| Gini impurity | $$1 - \sum p_i^2$$ | $$[0, 1 - 1/C]$$ | CART |

Both entropy and Gini impurity measure node impurity, and in practice they often yield similar tree structures. Entropy is more computationally expensive due to the logarithm but provides a stronger information-theoretic interpretation. The CART algorithm uses Gini impurity by default, while ID3 and C4.5 rely on entropy.

## Entropy in feature selection

Entropy-based metrics are widely used to rank and select features in high-dimensional datasets:

- **Mutual information** between each feature and the target variable quantifies how much knowing that feature reduces uncertainty about the target. Features with high mutual information are considered more informative.
- **Information gain** can be computed for each feature independently and used as a filter method for feature selection before model training.
- **Minimum redundancy maximum relevance (mRMR)** selects features that have high mutual information with the target but low mutual information with each other, reducing redundancy.

Entropy-based feature selection is particularly valuable in text classification, genomics, and other domains where the number of features can reach tens of thousands.

## What is the principle of maximum entropy?

The **principle of maximum entropy** states that, given a set of constraints derived from observed data, the probability distribution that best represents the current state of knowledge is the one with the highest entropy. This principle, formalized by E. T. Jaynes in 1957, provides a principled method for constructing probability distributions when information is incomplete.[3]

In machine learning, maximum entropy models have several important applications:

- **Maximum entropy classifiers** (equivalent to multinomial logistic regression) find the distribution over class labels that maximizes entropy while matching observed feature expectations. The softmax function naturally arises from this framework.
- **Maximum Entropy Markov Models (MEMMs)** extend the maximum entropy approach to sequential data, enabling applications in part-of-speech tagging, named entity recognition, and other natural language processing tasks.
- **Exponential family distributions** (Gaussian, Poisson, exponential, Bernoulli) can all be derived as maximum entropy distributions subject to appropriate moment constraints.[11]

The maximum entropy principle provides a theoretical justification for many standard modeling choices. For instance, assuming a Gaussian distribution when only the mean and variance are known is equivalent to applying the maximum entropy principle.

## How is entropy used in reinforcement learning?

In reinforcement learning, **entropy regularization** adds an entropy bonus to the reward signal, encouraging the agent's policy to remain stochastic and explore more broadly. The entropy-regularized objective is:

$$
J(\pi) = \mathbb{E}\left[\sum_t \left(r_t + \alpha H(\pi(\cdot \mid s_t))\right)\right]
$$

where $$\alpha$$ is a temperature parameter controlling the tradeoff between reward maximization and entropy. Higher values of $$\alpha$$ favor more exploratory behavior, while lower values emphasize exploitation.

The **Soft Actor-Critic (SAC)** algorithm, introduced by Haarnoja et al. at the 35th International Conference on Machine Learning (ICML) in 2018, is the most prominent example of entropy-regularized reinforcement learning. SAC is described by its authors as "an off-policy actor-critic deep RL algorithm based on the maximum entropy reinforcement learning framework. In this framework, the actor aims to maximize expected reward while also maximizing entropy."[6] Key benefits of entropy regularization include:

- **Improved exploration**: The entropy bonus prevents the policy from collapsing to a deterministic strategy too early, helping the agent discover better solutions.
- **Robustness**: Stochastic policies are more robust to perturbations in the environment.
- **Multi-modal behavior**: Entropy regularization can help the agent learn multiple near-optimal strategies rather than committing to a single one.

A follow-up version of SAC uses an automatically tuned temperature parameter $$\alpha$$ that adjusts over the course of training to maintain a target entropy level, eliminating the need for manual tuning.[6]

## What is entropy used for in machine learning?

### Text analysis and natural language processing

Entropy measures the information density of text. A passage with high entropy uses a diverse vocabulary and is harder to predict, while low-entropy text is repetitive and predictable. Applications include:

- **Language modeling**: Perplexity, a standard evaluation metric for language models, is defined as $$2^H$$, where H is the cross-entropy of the model. Lower perplexity indicates a better model.[11]
- **Keyword extraction**: Words with high pointwise mutual information relative to a document or topic are identified as keywords.
- **Machine translation**: Entropy-based alignment models help identify corresponding words and phrases across languages.

### Anomaly detection

Entropy can serve as a signal for detecting anomalies in data streams. Under normal conditions, a system produces outputs with a characteristic entropy level. Significant deviations from this baseline may indicate anomalous behavior. For example:

- In network security, sudden changes in the entropy of packet header fields can indicate distributed denial-of-service (DDoS) attacks.
- In manufacturing, entropy monitoring of sensor data can detect equipment degradation before failure occurs.
- In financial markets, unusual entropy patterns in trading data may signal market manipulation.

### Image processing and computer vision

Entropy quantifies the information content of images and is used for:

- **Image thresholding**: Maximum entropy thresholding selects a threshold that maximizes the entropy of the resulting binary image, producing effective foreground-background segmentation.
- **Texture analysis**: Local entropy maps characterize the complexity of texture regions in an image.
- **Image quality assessment**: Entropy correlates with the perceived information content and detail in an image.

### Data compression

Shannon's source coding theorem establishes that the entropy of a source is the theoretical lower bound on the average number of bits required to encode its output without loss.[1] Practical compression algorithms such as Huffman coding and arithmetic coding approach this limit. Shannon estimated the entropy rate of printed English at roughly 0.6 to 1.3 bits per character based on human prediction experiments, far below the 4.7 bits per character that 26 equally likely letters would require; this redundancy explains why English text can typically be compressed to a small fraction of its original size.[9][13]

## How does Shannon entropy relate to thermodynamic entropy?

Shannon entropy and thermodynamic entropy are formally analogous. Boltzmann's entropy $$S = k_B \ln W$$ counts the number of microstates consistent with a macrostate, while Shannon's entropy $$H = -\sum p \log p$$ measures the uncertainty over possible outcomes. The connection becomes exact when the probabilities in Shannon's formula correspond to the Boltzmann distribution over energy states:

$$
S = k_B H
$$

where $$k_B$$ is Boltzmann's constant (approximately $$1.38 \times 10^{-23}$$ J/K).

Landauer's principle (1961) makes this connection physically concrete: erasing one bit of information requires a minimum energy dissipation of $$k_B T \ln 2$$, where T is the temperature of the environment.[7] This result establishes a fundamental link between computation, information, and thermodynamics. Experimental verification of Landauer's principle was achieved in 2012 by Berut and colleagues at ENS Lyon, who used a single colloidal particle trapped in a double-well optical potential as a one-bit memory and measured that the heat dissipated during erasure approached the Landauer value of $$k_B T \ln 2$$ in the slow (quasi-static) limit.[8]

## Summary of key formulas

| Quantity | Formula | Description |
|---|---|---|
| Shannon entropy | $$H(X) = -\sum p(x) \log p(x)$$ | Average uncertainty of a discrete random variable |
| Binary entropy | $$H_b(p) = -p \log p - (1-p) \log(1-p)$$ | Entropy of a Bernoulli variable |
| Joint entropy | $$H(X,Y) = -\sum\sum p(x,y) \log p(x,y)$$ | Total uncertainty in a pair of variables |
| Conditional entropy | $$H(Y \mid X) = H(X,Y) - H(X)$$ | Remaining uncertainty after observing X |
| Mutual information | $$I(X;Y) = H(X) + H(Y) - H(X,Y)$$ | Shared information between two variables |
| KL divergence | $$D_{\mathrm{KL}}(P \parallel Q) = \sum P(x) \log(P(x)/Q(x))$$ | Divergence of Q from P |
| Cross-entropy | $$H(P,Q) = -\sum P(x) \log Q(x)$$ | Expected code length using Q to encode P |
| Differential entropy | $$h(X) = -\int f(x) \log f(x) \, dx$$ | Continuous analog of Shannon entropy |
| Information gain | $$\mathrm{IG}(S,A) = H(S) - \sum \frac{\lvert S_v \rvert}{\lvert S \rvert} H(S_v)$$ | Entropy reduction from splitting on feature A |

## References

1. Shannon, C. E. (1948). "A Mathematical Theory of Communication." *Bell System Technical Journal*, 27(3), 379-423.
2. Cover, T. M., & Thomas, J. A. (2006). *Elements of Information Theory* (2nd ed.). Wiley-Interscience.
3. Jaynes, E. T. (1957). "Information Theory and Statistical Mechanics." *Physical Review*, 106(4), 620-630.
4. Kullback, S., & Leibler, R. A. (1951). "On Information and Sufficiency." *Annals of Mathematical Statistics*, 22(1), 79-86.
5. Quinlan, J. R. (1986). "Induction of Decision Trees." *Machine Learning*, 1(1), 81-106.
6. Haarnoja, T., Zhou, A., Abbeel, P., & Levine, S. (2018). "Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor." *Proceedings of the 35th International Conference on Machine Learning (ICML)*.
7. Landauer, R. (1961). "Irreversibility and Heat Generation in the Computing Process." *IBM Journal of Research and Development*, 5(3), 183-191.
8. Berut, A., et al. (2012). "Experimental verification of Landauer's principle linking information and thermodynamics." *Nature*, 483, 187-189.
9. MacKay, D. J. C. (2003). *Information Theory, Inference, and Learning Algorithms*. Cambridge University Press.
10. Goodfellow, I., Bengio, Y., & Courville, A. (2016). *Deep Learning*. MIT Press. Chapter 3: Probability and Information Theory.
11. Murphy, K. P. (2022). *Probabilistic Machine Learning: An Introduction*. MIT Press.
12. Bishop, C. M. (2006). *Pattern Recognition and Machine Learning*. Springer. Chapter 1.6: Information Theory.
13. Shannon, C. E. (1951). "Prediction and Entropy of Printed English." *Bell System Technical Journal*, 30(1), 50-64.