Entropy is a fundamental measure of uncertainty, randomness, or information content in a probability distribution. Originating from information theory, the concept was formalized by Claude Shannon in his landmark 1948 paper "A Mathematical Theory of Communication" and has since become one of the most widely used quantities in machine learning, statistics, physics, and data science. In machine learning, entropy serves as the foundation for loss functions such as cross-entropy, splitting criteria in decision trees, feature selection methods, and exploration strategies in reinforcement learning.
Imagine you have a bag of candy. If every single piece is a red gummy bear, there is no surprise when you reach in and pull one out. You already know what you are going to get. That bag has low entropy because there is almost no uncertainty.
Now imagine a bag filled with dozens of different candies: gummy bears, chocolate bars, lollipops, sour worms, and more, all mixed together. Every time you reach in, you have no idea what you will pull out. That bag has high entropy because there is a lot of uncertainty.
Entropy is just a number that tells you how surprised you should expect to be. When outcomes are predictable, entropy is low. When outcomes are unpredictable and spread out evenly among many possibilities, entropy is high. In machine learning, algorithms use entropy to figure out which questions (features) reduce surprise the most, helping them make better predictions.
The concept of entropy has roots in 19th-century thermodynamics. Rudolf Clausius introduced the term in 1865 to describe the irreversible dissipation of energy in physical systems. Ludwig Boltzmann later gave it a statistical interpretation in the 1870s, relating the entropy of a thermodynamic system to the number of microscopic configurations (microstates) consistent with its macroscopic state. Boltzmann's formula, S = k_B ln W, where k_B is Boltzmann's constant and W is the number of microstates, laid the groundwork for statistical mechanics.
In 1948, Claude Shannon independently developed an analogous quantity for communication systems. His paper, published in the Bell System Technical Journal, defined information entropy as a measure of the average uncertainty associated with a random variable's outcomes. Shannon reportedly chose the name "entropy" on the advice of the mathematician John von Neumann, who noted that "nobody knows what entropy really is, so in a debate you will always have the advantage." Shannon's work established the field of information theory and demonstrated that entropy sets a fundamental limit on lossless data compression.
E. T. Jaynes further bridged the gap between thermodynamics and information theory in 1957 by proposing the principle of maximum entropy. Jaynes argued that, given incomplete information, the most honest probability distribution to assign is the one that maximizes entropy subject to known constraints. This principle became a cornerstone of Bayesian inference and statistical modeling.
For a discrete random variable X with possible outcomes x_1, x_2, ..., x_n and probability mass function p(x), the Shannon entropy H(X) is defined as:
$$H(X) = -\sum_{i=1}^{n} p(x_i) \log p(x_i)$$
The convention 0 log 0 = 0 is adopted, consistent with the limit as p approaches zero. The base of the logarithm determines the unit of measurement:
| Logarithm base | Unit name | Common usage |
|---|---|---|
| 2 | Bit (shannon) | Information theory, computing |
| e | Nat (natural unit) | Theoretical analysis, machine learning |
| 10 | Hartley (ban) | Early information theory |
When base 2 is used, entropy measures the minimum average number of binary questions (yes/no) needed to identify the outcome of the random variable.
Each term p(x_i) log(1/p(x_i)) represents the "surprise" or "information content" of observing outcome x_i, weighted by its probability. Rare events carry more surprise (higher information content) than common events. Entropy is the expected value of this surprise across all outcomes.
For example, consider a fair coin with P(heads) = P(tails) = 0.5. The entropy is:
H = -(0.5 log_2 0.5 + 0.5 log_2 0.5) = -(0.5 x (-1) + 0.5 x (-1)) = 1 bit
This means one binary digit is needed on average to encode each coin flip. By contrast, a biased coin with P(heads) = 0.9 and P(tails) = 0.1 has entropy:
H = -(0.9 log_2 0.9 + 0.1 log_2 0.1) ≈ 0.469 bits
The biased coin is more predictable, so less information is needed to describe its outcomes.
Shannon entropy satisfies several important mathematical properties:
| Property | Statement | Significance |
|---|---|---|
| Non-negativity | H(X) >= 0 | Uncertainty is never negative for discrete variables |
| Maximum entropy | H(X) <= log(n), with equality iff p is uniform | Uniform distribution is the most uncertain |
| Concavity | H(lambda p + (1-lambda) q) >= lambda H(p) + (1-lambda) H(q) | Mixing distributions cannot decrease entropy |
| Additivity | H(X, Y) = H(X) + H(Y) for independent X, Y | Entropies of independent sources add |
| Conditioning reduces entropy | H(X | Y) <= H(X) |
| Subadditivity | H(X, Y) <= H(X) + H(Y) | Joint entropy never exceeds the sum of marginals |
| Permutation invariance | H is unchanged by reordering outcomes | Entropy depends only on probability values, not labels |
Shannon proved that these properties (along with continuity) uniquely characterize the entropy function, meaning any measure of uncertainty satisfying these axioms must take the logarithmic form.
The binary entropy function is a special case of Shannon entropy for a Bernoulli random variable with parameter p (probability of success):
H_b(p) = -p log_2 p - (1 - p) log_2 (1 - p)
This function maps the interval [0, 1] to [0, 1] and has the following characteristics:
The binary entropy function appears frequently in coding theory, hypothesis testing, and as a building block for analyzing more complex entropy expressions.
The joint entropy of two discrete random variables X and Y measures the total uncertainty in the pair (X, Y):
H(X, Y) = -\sum_{x} \sum_{y} p(x, y) log p(x, y)
Joint entropy satisfies the inequality H(X, Y) <= H(X) + H(Y), with equality if and only if X and Y are independent. The chain rule for entropy decomposes joint entropy as:
H(X, Y) = H(X) + H(Y|X) = H(Y) + H(X|Y)
The conditional entropy H(Y|X) quantifies the remaining uncertainty about Y after observing X:
H(Y|X) = -\sum_{x} \sum_{y} p(x, y) log p(y|x)
Equivalently, H(Y|X) = H(X, Y) - H(X). Conditional entropy is always non-negative and satisfies H(Y|X) <= H(Y), meaning that conditioning on additional information never increases entropy on average.
Mutual information I(X; Y) measures the amount of information that one random variable contains about another:
I(X; Y) = H(X) - H(X|Y) = H(Y) - H(Y|X) = H(X) + H(Y) - H(X, Y)
Mutual information is symmetric, meaning I(X; Y) = I(Y; X), and is always non-negative. It equals zero if and only if X and Y are statistically independent. Mutual information can also be expressed as the Kullback-Leibler divergence between the joint distribution and the product of marginals:
I(X; Y) = D_KL(p(x, y) || p(x) p(y))
In machine learning, mutual information is used for feature selection, clustering evaluation, and measuring the dependence between variables.
The Kullback-Leibler (KL) divergence, also called relative entropy, measures how one probability distribution P diverges from a reference distribution Q:
D_KL(P || Q) = \sum_{x} P(x) log(P(x) / Q(x))
Key properties of KL divergence include:
KL divergence is foundational to variational inference, variational autoencoders, and many optimization procedures in machine learning.
Cross-entropy between a true distribution P and an estimated distribution Q is defined as:
H(P, Q) = -\sum_{x} P(x) log Q(x)
Cross-entropy relates to entropy and KL divergence through the identity:
H(P, Q) = H(P) + D_KL(P || Q)
Since H(P) is constant with respect to Q, minimizing cross-entropy with respect to Q is equivalent to minimizing KL divergence. This is why cross-entropy is the standard loss function for classification tasks in deep learning. When used with a softmax output layer, minimizing cross-entropy is equivalent to maximum likelihood estimation.
For continuous random variables with probability density function f(x), the differential entropy (or continuous entropy) is defined as:
h(X) = -\int f(x) log f(x) dx
Differential entropy shares some properties with discrete entropy but differs in important ways:
| Property | Discrete entropy | Differential entropy |
|---|---|---|
| Non-negativity | Always >= 0 | Can be negative |
| Coordinate invariance | Yes | No (changes under coordinate transformations) |
| Translation invariance | N/A | h(X + c) = h(X) |
| Scaling | N/A | h(aX) = h(X) + log |
| Maximum entropy (fixed variance) | Uniform distribution | Gaussian distribution |
Notable differential entropy values for common distributions:
| Distribution | Differential entropy |
|---|---|
| Uniform U(a, b) | ln(b - a) |
| Gaussian N(mu, sigma^2) | (1/2) ln(2 pi e sigma^2) |
| Exponential Exp(lambda) | 1 + ln(1/lambda) |
The Gaussian distribution has the maximum differential entropy among all distributions with a given mean and variance, a result that justifies many Gaussian assumptions in statistical modeling.
Entropy plays a central role in the construction of decision trees, particularly in the ID3, C4.5, and C5.0 algorithms. At each node, the algorithm selects the feature that maximally reduces the entropy of the target variable.
Information gain (IG) is the reduction in entropy achieved by partitioning a dataset according to a feature A:
IG(S, A) = H(S) - \sum_{v in Values(A)} (|S_v| / |S|) H(S_v)
where H(S) is the entropy of the full dataset, S_v is the subset of examples with feature value v, and |S_v|/|S| is the proportion of examples in that subset. The feature with the highest information gain is chosen as the splitting criterion.
For a binary classification problem with positive proportion p_+ and negative proportion p_-:
H(S) = -p_+ log_2 p_+ - p_- log_2 p_-
A pure node (all examples belong to one class) has entropy 0, while a node with equal class proportions has entropy 1 bit.
One limitation of information gain is its bias toward features with many distinct values. The gain ratio, introduced in the C4.5 algorithm, addresses this by normalizing information gain by the feature's intrinsic information:
GainRatio(S, A) = IG(S, A) / SplitInfo(S, A)
where SplitInfo(S, A) = -\sum_{v} (|S_v|/|S|) log_2(|S_v|/|S|). This penalizes features that produce many small partitions.
| Criterion | Formula | Range | Used by |
|---|---|---|---|
| Entropy | -\sum p_i log_2 p_i | [0, log_2 C] | ID3, C4.5, C5.0 |
| Gini impurity | 1 - \sum p_i^2 | [0, 1 - 1/C] | CART |
Both entropy and Gini impurity measure node impurity, and in practice they often yield similar tree structures. Entropy is more computationally expensive due to the logarithm but provides a stronger information-theoretic interpretation. The CART algorithm uses Gini impurity by default, while ID3 and C4.5 rely on entropy.
Entropy-based metrics are widely used to rank and select features in high-dimensional datasets:
Entropy-based feature selection is particularly valuable in text classification, genomics, and other domains where the number of features can reach tens of thousands.
The principle of maximum entropy states that, given a set of constraints derived from observed data, the probability distribution that best represents the current state of knowledge is the one with the highest entropy. This principle, formalized by E. T. Jaynes in 1957, provides a principled method for constructing probability distributions when information is incomplete.
In machine learning, maximum entropy models have several important applications:
The maximum entropy principle provides a theoretical justification for many standard modeling choices. For instance, assuming a Gaussian distribution when only the mean and variance are known is equivalent to applying the maximum entropy principle.
In reinforcement learning, entropy regularization adds an entropy bonus to the reward signal, encouraging the agent's policy to remain stochastic and explore more broadly. The entropy-regularized objective is:
J(pi) = E[\sum_t (r_t + alpha H(pi(·|s_t)))]
where alpha is a temperature parameter controlling the tradeoff between reward maximization and entropy. Higher values of alpha favor more exploratory behavior, while lower values emphasize exploitation.
The Soft Actor-Critic (SAC) algorithm, introduced by Haarnoja et al. in 2018, is the most prominent example of entropy-regularized reinforcement learning. Key benefits of entropy regularization include:
SAC uses an automatically tuned temperature parameter alpha that adjusts over the course of training to maintain a target entropy level, eliminating the need for manual tuning.
Entropy measures the information density of text. A passage with high entropy uses a diverse vocabulary and is harder to predict, while low-entropy text is repetitive and predictable. Applications include:
Entropy can serve as a signal for detecting anomalies in data streams. Under normal conditions, a system produces outputs with a characteristic entropy level. Significant deviations from this baseline may indicate anomalous behavior. For example:
Entropy quantifies the information content of images and is used for:
Shannon's source coding theorem establishes that the entropy of a source is the theoretical lower bound on the average number of bits required to encode its output without loss. Practical compression algorithms such as Huffman coding and arithmetic coding approach this limit. The entropy rate of English text, estimated at roughly 1.0 to 1.5 bits per character, explains why English text can typically be compressed to about 12 to 19 percent of its original size.
Shannon entropy and thermodynamic entropy are formally analogous. Boltzmann's entropy S = k_B ln W counts the number of microstates consistent with a macrostate, while Shannon's entropy H = -\sum p log p measures the uncertainty over possible outcomes. The connection becomes exact when the probabilities in Shannon's formula correspond to the Boltzmann distribution over energy states:
S = k_B H
where k_B is Boltzmann's constant (approximately 1.38 x 10^-23 J/K).
Landauer's principle (1961) makes this connection physically concrete: erasing one bit of information requires a minimum energy dissipation of k_B T ln 2, where T is the temperature of the environment. This result establishes a fundamental link between computation, information, and thermodynamics. Experimental verification of Landauer's principle was achieved in 2012 by Berut et al.
| Quantity | Formula | Description |
|---|---|---|
| Shannon entropy | H(X) = -\sum p(x) log p(x) | Average uncertainty of a discrete random variable |
| Binary entropy | H_b(p) = -p log p - (1-p) log(1-p) | Entropy of a Bernoulli variable |
| Joint entropy | H(X,Y) = -\sum\sum p(x,y) log p(x,y) | Total uncertainty in a pair of variables |
| Conditional entropy | H(Y | X) = H(X,Y) - H(X) |
| Mutual information | I(X;Y) = H(X) + H(Y) - H(X,Y) | Shared information between two variables |
| KL divergence | D_KL(P | |
| Cross-entropy | H(P,Q) = -\sum P(x) log Q(x) | Expected code length using Q to encode P |
| Differential entropy | h(X) = -\int f(x) log f(x) dx | Continuous analog of Shannon entropy |
| Information gain | IG(S,A) = H(S) - \sum ( | S_v |