Entropy

Entropy is a fundamental measure of uncertainty, randomness, or information content in a probability distribution. Originating from information theory, the concept was formalized by Claude Shannon in his landmark 1948 paper "A Mathematical Theory of Communication" and has since become one of the most widely used quantities in machine learning, statistics, physics, and data science. In machine learning, entropy serves as the foundation for loss functions such as cross-entropy, splitting criteria in decision trees, feature selection methods, and exploration strategies in reinforcement learning.

Explain like I'm 5 (ELI5)

Imagine you have a bag of candy. If every single piece is a red gummy bear, there is no surprise when you reach in and pull one out. You already know what you are going to get. That bag has low entropy because there is almost no uncertainty.

Now imagine a bag filled with dozens of different candies: gummy bears, chocolate bars, lollipops, sour worms, and more, all mixed together. Every time you reach in, you have no idea what you will pull out. That bag has high entropy because there is a lot of uncertainty.

Entropy is just a number that tells you how surprised you should expect to be. When outcomes are predictable, entropy is low. When outcomes are unpredictable and spread out evenly among many possibilities, entropy is high. In machine learning, algorithms use entropy to figure out which questions (features) reduce surprise the most, helping them make better predictions.

Historical background

The concept of entropy has roots in 19th-century thermodynamics. Rudolf Clausius introduced the term in 1865 to describe the irreversible dissipation of energy in physical systems. Ludwig Boltzmann later gave it a statistical interpretation in the 1870s, relating the entropy of a thermodynamic system to the number of microscopic configurations (microstates) consistent with its macroscopic state. Boltzmann's formula, S = k_B ln W, where k_B is Boltzmann's constant and W is the number of microstates, laid the groundwork for statistical mechanics.

In 1948, Claude Shannon independently developed an analogous quantity for communication systems. His paper, published in the Bell System Technical Journal, defined information entropy as a measure of the average uncertainty associated with a random variable's outcomes. Shannon reportedly chose the name "entropy" on the advice of the mathematician John von Neumann, who noted that "nobody knows what entropy really is, so in a debate you will always have the advantage." Shannon's work established the field of information theory and demonstrated that entropy sets a fundamental limit on lossless data compression.

E. T. Jaynes further bridged the gap between thermodynamics and information theory in 1957 by proposing the principle of maximum entropy. Jaynes argued that, given incomplete information, the most honest probability distribution to assign is the one that maximizes entropy subject to known constraints. This principle became a cornerstone of Bayesian inference and statistical modeling.

Shannon entropy

Definition

For a discrete random variable X with possible outcomes x_1, x_2, ..., x_n and probability mass function p(x), the Shannon entropy H(X) is defined as:

$$H(X) = -\sum_{i=1}^{n} p(x_i) \log p(x_i)$$

The convention 0 log 0 = 0 is adopted, consistent with the limit as p approaches zero. The base of the logarithm determines the unit of measurement:

Logarithm base	Unit name	Common usage
2	Bit (shannon)	Information theory, computing
e	Nat (natural unit)	Theoretical analysis, machine learning
10	Hartley (ban)	Early information theory

When base 2 is used, entropy measures the minimum average number of binary questions (yes/no) needed to identify the outcome of the random variable.

Intuition

Each term p(x_i) log(1/p(x_i)) represents the "surprise" or "information content" of observing outcome x_i, weighted by its probability. Rare events carry more surprise (higher information content) than common events. Entropy is the expected value of this surprise across all outcomes.

For example, consider a fair coin with P(heads) = P(tails) = 0.5. The entropy is:

H = -(0.5 log_2 0.5 + 0.5 log_2 0.5) = -(0.5 x (-1) + 0.5 x (-1)) = 1 bit

This means one binary digit is needed on average to encode each coin flip. By contrast, a biased coin with P(heads) = 0.9 and P(tails) = 0.1 has entropy:

H = -(0.9 log_2 0.9 + 0.1 log_2 0.1) ≈ 0.469 bits

The biased coin is more predictable, so less information is needed to describe its outcomes.

Key properties

Shannon entropy satisfies several important mathematical properties:

Property	Statement	Significance
Non-negativity	H(X) >= 0	Uncertainty is never negative for discrete variables
Maximum entropy	H(X) <= log(n), with equality iff p is uniform	Uniform distribution is the most uncertain
Concavity	H(lambda p + (1-lambda) q) >= lambda H(p) + (1-lambda) H(q)	Mixing distributions cannot decrease entropy
Additivity	H(X, Y) = H(X) + H(Y) for independent X, Y	Entropies of independent sources add
Conditioning reduces entropy	H(X	Y) <= H(X)
Subadditivity	H(X, Y) <= H(X) + H(Y)	Joint entropy never exceeds the sum of marginals
Permutation invariance	H is unchanged by reordering outcomes	Entropy depends only on probability values, not labels

Shannon proved that these properties (along with continuity) uniquely characterize the entropy function, meaning any measure of uncertainty satisfying these axioms must take the logarithmic form.

Binary entropy function

The binary entropy function is a special case of Shannon entropy for a Bernoulli random variable with parameter p (probability of success):

H_b(p) = -p log_2 p - (1 - p) log_2 (1 - p)

This function maps the interval [0, 1] to [0, 1] and has the following characteristics:

It equals 0 at the endpoints p = 0 and p = 1 (complete certainty).
It reaches its maximum value of 1 bit at p = 0.5 (maximum uncertainty).
It is symmetric about p = 0.5, meaning H_b(p) = H_b(1 - p).
It is strictly concave on (0, 1).

The binary entropy function appears frequently in coding theory, hypothesis testing, and as a building block for analyzing more complex entropy expressions.

Joint entropy

The joint entropy of two discrete random variables X and Y measures the total uncertainty in the pair (X, Y):

H(X, Y) = -\sum_{x} \sum_{y} p(x, y) log p(x, y)

Joint entropy satisfies the inequality H(X, Y) <= H(X) + H(Y), with equality if and only if X and Y are independent. The chain rule for entropy decomposes joint entropy as:

H(X, Y) = H(X) + H(Y|X) = H(Y) + H(X|Y)

Conditional entropy

The conditional entropy H(Y|X) quantifies the remaining uncertainty about Y after observing X:

H(Y|X) = -\sum_{x} \sum_{y} p(x, y) log p(y|x)

Equivalently, H(Y|X) = H(X, Y) - H(X). Conditional entropy is always non-negative and satisfies H(Y|X) <= H(Y), meaning that conditioning on additional information never increases entropy on average.

Mutual information

Mutual information I(X; Y) measures the amount of information that one random variable contains about another:

I(X; Y) = H(X) - H(X|Y) = H(Y) - H(Y|X) = H(X) + H(Y) - H(X, Y)

Mutual information is symmetric, meaning I(X; Y) = I(Y; X), and is always non-negative. It equals zero if and only if X and Y are statistically independent. Mutual information can also be expressed as the Kullback-Leibler divergence between the joint distribution and the product of marginals:

I(X; Y) = D_KL(p(x, y) || p(x) p(y))

In machine learning, mutual information is used for feature selection, clustering evaluation, and measuring the dependence between variables.

KL divergence (relative entropy)

The Kullback-Leibler (KL) divergence, also called relative entropy, measures how one probability distribution P diverges from a reference distribution Q:

D_KL(P || Q) = \sum_{x} P(x) log(P(x) / Q(x))

Key properties of KL divergence include:

Non-negativity: D_KL(P || Q) >= 0, with equality iff P = Q (Gibbs' inequality).
Asymmetry: D_KL(P || Q) != D_KL(Q || P) in general, so it is not a true distance metric.
Unboundedness: D_KL can be infinite if Q assigns zero probability to an event that P assigns positive probability to.

KL divergence is foundational to variational inference, variational autoencoders, and many optimization procedures in machine learning.

Cross-entropy

Cross-entropy between a true distribution P and an estimated distribution Q is defined as:

H(P, Q) = -\sum_{x} P(x) log Q(x)

Cross-entropy relates to entropy and KL divergence through the identity:

H(P, Q) = H(P) + D_KL(P || Q)

Since H(P) is constant with respect to Q, minimizing cross-entropy with respect to Q is equivalent to minimizing KL divergence. This is why cross-entropy is the standard loss function for classification tasks in deep learning. When used with a softmax output layer, minimizing cross-entropy is equivalent to maximum likelihood estimation.

Differential entropy

For continuous random variables with probability density function f(x), the differential entropy (or continuous entropy) is defined as:

h(X) = -\int f(x) log f(x) dx

Differential entropy shares some properties with discrete entropy but differs in important ways:

Property	Discrete entropy	Differential entropy
Non-negativity	Always >= 0	Can be negative
Coordinate invariance	Yes	No (changes under coordinate transformations)
Translation invariance	N/A	h(X + c) = h(X)
Scaling	N/A	h(aX) = h(X) + log
Maximum entropy (fixed variance)	Uniform distribution	Gaussian distribution

Notable differential entropy values for common distributions:

Distribution	Differential entropy
Uniform U(a, b)	ln(b - a)
Gaussian N(mu, sigma^2)	(1/2) ln(2 pi e sigma^2)
Exponential Exp(lambda)	1 + ln(1/lambda)

The Gaussian distribution has the maximum differential entropy among all distributions with a given mean and variance, a result that justifies many Gaussian assumptions in statistical modeling.

Entropy in decision trees

Entropy plays a central role in the construction of decision trees, particularly in the ID3, C4.5, and C5.0 algorithms. At each node, the algorithm selects the feature that maximally reduces the entropy of the target variable.

Information gain

Information gain (IG) is the reduction in entropy achieved by partitioning a dataset according to a feature A:

IG(S, A) = H(S) - \sum_{v in Values(A)} (|S_v| / |S|) H(S_v)

where H(S) is the entropy of the full dataset, S_v is the subset of examples with feature value v, and |S_v|/|S| is the proportion of examples in that subset. The feature with the highest information gain is chosen as the splitting criterion.

For a binary classification problem with positive proportion p_+ and negative proportion p_-:

H(S) = -p_+ log_2 p_+ - p_- log_2 p_-

A pure node (all examples belong to one class) has entropy 0, while a node with equal class proportions has entropy 1 bit.

Gain ratio

One limitation of information gain is its bias toward features with many distinct values. The gain ratio, introduced in the C4.5 algorithm, addresses this by normalizing information gain by the feature's intrinsic information:

GainRatio(S, A) = IG(S, A) / SplitInfo(S, A)

where SplitInfo(S, A) = -\sum_{v} (|S_v|/|S|) log_2(|S_v|/|S|). This penalizes features that produce many small partitions.

Comparison with Gini impurity

Criterion	Formula	Range	Used by
Entropy	-\sum p_i log_2 p_i	[0, log_2 C]	ID3, C4.5, C5.0
Gini impurity	1 - \sum p_i^2	[0, 1 - 1/C]	CART

Both entropy and Gini impurity measure node impurity, and in practice they often yield similar tree structures. Entropy is more computationally expensive due to the logarithm but provides a stronger information-theoretic interpretation. The CART algorithm uses Gini impurity by default, while ID3 and C4.5 rely on entropy.

Entropy in feature selection

Entropy-based metrics are widely used to rank and select features in high-dimensional datasets:

Mutual information between each feature and the target variable quantifies how much knowing that feature reduces uncertainty about the target. Features with high mutual information are considered more informative.
Information gain can be computed for each feature independently and used as a filter method for feature selection before model training.
Minimum redundancy maximum relevance (mRMR) selects features that have high mutual information with the target but low mutual information with each other, reducing redundancy.

Entropy-based feature selection is particularly valuable in text classification, genomics, and other domains where the number of features can reach tens of thousands.

Maximum entropy models

The principle of maximum entropy states that, given a set of constraints derived from observed data, the probability distribution that best represents the current state of knowledge is the one with the highest entropy. This principle, formalized by E. T. Jaynes in 1957, provides a principled method for constructing probability distributions when information is incomplete.

In machine learning, maximum entropy models have several important applications:

Maximum entropy classifiers (equivalent to multinomial logistic regression) find the distribution over class labels that maximizes entropy while matching observed feature expectations. The softmax function naturally arises from this framework.
Maximum Entropy Markov Models (MEMMs) extend the maximum entropy approach to sequential data, enabling applications in part-of-speech tagging, named entity recognition, and other natural language processing tasks.
Exponential family distributions (Gaussian, Poisson, exponential, Bernoulli) can all be derived as maximum entropy distributions subject to appropriate moment constraints.

The maximum entropy principle provides a theoretical justification for many standard modeling choices. For instance, assuming a Gaussian distribution when only the mean and variance are known is equivalent to applying the maximum entropy principle.

Entropy regularization in reinforcement learning

In reinforcement learning, entropy regularization adds an entropy bonus to the reward signal, encouraging the agent's policy to remain stochastic and explore more broadly. The entropy-regularized objective is:

J(pi) = E[\sum_t (r_t + alpha H(pi(·|s_t)))]

where alpha is a temperature parameter controlling the tradeoff between reward maximization and entropy. Higher values of alpha favor more exploratory behavior, while lower values emphasize exploitation.

The Soft Actor-Critic (SAC) algorithm, introduced by Haarnoja et al. in 2018, is the most prominent example of entropy-regularized reinforcement learning. Key benefits of entropy regularization include:

Improved exploration: The entropy bonus prevents the policy from collapsing to a deterministic strategy too early, helping the agent discover better solutions.
Robustness: Stochastic policies are more robust to perturbations in the environment.
Multi-modal behavior: Entropy regularization can help the agent learn multiple near-optimal strategies rather than committing to a single one.

SAC uses an automatically tuned temperature parameter alpha that adjusts over the course of training to maintain a target entropy level, eliminating the need for manual tuning.

Applications of entropy in machine learning

Text analysis and natural language processing

Entropy measures the information density of text. A passage with high entropy uses a diverse vocabulary and is harder to predict, while low-entropy text is repetitive and predictable. Applications include:

Language modeling: Perplexity, a standard evaluation metric for language models, is defined as 2^H, where H is the cross-entropy of the model. Lower perplexity indicates a better model.
Keyword extraction: Words with high pointwise mutual information relative to a document or topic are identified as keywords.
Machine translation: Entropy-based alignment models help identify corresponding words and phrases across languages.

Anomaly detection

Entropy can serve as a signal for detecting anomalies in data streams. Under normal conditions, a system produces outputs with a characteristic entropy level. Significant deviations from this baseline may indicate anomalous behavior. For example:

In network security, sudden changes in the entropy of packet header fields can indicate distributed denial-of-service (DDoS) attacks.
In manufacturing, entropy monitoring of sensor data can detect equipment degradation before failure occurs.
In financial markets, unusual entropy patterns in trading data may signal market manipulation.

Image processing and computer vision

Entropy quantifies the information content of images and is used for:

Image thresholding: Maximum entropy thresholding selects a threshold that maximizes the entropy of the resulting binary image, producing effective foreground-background segmentation.
Texture analysis: Local entropy maps characterize the complexity of texture regions in an image.
Image quality assessment: Entropy correlates with the perceived information content and detail in an image.

Data compression

Shannon's source coding theorem establishes that the entropy of a source is the theoretical lower bound on the average number of bits required to encode its output without loss. Practical compression algorithms such as Huffman coding and arithmetic coding approach this limit. The entropy rate of English text, estimated at roughly 1.0 to 1.5 bits per character, explains why English text can typically be compressed to about 12 to 19 percent of its original size.

Relationship to thermodynamic entropy

Shannon entropy and thermodynamic entropy are formally analogous. Boltzmann's entropy S = k_B ln W counts the number of microstates consistent with a macrostate, while Shannon's entropy H = -\sum p log p measures the uncertainty over possible outcomes. The connection becomes exact when the probabilities in Shannon's formula correspond to the Boltzmann distribution over energy states:

S = k_B H

where k_B is Boltzmann's constant (approximately 1.38 x 10^-23 J/K).

Landauer's principle (1961) makes this connection physically concrete: erasing one bit of information requires a minimum energy dissipation of k_B T ln 2, where T is the temperature of the environment. This result establishes a fundamental link between computation, information, and thermodynamics. Experimental verification of Landauer's principle was achieved in 2012 by Berut et al.

Summary of key formulas

Quantity	Formula	Description
Shannon entropy	H(X) = -\sum p(x) log p(x)	Average uncertainty of a discrete random variable
Binary entropy	H_b(p) = -p log p - (1-p) log(1-p)	Entropy of a Bernoulli variable
Joint entropy	H(X,Y) = -\sum\sum p(x,y) log p(x,y)	Total uncertainty in a pair of variables
Conditional entropy	H(Y	X) = H(X,Y) - H(X)
Mutual information	I(X;Y) = H(X) + H(Y) - H(X,Y)	Shared information between two variables
KL divergence	D_KL(P
Cross-entropy	H(P,Q) = -\sum P(x) log Q(x)	Expected code length using Q to encode P
Differential entropy	h(X) = -\int f(x) log f(x) dx	Continuous analog of Shannon entropy
Information gain	IG(S,A) = H(S) - \sum (	S_v

References

Shannon, C. E. (1948). "A Mathematical Theory of Communication." *Bell System Technical Journal*, 27(3), 379-423.
Cover, T. M., & Thomas, J. A. (2006). *Elements of Information Theory* (2nd ed.). Wiley-Interscience.
Jaynes, E. T. (1957). "Information Theory and Statistical Mechanics." *Physical Review*, 106(4), 620-630.
Kullback, S., & Leibler, R. A. (1951). "On Information and Sufficiency." *Annals of Mathematical Statistics*, 22(1), 79-86.
Quinlan, J. R. (1986). "Induction of Decision Trees." *Machine Learning*, 1(1), 81-106.
Haarnoja, T., Zhou, A., Abbeel, P., & Levine, S. (2018). "Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor." *Proceedings of the 35th International Conference on Machine Learning (ICML)*.
Landauer, R. (1961). "Irreversibility and Heat Generation in the Computing Process." *IBM Journal of Research and Development*, 5(3), 183-191.
Berut, A., et al. (2012). "Experimental verification of Landauer's principle linking information and thermodynamics." *Nature*, 483, 187-189.
MacKay, D. J. C. (2003). *Information Theory, Inference, and Learning Algorithms*. Cambridge University Press.
Goodfellow, I., Bengio, Y., & Courville, A. (2016). *Deep Learning*. MIT Press. Chapter 3: Probability and Information Theory.
Murphy, K. P. (2022). *Probabilistic Machine Learning: An Introduction*. MIT Press.
Bishop, C. M. (2006). *Pattern Recognition and Machine Learning*. Springer. Chapter 1.6: Information Theory.

Explain like I'm 5 (ELI5)

Historical background

Shannon entropy

Definition

Intuition

Key properties

Binary entropy function

Related information-theoretic quantities

Joint entropy

Conditional entropy

Mutual information

KL divergence (relative entropy)

Cross-entropy

Differential entropy

Entropy in decision trees

Information gain

Gain ratio

Comparison with Gini impurity

Entropy in feature selection

Maximum entropy models

Entropy regularization in reinforcement learning

Applications of entropy in machine learning

Text analysis and natural language processing

Anomaly detection

Image processing and computer vision

Data compression

Relationship to thermodynamic entropy

Summary of key formulas

References

Improve this article

Related Articles

ARC-AGI 2

Information theory

Cross-Entropy

Information Gain

Log Loss

Perplexity

Explain like I'm 5 (ELI5)

Historical background

Shannon entropy

Definition

Intuition

Key properties

Binary entropy function

Related information-theoretic quantities

Joint entropy

Conditional entropy

Mutual information

KL divergence (relative entropy)

Cross-entropy

Differential entropy

Entropy in decision trees

Information gain

Gain ratio

Comparison with Gini impurity

Entropy in feature selection

Maximum entropy models

Entropy regularization in reinforcement learning

Applications of entropy in machine learning

Text analysis and natural language processing

Anomaly detection

Image processing and computer vision

Data compression

Relationship to thermodynamic entropy

Summary of key formulas

References

Related Articles

ARC-AGI 2

Information theory

Cross-Entropy

Information Gain

Log Loss

Perplexity