Information theory
Last reviewed
Apr 28, 2026
Sources
22 citations
Review status
Source-backed
Revision
v1 ยท 4,496 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Apr 28, 2026
Sources
22 citations
Review status
Source-backed
Revision
v1 ยท 4,496 words
Add missing citations, update stale details, or suggest a clearer explanation.
Information theory is the mathematical study of the quantification, storage, and communication of information. The field was founded by [[claude_shannon|Claude Shannon]] in his 1948 paper "A Mathematical Theory of Communication," published in two parts in the Bell System Technical Journal while he was a researcher at [[bell_labs|Bell Labs]] [1]. Shannon's framework gave engineers a precise way to talk about how much information a source produces, how much can be squeezed through a noisy channel, and how close any practical scheme can come to the fundamental limits.
The core quantities of the theory, [[entropy|entropy]], [[mutual_information|mutual information]], channel capacity, and [[kl_divergence|Kullback-Leibler divergence]], are now standard tools across electrical engineering, statistics, computer science, theoretical physics, and machine learning [2][3]. Modern data compression formats, error-correcting codes used in mobile networks and storage, classification loss functions, [[variational_autoencoder|variational autoencoders]], policy optimization in reinforcement learning, and the [[information_bottleneck|information bottleneck]] view of deep networks all draw directly on Shannon's vocabulary.
The pre-Shannon period contained important precursors. In 1924, Harry Nyquist published "Certain Factors Affecting Telegraph Speed," relating signaling rate to bandwidth and showing that the speed of intelligence transmission grows logarithmically with the number of distinct signal levels [4]. In 1928, Ralph Hartley followed with "Transmission of Information," proposing a measure of information equal to log(s^n), where s is the alphabet size and n the number of symbols [5]. Hartley's measure ignored probabilities and treated all messages as equally likely, but it established the logarithm as the natural unit for information.
Shannon's 1948 paper unified and generalized these ideas. Working at Bell Labs, Shannon framed communication as a probabilistic problem: a source emits symbols according to some probability distribution, the symbols pass through a noisy channel, and a receiver tries to reconstruct the original message. Shannon defined the entropy of a source, proved that this entropy gives the minimum average number of bits per symbol needed to encode the source losslessly, and proved that there is a maximum rate, the channel capacity, at which information can be transmitted with arbitrarily low error probability over a noisy channel [1].
The paper appeared in two parts in the July and October 1948 issues of the Bell System Technical Journal. Shannon used the term "communication theory" in the 1948 paper itself; the label "information theory" came into use shortly after. In 1949, Shannon and Warren Weaver expanded the work into a short book, The Mathematical Theory of Communication, with Weaver contributing a non-technical exposition for a general audience [6]. The same year, Shannon published "Communication Theory of Secrecy Systems" [7], laying the mathematical foundation for modern cryptography.
Within a few years, Shannon's framework had been extended in several directions. In 1951, Solomon Kullback and Richard Leibler introduced the directed divergence between two probability distributions, now called the Kullback-Leibler divergence or relative entropy [8]. In 1957, Edwin Jaynes connected information theory to statistical physics through the maximum entropy principle [9]. The IEEE established the Transactions on Information Theory journal in 1963.
The Shannon entropy of a discrete random variable X with probability mass function p(x) is
H(X) = - sum_x p(x) log p(x),
where 0 log 0 is taken to be 0 by convention [2]. The base of the logarithm fixes the unit:
| Logarithm base | Unit | Typical use |
|---|---|---|
| 2 | bit (also called shannon) | Communication, computer science |
| e | nat | Calculus, statistics, machine learning |
| 10 | hartley or dit | Older engineering literature |
Entropy is non-negative and equals zero exactly when the distribution is concentrated on a single outcome. For a fixed alphabet of size n, entropy is maximized by the uniform distribution, with H = log n. It is additive for independent variables: if X and Y are independent, then H(X, Y) = H(X) + H(Y). Entropy is invariant under bijective relabeling of outcomes; only the probabilities matter.
A standard interpretation is that entropy measures the average uncertainty in X, the average information content of an outcome, or the minimum average number of bits needed to encode samples from X losslessly. A fair coin has H = 1 bit. A fair six-sided die has H = log2(6) ~= 2.585 bits. A source that always emits the same letter has entropy 0 [3]. For a Bernoulli random variable with success probability p, the binary entropy function H_b(p) = -p log p - (1-p) log(1-p) is symmetric around p = 0.5, attains its maximum value of 1 bit there, and falls to zero at p = 0 and p = 1.
For two discrete random variables X and Y with joint distribution p(x, y), the joint entropy is
H(X, Y) = - sum_{x, y} p(x, y) log p(x, y).
The conditional entropy of Y given X is
H(Y | X) = sum_x p(x) H(Y | X = x) = - sum_{x, y} p(x, y) log p(y | x),
and it satisfies the chain rule
H(X, Y) = H(X) + H(Y | X) = H(Y) + H(X | Y).
Conditional entropy is the average remaining uncertainty in Y after X is observed. It is non-negative and zero exactly when Y is a deterministic function of X [2].
The mutual information between X and Y is
I(X; Y) = H(X) + H(Y) - H(X, Y) = H(X) - H(X | Y) = H(Y) - H(Y | X).
It is symmetric, non-negative, and zero exactly when X and Y are statistically independent. Mutual information measures the average reduction in uncertainty about X from observing Y, or equivalently the amount of shared information between the two variables. It can also be written in terms of the joint and product distributions as
I(X; Y) = D_KL( p(x, y) || p(x) p(y) ),
which identifies mutual information as the KL divergence between the joint distribution and the product of the marginals [2]. Unlike correlation, mutual information captures non-linear dependencies and any form of statistical association.
The [[cross_entropy|cross-entropy]] between two distributions p and q over the same alphabet is
H(p, q) = - sum_x p(x) log q(x).
If p is the true distribution of a source and q is the distribution assumed by an encoder, then H(p, q) is the average number of bits per symbol used by an optimal code designed for q when the source is actually p. By Gibbs' inequality, H(p, q) >= H(p), with equality if and only if p = q.
The Kullback-Leibler divergence (also called relative entropy or information divergence) is the gap between cross-entropy and entropy:
D_KL(p || q) = sum_x p(x) log ( p(x) / q(x) ) = H(p, q) - H(p).
It was introduced by [[solomon_kullback|Solomon Kullback]] and [[richard_leibler|Richard Leibler]] in 1951 [8]. KL divergence is always non-negative and equals zero if and only if p = q. It is not symmetric in general (D_KL(p || q) is usually different from D_KL(q || p)) and it does not satisfy the triangle inequality, so it is not a metric. It can nonetheless be interpreted as the expected number of extra bits per symbol incurred when coding samples from p with a code optimal for q.
The relation H(p, q) = H(p) + D_KL(p || q) matters in machine learning. When p is an empirical distribution that does not depend on model parameters, H(p) is constant, so minimizing the cross-entropy H(p, q_theta) with respect to theta is equivalent to minimizing D_KL(p || q_theta) and to maximum likelihood estimation [3].
For a continuous random variable X with probability density function f(x), the differential entropy is
h(X) = - integral f(x) log f(x) dx.
Differential entropy shares additivity and a chain rule with discrete entropy but has important differences. It is not invariant under coordinate transformations: a smooth change of variables introduces a Jacobian determinant that shifts h(X). It can also be negative; a Gaussian with very small variance has large negative differential entropy.
For a Gaussian with variance sigma^2, h(X) = 0.5 log(2 pi e sigma^2). Among all continuous distributions on the real line with given variance, the Gaussian has the largest differential entropy. KL divergence and mutual information are well defined and non-negative for continuous random variables, even when individual differential entropies are not [2].
A discrete memoryless channel is specified by a conditional distribution p(y | x) describing the probability of receiving symbol y when symbol x is transmitted. The channel capacity is
C = max_{p(x)} I(X; Y),
where the maximum runs over all input distributions. The capacity has units of bits per channel use. For continuous channels with input power constraints, an analogous definition uses differential entropy. For an additive white Gaussian noise channel with bandwidth B, signal power S, and noise power N, Shannon derived the famous formula
C = B log2(1 + S/N) bits per second [1].
This equation set the bar for every modulation and coding system designed since.
Shannon's 1948 paper proved two foundational theorems that remain the pillars of information theory.
The source coding theorem, also called Shannon's first theorem, states that a stationary source with entropy rate H bits per symbol can be losslessly compressed to an average of H bits per symbol and no further. Concretely, for any rate R > H there exists a coding scheme that achieves average code length R and decodes without error in the limit of long blocks; for any rate R < H, no such scheme exists, since some symbols must be assigned the same code word and the source cannot be reconstructed [1][2]. The source coding theorem fixes the fundamental limit of any lossless compressor.
The noisy channel coding theorem, or Shannon's second theorem, states that for a discrete memoryless channel with capacity C, any rate R < C is achievable, meaning there exists a sequence of codes of increasing block length such that the maximum probability of decoding error tends to zero. For any R > C, the probability of error is bounded below by a positive constant and reliable communication is impossible [1][2]. The proof uses random coding and joint typicality and shows the existence of capacity-achieving codes without exhibiting one. Constructing explicit codes that approach capacity occupied much of the next sixty years of research.
A related result is the rate-distortion theorem, which characterizes the trade-off between compression rate and reconstruction distortion in lossy compression. It defines the rate-distortion function R(D) as the minimum information rate needed to reconstruct a source within average distortion D for a chosen distortion function such as squared error or Hamming distance. Lossy compressors for images, audio, and video are designed against the limit set by R(D).
Lossless compression algorithms approach the entropy bound predicted by the source coding theorem. The earliest practical scheme was the Shannon-Fano code, developed by Shannon and independently by Robert Fano in 1948 and 1949, which assigns shorter binary code words to more probable symbols using a recursive partitioning procedure. In 1952, David Huffman introduced [[huffman_coding|Huffman coding]] as a class assignment at MIT [10]. Huffman coding produces an optimal prefix code for a known symbol distribution, with average length within one bit of the source entropy, and is still used in formats such as JPEG, MP3, and DEFLATE, the algorithm behind ZIP and gzip.
[[arithmetic_coding|Arithmetic coding]], developed in the 1970s and 1980s, encodes an entire message as a single fraction in [0, 1) and asymptotically achieves the entropy bound for any distribution, beating Huffman coding by less than one bit per message [3]. Modern formats such as JPEG 2000, H.264 CABAC, AV1, and many neural compressors rely on arithmetic coding or its close cousin range coding.
The [[lempel_ziv|Lempel-Ziv]] family of algorithms (LZ77 in 1977, LZ78 in 1978, LZW in 1984) takes a different approach. These methods are universal: they do not require a model of the source distribution, yet they asymptotically achieve the entropy rate of any stationary ergodic source. LZ77 and its descendants form the basis of DEFLATE, gzip, PNG, and zstd, while LZ78 and LZW underlie GIF. Bzip2 combines the Burrows-Wheeler transform with Huffman coding, and modern compressors such as zstd, brotli, and LZMA combine dictionary methods with arithmetic-style entropy coders to approach the practical limits of compression.
For lossy compression, transform coding methods such as the discrete cosine transform underlie JPEG, MP3, and MPEG. Vector quantization, wavelet coding, and learned neural network codecs operate within the rate-distortion framework, balancing rate against perceptual or pixel-level distortion measures.
Shannon's noisy channel coding theorem is non-constructive: it proves capacity-achieving codes exist without showing how to build one. The history of [[error_correcting_code|error-correcting codes]] is the history of trying to close the gap.
In 1950, Richard Hamming published the family of single-error-correcting Hamming codes [11]. In 1960, Irving Reed and Gustave Solomon introduced [[reed_solomon_code|Reed-Solomon codes]], a class of polynomial codes over finite fields that correct multiple symbol errors and are still used for CDs, DVDs, QR codes, hard drives, deep-space probes such as Voyager, and DSL. Convolutional codes, decoded by Andrew Viterbi's 1967 algorithm, dominated mobile and satellite communications for several decades.
A dramatic step toward Shannon's limit came in 1993 when Claude Berrou, Alain Glavieux, and Punya Thitimajshima introduced turbo codes, which use parallel concatenated convolutional codes and iterative decoding to come within a fraction of a decibel of capacity on additive white Gaussian noise channels [12]. [[ldpc_code|Low-density parity-check (LDPC) codes]] were proposed by Robert Gallager in 1962 in his doctoral thesis but largely ignored until they were rediscovered in the late 1990s. Modern LDPC codes power Wi-Fi 6, 5G data channels, DVB-S2, and many storage systems.
In 2008, Erdal Arikan introduced [[polar_code|polar codes]], the first explicit family of codes proven to achieve the capacity of a wide class of channels with low-complexity encoding and decoding [13]. Polar codes are part of the 5G standard for control channels and represent the first time a constructive solution matches the existence proof in Shannon's 1948 theorem.
Information-theoretic ideas appear throughout machine learning, both as building blocks of training objectives and as conceptual frameworks for understanding model behavior.
The standard loss function for classification is the [[cross_entropy_loss|cross-entropy loss]]. Given training labels with empirical class distribution p(y | x) and a model output distribution q_theta(y | x), the loss is
L(theta) = - E_{x, y ~ data} log q_theta(y | x).
This is the cross-entropy between the empirical and predicted distributions, averaged over the data. Because p is fixed, minimizing cross-entropy is equivalent to minimizing the KL divergence D_KL(p || q_theta) and to [[maximum_likelihood_estimation|maximum likelihood estimation]] of the model parameters [3]. [[logistic_regression|Logistic regression]], softmax classifiers, and the output layer of essentially every modern deep classifier are trained against cross-entropy loss.
In binary classification, cross-entropy reduces to the log-loss - y log q - (1 - y) log (1 - q), which is convex in the logits and produces well-calibrated probability estimates when the model class is rich enough. Negative log-likelihood objectives for regression with Gaussian noise reduce to mean squared error, again following from a cross-entropy interpretation.
In [[language_model|language modeling]], cross-entropy per token is the central evaluation metric. The [[perplexity|perplexity]] of a model on a corpus is exp(H), where H is the average cross-entropy in nats per token, or equivalently 2^H in bits. Lower perplexity means the model assigns higher probability to held-out text. Perplexity has been a primary scoring statistic for language models from n-gram systems through transformer-based large language models.
KL divergence is the engine of variational inference. Given a posterior p(z | x) that is intractable, variational methods approximate it with a simpler family q_phi(z | x) by minimizing D_KL(q_phi(z | x) || p(z | x)). This is equivalent to maximizing the evidence lower bound (ELBO):
ELBO = E_{q_phi(z | x)} [ log p(x | z) ] - D_KL( q_phi(z | x) || p(z) ).
The variational autoencoder, introduced by Diederik Kingma and Max Welling in 2013 [14], parameterizes both q_phi(z | x) and p(x | z) with neural networks and optimizes the ELBO end to end. The KL term regularizes the encoder distribution toward the prior p(z), while the reconstruction term encourages decoded samples to match the inputs. The same structure underlies normalizing flows, diffusion models in their variational formulation, and many probabilistic generative models.
KL divergence also constrains policy updates in reinforcement learning. Trust Region Policy Optimization ([[trpo|TRPO]]) limits the KL divergence between successive policies to keep updates within a trust region around the current policy [15]. Proximal Policy Optimization ([[ppo|PPO]]), now a default policy gradient method and a key ingredient in fine-tuning large language models with human feedback, uses a clipped surrogate objective that approximates the same KL constraint [16]. Both methods reflect the principle that small KL steps in policy space yield more stable training than large unconstrained gradient steps.
Mutual information is the natural language for describing what a representation preserves. The InfoMax principle, proposed by Ralph Linsker in 1988, says that a good representation Z of input X should maximize I(X; Z) subject to capacity constraints. [[infogan|InfoGAN]] (Chen et al., 2016) augments generative adversarial networks with a term that maximizes mutual information between latent codes and generated samples to encourage disentangled representations.
[[contrastive_learning|Contrastive learning]] methods such as Contrastive Predictive Coding, SimCLR, and MoCo use the [[infonce|InfoNCE]] objective introduced by Aaron van den Oord, Yazhe Li, and Oriol Vinyals in 2018 [17]. InfoNCE provides a tractable lower bound on mutual information by training a critic to discriminate positive pairs from negatives. The Mutual Information Neural Estimator (MINE), introduced by Belghazi et al. in 2018, estimates I(X; Y) through the Donsker-Varadhan representation of KL divergence and is used both as an analysis tool and as a training signal.
Maximum entropy methods are another long-standing application. Logistic regression and softmax classifiers can be derived as the unique distributions that maximize entropy subject to linear feature-expectation constraints, an approach pioneered in natural language processing by Berger, Della Pietra, and Della Pietra in the 1990s. Conditional random fields generalize the idea to structured prediction. The cross-entropy method of Rubinstein is a related stochastic optimization scheme that iteratively updates a sampling distribution toward rare desirable events.
The information bottleneck principle, introduced by Naftali Tishby, Fernando Pereira, and William Bialek in 2000 [18], frames representation learning as a constrained optimization problem. Given input X and target Y, the goal is to learn a representation T that minimizes I(X; T) (compression of input information) while preserving I(T; Y) (relevant information about the target). The Lagrangian objective
L = I(X; T) - beta I(T; Y)
trades off compression and prediction. In 2017, Tishby and Ravid Shwartz-Ziv argued that deep networks trained with stochastic gradient descent traverse two phases on the information plane: a fitting phase where I(T; Y) increases and a compression phase where I(T; X) decreases [19]. This claim attracted widespread attention and inspired a body of follow-up work, although Andrew Saxe and collaborators in 2018 showed that the proposed compression phase depends on activation function choice and is absent in many practical networks [20]. The information bottleneck remains an influential conceptual lens for analyzing what neural networks compute, even when its strongest empirical claims are debated.
Rate-distortion theory provides another bridge between information theory and modern machine learning, particularly in neural compression. Methods such as Balle et al.'s end-to-end learned image compression train an encoder, decoder, and entropy model jointly to optimize a rate-distortion Lagrangian, with the rate term estimated through differentiable entropy models that track the source coding bound.
Information theory and statistical mechanics share the same core mathematics. The thermodynamic entropy of a macrostate, in the Boltzmann-Gibbs formulation, is S = - k_B sum_i p_i log p_i, where p_i is the probability of microstate i and k_B is Boltzmann's constant. This is the Shannon entropy in nats multiplied by k_B, a coincidence of form that reflects an underlying coincidence of meaning [9]. The link is explored in detail under [[boltzmann_entropy|Boltzmann entropy]].
[[edwin_jaynes|Edwin Jaynes]] argued in 1957 that statistical mechanics could be derived from information theory through the principle of maximum entropy: when the state of a system is constrained only by macroscopic averages, the least biased probability distribution consistent with those averages is the one with maximum entropy [9]. The Boltzmann distribution emerges as the maximum entropy distribution given a fixed expected energy. The same principle is used outside physics whenever a probabilistic model must be chosen subject to constraints with no further information.
In black hole physics, Jacob Bekenstein and Stephen Hawking showed that black holes carry an entropy proportional to the area of the event horizon, with S = (k_B c^3 A) / (4 G hbar). This Bekenstein-Hawking formula treats information as a physical quantity and inspired the holographic principle in quantum gravity.
The [[landauer_principle|Landauer principle]], formulated by Rolf Landauer at IBM in 1961, states that erasing one bit of information dissipates at least k_B T ln 2 units of energy as heat into the environment, where T is the temperature. The principle ties logical irreversibility to thermodynamic irreversibility and gives a fundamental lower bound on the energy cost of computation.
[[quantum_information|Quantum information theory]] generalizes the classical theory to quantum states. Von Neumann entropy S(rho) = - tr(rho log rho), where rho is the density matrix, replaces Shannon entropy. Holevo's theorem bounds how much classical information a quantum channel can transmit. Quantum entanglement, quantum error correction, and quantum channel capacities form a separate but parallel theory with its own coding theorems.
In cryptography, Shannon's 1949 paper "Communication Theory of Secrecy Systems" introduced the notions of confusion and diffusion that still guide cipher design [7]. Shannon proved that the one-time pad provides perfect secrecy, meaning the ciphertext gives no information about the plaintext to an eavesdropper, and that any perfectly secret cipher requires a key at least as long as the message.
In signal processing, the Nyquist-Shannon sampling theorem states that a band-limited continuous signal can be perfectly reconstructed from samples taken at twice its highest frequency. This theorem underlies digital audio, digital video, software-defined radio, and medical imaging. Wiener filtering and the connection between mutual information and minimum mean square error link information theory to estimation and detection theory.
In bioinformatics, sequence alignment scores, position-specific scoring matrices for motif discovery, and information-theoretic measures of conservation are routine. The information content of a DNA-binding motif, measured in bits, summarizes how strongly conserved each position is. Mutual information detects coevolving residue pairs in proteins.
In linguistics and natural language processing, information theory underpins word frequency statistics, n-gram models, and the entropy rate of natural languages. Shannon estimated the entropy of printed English at roughly 1 to 1.3 bits per letter through human prediction experiments, a figure that subsequent neural language models have approached and surpassed.
Machine learning model selection uses information criteria such as the Akaike information criterion (AIC) and Bayesian information criterion (BIC) to balance model fit against complexity, both derived from information-theoretic considerations [3]. Mutual information is also used as a feature selection criterion, while [[bayes_theorem|Bayes' theorem]] interacts with information theory through the chain rule and the data processing inequality.
After Shannon, many researchers built on the foundations he laid. Robert Fano collaborated with Shannon on early coding ideas and wrote one of the first textbooks on information theory. Solomon Kullback and Richard Leibler introduced relative entropy. Peter Elias developed convolutional codes. Robert Gallager invented LDPC codes and wrote a widely used graduate textbook. Toby Berger contributed to rate-distortion theory and multiterminal information theory. Thomas M. Cover and Joy A. Thomas authored Elements of Information Theory, the standard graduate textbook in the field [2].
[[david_mackay|David MacKay]] (1967-2016) wrote Information Theory, Inference, and Learning Algorithms [3], an influential textbook bridging classical information theory and modern Bayesian machine learning, and contributed to LDPC code analysis and Bayesian neural networks. Jorma Rissanen developed arithmetic coding and the minimum description length principle. Imre Csiszar's work on information geometry and Erdal Arikan's polar codes are further milestones. The IEEE awards the Shannon Award annually.