KL Divergence (Kullback-Leibler Divergence)

Kullback-Leibler divergence, often abbreviated KL divergence and written D_KL(P || Q), is a measure of how one probability distribution P differs from a second reference distribution Q. It quantifies the expected number of extra information units (typically bits or nats) needed to encode samples drawn from P when the encoding is optimized for Q rather than for the true distribution. Introduced by Solomon Kullback and Richard Leibler in their 1951 paper On Information and Sufficiency, the quantity has become one of the most fundamental tools in information theory, statistics, and modern machine learning ^[1]. KL divergence appears at the heart of variational inference, reinforcement learning trust-region methods, knowledge distillation, variational autoencoders, reinforcement learning from human feedback, and the cross-entropy loss that trains nearly every modern large language model.

Despite the name, KL divergence is not a true distance metric. It is asymmetric (D_KL(P || Q) is generally not equal to D_KL(Q || P)) and does not satisfy the triangle inequality. This asymmetry, far from being a defect, is the source of much of its expressive power. The forward and reverse forms of KL produce qualitatively different behavior when used as objectives, and choosing between them is a substantive modeling decision in everything from approximate Bayesian inference to model alignment. The two formulations are sometimes called the M-projection (moment matching, mean-seeking) and the I-projection (information projection, mode-seeking), and the difference between them shapes the behavior of variational autoencoders, GANs, expectation propagation, and policy optimization in reinforcement learning.

Definition and notation

For two discrete probability distributions P and Q defined on the same sample space, the Kullback-Leibler divergence from Q to P is defined as the sum over the support: D_KL(P || Q) = sum over x of P(x) * log(P(x) / Q(x)). The convention 0 * log(0 / q) = 0 handles points where P(x) is zero. The expression is undefined (or formally infinite) at any point where Q(x) is zero but P(x) is positive, a property known as absolute continuity of P with respect to Q. In other words, KL divergence requires that Q assign nonzero probability to every event that P assigns nonzero probability to; otherwise the divergence diverges to infinity.

For two continuous distributions with densities p(x) and q(x), the analogous integral form is D_KL(P || Q) = integral of p(x) * log(p(x) / q(x)) dx, taken over the support of P. The same absolute continuity requirement applies: q(x) must be positive wherever p(x) is positive. The integral form is the natural generalization, and most modern applications in deep learning involve continuous distributions and use the integral version, often estimated via Monte Carlo samples.

KL divergence is also commonly written as the expectation of the log-ratio under P: D_KL(P || Q) = E_{x ~ P}[log(P(x) / Q(x))] = E_{x ~ P}[log P(x) - log Q(x)]. This form makes the connection to estimation explicit. To estimate KL divergence from samples drawn from P, one can compute the average difference between log P and log Q over the samples, provided both densities are tractable. The expectation form also makes the connection to cross-entropy and entropy especially clear: D_KL(P || Q) = H(P, Q) - H(P), where H(P, Q) is the cross-entropy between P and Q and H(P) is the entropy of P.

The choice of logarithm base determines the unit of measurement. Base 2 produces bits, base e produces nats, and base 10 produces hartleys. Information theorists conventionally use bits, while statisticians and machine learning researchers usually use nats. The choice is arbitrary in the sense that all bases produce equivalent rankings and gradients, differing only by a multiplicative constant. Most modern deep learning code uses natural logarithms because they integrate cleanly with the natural logarithm-based softmax cross-entropy loss.

Historical origins

KL divergence emerged from a specific question in mathematical statistics. In 1951, Solomon Kullback and Richard Leibler, both then working at the U.S. National Security Agency, published On Information and Sufficiency in the Annals of Mathematical Statistics ^[1]. The paper sought to characterize the information that a sample provides for discriminating between two statistical hypotheses, building on Ronald Fisher's earlier notion of statistical sufficiency and Claude Shannon's then-new theory of communication. Kullback and Leibler defined what they called the mean information for discrimination, the quantity later named after them, and showed how it captures the expected log-likelihood ratio under one hypothesis when the other hypothesis is the alternative being considered.

The paper proved several foundational properties of the divergence: that it is non-negative, that it equals zero if and only if the two distributions are identical almost everywhere, that it is invariant under one-to-one transformations of the random variable, and that it satisfies an additivity property for independent observations. They also connected the divergence to maximum likelihood estimation, showing that minimizing the empirical KL divergence between an empirical distribution and a parametric family is equivalent to maximum likelihood estimation. The paper's framing of statistical inference in information-theoretic terms helped establish the bridge between the two fields.

Kullback expanded the ideas considerably in his 1959 book Information Theory and Statistics, which remains a classical reference ^[2]. The book introduced what Kullback called the divergence (more precisely, the symmetrized form J(P, Q) = D_KL(P || Q) + D_KL(Q || P)) as a measure of separation between distributions, and applied the framework to a broad range of estimation and hypothesis testing problems. Subsequent work by other researchers, including Imre Csiszar and Solomon Kullback's coauthor John A. McLean, extended the theory to general measure-theoretic settings and developed the notion of f-divergences as a unified family containing KL divergence as a special case.

In the decades since, KL divergence has been rediscovered and renamed in numerous fields. In statistics it is sometimes called the relative entropy, the I-divergence, the discrimination information, or the directed divergence. In information theory the term relative entropy is more standard, and in physics the same quantity appears in the second law of thermodynamics and the H-theorem under various names. The deep connections between statistical mechanics, information theory, and machine learning that have emerged in the twenty-first century are made vivid by the recurring appearance of D_KL across all three.

Mathematical properties

KL divergence has several mathematical properties that explain both its utility and its limitations as a distance-like quantity. The most important properties are non-negativity, the identity-of-indiscernibles condition, asymmetry, lack of triangle inequality, convexity, and invariance under reparameterization.

Non-negativity. Gibbs' inequality states that D_KL(P || Q) is greater than or equal to zero for any pair of distributions P and Q on the same space. The proof follows from Jensen's inequality applied to the concave logarithm function, or equivalently from the convexity of x * log(x). Non-negativity is the property that makes KL divergence a meaningful measure of dissimilarity: lower values indicate that P and Q are closer in some sense, and zero is the minimum.

Identity of indiscernibles. D_KL(P || Q) equals zero if and only if P and Q are equal almost everywhere. This means that two distributions that agree on every measurable event have zero divergence in either direction, and any positive divergence indicates a genuine difference. The condition is essential for using KL divergence as a training objective, since it guarantees that the global minimum of D_KL(P || Q) with respect to Q (for fixed P) is achieved exactly when Q matches P.

Asymmetry. In general, D_KL(P || Q) is not equal to D_KL(Q || P). The two quantities can differ by orders of magnitude, and choosing between them in an optimization problem leads to qualitatively different solutions. The asymmetry reflects an underlying asymmetry of the encoding interpretation: D_KL(P || Q) is the cost of using a Q-optimal code for P-distributed data, which is not the same problem as the reverse.

Failure of the triangle inequality. KL divergence does not satisfy D_KL(P || R) less than or equal to D_KL(P || Q) + D_KL(Q || R) in general. This means KL is not a metric in the topological sense and cannot be directly used to define a notion of geodesic distance between distributions. Several derived quantities such as the square root of the symmetrized KL or the Hellinger distance do satisfy the triangle inequality and are sometimes used when a true metric is needed.

Convexity. D_KL(P || Q) is jointly convex in the pair (P, Q) when both arguments range over the simplex of probability distributions. This convexity has important consequences for optimization and information geometry. It guarantees, for example, that the set of distributions Q with D_KL(P || Q) less than or equal to some threshold forms a convex set, and that the minimum of D_KL(P || Q) over a convex family of Q's is unique.

Reparameterization invariance. KL divergence is invariant under any one-to-one differentiable change of coordinates. If y = f(x) is an invertible mapping, then D_KL(p_X || q_X) = D_KL(p_Y || q_Y) where p_Y and q_Y are the densities of the transformed variables. This invariance is what makes KL divergence the natural information-theoretic notion of distance between distributions, since it depends only on the underlying probability measures and not on the parameterization used to describe them.

Pinsker's inequality. The total variation distance between P and Q is bounded above by the square root of D_KL(P || Q) divided by two: TV(P, Q) less than or equal to sqrt(D_KL(P || Q) / 2). This inequality, due to Mark Pinsker in 1964, allows KL bounds to be converted into bounds on the more conventional total variation distance, which is useful when reasoning about the approximation quality of variational methods.

Forward versus reverse KL

The choice between D_KL(P || Q) and D_KL(Q || P) when fitting a model is the source of one of the most consequential design decisions in modern machine learning. When P is the true (or target) distribution and Q is a parametric model being fit, D_KL(P || Q) is called the forward KL or M-projection, and D_KL(Q || P) is called the reverse KL or I-projection. The two have very different behavior, especially when the model family for Q is too restrictive to represent P exactly.

Forward KL, sometimes called the moment-matching projection, places strong penalties on regions where Q assigns low probability but P assigns substantial mass. The integrand p(x) * log(p(x) / q(x)) blows up when q(x) is small and p(x) is not. To avoid this divergence, an optimizer fitting Q under forward KL will spread Q to cover all the modes and tails of P, even if this means that Q places significant mass in regions where P has none. Forward KL is therefore called mode-covering or mean-seeking. Maximum likelihood estimation, in which the empirical distribution is the target P and the model family is Q, is an instance of forward KL minimization.

Reverse KL, sometimes called the information projection, has the opposite behavior. The integrand q(x) * log(q(x) / p(x)) is well-behaved when q(x) is small and large when q(x) is significant in regions where p(x) is small. An optimizer fitting Q under reverse KL will pick a single mode of P (or a small subset of modes) and concentrate Q there, ignoring the other modes entirely. This is called mode-seeking behavior. Variational inference using mean-field approximations and many GAN-style training objectives implicitly minimize reverse-KL-like quantities and exhibit this mode-seeking behavior in practice.

The table below summarizes the qualitative behavior of forward and reverse KL when Q has insufficient capacity to represent the true P.

| Property | Forward KL D_KL(P || Q) | Reverse KL D_KL(Q || P) | |---|---|---| | Other names | M-projection, moment-matching, mean-seeking | I-projection, information projection, mode-seeking | | Penalizes when | Q is small where P is large | Q is large where P is small | | Behavior with multimodal P | Covers all modes, smears mass between them | Picks one mode, concentrates there | | Estimation requires | Samples from P, evaluations of Q | Samples from Q, evaluations of P | | Bias | Underestimates uncertainty if Q too narrow | Underestimates support, overconfident | | Common use | Maximum likelihood, supervised learning | Variational inference, RL, GANs | | Effect on a Gaussian Q fit to bimodal P | Q centered between the modes, wide variance | Q concentrated on one mode, narrow variance |

The practical implications of the choice are most vivid in variational inference, where the goal is to approximate an intractable posterior with a tractable family. Variational inference traditionally uses reverse KL, leading to mode-seeking approximations that systematically underestimate posterior variance, a bias known to practitioners as the variance shrinkage problem. Some methods such as expectation propagation use forward KL instead and produce moment-matched approximations that capture posterior variance better but at higher computational cost.

Connection to entropy and cross-entropy

KL divergence is intimately tied to entropy and cross-entropy, and many of the loss functions used in modern machine learning can be rewritten as KL divergences. Recall that the entropy of a distribution P is H(P) = -sum P(x) * log P(x), the cross-entropy between P and Q is H(P, Q) = -sum P(x) * log Q(x), and the KL divergence satisfies D_KL(P || Q) = H(P, Q) - H(P).

This decomposition has an immediate practical consequence. When P is fixed (for example, the empirical distribution of training data), the entropy H(P) is a constant. Minimizing cross-entropy H(P, Q) over Q is therefore equivalent to minimizing the KL divergence D_KL(P || Q) over Q. Both objectives produce the same optimum and the same gradients with respect to the parameters of Q. The cross-entropy loss used to train classifiers, language models, and most other supervised neural networks is, mathematically, a KL divergence in disguise.

For language modeling, the empirical distribution P consists of one-hot indicator vectors at each token position, so the cross-entropy reduces to -log Q(x_t | x_<t) where x_t is the actual token and Q is the model's predicted distribution. Averaging over a corpus, this is the negative log likelihood, and the trained model is the maximum likelihood estimator over the model family. The connection to KL divergence makes the information-theoretic interpretation explicit: a language model's loss measures how many extra bits per token a reader would need if they used the model's predictions to compress the actual text.

The related quantity called perplexity, computed as the exponential of the cross-entropy in nats, is a multiplicative measure of model quality. A perplexity of N can be loosely interpreted as the model being uncertain over N equally likely options at each step, which makes it intuitive to compare across model sizes and datasets. Lower KL divergence to the true distribution corresponds to lower perplexity, and the two quantities convey identical information ordered differently.

Mutual information and Jensen-Shannon divergence

Mutual information between two random variables X and Y is defined as the KL divergence between their joint distribution and the product of their marginals: I(X; Y) = D_KL(p(x, y) || p(x) * p(y)). This expresses mutual information as the amount by which the joint distribution differs from the independence assumption that would make X and Y unrelated. Mutual information inherits non-negativity from KL divergence (so it is always greater than or equal to zero), with equality if and only if X and Y are independent. The interpretation is that I(X; Y) measures how much knowing one variable reduces uncertainty about the other.

The asymmetry of KL divergence motivates the construction of symmetric variants. The Jensen-Shannon divergence (JSD) is defined as JSD(P, Q) = (1/2) * D_KL(P || M) + (1/2) * D_KL(Q || M), where M = (P + Q) / 2 is the midpoint distribution. JSD is symmetric in its two arguments by construction and is bounded above by log 2 (or by 1 in bits), making it a more well-behaved similarity measure than raw KL. The square root of JSD satisfies the triangle inequality, so it is a true metric on the space of distributions. JSD has been used as the discriminator objective in the original GAN formulation by Goodfellow and colleagues, and it appears in many other settings where a symmetric, bounded measure of distributional similarity is needed.

The symmetrized KL J(P, Q) = D_KL(P || Q) + D_KL(Q || P), originally proposed in Kullback's book, is another classical symmetric variant. Unlike JSD, it can be infinite when either distribution lacks support that the other has, which limits its practical use. JSD avoids this problem because the midpoint M always has support wherever P or Q does.

The f-divergence family

KL divergence is a special case of a broader family of quantities called f-divergences, introduced independently by Csiszar, Morimoto, and Ali-Silvey in the 1960s. An f-divergence is defined as D_f(P || Q) = integral of q(x) * f(p(x) / q(x)) dx for any convex function f satisfying f(1) = 0. Different choices of f produce different members of the family.

f(t)	Resulting divergence	Notes
t * log(t)	Forward KL D_KL(P
-log(t)	Reverse KL D_KL(Q
(t - 1)^2 / 2	Pearson chi-squared	Quadratic, light tails
(1 - sqrt(t))^2	Hellinger squared distance	Symmetric, bounded
	t - 1	/ 2
t * log(t) - (1 + t) * log((1 + t) / 2)	Jensen-Shannon (one term)	Symmetric, bounded
(t^a - 1) / (a * (a - 1))	Renyi divergence (alpha = a)	Family interpolating KL

The Renyi divergence of order alpha generalizes KL by replacing the logarithm with a power: D_alpha(P || Q) = (1 / (alpha - 1)) * log(sum P(x)^alpha * Q(x)^(1 - alpha)). The limit as alpha approaches one recovers KL divergence, while different values of alpha produce different sensitivities to differences in the distributions. Alpha less than one downweights the contribution of low-probability events and is sometimes used in privacy-preserving machine learning, while alpha greater than one (especially alpha equal to two) is sometimes preferred for its smoother gradients in variational inference.

Unifying these divergences under the f-divergence umbrella has proven theoretically fruitful. Many properties such as monotonicity under data processing, joint convexity, and invariance under reparameterization hold for the entire family, with KL divergence as the canonical instance. Different members of the family have been used as training objectives in different machine learning settings, with the choice often motivated by either statistical efficiency or computational convenience.

Estimation from samples

In most practical applications, the densities p and q are not directly available, and KL divergence must be estimated from samples. The simplest estimator is the Monte Carlo plug-in, which evaluates KL as an empirical average of the log-ratio under samples from P: D_KL(P || Q) is approximately equal to (1 / n) * sum log(p(x_i) / q(x_i)) for x_i drawn from P. This estimator is unbiased when both p and q can be evaluated at the samples, which is the typical case in deep generative modeling where one knows the form of both distributions. When only samples from one or both distributions are available, more elaborate estimators are needed.

When only samples are available, k-nearest-neighbor estimators originally due to Wang, Kulkarni, and Verdu provide a non-parametric way to estimate KL between continuous distributions in low to moderate dimensions. The estimator computes distances from each sample of P to its nearest neighbor in P (for the entropy term) and to its nearest neighbor in samples of Q (for the cross-term). These estimators are consistent and have well-understood asymptotic properties but suffer from the curse of dimensionality and become impractical above roughly twenty dimensions.

Neural KL estimators have largely supplanted k-nearest-neighbor methods for high-dimensional problems. The Donsker-Varadhan variational representation of KL divergence states that D_KL(P || Q) is the supremum over all functions T of E_P[T] - log E_Q[exp(T)]. This dual formulation can be optimized by training a neural network T to maximize the right-hand side, producing a lower bound on KL that becomes tight as T approaches the optimum log(p / q). The Mutual Information Neural Estimator (MINE) of Belghazi and colleagues in 2018 applied this idea specifically to mutual information estimation. Variants based on the f-GAN framework or the Nguyen-Wainwright-Jordan estimator allow KL to be estimated from samples alone using deep neural networks, although care must be taken with the bias and variance of these estimators in practice.

For unnormalized models or for KL between intractable distributions, the closely related contrastive divergence and noise-contrastive estimation methods sidestep the need to estimate KL directly by estimating only its gradient or a related quantity. Score matching and denoising score matching, which power modern diffusion models, can also be viewed as alternatives to direct KL minimization that avoid the partition function entirely.

Applications in machine learning

KL divergence appears at the heart of many of the techniques that make modern AI work. The table below summarizes some of the most important applications, all of which are discussed in more detail below.

Application	Role of KL divergence	Direction
Maximum likelihood estimation	Equivalent to forward KL between empirical and model	Forward KL
Variational inference	ELBO = log evidence minus reverse KL to true posterior	Reverse KL
Variational autoencoders	KL term regularizes encoder to prior, typically Gaussian	Reverse KL to prior
TRPO and PPO	Trust region constraint on policy update	Reverse KL between old and new policy
RLHF KL penalty	Anchors fine-tuned model to base, prevents reward hacking	Reverse KL to reference
Knowledge distillation	Cross-entropy between teacher and student soft targets	Forward KL teacher to student
Information bottleneck	Penalizes information passing through a bottleneck	Multiple
Mutual information estimation	I(X; Y) = D_KL(joint
Bayesian model selection	Bayes factors are exponentiated KL differences	Forward and reverse
Diffusion model training	Denoising score matching equivalent to KL bound	Forward KL bound

Variational inference and the ELBO

Variational inference approximates an intractable posterior p(z | x) with a tractable distribution q(z) drawn from some chosen family. The standard derivation begins from the identity log p(x) = E_q[log p(x, z) - log q(z)] + D_KL(q(z) || p(z | x)). Since the KL term is non-negative, the first term on the right-hand side is a lower bound on log p(x), called the evidence lower bound (ELBO). Maximizing the ELBO over the parameters of q is therefore equivalent to minimizing the reverse KL divergence between the approximating distribution and the true posterior, since the marginal log-evidence does not depend on q.

The choice of reverse KL in variational inference is consequential. It produces mode-seeking approximations that tend to be over-confident and to miss secondary modes of the posterior. Several alternative formulations such as expectation propagation, alpha-divergence variational inference, and importance-weighted variational inference modify the divergence used and produce different bias-variance trade-offs. Despite these alternatives, the reverse-KL ELBO remains the standard objective for amortized variational inference and powers most variational autoencoders.

Variational autoencoders

The variational autoencoder (VAE), introduced by Kingma and Welling in 2014, is one of the most influential applications of variational inference in deep learning. The VAE objective is the ELBO computed with an amortized inference network that produces the parameters of q(z | x) as a function of x. The ELBO decomposes into a reconstruction term (the expected log-likelihood of x given z) and a KL regularization term D_KL(q(z | x) || p(z)) that pulls the encoder distribution toward the prior, typically a standard multivariate Gaussian.

The KL regularization term has a closed-form expression when both q(z | x) and p(z) are Gaussian, and the analytical formula is differentiable through the encoder parameters. This is what makes VAE training tractable: the reconstruction term is estimated by drawing samples through the reparameterization trick, and the KL term is computed analytically without sampling. The relative weight of the two terms (often denoted beta in the beta-VAE family) controls the trade-off between reconstruction accuracy and latent disentanglement, with higher beta values pushing the latent distribution closer to the prior at the cost of poorer reconstruction.

The phenomenon of posterior collapse, in which the encoder learns to set q(z | x) equal to p(z) for every x and the latent code carries no information, is a well-known failure mode of VAE training. Posterior collapse occurs when the decoder is too powerful or the KL term too strong, and several techniques such as KL annealing, free bits, and reweighted ELBOs have been proposed to mitigate it. All of these techniques are essentially modifications of how KL is computed or weighted, illustrating just how central the KL term is to VAE behavior.

Trust-region policy optimization and PPO

KL divergence plays a central role in modern policy gradient methods for reinforcement learning. Trust Region Policy Optimization (TRPO), introduced by Schulman and colleagues in 2015, constrains each policy update to produce a new policy whose KL divergence from the old policy is bounded by a small trust-region radius. The constraint is enforced as part of a constrained optimization problem, solved approximately with a conjugate-gradient-based linear-quadratic procedure. The trust region prevents catastrophic policy collapse from overly aggressive updates, a problem that plagued earlier policy gradient methods.

Proximal Policy Optimization (PPO), introduced by Schulman and colleagues in 2017, replaces TRPO's hard KL constraint with a clipped surrogate objective that approximates the same trust-region behavior at much lower computational cost. PPO has become the dominant policy gradient algorithm in deep reinforcement learning and is the backbone of reinforcement learning from human feedback for large language model alignment. Some PPO variants retain an explicit KL penalty in addition to or instead of the clipping mechanism, and tuning the KL coefficient is one of the principal hyperparameters of PPO-based RLHF training pipelines.

The KL constraint in TRPO and PPO is mathematically the reverse KL divergence D_KL(pi_old || pi_new), evaluated as an expectation over states and actions sampled under the old policy. The intuition is that by limiting how much the new policy can differ from the old policy in this divergence, the importance-weighted policy improvement estimate remains accurate. Without this trust region, policy gradient methods can take arbitrarily large steps in policy space and either diverge entirely or degrade performance.

Reinforcement learning from human feedback

Reinforcement learning from human feedback (RLHF) trains a language model to maximize a reward derived from human preference data while staying close to a reference model, typically the initial supervised fine-tuned model. The total loss combines the reward signal with a KL penalty: total reward = r(x, y) - beta * D_KL(pi(y | x) || pi_ref(y | x)), where pi is the trained policy, pi_ref is the reference, and beta is a coefficient controlling the strength of the KL anchor.

The KL term serves several purposes. It prevents the policy from drifting too far from a coherent language model and producing nonsensical text in pursuit of reward. It limits reward hacking by penalizing deviations from a reference distribution that produces well-calibrated responses. And it provides a knob (beta) for trading off how much the policy is allowed to deviate from the reference, which empirically governs the trade-off between alignment quality and capability preservation.

The direct preference optimization (DPO) method, introduced by Rafailov and colleagues in 2023, reformulates RLHF as a supervised learning problem by deriving an analytical relationship between the optimal RLHF policy and the implicit reward, exploiting the structure of the KL-regularized objective. DPO eliminates the need for an explicit reward model and PPO loop, training the policy directly from preference pairs with a binary cross-entropy loss. The mathematical derivation hinges on the same KL-regularized formulation used in PPO-based RLHF, illustrating just how foundational the KL term is to the RLHF problem regardless of which optimization algorithm is used.

KL-controlled RL, of which RLHF is one instance, is a more general framework introduced by Jaques and colleagues for using a reference policy to constrain reinforcement learning. The framework has applications beyond language models, including dialogue systems, robotics, and game playing. In all of these settings, the role of KL is the same: it provides a soft constraint that keeps the policy close to a known-good reference while permitting controlled deviation toward higher reward.

Knowledge distillation

Knowledge distillation, introduced in its modern form by Hinton, Vinyals, and Dean in 2015, transfers knowledge from a large teacher model to a smaller student by training the student to match the teacher's softened output distributions. The loss function is the KL divergence (or equivalently the cross-entropy minus the constant teacher entropy) between the teacher's predicted distribution and the student's predicted distribution, with a temperature parameter applied to the softmax to produce more informative soft targets.

The distillation loss is typically computed as D_KL(softmax(z_T / tau) || softmax(z_S / tau)) where z_T and z_S are the teacher and student logits and tau is the temperature. Higher temperatures produce softer distributions that carry more information about the teacher's relative confidence over different classes, and the distillation gradient is rescaled by tau squared to maintain comparable magnitudes across temperatures. The student is trained on a weighted sum of this distillation KL and the original cross-entropy loss against the hard labels, with the weighting controlling how much the student relies on the teacher versus the ground truth.

Knowledge distillation has become one of the standard techniques for compressing large language models. Distilled models such as DistilBERT, DistilGPT2, MiniLM, and many open-source distilled variants of larger models all use KL-divergence-based losses. More recent work on sequence-level distillation, on-policy distillation, and reverse-KL distillation explores variations on the basic theme. The reverse-KL form, in which the student matches the teacher with D_KL(student || teacher), produces mode-seeking distillation that focuses the student on the teacher's most confident predictions and has been shown to improve performance for some classes of student architectures.

Information bottleneck and representation learning

The information bottleneck principle, proposed by Tishby, Pereira, and Bialek in 2000, frames representation learning as a trade-off between two mutual information quantities, both of which are KL divergences in disguise. The objective is to find a representation Z of the input X that is informative about a target Y while compressing as much irrelevant information as possible: minimize I(X; Z) - beta * I(Z; Y). Since both mutual information terms can be expressed as KL divergences, the entire framework is a constrained KL optimization.

The information bottleneck has been applied to interpretability, robustness, and disentanglement of deep neural networks. The variational information bottleneck of Alemi and colleagues uses variational bounds on both mutual information terms, expressing the loss in terms of explicit KL divergences that can be computed and optimized. Connections between information bottleneck training and natural training procedures such as cross-entropy minimization have been explored extensively, although the empirical equivalence is not perfect and depends on architectural choices.

Diffusion models and score matching

Diffusion models, the dominant paradigm for high-quality image generation, were originally derived as variational lower bounds on data likelihood involving sums of KL divergences between forward and reverse process transitions. The denoising diffusion probabilistic model (DDPM) loss simplifies under specific noise schedules to a sum of mean-squared error terms, but the derivation reveals that the model is implicitly minimizing a KL bound on the negative log-likelihood at every time step. The connection is made explicit in the variational diffusion model formulation, which expresses the entire loss in terms of KL divergences and connects diffusion training to score matching, energy-based modeling, and continuous normalizing flows.

The equivalence between denoising score matching and KL divergence minimization, formalized in work by Vincent and others, provides a crisp explanation for why diffusion models produce high-quality samples: their loss function is a tight bound on the negative log-likelihood of the data under the model. Recent work on rectified flow, consistency models, and shortcut models has further developed the connections between KL divergences and the deterministic ODE formulations of these models.

Practical considerations

KL divergence in deep learning is usually estimated and optimized rather than computed exactly. Several practical issues arise repeatedly, including numerical stability, the choice of estimator, and the appropriate scaling of KL terms relative to other loss components.

Numerical stability is a recurring concern because the log-ratio in KL divergence can become large or undefined when the two distributions disagree sharply. The standard remedy is to compute KL divergence directly from logits rather than from softmax probabilities, using the log-sum-exp trick to maintain precision. The PyTorch nn.KLDivLoss expects log-probabilities for the predicted distribution and probabilities for the target, a deliberate API choice that avoids the numerical pitfalls of computing log on small probabilities. JAX, TensorFlow, and other frameworks provide similar utilities.

The choice of KL estimator matters substantially when KL is part of a learning objective. The naive Monte Carlo estimator log(p / q) under samples from p is unbiased but high-variance. Several lower-variance estimators have been developed, including the Kingma-Welling reparameterized estimator used in VAE training and the bias-corrected estimators discussed in John Schulman's well-known blog post on KL approximation. The right estimator depends on whether the goal is to estimate KL accurately or to compute a low-variance gradient through KL with respect to model parameters.

When KL appears as a regularizer alongside another loss, its scale relative to the main loss is a critical hyperparameter. In VAEs, the beta hyperparameter controls the trade-off between reconstruction and KL regularization. In RLHF, the beta coefficient controls the strength of the KL anchor. In knowledge distillation, the alpha mixing weight controls the relative influence of the distillation KL versus the supervised cross-entropy. Tuning these coefficients is often more important than tuning model architecture or learning rate, and several adaptive schemes have been proposed for setting them dynamically during training.

Limitations and alternative measures

Despite its centrality, KL divergence has well-known limitations that motivate the use of alternative measures in some settings. The most important limitations are unboundedness, sensitivity to support mismatches, and the asymmetry that makes interpretation context-dependent.

KL divergence is unbounded above and can be infinite when the two distributions have different supports. This causes problems when the candidate distributions arise from limited sampling or when the model family has narrow support. The Wasserstein distance, which arises from optimal transport theory, addresses this by measuring the cost of transporting probability mass between distributions and remaining finite even for distributions with disjoint supports. Wasserstein distance has become especially popular in generative modeling since the introduction of Wasserstein GANs.

KL divergence is also not a true distance, which complicates its use as a similarity measure. The Jensen-Shannon divergence and the Hellinger distance address this with symmetric and bounded measures, but at the cost of losing some of KL's tight connections to information theory and maximum likelihood. The Renyi divergences provide a parametric family that interpolates between KL and other choices, with trade-offs that depend on the application.

Finally, KL divergence is sensitive to where probability mass is placed but not to how far apart distributions are in terms of the underlying space. Two Gaussians with the same variance and very different means have KL divergences that depend only on the squared distance between their means, ignoring the structure of the space they live in. For high-dimensional applications where the geometry of the data matters, KL divergence may not capture the kind of dissimilarity one cares about, and metrics that respect the underlying geometry such as Wasserstein or Sinkhorn divergences may be more appropriate.

References

Kullback, S. and Leibler, R. A. (1951). *On Information and Sufficiency*. Annals of Mathematical Statistics, 22(1), 79-86. https://doi.org/10.1214/aoms/1177729694
Kullback, S. (1959). *Information Theory and Statistics*. John Wiley and Sons. (Reprinted by Dover Publications in 1968 and 1997.)
Cover, T. M. and Thomas, J. A. (2006). *Elements of Information Theory* (2nd ed.). Wiley-Interscience.
MacKay, D. J. C. (2003). *Information Theory, Inference, and Learning Algorithms*. Cambridge University Press.
Csiszar, I. (1967). *Information-type measures of difference of probability distributions and indirect observations*. Studia Scientiarum Mathematicarum Hungarica, 2, 299-318.
Pinsker, M. S. (1964). *Information and Information Stability of Random Variables and Processes*. Holden-Day.
Kingma, D. P. and Welling, M. (2014). *Auto-Encoding Variational Bayes*. International Conference on Learning Representations (ICLR). arXiv:1312.6114. https://arxiv.org/abs/1312.6114
Rezende, D. J., Mohamed, S., and Wierstra, D. (2014). *Stochastic Backpropagation and Approximate Inference in Deep Generative Models*. International Conference on Machine Learning (ICML). arXiv:1401.4082. https://arxiv.org/abs/1401.4082
Schulman, J., Levine, S., Moritz, P., Jordan, M. I., and Abbeel, P. (2015). *Trust Region Policy Optimization*. International Conference on Machine Learning (ICML). arXiv:1502.05477. https://arxiv.org/abs/1502.05477
Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. (2017). *Proximal Policy Optimization Algorithms*. arXiv:1707.06347. https://arxiv.org/abs/1707.06347
Christiano, P. F. et al. (2017). *Deep reinforcement learning from human preferences*. NeurIPS. arXiv:1706.03741. https://arxiv.org/abs/1706.03741
Stiennon, N. et al. (2020). *Learning to summarize with human feedback*. NeurIPS. arXiv:2009.01325. https://arxiv.org/abs/2009.01325
Ouyang, L. et al. (2022). *Training language models to follow instructions with human feedback*. NeurIPS. arXiv:2203.02155. https://arxiv.org/abs/2203.02155
Rafailov, R., Sharma, A., Mitchell, E., Ermon, S., Manning, C. D., and Finn, C. (2023). *Direct Preference Optimization: Your Language Model is Secretly a Reward Model*. NeurIPS. arXiv:2305.18290. https://arxiv.org/abs/2305.18290
Hinton, G., Vinyals, O., and Dean, J. (2015). *Distilling the Knowledge in a Neural Network*. NeurIPS Deep Learning Workshop. arXiv:1503.02531. https://arxiv.org/abs/1503.02531
Tishby, N., Pereira, F. C., and Bialek, W. (1999/2000). *The Information Bottleneck Method*. Allerton Conference. arXiv:physics/0004057. https://arxiv.org/abs/physics/0004057
Alemi, A. A., Fischer, I., Dillon, J. V., and Murphy, K. (2017). *Deep Variational Information Bottleneck*. International Conference on Learning Representations (ICLR). arXiv:1612.00410. https://arxiv.org/abs/1612.00410
Belghazi, M. I. et al. (2018). *Mutual Information Neural Estimation*. International Conference on Machine Learning (ICML). arXiv:1801.04062. https://arxiv.org/abs/1801.04062
Vincent, P. (2011). *A Connection Between Score Matching and Denoising Autoencoders*. Neural Computation, 23(7), 1661-1674.
Ho, J., Jain, A., and Abbeel, P. (2020). *Denoising Diffusion Probabilistic Models*. NeurIPS. arXiv:2006.11239. https://arxiv.org/abs/2006.11239
Schulman, J. (2020). *Approximating KL Divergence*. http://joschu.net/blog/kl-approx.html
Jaques, N. et al. (2019). *Way Off-Policy Batch Deep Reinforcement Learning of Implicit Human Preferences in Dialog*. arXiv:1907.00456. https://arxiv.org/abs/1907.00456

Definition and notation

Historical origins

Mathematical properties

Forward versus reverse KL

Connection to entropy and cross-entropy

Mutual information and Jensen-Shannon divergence

The f-divergence family

Estimation from samples

Applications in machine learning

Variational inference and the ELBO

Variational autoencoders

Trust-region policy optimization and PPO

Reinforcement learning from human feedback

Knowledge distillation

Information bottleneck and representation learning

Diffusion models and score matching

Practical considerations

Limitations and alternative measures

See also

References

Improve this article

Related Articles

ARC-AGI 2

Information theory

AUC-ROC

ARIMA

Cross-Entropy

Entropy

Definition and notation

Historical origins

Mathematical properties

Forward versus reverse KL

Connection to entropy and cross-entropy

Mutual information and Jensen-Shannon divergence

The f-divergence family

Estimation from samples

Applications in machine learning

Variational inference and the ELBO

Variational autoencoders

Trust-region policy optimization and PPO

Reinforcement learning from human feedback

Knowledge distillation

Information bottleneck and representation learning

Diffusion models and score matching

Practical considerations

Limitations and alternative measures

See also

References

Related Articles

ARC-AGI 2

Information theory

AUC-ROC

ARIMA

Cross-Entropy

Entropy