Variational inference (VI), also known as variational Bayes (VB), is a family of methods in machine learning and statistics that turn the problem of approximate Bayesian inference into an optimization problem. Given an intractable posterior distribution p(z|x) over latent variables z conditioned on observed data x, variational inference posits a simpler family of candidate distributions q(z; phi) parameterized by phi, then searches for the member of that family that lies closest to the true posterior in the sense of Kullback-Leibler divergence. This reformulation lets practitioners trade the intractable integration of fully Bayesian computation for the comparatively tame machinery of gradient descent, and it has become the dominant technique for scaling Bayesian methods to the very large models and very large datasets that define modern deep learning [1].
The core identity that makes variational inference work is a decomposition of the marginal log-likelihood log p(x) into the sum of an evidence lower bound (ELBO) and a KL divergence between the variational distribution and the true posterior. Because KL divergence is non-negative, maximizing the ELBO is equivalent to minimizing the KL gap, and the value of the ELBO at convergence provides a useful lower bound on the model evidence that can be used for model comparison. Variational inference now serves as the inferential engine for variational autoencoders, latent Dirichlet allocation, Bayesian neural networks, large-scale topic models, and the lower bounds that define many diffusion models [2][3].
Most interesting probabilistic models combine observed data x with unobserved latent variables z. Inference means computing the posterior p(z|x), which by Bayes' rule equals p(x|z) p(z) / p(x). The numerator is usually easy: it is the joint p(x,z), available in closed form by definition of the model. The denominator p(x) is the marginal likelihood (also called the evidence) and is obtained by integrating the joint over all latent configurations. For all but the simplest models, this integral has no closed form. Mixture models, hierarchical Bayesian models, topic models, hidden Markov models with continuous states, and deep generative models all share this property: the joint is tractable, the posterior is not [1].
Two broad strategies have historically dominated approximate inference. The first is Markov chain Monte Carlo, which constructs a stochastic process whose stationary distribution is the target posterior, then draws samples from it. MCMC is asymptotically exact and remains the gold standard for many statistical applications, but its samples are correlated, convergence is hard to diagnose, and the wall-clock cost grows with both model size and dataset size. The second strategy is variational inference, which sacrifices asymptotic exactness for speed. Instead of sampling, VI fits an approximation to the posterior. The approximation is biased, but the optimization is amenable to all the tools that have made deep learning practical: stochastic gradients, GPU acceleration, mini-batching, and automatic differentiation [1][3]. Hybrid methods such as Hamiltonian variational inference and MCMC-augmented VI combine the two families, and importance-weighted bounds let VI approach the exactness of sampling at the cost of more compute.
The foundation of variational inference is the ELBO. Writing q(z) for the variational distribution and p(x,z) for the joint, the log marginal likelihood can be expanded as log p(x) = E_q[log p(x,z)] - E_q[log q(z)] + KL(q(z) || p(z|x)). The first two terms together form the ELBO, often denoted L(q):
L(q) = E_q[log p(x,z)] - E_q[log q(z)]
Because the KL divergence is always non-negative, L(q) is a lower bound on log p(x), with equality if and only if q matches the true posterior exactly. Maximizing L(q) over the variational family therefore drives q toward p(z|x). The ELBO can be rewritten as L(q) = E_q[log p(x|z)] - KL(q(z) || p(z)), decomposing the bound into a reconstruction term and a complexity penalty. In statistical physics, -L(q) is called the variational free energy [4].
Direct maximization of L(q) over arbitrary q is just as hard as the original problem. The art of variational inference lies in choosing the family of q distributions to be expressive enough to capture the important structure of the posterior, yet simple enough that the expectations in L(q) are tractable. The earliest and still most common restriction is the mean-field family.
Mean-field variational inference assumes the variational distribution factorizes across latent variables: q(z_1, ..., z_m) = product over j of q_j(z_j). Each factor q_j is allowed to be any distribution over z_j, but the joint enforces that latent variables are mutually independent under q. This independence assumption is borrowed from mean-field theory in statistical mechanics, where it gives the technique its name. The cost is that any posterior correlations between latent variables vanish under q; the benefit is enormous tractability [1][5].
Under the mean-field assumption, the ELBO can be optimized by coordinate ascent. The optimal update for each factor q_j, holding the others fixed, has a clean closed form: q_j*(z_j) is proportional to exp(E_{-j}[log p(z_j | z_{-j}, x)]). When the model is in the conjugate exponential family, this update produces a distribution of the same parametric form as the conditional, and the algorithm reduces to repeated updates of natural parameters. The result is the coordinate ascent variational inference (CAVI) algorithm: cycle through factors, apply the closed-form update, and monitor the ELBO until it stops improving [1].
CAVI is guaranteed to converge to a local maximum of the ELBO, but the ELBO is generally non-convex. Its weaknesses are that every iteration touches every data point, which scales poorly, and the mean-field assumption can produce dramatically wrong uncertainty estimates when the true posterior has strong correlations.
The most well-known consequence of mean-field VI is variance underestimation. Because the optimization minimizes KL(q || p), the reverse KL direction, the bound penalizes assigning probability mass where the true posterior has none, but is comparatively forgiving of failing to cover regions of the posterior. The result is a mode-seeking approximation: q tends to lock onto a single mode of p and to be sharper than p around that mode. For multi-modal posteriors this can be misleading, since a mean-field q will typically capture only one mode. This contrasts with the mode-covering behavior of forward KL, KL(p || q) [6].
Structured variational inference relaxes the mean-field assumption by allowing some latent variables to remain coupled in q while still factorizing across others. A common pattern is to retain the natural conditional independence structure of the model: in a hidden Markov model, q(z_{1:T}) might be a chain that respects the temporal dependence, while parameters are factorized away. Structured VI typically yields tighter ELBOs at the cost of more complex updates.
The scalability barrier of CAVI was broken by stochastic variational inference (SVI), introduced by Matthew Hoffman, David Blei, Chong Wang, and John Paisley in their 2013 Journal of Machine Learning Research paper. SVI applies stochastic optimization to the ELBO, replacing the full-data gradient with an unbiased estimate computed from a single data point or a small mini-batch. The key insight is that for conditionally conjugate exponential family models, the natural gradient of the ELBO with respect to global variational parameters has a particularly simple form in terms of expected sufficient statistics. Subsampling data and rescaling the local statistics give an unbiased noisy natural gradient, and Robbins-Monro stochastic approximation guarantees convergence [7].
SVI was a watershed for applied Bayesian modeling. It made it possible to fit latent Dirichlet allocation and the hierarchical Dirichlet process to corpora with millions of documents. It also reframed variational inference in the language of stochastic optimization, paving the way for the black-box and automatic-differentiation methods that followed.
A limitation of SVI in its original form is that it requires conditional conjugacy and model-specific derivations. Black box variational inference (BBVI), introduced by Rajesh Ranganath, Sean Gerrish, and David Blei in 2014, removed both restrictions. BBVI estimates the gradient of the ELBO using only samples from q and evaluations of log p(x,z) and log q(z), with no model-specific calculations needed [8].
The core trick is the score function (REINFORCE) gradient estimator. Differentiating the ELBO yields gradient L(phi) = E_q[(log p(x,z) - log q(z; phi)) * gradient log q(z; phi)], which can be approximated by drawing samples from q, evaluating the log probabilities, and averaging. The estimator is unbiased but has high variance, so BBVI introduces variance reduction techniques such as control variates and Rao-Blackwellization. Because the estimator only requires sampling and pointwise log-density evaluations, BBVI works on essentially any model that can be specified in code [8].
A second route to model-agnostic variational inference, with much lower gradient variance than score-function methods, is the reparameterization trick. The idea, popularized by Diederik Kingma and Max Welling in their 2013 Auto-Encoding Variational Bayes paper and independently by Danilo Rezende, Shakir Mohamed, and Daan Wierstra in 2014, is to express the random variable z under q as a deterministic transformation of an auxiliary noise variable epsilon whose distribution does not depend on phi: z = g(epsilon, phi). For the most common case of a Gaussian q with mean mu and standard deviation sigma, the reparameterization is z = mu + sigma * epsilon with epsilon drawn from a standard normal. The expectation in the ELBO then becomes an expectation over the fixed noise distribution, and gradients with respect to phi can flow through g via standard backpropagation [9][10].
The reparameterization trick produces gradient estimates with much lower variance than score-function estimators in continuous latent variable models, which is one of the reasons it has dominated the deep generative modeling literature. It applies whenever q can be written as a differentiable transformation of a parameter-free noise distribution, a property satisfied by Gaussians, log-normals, location-scale families, and any distribution defined by a neural network with stochastic inputs.
The combination of stochastic optimization, reparameterization, and automatic differentiation produced one of the most useful packages for applied Bayesian modeling: automatic differentiation variational inference (ADVI), introduced by Alp Kucukelbir, Dustin Tran, Rajesh Ranganath, Andrew Gelman, and David Blei in their 2017 Journal of Machine Learning Research paper. ADVI takes a probabilistic model written in a generic modeling language and runs a fully automatic mean-field or full-rank Gaussian variational approximation. The user provides only the model and data; ADVI handles transformations of constrained variables to unconstrained space, the variational family, gradient estimation, and optimization [11].
ADVI works by mapping every latent variable to an unconstrained real-valued representation, fitting a Gaussian variational distribution in that unconstrained space, and then transforming back. The reparameterization trick gives low-variance gradients, and stochastic optimization with adaptive step sizes handles convergence. ADVI is the default variational inference engine in the Stan probabilistic programming language and has been ported into numerous other tools [11].
Mean-field and Gaussian variational families, despite their convenience, are too simple to capture the structure of complex posteriors. Normalizing flows, introduced for variational inference by Danilo Rezende and Shakir Mohamed in 2015, build expressive variational distributions by composing a sequence of invertible transformations applied to a simple base distribution. Each transformation is a diffeomorphism whose Jacobian determinant can be computed efficiently, so the change-of-variables formula yields a tractable density for the resulting distribution. By stacking many such layers, flows can represent arbitrarily complex multi-modal or skewed posteriors while still allowing exact density evaluation and easy sampling [12]. Subsequent work has produced inverse autoregressive flows, real NVP, neural spline flows, and many other variants. Normalizing flow posteriors are a natural drop-in replacement for the Gaussian q in a variational autoencoder.
Classical variational inference fits separate variational parameters for each latent configuration, which becomes impractical when the latent variables are per-data-point and the dataset is large. Amortized variational inference replaces the per-instance parameters with a shared inference network: a neural network that maps each observation x to the parameters of q(z|x). After training, computing the approximate posterior for a new observation requires only a single forward pass through the inference network. This is the architecture that defines the encoder side of a variational autoencoder [9].
Amortization brings two benefits and one cost. The benefits are speed (inference at test time is a forward pass) and statistical sharing across data points. The cost is the amortization gap, the difference between the ELBO achievable with the inference network and the ELBO that would be achievable by optimizing q(z|x) freely for each x. For complex posteriors this gap can be substantial, motivating semi-amortized methods that refine the network's output with a few steps of per-instance optimization.
The lineage of variational inference traces through several distinct traditions. The variational principle itself comes from physics, where it provides a way to bound free energies by minimizing functionals over candidate distributions. The connection to Bayesian inference was made by Geoffrey Hinton and Drew van Camp in their 1993 paper Keeping Neural Networks Simple by Minimizing the Description Length of the Weights, which used a variational bound to penalize the information content of network weights and is widely regarded as the first application of variational Bayes to neural networks [13].
Throughout the 1990s, variational methods spread through graphical models. Michael Jordan, Zoubin Ghahramani, Tommi Jaakkola, and Lawrence Saul published their landmark survey An Introduction to Variational Methods for Graphical Models in Machine Learning in 1999, which synthesized the existing techniques into a coherent framework [5]. The mid-2000s saw variational inference scale to large topic models with latent Dirichlet allocation by Blei, Ng, and Jordan in 2003, fitted with a mean-field variational EM algorithm [14].
The modern era began with three nearly-simultaneous developments in 2013 and 2014. Hoffman and colleagues published stochastic variational inference, breaking the data-size barrier for conjugate exponential family models. Kingma and Welling published Auto-Encoding Variational Bayes, introducing the variational autoencoder and the reparameterization trick. Ranganath, Gerrish, and Blei published Black Box Variational Inference, removing the conjugacy requirement. The pieces fit together: SVI gave scalability, BBVI gave generality, and the VAE showed that VI could power state-of-the-art generative models. The years that followed produced ADVI, normalizing flows, importance-weighted autoencoders, and a flood of variants now used throughout machine learning [3][7][8][9].
The choice between variational inference and MCMC has shaped a great deal of practical Bayesian modeling. The two approaches make opposite trade-offs along several axes.
| Criterion | Variational Inference | Markov Chain Monte Carlo |
|---|---|---|
| Output | A parametric approximation q(z) | A set of samples from p(z |
| Asymptotic guarantee | Local optimum of ELBO; biased | Asymptotically exact |
| Speed | Fast; uses gradients and mini-batches | Slow per sample; serial chain |
| Scalability to large data | Excellent (SVI, ADVI, amortization) | Limited; requires subsampling tricks |
| Scalability to large models | Excellent; runs on GPUs | Limited; HMC is expensive in high dim |
| Uncertainty calibration | Often underestimates variance | Reliable when well-mixed |
| Multimodal posteriors | Mode-seeking; tends to capture one mode | Can in principle visit all modes |
| Convergence diagnostics | ELBO trajectory; less mature | R-hat, ESS, trace plots; well developed |
| Best fit | Big data, deep generative models | Small to medium data, careful inference |
In practice, many statisticians use VI for exploration and rapid iteration, then validate with MCMC for calibrated uncertainty. Engineers building production systems with millions of latent variables typically rely on variational methods exclusively [1].
The following timeline tracks the methods that have shaped modern variational inference.
| Year | Method | Authors | Key idea |
|---|---|---|---|
| 1993 | Variational free energy for neural networks | Hinton, van Camp | First application of VI to network weights via MDL |
| 1999 | Variational methods survey | Jordan, Ghahramani, Jaakkola, Saul | Unified framework for VI in graphical models |
| 2003 | LDA with variational EM | Blei, Ng, Jordan | Mean-field VI for topic modeling |
| 2008 | Variational message passing | Winn, Bishop | Generic VI through messages on a factor graph |
| 2013 | Stochastic variational inference (SVI) | Hoffman, Blei, Wang, Paisley | Stochastic gradients on natural parameters |
| 2013 | Auto-Encoding Variational Bayes (VAE) | Kingma, Welling | Reparameterization trick and amortized inference |
| 2014 | Stochastic backpropagation | Rezende, Mohamed, Wierstra | Reparameterized gradients in deep generative models |
| 2014 | Black box variational inference (BBVI) | Ranganath, Gerrish, Blei | Score-function gradients; no model-specific work |
| 2015 | Normalizing flows for VI | Rezende, Mohamed | Expressive q via invertible transformations |
| 2015 | Bayes by Backprop | Blundell et al. | VI over neural network weights |
| 2016 | Importance weighted autoencoders | Burda, Grosse, Salakhutdinov | Tighter ELBO via importance sampling |
| 2017 | Automatic differentiation VI (ADVI) | Kucukelbir et al. | Fully automatic Gaussian VI; deployed in Stan |
| 2021 | Variational diffusion models | Kingma, Salimans, Poole, Ho | Diffusion models written as hierarchical VI |
The most famous large-scale application of variational inference outside of deep learning is latent Dirichlet allocation (LDA), the topic model introduced by Blei, Ng, and Jordan in 2003. LDA models each document as a mixture over topics, where each topic is itself a distribution over words. The latent variables are the per-document topic proportions and the per-word topic assignments, and the posterior given a corpus is intractable because of the coupling between topics and assignments [14].
The original paper used a mean-field variational EM algorithm: factorize q over topic proportions and assignments, run CAVI on each factor, and use the resulting expectations to update topic-word distributions. This scaled to corpora of tens of thousands of documents, and stochastic variational inference later pushed it to corpora of millions [7][14].
The variational autoencoder (VAE), introduced by Kingma and Welling in 2013, is the canonical modern application of variational inference. A VAE is a deep generative model in which a latent code z is drawn from a simple prior (typically a standard normal), passed through a decoder neural network to produce parameters of a likelihood over data x. Inference is amortized through an encoder network that outputs the mean and variance of a Gaussian q(z|x). The two networks are trained jointly by maximizing the ELBO, which decomposes into a reconstruction term and a KL term that pulls q toward the prior [9].
VAEs became a workhorse of unsupervised learning in the mid-2010s. They produce useful continuous latent representations for downstream tasks, generate plausible samples, and provide a principled probabilistic framework for representation learning. The original VAE has been extended in many directions: beta-VAEs trade reconstruction for disentanglement, conditional VAEs allow class-conditional generation, hierarchical VAEs stack many layers of latent variables, and VQ-VAEs use discrete codes.
Variational inference provides one of the few practical routes to fully Bayesian deep learning. Treating the weights of a neural network as random variables with a prior and a posterior gives calibrated uncertainty over predictions, but the posterior over millions of weights is intractable. Bayes by Backprop, introduced by Charles Blundell, Julien Cornebise, Koray Kavukcuoglu, and Daan Wierstra in 2015, applies mean-field variational inference to the weights: each weight has its own Gaussian variational distribution, and the ELBO is optimized via the reparameterization trick to learn weight means and variances jointly [15].
The resulting Bayesian neural network provides per-weight uncertainty estimates that can be propagated to predictions by Monte Carlo sampling at test time. Related techniques include Monte Carlo dropout (which can be interpreted as VI with a Bernoulli posterior) and stochastic-weight averaging. None of these approaches captures the full weight posterior faithfully, but they provide vastly better uncertainty estimates than point-estimate networks at modest extra cost.
The Gaussian mixture model is one of the simplest examples on which variational inference can be derived end-to-end, which is why it appears in essentially every introduction to the topic. With cluster assignments and component parameters as latent variables and mean-field q over both, CAVI yields closed-form updates that closely resemble the Expectation-Maximization (EM) algorithm but with proper Bayesian treatment of the parameters [1].
The explosion of interest in diffusion models since 2020 has reframed the most successful generative architectures of the present era as hierarchical variational models. A diffusion model can be derived as a Markovian hierarchical VAE in which the encoder is a fixed forward Gaussian noising process and the decoder is a learned reverse denoising process. The training objective is a variational lower bound on log p(x), which under the standard parameterization simplifies to a weighted sum of squared-error denoising losses across noise scales. The Variational Diffusion Models paper by Kingma, Salimans, Poole, and Ho in 2021 showed that this bound has a clean form in terms of the signal-to-noise ratio and that the noise schedule itself can be learned jointly with the denoising network [16]. Score-based generative models, denoising diffusion probabilistic models, and continuous-time SDE-based generative models can all be cast in variational terms.
Variational inference is implemented in essentially every modern probabilistic programming system. Most libraries converge on a similar workflow: write a model, choose a variational family (often called a guide function in Pyro terminology), and run a stochastic optimizer.
| Library | Language | Backend | Notes |
|---|---|---|---|
| Stan | Stan modeling language | Custom autodiff | ADVI is the built-in VI engine; widely used by statisticians |
| PyMC | Python | PyTensor / JAX | Mean-field, full-rank, and normalizing flow VI |
| NumPyro | Python | JAX | SVI, autoguides, and JIT-compiled gradients on GPU/TPU |
| Pyro | Python | PyTorch | General-purpose SVI with rich guide language |
| TensorFlow Probability | Python | TensorFlow | Variational distributions and surrogate posteriors |
| Edward2 | Python | TensorFlow | Functional probabilistic programming with VI |
| Turing.jl | Julia | Various | ADVI and other VI methods in the Julia ecosystem |
Most modern systems also offer automatic guide construction, often called autoguides, that pick a reasonable variational family without user intervention. The combination of automatic differentiation, autoguides, and adaptive optimizers has made variational inference about as easy to apply as point estimation [11].
Applying variational inference well requires attention to several recurring issues. The first is choice of variational family. Mean-field is the simplest, but its independence assumption can be devastating when latent variables are strongly correlated. A full-rank Gaussian captures pairwise correlations at quadratic memory cost. Normalizing flows give arbitrary flexibility but add training cost.
The second is gradient estimator variance. The reparameterization trick has lower variance than the score-function estimator and should be used whenever possible. For discrete latent variables, gradient estimators like Gumbel-softmax, REBAR, RELAX, and straight-through estimators each offer different bias-variance trade-offs.
The third is posterior collapse, a pathology specific to amortized models like VAEs. When the decoder is sufficiently expressive, the optimizer can learn to ignore the latent code entirely, driving q(z|x) to the prior. Mitigations include KL annealing, free bits, constraining the decoder, and stronger encoders.
The fourth is diagnostics. The ELBO trajectory is the basic indicator of convergence, but a high ELBO does not imply that q is close to the true posterior; only that no member of the variational family is closer. Pareto-smoothed importance sampling can detect when the approximation is poor.
Variational inference inherits a small set of limitations from the optimization structure. The bias from the choice of variational family cannot be removed without enriching the family at compute cost. The reverse-KL objective is mode-seeking, which underestimates posterior variance. The amortization gap creates a structural ceiling on accuracy in inference networks.
Research continues on all of these fronts. Importance-weighted bounds, alpha divergences, Stein discrepancies, and Renyi divergences provide alternative objectives. Hamiltonian variational inference, MCMC-augmented VI, and unbiased VI use sampling to refine variational approximations. Implicit variational distributions parameterized by neural samplers can in principle represent any distribution but bring estimation challenges of their own.
Despite these caveats, variational inference is the only practical approach for an enormous fraction of modern probabilistic modeling. The success of variational autoencoders, latent Dirichlet allocation, Bayesian neural networks, and diffusion models testifies to the breadth of its impact [1][3].