Variational Inference

Variational inference (VI), also known as variational Bayes (VB), is a family of methods in machine learning and statistics that turn the problem of approximate Bayesian inference into an optimization problem. Given an intractable posterior distribution p(z|x) over latent variables z conditioned on observed data x, variational inference posits a simpler family of candidate distributions q(z; phi) parameterized by phi, then searches for the member of that family that lies closest to the true posterior in the sense of Kullback-Leibler divergence. This reformulation lets practitioners trade the intractable integration of fully Bayesian computation for the comparatively tame machinery of gradient descent, and it has become the dominant technique for scaling Bayesian methods to the very large models and very large datasets that define modern deep learning ^[1].

The core identity that makes variational inference work is a decomposition of the marginal log-likelihood log p(x) into the sum of an evidence lower bound (ELBO) and a KL divergence between the variational distribution and the true posterior. Because KL divergence is non-negative, maximizing the ELBO is equivalent to minimizing the KL gap, and the value of the ELBO at convergence provides a useful lower bound on the model evidence that can be used for model comparison. Variational inference now serves as the inferential engine for variational autoencoders, latent Dirichlet allocation, Bayesian neural networks, large-scale topic models, and the lower bounds that define many diffusion models ^[2]^[3].

The inference problem

Most interesting probabilistic models combine observed data x with unobserved latent variables z. Inference means computing the posterior p(z|x), which by Bayes' rule equals p(x|z) p(z) / p(x). The numerator is usually easy: it is the joint p(x,z), available in closed form by definition of the model. The denominator p(x) is the marginal likelihood (also called the evidence) and is obtained by integrating the joint over all latent configurations. For all but the simplest models, this integral has no closed form. Mixture models, hierarchical Bayesian models, topic models, hidden Markov models with continuous states, and deep generative models all share this property: the joint is tractable, the posterior is not ^[1].

Two broad strategies have historically dominated approximate inference. The first is Markov chain Monte Carlo, which constructs a stochastic process whose stationary distribution is the target posterior, then draws samples from it. MCMC is asymptotically exact and remains the gold standard for many statistical applications, but its samples are correlated, convergence is hard to diagnose, and the wall-clock cost grows with both model size and dataset size. The second strategy is variational inference, which sacrifices asymptotic exactness for speed. Instead of sampling, VI fits an approximation to the posterior. The approximation is biased, but the optimization is amenable to all the tools that have made deep learning practical: stochastic gradients, GPU acceleration, mini-batching, and automatic differentiation ^[1]^[3]. Hybrid methods such as Hamiltonian variational inference and MCMC-augmented VI combine the two families, and importance-weighted bounds let VI approach the exactness of sampling at the cost of more compute.

The evidence lower bound

The foundation of variational inference is the ELBO. Writing q(z) for the variational distribution and p(x,z) for the joint, the log marginal likelihood can be expanded as log p(x) = E_q[log p(x,z)] - E_q[log q(z)] + KL(q(z) || p(z|x)). The first two terms together form the ELBO, often denoted L(q):

L(q) = E_q[log p(x,z)] - E_q[log q(z)]

Because the KL divergence is always non-negative, L(q) is a lower bound on log p(x), with equality if and only if q matches the true posterior exactly. Maximizing L(q) over the variational family therefore drives q toward p(z|x). The ELBO can be rewritten as L(q) = E_q[log p(x|z)] - KL(q(z) || p(z)), decomposing the bound into a reconstruction term and a complexity penalty. In statistical physics, -L(q) is called the variational free energy ^[4].

Direct maximization of L(q) over arbitrary q is just as hard as the original problem. The art of variational inference lies in choosing the family of q distributions to be expressive enough to capture the important structure of the posterior, yet simple enough that the expectations in L(q) are tractable. The earliest and still most common restriction is the mean-field family.

Mean-field variational inference

Mean-field variational inference assumes the variational distribution factorizes across latent variables: q(z_1, ..., z_m) = product over j of q_j(z_j). Each factor q_j is allowed to be any distribution over z_j, but the joint enforces that latent variables are mutually independent under q. This independence assumption is borrowed from mean-field theory in statistical mechanics, where it gives the technique its name. The cost is that any posterior correlations between latent variables vanish under q; the benefit is enormous tractability ^[1]^[5].

Under the mean-field assumption, the ELBO can be optimized by coordinate ascent. The optimal update for each factor q_j, holding the others fixed, has a clean closed form: q_j*(z_j) is proportional to exp(E_{-j}[log p(z_j | z_{-j}, x)]). When the model is in the conjugate exponential family, this update produces a distribution of the same parametric form as the conditional, and the algorithm reduces to repeated updates of natural parameters. The result is the coordinate ascent variational inference (CAVI) algorithm: cycle through factors, apply the closed-form update, and monitor the ELBO until it stops improving ^[1].

CAVI is guaranteed to converge to a local maximum of the ELBO, but the ELBO is generally non-convex. Its weaknesses are that every iteration touches every data point, which scales poorly, and the mean-field assumption can produce dramatically wrong uncertainty estimates when the true posterior has strong correlations.

The most well-known consequence of mean-field VI is variance underestimation. Because the optimization minimizes KL(q || p), the reverse KL direction, the bound penalizes assigning probability mass where the true posterior has none, but is comparatively forgiving of failing to cover regions of the posterior. The result is a mode-seeking approximation: q tends to lock onto a single mode of p and to be sharper than p around that mode. For multi-modal posteriors this can be misleading, since a mean-field q will typically capture only one mode. This contrasts with the mode-covering behavior of forward KL, KL(p || q) ^[6].

Structured variational inference

Structured variational inference relaxes the mean-field assumption by allowing some latent variables to remain coupled in q while still factorizing across others. A common pattern is to retain the natural conditional independence structure of the model: in a hidden Markov model, q(z_{1:T}) might be a chain that respects the temporal dependence, while parameters are factorized away. Structured VI typically yields tighter ELBOs at the cost of more complex updates.

Stochastic variational inference

The scalability barrier of CAVI was broken by stochastic variational inference (SVI), introduced by Matthew Hoffman, David Blei, Chong Wang, and John Paisley in their 2013 Journal of Machine Learning Research paper. SVI applies stochastic optimization to the ELBO, replacing the full-data gradient with an unbiased estimate computed from a single data point or a small mini-batch. The key insight is that for conditionally conjugate exponential family models, the natural gradient of the ELBO with respect to global variational parameters has a particularly simple form in terms of expected sufficient statistics. Subsampling data and rescaling the local statistics give an unbiased noisy natural gradient, and Robbins-Monro stochastic approximation guarantees convergence ^[7].

SVI was a watershed for applied Bayesian modeling. It made it possible to fit latent Dirichlet allocation and the hierarchical Dirichlet process to corpora with millions of documents. It also reframed variational inference in the language of stochastic optimization, paving the way for the black-box and automatic-differentiation methods that followed.

Black box variational inference

A limitation of SVI in its original form is that it requires conditional conjugacy and model-specific derivations. Black box variational inference (BBVI), introduced by Rajesh Ranganath, Sean Gerrish, and David Blei in 2014, removed both restrictions. BBVI estimates the gradient of the ELBO using only samples from q and evaluations of log p(x,z) and log q(z), with no model-specific calculations needed ^[8].

The core trick is the score function (REINFORCE) gradient estimator. Differentiating the ELBO yields gradient L(phi) = E_q[(log p(x,z) - log q(z; phi)) * gradient log q(z; phi)], which can be approximated by drawing samples from q, evaluating the log probabilities, and averaging. The estimator is unbiased but has high variance, so BBVI introduces variance reduction techniques such as control variates and Rao-Blackwellization. Because the estimator only requires sampling and pointwise log-density evaluations, BBVI works on essentially any model that can be specified in code ^[8].

The reparameterization trick

A second route to model-agnostic variational inference, with much lower gradient variance than score-function methods, is the reparameterization trick. The idea, popularized by Diederik Kingma and Max Welling in their 2013 Auto-Encoding Variational Bayes paper and independently by Danilo Rezende, Shakir Mohamed, and Daan Wierstra in 2014, is to express the random variable z under q as a deterministic transformation of an auxiliary noise variable epsilon whose distribution does not depend on phi: z = g(epsilon, phi). For the most common case of a Gaussian q with mean mu and standard deviation sigma, the reparameterization is z = mu + sigma * epsilon with epsilon drawn from a standard normal. The expectation in the ELBO then becomes an expectation over the fixed noise distribution, and gradients with respect to phi can flow through g via standard backpropagation ^[9]^[10].

The reparameterization trick produces gradient estimates with much lower variance than score-function estimators in continuous latent variable models, which is one of the reasons it has dominated the deep generative modeling literature. It applies whenever q can be written as a differentiable transformation of a parameter-free noise distribution, a property satisfied by Gaussians, log-normals, location-scale families, and any distribution defined by a neural network with stochastic inputs.

Automatic differentiation variational inference

The combination of stochastic optimization, reparameterization, and automatic differentiation produced one of the most useful packages for applied Bayesian modeling: automatic differentiation variational inference (ADVI), introduced by Alp Kucukelbir, Dustin Tran, Rajesh Ranganath, Andrew Gelman, and David Blei in their 2017 Journal of Machine Learning Research paper. ADVI takes a probabilistic model written in a generic modeling language and runs a fully automatic mean-field or full-rank Gaussian variational approximation. The user provides only the model and data; ADVI handles transformations of constrained variables to unconstrained space, the variational family, gradient estimation, and optimization ^[11].

ADVI works by mapping every latent variable to an unconstrained real-valued representation, fitting a Gaussian variational distribution in that unconstrained space, and then transforming back. The reparameterization trick gives low-variance gradients, and stochastic optimization with adaptive step sizes handles convergence. ADVI is the default variational inference engine in the Stan probabilistic programming language and has been ported into numerous other tools ^[11].

Normalizing flows for richer posteriors

Mean-field and Gaussian variational families, despite their convenience, are too simple to capture the structure of complex posteriors. Normalizing flows, introduced for variational inference by Danilo Rezende and Shakir Mohamed in 2015, build expressive variational distributions by composing a sequence of invertible transformations applied to a simple base distribution. Each transformation is a diffeomorphism whose Jacobian determinant can be computed efficiently, so the change-of-variables formula yields a tractable density for the resulting distribution. By stacking many such layers, flows can represent arbitrarily complex multi-modal or skewed posteriors while still allowing exact density evaluation and easy sampling ^[12]. Subsequent work has produced inverse autoregressive flows, real NVP, neural spline flows, and many other variants. Normalizing flow posteriors are a natural drop-in replacement for the Gaussian q in a variational autoencoder.

Amortized variational inference

Classical variational inference fits separate variational parameters for each latent configuration, which becomes impractical when the latent variables are per-data-point and the dataset is large. Amortized variational inference replaces the per-instance parameters with a shared inference network: a neural network that maps each observation x to the parameters of q(z|x). After training, computing the approximate posterior for a new observation requires only a single forward pass through the inference network. This is the architecture that defines the encoder side of a variational autoencoder ^[9].

Amortization brings two benefits and one cost. The benefits are speed (inference at test time is a forward pass) and statistical sharing across data points. The cost is the amortization gap, the difference between the ELBO achievable with the inference network and the ELBO that would be achievable by optimizing q(z|x) freely for each x. For complex posteriors this gap can be substantial, motivating semi-amortized methods that refine the network's output with a few steps of per-instance optimization.

History and intellectual context

The lineage of variational inference traces through several distinct traditions. The variational principle itself comes from physics, where it provides a way to bound free energies by minimizing functionals over candidate distributions. The connection to Bayesian inference was made by Geoffrey Hinton and Drew van Camp in their 1993 paper Keeping Neural Networks Simple by Minimizing the Description Length of the Weights, which used a variational bound to penalize the information content of network weights and is widely regarded as the first application of variational Bayes to neural networks ^[13].

Throughout the 1990s, variational methods spread through graphical models. Michael Jordan, Zoubin Ghahramani, Tommi Jaakkola, and Lawrence Saul published their landmark survey An Introduction to Variational Methods for Graphical Models in Machine Learning in 1999, which synthesized the existing techniques into a coherent framework ^[5]. The mid-2000s saw variational inference scale to large topic models with latent Dirichlet allocation by Blei, Ng, and Jordan in 2003, fitted with a mean-field variational EM algorithm ^[14].

The modern era began with three nearly-simultaneous developments in 2013 and 2014. Hoffman and colleagues published stochastic variational inference, breaking the data-size barrier for conjugate exponential family models. Kingma and Welling published Auto-Encoding Variational Bayes, introducing the variational autoencoder and the reparameterization trick. Ranganath, Gerrish, and Blei published Black Box Variational Inference, removing the conjugacy requirement. The pieces fit together: SVI gave scalability, BBVI gave generality, and the VAE showed that VI could power state-of-the-art generative models. The years that followed produced ADVI, normalizing flows, importance-weighted autoencoders, and a flood of variants now used throughout machine learning ^[3]^[7]^[8]^[9].

Variational inference vs Markov chain Monte Carlo

The choice between variational inference and MCMC has shaped a great deal of practical Bayesian modeling. The two approaches make opposite trade-offs along several axes.

Criterion	Variational Inference	Markov Chain Monte Carlo
Output	A parametric approximation q(z)	A set of samples from p(z
Asymptotic guarantee	Local optimum of ELBO; biased	Asymptotically exact
Speed	Fast; uses gradients and mini-batches	Slow per sample; serial chain
Scalability to large data	Excellent (SVI, ADVI, amortization)	Limited; requires subsampling tricks
Scalability to large models	Excellent; runs on GPUs	Limited; HMC is expensive in high dim
Uncertainty calibration	Often underestimates variance	Reliable when well-mixed
Multimodal posteriors	Mode-seeking; tends to capture one mode	Can in principle visit all modes
Convergence diagnostics	ELBO trajectory; less mature	R-hat, ESS, trace plots; well developed
Best fit	Big data, deep generative models	Small to medium data, careful inference

In practice, many statisticians use VI for exploration and rapid iteration, then validate with MCMC for calibrated uncertainty. Engineers building production systems with millions of latent variables typically rely on variational methods exclusively ^[1].

Major variational inference methods

The following timeline tracks the methods that have shaped modern variational inference.

Year	Method	Authors	Key idea
1993	Variational free energy for neural networks	Hinton, van Camp	First application of VI to network weights via MDL
1999	Variational methods survey	Jordan, Ghahramani, Jaakkola, Saul	Unified framework for VI in graphical models
2003	LDA with variational EM	Blei, Ng, Jordan	Mean-field VI for topic modeling
2008	Variational message passing	Winn, Bishop	Generic VI through messages on a factor graph
2013	Stochastic variational inference (SVI)	Hoffman, Blei, Wang, Paisley	Stochastic gradients on natural parameters
2013	Auto-Encoding Variational Bayes (VAE)	Kingma, Welling	Reparameterization trick and amortized inference
2014	Stochastic backpropagation	Rezende, Mohamed, Wierstra	Reparameterized gradients in deep generative models
2014	Black box variational inference (BBVI)	Ranganath, Gerrish, Blei	Score-function gradients; no model-specific work
2015	Normalizing flows for VI	Rezende, Mohamed	Expressive q via invertible transformations
2015	Bayes by Backprop	Blundell et al.	VI over neural network weights
2016	Importance weighted autoencoders	Burda, Grosse, Salakhutdinov	Tighter ELBO via importance sampling
2017	Automatic differentiation VI (ADVI)	Kucukelbir et al.	Fully automatic Gaussian VI; deployed in Stan
2021	Variational diffusion models	Kingma, Salimans, Poole, Ho	Diffusion models written as hierarchical VI

Latent Dirichlet allocation

The most famous large-scale application of variational inference outside of deep learning is latent Dirichlet allocation (LDA), the topic model introduced by Blei, Ng, and Jordan in 2003. LDA models each document as a mixture over topics, where each topic is itself a distribution over words. The latent variables are the per-document topic proportions and the per-word topic assignments, and the posterior given a corpus is intractable because of the coupling between topics and assignments ^[14].

The original paper used a mean-field variational EM algorithm: factorize q over topic proportions and assignments, run CAVI on each factor, and use the resulting expectations to update topic-word distributions. This scaled to corpora of tens of thousands of documents, and stochastic variational inference later pushed it to corpora of millions ^[7]^[14].

Variational autoencoder

The variational autoencoder (VAE), introduced by Kingma and Welling in 2013, is the canonical modern application of variational inference. A VAE is a deep generative model in which a latent code z is drawn from a simple prior (typically a standard normal), passed through a decoder neural network to produce parameters of a likelihood over data x. Inference is amortized through an encoder network that outputs the mean and variance of a Gaussian q(z|x). The two networks are trained jointly by maximizing the ELBO, which decomposes into a reconstruction term and a KL term that pulls q toward the prior ^[9].

VAEs became a workhorse of unsupervised learning in the mid-2010s. They produce useful continuous latent representations for downstream tasks, generate plausible samples, and provide a principled probabilistic framework for representation learning. The original VAE has been extended in many directions: beta-VAEs trade reconstruction for disentanglement, conditional VAEs allow class-conditional generation, hierarchical VAEs stack many layers of latent variables, and VQ-VAEs use discrete codes.

Bayesian neural networks

Variational inference provides one of the few practical routes to fully Bayesian deep learning. Treating the weights of a neural network as random variables with a prior and a posterior gives calibrated uncertainty over predictions, but the posterior over millions of weights is intractable. Bayes by Backprop, introduced by Charles Blundell, Julien Cornebise, Koray Kavukcuoglu, and Daan Wierstra in 2015, applies mean-field variational inference to the weights: each weight has its own Gaussian variational distribution, and the ELBO is optimized via the reparameterization trick to learn weight means and variances jointly ^[15].

The resulting Bayesian neural network provides per-weight uncertainty estimates that can be propagated to predictions by Monte Carlo sampling at test time. Related techniques include Monte Carlo dropout (which can be interpreted as VI with a Bernoulli posterior) and stochastic-weight averaging. None of these approaches captures the full weight posterior faithfully, but they provide vastly better uncertainty estimates than point-estimate networks at modest extra cost.

Gaussian mixture models and diffusion models

The Gaussian mixture model is one of the simplest examples on which variational inference can be derived end-to-end, which is why it appears in essentially every introduction to the topic. With cluster assignments and component parameters as latent variables and mean-field q over both, CAVI yields closed-form updates that closely resemble the Expectation-Maximization (EM) algorithm but with proper Bayesian treatment of the parameters ^[1].

The explosion of interest in diffusion models since 2020 has reframed the most successful generative architectures of the present era as hierarchical variational models. A diffusion model can be derived as a Markovian hierarchical VAE in which the encoder is a fixed forward Gaussian noising process and the decoder is a learned reverse denoising process. The training objective is a variational lower bound on log p(x), which under the standard parameterization simplifies to a weighted sum of squared-error denoising losses across noise scales. The Variational Diffusion Models paper by Kingma, Salimans, Poole, and Ho in 2021 showed that this bound has a clean form in terms of the signal-to-noise ratio and that the noise schedule itself can be learned jointly with the denoising network ^[16]. Score-based generative models, denoising diffusion probabilistic models, and continuous-time SDE-based generative models can all be cast in variational terms.

Software libraries

Variational inference is implemented in essentially every modern probabilistic programming system. Most libraries converge on a similar workflow: write a model, choose a variational family (often called a guide function in Pyro terminology), and run a stochastic optimizer.

Library	Language	Backend	Notes
Stan	Stan modeling language	Custom autodiff	ADVI is the built-in VI engine; widely used by statisticians
PyMC	Python	PyTensor / JAX	Mean-field, full-rank, and normalizing flow VI
NumPyro	Python	JAX	SVI, autoguides, and JIT-compiled gradients on GPU/TPU
Pyro	Python	PyTorch	General-purpose SVI with rich guide language
TensorFlow Probability	Python	TensorFlow	Variational distributions and surrogate posteriors
Edward2	Python	TensorFlow	Functional probabilistic programming with VI
Turing.jl	Julia	Various	ADVI and other VI methods in the Julia ecosystem

Most modern systems also offer automatic guide construction, often called autoguides, that pick a reasonable variational family without user intervention. The combination of automatic differentiation, autoguides, and adaptive optimizers has made variational inference about as easy to apply as point estimation ^[11].

Practical considerations

Applying variational inference well requires attention to several recurring issues. The first is choice of variational family. Mean-field is the simplest, but its independence assumption can be devastating when latent variables are strongly correlated. A full-rank Gaussian captures pairwise correlations at quadratic memory cost. Normalizing flows give arbitrary flexibility but add training cost.

The second is gradient estimator variance. The reparameterization trick has lower variance than the score-function estimator and should be used whenever possible. For discrete latent variables, gradient estimators like Gumbel-softmax, REBAR, RELAX, and straight-through estimators each offer different bias-variance trade-offs.

The third is posterior collapse, a pathology specific to amortized models like VAEs. When the decoder is sufficiently expressive, the optimizer can learn to ignore the latent code entirely, driving q(z|x) to the prior. Mitigations include KL annealing, free bits, constraining the decoder, and stronger encoders.

The fourth is diagnostics. The ELBO trajectory is the basic indicator of convergence, but a high ELBO does not imply that q is close to the true posterior; only that no member of the variational family is closer. Pareto-smoothed importance sampling can detect when the approximation is poor.

Limitations and open problems

Variational inference inherits a small set of limitations from the optimization structure. The bias from the choice of variational family cannot be removed without enriching the family at compute cost. The reverse-KL objective is mode-seeking, which underestimates posterior variance. The amortization gap creates a structural ceiling on accuracy in inference networks.

Research continues on all of these fronts. Importance-weighted bounds, alpha divergences, Stein discrepancies, and Renyi divergences provide alternative objectives. Hamiltonian variational inference, MCMC-augmented VI, and unbiased VI use sampling to refine variational approximations. Implicit variational distributions parameterized by neural samplers can in principle represent any distribution but bring estimation challenges of their own.

Despite these caveats, variational inference is the only practical approach for an enormous fraction of modern probabilistic modeling. The success of variational autoencoders, latent Dirichlet allocation, Bayesian neural networks, and diffusion models testifies to the breadth of its impact ^[1]^[3].

References

Blei, D. M., Kucukelbir, A., & McAuliffe, J. D. (2017). Variational Inference: A Review for Statisticians. *Journal of the American Statistical Association*, 112(518), 859-877.
Wainwright, M. J., & Jordan, M. I. (2008). Graphical Models, Exponential Families, and Variational Inference. *Foundations and Trends in Machine Learning*, 1(1-2), 1-305.
Zhang, C., Butepage, J., Kjellstrom, H., & Mandt, S. (2018). Advances in Variational Inference. *IEEE Transactions on Pattern Analysis and Machine Intelligence*, 41(8), 2008-2026.
Beal, M. J. (2003). *Variational Algorithms for Approximate Bayesian Inference*. Doctoral dissertation, University College London.
Jordan, M. I., Ghahramani, Z., Jaakkola, T. S., & Saul, L. K. (1999). An Introduction to Variational Methods for Graphical Models. *Machine Learning*, 37, 183-233.
Bishop, C. M. (2006). *Pattern Recognition and Machine Learning*. Chapter 10, Approximate Inference. Springer.
Hoffman, M. D., Blei, D. M., Wang, C., & Paisley, J. (2013). Stochastic Variational Inference. *Journal of Machine Learning Research*, 14, 1303-1347.
Ranganath, R., Gerrish, S., & Blei, D. M. (2014). Black Box Variational Inference. *Proceedings of the Seventeenth International Conference on Artificial Intelligence and Statistics (AISTATS)*, 814-822.
Kingma, D. P., & Welling, M. (2013). Auto-Encoding Variational Bayes. *arXiv preprint arXiv:1312.6114*. Published at ICLR 2014.
Rezende, D. J., Mohamed, S., & Wierstra, D. (2014). Stochastic Backpropagation and Approximate Inference in Deep Generative Models. *Proceedings of the 31st International Conference on Machine Learning (ICML)*, 1278-1286.
Kucukelbir, A., Tran, D., Ranganath, R., Gelman, A., & Blei, D. M. (2017). Automatic Differentiation Variational Inference. *Journal of Machine Learning Research*, 18(14), 1-45.
Rezende, D., & Mohamed, S. (2015). Variational Inference with Normalizing Flows. *Proceedings of the 32nd International Conference on Machine Learning (ICML)*, 1530-1538.
Hinton, G. E., & van Camp, D. (1993). Keeping the Neural Networks Simple by Minimizing the Description Length of the Weights. *Proceedings of the Sixth Annual ACM Conference on Computational Learning Theory*, 5-13.
Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent Dirichlet Allocation. *Journal of Machine Learning Research*, 3, 993-1022.
Blundell, C., Cornebise, J., Kavukcuoglu, K., & Wierstra, D. (2015). Weight Uncertainty in Neural Networks. *Proceedings of the 32nd International Conference on Machine Learning (ICML)*, 1613-1622.
Kingma, D. P., Salimans, T., Poole, B., & Ho, J. (2021). Variational Diffusion Models. *Advances in Neural Information Processing Systems (NeurIPS)*, 34.

The inference problem

The evidence lower bound

Mean-field variational inference

Structured variational inference

Stochastic variational inference

Black box variational inference

The reparameterization trick

Automatic differentiation variational inference

Normalizing flows for richer posteriors

Amortized variational inference

History and intellectual context

Variational inference vs Markov chain Monte Carlo

Major variational inference methods

Latent Dirichlet allocation

Variational autoencoder

Bayesian neural networks

Gaussian mixture models and diffusion models

Software libraries

Practical considerations

Limitations and open problems

See also

References

Improve this article

Related Articles

ARC-AGI 2

Gaussian Process

AUC-ROC

ARIMA

Machine learning terms/Clustering

Machine learning terms/Decision Forests

The inference problem

The evidence lower bound

Mean-field variational inference

Structured variational inference

Stochastic variational inference

Black box variational inference

The reparameterization trick

Automatic differentiation variational inference

Normalizing flows for richer posteriors

Amortized variational inference

History and intellectual context

Variational inference vs Markov chain Monte Carlo

Major variational inference methods

Latent Dirichlet allocation

Variational autoencoder

Bayesian neural networks

Gaussian mixture models and diffusion models

Software libraries

Practical considerations

Limitations and open problems

See also

References

Related Articles

ARC-AGI 2

Gaussian Process

AUC-ROC

ARIMA

Machine learning terms/Clustering

Machine learning terms/Decision Forests