Variational Autoencoder

A variational autoencoder (VAE) is a generative model that learns a probabilistic latent space representation of data using an encoder-decoder architecture trained by maximizing the evidence lower bound (ELBO). Introduced by Diederik Kingma and Max Welling in their 2013 paper "Auto-Encoding Variational Bayes" (arXiv:1312.6114), and concurrently by Danilo Rezende, Shakir Mohamed, and Daan Wierstra in "Stochastic Backpropagation and Approximate Inference in Deep Generative Models" (2014), the VAE provides a principled framework for learning latent variable models with neural networks while simultaneously enabling the generation of new data samples. Unlike a standard autoencoder, which maps inputs to fixed points in latent space, a VAE maps inputs to probability distributions, making the latent space continuous and suitable for sampling.

VAEs sit alongside generative adversarial networks and diffusion models as one of the foundational families of deep learning generative models. While the raw VAE has largely been superseded by diffusion models for unconditional image synthesis, the VAE component itself remains central to the modern image and audio generation stack: Stable Diffusion and most other latent diffusion systems use a VAE encoder-decoder pair to compress images into a workable latent space, and discrete VAE variants serve as the tokenizer of choice for autoregressive image, video, and audio models including DALL-E, Parti, MusicLM, AudioLM, and many others.

Background and motivation

Before VAEs, standard autoencoders were widely used for dimensionality reduction and feature learning. A standard autoencoder consists of an encoder network that compresses input data into a lower-dimensional representation (the latent code) and a decoder network that reconstructs the input from this code. While effective for compression and reconstruction, standard autoencoders have a significant limitation: their latent spaces are not structured in a way that supports meaningful generation. Points sampled randomly from a standard autoencoder's latent space typically do not decode into realistic outputs because the encoder has no incentive to organize the space smoothly. There are gaps and discontinuities, and the decoder has never seen those regions during training.

Classical approaches to latent variable models, such as factor analysis, Gaussian mixture models, and the variational Bayes methods of the 2000s, could in principle produce a structured generative model, but they relied on tractable conjugate distributions and could not scale to high-dimensional data like images. Markov chain Monte Carlo methods are too slow for training neural networks, and the score function gradient estimator (REINFORCE) suffers from impractical variance for continuous latent variables. Until 2013, training an expressive deep latent variable model with continuous latents and amortized inference was an open problem.

The core insight behind the VAE is to combine three ingredients. First, impose a probabilistic structure on the latent space by treating the latent as a random variable with a fixed prior. Second, replace the expensive per-example posterior inference of classical variational methods with a single neural network that maps each input to its approximate posterior, an idea known as amortized inference. Third, reformulate the random sampling step so that gradients can flow through it, using what is now called the reparameterization trick. The combination yields an objective that is differentiable, amenable to mini-batch stochastic gradient descent, and gives a structured continuous latent space useful for generation, interpolation, and downstream representation learning.

Architecture

A VAE consists of three main components: an encoder network, a stochastic latent layer, and a decoder network.

Encoder (inference network)

The encoder, also called the inference network or recognition network, takes an input x and produces the parameters of a probability distribution q_phi(z|x) over latent variables z. In the most common formulation, the encoder outputs two vectors of the same dimensionality as the latent space: a mean vector mu(x) and a log-variance vector log sigma^2(x). Together these define a diagonal multivariate Gaussian q_phi(z|x) = N(mu(x), diag(sigma^2(x))), an approximation to the true (and intractable) posterior p(z|x). The diagonal covariance assumption is sometimes called the mean-field approximation in the variational inference literature.

The encoder can be implemented with any differentiable architecture appropriate for the data type. For image data, convolutional neural networks (CNNs) are typical, often with strided convolutions or pooling that reduce spatial resolution. For sequential data, recurrent neural networks and Transformers are common. For graph or molecular data, message-passing networks have been used. The encoder produces parameters of q_phi(z|x), not z itself; sampling happens in a separate stochastic layer described below.

Latent space and sampling

The latent space is the space of latent variables z. Each input is associated not with a single point but with a probability distribution. During training, the model samples z from q_phi(z|x). The dimensionality of the latent space is a hyperparameter chosen by the practitioner. For visualization or toy datasets it can be as low as 2; for image VAEs trained on natural images it is typically several hundred to a few thousand, often shaped as a low-resolution feature map (for example, 4 channels at 64x64 spatial resolution in Stable Diffusion 1).

The prior distribution p(z) is usually chosen to be a standard multivariate Gaussian N(0, I), where I is the identity matrix. This choice is mathematically convenient because it has a simple density, supports an analytical KL divergence with any other Gaussian, and provides a smooth, isotropic structure that the encoder's output distributions are regularized toward. Other priors have been explored, including mixtures of Gaussians, von Mises-Fisher distributions on the sphere, and learned priors used by VQ-VAE and other discrete variants.

Decoder (generative network)

The decoder takes a latent sample z and produces the parameters of the data distribution p_theta(x|z), where theta denotes the decoder parameters. For continuous data such as natural images, the decoder typically outputs the mean of a Gaussian distribution with a fixed or learned variance, in which case the negative log-likelihood reduces to mean squared error up to a constant. For binary or strictly bounded data such as MNIST pixel values, the decoder outputs Bernoulli probabilities and the loss becomes binary cross-entropy. For discrete tokens, a categorical distribution is used.

Like the encoder, the decoder can be implemented with any suitable neural network architecture. For images, transposed convolutions, upsampling layers, and residual blocks are typical. The encoder and decoder together form a bottleneck structure with the latent z acting as an information-constraining bridge.

Mathematical foundation

The theoretical justification for the VAE comes from variational inference, a technique from Bayesian statistics. The goal is to train a generative model p_theta(x) on data x, where p_theta is built from a latent variable z with a known prior:

p_theta(x) = integral p_theta(x | z) p(z) dz

For any expressive decoder this integral is intractable, and so is the posterior p_theta(z|x) needed for maximum likelihood training. Variational inference sidesteps this problem by introducing a tractable approximate posterior q_phi(z|x) and bounding the log-evidence using Jensen's inequality:

log p_theta(x) >= E_{z ~ q_phi(z|x)} [ log p_theta(x|z) ] - KL( q_phi(z|x) || p(z) )

The right-hand side is the evidence lower bound or ELBO. It can be decomposed into two interpretable terms.

Term	Name	Role
E[log p_theta(x\|z)]	Reconstruction term	Measures how well the decoder reconstructs x from the latent code z. Encourages z to retain useful information about x.
KL(q_phi(z\|x) \|\| p(z))	KL regularizer	Measures how far the encoder's output distribution deviates from the prior. Encourages the latent space to remain smooth and well-organized.

Maximizing the ELBO with respect to both phi and theta is equivalent to simultaneously improving the variational approximation and the generative model. The slack between log p_theta(x) and the ELBO equals the KL divergence KL(q_phi(z|x) || p_theta(z|x)) between the approximate and true posteriors. As the approximation improves, the slack shrinks. In practice the loss function is typically written as a minimization problem (negative ELBO) with the reconstruction loss and KL divergence as additive terms.

Closed-form KL for diagonal Gaussians

When both the approximate posterior and the prior are diagonal Gaussians with the prior being N(0, I), the KL divergence has a convenient closed form. For a latent space of dimensionality J:

KL( N(mu, sigma^2 I) || N(0, I) ) = -0.5 * sum_{j=1..J} ( 1 + log(sigma_j^2) - mu_j^2 - sigma_j^2 )

This term acts as a regularizer that prevents the encoder from simply memorizing each input as a narrow spike in latent space. By penalizing distributions that deviate from the standard Gaussian prior, the KL term encourages the latent space to be continuous, meaning nearby points decode to similar outputs, and complete, meaning every point in the latent space decodes to a plausible output.

The reparameterization trick

One of the key technical contributions of Kingma and Welling's paper is the reparameterization trick, which makes the entire pipeline differentiable. During the forward pass, the model must sample z from q_phi(z|x). Sampling is a stochastic operation, and standard backpropagation cannot compute gradients through a random sampling step.

The reparameterization trick expresses the random variable z as a deterministic, differentiable function of the distribution parameters and an independent noise variable. For a diagonal Gaussian, instead of sampling z directly from N(mu, sigma^2), the model samples epsilon from N(0, I) and computes:

z = mu(x) + sigma(x) * epsilon,    epsilon ~ N(0, I)

This transformation moves the source of randomness (epsilon) outside the computational graph of the model parameters, making the entire pipeline differentiable with respect to mu and sigma. Gradients can now flow through the sampling step, enabling end-to-end training with stochastic gradient descent and standard automatic differentiation.

In the variational inference literature this is called a pathwise gradient estimator. Compared to the score function (REINFORCE) estimator, which works for any distribution but has very high variance, the pathwise estimator has substantially lower variance, especially when there are strong functional dependencies between the latent variables and the parameters. Low-variance gradients are what made it practical to train deep generative models with millions of parameters at scale.

The reparameterization trick is not limited to Gaussian distributions. It can be applied to any distribution that can be written as a deterministic transformation of a fixed base distribution, such as the Cauchy and uniform distributions. Distributions that lack a direct reparameterization, such as the Gamma, Beta, Dirichlet, and discrete categorical, have been adapted using techniques like the implicit reparameterization gradient, the Gumbel-Softmax relaxation for discrete variables, and the straight-through estimator used in VQ-VAE.

A simple training algorithm

Given a mini-batch of data x_1, ..., x_M, a single VAE training step performs:

Pass each x_i through the encoder to compute mu_i and log sigma^2_i.
Sample epsilon_i from N(0, I) and compute z_i = mu_i + sigma_i * epsilon_i.
Pass z_i through the decoder to compute the reconstruction parameters.
Compute the reconstruction loss (MSE or BCE) and the closed-form KL term.
Sum them, average over the batch, and backpropagate.
Update phi and theta using an optimizer such as Adam.

This is the entire training loop in its simplest form. The Monte Carlo estimate of the reconstruction term using a single sample of epsilon per data point is unbiased but noisy; more samples reduce variance at the cost of compute. The KL term is computed in closed form when both q and the prior are Gaussian, eliminating sampling noise from that side of the loss.

Posterior collapse and other training pathologies

A well-known training challenge with VAEs is posterior collapse, also known as the KL vanishing problem. It occurs when the encoder learns to output q_phi(z|x) that is nearly identical to the prior p(z) for every input. The KL term goes to zero, the latent code carries no information about x, and the decoder learns to model the marginal data distribution p(x) on its own, ignoring z entirely. This is especially common with strong autoregressive decoders (such as PixelCNN on images or LSTM/Transformer language models on text) that can model p(x) reasonably well without any latent input.

Several mitigations have been proposed:

Technique	Idea
KL annealing	Start with KL weight at 0, ramp to 1 over the first epochs, so the model first learns useful encodings before the regularizer kicks in.
Free bits	Set a per-dimension floor on the KL term: gradient is zero whenever the KL is below the threshold. Gives the model permission to use a minimum amount of latent capacity.
Beta scheduling	Use a beta < 1 early in training, raise to or above 1 later.
Decoder weakening	Restrict decoder capacity (smaller receptive field, dropout, masking) so the latent z is necessary.
Skip connections from z	Inject z at multiple decoder layers so the decoder cannot easily ignore it.
Lagging inference fix	Update the inference network more often than the generator (He et al., 2019) to stop the encoder from falling behind.
Discrete latents (VQ-VAE)	Avoid the KL term entirely by using a fixed uniform prior over codebook entries.

Other practical issues include the so-called holes problem, where the prior places mass in regions that the aggregate posterior does not cover, leading to poor samples drawn from the prior even when reconstruction is good, and posterior over-simplification, where the diagonal Gaussian assumption of q_phi cannot capture correlated structure in the true posterior.

Comparison with standard autoencoders

Feature	Standard autoencoder	Variational autoencoder
Latent representation	Deterministic point	Probability distribution (mean and variance)
Latent space structure	Unstructured; may have gaps and discontinuities	Smooth, continuous, regularized by KL divergence
Loss function	Reconstruction loss only	Reconstruction loss + KL divergence (ELBO)
Generative capability	Poor; random samples typically produce unrealistic outputs	Good; samples from the prior decode into plausible data
Training objective	Minimize reconstruction error	Maximize evidence lower bound
Sampling	Not designed for sampling	Designed for sampling from the latent prior
Probabilistic interpretation	None; purely a feature extractor	Explicit graphical model with prior and likelihood
Use cases	Dimensionality reduction, denoising, feature learning	Data generation, interpolation, anomaly detection, representation learning

Comparison with other generative model families

VAEs are one of several major families of deep generative models. Each makes different trade-offs between sample quality, training stability, likelihood evaluation, and inference speed.

Family	Likelihood	Sample quality	Training stability	Inference speed	Notes
VAE	ELBO (lower bound)	Moderate; classically blurry, sharper with NVAE / VQ-VAE-2	Stable, single loss	Single forward pass	Smooth, structured latent space; useful for compression and editing
GAN	None directly	High visual fidelity	Unstable; needs careful balancing of generator and discriminator	Single forward pass	Susceptible to mode collapse
Diffusion model	Variational bound or score matching	State of the art for images, video, audio	Stable	Slow; many denoising steps	Often paired with a VAE to operate in latent space
Normalizing flow	Exact	Good but historically below GAN/diffusion	Stable	Fast	Requires invertible architecture, limits expressiveness
Autoregressive model (PixelCNN, GPT)	Exact	Sharp; sequential generation	Stable	Very slow for high-dimensional data	Strong on text and discrete tokens
Energy-based model	Implicit	Variable	Often unstable	Slow (MCMC)	Flexible but harder to train

VAEs occupy a useful middle ground: they train stably, give a tractable likelihood proxy, encode data into a meaningful latent space that supports interpolation and editing, and produce reasonable samples on a single forward pass. The trade-off is that pure-VAE samples are typically less sharp than GAN or diffusion outputs, primarily because the per-pixel Gaussian or Bernoulli decoder distribution implicitly assumes pixel-wise independence and averages over uncertainty.

In modern practice these families are often combined. Stable Diffusion uses a VAE encoder to compress images, then runs a diffusion model in the resulting latent space, then uses the VAE decoder to map back to pixels. VQ-VAE and dVAE provide discrete tokens that autoregressive models like DALL-E 1, Parti, and MusicLM consume. VAE-GAN hybrids combine a VAE with an adversarial loss to sharpen reconstructions. Each pairing tries to keep the strengths (stable training, structured latent space, fast inference) while patching the weaknesses (blurriness).

Notable variants

Since the original VAE paper, dozens of variants have been developed. The most influential are summarized below.

Variant	Year	Key modification	Primary advantage
Conditional VAE (CVAE)	2015	Condition encoder and decoder on label or context y	Controlled generation; structured prediction
Importance Weighted Autoencoder (IWAE)	2015	Tighter ELBO using K importance samples	Richer posteriors; better likelihood
Beta-VAE	2017	KL term scaled by beta > 1	Disentangled latent factors
VQ-VAE	2017	Discrete codebook; nearest-neighbor quantization	Avoids posterior collapse; high-quality discrete tokens
InfoVAE / MMD-VAE	2017	Maximum Mean Discrepancy instead of KL	More informative latent codes
Wasserstein Autoencoder (WAE)	2018	Wasserstein distance on aggregated posterior	Sharper outputs than vanilla VAE
Beta-TCVAE	2018	Decompose KL into total correlation, mutual info, dimension-wise KL	Better disentanglement metric
Disentangled Sequential VAE	2018	Factorize content vs dynamics for video / sequences	Clean separation of static and dynamic factors
VQ-VAE-2	2019	Hierarchical discrete codes; PixelCNN prior	Image quality competitive with GANs
NVAE	2020	Deep hierarchical VAE with residual cells, spectral regularization	State-of-the-art VAE image quality on CelebA HQ, CIFAR-10
dVAE (DALL-E)	2021	Discrete VAE with Gumbel-softmax relaxation; 8192-entry codebook	Image tokenizer for autoregressive transformer
LDM autoencoder	2022	KL or VQ-regularized convolutional VAE for latent diffusion	Backbone of Stable Diffusion and many successors
VAE-GAN	2016/ongoing	Adversarial loss on top of decoder	Sharper, more realistic outputs

Beta-VAE

Introduced by Higgins et al. at ICLR 2017, the beta-VAE multiplies the KL term by an adjustable hyperparameter beta:

Loss = -E[log p_theta(x|z)] + beta * KL( q_phi(z|x) || p(z) )

For beta > 1, the model places greater pressure on the latent distribution to match the factorized prior, which empirically encourages individual latent dimensions to align with independent factors of variation in the data: pose, lighting, identity, color, and so on. The beta-VAE was a starting point for a large literature on disentangled representation learning. Burgess et al. (2018) extended it with a controlled capacity schedule, and Chen et al. (2018) introduced beta-TCVAE, which decomposes the KL term into total correlation, mutual information, and dimension-wise KL components and applies the disentanglement penalty more selectively. Locatello et al. (2019) later showed that fully unsupervised disentanglement is impossible without inductive biases on the data or model, tempering some of the early enthusiasm but leaving disentangled VAEs useful when those biases are present.

Conditional VAE

Sohn, Lee, and Yan (2015) introduced the conditional VAE in their NeurIPS paper "Learning Structured Output Representation using Deep Conditional Generative Models." The model conditions both the encoder q_phi(z | x, y) and the decoder p_theta(x | z, y) on additional information y, such as a class label, text caption, or partial input. This turns a generic generative model into a structured output predictor: given y, sample z from the conditional prior or posterior and decode to get a distribution over x. CVAEs are widely used in semi-supervised learning, image inpainting, structured prediction, and conditional molecular design.

Importance Weighted Autoencoder

Burda, Grosse, and Salakhutdinov (2015) introduced the IWAE, which uses K samples from the encoder per data point and computes a tighter log-likelihood bound:

L_K = E[ log (1/K) sum_{k=1..K} p_theta(x, z_k) / q_phi(z_k | x) ]

As K grows, this bound converges to the true log-likelihood. IWAE typically learns richer posteriors than the standard VAE because the encoder no longer needs to produce a single tight posterior; the importance weights correct for any mismatch. The trade-off is K-fold higher compute per training step.

VQ-VAE

Introduced by Aaron van den Oord, Oriol Vinyals, and Koray Kavukcuoglu at NeurIPS 2017, the Vector Quantised VAE (VQ-VAE) replaces the continuous latent with a discrete codebook of learned embedding vectors. The encoder produces a continuous output z_e, which is then snapped to its nearest codebook entry e_k, and the decoder reconstructs from e_k. Gradients are propagated through the non-differentiable quantization step using a straight-through estimator: the gradient of the encoder is computed as if quantization were the identity, while a separate codebook loss pulls each entry toward the encoder outputs assigned to it.

VQ-VAE removes the KL regularizer entirely (the prior is uniform over the codebook by convention) and as a result is largely immune to posterior collapse. The discrete latent codes can then be modeled by a separate autoregressive prior such as PixelCNN or a Transformer, which acts as a generative model in token space. This two-stage recipe (train a VQ-VAE, then train an autoregressive prior over its tokens) became the standard pattern for many modern multimodal systems.

VQ-VAE-2 (Razavi et al., 2019) extended the original with a hierarchy of discrete codes, with a top-level code capturing global structure such as object identity and a bottom-level code capturing local texture and detail. With strong autoregressive priors, VQ-VAE-2 generated images on ImageNet competitive with the best GANs at the time, with the additional benefit of avoiding mode collapse and providing exact likelihoods.

NVAE

NVAE (Vahdat and Kautz, NeurIPS 2020) is a deep hierarchical VAE that pushes pure-VAE image generation to state-of-the-art quality. It uses depthwise separable convolutions, batch normalization, residual parameterization of normal distributions, and spectral regularization to stabilize training. NVAE was the first VAE successfully applied to natural images at 256x256 resolution and pushed CIFAR-10 likelihood from 2.98 to 2.91 bits per dimension. It produces convincing high-resolution faces on CelebA HQ.

Latent diffusion VAEs

The VAEs used inside Stable Diffusion and other latent diffusion models are a separate lineage. Rombach et al. (CVPR 2022) train a convolutional autoencoder with either a small KL penalty or a VQ regularizer, plus a perceptual loss (LPIPS) and an adversarial discriminator (PatchGAN) to keep reconstructions sharp. The result is a VAE whose primary purpose is not to generate samples but to produce a compact, perceptually faithful latent space in which a separate diffusion U-Net then operates. Stable Diffusion 1 and 2 used a 4-channel latent at 1/8 spatial resolution; Stable Diffusion 3 (2024) introduced a 16-channel latent that captures more color and texture detail and reduces typical artifacts on hands, faces, and small text.

Applications

VAEs and their variants have been applied across a wide range of domains.

Domain	Use case	Representative work
Image generation	Sample novel images; interpolate between images	NVAE, VQ-VAE-2, VAE-GAN
Image editing	Move along latent directions to edit attributes	Beta-VAE, StyleVAE
Latent diffusion	Encode images into latent space for diffusion	Stable Diffusion 1/2/3, SDXL
Image tokenization	Convert image to discrete tokens for transformers	DALL-E dVAE, Parti VQGAN, MUSE
Audio compression	Encode waveforms to discrete tokens	SoundStream, EnCodec
Audio / music generation	Token model over VQ-VAE codes	MusicVAE, AudioLM, MusicLM, Jukebox
Speech synthesis	Latent-conditioned vocoders and codecs	Tacotron-VAE, NaturalSpeech 2
Text generation	Sentence-level latent variable language models	Bowman et al. 2016, Optimus
Anomaly detection	High reconstruction error or low ELBO flags anomalies	Medical imaging, fraud detection, manufacturing
Drug discovery	Continuous latent space over molecules; Bayesian optimization	Gomez-Bombarelli et al. 2018
Single-cell biology	Denoise and embed gene expression matrices	scVI, totalVI
Recommender systems	Collaborative filtering with latent user/item factors	Mult-VAE (Liang et al. 2018)
Reinforcement learning	World models with VAE-encoded observations	Ha and Schmidhuber's World Models, Dreamer
Robotics	Latent dynamics models from sensor data	PlaNet, latent imagination

Image generation and editing

VAEs can generate new images by sampling z from the prior and decoding. The smooth structure of the latent space enables image interpolation, attribute manipulation, and structure-preserving edits. While vanilla VAE samples on natural images are characteristically blurry, NVAE and VQ-VAE-2 achieve sharpness comparable to GANs, and latent-diffusion VAEs reach the high quality of modern text-to-image systems by delegating distribution modeling to a separate diffusion network.

Latent diffusion and modern image generation

The most consequential recent use of VAEs is as the perceptual compressor in latent diffusion models. A pretrained VAE encoder maps a 512x512x3 image into a small latent grid (64x64x4 for SD 1, 64x64x16 for SD 3), the diffusion model operates in that latent space at roughly forty-eight times less memory than pixel-space diffusion, and the VAE decoder maps the result back to image space. This factorization is the central reason Stable Diffusion can run on consumer GPUs and has shaped the architecture of nearly every text-to-image system released since 2022, including SDXL, Stable Diffusion 3, Pixart-Alpha, Flux, and most open-source video diffusion models.

Image tokenization for transformers

Discrete VAEs serve as image tokenizers for autoregressive image models. OpenAI's DALL-E (2021) trained a discrete VAE that compresses each 256x256 RGB image into a 32x32 grid of tokens drawn from an 8192-entry codebook, then trained a 12-billion parameter autoregressive transformer on the concatenation of text tokens and image tokens. Later systems, including Parti, MUSE, and many video models, use VQ-GAN or related discrete tokenizers built on the same principle.

Audio and speech

The VQ-VAE family is the dominant tokenizer in neural audio. SoundStream (Google, 2021) and EnCodec (Meta, 2022) use convolutional encoder-decoder architectures with residual vector quantization, adversarial losses, and reconstruction losses to compress 24 kHz audio into low-bitrate discrete tokens at 1.5 to 24 kbps. These tokens form the input to MusicLM, AudioLM, MusicGen, and Whisper-style speech systems. Earlier work by Roberts et al. (Google, 2018) introduced MusicVAE, a hierarchical VAE for short musical sequences that supports interpolation between melodies and rhythms.

Drug discovery and molecular design

Gomez-Bombarelli et al. (ACS Central Science, 2018) trained a VAE on SMILES string representations of molecules, producing a continuous latent space over chemistry. They then used a Gaussian process surrogate model and Bayesian optimization to search the latent space for molecules with desired properties such as octanol-water partition coefficient or fluorescent emission. The same template (encode molecules into a continuous latent, optimize, decode back to candidate molecules) has been applied to drug discovery, materials science, and OLED design. Subsequent work introduced graph-based and grammar-based VAEs that respect molecular validity better than pure SMILES models.

Anomaly detection

VAEs trained on normal data flag anomalies as inputs that produce high reconstruction error or low ELBO. The intuition is that the model learns a manifold of normal examples and struggles to reconstruct out-of-distribution inputs. Applications include medical imaging (flagging tumors against a background of healthy tissue), industrial quality control (defect detection on production lines), fraud detection in financial transactions, and intrusion detection in network traffic. The probabilistic interpretation makes VAEs natural for tasks where the cost of false negatives is high and where uncertainty estimates matter.

Single-cell genomics

In computational biology, VAEs are used to model gene expression matrices. scVI (Lopez et al. 2018) and follow-ups such as totalVI use deep generative models with negative binomial likelihoods to denoise single-cell RNA-seq data, integrate datasets from different experimental batches, and infer cell-type structure. The latent representations support clustering, differential expression analysis, and trajectory inference.

Text generation

Bowman et al. (2016) introduced a sentence-level VAE with an LSTM decoder, demonstrating smooth interpolation between sentences in latent space and generation of novel sentences from the prior. Posterior collapse is a chronic problem for text VAEs because LSTM and Transformer decoders are strong autoregressive models that can model p(x) well without the latent. Mitigations include KL annealing, free bits, weakened decoders, and discrete latents. Optimus (Li et al. 2020) revisited the idea using BERT and GPT-2 as encoder and decoder, scaling the VAE recipe to large pretrained language models.

Reinforcement learning and world models

VAEs are widely used in model-based reinforcement learning to compress raw pixel observations into a low-dimensional latent state. Ha and Schmidhuber's World Models (2018) used a VAE plus an MDN-RNN to compress and predict the dynamics of car-racing and Doom environments. Hafner et al.'s Dreamer family of agents extended this approach with categorical (VQ-style) latents to achieve state-of-the-art sample efficiency on Atari and DeepMind Control Suite tasks.

Recommender systems

Mult-VAE (Liang et al., WWW 2018) applies a VAE with multinomial likelihood to implicit feedback collaborative filtering. The model treats each user as a bag of consumed items, encodes the bag into a latent user representation, and decodes a multinomial distribution over the catalog. Mult-VAE outperformed prior matrix factorization and autoencoder baselines on standard benchmarks and is now a common reference baseline in the recommender-systems literature.

Modern context and current usage

For unconditional, end-to-end image synthesis, pure VAEs have largely been surpassed by diffusion models, which produce sharper samples and benefit from a much simpler maximum-likelihood-style training objective in the form of score matching. NVAE remains the strongest pure-VAE model on academic benchmarks, but its samples do not match modern diffusion or autoregressive systems on natural images at scale.

Where VAEs continue to dominate is wherever a fast, structured, and bidirectional mapping between data and a latent space is needed. The latent diffusion architecture popularized by Stable Diffusion uses a VAE encoder-decoder pair as its perceptual compressor; nearly every open-source text-to-image, text-to-video, and text-to-audio system released since 2022 inherits this pattern. Discrete VAEs power image and audio tokenizers for autoregressive transformers (DALL-E 1, Parti, MUSE, MusicLM, AudioLM, MusicGen). VAEs remain the workhorse for representation learning in single-cell biology, recommender systems, anomaly detection, and world-model RL. The basic recipe (amortized variational inference plus the reparameterization trick plus a deep network encoder and decoder) has also been adapted as a building block in normalizing flows, hierarchical Bayesian models, and the formal derivations of denoising diffusion as an infinite-depth VAE (Kingma et al. 2021).

There is also ongoing research on improving the VAE itself. Examples include continuous-time hierarchical VAEs, energy-based decoder priors, diffusion-decoder VAEs that swap the Gaussian or Bernoulli decoder for a small diffusion model conditioned on the latent (eps-VAE, 2024), and lightweight video VAEs that compress along both space and time for video diffusion. The architecture has not stood still; it has been quietly absorbed into nearly every modern generative pipeline.

Limitations

VAEs have known weaknesses worth flagging.

Limitation	Cause	Common mitigation
Blurry samples on natural images	Pixel-wise Gaussian decoder averages over multimodal pixel distributions	Hierarchical VAE (NVAE), VQ-VAE, perceptual + adversarial losses (LDM autoencoder)
Posterior collapse	Strong decoder ignores latent	KL annealing, free bits, decoder weakening, discrete latents
Holes in latent space	Aggregate posterior does not match prior	Two-stage models, learned priors, normalizing-flow priors, VQ codebooks
Mean-field posterior is too restrictive	Diagonal Gaussian assumption misses correlations	Normalizing-flow posterior (IAF), hierarchical posterior, IWAE
Likelihood is only a lower bound	ELBO leaves a slack proportional to KL between true and approximate posterior	IWAE, more expressive q, larger encoder
Disentanglement is fragile	Disentangled factors are not identifiable without supervision (Locatello 2019)	Add weak labels or explicit inductive biases
Training-inference mismatch	The encoder is trained jointly with the decoder, so a frozen VAE used for downstream tasks may have been over-fit to its decoder's quirks	Train a stronger separate encoder; finetune jointly with downstream task

Implementation

A minimal VAE for binarized MNIST is one of the standard pedagogical examples in deep learning frameworks. PyTorch tutorials, the official TensorFlow Probability VAE example, and the Hugging Face diffusers library all ship VAE implementations that can be used directly. The diffusers library exposes the AutoencoderKL class used in Stable Diffusion, the AutoencoderTiny class used for fast latent previews, and the various VQ models used in MUSE-style pipelines. NVIDIA released the official PyTorch implementation of NVAE on GitHub, and the original DALL-E discrete VAE checkpoint is available from OpenAI.

For practitioners, common recipes are:

Goal	Recommended approach
Toy generation; teach the math	A 2-layer MLP encoder/decoder VAE on MNIST or Fashion-MNIST with a 2D latent for visualization
Reasonable image quality	NVAE or a deep convolutional VAE with skip connections and KL annealing
Image tokenization	VQ-VAE or VQ-GAN with a perceptual + adversarial loss
Latent diffusion preprocessor	KL- or VQ-regularized convolutional autoencoder with LPIPS + PatchGAN losses (LDM autoencoder)
Anomaly detection	Standard VAE with a domain-appropriate likelihood (Bernoulli, Gaussian, negative binomial)
Tabular / structured data	Categorical / continuous mixed VAE with embedding layers per feature

Influence and legacy

The Kingma and Welling paper has been cited tens of thousands of times and is one of the most influential machine learning papers of the 2010s. Its template, amortized variational inference plus the reparameterization trick, established the modern playbook for probabilistic deep learning. Subsequent influential work that builds directly on the VAE includes normalizing flows (Rezende and Mohamed 2015), hierarchical latent models, and the variational diffusion model framework (Kingma et al. 2021), which formally connects denoising diffusion probabilistic models to a continuous-time, infinite-depth hierarchical VAE.

Kingma went on to lead key follow-up work including the introduction of inverse autoregressive flows (Kingma et al. 2016), the Adam optimizer that is now the default for training nearly all VAEs and most other deep models, and variational diffusion models. Welling's group at the University of Amsterdam has continued to work on probabilistic and equivariant deep learning, including normalizing flows and graph neural networks. The 2019 monograph "An Introduction to Variational Autoencoders" by Kingma and Welling (Foundations and Trends in Machine Learning) remains the canonical pedagogical reference.

More than a decade after the original paper, the VAE is no longer the front-line image generator, but the ideas it introduced (amortized inference, the reparameterization trick, deep latent variable models, ELBO as a practical training objective) are now baseline machinery in modern generative modeling. Most of today's text-to-image, text-to-video, text-to-audio, and text-to-music systems rely on some form of VAE somewhere in their pipeline, even when they advertise themselves under a different name.

References

Kingma, D. P., & Welling, M. (2013). "Auto-Encoding Variational Bayes." arXiv preprint arXiv:1312.6114. Published at ICLR 2014.
Rezende, D. J., Mohamed, S., & Wierstra, D. (2014). "Stochastic Backpropagation and Approximate Inference in Deep Generative Models." Proceedings of ICML 2014.
Kingma, D. P., & Welling, M. (2019). "An Introduction to Variational Autoencoders." Foundations and Trends in Machine Learning, 12(4), 307-392.
Higgins, I., Matthey, L., Pal, A., Burgess, C., Glorot, X., Botvinick, M., Mohamed, S., & Lerchner, A. (2017). "beta-VAE: Learning Basic Visual Concepts with a Constrained Variational Framework." ICLR 2017.
Burgess, C. P., Higgins, I., Pal, A., Matthey, L., Watters, N., Desjardins, G., & Lerchner, A. (2018). "Understanding disentangling in beta-VAE." arXiv:1804.03599.
Chen, T. Q., Li, X., Grosse, R., & Duvenaud, D. (2018). "Isolating Sources of Disentanglement in Variational Autoencoders." NeurIPS 2018.
Locatello, F., et al. (2019). "Challenging Common Assumptions in the Unsupervised Learning of Disentangled Representations." ICML 2019, best paper award.
Sohn, K., Lee, H., & Yan, X. (2015). "Learning Structured Output Representation using Deep Conditional Generative Models." NeurIPS 2015.
Burda, Y., Grosse, R., & Salakhutdinov, R. (2015). "Importance Weighted Autoencoders." arXiv:1509.00519.
Tolstikhin, I., Bousquet, O., Gelly, S., & Schoelkopf, B. (2018). "Wasserstein Auto-Encoders." ICLR 2018.
Zhao, S., Song, J., & Ermon, S. (2019). "InfoVAE: Balancing Learning and Inference in Variational Autoencoders." AAAI 2019.
van den Oord, A., Vinyals, O., & Kavukcuoglu, K. (2017). "Neural Discrete Representation Learning." NeurIPS 2017.
Razavi, A., van den Oord, A., & Vinyals, O. (2019). "Generating Diverse High-Fidelity Images with VQ-VAE-2." NeurIPS 2019.
Vahdat, A., & Kautz, J. (2020). "NVAE: A Deep Hierarchical Variational Autoencoder." NeurIPS 2020 (spotlight).
Ramesh, A., Pavlov, M., Goh, G., Gray, S., Voss, C., Radford, A., Chen, M., & Sutskever, I. (2021). "Zero-Shot Text-to-Image Generation" (DALL-E). ICML 2021.
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., & Ommer, B. (2022). "High-Resolution Image Synthesis with Latent Diffusion Models." CVPR 2022.
Esser, P., et al. (2024). "Scaling Rectified Flow Transformers for High-Resolution Image Synthesis" (Stable Diffusion 3).
Kingma, D. P., Salimans, T., Poole, B., & Ho, J. (2021). "Variational Diffusion Models." NeurIPS 2021.
Bowman, S. R., Vilnis, L., Vinyals, O., Dai, A. M., Jozefowicz, R., & Bengio, S. (2016). "Generating Sentences from a Continuous Space." CoNLL 2016.
He, J., Spokoyny, D., Neubig, G., & Berg-Kirkpatrick, T. (2019). "Lagging Inference Networks and Posterior Collapse in Variational Autoencoders." ICLR 2019.
Roberts, A., Engel, J., Raffel, C., Hawthorne, C., & Eck, D. (2018). "A Hierarchical Latent Vector Model for Learning Long-Term Structure in Music Generation" (MusicVAE). ICML 2018.
Zeghidour, N., et al. (2021). "SoundStream: An End-to-End Neural Audio Codec." arXiv:2107.03312.
Defossez, A., Copet, J., Synnaeve, G., & Adi, Y. (2022). "High Fidelity Neural Audio Compression" (EnCodec). arXiv:2210.13438.
Gomez-Bombarelli, R., et al. (2018). "Automatic Chemical Design Using a Data-Driven Continuous Representation of Molecules." ACS Central Science, 4(2), 268-276.
Lopez, R., Regier, J., Cole, M. B., Jordan, M. I., & Yosef, N. (2018). "Deep generative modeling for single-cell transcriptomics" (scVI). Nature Methods, 15, 1053-1058.
Liang, D., Krishnan, R. G., Hoffman, M. D., & Jebara, T. (2018). "Variational Autoencoders for Collaborative Filtering" (Mult-VAE). WWW 2018.
Ha, D., & Schmidhuber, J. (2018). "World Models." NeurIPS 2018.
Hafner, D., Pasukonis, J., Ba, J., & Lillicrap, T. (2023). "Mastering Diverse Domains through World Models" (DreamerV3). arXiv:2301.04104.
Doersch, C. (2016). "Tutorial on Variational Autoencoders." arXiv:1606.05908.
Mohamed, S., Rosca, M., Figurnov, M., & Mnih, A. (2020). "Monte Carlo Gradient Estimation in Machine Learning." Journal of Machine Learning Research, 21(132), 1-62.

Background and motivation

Architecture

Encoder (inference network)

Latent space and sampling

Decoder (generative network)

Mathematical foundation

Closed-form KL for diagonal Gaussians

The reparameterization trick

A simple training algorithm

Posterior collapse and other training pathologies

Comparison with standard autoencoders

Comparison with other generative model families

Notable variants

Beta-VAE

Conditional VAE

Importance Weighted Autoencoder

VQ-VAE

NVAE

Latent diffusion VAEs

Applications

Image generation and editing

Latent diffusion and modern image generation

Image tokenization for transformers

Audio and speech

Drug discovery and molecular design

Anomaly detection

Single-cell genomics

Text generation

Reinforcement learning and world models

Recommender systems

Modern context and current usage

Limitations

Implementation

Influence and legacy

References

Improve this article

Related Articles

Latent diffusion model

Denoising Diffusion Probabilistic Model

Sparse autoencoder

GELU (Gaussian Error Linear Unit)

LeNet

Diffusion models

Background and motivation

Architecture

Encoder (inference network)

Latent space and sampling

Decoder (generative network)

Mathematical foundation

Closed-form KL for diagonal Gaussians

The reparameterization trick

A simple training algorithm

Posterior collapse and other training pathologies

Comparison with standard autoencoders

Comparison with other generative model families

Notable variants

Beta-VAE

Conditional VAE

Importance Weighted Autoencoder

VQ-VAE

NVAE

Latent diffusion VAEs

Applications

Image generation and editing

Latent diffusion and modern image generation

Image tokenization for transformers

Audio and speech

Drug discovery and molecular design

Anomaly detection

Single-cell genomics

Text generation

Reinforcement learning and world models

Recommender systems

Modern context and current usage

Limitations

Implementation

Influence and legacy

References

Related Articles

Latent diffusion model