VQ-VAE (Vector Quantized Variational Autoencoder)
Last reviewed
Jun 8, 2026
Sources
9 citations
Review status
Source-backed
Revision
v1 · 2,109 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 8, 2026
Sources
9 citations
Review status
Source-backed
Revision
v1 · 2,109 words
Add missing citations, update stale details, or suggest a clearer explanation.
VQ-VAE (Vector Quantized Variational Autoencoder) is a generative model that learns a discrete latent representation of data by quantizing the output of a neural encoder against a learned codebook of embedding vectors. It was introduced in the 2017 paper "Neural Discrete Representation Learning" by Aaron van den Oord, Oriol Vinyals, and Koray Kavukcuoglu at DeepMind, presented at NeurIPS 2017 [1]. Unlike a standard variational autoencoder, whose latent space is a continuous distribution, VQ-VAE maps each spatial or temporal position of an input to an index in a finite codebook, producing a grid or sequence of discrete tokens.
This discreteness is the model's defining property. Because the latents are integer codes drawn from a fixed vocabulary, they can be modeled afterward by a powerful autoregressive model such as PixelCNN, WaveNet, or a Transformer, exactly the way language models predict word tokens. This two-stage "tokenize, then model the tokens" recipe became one of the most influential paradigms in generative modeling. VQ-VAE is the conceptual ancestor of the discrete image tokenizer in DALL-E, of VQGAN, and of the residual neural audio codecs SoundStream and EnCodec that underpin modern audio and music generation [1][2][3][4].
The design of VQ-VAE responds to two practical problems with continuous-latent VAEs.
The first is posterior collapse. When a VAE is paired with a sufficiently expressive autoregressive decoder, the decoder can model the data distribution on its own and learns to ignore the latent variable; the approximate posterior collapses toward the prior and the latent code carries little information [1]. By committing the encoder output to a discrete codebook entry, VQ-VAE forces the latent to remain informative, and the authors report that the model does not suffer from this collapse [1].
The second motivation is representational. Many modalities of interest, such as language, are inherently discrete, and even continuous signals like images and audio are often well summarized by a sequence of symbols. Discrete codes pair naturally with the strongest available density estimators for sequences, the autoregressive models, which assign a probability to one symbol at a time over a finite vocabulary. A continuous VAE latent does not give those priors a clean discrete target to predict, whereas VQ-VAE does [1].
VQ-VAE has three parts: an encoder, a shared codebook (also called the embedding space or dictionary), and a decoder.
The codebook is a learnable table of K embedding vectors, each of dimension D, written e_1 through e_K. The encoder is a convolutional network that maps an input x to a field of continuous vectors z_e(x), for example a grid of feature vectors over an image or a sequence of vectors over audio. Each encoder output vector is then quantized by nearest-neighbor lookup: it is replaced by the codebook entry closest to it in Euclidean distance. Formally the discrete code at a position is the index k that minimizes the distance between z_e(x) and e_k, and the quantized representation z_q(x) is the corresponding embedding e_k [1]. The decoder reconstructs the input from z_q(x). The set of chosen indices is the discrete latent; for images it is a smaller grid of integers (for instance a 32 by 32 field of codes for a 128 by 128 image).
The nearest-neighbor argmax is not differentiable, so VQ-VAE is trained with a three-part objective. Using sg to denote the stop-gradient operator (the identity in the forward pass, zero gradient in the backward pass), the loss is [1]:
L = log p(x | z_q(x)) + || sg[z_e(x)] - e ||^2 + beta * || z_e(x) - sg[e] ||^2
The three terms are:
The authors report that results are robust to beta over a wide range and use beta = 0.25 in their experiments [1].
To train the encoder despite the non-differentiable quantization, VQ-VAE uses a straight-through estimator. In the forward pass the discrete z_q(x) is fed to the decoder; in the backward pass the gradient of the reconstruction loss at the decoder input is copied unchanged back to the encoder output z_e(x), as if the quantization were an identity function [1]. The codebook and commitment terms then supply the remaining gradient signal that aligns the encoder outputs and the codebook entries.
The original paper also describes an alternative to the codebook loss in which the embeddings are updated by an exponential moving average (EMA) of the encoder outputs assigned to each code, treating quantization as online k-means. EMA updates are widely used in practice because they tend to be more stable and improve codebook usage [1]. They were adopted by many later systems, including VQ-VAE-2 and audio codecs.
VQ-VAE is trained as an autoencoder, so on its own it reconstructs but does not generate novel samples from scratch. Generation uses a two-stage procedure [1]:
In the original work the prior over image codes was a PixelCNN (with self-attention, in the PixelSNAIL style), and the prior over audio codes was a WaveNet [1]. This separation matters: the VQ-VAE compresses the signal into a much shorter discrete sequence, so the autoregressive model operates in a compact latent space rather than over raw pixels or audio samples, which is far cheaper. The same template, with the prior replaced by a Transformer, became the standard approach for autoregressive image and audio generation.
The 2017 paper demonstrated the method across modalities: high-quality image reconstruction and generation on ImageNet, raw-audio modeling with applications to speaker conversion (changing the speaker while preserving content) and unsupervised discovery of phoneme-like units, and video. The discrete codes were shown to capture high-level content, such as phonetic structure in speech, while discarding low-level detail like the exact speaker identity [1].
VQ-VAE-2, "Generating Diverse High-Fidelity Images with VQ-VAE-2" by Ali Razavi, Aaron van den Oord, and Oriol Vinyals (DeepMind), was presented at NeurIPS 2019 [2]. It scaled the approach to high-resolution, high-fidelity image synthesis with two main changes.
First, it made the latent representation hierarchical. Instead of a single grid of codes, VQ-VAE-2 uses multiple levels: a coarse top-level code that captures global structure such as overall shape and color layout, and one or more finer bottom-level codes conditioned on the top that fill in local detail and texture [2]. Second, it used substantially more powerful autoregressive priors over each level, built on PixelCNN with multi-headed self-attention in the PixelSNAIL style, with the lower-resolution levels conditioning the higher-resolution ones [2].
The result was that samples on ImageNet at resolutions up to 256 by 256 rivaled the fidelity of the best contemporary generative adversarial networks while avoiding their characteristic failure modes: VQ-VAE-2 produced diverse samples and did not suffer from mode collapse, since it optimizes a likelihood-based objective rather than an adversarial one [2].
VQ-VAE established the discrete-tokenization paradigm that pervades modern generative AI. Its core idea, encode a signal into discrete tokens and then model those tokens with a strong sequence model, recurs across images, audio, and video.
| System | Year | Domain | Relationship to VQ-VAE |
|---|---|---|---|
| VQ-VAE-2 | 2019 | Images | Hierarchical, high-fidelity successor [2] |
| Jukebox | 2020 | Music | Hierarchical VQ-VAE codes plus a Transformer prior [5] |
| DALL-E (dVAE) | 2021 | Text to image | Discrete image tokenizer; uses a Gumbel-softmax relaxation instead of nearest-neighbor argmax [3] |
| VQGAN | 2021 | Images | VQ-VAE codebook plus adversarial and perceptual losses; Transformer prior [4] |
| SoundStream | 2021 | Audio codec | Residual vector quantization of audio [6] |
| EnCodec | 2022 | Audio codec | Residual vector quantization, SoundStream-style [7] |
Jukebox from OpenAI used a hierarchical VQ-VAE to tokenize raw music waveforms, then trained large Transformers as the prior over the codes [5]. OpenAI's DALL-E trained a discrete variational autoencoder (dVAE) with a codebook of 8192 image tokens, then modeled the joint sequence of text and image tokens with a Transformer; the dVAE replaces the hard nearest-neighbor lookup with a Gumbel-softmax relaxation but plays the same role as a VQ-VAE image tokenizer [3]. VQGAN, from "Taming Transformers for High-Resolution Image Synthesis" by Patrick Esser, Robin Rombach, and Bjorn Ommer, augmented the VQ-VAE objective with a patch-based adversarial loss and a perceptual loss to learn a more context-rich, perceptually sharp codebook, then composed the codes with an autoregressive Transformer [4].
In audio, vector quantization of encoder features evolved into residual vector quantization (RVQ), in which several codebooks are applied in sequence, each quantizing the residual error left by the previous one. RVQ powers Google's SoundStream [6] and Meta's EnCodec [7], the neural codecs that tokenize audio for downstream generative models such as AudioLM and MusicGen. The general lesson, that a learned discrete code lets a Transformer treat continuous media as a sequence of tokens, traces to VQ-VAE.
The most discussed limitation of VQ-VAE is codebook collapse, also called dead codes or low codebook utilization. During training a subset of codebook entries may stop being selected as nearest neighbors; they receive no gradient from the codebook loss and remain stranded, so the effective vocabulary shrinks and reconstructions degrade. The problem tends to worsen as the nominal codebook size K grows [8]. Practitioners developed a range of countermeasures, including EMA codebook updates [1], random restarts or re-seeding of unused codes (Jukebox reinitializes any code whose usage falls below a threshold to a recent encoder output) [5], code splitting, and entropy or commitment regularizers that encourage uniform usage.
A second line of work removes the explicit, learned codebook altogether. Finite Scalar Quantization (FSQ), introduced in 2023 by Fabian Mentzer, David Minnen, Eirikur Agustsson, and Michael Tschannen in "Finite Scalar Quantization: VQ-VAE Made Simple," projects the encoder output to a small number of dimensions and rounds each dimension to one of a few fixed levels, which defines an implicit codebook of size L^d without any learned embedding table [8]. Gradients pass through the rounding via the straight-through estimator. FSQ avoids codebook collapse by construction and needs none of the commitment losses, reseeding, or splitting tricks that VQ requires, while matching VQ within a small margin on reconstruction and generation [8]. The closely related lookup-free quantization (LFQ), used in MAGVIT-v2, reduces the per-code embedding dimension to zero by binarizing each latent dimension at a threshold, enabling very large vocabularies (on the order of 2^18) and improving generation quality, with entropy-based penalties encouraging full codebook use [9]. These methods have largely supplanted the original learned-codebook quantizer in many recent tokenizers, while preserving the discrete-token, two-stage paradigm that VQ-VAE pioneered.