VQ-VAE (Vector Quantized Variational Autoencoder)

Deep Learning Generative AI

14 min read

Updated Jul 12, 2026

Suggest edit History Talk

RawGraph

Last edited

Jul 12, 2026

Fact-checked

In review queue

Sources

9 citations

Revision

v3 · 2,823 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

VQ-VAE (Vector Quantized Variational Autoencoder) is a generative neural network that compresses data into a grid or sequence of discrete tokens drawn from a learned codebook, rather than into the continuous latent space of a standard autoencoder. It was introduced in the 2017 paper "Neural Discrete Representation Learning" by Aaron van den Oord, Oriol Vinyals, and Koray Kavukcuoglu at DeepMind, presented at NeurIPS 2017 ^[1]. In the authors' words, the model differs from a variational autoencoder "in two key ways: the encoder network outputs discrete, rather than continuous, codes; and the prior is learnt rather than static" ^[1]. Each spatial or temporal position of an input is mapped to an index in a finite codebook by nearest-neighbour lookup, turning images, audio, or video into integer codes that a sequence model can predict the way a language model predicts words.

This discreteness is the model's defining property and the source of its influence. Because the latents are integer codes drawn from a fixed vocabulary, they can be modeled afterward by a powerful autoregressive model such as PixelCNN, WaveNet, or a Transformer, exactly the way language models predict word tokens. This two-stage "tokenize, then model the tokens" recipe became one of the most influential paradigms in generative modeling. VQ-VAE is the conceptual ancestor of the discrete image tokenizer in DALL-E, of VQGAN, and of the residual neural audio codecs SoundStream and EnCodec that underpin modern audio and music generation ^[1]^[2]^[3]^[4].

What is VQ-VAE?

VQ-VAE is an autoencoder whose latent space is a finite set of learned vectors (the codebook) instead of a continuous Gaussian. An encoder turns the input into a field of continuous vectors; each of those vectors is snapped to its nearest codebook entry; and a decoder reconstructs the input from the snapped (quantized) vectors. The list of chosen codebook indices is the discrete latent. The original paper showed that a single VQ-VAE could compress a 128 by 128 by 3 color image down to a 32 by 32 by 1 grid of integer codes using a codebook of $K = 512$ entries, a roughly 42.6x reduction in bits, while keeping reconstructions only slightly blurrier than the originals ^[1].

The authors frame the method as a marriage of two ideas: "Our model, the Vector Quantised Variational AutoEncoder (VQ-VAE), differs from VAEs in two key ways," namely the discrete codes and the learned prior ^[1]. The result is a model that combines the autoencoder's reconstruction objective with the vector-quantization technique long used in signal compression.

Why discrete latents? (motivation)

The design of VQ-VAE responds to two practical problems with continuous-latent VAEs.

The first is posterior collapse. When a VAE is paired with a sufficiently expressive autoregressive decoder, the decoder can model the data distribution on its own and learns to ignore the latent variable; the approximate posterior collapses toward the prior and the latent code carries little information ^[1]. The paper states that "using the VQ method allows the model to circumvent issues of 'posterior collapse', where the latents are ignored when they are paired with a powerful autoregressive decoder" ^[1]. By committing the encoder output to a discrete codebook entry, VQ-VAE forces the latent to remain informative.

The second motivation is representational. Many modalities of interest, such as language, are inherently discrete, and even continuous signals like images and audio are often well summarized by a sequence of symbols. Discrete codes pair naturally with the strongest available density estimators for sequences, the autoregressive models, which assign a probability to one symbol at a time over a finite vocabulary. A continuous VAE latent does not give those priors a clean discrete target to predict, whereas VQ-VAE does ^[1].

How does VQ-VAE work?

Encoder, codebook, and quantization

VQ-VAE has three parts: an encoder, a shared codebook (also called the embedding space or dictionary), and a decoder.

The codebook is a learnable table of $K$ embedding vectors, each of dimension $D$ , written $e_1$ through $e_K$ . The encoder is a convolutional network that maps an input x to a field of continuous vectors $z_e(x)$ , for example a grid of feature vectors over an image or a sequence of vectors over audio. Each encoder output vector is then quantized by nearest-neighbor lookup: it is replaced by the codebook entry closest to it in Euclidean distance. Formally the discrete code at a position is the index $k$ that minimizes the distance between $z_e(x)$ and $e_k$ , and the quantized representation $z_q(x)$ is the corresponding embedding $e_k$ ^[1]. The decoder reconstructs the input from $z_q(x)$ . The set of chosen indices is the discrete latent; for images it is a smaller grid of integers (for instance the 32 by 32 field of codes for a 128 by 128 image in the original paper, with $K = 512$ codes in the dictionary) ^[1].

What is the VQ-VAE loss function?

The nearest-neighbor argmax is not differentiable, so VQ-VAE is trained with a three-part objective. Using sg to denote the stop-gradient operator (the identity in the forward pass, zero gradient in the backward pass), the loss is ^[1]:

L = \log p(x \mid z_q(x)) + \lVert \mathrm{sg}[z_e(x)] - e \rVert^2 + \beta \lVert z_e(x) - \mathrm{sg}[e] \rVert^2

The three terms are:

Reconstruction loss, the first term, trains the encoder and decoder so that the decoder reproduces x from the quantized code.
Codebook (embedding) loss, the second term, moves the selected codebook vectors e toward the encoder outputs. The stop-gradient on z_e(x) means only the codebook is updated by this term. This is equivalent to a vector quantization objective, the online clustering of encoder outputs.
Commitment loss, the third term, pulls the encoder outputs toward the codebook vectors so the encoder "commits" to embeddings and the codebook does not grow arbitrarily; the stop-gradient on e means only the encoder is updated here. The scalar $\beta$ weights this term.

The authors report that results are robust to $\beta$ over a wide range and use $\beta = 0.25$ in all their experiments ^[1].

What is the straight-through estimator in VQ-VAE?

To train the encoder despite the non-differentiable quantization, VQ-VAE uses a straight-through estimator. In the forward pass the discrete z_q(x) is fed to the decoder; in the backward pass the gradient of the reconstruction loss at the decoder input is copied unchanged back to the encoder output z_e(x), as if the quantization were an identity function ^[1]. The codebook and commitment terms then supply the remaining gradient signal that aligns the encoder outputs and the codebook entries.

EMA codebook updates

The original paper also describes an alternative to the codebook loss in which the embeddings are updated by an exponential moving average (EMA) of the encoder outputs assigned to each code, treating quantization as online k-means. EMA updates are widely used in practice because they tend to be more stable and improve codebook usage ^[1]. They were adopted by many later systems, including VQ-VAE-2 and audio codecs.

How does VQ-VAE generate new samples?

VQ-VAE is trained as an autoencoder, so on its own it reconstructs but does not generate novel samples from scratch. Generation uses a two-stage procedure ^[1]:

Train the VQ-VAE to obtain an encoder, a codebook, and a decoder, and encode the training data into sequences or grids of discrete codes.
Train a separate autoregressive prior over those discrete codes. To sample, draw a code sequence from the prior, then decode it with the VQ-VAE decoder.

In the original work the prior over image codes was a PixelCNN (with self-attention, in the PixelSNAIL style), and the prior over audio codes was a WaveNet ^[1]. This separation matters: the VQ-VAE compresses the signal into a much shorter discrete sequence, so the autoregressive model operates in a compact latent space rather than over raw pixels or audio samples, which is far cheaper. The same template, with the prior replaced by a Transformer, became the standard approach for autoregressive image and audio generation.

The 2017 paper demonstrated the method across modalities: high-quality image reconstruction and generation on ImageNet, raw-audio modeling with applications to speaker conversion (changing the speaker while preserving content) and unsupervised discovery of phoneme-like units, and video. The discrete codes were shown to capture high-level content, such as phonetic structure in speech, while discarding low-level detail like the exact speaker identity ^[1].

How is VQ-VAE different from a standard VAE?

A standard variational autoencoder encodes each input into the parameters of a continuous probability distribution (typically a diagonal Gaussian) and is trained by maximizing an evidence lower bound that balances reconstruction against a KL-divergence term toward a fixed prior. VQ-VAE replaces that continuous, KL-regularized latent with a discrete codebook lookup and swaps the KL term for the codebook and commitment losses described above.

Property	Standard VAE	VQ-VAE
Latent space	Continuous (e.g. Gaussian)	Discrete codes from a learned codebook of K entries ^[1]
How the encoder output is mapped	Sampled from a distribution (reparameterization trick)	Snapped to the nearest codebook vector (argmin) ^[1]
Regularizer	KL divergence to a fixed prior	Codebook loss plus $\beta$ -weighted commitment loss ( $\beta = 0.25$ ) ^[1]
Gradient through the latent	Reparameterization (fully differentiable)	Straight-through estimator across the non-differentiable quantizer ^[1]
Prior	Static (chosen in advance)	Learned afterward as a separate autoregressive model ^[1]
Posterior collapse	Common with strong decoders	Circumvented by the discrete bottleneck ^[1]

The practical payoff of these differences is that the VQ-VAE latent is a clean sequence of tokens, which is exactly what a powerful autoregressive prior or Transformer can model, whereas a continuous VAE latent is not.

VQ-VAE-2

VQ-VAE-2, "Generating Diverse High-Fidelity Images with VQ-VAE-2" by Ali Razavi, Aaron van den Oord, and Oriol Vinyals (DeepMind), was presented at NeurIPS 2019 ^[2]. It scaled the approach to high-resolution, high-fidelity image synthesis with two main changes.

First, it made the latent representation hierarchical. Instead of a single grid of codes, VQ-VAE-2 uses multiple levels: a coarse top-level code that captures global structure such as overall shape and color layout, and one or more finer bottom-level codes conditioned on the top that fill in local detail and texture ^[2]. Second, it used substantially more powerful autoregressive priors over each level, built on PixelCNN with multi-headed self-attention in the PixelSNAIL style, with the lower-resolution levels conditioning the higher-resolution ones ^[2].

The result was that samples on ImageNet at resolutions up to 256 by 256 rivaled the fidelity of the best contemporary generative adversarial networks while avoiding their characteristic failure modes: VQ-VAE-2 produced diverse samples and did not suffer from mode collapse, since it optimizes a likelihood-based objective rather than an adversarial one ^[2].

What is VQ-VAE used for? (impact and descendants)

VQ-VAE established the discrete-tokenization paradigm that pervades modern generative AI. Its core idea, encode a signal into discrete tokens and then model those tokens with a strong sequence model, recurs across images, audio, and video.

System	Year	Domain	Relationship to VQ-VAE
VQ-VAE-2	2019	Images	Hierarchical, high-fidelity successor ^[2]
Jukebox	2020	Music	Hierarchical VQ-VAE codes plus a Transformer prior ^[5]
DALL-E (dVAE)	2021	Text to image	Discrete image tokenizer; uses a Gumbel-softmax relaxation instead of nearest-neighbor argmax ^[3]
VQGAN	2021	Images	VQ-VAE codebook plus adversarial and perceptual losses; Transformer prior ^[4]
SoundStream	2021	Audio codec	Residual vector quantization of audio ^[6]
EnCodec	2022	Audio codec	Residual vector quantization, SoundStream-style ^[7]

Jukebox from OpenAI used a hierarchical VQ-VAE to tokenize raw music waveforms, then trained large Transformers as the prior over the codes ^[5]. OpenAI's DALL-E trained a discrete variational autoencoder (dVAE) that compressed each 256 by 256 image into a 32 by 32 grid of tokens, each token one of 8192 possible values, reducing the Transformer's context size by a factor of 192 with little loss of visual quality; it then modeled the joint sequence of text and image tokens with a Transformer ^[3]. The dVAE replaces the hard nearest-neighbor lookup with a Gumbel-softmax relaxation but plays the same role as a VQ-VAE image tokenizer ^[3]. VQGAN, from "Taming Transformers for High-Resolution Image Synthesis" by Patrick Esser, Robin Rombach, and Bjorn Ommer, augmented the VQ-VAE objective with a patch-based adversarial loss and a perceptual loss to learn a more context-rich, perceptually sharp codebook, then composed the codes with an autoregressive Transformer ^[4].

In audio, vector quantization of encoder features evolved into residual vector quantization (RVQ), in which several codebooks are applied in sequence, each quantizing the residual error left by the previous one. RVQ powers Google's SoundStream ^[6] and Meta's EnCodec ^[7], the neural codecs that tokenize audio for downstream generative models such as AudioLM and MusicGen. The general lesson, that a learned discrete code lets a Transformer treat continuous media as a sequence of tokens, traces to VQ-VAE.

What are the limitations of VQ-VAE?

The most discussed limitation of VQ-VAE is codebook collapse, also called dead codes or low codebook utilization. During training a subset of codebook entries may stop being selected as nearest neighbors; they receive no gradient from the codebook loss and remain stranded, so the effective vocabulary shrinks and reconstructions degrade. The problem tends to worsen as the nominal codebook size K grows ^[8]. Practitioners developed a range of countermeasures, including EMA codebook updates ^[1], random restarts or re-seeding of unused codes (Jukebox reinitializes any code whose usage falls below a threshold to a recent encoder output) ^[5], code splitting, and entropy or commitment regularizers that encourage uniform usage.

A second line of work removes the explicit, learned codebook altogether. Finite Scalar Quantization (FSQ), introduced in 2023 by Fabian Mentzer, David Minnen, Eirikur Agustsson, and Michael Tschannen in "Finite Scalar Quantization: VQ-VAE Made Simple," projects the encoder output to a small number of dimensions and rounds each dimension to one of a few fixed levels, which defines an implicit codebook of size $L^d$ without any learned embedding table ^[8]. Gradients pass through the rounding via the straight-through estimator. FSQ avoids codebook collapse by construction and needs none of the commitment losses, reseeding, or splitting tricks that VQ requires, while matching VQ within a small margin on reconstruction and generation ^[8]. The closely related lookup-free quantization (LFQ), used in MAGVIT-v2, reduces the per-code embedding dimension to zero by binarizing each latent dimension at a threshold, enabling very large vocabularies (on the order of $2^{18}$ ) and improving generation quality, with entropy-based penalties encouraging full codebook use ^[9]. These methods have largely supplanted the original learned-codebook quantizer in many recent tokenizers, while preserving the discrete-token, two-stage paradigm that VQ-VAE pioneered.

ELI5: VQ-VAE explained simply

Imagine you want to describe a picture using a fixed box of crayons. A normal autoencoder lets you mix any color you like (a continuous space). VQ-VAE forces you to pick the single closest crayon from your box for each patch of the picture (a discrete codebook). You then write down only the list of crayon numbers, which is a short, tidy code. Because that code is just a list of numbers, a second model can learn the patterns in those lists and invent brand-new pictures by writing down a new list of crayon numbers and coloring it in. The clever trick that makes this trainable is to pretend, during the math that adjusts the network, that snapping to the nearest crayon did nothing, so the learning signal can still flow back to the part that chose the colors.

References

van den Oord, A., Vinyals, O., and Kavukcuoglu, K. "Neural Discrete Representation Learning." NeurIPS 2017. arXiv:1711.00937. https://arxiv.org/abs/1711.00937 ↩
Razavi, A., van den Oord, A., and Vinyals, O. "Generating Diverse High-Fidelity Images with VQ-VAE-2." NeurIPS 2019. arXiv:1906.00446. https://arxiv.org/abs/1906.00446 ↩
Ramesh, A., et al. "Zero-Shot Text-to-Image Generation." (DALL-E). ICML 2021. arXiv:2102.12092. https://arxiv.org/abs/2102.12092 ↩
Esser, P., Rombach, R., and Ommer, B. "Taming Transformers for High-Resolution Image Synthesis." (VQGAN). CVPR 2021. arXiv:2012.09841. https://arxiv.org/abs/2012.09841 ↩
Dhariwal, P., Jun, H., Payne, C., Kim, J. W., Radford, A., and Sutskever, I. "Jukebox: A Generative Model for Music." 2020. arXiv:2005.00341. https://arxiv.org/abs/2005.00341 ↩
Zeghidour, N., Luebs, A., Omran, A., Skoglund, J., and Tagliasacchi, M. "SoundStream: An End-to-End Neural Audio Codec." 2021. arXiv:2107.03312. https://arxiv.org/abs/2107.03312 ↩
Defossez, A., Copet, J., Synnaeve, G., and Adi, Y. "High Fidelity Neural Audio Compression." (EnCodec). 2022. arXiv:2210.13438. https://arxiv.org/abs/2210.13438 ↩
Mentzer, F., Minnen, D., Agustsson, E., and Tschannen, M. "Finite Scalar Quantization: VQ-VAE Made Simple." ICLR 2024. arXiv:2309.15505. https://arxiv.org/abs/2309.15505 ↩
Yu, L., et al. "Language Model Beats Diffusion: Tokenizer is Key to Visual Generation." (MAGVIT-v2). ICLR 2024. arXiv:2310.05737. https://arxiv.org/abs/2310.05737 ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

2 revisions by 1 contributors · full history

Suggest edit

What links here

Jukebox (OpenAI)MaskGIT Masked Autoregressive (MAR) generation Transfusion VQGAN (Taming Transformers)Variational Autoencoder Visual Autoregressive modeling (VAR)

What is VQ-VAE?

Why discrete latents? (motivation)

How does VQ-VAE work?

Encoder, codebook, and quantization

What is the VQ-VAE loss function?

What is the straight-through estimator in VQ-VAE?

EMA codebook updates

How does VQ-VAE generate new samples?

How is VQ-VAE different from a standard VAE?

VQ-VAE-2

What is VQ-VAE used for? (impact and descendants)

What are the limitations of VQ-VAE?

ELI5: VQ-VAE explained simply

See also

References

Improve this article

Related Articles

Diffusion model

AudioCraft

GAN

Generative Model

Autoencoder

Latent diffusion model

What links here

Related Articles

Diffusion model

AudioCraft

GAN

Generative Model

Autoencoder

Latent diffusion model

What links here