VQGAN (Taming Transformers)
Last reviewed
Jun 8, 2026
Sources
8 citations
Review status
Source-backed
Revision
v1 · 1,905 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 8, 2026
Sources
8 citations
Review status
Source-backed
Revision
v1 · 1,905 words
Add missing citations, update stale details, or suggest a clearer explanation.
VQGAN, short for Vector Quantized Generative Adversarial Network, is an image-synthesis method introduced in the 2020 paper "Taming Transformers for High-Resolution Image Synthesis" by Patrick Esser, Robin Rombach, and Bjorn Ommer of the IWR at Heidelberg University (the group later known as CompVis) [1][2]. The work was presented as an oral at CVPR 2021 [2][3].
The method has two stages. First, a convolutional autoencoder with a learned discrete codebook (in the spirit of VQ-VAE) is trained to compress an image into a small grid of integer code indices and to reconstruct it. Unlike VQ-VAE, this autoencoder is trained with a patch-based adversarial discriminator (the GAN part) and a perceptual reconstruction loss rather than a pure pixel-space L2 loss, which lets the codebook capture rich, perceptually meaningful image parts at a high compression rate. Second, the resulting sequence of code indices is modeled with an autoregressive Transformer (a GPT-2-style network), so that new images can be generated by sampling code sequences and decoding them back to pixels [1][3].
By moving the Transformer off of raw pixels and onto a short sequence of discrete latent codes, VQGAN made it computationally feasible to apply attention-based models to image generation, including conditional and megapixel synthesis. It became one of the most influential generative-vision papers of its era: it popularized the discrete image-tokenizer paradigm, powered the VQGAN+CLIP art movement of 2021, and its adversarial autoencoder was the direct precursor to the latent diffusion models that became Stable Diffusion [3][6][7].
The paper frames a complementary tension between two model families. Transformers are highly expressive and contain no built-in locality assumptions, which lets them learn long-range relationships, but their self-attention has a computational and memory cost that grows quadratically with sequence length. Applied directly to pixels, even a modest image becomes an intractably long sequence, so pixel-level autoregressive Transformers were limited to small, low-resolution images [1][3].
Convolutional neural networks have the opposite profile. Their built-in inductive biases, locality and translation equivariance through shared kernels, make them efficient and well suited to the strong local statistical regularities of natural images, but those same biases can limit their ability to model global composition [1].
The central idea is to use each family where it is strongest: let a CNN learn a "context-rich vocabulary" of image constituents, encoded as a compact set of discrete tokens, and let a Transformer model the global composition of those tokens. Because the Transformer operates on a sequence orders of magnitude shorter than the pixel count, its quadratic cost becomes manageable while it retains its expressivity over the learned codes [1][3].
The first stage learns a discrete representation of images. A convolutional encoder maps an input image to a grid of continuous feature vectors. Each vector is then quantized by snapping it to its nearest entry in a learned codebook of vectors, so the image is represented by a grid of integer indices into that codebook. A convolutional decoder reconstructs the image from the quantized grid. The encoder, decoder, and codebook are trained jointly, with the standard VQ-VAE machinery: a reconstruction term, a codebook (dictionary) loss that pulls codebook entries toward the encoder outputs, and a commitment loss that keeps encoder outputs close to their chosen codes, with gradients passed through the non-differentiable quantization step via the straight-through estimator [1].
VQGAN's key departure from VQ-VAE is the training objective for this autoencoder. Instead of relying on a pixel-wise L2 reconstruction loss, which tends to produce blurry outputs and forces a low compression rate, VQGAN adds two ingredients [1]:
The combination lets the model use a much more aggressive spatial compression while still reconstructing perceptually convincing images. The paper studies downsampling factors f of 8, 16, and even larger (so a grid of tokens stands in for a far larger block of pixels), choosing the factor according to whether fine detail or broad layout matters most for the data [1][3]. The result is a compact, perceptually rich codebook of context-rich visual parts, which is what makes the second stage tractable. To stabilize training when context is large, the adversarial term is introduced with an adaptive weight balanced against the perceptual reconstruction term [1].
With the codebook fixed, an image becomes a grid of code indices, which is flattened (in raster-scan order) into a one-dimensional sequence of tokens. A GPT-2-style autoregressive Transformer is trained to predict each token given all previous tokens, learning the distribution over plausible code sequences. New images are produced by autoregressively sampling a sequence of indices and decoding it with the stage-1 decoder [1][3].
Because the representation is a sequence of tokens, conditioning is natural: the model performs conditional synthesis by prepending conditioning information, such as a class label or a spatially encoded condition like a semantic segmentation map or depth, as additional tokens that the Transformer attends to. This supports class-conditional generation as well as semantically guided, layout-to-image synthesis [1][3].
Even on the compressed token grid, a full megapixel image yields a sequence too long to attend over globally. VQGAN handles this with a sliding-window (patch-wise) attention scheme: the image is generated patch by patch, and when predicting tokens in a given region the Transformer attends only to a fixed-size local window of surrounding tokens rather than the entire grid. Because the underlying data is approximately stationary (translation invariant) and each token already aggregates a large receptive field of pixels, this local attention is sufficient to maintain coherence while keeping cost bounded. This is what enables synthesis at megapixel resolutions, including semantically guided landscapes generated at resolutions in the megapixel range, well beyond what pixel-level Transformers could reach [1][3].
The paper reported strong results across unconditional, class-conditional, and conditional (layout-guided) settings. On class-conditional ImageNet generation, VQGAN with a large autoregressive Transformer (a roughly 1.4-billion-parameter model in the released configuration) achieved state-of-the-art Frechet Inception Distance (FID) among autoregressive image models and was reported to outperform BigGAN on this benchmark [1][3]. The authors also presented what they described as the first results on semantically guided synthesis of megapixel images with Transformers, generating high-resolution images conditioned on semantic layouts, depth maps, and other structured inputs [1][3].
A recurring finding is that the learned discrete codebook is the decisive ingredient: ablations showed that the convolutional VQGAN representation substantially outperformed a comparable pixel-based or weaker (for example, VQVAE-style) codebook when paired with the same Transformer, both in reconstruction quality and in downstream generative likelihood, and that the approach scaled favorably with a transformer prior relative to convolutional autoregressive baselines such as PixelSNAIL [1].
| Aspect | Choice in VQGAN |
|---|---|
| Stage-1 model | Convolutional encoder/decoder with a learned discrete codebook |
| Quantization | Nearest-neighbor lookup into the codebook; straight-through gradients |
| Stage-1 losses | Perceptual (LPIPS) + patch-based adversarial + codebook/commitment |
| Compression | Downsampling factor f of 8, 16, or higher |
| Stage-2 model | GPT-2-style autoregressive Transformer over code indices |
| Conditioning | Prepended tokens (class label, segmentation, depth) |
| High resolution | Sliding-window (local) attention, patch-wise generation |
| Headline result | SOTA FID among autoregressive models on class-conditional ImageNet |
VQGAN became one of the most consequential generative-vision works of the early 2020s, in three distinct lineages.
VQGAN+CLIP and AI art. In 2021, artists and researchers connected VQGAN's image generator to OpenAI's CLIP, which scores how well an image matches a text prompt. Katherine Crowson, building on Ryan Murdoch's earlier BigGAN-plus-CLIP "Big Sleep," wrote a widely shared Google Colab notebook that optimized a VQGAN latent so that the decoded image maximized CLIP's text-image similarity for a user prompt. The resulting "VQGAN+CLIP" became one of the first broadly accessible text-to-image tools and a defining technique of the 2021 generative-art wave, before diffusion-based systems superseded it [4][5].
Latent diffusion and Stable Diffusion. The same Heidelberg/CompVis group built directly on the VQGAN autoencoder to create latent diffusion. In "High-Resolution Image Synthesis with Latent Diffusion Models" (Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bjorn Ommer, CVPR 2022), they reused a VQGAN-style perceptual-plus-adversarial autoencoder to compress images into a compact latent space and then ran a diffusion model in that latent space rather than an autoregressive Transformer. This made high-resolution text-to-image generation efficient and became the basis of Stable Diffusion, released in 2022 [6][7]. In this sense VQGAN is the architectural ancestor of latent diffusion: the autoencoder is essentially the same idea, with diffusion swapped in for the autoregressive prior.
Discrete image tokens and autoregressive generation. VQGAN cemented the paradigm of treating images as sequences of discrete tokens produced by a learned tokenizer, which a sequence model then generates. This tokenizer-plus-sequence-model recipe underlies a broad class of later systems: autoregressive token models in the DALL-E lineage such as DALL-E and its successor DALL-E 2, Google's Parti, the masked-token (non-autoregressive) approach of MUSE, and later refinements of visual autoregressive modeling such as VAR. Google's own "Vector-quantized Image Modeling with Improved VQGAN" (ViT-VQGAN, 2021) replaced the convolutional backbone with a Vision Transformer and improved the codebook, underscoring how central the VQGAN tokenizer had become to autoregressive image generation [3][8].
VQGAN is best understood as VQ-VAE with a better-trained autoencoder. The Vector Quantized Variational Autoencoder, introduced by Aaron van den Oord, Oriol Vinyals, and Koray Kavukcuoglu in 2017, established the core machinery: a convolutional encoder, a learned discrete codebook with nearest-neighbor quantization, straight-through gradients, and a convolutional decoder, with an autoregressive prior (originally PixelCNN-style) learned over the discrete codes in a second stage [1].
VQGAN keeps this two-stage structure but changes two things. First, the stage-1 objective: instead of a pixel-space reconstruction loss, VQGAN trains the autoencoder with a perceptual (LPIPS) loss and a patch-based adversarial loss, which yields sharper reconstructions at much higher compression and a perceptually richer codebook. Second, the stage-2 prior: VQGAN replaces the convolutional autoregressive prior with a powerful GPT-2-style Transformer, and adds sliding-window attention to reach high resolutions [1][3]. Those two changes (an adversarially trained tokenizer and a Transformer prior) are precisely what let the discrete-codebook approach scale from the modest images of VQ-VAE to high-resolution, semantically controllable synthesis, and what made the codebook idea durable enough to seed both the autoregressive image-token line and the latent-diffusion line that followed.