VQGAN (Taming Transformers)

Deep Learning Generative AI

11 min read

Updated Jun 25, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 25, 2026

Fact-checked

In review queue

Sources

9 citations

Revision

v2 · 2,291 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

VQGAN (Vector Quantized Generative Adversarial Network) is a two-stage image-synthesis method that first compresses an image into a small grid of discrete codebook tokens with an adversarially trained autoencoder, then models those tokens with an autoregressive Transformer to generate new images ^[1]^[3]. It was introduced in the December 2020 paper "Taming Transformers for High-Resolution Image Synthesis" by Patrick Esser, Robin Rombach, and Bjorn Ommer of Heidelberg University (the CompVis group), presented as an oral at CVPR 2021 ^[1]^[2]^[3]. VQGAN's central innovation is replacing the pixel-wise reconstruction loss of VQ-VAE with a perceptual plus patch-based adversarial loss, which yields a far more aggressive, perceptually rich compression and made it practical to apply attention-based models to high-resolution image generation. The same adversarial autoencoder became the direct architectural precursor to latent diffusion and Stable Diffusion ^[6]^[7].

What is VQGAN?

VQGAN, short for Vector Quantized Generative Adversarial Network, is an image-synthesis method introduced in the 2020 paper "Taming Transformers for High-Resolution Image Synthesis" by Patrick Esser, Robin Rombach, and Bjorn Ommer of the IWR at Heidelberg University (the group later known as CompVis) ^[1]^[2]. The work was presented as an oral at CVPR 2021 ^[2]^[3].

The method has two stages. First, a convolutional autoencoder with a learned discrete codebook (in the spirit of VQ-VAE) is trained to compress an image into a small grid of integer code indices and to reconstruct it. Unlike VQ-VAE, this autoencoder is trained with a patch-based adversarial discriminator (the GAN part) and a perceptual reconstruction loss rather than a pure pixel-space L2 loss, which lets the codebook capture rich, perceptually meaningful image parts at a high compression rate. Second, the resulting sequence of code indices is modeled with an autoregressive Transformer (a GPT-2-style network), so that new images can be generated by sampling code sequences and decoding them back to pixels ^[1]^[3].

The paper's abstract summarizes the recipe directly: the authors show how to "(i) use CNNs to learn a context-rich vocabulary of image constituents, and in turn (ii) utilize transformers to efficiently model their composition within high-resolution images" ^[1]. By moving the Transformer off of raw pixels and onto a short sequence of discrete latent codes, VQGAN made it computationally feasible to apply attention-based models to image generation, including conditional and megapixel synthesis. It became one of the most influential generative-vision papers of its era: it popularized the discrete image-tokenizer paradigm, powered the VQGAN+CLIP art movement of 2021, and its adversarial autoencoder was the direct precursor to the latent diffusion models that became Stable Diffusion ^[3]^[6]^[7].

Why combine CNNs and Transformers?

The paper frames a complementary tension between two model families. Transformers are highly expressive and contain no built-in locality assumptions, which lets them learn long-range relationships, but their self-attention has a computational and memory cost that grows quadratically with sequence length. Applied directly to pixels, even a modest image becomes an intractably long sequence, so pixel-level autoregressive Transformers were limited to small, low-resolution images ^[1]^[3]. As the authors put it, transformers "contain no inductive bias that prioritizes local interactions," which "makes them expressive, but also computationally infeasible for long sequences, such as high-resolution images" ^[1].

Convolutional neural networks have the opposite profile. Their built-in inductive biases, locality and translation equivariance through shared kernels, make them efficient and well suited to the strong local statistical regularities of natural images, but those same biases can limit their ability to model global composition ^[1].

The central idea is to use each family where it is strongest: let a CNN learn a "context-rich vocabulary" of image constituents, encoded as a compact set of discrete tokens, and let a Transformer model the global composition of those tokens ^[1]. Because the Transformer operates on a sequence orders of magnitude shorter than the pixel count, its quadratic cost becomes manageable while it retains its expressivity over the learned codes ^[1]^[3].

How does VQGAN work?

Stage 1: the adversarial discrete codebook

The first stage learns a discrete representation of images. A convolutional encoder maps an input image to a grid of continuous feature vectors. Each vector is then quantized by snapping it to its nearest entry in a learned codebook of vectors, so the image is represented by a grid of integer indices into that codebook. A convolutional decoder reconstructs the image from the quantized grid. The encoder, decoder, and codebook are trained jointly, with the standard VQ-VAE machinery: a reconstruction term, a codebook (dictionary) loss that pulls codebook entries toward the encoder outputs, and a commitment loss that keeps encoder outputs close to their chosen codes, with gradients passed through the non-differentiable quantization step via the straight-through estimator ^[1].

VQGAN's key departure from VQ-VAE is the training objective for this autoencoder. Instead of relying on a pixel-wise L2 reconstruction loss, which tends to produce blurry outputs and forces a low compression rate, VQGAN adds two ingredients ^[1]:

A perceptual loss measured in the feature space of a pretrained network (LPIPS), which rewards perceptual similarity rather than exact per-pixel match.
A patch-based adversarial loss: a convolutional discriminator judges local image patches as real or reconstructed, pushing the decoder to produce realistic textures and sharp detail.

The combination lets the model use a much more aggressive spatial compression while still reconstructing perceptually convincing images. The paper studies downsampling factors f of 8, 16, and even larger (so a grid of tokens stands in for a far larger block of pixels), choosing the factor according to whether fine detail or broad layout matters most for the data ^[1]^[3]. As a concrete example, a widely used configuration uses a downsampling factor of f = 16 with a codebook of 16,384 entries, so a 256x256 image is encoded as just 256 tokens (a 16x16 grid), and a 512x512 image becomes 1,024 tokens ^[3]^[9]. The result is a compact, perceptually rich codebook of context-rich visual parts, which is what makes the second stage tractable. To stabilize training when context is large, the adversarial term is introduced with an adaptive weight balanced against the perceptual reconstruction term ^[1].

Stage 2: the autoregressive Transformer prior

With the codebook fixed, an image becomes a grid of code indices, which is flattened (in raster-scan order) into a one-dimensional sequence of tokens. A GPT-2-style autoregressive Transformer is trained to predict each token given all previous tokens, learning the distribution over plausible code sequences. New images are produced by autoregressively sampling a sequence of indices and decoding it with the stage-1 decoder ^[1]^[3].

Because the representation is a sequence of tokens, conditioning is natural: the model performs conditional synthesis by prepending conditioning information, such as a class label or a spatially encoded condition like a semantic segmentation map or depth, as additional tokens that the Transformer attends to. This supports class-conditional generation as well as semantically guided, layout-to-image synthesis ^[1]^[3].

How does VQGAN generate megapixel images?

Even on the compressed token grid, a full megapixel image yields a sequence too long to attend over globally. VQGAN handles this with a sliding-window (patch-wise) attention scheme: the image is generated patch by patch, and when predicting tokens in a given region the Transformer attends only to a fixed-size local window of surrounding tokens rather than the entire grid. Because the underlying data is approximately stationary (translation invariant) and each token already aggregates a large receptive field of pixels, this local attention is sufficient to maintain coherence while keeping cost bounded. This is what enables synthesis at megapixel resolutions, including semantically guided landscapes generated at resolutions in the megapixel range, well beyond what pixel-level Transformers could reach. The authors report "the first results on semantically-guided synthesis of megapixel images with transformers" ^[1]^[3].

What results did VQGAN achieve?

The paper reported strong results across unconditional, class-conditional, and conditional (layout-guided) settings. On class-conditional ImageNet generation, VQGAN with a large autoregressive Transformer (a roughly 1.4-billion-parameter model in the released configuration) achieved state-of-the-art Frechet Inception Distance (FID) among autoregressive image models and was reported to outperform BigGAN on this benchmark ^[1]^[3]^[4]. The released model is described by the authors as "a pretrained, 1.4B transformer model trained for class-conditional ImageNet synthesis, which obtains state-of-the-art FID scores among autoregressive approaches and outperforms BigGAN" ^[4]. The authors also presented what they described as the first results on semantically guided synthesis of megapixel images with Transformers, generating high-resolution images conditioned on semantic layouts, depth maps, and other structured inputs ^[1]^[3].

A recurring finding is that the learned discrete codebook is the decisive ingredient: ablations showed that the convolutional VQGAN representation substantially outperformed a comparable pixel-based or weaker (for example, VQVAE-style) codebook when paired with the same Transformer, both in reconstruction quality and in downstream generative likelihood, and that the approach scaled favorably with a transformer prior relative to convolutional autoregressive baselines such as PixelSNAIL ^[1].

Aspect	Choice in VQGAN
Stage-1 model	Convolutional encoder/decoder with a learned discrete codebook
Quantization	Nearest-neighbor lookup into the codebook; straight-through gradients
Stage-1 losses	Perceptual (LPIPS) + patch-based adversarial + codebook/commitment
Compression	Downsampling factor f of 8, 16, or higher (e.g. f = 16, 16,384-entry codebook: 256x256 image to 256 tokens)
Stage-2 model	GPT-2-style autoregressive Transformer over code indices
Conditioning	Prepended tokens (class label, segmentation, depth)
High resolution	Sliding-window (local) attention, patch-wise generation
Headline result	SOTA FID among autoregressive models on class-conditional ImageNet, outperforming BigGAN

Why was VQGAN influential?

VQGAN became one of the most consequential generative-vision works of the early 2020s, in three distinct lineages.

VQGAN+CLIP and AI art. In 2021, artists and researchers connected VQGAN's image generator to OpenAI's CLIP, which scores how well an image matches a text prompt. Katherine Crowson (RiversHaveWings), building on Ryan Murdoch's (advadnoun) earlier BigGAN-plus-CLIP "Big Sleep," wrote a widely shared Google Colab notebook that optimized a VQGAN latent so that the decoded image maximized CLIP's text-image similarity for a user prompt ^[4]^[5]. The resulting "VQGAN+CLIP" became one of the first broadly accessible text-to-image tools and a defining technique of the 2021 generative-art wave, before diffusion-based systems superseded it ^[4]^[5].

Latent diffusion and Stable Diffusion. The same Heidelberg/CompVis group built directly on the VQGAN autoencoder to create latent diffusion. In "High-Resolution Image Synthesis with Latent Diffusion Models" (Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bjorn Ommer, CVPR 2022), they reused a VQGAN-style perceptual-plus-adversarial autoencoder to compress images into a compact latent space and then ran a diffusion model in that latent space rather than an autoregressive Transformer. This made high-resolution text-to-image generation efficient and became the basis of Stable Diffusion, released in 2022 ^[6]^[7]. In this sense VQGAN is the architectural ancestor of latent diffusion: the autoencoder is essentially the same idea, with diffusion swapped in for the autoregressive prior.

Discrete image tokens and autoregressive generation. VQGAN cemented the paradigm of treating images as sequences of discrete tokens produced by a learned tokenizer, which a sequence model then generates. This tokenizer-plus-sequence-model recipe underlies a broad class of later systems: autoregressive token models in the DALL-E lineage such as DALL-E and its successor DALL-E 2, Google's Parti, the masked-token (non-autoregressive) approach of MUSE, and later refinements of visual autoregressive modeling such as VAR. Google's own "Vector-quantized Image Modeling with Improved VQGAN" (ViT-VQGAN, 2021) replaced the convolutional backbone with a Vision Transformer and improved the codebook, underscoring how central the VQGAN tokenizer had become to autoregressive image generation ^[3]^[8].

How is VQGAN different from VQ-VAE?

VQGAN is best understood as VQ-VAE with a better-trained autoencoder. The Vector Quantized Variational Autoencoder, introduced by Aaron van den Oord, Oriol Vinyals, and Koray Kavukcuoglu in 2017, established the core machinery: a convolutional encoder, a learned discrete codebook with nearest-neighbor quantization, straight-through gradients, and a convolutional decoder, with an autoregressive prior (originally PixelCNN-style) learned over the discrete codes in a second stage ^[1].

VQGAN keeps this two-stage structure but changes two things. First, the stage-1 objective: instead of a pixel-space reconstruction loss, VQGAN trains the autoencoder with a perceptual (LPIPS) loss and a patch-based adversarial loss, which yields sharper reconstructions at much higher compression and a perceptually richer codebook. Second, the stage-2 prior: VQGAN replaces the convolutional autoregressive prior with a powerful GPT-2-style Transformer, and adds sliding-window attention to reach high resolutions ^[1]^[3]. Those two changes (an adversarially trained tokenizer and a Transformer prior) are precisely what let the discrete-codebook approach scale from the modest images of VQ-VAE to high-resolution, semantically controllable synthesis, and what made the codebook idea durable enough to seed both the autoregressive image-token line and the latent-diffusion line that followed.

	VQ-VAE (2017)	VQGAN (2020)
Stage-1 loss	Pixel-space L2 reconstruction	Perceptual (LPIPS) + patch-based adversarial
Reconstruction quality	Blurry, low compression	Sharp, aggressive compression
Stage-2 prior	Convolutional autoregressive (PixelCNN)	GPT-2-style autoregressive Transformer
High-resolution synthesis	Limited	Megapixel via sliding-window attention

References

Esser, P., Rombach, R., and Ommer, B. "Taming Transformers for High-Resolution Image Synthesis." arXiv:2012.09841, December 2020 (revised June 2021). https://arxiv.org/abs/2012.09841 ↩
CompVis. "Taming Transformers for High-Resolution Image Synthesis" (project page). https://compvis.github.io/taming-transformers/ ↩
Esser, P., Rombach, R., and Ommer, B. "Taming Transformers for High-Resolution Image Synthesis." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021 (Oral). https://openaccess.thecvf.com/content/CVPR2021/html/Esser_Taming_Transformers_for_High-Resolution_Image_Synthesis_CVPR_2021_paper.html ↩
CompVis. "taming-transformers" (code repository). https://github.com/CompVis/taming-transformers ↩
Steinbruck, A. "VQGAN+CLIP, How does it work?" 2021. https://alexasteinbruck.medium.com/vqgan-clip-how-does-it-work-210a5dca5e52 ↩
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., and Ommer, B. "High-Resolution Image Synthesis with Latent Diffusion Models." arXiv:2112.10752, December 2021; CVPR 2022. https://arxiv.org/abs/2112.10752 ↩
CompVis. "latent-diffusion" (code repository). https://github.com/CompVis/latent-diffusion ↩
Yu, J., Li, X., Koh, J. Y., et al. "Vector-quantized Image Modeling with Improved VQGAN." arXiv:2110.04627, October 2021; ICLR 2022. https://arxiv.org/abs/2110.04627 ↩
flax-community. "vqgan_f16_16384" (model card). Hugging Face. https://huggingface.co/flax-community/vqgan_f16_16384 ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

1 revision by 1 contributors · full history

Suggest edit

What links here

MaskGIT Masked Autoregressive (MAR) generation Robin Rombach Transfusion VQ-VAE (Vector Quantized Variational Autoencoder)Visual Autoregressive modeling (VAR)

What is VQGAN?

Why combine CNNs and Transformers?

How does VQGAN work?

Stage 1: the adversarial discrete codebook

Stage 2: the autoregressive Transformer prior

How does VQGAN generate megapixel images?

What results did VQGAN achieve?

Why was VQGAN influential?

How is VQGAN different from VQ-VAE?

References

Improve this article

Related Articles

Diffusion model

AudioCraft

GAN

Generative Model

Autoencoder

Latent diffusion model

What links here

Related Articles

Diffusion model

AudioCraft

GAN

Generative Model

Autoencoder

Latent diffusion model

What links here