MaskGIT

Deep Learning Generative AI

10 min read

Updated Jun 8, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 8, 2026

Fact-checked

In review queue

Sources

7 citations

Revision

v1 · 2,059 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Overview

MaskGIT, short for Masked Generative Image Transformer, is an image-synthesis method introduced by Google Research in the 2022 paper "MaskGIT: Masked Generative Image Transformer" by Huiwen Chang, Han Zhang, Lu Jiang, Ce Liu, and William T. Freeman ^[1]^[2]. The paper was first posted to arXiv on February 8, 2022, and was presented at the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2022 ^[1]^[3].

MaskGIT generates images over a grid of discrete visual tokens produced by a VQGAN-style tokenizer, rather than over raw pixels. Its central contribution is the decoding procedure: instead of producing the tokens one at a time in a fixed raster-scan order, as an autoregressive model does, MaskGIT uses a bidirectional Transformer trained with a masked-token-prediction objective and then decodes all tokens in parallel over a small number of refinement steps. At each step the model predicts every still-masked token at once, keeps only the predictions it is most confident about, and leaves the rest masked for the next step. A typical 256x256 image is produced in around 8 steps rather than the 256 sequential steps an autoregressive model would need, a 30x to 64x acceleration with image quality that matches or exceeds the autoregressive and GAN baselines of the time ^[1]^[2].

MaskGIT established the "masked generative" paradigm for visual generation. It was the direct basis for Google's Muse text-to-image model and the MAGVIT video generator, and it sits alongside diffusion models and autoregressive token models as one of the three main families of token-based and iterative image generators ^[4]^[5].

Background: slow autoregressive image generation

By 2021 a dominant recipe for generative vision was the two-stage tokenizer-plus-sequence-model approach popularized by VQ-VAE and VQGAN. A learned tokenizer compresses an image into a small grid of integer codes drawn from a fixed codebook, and a second-stage model learns the distribution over those code sequences; to sample a new image, code indices are generated and decoded back to pixels by the tokenizer's decoder ^[1].

In systems such as VQGAN and the autoregressive token line that includes DALL-E and Google's Parti, the second-stage model is an autoregressive Transformer. It flattens the two-dimensional token grid into a sequence in raster-scan order (left to right, top to bottom) and predicts each token conditioned only on the tokens before it. This design has two drawbacks that MaskGIT set out to fix ^[1]:

Speed. Tokens are produced strictly one at a time, so the number of forward passes equals the number of tokens: a 16x16 grid is 256 sequential steps, and a 32x32 grid for a 512x512 image is 1024 steps. Because the steps cannot be parallelized, generation is slow and scales poorly with resolution ^[1].
Unidirectional context. The raster ordering is arbitrary for a 2D image, and causal attention means a token can only see tokens that precede it in that scan. A token in the upper-left cannot condition on content generated later in the lower-right, even though images have no natural reading direction ^[1].

MaskGIT addresses both issues by replacing sequential, causal decoding with parallel, bidirectional, iterative decoding.

How MaskGIT works

Tokenizer

MaskGIT keeps the first stage of the VQGAN recipe essentially unchanged. A convolutional VQGAN-style autoencoder with a learned discrete codebook compresses an image by a fixed spatial factor of 16, so a 256x256 image becomes a 16x16 grid of 256 tokens and a 512x512 image becomes a 32x32 grid of 1024 tokens, each cell an index into the codebook. All of MaskGIT's modeling happens on these discrete tokens, and the tokenizer's decoder maps the final token grid back to a pixel image ^[1].

Training: masked visual token modeling

The second stage is a bidirectional Transformer trained with a masked-token-prediction objective in the spirit of BERT, adapted from language to image tokens. During training, a random subset of the tokens in an image is replaced with a special [MASK] token, and the model is trained to reconstruct the original tokens at the masked positions by attending to all of the unmasked tokens in every direction. Because attention is bidirectional rather than causal, every prediction can use the full surrounding context ^[1].

The key departure from BERT is that MaskGIT does not use a fixed masking ratio. For each training image, the fraction of tokens that are masked is drawn from a schedule so that the model is exposed to everything from lightly masked grids (almost fully observed) to nearly or fully masked grids. This variable masking ratio is what makes the model usable as a generator: at inference it must be able to predict tokens when almost the entire canvas is still masked, a regime BERT's fixed low masking ratio never trains for ^[1].

Inference: iterative parallel decoding

At generation time MaskGIT starts from a blank canvas in which every token is the [MASK] token and fills the grid over a small fixed number of iterations T. Each iteration performs four operations ^[1]:

Predict. The Transformer takes the current partially masked grid and outputs, for every masked position in parallel, a probability distribution over the codebook.
Sample. A token is sampled from the predicted distribution at each masked position.
Score. Each newly sampled token is assigned a confidence equal to the model's predicted probability for that token. Already-unmasked tokens from earlier iterations are treated as fully confident and are not changed.
Mask. The mask schedule determines how many tokens should remain masked after this iteration. The lowest-confidence predictions are returned to the masked state, and only the highest-confidence ones are committed, so the number of committed tokens grows each step until the grid is full.

The number of tokens kept at each step is set by a mask scheduling function evaluated at the current step's progress. The authors tested linear, concave, and convex schedules and found that a cosine schedule worked best: it commits relatively few tokens in the early steps, when the model is most uncertain because little context exists, and commits many tokens later, when most of the image is already in place ^[1]. To inject diversity so the procedure does not collapse to a single deterministic output, the confidence scores are perturbed by a temperature-controlled noise term whose temperature is annealed over the iterations; later open-source reimplementations commonly formalize this as adding Gumbel noise scaled by an annealed temperature to the log-probabilities ^[1]^[6].

Because each iteration predicts all remaining tokens in a single forward pass, the total number of forward passes equals T rather than the number of tokens. MaskGIT used T = 8 for 256x256 images and T = 12 for 512x512 images, far fewer than the 256 and 1024 sequential steps required by the autoregressive baseline ^[1].

Image editing through masking

The same trained model supports inpainting, outpainting (called extrapolation in the paper), and class-conditional editing without any task-specific retraining. To edit an image, the regions to keep are tokenized and left unmasked while the regions to fill are set to [MASK], and the standard iterative decoding fills the masked region consistently with the surrounding tokens. The authors demonstrate inpainting on 512x512 Places2 images, horizontal extrapolation that stitches images into wider panoramas, and class-conditional replacement of objects within an image ^[1]^[7].

Results

On class-conditional ImageNet generation, MaskGIT improved sharply over the autoregressive VQGAN it was built on and was competitive with or better than the leading GAN of the period. The headline comparisons reported in the paper are summarized below ^[1].

Model	Resolution	FID (lower is better)	Inception Score (higher is better)
VQGAN	256x256	15.78	78.3
DCTransformer	256x256	36.51	-
ADM (diffusion)	256x256	10.94	101.0
BigGAN-deep	256x256	6.95	198.2
MaskGIT	256x256	6.18	182.1
BigGAN-deep	512x512	8.43	-
MaskGIT	512x512	7.32	-

At 256x256, MaskGIT cut VQGAN's Frechet Inception Distance (FID) from 15.78 to 6.18 and raised the Inception Score from 78.3 to 182.1, while also beating the ADM diffusion baseline and reaching FID parity with BigGAN-deep ^[1]. At 512x512, MaskGIT reached an FID of 7.32, surpassing BigGAN-deep's 8.43, which the authors presented as a new state of the art for that benchmark ^[1]^[2].

The efficiency gain was the paper's signature result. Measured against the autoregressive VQGAN sampler, MaskGIT accelerated decoding by roughly 30x to 64x, with the speedup growing as resolution and token count increased, because the autoregressive cost grows with the number of tokens while MaskGIT's cost stays fixed at T steps ^[1]^[2]. Ablations showed the cosine mask schedule and the choice of T to be the main levers on the quality-versus-speed trade-off, with quality peaking around 8 to 12 iterations ^[1].

Influence: Muse and MAGVIT

MaskGIT introduced the masked generative paradigm that a series of later Google models extended to new modalities and to text conditioning ^[4]^[5].

Muse (Chang et al., January 2023) scaled the MaskGIT approach to text-to-image generation. It conditions the masked-token Transformer on text embeddings from a pretrained large language model and grows the model to around 3 billion parameters, using a cascade with a super-resolution stage. The authors reported that Muse was substantially faster than pixel-space diffusion models such as Imagen and DALL-E 2, and than autoregressive models such as Parti, because it inherits MaskGIT's parallel iterative decoding ^[4].
MAGVIT (Yu et al., 2023), short for Masked Generative Video Transformer, adapted MaskGIT to video by introducing a 3D tokenizer over space and time and applying masked iterative decoding to video tokens, supporting tasks such as frame prediction and video inpainting. The follow-up MAGVIT-v2 improved the tokenizer for both images and video, and the contemporaneous Phenaki text-to-video model used a related masked-token scheme over a video tokenizer ^[5].

The paradigm also spread into open-source reimplementations and into other domains, including masked acoustic and music token models that borrow MaskGIT's confidence-based parallel decoding ^[6].

Relationship to other methods

MaskGIT is best understood by contrasting how it traverses the token grid relative to its neighbors in the token-based and iterative generation landscape.

Versus autoregressive token models (VQGAN, Parti, the DALL-E line). MaskGIT shares the same VQGAN-style tokenizer and discrete-token representation, but replaces the causal, raster-order, one-token-at-a-time prior with a bidirectional Transformer that fills tokens in confidence order over a few parallel steps. This removes the unidirectional-context limitation and turns hundreds or thousands of sequential steps into roughly a dozen ^[1].
Versus diffusion models. Both MaskGIT and a diffusion model generate by iterative refinement from a degraded state, but the corruption process differs. Diffusion operates in a continuous space and gradually denoises Gaussian noise over many steps, whereas MaskGIT operates in a discrete token space and progressively unmasks tokens over a handful of steps. MaskGIT's formulation can be viewed as a discrete-state, absorbing-state ("masking") analogue of diffusion, and it typically needs far fewer steps than the pixel-space diffusion models of its era ^[1]^[4].
Versus later masked-autoregressive work. The masked-prediction idea was later combined back with autoregression and with continuous tokens. Masked Autoregressive (MAR) modeling, for example, predicts tokens in a random order using a masked Transformer and a per-token diffusion head over continuous latents, blending MaskGIT's order-agnostic decoding with a diffusion-style output, and visual autoregressive lines such as next-scale prediction likewise revisited the ordering question that MaskGIT first reframed ^[4]^[5].

References

Chang, H., Zhang, H., Jiang, L., Liu, C., and Freeman, W. T. "MaskGIT: Masked Generative Image Transformer." arXiv:2202.04200, February 2022. https://arxiv.org/abs/2202.04200 ↩
Chang, H., Zhang, H., Jiang, L., Liu, C., and Freeman, W. T. "MaskGIT: Masked Generative Image Transformer" (project page). https://masked-generative-image-transformer.github.io/ ↩
Chang, H., Zhang, H., Jiang, L., Liu, C., and Freeman, W. T. "MaskGIT: Masked Generative Image Transformer." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022. https://openaccess.thecvf.com/content/CVPR2022/html/Chang_MaskGIT_Masked_Generative_Image_Transformer_CVPR_2022_paper.html ↩
Chang, H., Zhang, H., Barber, J., et al. "Muse: Text-To-Image Generation via Masked Generative Transformers." arXiv:2301.00704, January 2023. https://arxiv.org/abs/2301.00704 ↩
Yu, L., Cheng, Y., Sohn, K., et al. "MAGVIT: Masked Generative Video Transformer." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023. arXiv:2212.05199. https://arxiv.org/abs/2212.05199 ↩
Garcia, H., Seetharaman, P., Salamon, J., and Pardo, B. "VampNet: Music Generation via Masked Acoustic Token Modeling." arXiv:2307.04686, July 2023. https://arxiv.org/abs/2307.04686 ↩
Google Research. "maskgit: Official Jax Implementation of MaskGIT" (code repository). https://github.com/google-research/maskgit ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

Suggest edit

What links here

Show-o