MaskGIT
Last reviewed
Jun 8, 2026
Sources
7 citations
Review status
Source-backed
Revision
v1 · 2,059 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 8, 2026
Sources
7 citations
Review status
Source-backed
Revision
v1 · 2,059 words
Add missing citations, update stale details, or suggest a clearer explanation.
MaskGIT, short for Masked Generative Image Transformer, is an image-synthesis method introduced by Google Research in the 2022 paper "MaskGIT: Masked Generative Image Transformer" by Huiwen Chang, Han Zhang, Lu Jiang, Ce Liu, and William T. Freeman [1][2]. The paper was first posted to arXiv on February 8, 2022, and was presented at the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2022 [1][3].
MaskGIT generates images over a grid of discrete visual tokens produced by a VQGAN-style tokenizer, rather than over raw pixels. Its central contribution is the decoding procedure: instead of producing the tokens one at a time in a fixed raster-scan order, as an autoregressive model does, MaskGIT uses a bidirectional Transformer trained with a masked-token-prediction objective and then decodes all tokens in parallel over a small number of refinement steps. At each step the model predicts every still-masked token at once, keeps only the predictions it is most confident about, and leaves the rest masked for the next step. A typical 256x256 image is produced in around 8 steps rather than the 256 sequential steps an autoregressive model would need, a 30x to 64x acceleration with image quality that matches or exceeds the autoregressive and GAN baselines of the time [1][2].
MaskGIT established the "masked generative" paradigm for visual generation. It was the direct basis for Google's Muse text-to-image model and the MAGVIT video generator, and it sits alongside diffusion models and autoregressive token models as one of the three main families of token-based and iterative image generators [4][5].
By 2021 a dominant recipe for generative vision was the two-stage tokenizer-plus-sequence-model approach popularized by VQ-VAE and VQGAN. A learned tokenizer compresses an image into a small grid of integer codes drawn from a fixed codebook, and a second-stage model learns the distribution over those code sequences; to sample a new image, code indices are generated and decoded back to pixels by the tokenizer's decoder [1].
In systems such as VQGAN and the autoregressive token line that includes DALL-E and Google's Parti, the second-stage model is an autoregressive Transformer. It flattens the two-dimensional token grid into a sequence in raster-scan order (left to right, top to bottom) and predicts each token conditioned only on the tokens before it. This design has two drawbacks that MaskGIT set out to fix [1]:
MaskGIT addresses both issues by replacing sequential, causal decoding with parallel, bidirectional, iterative decoding.
MaskGIT keeps the first stage of the VQGAN recipe essentially unchanged. A convolutional VQGAN-style autoencoder with a learned discrete codebook compresses an image by a fixed spatial factor of 16, so a 256x256 image becomes a 16x16 grid of 256 tokens and a 512x512 image becomes a 32x32 grid of 1024 tokens, each cell an index into the codebook. All of MaskGIT's modeling happens on these discrete tokens, and the tokenizer's decoder maps the final token grid back to a pixel image [1].
The second stage is a bidirectional Transformer trained with a masked-token-prediction objective in the spirit of BERT, adapted from language to image tokens. During training, a random subset of the tokens in an image is replaced with a special [MASK] token, and the model is trained to reconstruct the original tokens at the masked positions by attending to all of the unmasked tokens in every direction. Because attention is bidirectional rather than causal, every prediction can use the full surrounding context [1].
The key departure from BERT is that MaskGIT does not use a fixed masking ratio. For each training image, the fraction of tokens that are masked is drawn from a schedule so that the model is exposed to everything from lightly masked grids (almost fully observed) to nearly or fully masked grids. This variable masking ratio is what makes the model usable as a generator: at inference it must be able to predict tokens when almost the entire canvas is still masked, a regime BERT's fixed low masking ratio never trains for [1].
At generation time MaskGIT starts from a blank canvas in which every token is the [MASK] token and fills the grid over a small fixed number of iterations T. Each iteration performs four operations [1]:
The number of tokens kept at each step is set by a mask scheduling function evaluated at the current step's progress. The authors tested linear, concave, and convex schedules and found that a cosine schedule worked best: it commits relatively few tokens in the early steps, when the model is most uncertain because little context exists, and commits many tokens later, when most of the image is already in place [1]. To inject diversity so the procedure does not collapse to a single deterministic output, the confidence scores are perturbed by a temperature-controlled noise term whose temperature is annealed over the iterations; later open-source reimplementations commonly formalize this as adding Gumbel noise scaled by an annealed temperature to the log-probabilities [1][6].
Because each iteration predicts all remaining tokens in a single forward pass, the total number of forward passes equals T rather than the number of tokens. MaskGIT used T = 8 for 256x256 images and T = 12 for 512x512 images, far fewer than the 256 and 1024 sequential steps required by the autoregressive baseline [1].
The same trained model supports inpainting, outpainting (called extrapolation in the paper), and class-conditional editing without any task-specific retraining. To edit an image, the regions to keep are tokenized and left unmasked while the regions to fill are set to [MASK], and the standard iterative decoding fills the masked region consistently with the surrounding tokens. The authors demonstrate inpainting on 512x512 Places2 images, horizontal extrapolation that stitches images into wider panoramas, and class-conditional replacement of objects within an image [1][7].
On class-conditional ImageNet generation, MaskGIT improved sharply over the autoregressive VQGAN it was built on and was competitive with or better than the leading GAN of the period. The headline comparisons reported in the paper are summarized below [1].
| Model | Resolution | FID (lower is better) | Inception Score (higher is better) |
|---|---|---|---|
| VQGAN | 256x256 | 15.78 | 78.3 |
| DCTransformer | 256x256 | 36.51 | - |
| ADM (diffusion) | 256x256 | 10.94 | 101.0 |
| BigGAN-deep | 256x256 | 6.95 | 198.2 |
| MaskGIT | 256x256 | 6.18 | 182.1 |
| BigGAN-deep | 512x512 | 8.43 | - |
| MaskGIT | 512x512 | 7.32 | - |
At 256x256, MaskGIT cut VQGAN's Frechet Inception Distance (FID) from 15.78 to 6.18 and raised the Inception Score from 78.3 to 182.1, while also beating the ADM diffusion baseline and reaching FID parity with BigGAN-deep [1]. At 512x512, MaskGIT reached an FID of 7.32, surpassing BigGAN-deep's 8.43, which the authors presented as a new state of the art for that benchmark [1][2].
The efficiency gain was the paper's signature result. Measured against the autoregressive VQGAN sampler, MaskGIT accelerated decoding by roughly 30x to 64x, with the speedup growing as resolution and token count increased, because the autoregressive cost grows with the number of tokens while MaskGIT's cost stays fixed at T steps [1][2]. Ablations showed the cosine mask schedule and the choice of T to be the main levers on the quality-versus-speed trade-off, with quality peaking around 8 to 12 iterations [1].
MaskGIT introduced the masked generative paradigm that a series of later Google models extended to new modalities and to text conditioning [4][5].
The paradigm also spread into open-source reimplementations and into other domains, including masked acoustic and music token models that borrow MaskGIT's confidence-based parallel decoding [6].
MaskGIT is best understood by contrasting how it traverses the token grid relative to its neighbors in the token-based and iterative generation landscape.