Masked Autoregressive (MAR) generation
Last reviewed
Jun 8, 2026
Sources
3 citations
Review status
Source-backed
Revision
v1 · 1,829 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 8, 2026
Sources
3 citations
Review status
Source-backed
Revision
v1 · 1,829 words
Add missing citations, update stale details, or suggest a clearer explanation.
Masked Autoregressive (MAR) generation is an image-generation method introduced in the 2024 paper "Autoregressive Image Generation without Vector Quantization" by Tianhong Li, Yonglong Tian, He Li, Mingyang Deng, and Kaiming He [1]. The work was a collaboration between MIT CSAIL (Li, Deng, He), Google DeepMind (Tian), and Tsinghua University (He Li), and was presented as a spotlight paper at NeurIPS 2024 [1][2].
The paper's central claim is that autoregressive models for images do not inherently require discrete, vector-quantized tokens. Conventional wisdom held that because an autoregressive model predicts a probability distribution over the next token, that token must live in a finite discrete vocabulary so the distribution can be represented as a categorical softmax. Li and colleagues observe that a discrete space is convenient for this purpose but not necessary. Their key technical contribution, a per-token "Diffusion Loss," lets an autoregressive model operate directly on continuous-valued tokens by using a small denoising network to define the per-token distribution, in place of a categorical cross-entropy loss [1]. This removes the dependency on a learned discrete codebook such as the one in VQ-VAE or VQGAN.
The paper studies this idea across a family of generation orders, from strict raster-scan autoregression to the generalized "masked autoregressive" (MAR) variants that predict sets of tokens in random order, in the spirit of MaskGIT. Combining the MAR formulation with Diffusion Loss yields a generator that reaches a Frechet Inception Distance (FID) of 1.55 on ImageNet 256x256 while generating an image in under 0.3 seconds, a favorable speed and quality trade-off relative to contemporary diffusion transformers [1].
A standard autoregressive generative model factorizes the probability of a sequence as a product of conditional next-token distributions and is trained to predict each token given the preceding ones. For text this is natural, because words come from a finite vocabulary, so each conditional distribution is a categorical distribution produced by a softmax over that vocabulary and trained with cross-entropy.
Images are continuous-valued, so applying this recipe historically required first converting an image into a sequence of discrete tokens. The dominant approach was a two-stage pipeline: a VQ-VAE or VQGAN tokenizer compresses the image into a grid of integer indices drawn from a learned codebook, and a Transformer then models that grid as a discrete sequence [1]. Visual Autoregressive Modeling (VAR) and MaskGIT are later members of this discrete-token family.
Vector quantization has well-documented drawbacks. The quantization step is non-differentiable, so it is trained with workarounds such as the straight-through estimator and auxiliary codebook and commitment losses, which make tokenizers finicky to optimize. Snapping each continuous feature vector to its nearest codebook entry is also lossy, which caps reconstruction quality and can leave parts of the codebook unused. The MAR authors argue that these difficulties are an artifact of forcing a discrete representation purely to satisfy the categorical-softmax assumption, rather than an intrinsic requirement of autoregressive modeling [1]. Their contribution is to remove that assumption.
MAR has two largely independent components: a loss that models continuous tokens (Diffusion Loss) and a generation order that relaxes strict autoregression (masked autoregression). Either can be combined with the other.
The core idea is to model the per-token probability distribution with a small diffusion model rather than a softmax. The main Transformer reads the already-generated tokens and outputs, for each position to be predicted, a conditioning vector z. That vector parameterizes the distribution of the continuous token x at that position [1].
This distribution is represented implicitly by a small denoising network, a multilayer perceptron (MLP) of a few residual blocks conditioned on z via adaptive layer normalization. During training, the method follows the DDPM recipe at the level of a single token: it samples a noise level, corrupts the ground-truth token x into a noised version, and trains the MLP to predict the added noise conditioned on z and the noise level. The resulting denoising objective is the Diffusion Loss, and it serves the same role that cross-entropy plays for discrete tokens, namely defining and fitting p(x given z) [1].
At inference, a token is produced by running the reverse diffusion process: starting from Gaussian noise and iteratively denoising with the MLP conditioned on the z that the Transformer emitted for that position. Because the denoising network is small and shared across positions, the overhead is modest. The reported default adds roughly 21 million parameters to a roughly 400 million parameter Transformer, on the order of a 10 percent increase in inference time [1]. The method also exposes a temperature parameter that scales the noise injected during sampling, giving the same diversity-versus-fidelity control that sampling temperature provides in discrete autoregressive models, and the authors find this temperature to be important for quality [1].
Crucially, because the tokens are now continuous, MAR uses a continuous tokenizer with no codebook. The default is the KL-regularized continuous variational autoencoder from latent diffusion (the KL-16 tokenizer), which encodes a 256x256 image into a 16x16 grid of continuous latent vectors [1].
The second component generalizes the generation order. A strict autoregressive model fixes a raster order and uses causal attention so that each token sees only its predecessors. MAR instead generates tokens in a random order and can predict many tokens at once, drawing on the masked autoencoder (MAE) and MaskGIT lineage [1].
Concretely, MAR uses bidirectional (full) attention over the known tokens and a set of mask tokens marking positions still to be generated, similar to an MAE encoder-decoder. Training randomly masks a high fraction of tokens, in the range of 70 to 100 percent, and asks the model to predict the masked ones. Generation proceeds over a fixed number of steps (64 by default): at each step the model predicts the currently masked positions, a subset of the highest-confidence predictions is kept, and the masking ratio is lowered along a cosine schedule until no tokens remain masked. Predicting several tokens per step is what makes generation fast, since the number of network passes is the step count rather than the token count [1].
The paper frames a spectrum of models under one roof, controlled by the order and how many tokens are produced per step:
| Variant | Order | Attention | Tokens per step | FID (no CFG) |
|---|---|---|---|---|
| Standard autoregressive (AR) | Raster | Causal | 1 | 19.23 |
| MAR (random order) | Random | Causal | 1 | 13.07 |
| MAR (bidirectional) | Random | Bidirectional | 1 | 3.43 |
| MAR (bidirectional, parallel) | Random | Bidirectional | Multiple | 3.50 |
Numbers are for the paper's default configuration on ImageNet 256x256 without classifier-free guidance [1]. Moving from strict raster AR toward random-order, bidirectional masked generation improves quality substantially while also enabling parallel decoding, and the nearly identical FID of the single-token and parallel rows shows the speedup comes at little cost in quality [1].
To boost class-conditional samples, MAR also supports classifier-free guidance, training with the class label dropped on a fraction of examples and combining conditional and unconditional predictions at inference [1].
On the standard ImageNet 256x256 class-conditional benchmark, evaluated with the Frechet Inception Distance, MAR with Diffusion Loss scales smoothly with model size and reaches strong scores with classifier-free guidance [1]:
| Model | Parameters | FID-50K (with CFG) | Inception Score |
|---|---|---|---|
| MAR-B | 208M | 2.31 | 281.7 |
| MAR-L | 479M | 1.78 | 296.0 |
| MAR-H | 943M | 1.55 | 303.7 |
The headline configuration, MAR-H, reaches an FID of 1.55, and MAR-L generates an image in under 0.3 seconds (on an A100 GPU) while staying below an FID of 2.0 [1][2]. The authors highlight this as a favorable speed-versus-accuracy trade-off compared with the Diffusion Transformer (DiT), whose DiT-XL/2 reaches 2.27 FID at 675M parameters; MAR also surpasses latent-diffusion baselines such as LDM-4 (3.60 FID) and discrete masked-token models such as MaskGIT [1].
Two controlled comparisons support the paper's thesis. First, holding the continuous KL-16 tokenizer and architecture fixed, Diffusion Loss reaches 3.50 FID versus 8.79 FID for a cross-entropy loss applied to the same setup, a relative improvement of roughly 50 to 60 percent and direct evidence that modeling tokens with diffusion beats discretizing them [1]. Second, the variant table above shows that the masked, random-order formulation is far stronger than strict raster autoregression. The number of autoregressive steps can be tuned (the paper sweeps roughly 8 to 128 steps) to trade quality for speed [1].
MAR is notable for decoupling two ideas that had been bundled together in image generation: the autoregressive modeling strategy and the discrete-token representation. By showing that a continuous-token autoregressive model can match or beat strong diffusion and discrete-token baselines, the paper removed a standard justification for vector-quantized tokenizers and offered a path that sidesteps their training difficulties and reconstruction ceiling [1].
It also blurs the boundary between autoregressive models and diffusion models. In MAR a diffusion process operates per token to define the output distribution, while an autoregressive (or masked) Transformer supplies the conditioning, so the method is in a sense autoregressive on the outside and diffusive on the inside [1]. This composability proved influential: the Diffusion Loss idea was taken up in later continuous-token generative systems and unified image-and-text models, and the official PyTorch implementation, released as the "mar" repository, became a common starting point for follow-up research [1].
MAR sits at the intersection of several families of generative models: