Visual Autoregressive modeling (VAR)
Last reviewed
Jun 8, 2026
Sources
8 citations
Review status
Source-backed
Revision
v1 · 1,583 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 8, 2026
Sources
8 citations
Review status
Source-backed
Revision
v1 · 1,583 words
Add missing citations, update stale details, or suggest a clearer explanation.
Visual Autoregressive modeling (VAR) is an image generation paradigm, introduced in 2024, that reframes autoregressive image synthesis as coarse-to-fine "next-scale prediction" rather than the conventional raster-order "next-token prediction." Instead of emitting image tokens one at a time in reading order, VAR generates an image as a pyramid of token maps that grow from a single token up to the full-resolution grid, predicting each entire higher-resolution map in parallel while conditioning on all coarser maps already produced [1]. The method was presented in the paper "Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction" by Keyu Tian, Yi Jiang, Zehuan Yuan, Bingyue Peng, and Liwei Wang of ByteDance and Peking University [1].
VAR was the first GPT-style autoregressive approach reported to surpass diffusion transformers on standard image generation benchmarks. On ImageNet 256x256 it improved the autoregressive baseline's Frechet Inception Distance (FID) from 18.65 to 1.80 and Inception Score (IS) from 80.4 to 356.4, with roughly 20 times faster inference, while outperforming the Diffusion Transformer (DiT) on image quality, speed, data efficiency, and scalability [1]. The work received one of two Best Paper awards at the Conference on Neural Information Processing Systems (NeurIPS) 2024 [2][3]. Separately from its technical contribution, lead author Keyu Tian became the subject of a widely reported lawsuit filed by ByteDance over alleged interference with internal model training; this dispute is unrelated to the validity of the published results and is described neutrally below [4][5].
A standard autoregressive image generator first compresses an image into a grid of discrete tokens using a vector-quantized autoencoder such as VQ-VAE or VQGAN, then flattens that 2D grid into a 1D sequence and predicts tokens one by one in raster-scan order (left to right, top to bottom). A Transformer factorizes the image likelihood as a product of per-token conditionals, each token attending only to those that precede it in the flattened sequence [1].
This formulation has two recognized weaknesses. First, it is slow: generating an h by w token grid requires on the order of h times w sequential forward passes, which scales poorly with resolution. Second, the raster ordering imposes a one-dimensional, unidirectional structure on data that is inherently two-dimensional and bidirectional, forcing the model to predict a pixel region from an arbitrary subset of its spatial neighbors and breaking the spatial locality that images exhibit [1]. Partly for these reasons, autoregressive image models had lagged behind diffusion models, and the diffusion approach, including latent diffusion and the Diffusion Transformer, dominated high-fidelity image synthesis prior to VAR [1][6]. VAR was motivated by the observation that the success of large language models rests on next-token prediction over a sensible unit of data, and that the appropriate autoregressive "unit" for images may be an entire resolution level rather than a single token [1].
VAR replaces the single-grid tokenizer with a multi-scale, residual vector-quantized autoencoder. The encoder maps an image to a continuous feature map, which is then quantized into a sequence of token maps at progressively higher resolutions, from a coarse map up to the full resolution. The token maps are produced by residual quantization: at each scale the quantized reconstruction so far is subtracted from the target features, the residual is downsampled and quantized to form the next coarser-to-finer map, and the procedure distributes the overall quantization error across scales rather than concentrating it at a single resolution [1][6]. All scales share one codebook, and decoding sums the upsampled contributions of every token map to reconstruct the image [1][6]. The result is a pyramid of token maps, ordered from a 1 by 1 map up to the final 16 by 16 latent grid for ImageNet 256x256, that together represent the image [1].
Given this pyramid, VAR defines its autoregressive unit as a whole token map. The Transformer predicts the token map at scale k conditioned on all previously generated, coarser token maps (scales 1 through k minus 1). Crucially, all tokens within a single scale are predicted in parallel in one forward pass, so the number of sequential generation steps equals the number of scales rather than the number of tokens [1]. This is enforced by a block-wise causal attention mask: tokens may attend freely to all tokens at their own scale and to all tokens at coarser scales, but never to finer scales that have not yet been generated [1][6]. During training, teacher forcing with this mask lets the model learn all scales at once; at inference, kv-caching is used and no mask is needed [1].
Because the latent resolutions grow geometrically, a grid that would require hundreds of raster-order steps is generated in only a handful of scale steps, which is the primary source of VAR's speed advantage over token-by-token autoregression [1][6]. One consequence noted by independent analysts is that the coarsest scales fix the global layout of the image, so most of the perceptual content is determined within the first few steps [6].
On the ImageNet 256x256 class-conditional benchmark, VAR substantially improved its autoregressive baseline and surpassed strong diffusion baselines. Reported headline figures are summarized below.
| Metric (ImageNet 256x256) | Value |
|---|---|
| FID (autoregressive baseline to VAR) | 18.65 to 1.80 |
| Inception Score (baseline to VAR) | 80.4 to 356.4 |
| Inference speedup vs raster-order AR | ~20x |
| Largest model | VAR-d30, ~2.0B parameters (depth 30) |
| Scaling-law correlation coefficient | near -0.998 |
These results, drawn from the original paper, indicate that VAR outperformed the Diffusion Transformer (DiT) on FID and IS while being markedly faster [1]. Beyond raw quality, the authors reported two findings echoing the large-language-model literature. First, scaling up VAR exhibited clear power-law scaling laws: test loss and FID decreased predictably as a power law in model size and compute, with linear correlation coefficients near -0.998 in log space, evidence that the LLM-style scaling behavior transfers to this image paradigm [1]. Second, VAR showed zero-shot generalization to downstream tasks it was not trained for, including image in-painting, out-painting, and class-conditional editing, by fixing known token regions and letting the model complete the rest [1].
The VAR paper received a Best Paper (Oral) award at NeurIPS 2024, one of two papers so honored that year [2][3]. The award citation highlighted the method's reframing of autoregressive image generation and its demonstration that an autoregressive model could match or exceed diffusion-based generation while exhibiting LLM-like scaling [2]. The code and pretrained models were released openly under the FoundationVision project [1].
The recognition drew unusual public attention because of an unrelated legal dispute involving the lead author. In November 2024, ByteDance filed a civil suit, reported as accepted by the Haidian District People's Court in Beijing, against Keyu Tian, a former intern, seeking RMB 8 million (approximately 1.1 million US dollars) in damages plus a public apology [4][5]. ByteDance alleged that Tian had tampered with code to disrupt the training of an internal model project at the company; the company has publicly stated it dismissed the individual for serious disciplinary violations [4][5]. Multiple reports note that the alleged interference concerned a separate commercial training effort and was not described by ByteDance as having affected the VAR research itself, and that the award committee did not rescind the prize, citing the scientific merit of the work [4][5]. The allegations are contested and had not, as of mid-2026, been adjudicated to a public final judgment in the sources reviewed here; this article states them as attributed claims and takes no position on their resolution.
VAR sits at the intersection of several lines of work. Its tokenizer builds directly on the discrete vector-quantization tradition of VQ-VAE and VQGAN, extending single-grid quantization to a residual multi-scale pyramid [1][6]. Its central claim is positioned against the Diffusion Transformer and latent diffusion, which it reports outperforming on ImageNet, reopening the autoregressive-versus-diffusion question for image generation that diffusion had appeared to settle [1][6]. Conceptually it imports the next-token-prediction and scaling-law framework of large language models into vision by changing the prediction unit from a token to a scale [1].
VAR also seeded follow-up research. Most directly, ByteDance's Infinity model (December 2024) extends the next-scale-prediction idea to high-resolution text-to-image synthesis, replacing index-wise tokens with a bitwise, "infinite-vocabulary" tokenizer and adding a bitwise self-correction mechanism; Infinity reported surpassing diffusion systems such as SD3-Medium and SDXL on benchmarks like GenEval and ImageReward, and was accepted as an oral presentation at CVPR 2025 [7][8]. A broader body of subsequent work has built on or analyzed next-scale prediction, including efficiency-oriented variants and sampling refinements, establishing VAR as a reference point for autoregressive visual generation [7][8].