Make-A-Scene
Last reviewed
Jun 3, 2026
Sources
6 citations
Review status
Source-backed
Revision
v1 · 1,675 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 3, 2026
Sources
6 citations
Review status
Source-backed
Revision
v1 · 1,675 words
Add missing citations, update stale details, or suggest a clearer explanation.
Make-A-Scene is a text-to-image generation model published by Meta AI (then Meta AI Research) in 2022. Its central idea is that a user can guide an image not only with a text prompt but also with an optional scene layout, a coarse segmentation map that can be drawn as a simple sketch, giving direct control over composition that text alone cannot express. Unlike the diffusion model systems that dominated text-to-image work soon afterward, Make-A-Scene is a token-based autoregressive transformer: it represents text, scene, and image as discrete tokens and predicts them one after another. The paper, "Make-A-Scene: Scene-Based Text-to-Image Generation with Human Priors," was written by Oran Gafni, Adam Polyak, Oron Ashual, Shelly Sheynin, Devi Parikh, and Yaniv Taigman, posted to arXiv on 24 March 2022, and presented as an oral at the European Conference on Computer Vision (ECCV) 2022. [1][2][3]
The authors framed three gaps in then-current text-to-image methods. [1] First, controllability: most models accepted only text, so attributes such as style or color could be described but structure, form, and arrangement could be specified only loosely. Second, human perception: training losses were applied uniformly over the whole image, with no prior knowledge of which regions matter most to people, which left faces and salient objects looking weak. Third, quality and resolution: prior systems such as DALL-E and GLIDE generated at 256x256 pixels, and the authors argued that reaching 512x512 with few artifacts required a higher-quality image representation. Make-A-Scene targets all three. [1]
The defining feature is optional conditioning on a scene layout supplied alongside the text. [1] The scene is a semantic segmentation map built from a union of three complementary groups: panoptic segmentation (the general "stuff" and "things" of a photo), human segmentation, and face segmentation, plus an extra channel that maps the edges separating classes and instances. At inference time these scene tokens are either generated by the transformer itself (so the model invents a layout when the user provides only text) or extracted from an input image, which lets a person impose extra constraints on the result. [1]
The conditioning is described as implicit rather than explicit. In many earlier segmentation-to-image GAN methods (for example pix2pix and SPADE) the output is tied to the segmentation by a loss. In Make-A-Scene there is no loss binding the generated image tokens to the scene tokens, so the model is free to disregard the scene and rely on text alone. The authors report that in practice both the text and the scene firmly steer the image, and that the loose coupling increases sample variety. [1] Because the layout can be drawn quickly, the same mechanism supports several editing workflows that the paper demonstrates: scene editing (replacing or adding classes, such as turning sky and trees into sea, or dropping in a sketch of a giant dog), text editing with an anchored scene, recovering from unusual or out-of-distribution prompts by sketching the uncommon arrangement, and illustrating a children's story the authors wrote to show the workflow end to end. [1]
In Meta's accompanying product-style framing, the system was described as letting people "describe and illustrate their vision through both text descriptions and freeform sketches," with a "novel intermediate representation that captures the scene layout." Meta reported that images generated from both a sketch and text were rated as better aligned with the original sketch 99.54 percent of the time. Artists including Sofia Crespo, Scott Eaton, Alexander Reben, and Refik Anadol tested the system during development. [4]
Make-A-Scene follows the two-stage recipe common to discrete-representation image models: first train tokenizers that turn images and scenes into discrete tokens, then train an autoregressive transformer over those tokens. [1] It uses two modified Vector-Quantized Variational Autoencoders, building on the VQGAN framework, and a transformer based on the GPT-3 architecture.
| Component | Role |
|---|---|
| VQ-SEG | A VQ-VAE variant for the scene. Encodes and reconstructs the multi-channel semantic segmentation (panoptic, human, and face groups plus the edge map) into scene tokens. |
| VQ-IMG | The image tokenizer. A face-aware and object-aware VQ-VAE/VQGAN that encodes images into image tokens and decodes generated tokens back into pixels. |
| Scene-based transformer | A ~4-billion-parameter autoregressive transformer (GPT-3 style) that predicts scene and image tokens conditioned on the text. |
The token sequence has three consecutive, independent spaces: text tokens encoded with a byte-pair-encoding (BPE) tokenizer, then scene tokens from VQ-SEG, then image tokens from VQ-IMG, concatenated as one sequence. [1] In the reported experiments the transformer has about 4 billion parameters and generates a sequence of 256 text tokens, 256 scene tokens, and 1024 image tokens, which are decoded into a 256x256 or 512x512 image depending on the model variant. [1]
A recurring theme in the paper is "human priors": the observation that image quality is upper-bounded by the tokenizer's reconstruction quality, and that uniform losses waste capacity on regions people barely notice while under-serving regions they scrutinize. [1] Make-A-Scene adds region-specific losses on top of the standard VQGAN objective.
Make-A-Scene adapts classifier-free guidance, a technique introduced for diffusion models, to the autoregressive transformer setting. [1] The transformer is fine-tuned with the text prompt randomly replaced by padding tokens, so it can also sample unconditionally. At inference the model runs a conditional stream (conditioned on text) and an unconditional stream (conditioned on empty padding), then combines their logits with a guidance scale to bias generation toward the prompt. The authors note this removes the need for the post-generation re-ranking and filtering that systems like DALL-E used (often via CLIP), which makes generation faster and improves text adherence. [1]
Make-A-Scene was trained on roughly 35 million text-image pairs drawn from CC12M, Conceptual Captions, and subsets of YFCC100M and RedCaps; the VQ-SEG and VQ-IMG tokenizers were trained on CC12M, Conceptual Captions, and MS-COCO. [1] Evaluation used Fréchet Inception Distance (FID) over a 30,000-image subset of MS-COCO validation prompts (with no re-ranking) and human preference studies on Amazon Mechanical Turk covering image quality, photorealism, and text alignment. The authors treated human evaluation as the higher authority and FID as a secondary metric, and deliberately avoided the Inception Score. [1]
The table below collects the headline FID numbers reported by the authors (lower is better). [1]
| Model | FID (unfiltered) | FID (trained without MS-COCO, "filtered") |
|---|---|---|
| DALL-E (paper's 4B re-implementation) | N/A | 34.60 |
| GLIDE | N/A | 12.24 |
| XMC-GAN | 9.33 | N/A |
| LAFITE | 8.12 | 26.94 |
| Make-A-Scene (256x256) | 7.55 | 11.84 |
| Make-A-Scene, best model with scene given as input | 4.69 | N/A |
| Ground-truth lower bound (MS-COCO train vs. val) | 2.47 | N/A |
The reported state-of-the-art FID is 7.55 in the unfiltered setting and 11.84 when the model is trained without the MS-COCO training set. [1] Providing a scene as an additional input lowered FID to 4.69, approaching the loose 2.47 ground-truth lower bound the authors computed between MS-COCO subsets. In human preference comparisons against their DALL-E re-implementation and CogView, Make-A-Scene was favored across quality, photorealism, and text alignment, and an ablation showed each added element (scene tokens, face-aware training, classifier-free guidance, object-aware 512x512 training) contributing to the gains. [1]
Make-A-Video, Meta's text-to-video system announced on 29 September 2022, was positioned as a direct follow-on. Meta's blog states that "Make-A-Video follows our announcement earlier this year of Make-A-Scene, a multimodal generative AI method that gives people more control over the AI generated content they create." [5] Make-A-Scene's demonstration that text plus freeform sketches could drive high-fidelity image generation was the precursor Meta extended into motion. [4][5]
The image tokenizer is the most directly reused piece of Make-A-Scene. Meta's 2024 mixed-modal model Chameleon builds its image tokenizer on the Make-A-Scene work, stating that it trains "a new image tokenizer based on Gafni et al. (2022), which encodes a 512x512 image into 1024 discrete tokens from a codebook of size 8192." [6] Following the human-prior idea, Chameleon's tokenizer training upsampled images containing human faces, and the team noted the same kind of limitation Make-A-Scene's design implies: difficulty reconstructing images with large amounts of text. [6] This lineage carried Make-A-Scene's token-based, VQ-VAE-style image representation into a later generation of early-fusion multimodal foundation models. [6]