Make-A-Scene

Generative AI Image Generation Meta AI

8 min read

Updated Jul 16, 2026

Suggest edit History Talk

RawGraph

Last edited

Jul 16, 2026

Fact-checked

In review queue

Sources

6 citations

Revision

v2 · 1,673 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Make-A-Scene is a text-to-image generation model published by Meta AI (then Meta AI Research) in 2022. Its central idea is that a user can guide an image not only with a text prompt but also with an optional scene layout, a coarse segmentation map that can be drawn as a simple sketch, giving direct control over composition that text alone cannot express. Unlike the diffusion model systems that dominated text-to-image work soon afterward, Make-A-Scene is a token-based autoregressive transformer: it represents text, scene, and image as discrete tokens and predicts them one after another. The paper, "Make-A-Scene: Scene-Based Text-to-Image Generation with Human Priors," was written by Oran Gafni, Adam Polyak, Oron Ashual, Shelly Sheynin, Devi Parikh, and Yaniv Taigman, posted to arXiv on 24 March 2022, and presented as an oral at the European Conference on Computer Vision (ECCV) 2022. ^[1]^[2]^[3]

Motivation

The authors framed three gaps in then-current text-to-image methods. ^[1] First, controllability: most models accepted only text, so attributes such as style or color could be described but structure, form, and arrangement could be specified only loosely. Second, human perception: training losses were applied uniformly over the whole image, with no prior knowledge of which regions matter most to people, which left faces and salient objects looking weak. Third, quality and resolution: prior systems such as DALL-E and GLIDE generated at 256x256 pixels, and the authors argued that reaching 512x512 with few artifacts required a higher-quality image representation. Make-A-Scene targets all three. ^[1]

Scene control mechanism

The defining feature is optional conditioning on a scene layout supplied alongside the text. ^[1] The scene is a semantic segmentation map built from a union of three complementary groups: panoptic segmentation (the general "stuff" and "things" of a photo), human segmentation, and face segmentation, plus an extra channel that maps the edges separating classes and instances. At inference time these scene tokens are either generated by the transformer itself (so the model invents a layout when the user provides only text) or extracted from an input image, which lets a person impose extra constraints on the result. ^[1]

The conditioning is described as implicit rather than explicit. In many earlier segmentation-to-image GAN methods (for example pix2pix and SPADE) the output is tied to the segmentation by a loss. In Make-A-Scene there is no loss binding the generated image tokens to the scene tokens, so the model is free to disregard the scene and rely on text alone. The authors report that in practice both the text and the scene firmly steer the image, and that the loose coupling increases sample variety. ^[1] Because the layout can be drawn quickly, the same mechanism supports several editing workflows that the paper demonstrates: scene editing (replacing or adding classes, such as turning sky and trees into sea, or dropping in a sketch of a giant dog), text editing with an anchored scene, recovering from unusual or out-of-distribution prompts by sketching the uncommon arrangement, and illustrating a children's story the authors wrote to show the workflow end to end. ^[1]

In Meta's accompanying product-style framing, the system was described as letting people "describe and illustrate their vision through both text descriptions and freeform sketches," with a "novel intermediate representation that captures the scene layout." Meta reported that images generated from both a sketch and text were rated as better aligned with the original sketch 99.54 percent of the time. Artists including Sofia Crespo, Scott Eaton, Alexander Reben, and Refik Anadol tested the system during development. ^[4]

Architecture

Make-A-Scene follows the two-stage recipe common to discrete-representation image models: first train tokenizers that turn images and scenes into discrete tokens, then train an autoregressive transformer over those tokens. ^[1] It uses two modified Vector-Quantized Variational Autoencoders, building on the VQGAN framework, and a transformer based on the GPT-3 architecture.

Component	Role
VQ-SEG	A VQ-VAE variant for the scene. Encodes and reconstructs the multi-channel semantic segmentation (panoptic, human, and face groups plus the edge map) into scene tokens.
VQ-IMG	The image tokenizer. A face-aware and object-aware VQ-VAE/VQGAN that encodes images into image tokens and decodes generated tokens back into pixels.
Scene-based transformer	A ~4-billion-parameter autoregressive transformer (GPT-3 style) that predicts scene and image tokens conditioned on the text.

The token sequence has three consecutive, independent spaces: text tokens encoded with a byte-pair-encoding (BPE) tokenizer, then scene tokens from VQ-SEG, then image tokens from VQ-IMG, concatenated as one sequence. ^[1] In the reported experiments the transformer has about 4 billion parameters and generates a sequence of 256 text tokens, 256 scene tokens, and 1024 image tokens, which are decoded into a 256x256 or 512x512 image depending on the model variant. ^[1]

Human priors in the tokenizer

A recurring theme in the paper is "human priors": the observation that image quality is upper-bounded by the tokenizer's reconstruction quality, and that uniform losses waste capacity on regions people barely notice while under-serving regions they scrutinize. ^[1] Make-A-Scene adds region-specific losses on top of the standard VQGAN objective.

Face-aware VQ. Faces are located using the segmentation information, and up to a fixed number of face crops per image are passed through a pretrained face-embedding network. A feature-matching loss between reconstructed and ground-truth face activations adds "awareness" of faces and pushes the tokenizer toward higher-quality face reconstruction. ^[1]
Object-aware VQ. The same idea is generalized to objects defined as "things" in the panoptic categories, using a pretrained VGG network trained on ImageNet for the feature-matching loss. Running this loss over image crops let the authors raise output resolution from 256x256 to 512x512 by adding a single extra downsample and upsample layer to VQ-IMG. ^[1]
Face emphasis in the scene space. Small face parts such as eyes, nose, and lips tended to disappear when reconstructing the segmentation map, so a weighted binary cross-entropy loss raises their importance and their edges are folded into the segmentation edge map. ^[1]

Classifier-free guidance for transformers

Make-A-Scene adapts classifier-free guidance, a technique introduced for diffusion models, to the autoregressive transformer setting. ^[1] The transformer is fine-tuned with the text prompt randomly replaced by padding tokens, so it can also sample unconditionally. At inference the model runs a conditional stream (conditioned on text) and an unconditional stream (conditioned on empty padding), then combines their logits with a guidance scale to bias generation toward the prompt. The authors note this removes the need for the post-generation re-ranking and filtering that systems like DALL-E used (often via CLIP), which makes generation faster and improves text adherence. ^[1]

Results

Make-A-Scene was trained on roughly 35 million text-image pairs drawn from CC12M, Conceptual Captions, and subsets of YFCC100M and RedCaps; the VQ-SEG and VQ-IMG tokenizers were trained on CC12M, Conceptual Captions, and MS-COCO. ^[1] Evaluation used Fréchet Inception Distance (FID) over a 30,000-image subset of MS-COCO validation prompts (with no re-ranking) and human preference studies on Amazon Mechanical Turk covering image quality, photorealism, and text alignment. The authors treated human evaluation as the higher authority and FID as a secondary metric, and deliberately avoided the Inception Score. ^[1]

The table below collects the headline FID numbers reported by the authors (lower is better). ^[1]

Model	FID (unfiltered)	FID (trained without MS-COCO, "filtered")
DALL-E (paper's 4B re-implementation)	N/A	34.60
GLIDE	N/A	12.24
XMC-GAN	9.33	N/A
LAFITE	8.12	26.94
Make-A-Scene (256x256)	7.55	11.84
Make-A-Scene, best model with scene given as input	4.69	N/A
Ground-truth lower bound (MS-COCO train vs. val)	2.47	N/A

The reported state-of-the-art FID is 7.55 in the unfiltered setting and 11.84 when the model is trained without the MS-COCO training set. ^[1] Providing a scene as an additional input lowered FID to 4.69, approaching the loose 2.47 ground-truth lower bound the authors computed between MS-COCO subsets. In human preference comparisons against their DALL-E re-implementation and CogView, Make-A-Scene was favored across quality, photorealism, and text alignment, and an ablation showed each added element (scene tokens, face-aware training, classifier-free guidance, object-aware 512x512 training) contributing to the gains. ^[1]

Relationship to other Meta systems

Make-A-Video

Make-A-Video, Meta's text-to-video system announced on 29 September 2022, was positioned as a direct follow-on. Meta's blog states that "Make-A-Video follows our announcement earlier this year of Make-A-Scene, a multimodal generative AI method that gives people more control over the AI generated content they create." ^[5] Make-A-Scene's demonstration that text plus freeform sketches could drive high-fidelity image generation was the precursor Meta extended into motion. ^[4]^[5]

Chameleon

The image tokenizer is the most directly reused piece of Make-A-Scene. Meta's 2024 mixed-modal model Chameleon builds its image tokenizer on the Make-A-Scene work, stating that it trains "a new image tokenizer based on Gafni et al. (2022), which encodes a 512x512 image into 1024 discrete tokens from a codebook of size 8192." ^[6] Following the human-prior idea, Chameleon's tokenizer training upsampled images containing human faces, and the team noted the same kind of limitation Make-A-Scene's design implies: difficulty reconstructing images with large amounts of text. ^[6] This lineage carried Make-A-Scene's token-based, VQ-VAE-style image representation into a later generation of early-fusion multimodal foundation models. ^[6]

References

Oran Gafni, Adam Polyak, Oron Ashual, Shelly Sheynin, Devi Parikh, Yaniv Taigman. "Make-A-Scene: Scene-Based Text-to-Image Generation with Human Priors." arXiv:2203.13131, 24 March 2022. https://arxiv.org/abs/2203.13131 ↩
ECCV 2022 / ECVA. "Make-A-Scene: Scene-Based Text-to-Image Generation with Human Priors" (open-access proceedings PDF). https://www.ecva.net/papers/eccv_2022/papers_ECCV/papers/136750087.pdf ↩
Springer / ECCV 2022. "Make-A-Scene: Scene-Based Text-to-Image Generation with Human Priors," Computer Vision - ECCV 2022, Lecture Notes in Computer Science, doi:10.1007/978-3-031-19784-0_6. https://link.springer.com/chapter/10.1007/978-3-031-19784-0_6 ↩
Meta AI. "Greater creative control for AI image generation" (Make-A-Scene), 14 July 2022. https://ai.meta.com/blog/greater-creative-control-for-ai-image-generation/ ↩
Meta AI. "Introducing Make-A-Video: An AI system that generates videos from text," 29 September 2022. https://ai.meta.com/blog/generative-ai-text-to-video/ ↩
Chameleon Team, FAIR at Meta. "Chameleon: Mixed-Modal Early-Fusion Foundation Models." arXiv:2405.09818, 2024. https://arxiv.org/abs/2405.09818 ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

1 revision by 1 contributors · full history

Suggest edit

What links here

Stable Diffusion

Motivation

Scene control mechanism

Architecture

Human priors in the tokenizer

Classifier-free guidance for transformers

Results

Relationship to other Meta systems

Make-A-Video

Chameleon

References

Improve this article

Related Articles

Emu (Meta AI)

Emu Edit

CM3leon

Stability AI

Stable Diffusion

Adobe Firefly

What links here

Related Articles

Emu (Meta AI)

Emu Edit

CM3leon

Stability AI

Stable Diffusion

Adobe Firefly