Make-A-Video
Last reviewed
Jun 3, 2026
Sources
6 citations
Review status
Source-backed
Revision
v1 · 1,309 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 3, 2026
Sources
6 citations
Review status
Source-backed
Revision
v1 · 1,309 words
Add missing citations, update stale details, or suggest a clearer explanation.
Make-A-Video is a text-to-video generation system from Meta AI, announced on September 29, 2022. Its defining idea is that a model can learn to generate moving images without ever seeing a paired example of text matched to video. Instead, Make-A-Video learns what the world looks like and how language describes it from text-image pairs, then learns how things move from ordinary video clips that carry no captions. The accompanying paper, "Make-A-Video: Text-to-Video Generation without Text-Video Data," was posted to arXiv by Uriel Singer, Adam Polyak, Thomas Hayes, Xi Yin, Jie An, Songyang Zhang, Qiyuan Hu, Harry Yang, Oron Ashual, Oran Gafni, Devi Parikh, Sonal Gupta, and Yaniv Taigman, and was later accepted as a poster at ICLR 2023.[1][2][3]
At launch it was a research project rather than a shipping product. Meta published a project page with example clips and a sign-up form for anyone interested in future access, and said it planned to release a demo experience later.[3][4]
Collecting large datasets of video with accurate text descriptions is hard and expensive, which had held back text-to-video work relative to the rapid progress in text-to-image generation around 2022. Make-A-Video sidesteps the problem by splitting the learning into two parts that use data that already exists in abundance.[1]
First, a text-to-image diffusion model provides the visual and language knowledge. This part is trained on text-image pairs, so it learns how objects, scenes, and styles look and how they are named. Make-A-Video reuses the prior-plus-decoder design from Meta's earlier text-to-image work, where a prior network maps text to an image embedding from CLIP and a decoder turns that embedding into pixels.[1]
Second, the model learns motion from unlabeled video. Because the text understanding is already supplied by the image side, the video clips do not need captions; the network only has to learn how realistic motion unfolds over time. This separation is what lets the system avoid paired text-video data entirely.[1][3]
To turn an image generator into a video generator, Make-A-Video adds temporal structure to the existing spatial network. The decoder's two-dimensional convolution and attention layers are extended with new temporal layers: pseudo-3D convolution layers that stack a 1D temporal convolution on top of each 2D spatial convolution, and temporal attention added alongside the spatial attention. Factoring space and time this way keeps the new model close to the pretrained image weights and avoids the cost of full 3D convolutions over the whole clip.[1]
The full system is a cascade of several networks, which Meta later described as five models in total. Generation runs through these stages in sequence.[1][5]
| Stage | Component | Role | Operates on |
|---|---|---|---|
| 1 | Prior (P) | Maps input text to a CLIP image embedding | Embeddings, no spatial resolution |
| 2 | Spatiotemporal decoder (Dt) | Generates 16 low-resolution frames from the embedding | 16 frames at 64x64 |
| 3 | Frame interpolation network | Increases the frame count to raise the effective frame rate | Upsamples 16 frames to 76 |
| 4 | Spatiotemporal super-resolution (SRl) | Upscales each frame while keeping motion consistent | 64x64 to 256x256 |
| 5 | Spatial super-resolution (SRh) | Final spatial upscaling, applied per frame | 256x256 to 768x768 |
The decoder first produces 16 frames at 64x64 pixels. A separate masked frame-interpolation network then fills in intermediate frames; using a frame skip of 5, it expands a 16-frame clip to 76 frames, computed as (16 - 1) x 5 + 1. The first super-resolution network operates across space and time together to upscale from 64x64 to 256x256, so it can keep the added detail consistent from frame to frame. The final super-resolution network works on individual frames to reach 768x768.[1]
During training, the model conditions on frame rate, with clips sampled at a random rate between 1 and 30 frames per second. At inference, the interpolation stage lets the system target a chosen frame rate and produce smoother, longer-feeling motion.[1]
The split design is mirrored in the datasets. The text-to-image backbone was trained on a 2.3-billion-image subset of LAION-5B, filtered to remove NSFW imagery, toxic words in the text, and images with watermarks. The temporal components learned motion from unlabeled video: the full WebVid-10M dataset and a 10-million-clip subset of HD-VILA-100M. None of the video used in this part carried text aligned to the clips.[1]
The project page demonstrated four uses beyond a single text prompt: generating a video from a text description, animating a still image by adding motion, interpolating between two input images to create the motion between them, and producing variations of an existing video.[4]
| Property | Value |
|---|---|
| Modality | Text to video (plus image animation and video variation) |
| Final resolution | 768x768 |
| Frames per clip | 16 generated, interpolated to 76 |
| Frame-rate conditioning | 1 to 30 fps during training |
| Backbone | Text-to-image diffusion model (CLIP prior plus decoder) |
| Paired text-video data | None used |
| Status at launch | Research project, sign-up for future access |
The paper reported state-of-the-art results for its time using both automatic metrics and human studies. On UCF-101 in a zero-shot setting it reported an Inception Score of 33.00 and a Frechet Video Distance (FVD) of 367.23. On MSR-VTT it reported a Frechet Inception Distance (FID) of 13.17 and a CLIPSIM text-video similarity of 0.3049. In human evaluations against the contemporary CogVideo system, raters preferred Make-A-Video on both visual quality and faithfulness to the prompt.[1]
Meta's project page summarized its own user studies as roughly a 3x improvement over the previous state of the art in both how well the output represented the text and in overall quality.[4]
Because realistic generated video can be misused, Meta framed the launch as a deliberately cautious research preview. The project applied filters intended to reduce harmful content and added a visible watermark to every generated video so viewers could tell the footage was AI-made. Meta said it would apply its responsible-AI process before any wider release rather than open the system to the public immediately.[3][4]
Make-A-Video was an early step in a line of Meta generative-video research. Its image-generation lineage traces to Make-A-Scene, Meta's 2022 text-to-image work.[5]
In November 2023 Meta introduced Emu Video, which took a different and simpler route to the same goal. Rather than a deep cascade, Emu Video factorizes generation into two diffusion models: it first generates an image from the text prompt, then generates video conditioned on both the prompt and that image. Emu Video produced 512x512 clips of about four seconds at 16 frames per second, and in human evaluations Meta reported it was preferred over Make-A-Video by 96% of respondents on quality and 85% on faithfulness to the text. Meta explicitly contrasted the two, noting that Make-A-Video used a deep cascade of five models while Emu Video used just two.[5]
In October 2024 Meta announced Movie Gen, a larger family of foundation models for video and audio generation and editing, continuing the same research direction with longer, higher-quality output.[6]