Make-A-Video

Generative AI Meta AI Video Generation

8 min read

Updated Jun 28, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 28, 2026

Fact-checked

In review queue

Sources

6 citations

Revision

v2 · 1,639 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Make-A-Video is a text-to-video generation system from Meta AI, announced on September 29, 2022, whose defining contribution is generating moving video from a text prompt without ever training on a single paired example of text matched to video. Instead, Make-A-Video learns what the world looks like and how language describes it from text-image pairs, then learns how things move from ordinary, unlabeled video clips that carry no captions. The accompanying paper, "Make-A-Video: Text-to-Video Generation without Text-Video Data," was posted to arXiv by Uriel Singer, Adam Polyak, Thomas Hayes, Xi Yin, Jie An, Songyang Zhang, Qiyuan Hu, Harry Yang, Oron Ashual, Oran Gafni, Devi Parikh, Sonal Gupta, and Yaniv Taigman, and was later accepted as a poster at ICLR 2023.^[1]^[2]^[3]

At launch it was a research project rather than a shipping product. Meta published a project page with example clips and a sign-up form for anyone interested in future access, and said it planned to release a demo experience later.^[3]^[4]

What is Make-A-Video?

Make-A-Video is a diffusion-based generative AI model that turns a short text description into a short video clip. Its headline idea, stated plainly in the paper, is to recycle the rapid progress in text-to-image generation for video: "learn what the world looks like and how it is described from paired text-image data, and learn how the world moves from unsupervised video footage."^[1] In practice that means the model never needs the expensive, hard-to-collect dataset of videos labeled with accurate captions that earlier text-to-video methods relied on.

The paper lists three advantages of this design: it accelerates training because the model does not have to learn visual and multimodal representations from scratch; it removes the need for paired text-video data; and the generated videos inherit the diversity and "vastness" of contemporary image-generation models, including fantastical and stylized depictions.^[1] On its release Meta described Make-A-Video as setting "the new state-of-the-art in text-to-video generation," a claim the paper backs with both automatic metrics and human studies.^[1]

How does Make-A-Video work without text-video pairs?

Collecting large datasets of video with accurate text descriptions is hard and expensive, which had held back text-to-video work relative to the rapid progress in text-to-image generation around 2022. Make-A-Video sidesteps the problem by splitting the learning into two parts that use data that already exists in abundance.^[1]

First, a text-to-image diffusion model provides the visual and language knowledge. This part is trained on text-image pairs, so it learns how objects, scenes, and styles look and how they are named. Make-A-Video reuses the prior-plus-decoder design from Meta's earlier text-to-image work, where a prior network maps text to an image embedding from CLIP and a decoder turns that embedding into pixels.^[1]

Second, the model learns motion from unlabeled video. Because the text understanding is already supplied by the image side, the video clips do not need captions; the network only has to learn how realistic motion unfolds over time. This separation is what lets the system avoid paired text-video data entirely.^[1]^[3]

How is Make-A-Video built? (architecture)

To turn an image generator into a video generator, Make-A-Video adds temporal structure to the existing spatial network. The decoder's two-dimensional convolution and attention layers are extended with new temporal layers: pseudo-3D convolution layers that stack a 1D temporal convolution on top of each 2D spatial convolution, and temporal attention added alongside the spatial attention. Factoring space and time this way keeps the new model close to the pretrained image weights and avoids the cost of full 3D convolutions over the whole clip.^[1]

The full system is a cascade of several networks, which Meta later described as five models in total. Generation runs through these stages in sequence.^[1]^[5]

Stage	Component	Role	Operates on
1	Prior (P)	Maps input text to a CLIP image embedding	Embeddings, no spatial resolution
2	Spatiotemporal decoder (Dt)	Generates 16 low-resolution frames from the embedding	16 frames at 64x64
3	Frame interpolation network	Increases the frame count to raise the effective frame rate	Upsamples 16 frames to 76
4	Spatiotemporal super-resolution (SRl)	Upscales each frame while keeping motion consistent	64x64 to 256x256
5	Spatial super-resolution (SRh)	Final spatial upscaling, applied per frame	256x256 to 768x768

The decoder first produces 16 frames at 64x64 pixels. A separate masked frame-interpolation network then fills in intermediate frames; using a frame skip of 5, it expands a 16-frame clip to 76 frames, computed as (16 - 1) x 5 + 1. The first super-resolution network operates across space and time together to upscale from 64x64 to 256x256, so it can keep the added detail consistent from frame to frame. The final super-resolution network works on individual frames to reach 768x768.^[1]

During training, the model conditions on frame rate, with clips sampled at a random rate between 1 and 30 frames per second. At inference, the interpolation stage lets the system target a chosen frame rate and produce smoother, longer-feeling motion.^[1]

What data was Make-A-Video trained on?

The split design is mirrored in the datasets. The text-to-image backbone was trained on a 2.3-billion-image subset of LAION-5B, filtered to remove NSFW imagery, toxic words in the text, and images with watermarks. The temporal components learned motion from unlabeled video: the full WebVid-10M dataset and a 10-million-clip subset of HD-VILA-100M. None of the video used in this part carried text aligned to the clips.^[1]

What can Make-A-Video do?

The project page demonstrated four uses beyond a single text prompt: generating a video from a text description, animating a still image by adding motion, interpolating between two input images to create the motion between them, and producing variations of an existing video.^[4]

Property	Value
Modality	Text to video (plus image animation and video variation)
Final resolution	768x768
Frames per clip	16 generated, interpolated to 76
Frame-rate conditioning	1 to 30 fps during training
Backbone	Text-to-image diffusion model (CLIP prior plus decoder)
Paired text-video data	None used
Status at launch	Research project, sign-up for future access

How good were the results?

The paper reported state-of-the-art results for its time using both automatic metrics and human studies. On UCF-101 in a zero-shot setting it reported an Inception Score of 33.00 and a Frechet Video Distance (FVD) of 367.23. On MSR-VTT it reported a Frechet Inception Distance (FID) of 13.17 and a CLIPSIM text-video similarity of 0.3049. In human evaluations against the contemporary CogVideo system, raters preferred Make-A-Video on both visual quality and faithfulness to the prompt.^[1]

Meta's project page summarized its own user studies as roughly a 3x improvement over the previous state of the art in both how well the output represented the text and in overall quality.^[4]

How did Meta handle responsible release?

Because realistic generated video can be misused, Meta framed the launch as a deliberately cautious research preview. The project applied filters intended to reduce harmful content and added a visible watermark to every generated video so viewers could tell the footage was AI-made. Meta said it would apply its responsible-AI process before any wider release rather than open the system to the public immediately.^[3]^[4]

How does Make-A-Video relate to later Meta video models?

Make-A-Video was an early step in a line of Meta generative-video research. Its image-generation lineage traces to Make-A-Scene, Meta's 2022 text-to-image work.^[5]

In November 2023 Meta introduced Emu Video, which took a different and simpler route to the same goal. Rather than a deep cascade, Emu Video factorizes generation into two diffusion models: it first generates an image from the text prompt, then generates video conditioned on both the prompt and that image. Emu Video produced 512x512 clips of about four seconds at 16 frames per second, and in human evaluations Meta reported it was preferred over Make-A-Video by 96% of respondents on quality and 85% on faithfulness to the text. Meta explicitly contrasted the two, noting that Make-A-Video used a deep cascade of five models while Emu Video used just two.^[5]

In October 2024 Meta announced Movie Gen, a larger family of foundation models for video and audio generation and editing, continuing the same research direction with longer, higher-quality output.^[6]

How does Make-A-Video compare to other 2022 text-to-video systems?

Make-A-Video arrived during a burst of text-to-video research in late 2022. Google announced Imagen Video and Phenaki within days of Make-A-Video, and Tsinghua University's CogVideo was the contemporary open baseline the paper compared against in human evaluations.^[1] What set Make-A-Video apart was its central premise of training without any paired text-video data, reusing a text-to-image model for visual and language knowledge and learning motion separately from unlabeled footage.^[1]^[3]

When was Make-A-Video released?

Meta AI publicly announced Make-A-Video on September 29, 2022, the same day the paper was posted to arXiv (arXiv:2209.14792).^[1]^[3] It was released as a research project with a project page and a sign-up form for future access rather than as a public product, and the work was subsequently accepted as a poster at ICLR 2023.^[2]^[3]^[4]

References

Singer, U., Polyak, A., Hayes, T., Yin, X., An, J., Zhang, S., Hu, Q., Yang, H., Ashual, O., Gafni, O., Parikh, D., Gupta, S., Taigman, Y. "Make-A-Video: Text-to-Video Generation without Text-Video Data." arXiv:2209.14792, September 29, 2022. https://arxiv.org/abs/2209.14792 ↩
"Make-A-Video: Text-to-Video Generation without Text-Video Data." OpenReview, ICLR 2023 (poster). https://openreview.net/forum?id=nJfylDvgzlq ↩
"Introducing Make-A-Video: An AI system that generates videos from text." Meta AI Blog, September 29, 2022. https://ai.meta.com/blog/generative-ai-text-to-video/ ↩
"Make-A-Video by Meta AI." Project page. https://makeavideo.studio/ ↩
"Emu Video and Emu Edit: Our latest generative AI research milestones." Meta AI Blog, November 16, 2023. https://ai.meta.com/blog/emu-text-to-video-generation-image-editing-research/ ↩
"Meta Movie Gen." Meta AI Research, October 2024. https://ai.meta.com/research/movie-gen/ ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

1 revision by 1 contributors · full history

Suggest edit

What links here

Emu Video Make-A-Scene Text-to-video generation

What is Make-A-Video?

How does Make-A-Video work without text-video pairs?

How is Make-A-Video built? (architecture)

What data was Make-A-Video trained on?

What can Make-A-Video do?

How good were the results?

How did Meta handle responsible release?

How does Make-A-Video relate to later Meta video models?

How does Make-A-Video compare to other 2022 text-to-video systems?

When was Make-A-Video released?

References

Improve this article

Related Articles

Movie Gen

Emu Video

NVIDIA Picasso

Sora

Runway (company)

Pika (video generation)

What links here

Related Articles

Movie Gen

Emu Video

NVIDIA Picasso

Sora

Runway (company)

Pika (video generation)

What links here