Emu Video

AI Models Meta AI Video Generation

7 min read

Updated Jun 27, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 27, 2026

Fact-checked

In review queue

Sources

9 citations

Revision

v2 · 1,411 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Emu Video is a text-to-video generation model from Meta AI, announced on November 16, 2023, that creates short clips by first turning a text prompt into an image and then generating a video conditioned on both the text and that image. This "factorized" design uses just two diffusion model steps to produce 512x512, four-second videos at 16 frames per second, and in Meta's human evaluations its outputs were preferred over prior text-to-video systems, including Meta's own Make-A-Video, 96% of the time on quality. ^[1]^[2]

Emu Video was unveiled alongside the image editing model Emu Edit, and the accompanying research paper, "Emu Video: Factorizing Text-to-Video Generation by Explicit Image Conditioning," was posted to arXiv on November 17, 2023, and later published at the European Conference on Computer Vision (ECCV) 2024. ^[1]^[2]^[3] The work builds on Meta's Emu text-to-image foundation model and sits earlier in a lineage that continued with the larger Movie Gen system in 2024. ^[1]^[2]

What is Emu Video?

Emu Video is a research model that generates a short video from a natural-language text prompt. Rather than mapping text directly to a full video, it factorizes the task into two conditional generation steps: it first generates a still image from the prompt, then generates motion from that image plus the text. The same factorized model can also "animate" a user-provided image according to a text prompt, since the second stage is already built to turn an image plus text into a video. ^[1]^[2] At launch, Meta presented Emu Video as a research demonstration rather than a consumer product, releasing a project page with sample generations alongside the announcement. ^[2]^[1]

Emu (which Meta has described as standing for "Expressive Media Universe") is Meta's foundational text-to-image model, unveiled at the Meta Connect 2023 event in September 2023. It is a latent diffusion model pre-trained on roughly 1.1 billion image-text pairs and then "quality-tuned" on a small set of about 2,000 carefully selected high-quality images, producing 1024x1024 output. In Meta's reported human studies, Emu's images were preferred over Stable Diffusion XL more than 70 percent of the time. Emu Video reuses Emu as its image-generation backbone, which is how the factorized pipeline gets the intermediate image it conditions on. ^[4]^[5]^[2]

Before Emu Video, Meta's main text-to-video research model was Make-A-Video (2022), which relied on a cascade of several models to go from text to a high-resolution video. Emu Video's design goal was to reach comparable or better quality with a much simpler pipeline. ^[2]^[6]

How does Emu Video work?

Rather than learning to map text directly to a full video, Emu Video factorizes the problem into two conditional generation steps. The paper summarizes the approach as "first generating an image conditioned on the text, and then generating a video conditioned on the text and the generated image." Conditioning the video stage on an explicit, already-generated image is the key design choice: the image fixes the appearance and composition of the scene, so the video model can focus on producing coherent motion. The authors present this explicit image conditioning as the reason their method reaches higher quality than approaches that generate video directly from text. ^[1]^[2]

Meta describes the result as a method that "is simple to implement and uses just two diffusion models" to produce video, in contrast to the deeper cascade (for example, five models) used by Make-A-Video. The same factorized model can also "animate" a user-supplied image according to a text prompt, since the second stage is already built to turn an image plus text into a video. ^[2]^[1]

Two engineering decisions are highlighted as critical to making the simpler pipeline work at high resolution:

Adjusted noise schedules for diffusion. The paper emphasizes a zero terminal signal-to-noise-ratio (SNR) noise schedule, which removes a train-test mismatch that otherwise hurts high-resolution generation.
Multi-stage training. Training proceeds through multiple stages and resolutions rather than relying on a long cascade of separate super-resolution models.

Together these let the model "directly generate high quality and high resolution videos, without requiring a deep cascade of models as in prior work." ^[1]^[3]

What are Emu Video's output specifications?

Property	Value
Output resolution	512x512 pixels
Video length	4 seconds
Frame rate	16 frames per second
Diffusion models in pipeline	2
Image backbone	Emu text-to-image model
Additional capability	Animate a user-provided image from a text prompt

Meta states that the approach can "generate 512x512 four-second long videos at 16 frames per second." This is the headline configuration cited in Meta's announcement and in contemporaneous press coverage. ^[2]^[7]

How does Emu Video compare to Make-A-Video and other systems?

The authors evaluate Emu Video primarily through human preference studies rather than automated metrics alone, comparing it against both research systems and commercial products. The paper's abstract reports that Emu Video's generated videos were "strongly preferred in quality" over all prior work it tested. ^[1]

Comparison	Result reported
vs. Make-A-Video (Meta)	Preferred by 96% on quality; 85% on faithfulness to the prompt
vs. Imagen Video (Google)	Preferred 81% on quality
vs. PYOCO (Nvidia)	Preferred 90% on quality
vs. RunwayML Gen-2 (commercial)	Emu Video preferred
vs. Pika Labs (commercial)	Emu Video preferred
Image animation vs. prior work	Preferred 96%

The quality win rates of 81 percent versus Imagen Video, 90 percent versus PYOCO, and 96 percent versus Make-A-Video are stated directly in the paper's abstract, which also notes that the model "outperforms commercial solutions such as RunwayML's Gen2 and Pika Labs." The 85 percent faithfulness figure against Make-A-Video comes from Meta's announcement blog, which states the model "was preferred over Make-A-Video by 96% of respondents based on quality and by 85% of respondents based on faithfulness to the text prompt." For the image-animation task, the paper reports its generations were "preferred 96% over prior work." As with all single-vendor human-preference studies, these numbers reflect the authors' own evaluation protocol and prompt sets rather than an independent benchmark. ^[1]^[2]

Who built Emu Video and when was it released?

The paper is credited to Rohit Girdhar, Mannat Singh, Andrew Brown, Quentin Duval, Samaneh Azadi, Sai Saketh Rambhatla, Akbar Shah, Xi Yin, Devi Parikh, and Ishan Misra at Meta (GenAI / FAIR). The first version appeared on arXiv on November 17, 2023; a revised version was posted in August 2024, corresponding to publication at the European Conference on Computer Vision (ECCV) 2024. Meta released a project page with sample generations and announced the model through its AI research blog on November 16, 2023. At launch, Emu Video was presented as a research demonstration rather than a consumer product. ^[1]^[2]^[3]

How does Emu Video relate to Movie Gen?

Emu Video was a research step toward Meta's later media-generation work. In October 2024, Meta announced Movie Gen, a suite of media foundation models whose video component, Movie Gen Video, has about 30 billion parameters and can generate higher-definition clips of up to 16 seconds at 16 frames per second, along with companion models for audio, editing, and personalization. Movie Gen is substantially larger and broader in scope than Emu Video, and represents Meta's subsequent generation of text-to-video systems after the Emu line. ^[8]^[9]

References

Rohit Girdhar, Mannat Singh, Andrew Brown, et al. "Emu Video: Factorizing Text-to-Video Generation by Explicit Image Conditioning." arXiv:2311.10709, November 17, 2023. https://arxiv.org/abs/2311.10709 ↩
Meta AI. "Emu Video and Emu Edit: Our latest generative AI research milestones." November 16, 2023. https://ai.meta.com/blog/emu-text-to-video-generation-image-editing-research/ ↩
Rohit Girdhar, et al. "Emu Video: Factorizing Text-to-Video Generation by Explicit Image Conditioning" (HTML, v2). arXiv. https://arxiv.org/html/2311.10709v2 ↩
School of Machine Learning. "Emu: The Most Advanced Next-Generation Image Model From Meta." September 29, 2023. https://www.schoolofmachinelearning.com/2023/09/29/emu-image-generation-model-from-meta/ ↩
Maginative. "A Deep Dive Inside Emu, Meta's New Image Generation AI Model." https://www.maginative.com/article/a-deep-dive-inside-emu-metas-new-image-generation-ai-model/ ↩
InfoQ. "Meta Announces Generative AI Models Emu Video and Emu Edit." November 2023. https://www.infoq.com/news/2023/11/meta-emu-ai/ ↩
SiliconANGLE. "Meta announces new breakthroughs in AI image editing and video generation with Emu." November 16, 2023. https://siliconangle.com/2023/11/16/meta-announces-new-breakthroughs-ai-image-editing-video-generation-emu/ ↩
Meta AI. "Movie Gen: A Cast of Media Foundation Models." Research publication, October 2024. https://ai.meta.com/research/publications/movie-gen-a-cast-of-media-foundation-models/ ↩
Meta AI. "How Meta Movie Gen could usher in a new AI-enabled era for content creators." October 4, 2024. https://ai.meta.com/blog/movie-gen-media-foundation-models-generative-ai-video/ ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

1 revision by 1 contributors · full history

Suggest edit

What links here

Emu (Meta AI)Emu Edit Make-A-Video Movie Gen

What is Emu Video?

How does Emu Video work?

What are Emu Video's output specifications?

How does Emu Video compare to Make-A-Video and other systems?

Who built Emu Video and when was it released?

How does Emu Video relate to Movie Gen?

See also

References

Improve this article

Related Articles

Movie Gen

Make-A-Video

NVIDIA Picasso

Pika (video generation)

Sora 2

Veo 3

What links here

Related Articles

Movie Gen

Make-A-Video

NVIDIA Picasso

Pika (video generation)

Sora 2

Veo 3

What links here