Emu Video
Last reviewed
Jun 3, 2026
Sources
9 citations
Review status
Source-backed
Revision
v1 · 1,211 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 3, 2026
Sources
9 citations
Review status
Source-backed
Revision
v1 · 1,211 words
Add missing citations, update stale details, or suggest a clearer explanation.
Emu Video is a text-to-video generation model from Meta AI, announced on November 16, 2023, alongside the image editing model Emu Edit. Its central idea is a "factorized" approach that splits text-to-video generation into two diffusion model steps: first generate an image from the text prompt, then generate a video conditioned on both the text and that generated image. The accompanying research paper, "Emu Video: Factorizing Text-to-Video Generation by Explicit Image Conditioning," was posted to arXiv on November 17, 2023, and was later published at ECCV 2024. [1][2][3]
The work builds on Meta's Emu text-to-image foundation model and sits earlier in a lineage that continued with the larger Movie Gen system in 2024. In human evaluations reported by the authors, Emu Video's outputs were preferred over prior research systems, including Meta's own Make-A-Video, and over commercial tools available at the time. [1][2]
Emu (which Meta has described as standing for "Expressive Media Universe") is Meta's foundational text-to-image model, unveiled at the Meta Connect 2023 event in September 2023. It is a latent diffusion model pre-trained on roughly 1.1 billion image-text pairs and then "quality-tuned" on a small set of about 2,000 carefully selected high-quality images, producing 1024x1024 output. In Meta's reported human studies, Emu's images were preferred over Stable Diffusion XL more than 70 percent of the time. Emu Video reuses Emu as its image-generation backbone, which is how the factorized pipeline gets the intermediate image it conditions on. [4][5][2]
Before Emu Video, Meta's main text-to-video research model was Make-A-Video (2022), which relied on a cascade of several models to go from text to a high-resolution video. Emu Video's design goal was to reach comparable or better quality with a much simpler pipeline. [2][6]
Rather than learning to map text directly to a full video, Emu Video factorizes the problem into two conditional generation steps. The paper summarizes the approach as "first generating an image conditioned on the text, and then generating a video conditioned on the text and the generated image." Conditioning the video stage on an explicit, already-generated image is the key design choice: the image fixes the appearance and composition of the scene, so the video model can focus on producing coherent motion. The authors present this explicit image conditioning as the reason their method reaches higher quality than approaches that generate video directly from text. [1][2]
Meta describes the result as using "just two diffusion models" to produce video, in contrast to the deeper cascade (for example, five models) used by Make-A-Video. The same factorized model can also "animate" a user-supplied image according to a text prompt, since the second stage is already built to turn an image plus text into a video. [2][1]
Two engineering decisions are highlighted as critical to making the simpler pipeline work at high resolution:
Together these let the model "directly generate high quality and high resolution videos, without requiring a deep cascade of models as in prior work." [1][3]
| Property | Value |
|---|---|
| Output resolution | 512x512 pixels |
| Video length | 4 seconds |
| Frame rate | 16 frames per second |
| Diffusion models in pipeline | 2 |
| Image backbone | Emu text-to-image model |
| Additional capability | Animate a user-provided image from a text prompt |
Meta states that the approach generates "512x512 four-second long videos at 16 frames per second." This is the headline configuration cited in Meta's announcement and in contemporaneous press coverage. [2][7]
The authors evaluate Emu Video primarily through human preference studies rather than automated metrics alone, comparing it against both research systems and commercial products. The paper's abstract reports that Emu Video's generated videos were "strongly preferred in quality" over all prior work it tested. [1]
| Comparison | Result reported |
|---|---|
| vs. Make-A-Video (Meta) | Preferred by 96% on quality; 85% on faithfulness to the prompt |
| vs. Imagen Video (Google) | Preferred 81% on quality |
| vs. PYOCO (Nvidia) | Preferred 90% on quality |
| vs. RunwayML Gen-2 (commercial) | Emu Video preferred |
| vs. Pika Labs (commercial) | Emu Video preferred |
| Image animation vs. prior work | Preferred 96% |
The quality win rates of 81 percent versus Imagen Video, 90 percent versus PYOCO, and 96 percent versus Make-A-Video are stated directly in the paper's abstract, which also notes that the model "outperforms commercial solutions such as RunwayML's Gen2 and Pika Labs." The 85 percent faithfulness figure against Make-A-Video comes from Meta's announcement blog. For the image-animation task, the paper reports its generations were "preferred 96% over prior work." As with all single-vendor human-preference studies, these numbers reflect the authors' own evaluation protocol and prompt sets rather than an independent benchmark. [1][2]
The paper is credited to Rohit Girdhar, Mannat Singh, Andrew Brown, and collaborators at Meta (GenAI / FAIR). The first version appeared on arXiv on November 17, 2023; a revised version was posted in August 2024, corresponding to publication at the European Conference on Computer Vision (ECCV) 2024. Meta released a project page with sample generations and announced the model through its AI research blog on November 16, 2023. At launch, Emu Video was presented as a research demonstration rather than a consumer product. [1][2][3]
Emu Video was a research step toward Meta's later media-generation work. In October 2024, Meta announced Movie Gen, a suite of media foundation models whose video component, Movie Gen Video, has about 30 billion parameters and can generate higher-definition clips of up to 16 seconds at 16 frames per second, along with companion models for audio, editing, and personalization. Movie Gen is substantially larger and broader in scope than Emu Video, and represents Meta's subsequent generation of text-to-video systems after the Emu line. [8][9]