Movie Gen
Last reviewed
Jun 3, 2026
Sources
8 citations
Review status
Source-backed
Revision
v1 · 1,342 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 3, 2026
Sources
8 citations
Review status
Source-backed
Revision
v1 · 1,342 words
Add missing citations, update stale details, or suggest a clearer explanation.
Movie Gen is a family of media-generation foundation models developed by Meta AI and announced on October 4, 2024. The family covers text-to-video generation, video-to-audio and text-to-audio generation, instruction-based video editing, and personalized video generation from a user's photo. Meta presented the work as research rather than a shipping product, stating it had no plans to put the models into public products at launch. The accompanying paper, "Movie Gen: A Cast of Media Foundation Models," was posted to arXiv on October 17, 2024, with a revised version following on February 26, 2025. The first listed author is Adam Polyak, and the paper credits a large Movie Gen team at Meta.[1][2][3]
The release positioned Meta against other text-to-video systems that appeared in 2024, including OpenAI's Sora, and built on Meta's earlier image and video work such as Emu and Emu Video. Meta said the models set state-of-the-art results, as judged by human evaluators, across text-to-video synthesis, video personalization, video editing, video-to-audio generation, and text-to-audio generation.[1][2]
Movie Gen is built around two foundation models, with the personalization and editing capabilities derived from the video model rather than shipped as separate base models.
| Model | Approximate size | Capability |
|---|---|---|
| Movie Gen Video | 30B parameters | Joint text-to-image and text-to-video generation; up to 16 seconds of 1080p HD video |
| Movie Gen Audio | 13B parameters | Video-to-audio and text-to-audio generation: sound effects, Foley, ambient sound, and instrumental music |
| Personalized Movie Gen Video | Derived from the 30B video model | Generates video of a specific person from a single reference image plus a text prompt |
| Movie Gen Edit | Derived from the 30B video model | Precise, instruction-based editing of real or generated video |
Movie Gen Video is a 30-billion-parameter foundation model trained jointly for text-to-image and text-to-video generation. It is a transformer trained with a Flow Matching objective, which learns to transform random noise into a sample by predicting velocities in a compressed latent space rather than predicting pixels directly. This places it in the broader family of diffusion model and flow-based generative approaches.[2][4]
The model generates up to 16 seconds of video at 16 frames per second, which corresponds to a maximum context length of about 73,000 video tokens for the transformer. Generation happens at a base resolution near 768x768 pixels; a separate Spatial Upsampler, itself a video-to-video model, then raises the output to full HD 1080p. The system supports multiple aspect ratios, which Meta described as a first for the field at the time.[1][2][4]
To make video tractable, Movie Gen Video operates in a spatio-temporally compressed latent produced by a Temporal Autoencoder (TAE), which compresses input video by a factor of 8 across each of the height, width, and time dimensions. For text conditioning, the model concatenates the outputs of three text encoders, UL2, ByT5, and a long-prompt variant of MetaCLIP, after projecting them to a shared 6,144-dimensional space; the combination is meant to capture both semantic meaning and character-level detail in prompts.[2][4]
Movie Gen Audio is a roughly 13-billion-parameter model for video-to-audio and text-to-audio generation. Given a video and an optional text prompt, it produces 48 kHz audio that is synchronized to the on-screen action. The model generates diegetic sound effects timed to visible events, diegetic ambient sound that matches the scene, and non-diegetic instrumental music that fits the mood, including Foley-style effects. It does not generate speech or dialogue.[1][2]
A single generation produces audio up to about 45 seconds long. The model handles variable-length output and, through an audio-extension technique, can produce coherent soundtracks for videos several minutes long, well beyond the 16-second limit of the video model. Meta reported state-of-the-art results for both the video-to-audio and the text-to-audio settings in its human evaluations.[1][2]
Two of the four advertised capabilities are specializations of the video model.
Movie Gen Edit performs instruction-based video editing. A user supplies an existing clip, real or AI-generated, along with a text instruction, and the model applies the requested change while leaving the rest of the content intact. Meta highlighted localized and stylistic edits such as changing styles, adding or removing elements, altering backgrounds, and adjusting transitions, with the rest of the frame preserved.[1][2]
Personalized Movie Gen Video conditions generation on a single image of a person together with a text prompt, producing a video that features that individual while following the prompt and preserving identity and natural motion. Meta said this personalization path set a new state of the art for identity-preserving video generation in its evaluations.[1][2]
The paper describes training on large, filtered collections of paired media and text rather than a single fixed dataset. Movie Gen Video was trained on the order of 100 million video-text pairs and on the order of 1 billion image-text pairs, reflecting its joint image and video objective.[2][4]
Meta describes a multi-stage data curation pipeline rather than naming specific dataset sources or licenses. The pipeline applies visual filtering (for quality, aspect ratio, on-screen text via OCR, and scene-cut detection), motion filtering to remove static or erratic clips, and content filtering for deduplication and to improve concept diversity. The paper does not provide an explicit statement of the licensing status or provenance of the underlying media.[2][4]
At announcement, Movie Gen was a research project with no public product, API, or open weights. Meta said it did not plan to incorporate the models into public products until the following year and framed the release as an effort to open an early dialogue with creators. The project page described the work as moving "toward a potential future release," to be developed with feedback from filmmakers and creators rather than launched directly to the public.[1][3]
As part of that feedback effort, Meta said it was working with filmmakers and creators, including the horror studio Blumhouse and selected artists, to test the tools before any wider availability. Press reporting at the time of the announcement, including from Wired and Bloomberg, noted that access was limited to some Meta staff and external partners and that Meta intended to bring the technology to its apps, with Instagram cited as a likely surface, during 2025.[5][6][7]
That product step arrived in stages in 2025. On June 11, 2025, Meta announced a generative AI video editing feature, available in the Meta AI app, on the Meta.AI website, and in Meta's standalone Edits app, that it described as "inspired by our Movie Gen models." The initial version offered more than 50 preset prompts that could transform up to 10 seconds of video, changing elements such as outfit, location, style, and lighting, and Meta said custom text-prompted edits would follow later in the year. Meta presented the feature as a first step toward bringing AI video generation and editing across its products, rather than a release of the full Movie Gen models themselves.[8]