Movie Gen

Generative AI Meta AI Video Generation

8 min read

Updated Jun 27, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 27, 2026

Fact-checked

In review queue

Sources

8 citations

Revision

v2 · 1,519 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Movie Gen is a suite of media-generation foundation models from Meta AI, announced on October 4, 2024, that generates high-definition video with synchronized audio from text prompts. The suite is built around two models: Movie Gen Video, a roughly 30-billion-parameter transformer that produces up to 16 seconds of 1080p video at 16 frames per second, and Movie Gen Audio, a roughly 13-billion-parameter model that generates sound effects, ambient sound, and instrumental music timed to the picture. It also supports instruction-based video editing and personalized video generated from a single photo of a person.^[1]^[2]

Meta presented Movie Gen as research rather than a shipping product, stating it had no plans to put the models into public products at launch. The accompanying paper, "Movie Gen: A Cast of Media Foundation Models," was posted to arXiv on October 17, 2024, with a revised version following on February 26, 2025. The first listed author is Adam Polyak, and the paper credits a large Movie Gen team at Meta; the abstract describes the release as "a cast of foundation models that generates high-quality, 1080p HD videos with different aspect ratios and synchronized audio."^[1]^[2]^[4]

What is Movie Gen?

Movie Gen covers four media-generation capabilities under one research umbrella: text-to-video generation, video-to-audio and text-to-audio generation, instruction-based video editing, and personalized video generation from a user's photo. The release positioned Meta against other text-to-video systems that appeared in 2024, including OpenAI's Sora, and built on Meta's earlier image and video work such as Emu and Emu Video. Movie Gen reuses the lineage and research experience of Emu Video, Meta's 2023 text-to-video system, while delivering higher resolution, longer clips, and synchronized audio. Meta said the models set state-of-the-art results, as judged by human evaluators, across text-to-video synthesis, video personalization, video editing, video-to-audio generation, and text-to-audio generation. In the paper's words, "Our models set a new state-of-the-art on multiple tasks: text-to-video synthesis, video personalization, video editing, video-to-audio generation, and text-to-audio generation."^[1]^[2]^[4]

What models make up Movie Gen?

Movie Gen is built around two foundation models, with the personalization and editing capabilities derived from the video model rather than shipped as separate base models.

Model	Approximate size	Capability
Movie Gen Video	30B parameters	Joint text-to-image and text-to-video generation; up to 16 seconds of 1080p HD video
Movie Gen Audio	13B parameters	Video-to-audio and text-to-audio generation: sound effects, Foley, ambient sound, and instrumental music
Personalized Movie Gen Video	Derived from the 30B video model	Generates video of a specific person from a single reference image plus a text prompt
Movie Gen Edit	Derived from the 30B video model	Precise, instruction-based editing of real or generated video

How big is the Movie Gen video model?

Movie Gen Video is a 30-billion-parameter foundation model trained jointly for text-to-image and text-to-video generation. It is a transformer trained with a Flow Matching objective, which learns to transform random noise into a sample by predicting velocities in a compressed latent space rather than predicting pixels directly. This places it in the broader family of diffusion model and flow-based generative approaches.^[2]^[4]

The model generates up to 16 seconds of video at 16 frames per second, which corresponds to a maximum context length of about 73,000 video tokens for the transformer. Generation happens at a base resolution near 768x768 pixels; a separate Spatial Upsampler, itself a video-to-video model, then raises the output to full HD 1080p. The system supports multiple aspect ratios, which Meta described as a first for the field at the time.^[1]^[2]^[4]

To make video tractable, Movie Gen Video operates in a spatio-temporally compressed latent produced by a Temporal Autoencoder (TAE), which compresses input video by a factor of 8 across each of the height, width, and time dimensions. For text conditioning, the model concatenates the outputs of three text encoders, UL2, ByT5, and a long-prompt variant of MetaCLIP, after projecting them to a shared 6,144-dimensional space; the combination is meant to capture both semantic meaning and character-level detail in prompts.^[2]^[4]

What does Movie Gen Audio do?

Movie Gen Audio is a roughly 13-billion-parameter model for video-to-audio and text-to-audio generation. Given a video and an optional text prompt, it produces 48 kHz audio that is synchronized to the on-screen action. The model generates diegetic sound effects timed to visible events, diegetic ambient sound that matches the scene, and non-diegetic instrumental music that fits the mood, including Foley-style effects. It does not generate speech or dialogue.^[1]^[2]

A single generation produces audio up to about 45 seconds long. The model handles variable-length output and, through an audio-extension technique, can produce coherent soundtracks for videos several minutes long, well beyond the 16-second limit of the video model. Meta reported state-of-the-art results for both the video-to-audio and the text-to-audio settings in its human evaluations.^[1]^[2]

How does Movie Gen edit and personalize video?

Two of the four advertised capabilities are specializations of the video model.

Movie Gen Edit performs instruction-based video editing. A user supplies an existing clip, real or AI-generated, along with a text instruction, and the model applies the requested change while leaving the rest of the content intact. Meta highlighted localized and stylistic edits such as changing styles, adding or removing elements, altering backgrounds, and adjusting transitions, with the rest of the frame preserved.^[1]^[2]

Personalized Movie Gen Video conditions generation on a single image of a person together with a text prompt, producing a video that features that individual while following the prompt and preserving identity and natural motion. Meta said this personalization path set a new state of the art for identity-preserving video generation in its evaluations.^[1]^[2]

What data was Movie Gen trained on?

The paper describes training on large, filtered collections of paired media and text rather than a single fixed dataset. Movie Gen Video was trained on the order of 100 million video-text pairs and on the order of 1 billion image-text pairs, reflecting its joint image and video objective.^[2]^[4]

Meta describes a multi-stage data curation pipeline rather than naming specific dataset sources or licenses. The pipeline applies visual filtering (for quality, aspect ratio, on-screen text via OCR, and scene-cut detection), motion filtering to remove static or erratic clips, and content filtering for deduplication and to improve concept diversity. The paper does not provide an explicit statement of the licensing status or provenance of the underlying media.^[2]^[4]

Is Movie Gen publicly available?

At announcement, Movie Gen was a research project with no public product, API, or open weights. Meta said it did not plan to incorporate the models into public products until the following year and framed the release as an effort to open an early dialogue with creators. The project page described the work as moving "toward a potential future release," to be developed with feedback from filmmakers and creators rather than launched directly to the public.^[1]^[3]

As part of that feedback effort, Meta said it was working with filmmakers and creators, including the horror studio Blumhouse and selected artists, to test the tools before any wider availability. Press reporting at the time of the announcement, including from Wired and Bloomberg, noted that access was limited to some Meta staff and external partners and that Meta intended to bring the technology to its apps, with Instagram cited as a likely surface, during 2025.^[5]^[6]^[7]

That product step arrived in stages in 2025. On June 11, 2025, Meta announced a generative AI video editing feature, available in the Meta AI app, on the Meta.AI website, and in Meta's standalone Edits app, that it described as "inspired by our Movie Gen models." The initial version offered more than 50 preset prompts that could transform up to 10 seconds of video, changing elements such as outfit, location, style, and lighting, and Meta said custom text-prompted edits would follow later in the year. Meta presented the feature as a first step toward bringing AI video generation and editing across its products, rather than a release of the full Movie Gen models themselves.^[8]

References

Meta AI. "How Meta Movie Gen could usher in a new AI-enabled era for content creators." AI at Meta Blog, October 4, 2024. https://ai.meta.com/blog/movie-gen-media-foundation-models-generative-ai-video/ ↩
Polyak, Adam, et al. (The Movie Gen Team @ Meta). "Movie Gen: A Cast of Media Foundation Models." Meta AI Research, October 2024. https://ai.meta.com/research/publications/movie-gen-a-cast-of-media-foundation-models/ ↩
"Movie Gen." Meta AI research project page. https://ai.meta.com/research/movie-gen/ ↩
Polyak, Adam, et al. "Movie Gen: A Cast of Media Foundation Models." arXiv:2410.13720, submitted October 17, 2024 (revised February 26, 2025). https://arxiv.org/abs/2410.13720 ↩
Variety. "Meta Teams With Blumhouse and Filmmakers Like Casey Affleck to Test Movie Gen AI Tool." October 2024. https://variety.com/2024/digital/news/meta-movie-gen-blumhouse-casey-affleck-1236180282/ ↩
Reece Rogers. "Meta Announces Movie Gen, an AI Model for Generating Video and Audio." Wired, October 4, 2024 (via Techmeme). https://www.techmeme.com/241004/p8 ↩
Kurt Wagner. "Meta's Movie Gen Can Create 16-Second Videos From Text Prompts." Bloomberg, October 4, 2024 (via Techmeme). https://www.techmeme.com/241004/p9 ↩
Meta. "You Can Now Edit Videos With Meta AI." Meta Newsroom, June 11, 2025. https://about.fb.com/news/2025/06/edit-videos-with-meta-ai/ ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

1 revision by 1 contributors · full history

Suggest edit

What links here

Diffusion Transformer (DiT)Emu Video Make-A-Video Sora

What is Movie Gen?

What models make up Movie Gen?

How big is the Movie Gen video model?

What does Movie Gen Audio do?

How does Movie Gen edit and personalize video?

What data was Movie Gen trained on?

Is Movie Gen publicly available?

See also

References

Improve this article

Related Articles

Make-A-Video

Emu Video

NVIDIA Picasso

Sora

Runway (company)

Pika (video generation)

What links here

Related Articles

Make-A-Video

Emu Video

NVIDIA Picasso

Sora

Runway (company)

Pika (video generation)

What links here