Text-to-video generation

Diffusion Models Generative AI Video Generation

40 min read

Updated Jun 21, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 21, 2026

Fact-checked

In review queue

Sources

45 citations

Revision

v4 · 8,025 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Text-to-video (often abbreviated T2V) is the generative AI capability of producing video clips, with or without sound, directly from a written prompt. A user types a description of what should appear on screen, and a model returns a sequence of frames that, ideally, looks coherent across time and matches the request. Most T2V systems are diffusion models built on a diffusion transformer backbone, trained on web-scale video and caption datasets, and as of 2026 the leading products are OpenAI's Sora 2 and Google DeepMind's Veo 3, both of which generate synchronized native audio alongside the picture. ^[16]^[18] The same approach extends to image-to-video, video-to-video editing, and native audio synchronized with the picture.

The field went from research curiosity to consumer product in roughly two and a half years. In May 2022, Tsinghua University's CogVideo could just barely produce 480x480 clips of a few seconds. ^[6] By February 2024, OpenAI's Sora preview generated minute-long 1080p shots with consistent character identity and basic physics. ^[16] By mid-2025, Google DeepMind's Veo 3 added native dialogue, sound effects, and music to its outputs, removing the last seam between AI video and conventional film. ^[18] The speed of consumer adoption was striking: OpenAI's standalone Sora app reached one million downloads in under five days after its September 2025 launch, faster than ChatGPT had managed, and topped the US App Store within its first week. ^[42]^[43]

Text-to-video models share a heritage with text-to-image systems. Most modern T2V architectures are diffusion models, often built around a diffusion transformer (DiT), trained on enormous video and caption datasets. ^[11] The challenges that made the problem hard for so long, namely temporal consistency, computational cost, and reliable physics, have shaped the technical history of the field. The challenges that remain, principally hands, dense text rendering, long-form story logic, and the question of whose footage went into training, are the focus of current work.

What is text-to-video generation?

A text-to-video system takes a natural language prompt as input and outputs a video file: a sequence of frames usually rendered at 24 to 30 frames per second, lasting anywhere from one or two seconds in the early models up to a minute or more in 2024 to 2026 systems. Most commercial products today produce clips of five to ten seconds at resolutions between 720p and 1080p, with some higher tier modes reaching 4K.

The core conditioning is text. Many modern systems also accept additional inputs:

A starting image (image-to-video, sometimes called I2V).
A reference video for style or motion (video-to-video).
A reference subject for character consistency.
An ending frame, so the model interpolates between two stills.
A camera motion description or an explicit camera path.
Audio, in the case of recent models that synchronize lip motion to a provided voice track.

A text-to-video model is, at heart, a probability distribution over short videos conditioned on text. Sampling from that distribution is what produces a clip. This framing distinguishes T2V from older techniques like 3D rendering, motion capture, or compositing, where a deterministic pipeline turns explicit instructions into pixels.

History

Pre-2022: precursors

Research on generating video from text predates the modern wave by several years. The earliest published work that produced video from neural networks was Vondrick, Pirsiavash, and Torralba's VGAN (2016, "Generating Videos with Scene Dynamics," NeurIPS 2016), which trained a generative adversarial network on unlabeled video clips to produce one-second 64x64 outputs by separating foreground motion from a static background. ^[1] Saito and colleagues followed in 2017 with TGAN ("Temporal Generative Adversarial Nets with Singular Value Clipping," ICCV 2017), and Tulyakov and collaborators at NVIDIA released MoCoGAN in 2018 ("MoCoGAN: Decomposing Motion and Content for Video Generation," CVPR 2018), which split the latent space into content and motion components. ^[2]^[3] DeepMind's DVD-GAN (Clark et al., 2019) scaled the approach to UCF-101 at 256x256. ^[4] None of these systems took a free-form sentence as input; they were class-conditional or unconditional.

Datasets like UCF-101 and Moving MNIST anchored the early benchmarks. Results were short, low resolution, and class-conditional rather than open-ended. Variational autoencoders and autoregressive transformers followed. By 2021, Microsoft's NUWA (Wu et al., arXiv:2111.12417) could synthesize short clips from captions using a 3D transformer with sparse attention, but the quality remained too low for any practical use. ^[5] The field was stuck on a familiar trio of problems: too little high-quality video and caption data, too little compute, and architectures that did not scale gracefully to space and time together.

2022: the breakthrough year

Three papers in 2022 set the modern direction.

In May 2022, researchers at Tsinghua University released CogVideo (Hong et al., arXiv:2205.15868). It was the first publicly released text-to-video model with a meaningful gap to earlier work. CogVideo used an autoregressive transformer pretrained on text-to-image and then fine-tuned on text-video pairs. Outputs were 32 frames at 480x480, four seconds long. ^[6] The team also released the weights, which was unusual at the time and seeded a generation of open work in China.

On September 29, 2022, Meta AI announced Make-A-Video (Singer et al., arXiv:2209.14792). Make-A-Video reused a pretrained text-to-image diffusion model and bolted on temporal layers, then trained on unlabeled video to learn motion priors without paired text-video supervision. ^[7] The released demo videos were short and a bit smudgy, but the trick of starting from a strong T2I model became a recipe many later systems followed. Meta did not release weights.

A week later, on October 5, 2022, Google researchers posted Imagen Video (Ho et al., arXiv:2210.02303). Imagen Video used a cascade of seven diffusion models: a base text-to-video model, then three temporal super-resolution stages and three spatial super-resolution stages, ending at 1280x768 and 24 frames per second. It was pixel-space, not latent, which made it expensive but visually clean. ^[8] Almost simultaneously, Google released Phenaki (Villegas et al.), which introduced C-ViViT, a tokenizer that compressed video to a discrete sequence so a transformer could autoregressively generate variable-length clips of up to a few minutes from chained prompts. ^[9] Neither was made public as a product.

2023: open weights and the first commercial wave

The arrival of Stable Diffusion in 2022 had created a culture of open weights and rapid forking around image generation. In 2023, the same dynamic showed up for video.

Damo Academy at Alibaba released ModelScope T2V (also called Text2Video-Synthesis) in early 2023. It was the first open weights diffusion video model, downloadable from Hugging Face, capable of producing 256x256 clips. Quality was unpolished, but the watermark from the Shutterstock training data became a kind of folk meme. Soon after, Zeroscope appeared as a community fine-tune that scrubbed the watermark and increased resolution.

VideoCrafter from Tencent ARC, Show-1 from a National University of Singapore team, and AnimateDiff from Yuwei Guo and collaborators all appeared in 2023, each a slightly different attempt to take a strong Stable Diffusion checkpoint and add motion. AnimateDiff in particular was widely adopted because it worked as a plug-in module: any existing Stable Diffusion 1.5 LoRA could now produce short animated clips.

On the commercial side, Runway (often called Runway ML) launched Gen-1 in February 2023, a video-to-video stylization tool that took an existing clip and a text or image prompt, then re-rendered the source. Gen-2, launched in March 2023 and made generally available in June, was true text-to-video and image-to-video. Pika Labs opened a Discord-based beta around the same time and rolled out Pika 1.0 in late 2023, undercutting Runway on price.

The year ended with Stability AI's Stable Video Diffusion in November 2023. SVD took a Stable Diffusion 2.1 image model, inserted temporal convolutional layers, and trained on a curated subset of the LVD dataset. The 14-frame and 25-frame variants were released with weights, the first widely usable open T2V from a Western lab. ^[13]

2024: Sora and the scale-up

The year opened with two Google research releases. Lumiere (Bar-Tal et al., arXiv:2401.12945), posted in January 2024, introduced a Space-Time U-Net that produced the entire temporal duration of a video in a single pass instead of generating distant keyframes and interpolating. ^[15] Google had also published VideoPoet (Kondratyuk et al., arXiv:2312.14125) in December 2023, an autoregressive language model trained to predict tokens that included video, audio, and text, presented at Google I/O in May 2024. ^[14] Neither Lumiere nor VideoPoet became consumer products, but both demonstrated that Google had multiple parallel video generation efforts inside Research and DeepMind.

The pivot point of the field is February 15, 2024. OpenAI announced Sora with a set of demo videos that were qualitatively different from anything else publicly known: minute-long 1080p shots, recognizable characters across a take, plausible if not perfect physics, and prompts as elaborate as multi-paragraph short stories. The technical report ("Video generation models as world simulators") described a diffusion transformer trained on "spacetime patches," a unified token format that let the model handle videos of arbitrary aspect ratios, resolutions, and durations during both training and inference. ^[16] OpenAI framed the work as more than a media tool, writing that its "results suggest that scaling video generation models is a promising path towards building general purpose simulators of the physical world." ^[16] Sora stayed in research preview through 2024, with limited red team access. It launched as a product on December 9, 2024 inside a Sora.com web app, included with ChatGPT Plus and Pro subscriptions. ^[19]

The rest of 2024 was a sprint to catch up.

Luma Labs released Dream Machine (often shortened to Luma) in June 2024, accessible from a web interface with no waitlist, the first widely available product close to Sora's quality. Runway answered with Gen-3 Alpha later in June, retraining its model on new data. Kuaishou launched Kling 1.0 in June as well, an experience that started in China and quickly expanded worldwide. ^[31] MiniMax shipped Hailuo AI, also called Video-01, in late summer. ByteDance, owner of TikTok and Douyin, integrated its Seedance model into the Doubao app. Tsinghua spinout Shengshu Technology released Vidu, demonstrating its U-ViT architecture.

Google DeepMind made the year's other major announcement on May 14, 2024, when Veo was unveiled at Google I/O. A research preview followed, then in December 2024 Veo 2 was released with up to 4K resolution and improved physics. ^[17]

Meta announced Movie Gen on September 27, 2024, a 30 billion parameter foundation model with paired audio and personalized video features. Meta did not ship Movie Gen as a consumer product; the work was published as a research paper. ^[23]

China's open weights ecosystem made the year's quietest but most consequential moves. Zhipu released CogVideoX in August 2024, with 2 billion and 5 billion parameter checkpoints freely downloadable. ^[21] Tencent released HunyuanVideo on December 3, 2024, a 13 billion parameter DiT, by some benchmarks the strongest open weights video model to date (Kong et al., arXiv:2412.03603). ^[22]

The open weights wave was not entirely Chinese. San Francisco startup Genmo released Mochi 1 in October 2024, a 10 billion parameter Asymmetric Diffusion Transformer that they billed as the largest openly released video model at the time, with weights on Hugging Face under an Apache 2.0 license. ^[24] Israeli company Lightricks released LTX Video in November 2024, a smaller 2 billion parameter DiT optimized for fast generation on a single consumer GPU. ^[25] The HPC-AI Tech academic group released Open-Sora, a community reproduction of Sora that progressed through versions 1.0, 1.1, and 1.2 across 2024 and offered training code, weights, and a clear technical report. Adobe announced Firefly Video Model in beta on October 14, 2024, scheduled for general availability in 2025, marketed on the basis of being trained only on licensed and Adobe Stock material. ^[26]

2025 to 2026: audio, longer takes, and consolidation

Veo 3, announced at Google I/O on May 20, 2025, became the first major video model to include native audio synthesis, generating dialogue, sound effects, and music aligned to the picture. Google DeepMind CEO Demis Hassabis framed the launch in historical terms, saying "for the first time, we're emerging from the silent era of video generation." ^[44] The model produced eight second clips initially, with Veo 3.1 later in the year extending duration and adding finer control over scene transitions. Google rolled access into the Gemini app and the Vertex AI API, with later integrations into YouTube Shorts and the Flow filmmaking tool. ^[18]

OpenAI followed with Sora 2 on September 30, 2025, adding native audio, more reliable physical simulation, and a TikTok-style social app called Sora that surfaces user-generated clips. The Sora app rocketed to the top of the iPhone charts, hitting the No. 1 spot on the US App Store within days and crossing one million downloads in under five days, faster than ChatGPT reached the same milestone. ^[42]^[43] Sora 2 is rumored, though not officially confirmed, to use rectified flow rather than standard diffusion. ^[20]

Runway shipped Gen-4 in March 2025, doubling down on character and scene consistency for short film production. Pika 2 launched in December 2024 with a feature called Scene Ingredients that let users compose clips from labeled images. Kling went through 1.5, 2.0, and 2.1 versions in 2025, becoming the dominant T2V product in the Chinese market and a strong international option; Kuaishou also opened Kling AI Studio, a multi-tool editor, in April 2025. Alibaba released Wan 2.1 in February 2025 and Wan 2.2 in July, both with open weights, including a 14 billion parameter A14B mixture-of-experts variant. ^[30] Tencent followed its December 2024 base model with HunyuanVideo-I2V in March 2025, an image-to-video extension trained on the same backbone. Adobe took Firefly Video Model to general availability inside Premiere Pro and the Firefly web app in February 2025.

By early 2026 the consumer market had stabilized into roughly three tiers. At the top, Sora 2 and Veo 3 traded the lead on visual quality. In the middle, Kling, Runway Gen-4, Hailuo, and Luma competed on price and feature breadth. At the open weights tier, Wan 2.2, Hunyuan Video, and CogVideoX kept the research and indie filmmaking community supplied. Hollywood pilots that had begun in 2024 turned into actual productions in 2025, mostly for short films, music videos, and previsualization.

Timeline of key models

Year	Model	Org	Notes
2016	VGAN	MIT (Vondrick et al.)	First neural video generator (NeurIPS)
2017	TGAN	Saito et al.	Temporal GAN (ICCV)
2018	MoCoGAN	NVIDIA (Tulyakov et al.)	Motion-content disentangled GAN (CVPR)
2021 Nov	NUWA	Microsoft	Multimodal 3D transformer with sparse attention
2022 May	CogVideo	Tsinghua	First open T2V; 4s, 480x480
2022 Sep	Make-A-Video	Meta AI	T2I prior plus temporal layers
2022 Oct	Imagen Video	Google	Cascaded pixel-space diffusion
2022 Oct	Phenaki	Google	Variable-length via C-ViViT
2023 Feb	Runway Gen-1	Runway	Video-to-video stylization
2023 Mar	ModelScope T2V	Alibaba DAMO	First open weights diffusion T2V
2023 Mar	Runway Gen-2	Runway	Commercial T2V and I2V
2023 Jul	Zeroscope	community	Watermark-free fine-tune
2023 Sep	Show-1	NUS	Pixel plus latent hybrid
2023 Oct	VideoCrafter 1/2	Tencent ARC	Open weights
2023 Nov	Stable Video Diffusion	Stability AI	14 and 25 frame open release
2023 Dec	VideoPoet	Google	Autoregressive transformer with audio
2024 Jan	Lumiere	Google Research	Space-Time U-Net, single-pass duration
2024 Feb	Sora preview	OpenAI	DiT, spacetime patches, 60s
2024 May	Veo	Google DeepMind	I/O announcement
2024 Jun	Dream Machine	Luma Labs	Open consumer access
2024 Jun	Gen-3 Alpha	Runway	Retrained foundation
2024 Jun	Kling 1.0	Kuaishou	Strong China-built option
2024 Aug	CogVideoX	Zhipu AI	2B and 5B open weights
2024 Sep	Movie Gen	Meta	Research only, with audio
2024 Sep	Hailuo Video-01	MiniMax	Free tier launch
2024 Oct	Mochi 1	Genmo	10B open weights, Apache 2.0
2024 Oct	Firefly Video Model	Adobe	Beta; trained on licensed material
2024 Nov	LTX Video	Lightricks	2B DiT optimized for single-GPU
2024 Dec	Veo 2	Google DeepMind	4K, better physics
2024 Dec	HunyuanVideo	Tencent	13B open weights DiT
2024 Dec	Sora release	OpenAI	Public product launch
2024 Dec	Pika 2.0	Pika Labs	Scene Ingredients
2025 Feb	Wan 2.1	Alibaba	Open weights DiT
2025 Feb	Firefly Video GA	Adobe	In Premiere Pro and Firefly web
2025 Mar	HunyuanVideo-I2V	Tencent	Image-to-video extension
2025 Mar	Runway Gen-4	Runway	Character consistency focus
2025 Apr	Kling AI Studio	Kuaishou	Multi-tool editor
2025 May	Veo 3	Google DeepMind	Native audio
2025 Jul	Wan 2.2	Alibaba	A14B MoE open weights
2025 Sep	Sora 2	OpenAI	Native audio, social app

How do text-to-video models work?

Text-to-video is hard because video is high dimensional and temporally structured. A 5 second 1080p clip at 30 fps has roughly 300 million pixel values, two orders of magnitude more than a single image. Naive approaches that treat video as a stack of independent images produce flickering nonsense. The field has converged on a small set of architectural ideas that handle space and time jointly while keeping computation tractable.

Diffusion in latent space

Most current T2V models are latent diffusion models. The pipeline has three pieces. First, a 3D variational autoencoder compresses video to a compact latent representation, often by a factor of 8 spatially and 4 to 8 temporally. Second, a denoising network operates entirely in this latent space, removing noise from a randomly initialized tensor over many sampling steps until it lands at a clean latent. Third, the VAE decoder maps the final latent back to pixels. ^[12]

This is the same recipe Stability AI used for image generation in latent diffusion, extended to time. Sora, Veo, Kling, Hunyuan Video, and Wan all use variants of this scheme. The advantage is huge: a single GPU forward pass can cover several seconds of video in latent space when the same operation in pixel space would be infeasible.

Imagen Video was an exception. It worked entirely in pixel space using a cascade of low-resolution and super-resolution diffusion models. The result was visually clean but extremely expensive. ^[8]

Diffusion transformers

Most early diffusion models, including Stable Diffusion 1 and 2 and the original video extensions like Stable Video Diffusion, used a U-Net backbone: a convolutional neural network with skip connections that downsampled and upsampled features through a bottleneck. U-Nets are sample efficient and work well at the scale of single images. ^[13]

The shift to a diffusion transformer started with William Peebles and Saining Xie's 2022 DiT paper for class-conditional image generation. DiT replaces the U-Net's convolutional backbone with a pure transformer that operates on patches of the latent, the same way vision transformers operate on image patches. It scales more cleanly: bigger transformers do better, more reliably, than bigger U-Nets. ^[11]

Sora's technical report explicitly named DiT as the inspiration. By 2024, Veo, Kling 2.0, Hunyuan Video, Wan, and many other top systems had moved to transformer backbones. ^[16] The trade-off is that transformers are quadratic in sequence length, so model designers spend significant effort on local or hierarchical attention variants to keep cost manageable.

Spacetime patches

The most discussed contribution of the Sora paper was its handling of variable inputs. Earlier video models trained at a fixed resolution and duration, often 16 frames at 256x256. Sora instead patchified videos into a sequence of spacetime patches. A patch is a small cube of latent voxels, and any video, regardless of aspect ratio, length, or resolution, becomes a sequence of these patches. Position embeddings carry the spatial and temporal coordinates. ^[16]

The practical implication is that Sora can train on whatever video it has, in whatever shape, and at inference can produce widescreen, square, or vertical clips of any reasonable duration. Most subsequent commercial systems adopted some version of this idea, often called "native resolution" or "variable aspect ratio" training.

Two-stage and cascaded approaches

Make-A-Video followed a two-stage recipe: first generate a strong first frame using a text-to-image model, then animate it. ^[7] This pattern still shows up in image-to-video products. The user starts from a still and the system handles only the motion. The advantage is that any progress in T2I quality immediately benefits T2V; the disadvantage is that the model never learns to plan motion at the same time as it composes a scene.

Imagen Video's cascaded approach generated low-resolution video first, then ran multiple super-resolution diffusion stages, both spatial and temporal. ^[8] The cascade trick survives in some commercial pipelines as a way to produce 4K output without training a full 4K model.

Flow matching and rectified flow

Standard diffusion adds Gaussian noise to data and trains a network to reverse the process step by step, typically with 25 to 100 sampling steps. Flow matching and rectified flow, popularized by papers from Lipman et al. and Liu et al. in 2022, reformulate the same problem as learning a velocity field that pushes noise toward data along straight paths. ^[28]^[29] The result is faster sampling, often only a handful of steps for similar quality.

Stable Video Diffusion 2.0 used flow matching. Sora 2 is widely believed to use rectified flow, although OpenAI has not confirmed details. ^[20] Veo 3 also appears to use a flow-based formulation. Whether or not the term shows up in marketing, by 2025 most frontier T2V systems had moved off vanilla DDPM-style diffusion.

Long video and autoregressive chunks

Getting beyond 10 seconds is a separate problem from getting from 0 to 10. The naive approach, generate a 60-second latent in one pass, hits memory and quality walls. Phenaki's solution in 2022 was autoregressive chunking: generate a few seconds, condition the next chunk on the last frames of the previous one, and chain prompts to control the story across the whole sequence. ^[9]

Sora used long context windows to produce a single coherent minute, but later systems often returned to chunked autoregressive generation for longer outputs. ^[16] Veo 3.1 and Sora 2 both support multi-minute durations through chunk-and-condition pipelines.

How do text-to-video models generate audio?

Until 2025, almost every video model produced silent clips. Audio was tacked on after the fact using stock libraries or separate text-to-sound models like ElevenLabs, Suno, or Stable Audio.

Google DeepMind's Veo 3, announced at I/O on May 20, 2025, was the first major commercial T2V system to generate native audio. The model produced dialogue, ambient sound, music, and lip-synced speech in the same forward pass as the picture. Audio was conditioned on the prompt, so a request like "a chef explaining how to fold dough, with quiet kitchen sounds in the background" returned both the visual and the soundtrack from one generation. ^[18] Demis Hassabis marked the shift by saying the field was "emerging from the silent era of video generation." ^[44]

OpenAI's Sora 2 followed in September 2025 with native audio. The Sora 2 social app made this immediately obvious: a wall of short clips with synchronized voice, music, and effects, generated by users from text alone. ^[20]

Research precursors include MM-Diffusion (CVPR 2023), which trained a joint model on video and audio, and MMAudio, a 2024 model that generated audio conditioned on both video and text. Meta's Movie Gen Audio model, released in research form alongside Movie Gen in September 2024, also handled paired audio. ^[23]

Native audio creates new problems. Voices generated this way may impersonate real people; sound effects may copy from training material; the line between video and music generation blurs. Both Google and OpenAI applied watermarking and content moderation to their audio outputs, though the effectiveness of these has been debated. ^[36]

Major commercial products

Product	Company	Launched	Notes
Sora / Sora 2	OpenAI	Feb 2024 preview, Dec 2024 product, Sora 2 Sep 2025	DiT, spacetime patches, native audio in Sora 2
Veo 1/2/3	Google DeepMind	May 2024, Dec 2024, May 2025	4K in Veo 2, native audio in Veo 3
Runway Gen-2/3/4	Runway	Mar 2023, Jun 2024, Mar 2025	Long-running brand; film industry focus
Pika 1.0/2.0	Pika Labs	Dec 2023, Dec 2024	Discord then web; Scene Ingredients in 2.0
Dream Machine	Luma Labs	Jun 2024	Open consumer access at launch
Kling 1.0/1.5/2.0/2.1	Kuaishou	Jun 2024 onward	Largest Chinese T2V product
Hailuo AI / Video-01	MiniMax	Sep 2024	Free tier; competitive quality
Seedance	ByteDance	2024	Inside Doubao app
Vidu	Shengshu / Tsinghua	2024	U-ViT architecture; rapid iteration
Firefly Video	Adobe	Oct 2024 beta, Feb 2025 GA	Trained on licensed Adobe Stock; in Premiere Pro
HeyGen	HeyGen	2022 onward	Talking-head avatars from text
Synthesia	Synthesia	2017 onward	Enterprise avatar video
D-ID	D-ID	2017 onward	Talking-photo animation

For pure pricing and access patterns, in early 2026 a typical generation of an 8 second 1080p clip costs about 25 to 75 cents on most consumer products, with audio-enabled outputs at the higher end. Subscriptions in the 20 to 200 dollar per month range are standard. Sora and Veo gate on geography and on subscription tier. Kling, Hailuo, and Wan are accessible globally with regional payment quirks.

Is text-to-video open source?

The strongest products (Sora 2, Veo 3) are closed and accessible only through paid APIs and apps, but a large and capable open weights ecosystem exists alongside them, dominated by Chinese labs.

Model	Org	Released	Parameters	Notes
CogVideo	Tsinghua	May 2022	9B	First public release
ModelScope T2V	Alibaba DAMO	Mar 2023	1.7B	Watermarked
Zeroscope	community	2023	derived	Cleaned ModelScope fine-tune
VideoCrafter 1/2	Tencent ARC	2023	~1.4B	Latent diffusion
Stable Video Diffusion	Stability AI	Nov 2023	1.5B	First Western open T2V
AnimateDiff	community	2023	module	Plug-in for SD 1.5
CogVideoX	Zhipu AI	Aug 2024	2B and 5B	DiT, image- and text-conditioned
Open-Sora	HPC-AI Tech	2024	up to 1.1B	Sora reproduction; Apache 2.0
Mochi 1	Genmo	Oct 2024	10B	Asymmetric DiT; Apache 2.0
LTX Video	Lightricks	Nov 2024	2B	Single-GPU optimized
HunyuanVideo	Tencent	Dec 2024	13B	Strongest open T2V at release
Wan 2.1	Alibaba	Feb 2025	1.3B and 14B	T2V, I2V, V2V
HunyuanVideo-I2V	Tencent	Mar 2025	13B	Image-to-video extension
Wan 2.2	Alibaba	Jul 2025	up to 14B A14B MoE	Mixture-of-experts

The open weights stack is heavily Chinese, with HunyuanVideo, CogVideoX, and Wan dominating leaderboards. Stability AI's Stable Video Diffusion was the first Western open T2V, although its capability gap to closed frontier models widened over 2024 to 2025 as Stability shifted focus. ^[13] The other notable Western open releases were Mochi 1 from Genmo (October 2024) and LTX Video from Lightricks (November 2024), both released under permissive licenses for commercial use. ^[24]^[25] The HPC-AI Tech academic group's Open-Sora project, released in 2024, served as a research-grade reproduction with public training code rather than a competitive product. The pattern follows the broader open weights ecosystem in language models, where Qwen, DeepSeek, GLM, and Yi dominate while major US labs hold their best work back.

Open weights matter for video in part because the cost of training a frontier T2V model is now in the high tens to low hundreds of millions of dollars, far beyond academic budgets. Open releases let researchers study video diffusion behavior, fine-tune for specific domains, and run inference offline. ComfyUI workflows for HunyuanVideo, Wan 2.2, and LTX Video are now common in independent film and VFX work.

How is text-to-video quality measured?

Measuring T2V quality is unsolved. The same problems that plague image generation evaluation apply, and several new ones come from the time dimension.

FVD (Fréchet Video Distance) is the oldest benchmark in active use. It computes the Fréchet distance between feature distributions of generated and real videos, using an inflated 3D Inception network as the feature extractor. FVD is the video analog of FID for images. It correlates poorly with human preference, especially at the high end where most outputs are reasonable.

VBench, introduced by Tsinghua and other groups in late 2023 with versions through 2026, is the dominant systematic benchmark today. VBench breaks video quality into 16 dimensions, including subject consistency, background consistency, temporal flickering, motion smoothness, dynamic degree, aesthetic quality, imaging quality, object class accuracy, multiple objects, human action, color, spatial relationship, scene, appearance style, temporal style, and overall consistency. Each dimension is scored by a dedicated automated probe. ^[27] The VBench leaderboard at the Hugging Face site is widely cited; in early 2026 Sora 2, Veo 3, and Wan 2.2 trade the top three slots depending on the metric.

EvalCrafter is another comprehensive benchmark with similar structure. VideoFC focuses on factual or physical consistency.

Human preference arenas have become the most trusted comparison method. The Artificial Analysis Video Arena, modeled on LMSYS Chatbot Arena, presents users with two anonymous video generations from the same prompt and asks them to pick a winner. Aggregated rankings produce an Elo-style leaderboard. ^[41] The arena format avoids the problem that automated metrics can be gamed and that single-model evaluations are noisy.

No metric captures everything. Hand and finger artifacts, dense text rendering, and long-form story coherence are all weak spots that aggregated scores sometimes miss. Reading the VBench score is helpful; watching the actual outputs side by side is, as of 2026, still the only way to fully evaluate a system.

What is text-to-video used for?

The most common consumer use is making short, polished clips for social media. TikTok, Instagram Reels, YouTube Shorts, and Douyin all show heavy organic use of Kling, Sora 2, Hailuo, and Veo 3 outputs. The Sora 2 social app extends this directly: posts on the Sora app are generated, not uploaded. ^[20]

In marketing and advertising, T2V is mostly used for ideation, mood reels, and quick spec ads. The 2024 Toys R Us ad generated with Sora was an early stunt that drew mixed reactions; agencies now use the tools more discreetly. Procter and Gamble, Mondelez, and other major advertisers have publicly acknowledged using AI video in production pipelines.

In film and television, the picture is more cautious. Tyler Perry paused an 800 million dollar studio expansion in February 2024 after seeing Sora demos. ^[35] James Cameron joined the board of Stability AI in September 2024. Marvel Studios used AI-generated transitions in Secret Invasion in 2023 to mixed reception. Several short films generated entirely or mostly with T2V tools won festival prizes in 2024 and 2025, including The Frost (Pika) and Critterz (Sora). Use in major studio releases by 2026 is mostly limited to previsualization, set extension, and concept work, not final-pixel footage.

In enterprise contexts, talking-head products like Synthesia, HeyGen, and D-ID sit in a related but distinct lane. They generate videos of human-looking avatars reading scripts, and they dominate corporate training, e-commerce explainers, and localization. The output is less impressive as cinema but the per-clip cost beats hiring presenters by orders of magnitude.

In education and accessibility, T2V is being explored for sign language video synthesis, rapid creation of instructional content, and dubbing or lip-sync remapping for foreign-language editions of existing videos.

Copyright, ethics, and labor

The legal questions around T2V are unsettled and active. Three threads matter most.

First, training data. Most major T2V models are trained on web-scale video datasets that include copyrighted footage. Early disclosures from Runway and Stability AI revealed scrapes of YouTube, Vimeo, and stock libraries. The New York Times v. OpenAI lawsuit, filed in December 2023, names Sora's predecessors among the technologies trained on Times content. Movie Gen's training data was not fully disclosed, although Meta confirmed it included a mix of licensed and publicly available footage. ^[23] As of early 2026, no court has issued a definitive ruling on whether training on copyrighted video constitutes fair use.

Second, output similarity. T2V models can produce video that looks substantively like specific copyrighted works, sometimes by accident, sometimes when prompted. Sora 2 was caught generating recognizable Pixar-style and Studio Ghibli-style content within days of launch. ^[20] OpenAI added moderation layers and IP-aware filters in response. Disney, Universal, and Warner have all warned that they reserve their rights, although as of early 2026 no major studio has filed against an AI video lab.

Third, labor. The 2023 Writers Guild of America strike ended in September 2023 with the first major Hollywood contract limiting how studios may use AI to write or rewrite scripts. The 2023 SAG-AFTRA strike, which ended in November 2023 after 118 days, produced a contract requiring informed consent and compensation for any digital replica of an actor and barring the use of AI-generated performers to displace background actors without payment. ^[32] SAG-AFTRA's separate video game contract, ratified in June 2024 only after a fresh strike that ran from July 2024 into 2025, added similar protections for voice and motion capture. ^[33] IATSE (the International Alliance of Theatrical Stage Employees, representing crew) ratified a basic agreement in August 2024 that included AI-related provisions covering crew job security and consultation rights. ^[34] Tyler Perry's February 2024 announcement that he had paused an 800 million dollar expansion of his Atlanta studio explicitly cited Sora's demos. ^[35] The economic anxiety is that any actor's likeness, once trained on, can be re-used cheaply, and that whole categories of work, from extras to commercial spots, may be substituted by generated footage.

Deepfake abuse is a related concern. T2V tools can produce video of real people without consent, although most commercial products restrict this through prompt filters and face detectors. Attempts to generate political figures or celebrities are typically blocked or watermarked. The open weights ecosystem has weaker constraints, and modified versions of open models that strip safety training are routinely shared.

Watermarking, provenance, and regulation

Two technical responses to deepfake and copyright concerns are widely deployed. The first is invisible watermarking: a perturbation embedded in the pixels (and, where present, audio) that survives compression and re-encoding while remaining imperceptible. Google DeepMind's SynthID, introduced for images in August 2023 and extended to video for Veo and to audio for Lyria, attaches a model-side signature that DeepMind's verifier can detect at high rates even after edits. ^[36] OpenAI announced an analogous internal watermark for Sora outputs in February 2024. ^[16] Meta watermarked Movie Gen outputs. ^[23] Watermarking is not a panacea: cropping, heavy compression, or pixel-level adversarial attack can degrade detection.

The second response is content provenance through cryptographically signed metadata. The C2PA standard ("Coalition for Content Provenance and Authenticity"), shepherded by Adobe, Microsoft, the BBC, and others since 2021, attaches a chain of signed assertions to a media file recording how it was produced and edited. ^[37] Adobe Firefly Video, OpenAI Sora, and Google Veo all attach C2PA Content Credentials to their outputs by default. Because the signature can be stripped, C2PA is best understood as an opt-in disclosure mechanism rather than a forensic guarantee.

The regulatory picture moved fastest in Europe. The EU AI Act, formally adopted in June 2024 and entering staged force from August 2024 through 2026, imposes specific obligations on synthetic media. Article 50 requires that providers of generative AI systems mark outputs as artificially generated in a machine-readable format, and that deployers disclose when "deep fake" or AI-generated audio, image, or video is used in publicly distributed content unless certain artistic or law-enforcement exceptions apply. Provisions for general-purpose AI models with systemic risk apply to large foundation models, including frontier T2V systems above certain compute thresholds. ^[38]

In the United States, federal action has been piecemeal. Executive Order 14110 of October 2023 directed the Department of Commerce to develop standards for content authentication and watermarking; the order was repealed in January 2025. State-level laws followed, including Tennessee's ELVIS Act of March 2024 (protecting voice and likeness from unauthorized AI replication) and California laws AB 2602 and AB 1836 of September 2024 (digital replicas in employment and post-mortem rights). ^[40] China's Cyberspace Administration deep synthesis regulations took effect on January 10, 2023, and require provider registration and conspicuous labeling of generated media; further interim measures on generative AI took effect in August 2023. ^[39] Japan's Cabinet Office released AI promotion guidelines in 2024. India is drafting rules; Brazil and Australia are debating legislation.

What are the limitations of text-to-video?

Despite the speed of progress, T2V systems still fail in characteristic ways.

Physics is the famous one. Sora's launch demos included a memorable case where a glass falls off a table and the liquid passes through the surface instead of spilling. ^[16] Object permanence is fragile: a person walking behind a tree may emerge as a different person on the other side. Cause and effect are sometimes inverted. "A man bites into an apple" can produce a man putting an apple to his mouth and then the apple appearing whole again.

Hands are the second famous failure. Fingers count incorrectly, fuse together, or articulate impossibly. The problem is shared with text-to-image and has been reduced but not eliminated. Sora 2 and Veo 3 produce hands that pass casual inspection most of the time; close-ups still betray the model.

Text rendering inside the video is a third weakness. Signs, labels, and printed words come out as a blur of plausible-looking but meaningless characters more often than not. Veo 3 was the first to render legible English signage reliably; other models still struggle. ^[18]

Long-form coherence drops as duration grows. A 5 second Sora 2 clip is usually self-consistent. A 60 second one may drift in lighting, color grade, or camera framing. Cuts between shots in multi-shot generations are even harder; current systems do not plan story structure, only continuous footage.

Motion fidelity is uneven. Slow, broad motion looks great. Fast action, sports, and complex articulated movement like dance or martial arts often glitch. Specific human motions like throwing, kicking, or skiing are hit and miss.

Audio, where present, has its own problems. Lip sync is generally good; emotional inflection is often flat; non-English languages get less attention from the major closed models.

Recent developments and where the field is going

Several threads are worth watching through 2026.

Long-form output keeps stretching. Veo 3.1 and Sora 2.1 both extended single-pass clip lengths past one minute. Multi-minute coherent narratives are still tricky, but pipelines that chain prompts with character and location consistency are improving.

Personalization and identity preservation are a major focus. Runway Gen-4 advertised "reference any person or place" features. Meta's Movie Gen included a Personalized Video mode. ^[23] Closed models tighten guardrails around generating known individuals, while open models generally do not.

World models are blending in. Runway's Gen-4 and DeepMind's Genie 3 were both pitched as more world-model-like than traditional T2V, predicting rendered video conditioned not just on text but on actions or simulated dynamics. The convergence between video generation and game-like world simulation is one of the more interesting current frontiers, echoing OpenAI's own framing of Sora as a step toward "general purpose simulators of the physical world." ^[16]

Audio quality is climbing. Beyond the headline native audio of Veo 3 and Sora 2, work continues on cleanly disentangled music, dialogue, and effects channels, and on multi-speaker scenes with speaker identification.

Real-time and on-device generation is still out of reach for high-resolution outputs but moving in. Pruned and distilled versions of CogVideoX run on consumer GPUs at low resolution. Sub-second video generation is plausible for short, low-resolution clips by late 2026.

Finally, agentic video pipelines are appearing. Tools like Flow from Google and the multi-agent video editors built on top of Hunyuan and Wan let an agent plan shots, generate them, edit, and assemble a finished short film with minimal human direction. Whether this is good art or a curiosity is, as of early 2026, an open question.

Key papers

Year	Paper	Authors	Contribution
2016	VGAN	Vondrick et al., NeurIPS 2016	First neural video generator
2017	TGAN	Saito et al., ICCV 2017	Temporal GAN with singular value clipping
2018	MoCoGAN	Tulyakov et al., CVPR 2018	Motion-content disentanglement
2019	DVD-GAN	Clark et al., arXiv:1907.06571	GAN scaling on UCF-101
2021 Nov	NUWA	Wu et al., arXiv:2111.12417	3D transformer, sparse attention
2022 May	CogVideo	Hong et al., arXiv:2205.15868	First public T2V; autoregressive transformer
2022 Sep	Make-A-Video	Singer et al., arXiv:2209.14792	T2I prior plus temporal layers
2022 Oct	Imagen Video	Ho et al., arXiv:2210.02303	Cascaded pixel-space diffusion
2022 Oct	Phenaki	Villegas et al., arXiv:2210.02399	C-ViViT, variable-length
2022 Sep	Video Diffusion Models	Ho et al., NeurIPS 2022	First major paper on diffusion for video
2022 Dec	DiT	Peebles and Xie, arXiv:2212.09748	Diffusion transformer for images
2023 Apr	Latent Video Diffusion	Blattmann et al., arXiv:2304.08818	Foundation for SVD
2023 Nov	Stable Video Diffusion	Blattmann et al., arXiv:2311.15127	Open weights, latent diffusion
2023 Dec	VideoPoet	Kondratyuk et al., arXiv:2312.14125	Autoregressive transformer, joint audio
2024 Jan	Lumiere	Bar-Tal et al., arXiv:2401.12945	Space-Time U-Net
2024 Feb	Sora technical report	OpenAI	Spacetime patches at scale
2024 Sep	Movie Gen	Polyak et al., Meta	30B model with paired audio
2024 Aug	CogVideoX	Yang et al., arXiv:2408.06072	Open weights DiT
2024 Oct	Mochi 1	Genmo	10B Asymmetric DiT, Apache 2.0
2024 Dec	HunyuanVideo	Kong et al., arXiv:2412.03603	13B open weights DiT
2025	Wan 2.1/2.2	Alibaba	Open weights DiT, MoE in 2.2

References

Vondrick, C., Pirsiavash, H., and Torralba, A. (2016). "Generating Videos with Scene Dynamics." NeurIPS 2016. arXiv:1609.02612. ↩
Saito, M., Matsumoto, E., and Saito, S. (2017). "Temporal Generative Adversarial Nets with Singular Value Clipping." ICCV 2017. arXiv:1611.06624. ↩
Tulyakov, S. et al. (2018). "MoCoGAN: Decomposing Motion and Content for Video Generation." CVPR 2018. arXiv:1707.04993. ↩
Clark, A., Donahue, J., and Simonyan, K. (2019). "Adversarial Video Generation on Complex Datasets." DeepMind. arXiv:1907.06571. ↩
Wu, C. et al. (2021). "NUWA: Visual Synthesis Pre-training for Neural visUal World creAtion." Microsoft. arXiv:2111.12417. ↩
Hong, W. et al. (2022). "CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers." arXiv:2205.15868. ↩
Singer, U. et al. (2022). "Make-A-Video: Text-to-Video Generation without Text-Video Data." arXiv:2209.14792. Meta AI. ↩
Ho, J. et al. (2022). "Imagen Video: High Definition Video Generation with Diffusion Models." arXiv:2210.02303. Google Research. ↩
Villegas, R. et al. (2022). "Phenaki: Variable Length Video Generation From Open Domain Textual Description." arXiv:2210.02399. Google Research. ↩
Ho, J. et al. (2022). "Video Diffusion Models." NeurIPS 2022.
Peebles, W. and Xie, S. (2022). "Scalable Diffusion Models with Transformers." arXiv:2212.09748. ↩
Blattmann, A. et al. (2023). "Align your Latents: High-Resolution Video Synthesis with Latent Diffusion Models." CVPR 2023. arXiv:2304.08818. ↩
Blattmann, A. et al. (2023). "Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets." arXiv:2311.15127. Stability AI. ↩
Kondratyuk, D. et al. (2023). "VideoPoet: A Large Language Model for Zero-Shot Video Generation." Google. arXiv:2312.14125. ↩
Bar-Tal, O. et al. (2024). "Lumiere: A Space-Time Diffusion Model for Video Generation." Google Research. arXiv:2401.12945. ↩
OpenAI (2024). "Video generation models as world simulators." Sora technical report, openai.com, February 15, 2024. ↩
Google DeepMind (2024). "Veo: a generative video model." Google I/O announcement, May 14, 2024. ↩
Google DeepMind (2025). "Veo 3." Google I/O announcement, May 20, 2025. ↩
OpenAI (2024). "Sora is here." Product announcement, openai.com, December 9, 2024. ↩
OpenAI (2025). "Sora 2." Product announcement, openai.com, September 30, 2025. ↩
Yang, Z. et al. (2024). "CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer." arXiv:2408.06072. Zhipu AI. ↩
Kong, W. et al. (2024). "HunyuanVideo: A Systematic Framework For Large Video Generative Models." Tencent. arXiv:2412.03603. ↩
Polyak, A. et al. (2024). "Movie Gen: A Cast of Media Foundation Models." Meta AI research paper, September 27, 2024. ↩
Genmo (2024). "Mochi 1: A new SOTA in open-source video generation." genmo.ai, October 22, 2024. ↩
Lightricks (2024). "Introducing LTX Video." lightricks.com, November 2024. ↩
Adobe (2024). "Adobe announces Firefly Video Model in beta." news.adobe.com, October 14, 2024. ↩
Huang, Z. et al. (2023). "VBench: Comprehensive Benchmark Suite for Video Generative Models." arXiv:2311.17982. Updated through 2026. ↩
Lipman, Y. et al. (2022). "Flow Matching for Generative Modeling." ICLR 2023. arXiv:2210.02747. ↩
Liu, X., Gong, C., and Liu, Q. (2022). "Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow." ICLR 2023. arXiv:2209.03003. ↩
Wan team, Alibaba (2025). "Wan: Open and Advanced Large-Scale Video Generative Models." Alibaba Cloud, technical reports for 2.1 and 2.2. ↩
Kuaishou (2024). "Kling AI launches text-to-video model." kuaishou.com, June 6, 2024. ↩
SAG-AFTRA (2023). "AI Provisions in the 2023 TV/Theatrical Contracts." sagaftra.org, November 2023. ↩
SAG-AFTRA (2024). "Interactive Media Agreement reached after strike." sagaftra.org, June 2024. ↩
IATSE (2024). "Basic Crafts Agreement and Basic Agreement ratifications." iatse.net, August 2024. ↩
Perry, T. (2024). "Tyler Perry interview on Sora." The Hollywood Reporter, February 22, 2024. ↩
Google DeepMind (2024). "SynthID for video and audio." deepmind.google, May 2024. ↩
C2PA (2023). "Content Credentials Specification 2.0." c2pa.org. ↩
European Parliament and Council (2024). "Regulation (EU) 2024/1689 (Artificial Intelligence Act)." Official Journal of the European Union, June 13, 2024. ↩
Cyberspace Administration of China (2022). "Provisions on the Administration of Deep Synthesis Internet Information Services." cac.gov.cn, effective January 10, 2023. ↩
Tennessee General Assembly (2024). "Ensuring Likeness Voice and Image Security Act (ELVIS Act)." Public Chapter 588, signed March 21, 2024. ↩
Artificial Analysis. "Video Generation Arena." artificialanalysis.ai, accessed 2026. ↩
CNBC (2025). "OpenAI's Sora hit 1 million downloads in less than five days." cnbc.com, October 9, 2025. ↩
TechCrunch (2025). "OpenAI's Sora soars to No. 1 on Apple's US App Store." techcrunch.com, October 3, 2025. ↩
TechRadar (2025). "Google's Veo 3 marks the end of AI video's 'silent era'." techradar.com, May 2025. ↩
Wikipedia. "Text-to-video model." Updated through 2026.

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

3 revisions by 1 contributors · full history

Suggest edit

Text-to-video generation

What is text-to-video generation?

History

Pre-2022: precursors

2022: the breakthrough year

2023: open weights and the first commercial wave

2024: Sora and the scale-up

2025 to 2026: audio, longer takes, and consolidation

Timeline of key models

How do text-to-video models work?

Diffusion in latent space

Diffusion transformers

Spacetime patches

Two-stage and cascaded approaches

Flow matching and rectified flow

Long video and autoregressive chunks

How do text-to-video models generate audio?

Major commercial products

Is text-to-video open source?

How is text-to-video quality measured?

What is text-to-video used for?

Copyright, ethics, and labor

Watermarking, provenance, and regulation

What are the limitations of text-to-video?

Recent developments and where the field is going

Key papers

See also

References

Improve this article

What links here (24 of 48)

What links here (24 of 48)

What is text-to-video generation?

History

Pre-2022: precursors

2022: the breakthrough year

2023: open weights and the first commercial wave

2024: Sora and the scale-up

2025 to 2026: audio, longer takes, and consolidation

Timeline of key models

How do text-to-video models work?

Diffusion in latent space

Diffusion transformers

Spacetime patches

Two-stage and cascaded approaches

Flow matching and rectified flow

Long video and autoregressive chunks

How do text-to-video models generate audio?

Major commercial products

Is text-to-video open source?

How is text-to-video quality measured?

What is text-to-video used for?

Copyright, ethics, and labor

Watermarking, provenance, and regulation

What are the limitations of text-to-video?

Recent developments and where the field is going

Key papers

See also

References

Improve this article

Related Articles

Sora

Mochi 1

LTX-Video

Open-Sora

Stable Video Diffusion

Lumiere

What links here (24 of 48)

Related Articles

Sora

Mochi 1

LTX-Video

Open-Sora

Stable Video Diffusion

Lumiere

What links here (24 of 48)