Text-to-video generation
Last reviewed
May 2, 2026
Sources
42 citations
Review status
Source-backed
Revision
v2 · 7,653 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 2, 2026
Sources
42 citations
Review status
Source-backed
Revision
v2 · 7,653 words
Add missing citations, update stale details, or suggest a clearer explanation.
Text-to-video (often abbreviated T2V) is the generative AI capability of producing video clips, with or without sound, from a written prompt. A user types a description of what should appear on screen, and a model returns a sequence of frames that, ideally, looks coherent across time and matches the request. The same approach extends to image-to-video, video-to-video editing, and more recently to native audio synchronized with the picture.
The field went from research curiosity to consumer product in roughly two and a half years. In May 2022, Tsinghua University's CogVideo could just barely produce 480x480 clips of a few seconds. By February 2024, OpenAI's Sora preview generated minute-long 1080p shots with consistent character identity and basic physics. By mid-2025, Google DeepMind's Veo 3 added native dialogue, sound effects, and music to its outputs, removing the last seam between AI video and conventional film.
Text-to-video models share a heritage with text-to-image systems. Most modern T2V architectures are diffusion models, often built around a diffusion transformer (DiT), trained on enormous video and caption datasets. The challenges that made the problem hard for so long, namely temporal consistency, computational cost, and reliable physics, have shaped the technical history of the field. The challenges that remain, principally hands, dense text rendering, long-form story logic, and the question of whose footage went into training, are the focus of current work.
A text-to-video system takes a natural language prompt as input and outputs a video file: a sequence of frames usually rendered at 24 to 30 frames per second, lasting anywhere from one or two seconds in the early models up to a minute or more in 2024 to 2026 systems. Most commercial products today produce clips of five to ten seconds at resolutions between 720p and 1080p, with some higher tier modes reaching 4K.
The core conditioning is text. Many modern systems also accept additional inputs:
A text-to-video model is, at heart, a probability distribution over short videos conditioned on text. Sampling from that distribution is what produces a clip. This framing distinguishes T2V from older techniques like 3D rendering, motion capture, or compositing, where a deterministic pipeline turns explicit instructions into pixels.
Research on generating video from text predates the modern wave by several years. The earliest published work that produced video from neural networks was Vondrick, Pirsiavash, and Torralba's VGAN (2016, "Generating Videos with Scene Dynamics," NeurIPS 2016), which trained a generative adversarial network on unlabeled video clips to produce one-second 64x64 outputs by separating foreground motion from a static background. Saito and colleagues followed in 2017 with TGAN ("Temporal Generative Adversarial Nets with Singular Value Clipping," ICCV 2017), and Tulyakov and collaborators at NVIDIA released MoCoGAN in 2018 ("MoCoGAN: Decomposing Motion and Content for Video Generation," CVPR 2018), which split the latent space into content and motion components. DeepMind's DVD-GAN (Clark et al., 2019) scaled the approach to UCF-101 at 256x256. None of these systems took a free-form sentence as input; they were class-conditional or unconditional.
Datasets like UCF-101 and Moving MNIST anchored the early benchmarks. Results were short, low resolution, and class-conditional rather than open-ended. Variational autoencoders and autoregressive transformers followed. By 2021, Microsoft's NUWA (Wu et al., arXiv:2111.12417) could synthesize short clips from captions using a 3D transformer with sparse attention, but the quality remained too low for any practical use. The field was stuck on a familiar trio of problems: too little high-quality video and caption data, too little compute, and architectures that did not scale gracefully to space and time together.
Three papers in 2022 set the modern direction.
In May 2022, researchers at Tsinghua University released CogVideo (Hong et al., arXiv:2205.15868). It was the first publicly released text-to-video model with a meaningful gap to earlier work. CogVideo used an autoregressive transformer pretrained on text-to-image and then fine-tuned on text-video pairs. Outputs were 32 frames at 480x480, four seconds long. The team also released the weights, which was unusual at the time and seeded a generation of open work in China.
On September 29, 2022, Meta AI announced Make-A-Video (Singer et al., arXiv:2209.14792). Make-A-Video reused a pretrained text-to-image diffusion model and bolted on temporal layers, then trained on unlabeled video to learn motion priors without paired text-video supervision. The released demo videos were short and a bit smudgy, but the trick of starting from a strong T2I model became a recipe many later systems followed. Meta did not release weights.
A week later, on October 5, 2022, Google researchers posted Imagen Video (Ho et al., arXiv:2210.02303). Imagen Video used a cascade of seven diffusion models: a base text-to-video model, then three temporal super-resolution stages and three spatial super-resolution stages, ending at 1280x768 and 24 frames per second. It was pixel-space, not latent, which made it expensive but visually clean. Almost simultaneously, Google released Phenaki (Villegas et al.), which introduced C-ViViT, a tokenizer that compressed video to a discrete sequence so a transformer could autoregressively generate variable-length clips of up to a few minutes from chained prompts. Neither was made public as a product.
The arrival of Stable Diffusion in 2022 had created a culture of open weights and rapid forking around image generation. In 2023, the same dynamic showed up for video.
Damo Academy at Alibaba released ModelScope T2V (also called Text2Video-Synthesis) in early 2023. It was the first open weights diffusion video model, downloadable from Hugging Face, capable of producing 256x256 clips. Quality was unpolished, but the watermark from the Shutterstock training data became a kind of folk meme. Soon after, Zeroscope appeared as a community fine-tune that scrubbed the watermark and increased resolution.
VideoCrafter from Tencent ARC, Show-1 from a National University of Singapore team, and AnimateDiff from Yuwei Guo and collaborators all appeared in 2023, each a slightly different attempt to take a strong Stable Diffusion checkpoint and add motion. AnimateDiff in particular was widely adopted because it worked as a plug-in module: any existing Stable Diffusion 1.5 LoRA could now produce short animated clips.
On the commercial side, Runway (often called Runway ML) launched Gen-1 in February 2023, a video-to-video stylization tool that took an existing clip and a text or image prompt, then re-rendered the source. Gen-2, launched in March 2023 and made generally available in June, was true text-to-video and image-to-video. Pika Labs opened a Discord-based beta around the same time and rolled out Pika 1.0 in late 2023, undercutting Runway on price.
The year ended with Stability AI's Stable Video Diffusion in November 2023. SVD took a Stable Diffusion 2.1 image model, inserted temporal convolutional layers, and trained on a curated subset of the LVD dataset. The 14-frame and 25-frame variants were released with weights, the first widely usable open T2V from a Western lab.
The year opened with two Google research releases. Lumiere (Bar-Tal et al., arXiv:2401.12945), posted in January 2024, introduced a Space-Time U-Net that produced the entire temporal duration of a video in a single pass instead of generating distant keyframes and interpolating. Google had also published VideoPoet (Kondratyuk et al., arXiv:2312.14125) in December 2023, an autoregressive language model trained to predict tokens that included video, audio, and text, presented at Google I/O in May 2024. Neither Lumiere nor VideoPoet became consumer products, but both demonstrated that Google had multiple parallel video generation efforts inside Research and DeepMind.
The pivot point of the field is February 15, 2024. OpenAI announced Sora with a set of demo videos that were qualitatively different from anything else publicly known: minute-long 1080p shots, recognizable characters across a take, plausible if not perfect physics, and prompts as elaborate as multi-paragraph short stories. The technical report ("Video generation models as world simulators") described a diffusion transformer trained on "spacetime patches," a unified token format that let the model handle videos of arbitrary aspect ratios, resolutions, and durations during both training and inference. Sora stayed in research preview through 2024, with limited red team access. It launched as a product on December 9, 2024 inside a Sora.com web app, included with ChatGPT Plus and Pro subscriptions.
The rest of 2024 was a sprint to catch up.
Luma Labs released Dream Machine (often shortened to Luma) in June 2024, accessible from a web interface with no waitlist, the first widely available product close to Sora's quality. Runway answered with Gen-3 Alpha later in June, retraining its model on new data. Kuaishou launched Kling 1.0 in June as well, an experience that started in China and quickly expanded worldwide. MiniMax shipped Hailuo AI, also called Video-01, in late summer. ByteDance, owner of TikTok and Douyin, integrated its Seedance model into the Doubao app. Tsinghua spinout Shengshu Technology released Vidu, demonstrating its U-ViT architecture.
Google DeepMind made the year's other major announcement on May 14, 2024, when Veo was unveiled at Google I/O. A research preview followed, then in December 2024 Veo 2 was released with up to 4K resolution and improved physics.
Meta announced Movie Gen on September 27, 2024, a 30 billion parameter foundation model with paired audio and personalized video features. Meta did not ship Movie Gen as a consumer product; the work was published as a research paper.
China's open weights ecosystem made the year's quietest but most consequential moves. Zhipu released CogVideoX in August 2024, with 2 billion and 5 billion parameter checkpoints freely downloadable. Tencent released HunyuanVideo on December 3, 2024, a 13 billion parameter DiT, by some benchmarks the strongest open weights video model to date (Kong et al., arXiv:2412.03603).
The open weights wave was not entirely Chinese. San Francisco startup Genmo released Mochi 1 in October 2024, a 10 billion parameter Asymmetric Diffusion Transformer that they billed as the largest openly released video model at the time, with weights on Hugging Face under an Apache 2.0 license. Israeli company Lightricks released LTX Video in November 2024, a smaller 2 billion parameter DiT optimized for fast generation on a single consumer GPU. The HPC-AI Tech academic group released Open-Sora, a community reproduction of Sora that progressed through versions 1.0, 1.1, and 1.2 across 2024 and offered training code, weights, and a clear technical report. Adobe announced Firefly Video Model in beta on October 14, 2024, scheduled for general availability in 2025, marketed on the basis of being trained only on licensed and Adobe Stock material.
Veo 3, announced at Google I/O on May 20, 2025, became the first major video model to include native audio synthesis, generating dialogue, sound effects, and music aligned to the picture. The model produced eight second clips initially, with Veo 3.1 later in the year extending duration and adding finer control over scene transitions. Google rolled access into the Gemini app and the Vertex AI API, with later integrations into YouTube Shorts and the Flow filmmaking tool.
OpenAI followed with Sora 2 on September 30, 2025, adding native audio, more reliable physical simulation, and a TikTok-style social app called Sora that surfaces user-generated clips. The Sora app rocketed to the top of the iPhone charts in its first weekend. Sora 2 is rumored, though not officially confirmed, to use rectified flow rather than standard diffusion.
Runway shipped Gen-4 in March 2025, doubling down on character and scene consistency for short film production. Pika 2 launched in December 2024 with a feature called Scene Ingredients that let users compose clips from labeled images. Kling went through 1.5, 2.0, and 2.1 versions in 2025, becoming the dominant T2V product in the Chinese market and a strong international option; Kuaishou also opened Kling AI Studio, a multi-tool editor, in April 2025. Alibaba released Wan 2.1 in February 2025 and Wan 2.2 in July, both with open weights, including a 14 billion parameter A14B mixture-of-experts variant. Tencent followed its December 2024 base model with HunyuanVideo-I2V in March 2025, an image-to-video extension trained on the same backbone. Adobe took Firefly Video Model to general availability inside Premiere Pro and the Firefly web app in February 2025.
By early 2026 the consumer market had stabilized into roughly three tiers. At the top, Sora 2 and Veo 3 traded the lead on visual quality. In the middle, Kling, Runway Gen-4, Hailuo, and Luma competed on price and feature breadth. At the open weights tier, Wan 2.2, Hunyuan Video, and CogVideoX kept the research and indie filmmaking community supplied. Hollywood pilots that had begun in 2024 turned into actual productions in 2025, mostly for short films, music videos, and previsualization.
| Year | Model | Org | Notes |
|---|---|---|---|
| 2016 | VGAN | MIT (Vondrick et al.) | First neural video generator (NeurIPS) |
| 2017 | TGAN | Saito et al. | Temporal GAN (ICCV) |
| 2018 | MoCoGAN | NVIDIA (Tulyakov et al.) | Motion-content disentangled GAN (CVPR) |
| 2021 Nov | NUWA | Microsoft | Multimodal 3D transformer with sparse attention |
| 2022 May | CogVideo | Tsinghua | First open T2V; 4s, 480x480 |
| 2022 Sep | Make-A-Video | Meta AI | T2I prior plus temporal layers |
| 2022 Oct | Imagen Video | Cascaded pixel-space diffusion | |
| 2022 Oct | Phenaki | Variable-length via C-ViViT | |
| 2023 Feb | Runway Gen-1 | Runway | Video-to-video stylization |
| 2023 Mar | ModelScope T2V | Alibaba DAMO | First open weights diffusion T2V |
| 2023 Mar | Runway Gen-2 | Runway | Commercial T2V and I2V |
| 2023 Jul | Zeroscope | community | Watermark-free fine-tune |
| 2023 Sep | Show-1 | NUS | Pixel plus latent hybrid |
| 2023 Oct | VideoCrafter 1/2 | Tencent ARC | Open weights |
| 2023 Nov | Stable Video Diffusion | Stability AI | 14 and 25 frame open release |
| 2023 Dec | VideoPoet | Autoregressive transformer with audio | |
| 2024 Jan | Lumiere | Google Research | Space-Time U-Net, single-pass duration |
| 2024 Feb | Sora preview | OpenAI | DiT, spacetime patches, 60s |
| 2024 May | Veo | Google DeepMind | I/O announcement |
| 2024 Jun | Dream Machine | Luma Labs | Open consumer access |
| 2024 Jun | Gen-3 Alpha | Runway | Retrained foundation |
| 2024 Jun | Kling 1.0 | Kuaishou | Strong China-built option |
| 2024 Aug | CogVideoX | Zhipu AI | 2B and 5B open weights |
| 2024 Sep | Movie Gen | Meta | Research only, with audio |
| 2024 Sep | Hailuo Video-01 | MiniMax | Free tier launch |
| 2024 Oct | Mochi 1 | Genmo | 10B open weights, Apache 2.0 |
| 2024 Oct | Firefly Video Model | Adobe | Beta; trained on licensed material |
| 2024 Nov | LTX Video | Lightricks | 2B DiT optimized for single-GPU |
| 2024 Dec | Veo 2 | Google DeepMind | 4K, better physics |
| 2024 Dec | HunyuanVideo | Tencent | 13B open weights DiT |
| 2024 Dec | Sora release | OpenAI | Public product launch |
| 2024 Dec | Pika 2.0 | Pika Labs | Scene Ingredients |
| 2025 Feb | Wan 2.1 | Alibaba | Open weights DiT |
| 2025 Feb | Firefly Video GA | Adobe | In Premiere Pro and Firefly web |
| 2025 Mar | HunyuanVideo-I2V | Tencent | Image-to-video extension |
| 2025 Mar | Runway Gen-4 | Runway | Character consistency focus |
| 2025 Apr | Kling AI Studio | Kuaishou | Multi-tool editor |
| 2025 May | Veo 3 | Google DeepMind | Native audio |
| 2025 Jul | Wan 2.2 | Alibaba | A14B MoE open weights |
| 2025 Sep | Sora 2 | OpenAI | Native audio, social app |
Text-to-video is hard because video is high dimensional and temporally structured. A 5 second 1080p clip at 30 fps has roughly 300 million pixel values, two orders of magnitude more than a single image. Naive approaches that treat video as a stack of independent images produce flickering nonsense. The field has converged on a small set of architectural ideas that handle space and time jointly while keeping computation tractable.
Most current T2V models are latent diffusion models. The pipeline has three pieces. First, a 3D variational autoencoder compresses video to a compact latent representation, often by a factor of 8 spatially and 4 to 8 temporally. Second, a denoising network operates entirely in this latent space, removing noise from a randomly initialized tensor over many sampling steps until it lands at a clean latent. Third, the VAE decoder maps the final latent back to pixels.
This is the same recipe Stability AI used for image generation in latent diffusion, extended to time. Sora, Veo, Kling, Hunyuan Video, and Wan all use variants of this scheme. The advantage is huge: a single GPU forward pass can cover several seconds of video in latent space when the same operation in pixel space would be infeasible.
Imagen Video was an exception. It worked entirely in pixel space using a cascade of low-resolution and super-resolution diffusion models. The result was visually clean but extremely expensive.
Most early diffusion models, including Stable Diffusion 1 and 2 and the original video extensions like Stable Video Diffusion, used a U-Net backbone: a convolutional neural network with skip connections that downsampled and upsampled features through a bottleneck. U-Nets are sample efficient and work well at the scale of single images.
The shift to a diffusion transformer started with William Peebles and Saining Xie's 2022 DiT paper for class-conditional image generation. DiT replaces the U-Net's convolutional backbone with a pure transformer that operates on patches of the latent, the same way vision transformers operate on image patches. It scales more cleanly: bigger transformers do better, more reliably, than bigger U-Nets.
Sora's technical report explicitly named DiT as the inspiration. By 2024, Veo, Kling 2.0, Hunyuan Video, Wan, and many other top systems had moved to transformer backbones. The trade-off is that transformers are quadratic in sequence length, so model designers spend significant effort on local or hierarchical attention variants to keep cost manageable.
The most discussed contribution of the Sora paper was its handling of variable inputs. Earlier video models trained at a fixed resolution and duration, often 16 frames at 256x256. Sora instead patchified videos into a sequence of spacetime patches. A patch is a small cube of latent voxels, and any video, regardless of aspect ratio, length, or resolution, becomes a sequence of these patches. Position embeddings carry the spatial and temporal coordinates.
The practical implication is that Sora can train on whatever video it has, in whatever shape, and at inference can produce widescreen, square, or vertical clips of any reasonable duration. Most subsequent commercial systems adopted some version of this idea, often called "native resolution" or "variable aspect ratio" training.
Make-A-Video followed a two-stage recipe: first generate a strong first frame using a text-to-image model, then animate it. This pattern still shows up in image-to-video products. The user starts from a still and the system handles only the motion. The advantage is that any progress in T2I quality immediately benefits T2V; the disadvantage is that the model never learns to plan motion at the same time as it composes a scene.
Imagen Video's cascaded approach generated low-resolution video first, then ran multiple super-resolution diffusion stages, both spatial and temporal. The cascade trick survives in some commercial pipelines as a way to produce 4K output without training a full 4K model.
Standard diffusion adds Gaussian noise to data and trains a network to reverse the process step by step, typically with 25 to 100 sampling steps. Flow matching and rectified flow, popularized by papers from Lipman et al. and Liu et al. in 2022, reformulate the same problem as learning a velocity field that pushes noise toward data along straight paths. The result is faster sampling, often only a handful of steps for similar quality.
Stable Video Diffusion 2.0 used flow matching. Sora 2 is widely believed to use rectified flow, although OpenAI has not confirmed details. Veo 3 also appears to use a flow-based formulation. Whether or not the term shows up in marketing, by 2025 most frontier T2V systems had moved off vanilla DDPM-style diffusion.
Getting beyond 10 seconds is a separate problem from getting from 0 to 10. The naive approach, generate a 60-second latent in one pass, hits memory and quality walls. Phenaki's solution in 2022 was autoregressive chunking: generate a few seconds, condition the next chunk on the last frames of the previous one, and chain prompts to control the story across the whole sequence.
Sora used long context windows to produce a single coherent minute, but later systems often returned to chunked autoregressive generation for longer outputs. Veo 3.1 and Sora 2 both support multi-minute durations through chunk-and-condition pipelines.
Until 2025, almost every video model produced silent clips. Audio was tacked on after the fact using stock libraries or separate text-to-sound models like ElevenLabs, Suno, or Stable Audio.
Google DeepMind's Veo 3, announced at I/O on May 20, 2025, was the first major commercial T2V system to generate native audio. The model produced dialogue, ambient sound, music, and lip-synced speech in the same forward pass as the picture. Audio was conditioned on the prompt, so a request like "a chef explaining how to fold dough, with quiet kitchen sounds in the background" returned both the visual and the soundtrack from one generation.
OpenAI's Sora 2 followed in September 2025 with native audio. The Sora 2 social app made this immediately obvious: a wall of short clips with synchronized voice, music, and effects, generated by users from text alone.
Research precursors include MM-Diffusion (CVPR 2023), which trained a joint model on video and audio, and MMAudio, a 2024 model that generated audio conditioned on both video and text. Meta's Movie Gen Audio model, released in research form alongside Movie Gen in September 2024, also handled paired audio.
Native audio creates new problems. Voices generated this way may impersonate real people; sound effects may copy from training material; the line between video and music generation blurs. Both Google and OpenAI applied watermarking and content moderation to their audio outputs, though the effectiveness of these has been debated.
| Product | Company | Launched | Notes |
|---|---|---|---|
| Sora / Sora 2 | OpenAI | Feb 2024 preview, Dec 2024 product, Sora 2 Sep 2025 | DiT, spacetime patches, native audio in Sora 2 |
| Veo 1/2/3 | Google DeepMind | May 2024, Dec 2024, May 2025 | 4K in Veo 2, native audio in Veo 3 |
| Runway Gen-2/3/4 | Runway | Mar 2023, Jun 2024, Mar 2025 | Long-running brand; film industry focus |
| Pika 1.0/2.0 | Pika Labs | Dec 2023, Dec 2024 | Discord then web; Scene Ingredients in 2.0 |
| Dream Machine | Luma Labs | Jun 2024 | Open consumer access at launch |
| Kling 1.0/1.5/2.0/2.1 | Kuaishou | Jun 2024 onward | Largest Chinese T2V product |
| Hailuo AI / Video-01 | MiniMax | Sep 2024 | Free tier; competitive quality |
| Seedance | ByteDance | 2024 | Inside Doubao app |
| Vidu | Shengshu / Tsinghua | 2024 | U-ViT architecture; rapid iteration |
| Firefly Video | Adobe | Oct 2024 beta, Feb 2025 GA | Trained on licensed Adobe Stock; in Premiere Pro |
| HeyGen | HeyGen | 2022 onward | Talking-head avatars from text |
| Synthesia | Synthesia | 2017 onward | Enterprise avatar video |
| D-ID | D-ID | 2017 onward | Talking-photo animation |
For pure pricing and access patterns, in early 2026 a typical generation of an 8 second 1080p clip costs about 25 to 75 cents on most consumer products, with audio-enabled outputs at the higher end. Subscriptions in the 20 to 200 dollar per month range are standard. Sora and Veo gate on geography and on subscription tier. Kling, Hailuo, and Wan are accessible globally with regional payment quirks.
| Model | Org | Released | Parameters | Notes |
|---|---|---|---|---|
| CogVideo | Tsinghua | May 2022 | 9B | First public release |
| ModelScope T2V | Alibaba DAMO | Mar 2023 | 1.7B | Watermarked |
| Zeroscope | community | 2023 | derived | Cleaned ModelScope fine-tune |
| VideoCrafter 1/2 | Tencent ARC | 2023 | ~1.4B | Latent diffusion |
| Stable Video Diffusion | Stability AI | Nov 2023 | 1.5B | First Western open T2V |
| AnimateDiff | community | 2023 | module | Plug-in for SD 1.5 |
| CogVideoX | Zhipu AI | Aug 2024 | 2B and 5B | DiT, image- and text-conditioned |
| Open-Sora | HPC-AI Tech | 2024 | up to 1.1B | Sora reproduction; Apache 2.0 |
| Mochi 1 | Genmo | Oct 2024 | 10B | Asymmetric DiT; Apache 2.0 |
| LTX Video | Lightricks | Nov 2024 | 2B | Single-GPU optimized |
| HunyuanVideo | Tencent | Dec 2024 | 13B | Strongest open T2V at release |
| Wan 2.1 | Alibaba | Feb 2025 | 1.3B and 14B | T2V, I2V, V2V |
| HunyuanVideo-I2V | Tencent | Mar 2025 | 13B | Image-to-video extension |
| Wan 2.2 | Alibaba | Jul 2025 | up to 14B A14B MoE | Mixture-of-experts |
The open weights stack is heavily Chinese, with HunyuanVideo, CogVideoX, and Wan dominating leaderboards. Stability AI's Stable Video Diffusion was the first Western open T2V, although its capability gap to closed frontier models widened over 2024 to 2025 as Stability shifted focus. The other notable Western open releases were Mochi 1 from Genmo (October 2024) and LTX Video from Lightricks (November 2024), both released under permissive licenses for commercial use. The HPC-AI Tech academic group's Open-Sora project, released in 2024, served as a research-grade reproduction with public training code rather than a competitive product. The pattern follows the broader open weights ecosystem in language models, where Qwen, DeepSeek, GLM, and Yi dominate while major US labs hold their best work back.
Open weights matter for video in part because the cost of training a frontier T2V model is now in the high tens to low hundreds of millions of dollars, far beyond academic budgets. Open releases let researchers study video diffusion behavior, fine-tune for specific domains, and run inference offline. ComfyUI workflows for HunyuanVideo, Wan 2.2, and LTX Video are now common in independent film and VFX work.
Measuring T2V quality is unsolved. The same problems that plague image generation evaluation apply, and several new ones come from the time dimension.
FVD (Fréchet Video Distance) is the oldest benchmark in active use. It computes the Fréchet distance between feature distributions of generated and real videos, using an inflated 3D Inception network as the feature extractor. FVD is the video analog of FID for images. It correlates poorly with human preference, especially at the high end where most outputs are reasonable.
VBench, introduced by Tsinghua and other groups in late 2023 with versions through 2026, is the dominant systematic benchmark today. VBench breaks video quality into 16 dimensions, including subject consistency, background consistency, temporal flickering, motion smoothness, dynamic degree, aesthetic quality, imaging quality, object class accuracy, multiple objects, human action, color, spatial relationship, scene, appearance style, temporal style, and overall consistency. Each dimension is scored by a dedicated automated probe. The VBench leaderboard at the Hugging Face site is widely cited; in early 2026 Sora 2, Veo 3, and Wan 2.2 trade the top three slots depending on the metric.
EvalCrafter is another comprehensive benchmark with similar structure. VideoFC focuses on factual or physical consistency.
Human preference arenas have become the most trusted comparison method. The Artificial Analysis Video Arena, modeled on LMSYS Chatbot Arena, presents users with two anonymous video generations from the same prompt and asks them to pick a winner. Aggregated rankings produce an Elo-style leaderboard. The arena format avoids the problem that automated metrics can be gamed and that single-model evaluations are noisy.
No metric captures everything. Hand and finger artifacts, dense text rendering, and long-form story coherence are all weak spots that aggregated scores sometimes miss. Reading the VBench score is helpful; watching the actual outputs side by side is, as of 2026, still the only way to fully evaluate a system.
The most common consumer use is making short, polished clips for social media. TikTok, Instagram Reels, YouTube Shorts, and Douyin all show heavy organic use of Kling, Sora 2, Hailuo, and Veo 3 outputs. The Sora 2 social app extends this directly: posts on the Sora app are generated, not uploaded.
In marketing and advertising, T2V is mostly used for ideation, mood reels, and quick spec ads. The 2024 Toys R Us ad generated with Sora was an early stunt that drew mixed reactions; agencies now use the tools more discreetly. Procter and Gamble, Mondelez, and other major advertisers have publicly acknowledged using AI video in production pipelines.
In film and television, the picture is more cautious. Tyler Perry paused an 800 million dollar studio expansion in February 2024 after seeing Sora demos. James Cameron joined the board of Stability AI in September 2024. Marvel Studios used AI-generated transitions in Secret Invasion in 2023 to mixed reception. Several short films generated entirely or mostly with T2V tools won festival prizes in 2024 and 2025, including The Frost (Pika) and Critterz (Sora). Use in major studio releases by 2026 is mostly limited to previsualization, set extension, and concept work, not final-pixel footage.
In enterprise contexts, talking-head products like Synthesia, HeyGen, and D-ID sit in a related but distinct lane. They generate videos of human-looking avatars reading scripts, and they dominate corporate training, e-commerce explainers, and localization. The output is less impressive as cinema but the per-clip cost beats hiring presenters by orders of magnitude.
In education and accessibility, T2V is being explored for sign language video synthesis, rapid creation of instructional content, and dubbing or lip-sync remapping for foreign-language editions of existing videos.
The legal questions around T2V are unsettled and active. Three threads matter most.
First, training data. Most major T2V models are trained on web-scale video datasets that include copyrighted footage. Early disclosures from Runway and Stability AI revealed scrapes of YouTube, Vimeo, and stock libraries. The New York Times v. OpenAI lawsuit, filed in December 2023, names Sora's predecessors among the technologies trained on Times content. Movie Gen's training data was not fully disclosed, although Meta confirmed it included a mix of licensed and publicly available footage. As of early 2026, no court has issued a definitive ruling on whether training on copyrighted video constitutes fair use.
Second, output similarity. T2V models can produce video that looks substantively like specific copyrighted works, sometimes by accident, sometimes when prompted. Sora 2 was caught generating recognizable Pixar-style and Studio Ghibli-style content within days of launch. OpenAI added moderation layers and IP-aware filters in response. Disney, Universal, and Warner have all warned that they reserve their rights, although as of early 2026 no major studio has filed against an AI video lab.
Third, labor. The 2023 Writers Guild of America strike ended in September 2023 with the first major Hollywood contract limiting how studios may use AI to write or rewrite scripts. The 2023 SAG-AFTRA strike, which ended in November 2023 after 118 days, produced a contract requiring informed consent and compensation for any digital replica of an actor and barring the use of AI-generated performers to displace background actors without payment. SAG-AFTRA's separate video game contract, ratified in June 2024 only after a fresh strike that ran from July 2024 into 2025, added similar protections for voice and motion capture. IATSE (the International Alliance of Theatrical Stage Employees, representing crew) ratified a basic agreement in August 2024 that included AI-related provisions covering crew job security and consultation rights. Tyler Perry's February 2024 announcement that he had paused an 800 million dollar expansion of his Atlanta studio explicitly cited Sora's demos. The economic anxiety is that any actor's likeness, once trained on, can be re-used cheaply, and that whole categories of work, from extras to commercial spots, may be substituted by generated footage.
Deepfake abuse is a related concern. T2V tools can produce video of real people without consent, although most commercial products restrict this through prompt filters and face detectors. Attempts to generate political figures or celebrities are typically blocked or watermarked. The open weights ecosystem has weaker constraints, and modified versions of open models that strip safety training are routinely shared.
Two technical responses to deepfake and copyright concerns are widely deployed. The first is invisible watermarking: a perturbation embedded in the pixels (and, where present, audio) that survives compression and re-encoding while remaining imperceptible. Google DeepMind's SynthID, introduced for images in August 2023 and extended to video for Veo and to audio for Lyria, attaches a model-side signature that DeepMind's verifier can detect at high rates even after edits. OpenAI announced an analogous internal watermark for Sora outputs in February 2024. Meta watermarked Movie Gen outputs. Watermarking is not a panacea: cropping, heavy compression, or pixel-level adversarial attack can degrade detection.
The second response is content provenance through cryptographically signed metadata. The C2PA standard ("Coalition for Content Provenance and Authenticity"), shepherded by Adobe, Microsoft, the BBC, and others since 2021, attaches a chain of signed assertions to a media file recording how it was produced and edited. Adobe Firefly Video, OpenAI Sora, and Google Veo all attach C2PA Content Credentials to their outputs by default. Because the signature can be stripped, C2PA is best understood as an opt-in disclosure mechanism rather than a forensic guarantee.
The regulatory picture moved fastest in Europe. The EU AI Act, formally adopted in June 2024 and entering staged force from August 2024 through 2026, imposes specific obligations on synthetic media. Article 50 requires that providers of generative AI systems mark outputs as artificially generated in a machine-readable format, and that deployers disclose when "deep fake" or AI-generated audio, image, or video is used in publicly distributed content unless certain artistic or law-enforcement exceptions apply. Provisions for general-purpose AI models with systemic risk apply to large foundation models, including frontier T2V systems above certain compute thresholds.
In the United States, federal action has been piecemeal. Executive Order 14110 of October 2023 directed the Department of Commerce to develop standards for content authentication and watermarking; the order was repealed in January 2025. State-level laws followed, including Tennessee's ELVIS Act of March 2024 (protecting voice and likeness from unauthorized AI replication) and California laws AB 2602 and AB 1836 of September 2024 (digital replicas in employment and post-mortem rights). China's Cyberspace Administration deep synthesis regulations took effect on January 10, 2023, and require provider registration and conspicuous labeling of generated media; further interim measures on generative AI took effect in August 2023. Japan's Cabinet Office released AI promotion guidelines in 2024. India is drafting rules; Brazil and Australia are debating legislation.
Despite the speed of progress, T2V systems still fail in characteristic ways.
Physics is the famous one. Sora's launch demos included a memorable case where a glass falls off a table and the liquid passes through the surface instead of spilling. Object permanence is fragile: a person walking behind a tree may emerge as a different person on the other side. Cause and effect are sometimes inverted. "A man bites into an apple" can produce a man putting an apple to his mouth and then the apple appearing whole again.
Hands are the second famous failure. Fingers count incorrectly, fuse together, or articulate impossibly. The problem is shared with text-to-image and has been reduced but not eliminated. Sora 2 and Veo 3 produce hands that pass casual inspection most of the time; close-ups still betray the model.
Text rendering inside the video is a third weakness. Signs, labels, and printed words come out as a blur of plausible-looking but meaningless characters more often than not. Veo 3 was the first to render legible English signage reliably; other models still struggle.
Long-form coherence drops as duration grows. A 5 second Sora 2 clip is usually self-consistent. A 60 second one may drift in lighting, color grade, or camera framing. Cuts between shots in multi-shot generations are even harder; current systems do not plan story structure, only continuous footage.
Motion fidelity is uneven. Slow, broad motion looks great. Fast action, sports, and complex articulated movement like dance or martial arts often glitch. Specific human motions like throwing, kicking, or skiing are hit and miss.
Audio, where present, has its own problems. Lip sync is generally good; emotional inflection is often flat; non-English languages get less attention from the major closed models.
Several threads are worth watching through 2026.
Long-form output keeps stretching. Veo 3.1 and Sora 2.1 both extended single-pass clip lengths past one minute. Multi-minute coherent narratives are still tricky, but pipelines that chain prompts with character and location consistency are improving.
Personalization and identity preservation are a major focus. Runway Gen-4 advertised "reference any person or place" features. Meta's Movie Gen included a Personalized Video mode. Closed models tighten guardrails around generating known individuals, while open models generally do not.
World models are blending in. Runway's Gen-4 and DeepMind's Genie 3 were both pitched as more world-model-like than traditional T2V, predicting rendered video conditioned not just on text but on actions or simulated dynamics. The convergence between video generation and game-like world simulation is one of the more interesting current frontiers.
Audio quality is climbing. Beyond the headline native audio of Veo 3 and Sora 2, work continues on cleanly disentangled music, dialogue, and effects channels, and on multi-speaker scenes with speaker identification.
Real-time and on-device generation is still out of reach for high-resolution outputs but moving in. Pruned and distilled versions of CogVideoX run on consumer GPUs at low resolution. Sub-second video generation is plausible for short, low-resolution clips by late 2026.
Finally, agentic video pipelines are appearing. Tools like Flow from Google and the multi-agent video editors built on top of Hunyuan and Wan let an agent plan shots, generate them, edit, and assemble a finished short film with minimal human direction. Whether this is good art or a curiosity is, as of early 2026, an open question.
| Year | Paper | Authors | Contribution |
|---|---|---|---|
| 2016 | VGAN | Vondrick et al., NeurIPS 2016 | First neural video generator |
| 2017 | TGAN | Saito et al., ICCV 2017 | Temporal GAN with singular value clipping |
| 2018 | MoCoGAN | Tulyakov et al., CVPR 2018 | Motion-content disentanglement |
| 2019 | DVD-GAN | Clark et al., arXiv:1907.06571 | GAN scaling on UCF-101 |
| 2021 Nov | NUWA | Wu et al., arXiv:2111.12417 | 3D transformer, sparse attention |
| 2022 May | CogVideo | Hong et al., arXiv:2205.15868 | First public T2V; autoregressive transformer |
| 2022 Sep | Make-A-Video | Singer et al., arXiv:2209.14792 | T2I prior plus temporal layers |
| 2022 Oct | Imagen Video | Ho et al., arXiv:2210.02303 | Cascaded pixel-space diffusion |
| 2022 Oct | Phenaki | Villegas et al., arXiv:2210.02399 | C-ViViT, variable-length |
| 2022 Sep | Video Diffusion Models | Ho et al., NeurIPS 2022 | First major paper on diffusion for video |
| 2022 Dec | DiT | Peebles and Xie, arXiv:2212.09748 | Diffusion transformer for images |
| 2023 Apr | Latent Video Diffusion | Blattmann et al., arXiv:2304.08818 | Foundation for SVD |
| 2023 Nov | Stable Video Diffusion | Blattmann et al., arXiv:2311.15127 | Open weights, latent diffusion |
| 2023 Dec | VideoPoet | Kondratyuk et al., arXiv:2312.14125 | Autoregressive transformer, joint audio |
| 2024 Jan | Lumiere | Bar-Tal et al., arXiv:2401.12945 | Space-Time U-Net |
| 2024 Feb | Sora technical report | OpenAI | Spacetime patches at scale |
| 2024 Sep | Movie Gen | Polyak et al., Meta | 30B model with paired audio |
| 2024 Aug | CogVideoX | Yang et al., arXiv:2408.06072 | Open weights DiT |
| 2024 Oct | Mochi 1 | Genmo | 10B Asymmetric DiT, Apache 2.0 |
| 2024 Dec | HunyuanVideo | Kong et al., arXiv:2412.03603 | 13B open weights DiT |
| 2025 | Wan 2.1/2.2 | Alibaba | Open weights DiT, MoE in 2.2 |