Veo
Last reviewed
May 31, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v8 · 5,192 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 31, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v8 · 5,192 words
Add missing citations, update stale details, or suggest a clearer explanation.
Veo is a family of text-to-video generative AI models developed by Google DeepMind. The original Veo model was unveiled by DeepMind CEO Demis Hassabis at Google I/O on May 14, 2024, where it was positioned as Google's direct answer to OpenAI's Sora.[1][2] The family has since evolved through three major successors, veo 2 (December 16, 2024), veo 3 (May 20, 2025), and veo 3 1 (October 15, 2025), each adding substantial capabilities including 4K resolution, native synchronized audio generation, and reference-image conditioning.[3][4][5] As of May 2026, Veo 3.1 is the most recent publicly released version, with Veo 4 widely anticipated but not officially announced.
Veo is distributed to consumers through Google's gemini app and the dedicated flow filmmaking tool, and to developers and enterprises through the Gemini API, Google AI Studio, and vertex ai.[6] All Veo-generated videos carry an invisible synthid watermark embedded in every frame.[7] By July 2025, roughly two months after Veo 3's launch, Google CEO Sundar Pichai reported that users had created more than 40 million videos with Veo 3, and by late 2025 Google reported over 275 million videos generated through the Flow tool alone.[8][5]
Within the broader competitive landscape of generative video, Veo competes with sora (OpenAI), runway gen 3 and runway gen 4 (Runway), Pika Labs, and kling (Kuaishou). Veo 3 was the first major model from a leading AI lab to natively generate synchronized audio (dialogue, sound effects, and ambient sound) alongside the visual output in a single generation pass, a capability that DeepMind CEO Demis Hassabis described as marking the moment "AI video generation left the era of the silent film."[9]
Veo is the culmination of more than a decade of video-generation research at Google and DeepMind, and synthesizes techniques from a long line of predecessor systems.[10] Before Google Brain and DeepMind merged into Google DeepMind in April 2023, both organizations had independently pursued video synthesis along several different research tracks.
DeepMind's early contributions included the Generative Query Network (GQN), which learned 3D scene representations from 2D observations, and DVD-GAN, a generative adversarial network for high-resolution video. On the Google Research side, imagen video (October 2022) applied cascaded diffusion model techniques to generate 1280x768 video at 24 frames per second from text prompts, while Phenaki (October 2022) introduced an autoregressive transformer approach that could produce variable-length video from sequences of text prompts, enabling basic storytelling through prompt chaining.[11]
Subsequent projects closed the gap between research demonstrations and practical applications. WALT explored latent video diffusion with a window-attention mechanism to handle long temporal sequences efficiently. VideoPoet (December 2023) used a large language model backbone to unify text-to-video, image-to-video, video stylization, and video inpainting within a single architecture. lumiere (January 2024) introduced a Space-Time U-Net (STUNet) architecture that generated the entire temporal duration of a video in a single pass, improving global temporal consistency.[12]
Veo synthesized lessons from all of these projects. It combined the latent diffusion approach that made Imagen Video computationally tractable, the prompt-chaining storytelling capability from Phenaki, and the temporal consistency improvements pioneered by Lumiere. The result was a model substantially more capable than any of its predecessors across resolution, duration, realism, and prompt adherence. Veo also drew heavily on Google's foundational work on the transformer architecture and the multimodal understanding capabilities built into the gemini model family.
The original Veo model was unveiled by Demis Hassabis, head of Google DeepMind, and Douglas Eck, who leads DeepMind's generative-media research, during the Google I/O 2024 keynote on May 14, 2024.[1][2] The announcement came on the same day Google announced Imagen 3 and during a keynote that mentioned "AI" 121 times, reflecting the company's full-court press on generative AI.
Google described Veo 1 as capable of generating 1080p resolution videos "beyond a minute" in length from text prompts.[1] The model supported several input modes:[2]
Veo 1 demonstrated an understanding of cinematic concepts such as aerial shots, dolly zooms, time-lapses, and various lighting conditions, and it showed a basic understanding of physical interactions (fluid dynamics, gravity), though this was an area where subsequent versions would improve substantially.
Veo 1 was initially made available through a waitlist on Google Labs, inside a new web-based front end called VideoFX.[2] Access was limited during this early phase as Google gathered user feedback, tested safety measures, and monitored for misuse. Google indicated at launch that some of Veo's capabilities would ultimately be brought to YouTube Shorts.[2]
Google announced Veo 2 on December 16, 2024, describing it as a substantial upgrade in quality, realism, and creative control.[3] The announcement was authored by Aäron van den Oord (Research Scientist, Google DeepMind) and Elias Roman (Senior Director, Product Management, Google Labs), and was published on the Google Blog alongside the release of Imagen 3 and the introduction of the Whisk experimental tool.
| Feature | Veo 1 | Veo 2 |
|---|---|---|
| Maximum resolution | 1080p | 4K |
| Maximum duration | "Beyond a minute" | Several minutes |
| Physics understanding | Basic | Improved real-world physics simulation |
| Human motion | Limited | Better nuance in movement and expression |
| Cinematography control | Moderate | Advanced (lens types, depth-of-field, genre cues) |
Veo 2 produced fewer hallucinations such as extra fingers or unexpected objects.[3] It also understood "the unique language of cinematography," interpreting specific requests such as "18mm lens" and "shallow depth of field."[3] Human figures saw particular improvement, with more nuanced facial expressions, more natural body movement, and more realistic hand gestures.
Google conducted head-to-head comparison tests using 1,003 prompts from Meta's MovieGenBench dataset, with human evaluators judging 720p, eight-second clips produced by Veo 2 against output from Meta Movie Gen, Kling v1.5, MiniMax, and OpenAI's Sora Turbo.[3] In both "overall preference" and "prompt adherence" categories, Veo 2 received higher ratings than all compared models. Press coverage from outlets including Fortune and The Decoder characterized Veo 2 as having "trounced" the competition.[13][14]
These benchmarks were conducted by Google using its own evaluation methodology, and independent third-party benchmarks may yield different rankings.
Veo 2 was rolled out to VideoFX in Google Labs with an expanded user base. For developers, Veo 2 became generally available on Vertex AI with support for advanced video controls, including the ability to specify the last frame of a video or extend clips in length, and was also offered through the Gemini API. Veo 2 was made available to advanced Gemini app subscribers in April 2025.[15]
Veo 3 was announced at Google I/O on May 20, 2025, during Sundar Pichai's keynote presentation.[4][16] The headline feature was native audio generation, making Veo 3 the first major video-generation model from a leading AI lab to produce synchronized sound alongside visuals as part of a single generation process.
Veo 3 generates audio natively as part of the video creation process rather than requiring a separate audio model or post-production step.[4][17] The audio generation covers three main categories:
| Audio type | Description | Examples |
|---|---|---|
| Dialogue | Character speech with accurate lip synchronization | Conversations, narration, monologues |
| Sound effects | Context-aware sounds matching on-screen actions | Footsteps, door creaking, water splashing, phone ringing |
| Ambient noise | Background sounds that establish scene atmosphere | City traffic, wind, office hum, ocean waves, birds |
Google described this capability as breaking "the silent era of video generation."[9] The model produces dialogue with accurate lip-sync, environmental sounds that match the scene context, and sound effects that respond to visual actions. Users can control the tone, accent, and emotion of dialogue through their text prompts. The audio and video are generated jointly, meaning the model considers both modalities simultaneously rather than generating video first and then adding audio as an afterthought.
Beyond audio, Veo 3 delivered improvements in physics simulation, realism, and prompt adherence.[4] The model excelled at understanding short narrative descriptions, allowing users to describe a brief scene or story in their prompt and receive a clip that faithfully brings the narrative to life. Physics understanding continued to improve, with more realistic gravity, momentum, and material interactions. Veo 3 generated 4- to 8-second clips at resolutions up to 4K and in both 16:9 and 9:16 aspect ratios.[17]
Veo 3 generated significant public attention, with multiple demo videos going viral on social media.[18][19] One widely shared example, a fictional street interview that appeared so realistic it was widely mistaken for real footage, racked up more than 14 million views on X.[19] Online users posted fake news segments in multiple languages within Veo 3's first week, including an anchor announcing a fake death of a public figure and a fake political news conference, sparking widespread concern about misinformation.[20] On July 10, 2025, Sundar Pichai stated that users had created more than 40 million videos with Veo 3 since launch, and Google introduced a photo-to-video feature in the gemini app in the same period.[8]
Alongside the standard Veo 3 model, Google released Veo 3 Fast, a variant optimized for speed and cost efficiency. Veo 3 Fast generates videos more quickly and at a lower per-second cost, making it suitable for rapid iteration, prototyping, and workflows where generation speed is more important than maximum quality. On the Gemini API, Veo 3 Fast is priced at $0.15 per second compared to $0.40 per second for the standard model.[21]
Veo 3 launched initially in private preview on vertex ai and was subsequently made generally available.[6] It was also released through the Gemini API in Google AI Studio, the Gemini consumer app, and the flow creative tool, Google's dedicated AI filmmaking platform that was introduced at I/O 2025 specifically to showcase Veo.[22] Google AI Pro subscribers ($19.99 per month) received access to Veo 3 Fast with three generations per day in the Gemini app.
Veo 3.1 was released on October 15, 2025, as a paid preview in the Gemini API.[5][23] It builds on Veo 3 with enhanced audio quality, improved visual realism, and several new editing and control capabilities that move the platform closer to a full video-production toolkit.
Veo 3.1 outputs video at up to 1080p resolution (and 720p at 24 fps) and supports both horizontal (16:9) and vertical (9:16) formats, allowing portrait-orientation clips suitable for mobile-first platforms like YouTube Shorts, Instagram Reels, and TikTok.[24]
Veo 3.1 and Veo 3.1 Fast launched simultaneously across the Gemini API, Google AI Studio (in a Veo Studio demo), Vertex AI, the Gemini app, and Flow.[5] Pricing is identical to Veo 3: $0.40 per second for the Standard model and $0.15 per second for the Fast variant on the Gemini API.[21]
Google DeepMind has not published a full technical paper detailing Veo's architecture, but several key aspects of the system have been described publicly through blog posts, developer documentation, and presentations.[10]
Veo uses a latent diffusion transformer architecture that combines the efficiency of latent space operations with the sequence-modeling strengths of transformers. The pipeline begins with a specialized video autoencoder consisting of an encoder and a decoder. The encoder compresses raw video frames into a lower-dimensional, information-dense latent representation. By operating within this compressed latent space, the computationally expensive diffusion process becomes far more manageable, enabling generation of high-resolution video without prohibitive amounts of processing power.
The compressed latent space is then tokenized, converting the spatio-temporal data into a sequence of tokens that a transformer network can process. The transformer's self-attention mechanism captures long-range dependencies across both spatial dimensions (within a frame) and the temporal dimension (across frames). This means the model can understand not just what appears in a single frame but how objects should consistently evolve, move, and interact over time.
Veo follows the standard forward-and-reverse diffusion paradigm.[25] During training, the model takes clean latent representations of video and systematically adds Gaussian noise over a series of scheduled steps (the forward process) until nothing but random noise remains. By learning to predict and remove this noise at each step, the model internalizes the statistical structure of video data at every level of detail.
At inference time, the process runs in reverse. The model starts from random Gaussian noise in the latent space and iteratively denoises it, guided by the text prompt or image conditioning signal, until a coherent video latent emerges. The decoder then transforms this latent representation back into pixel space to produce the final video frames. The number of denoising steps influences both quality and generation speed; the "Fast" variants of Veo 3 and Veo 3.1 use fewer denoising steps or a distilled version of the model.
A significant factor in Veo's output quality is the richness of its conditioning mechanism. Google enriched its training data with detailed, multi-sentence captions for each training video, going well beyond simple one-line descriptions, enabling the model to associate nuanced text descriptions with specific visual elements, camera movements, and scene dynamics.
The model understands specialized cinematic terminology. Users can specify camera angles (e.g., low angle, bird's-eye view), lens types (e.g., 35 mm, fisheye, anamorphic), camera movements (e.g., dolly, tracking shot, crane shot), lighting setups (e.g., golden hour, chiaroscuro, neon), and genre-specific visual styles (e.g., film noir, documentary, anime).
Starting with Veo 3, the conditioning system was extended to audio. The model generates synchronized dialogue, sound effects, and ambient audio conditioned on the same text prompt and the generated visual content, producing a unified audiovisual output.
Veo's evolution shows a clear progression from a controlled Labs experiment to a fully productized creative platform that spans consumer, developer, and enterprise channels.
VideoFX was the first consumer-facing tool for Veo, launched alongside the original model in May 2024 as part of Google Labs.[2] It provided a simple web-based interface for text-to-video generation, with Veo 2 generating at 720p resolution and up to 8 seconds in length on VideoFX, though the underlying model supported higher resolutions and longer durations through other channels.
Flow is Google's dedicated AI-filmmaking tool, introduced at Google I/O 2025.[22] It is custom-designed for Veo, Imagen, and Gemini models and provides a more complete creative environment than VideoFX. Flow allows users to generate images and videos from scratch, swap objects within scenes, extend scenes, direct camera movement, and control pacing. It includes a timeline-based interface that supports iterative refinement of generated content and is built around the idea of "longer projects with continuity," preserving the same characters and actors across cuts.[26] Flow is available to subscribers of Google AI Pro and Google AI Ultra plans. By October 2025, Google reported that users had created more than 275 million AI videos through the Flow platform.[23]
Developers can access Veo models programmatically through Google AI Studio and the Gemini API. All Veo models from Veo 2 through Veo 3.1 (including both Standard and Fast variants) are accessible through this route, with charges applied on a pay-per-second basis only for successfully generated videos.[21]
For enterprise customers, Veo is available on vertex ai, Google Cloud's managed machine learning platform.[6] Vertex AI integration enables companies to incorporate Veo into existing cloud infrastructure, combine it with other Google Cloud services, and manage access through enterprise-grade identity and access controls. Veo 2, Veo 3, Veo 3 Fast, Veo 3.1, and Veo 3.1 Fast have all reached general availability on Vertex AI.
Consumer access to Veo is available directly within the gemini app. Google AI Pro subscribers receive access to Veo 3.1 Fast with up to three video generations per day, while Google AI Ultra subscribers receive the highest level of access to the full Veo 3.1 model.[21]
All videos generated by Veo are watermarked using synthid, a technology developed by Google DeepMind that embeds an imperceptible digital watermark directly into the pixels of every video frame.[7] This watermark is invisible to the human eye but detectable by automated tools, enabling identification of AI-generated media. The watermark is designed to be robust against common transformations such as cropping, resizing, and compression, though it is not intended to withstand motivated adversarial attacks.
Because SynthID watermarks every individual frame, the mark remains detectable even after substantial trimming or editing of a video.[27] Google reported in late 2025 that over 10 billion pieces of content had been watermarked with SynthID across four modalities (images via Imagen, video via Veo, audio via Lyria, and text via Gemini), making it the most widely deployed invisible AI watermarking system in existence.[27]
Beyond watermarking, Veo passes all generated content through multiple safety layers: automated safety filters that block requests for harmful, misleading, or inappropriate content; memorization-checking processes that reduce the likelihood of reproducing specific content from the training data; and content policies aligned with Google's broader AI Principles. Google has also made SynthID detection tools available to selected third parties to support the broader ecosystem's ability to identify AI-generated media.[7]
The AI video generation landscape has grown increasingly competitive since 2024, with multiple well-funded companies releasing capable models. The following table compares Veo 3.1 with several prominent alternatives as of late 2025.
| Feature | Veo 3.1 (Google) | sora 2 (OpenAI) | runway gen 4 (Runway) | Pika 2.2 | kling 2.6 (Kuaishou) |
|---|---|---|---|---|---|
| Developer | Google DeepMind | OpenAI | Runway | Pika Labs | Kuaishou |
| Max resolution | 4K | 1080p | 4K (upscaled) | 1080p | 1080p |
| Base clip duration | 8 seconds | Up to 20 seconds | Up to 10 seconds | Up to 10 seconds | 5–10 seconds |
| Extended duration | 1+ minute (scene extension) | 20 seconds | Extendable in 8 s increments | Limited | Up to 3 minutes (extension) |
| Native audio | Yes (dialogue, SFX, ambient) | Yes | No | No | Yes (since v2.6) |
| Text-to-video | Yes | Yes | Yes | Yes | Yes |
| Image-to-video | Yes | Yes | Yes | Yes | Yes |
| Reference images | Up to 3 | No | First/last frame | First/last frame | No |
| Camera controls | Yes | Limited | Yes (advanced) | Limited | Yes (motion brush) |
| API access | Gemini API, Vertex AI | OpenAI API | Runway API | Pika API | Kling API |
| Consumer pricing (from) | $19.99/mo | $20/mo | $12/mo | $8/mo | $10/mo |
| API cost (per second) | $0.15–$0.60 | $0.10–$0.50 | Credit-based | Credit-based | Credit-based |
| AI watermark | SynthID | C2PA metadata | C2PA metadata | Watermark (free tier) | Watermark (free tier) |
In Google's internal benchmarks conducted in December 2024 using 1,003 prompts from Meta's MovieGenBench dataset, human evaluators preferred Veo 2 over Sora Turbo, Meta Movie Gen, Kling v1.5, and MiniMax for both overall quality and prompt adherence.[3][13] Independent community evaluations, such as the Artificial Analysis Video Arena, have ranked Veo models competitively, though relative rankings can shift rapidly as all providers release frequent updates.
Despite its capabilities, Veo has several documented limitations as of late 2025:
In July 2025, MIT Technology Review reported that Veo 3 added garbled, nonsensical subtitles to generated videos even when users explicitly requested no captions, affecting up to 40 percent of dialogue scenes.[28] The root cause was attributed to training on YouTube videos, vlogs, and TikTok content that contained embedded subtitles, leading the model to "learn" that captions enhance similarity to human-created videos. The problem persisted more than a month after Google announced fixes on June 9, 2025.
In June 2025, CNBC reported that Google had used its catalog of YouTube videos, estimated at 20 billion videos, to train Veo 3 and other Gemini-family models.[29] Multiple leading creators and intellectual-property professionals told CNBC they had not been informed that their content could be used in this way. Google noted that its terms of service permit using YouTube content to improve "the product experience … including through machine learning and AI applications," but users have no opt-out mechanism. Even using one percent of YouTube would amount to roughly 2.3 billion minutes of training data, 40 times the volume reportedly used by some competing AI models. Google offers indemnification for users facing copyright challenges over content generated with Veo.[29]
Veo 3's realism, combined with its native audio generation, fueled rapid concerns about misinformation. Time magazine reported that Veo 3 could generate plausible deepfakes of riots, election fraud, and conflict.[20] In one notable incident, Philippine officials reportedly shared a Veo 3–generated street-interview video to support Vice President Sara Duterte during impeachment proceedings, illustrating real-world political misuse.[30]
In July 2025, Media Matters for America reported that racist and antisemitic videos generated using Veo 3 were being widely uploaded to TikTok.[31] Ars Technica's Ryan Whitwam observed that "vagueness in the prompt and the AI's inability to understand the subtleties of racist tropes (i.e., the use of monkeys instead of humans in some videos) make it easy to skirt the rules."
A Gizmodo report noted that early users frequently directed Veo 3 toward low-quality content, including fake "man on the street" interviews, low-effort haul videos, and repetitive jokes, raising questions about the social value of such ultra-cheap video at scale.
Veo has found applications across a range of creative and professional domains:
| Model | 720p/1080p (per second) | 4K (per second) |
|---|---|---|
| Veo 2 | $0.35 | N/A |
| Veo 3 | $0.40 | $0.60 |
| Veo 3 Fast | $0.15 | $0.35 |
| Veo 3.1 | $0.40 | $0.60 |
| Veo 3.1 Fast | $0.15 | $0.35 |
Charges apply only when videos are successfully generated.[21] There is no free tier for Veo video generation on the Gemini API. For Veo 3 and later models, the per-second price includes both video and audio output.
| Plan | Monthly price | Veo access | AI credits |
|---|---|---|---|
| Google AI Pro | $19.99 | Veo 3.1 Fast (up to 3 per day in Gemini app); limited Flow access | 1,000/month |
| Google AI Ultra | $249.99 | Veo 3.1 (highest tier); full Flow access | 25,000/month |
Vertex AI pricing for Veo 2 is $0.50 per second of generated video. Veo 3 pricing on Vertex AI was initially set at $0.75 per second at launch in May 2025 and was reduced to $0.40 per second in September 2025.[6] Enterprise customers may negotiate custom pricing through Google Cloud sales.
| Date | Event |
|---|---|
| May 14, 2024 | Veo 1 announced at Google I/O 2024 by Demis Hassabis and Douglas Eck; VideoFX launched in Google Labs with waitlist access |
| December 16, 2024 | Veo 2 announced with 4K resolution, improved physics understanding, and benchmark wins against Sora Turbo and other models |
| April 2025 | Veo 2 made available to advanced Gemini app subscribers |
| May 20, 2025 | Veo 3 announced at Google I/O 2025 with native audio generation; Flow filmmaking tool introduced |
| June 19, 2025 | CNBC reports that Veo 3 was trained on YouTube videos, drawing creator concerns |
| July 10, 2025 | Sundar Pichai reports 40 million videos generated with Veo 3 since launch |
| July 15, 2025 | MIT Technology Review documents Veo 3's persistent "garbled subtitles" problem |
| September 2025 | Veo 3 pricing on Vertex AI reduced from $0.75 to $0.40 per second; Veo 3 Fast reaches GA |
| October 15, 2025 | Veo 3.1 released in paid preview with reference-image support, multi-person dialogue, scene extension, and vertical-video support |
| Late 2025 | Google reports 275M+ videos generated through Flow |
As of May 2026, Veo 3.1 remains the most recent publicly released Veo model. Google has not officially announced Veo 4, though industry observers consider Google I/O 2026 (scheduled for May 19–20, 2026) a likely venue based on the company's historical pattern of unveiling major Veo releases at I/O. Until Google publishes an official announcement, no Veo 4 capabilities, pricing, or release date should be considered confirmed.
Veo's two-year arc, from a waitlist-only Labs experiment in May 2024 to a full creative platform with native audio, reference-image conditioning, multi-platform availability, and billions of frames watermarked through SynthID, illustrates how rapidly generative video has matured. The technology has also surfaced acute challenges around training data provenance, misinformation, deepfakes, and the unresolved economic relationship between AI labs and the creators whose work feeds these models, debates that are likely to define the next phase of the generative-video industry.