Veo is a family of text-to-video generative AI models developed by Google DeepMind. First announced at Google I/O on May 14, 2024, Veo generates high-definition video clips from text prompts, image inputs, or combinations of both. The model family has evolved rapidly through multiple iterations, with Veo 2 arriving in December 2024, Veo 3 launching at Google I/O in May 2025, and Veo 3.1 following in October 2025. Veo is available to consumers through Google's Gemini app and the dedicated Flow filmmaking tool, and to developers and enterprises through the Gemini API and Vertex AI.
As of late 2025, the Veo family represents one of the most capable commercial video generation systems available, competing directly with models such as OpenAI's Sora, Runway's Gen-3, Pika Labs, and Kuaishou's Kling. Veo 3 was notably the first major video generation model to include native audio generation, producing dialogue, sound effects, and ambient sound alongside the visual output.
Veo builds on over a decade of video generation research at Google and DeepMind. Before the two organizations merged into Google DeepMind in April 2023, both had independently pursued video synthesis. DeepMind contributed early work on the Generative Query Network (GQN), which learned 3D scene representations from 2D observations, and DVD-GAN, an early generative adversarial network for high-resolution video. On the Google Research side, Imagen Video (October 2022) applied cascaded diffusion models to generate 1280x768 video at 24 frames per second from text prompts, while Phenaki (October 2022) introduced an autoregressive approach that could produce variable-length video from sequences of text prompts, enabling basic storytelling through prompt chaining.
Subsequent projects further closed the gap between research demonstrations and practical applications. WALT explored latent video diffusion with a window attention mechanism to handle long temporal sequences efficiently. VideoPoet (December 2023) used a large language model backbone to unify multiple video generation tasks, including text-to-video, image-to-video, video stylization, and video inpainting, within a single architecture. Lumiere (January 2024) introduced a Space-Time U-Net (STUNet) architecture that generated the entire temporal duration of a video in a single pass, improving global temporal consistency.
Veo synthesized lessons from all of these projects. It combined the latent diffusion approach that made Imagen Video computationally tractable, the prompt-chaining storytelling capability from Phenaki, and the temporal consistency improvements pioneered by Lumiere. The result was a model that was substantially more capable than any of its predecessors across resolution, duration, realism, and prompt adherence. Veo also drew heavily on Google's foundational work on the Transformer architecture and the multimodal understanding capabilities built into the Gemini model family.
While Google DeepMind has not published a full technical paper detailing Veo's architecture, several key aspects of the system have been described publicly through blog posts, developer documentation, and presentations.
Veo uses a Latent Diffusion Transformer (LDT) architecture that combines the efficiency of latent space operations with the sequence modeling strengths of Transformers. The pipeline begins with a specialized video autoencoder consisting of an encoder and a decoder. The encoder takes raw video frames and compresses them into a lower-dimensional, information-dense latent representation. This compression step is critical: by operating within this compressed latent space, the computationally expensive diffusion process becomes far more manageable, enabling the generation of high-resolution video without requiring prohibitive amounts of processing power.
The compressed latent space is then tokenized, converting the spatio-temporal data into a sequence of tokens that a Transformer network can process. This tokenization step is what distinguishes Veo's architecture from earlier pure convolutional diffusion models. The Transformer's self-attention mechanism allows it to capture long-range dependencies across both spatial dimensions (within a frame) and the temporal dimension (across frames). This means the model can understand not just what appears in a single frame but how objects should consistently evolve, move, and interact over time.
Veo follows the standard forward-and-reverse diffusion paradigm. During training, the model takes clean latent representations of video and systematically adds Gaussian noise over a series of scheduled steps (the forward process) until nothing but random noise remains. By learning to predict and remove this noise at each step, the model internalizes the statistical structure of video data at every level of detail, from large-scale scene composition and camera motion down to fine textures and lighting gradients.
At inference time, the process runs in reverse. The model starts from random Gaussian noise in the latent space and iteratively denoises it, guided by the text prompt or image conditioning signal, until a coherent video latent emerges. The decoder then transforms this latent representation back into pixel space to produce the final video frames.
The number of denoising steps influences both the quality of the output and the generation speed. The "Fast" variants of Veo 3 and Veo 3.1 likely use fewer denoising steps or a distilled version of the model to achieve faster generation times at some cost to visual fidelity.
A significant factor in Veo's output quality is the richness of its conditioning mechanism. To improve prompt adherence, Google enriched its training data with detailed, multi-sentence captions for each training video, going well beyond simple one-line descriptions. This enabled the model to associate nuanced text descriptions with specific visual elements, camera movements, and scene dynamics.
The model understands specialized cinematic terminology. Users can specify camera angles (e.g., low angle, bird's eye view), lens types (e.g., 35mm, fisheye, anamorphic), camera movements (e.g., dolly, tracking shot, crane shot), lighting setups (e.g., golden hour, chiaroscuro, neon), and genre-specific visual styles (e.g., film noir, documentary, anime). Veo interprets these terms and produces output that reflects the requested style.
Starting with Veo 3, the conditioning system was extended to include audio. The model generates synchronized dialogue, sound effects, and ambient audio conditioned on the same text prompt and the generated visual content, producing a unified audiovisual output.
All videos generated by Veo are watermarked using SynthID, a technology developed by Google DeepMind that embeds an imperceptible digital watermark directly into the pixels of every video frame. This watermark is invisible to the human eye but can be detected by automated tools, enabling identification of AI-generated media. The watermark is designed to be robust against common transformations such as cropping, resizing, and compression, though it is not intended to withstand motivated adversarial attacks.
Beyond watermarking, Veo passes all generated content through multiple safety layers. These include automated safety filters that block requests for harmful, misleading, or inappropriate content, as well as memorization-checking processes that reduce the likelihood of the model reproducing specific content from its training data.
The original Veo model was unveiled by Google CEO Sundar Pichai at Google I/O on May 14, 2024. The announcement came during a keynote that mentioned artificial intelligence 121 times, reflecting Google's company-wide focus on AI at the time. At launch, Google described Veo as capable of generating 1080p resolution videos exceeding one minute in length from text prompts.
Veo 1 could capture a range of visual and cinematic styles, including landscape shots, time-lapses, and genre-specific aesthetics. The model supported several input modes:
The model demonstrated an understanding of cinematic concepts such as aerial shots, dolly zooms, time-lapses, and various lighting conditions. It also showed a basic understanding of physical interactions, though this was an area where subsequent versions would improve substantially.
Veo 1 was initially made available through a waitlist on Google Labs, accessible inside a new front-end tool called VideoFX. Access was limited during this early phase as Google gathered user feedback, tested safety measures, and monitored for misuse. The waitlist approach allowed Google to scale access gradually while maintaining control over the system's public exposure.
Google announced Veo 2 on December 16, 2024, describing it as a significant upgrade in quality, realism, and creative control. The announcement was authored by Aaron van den Oord and published on the Google Blog alongside updates to Imagen 3 and the introduction of the Whisk experiment.
Veo 2 introduced several substantial advances over its predecessor:
| Feature | Veo 1 | Veo 2 |
|---|---|---|
| Maximum resolution | 1080p | 4K |
| Physics understanding | Basic | Improved real-world physics simulation |
| Human motion | Limited | Better nuance in movement and expression |
| Cinematography control | Moderate | Advanced (genre, lens type, cinematic effects) |
| Maximum duration | 60+ seconds | Several minutes |
| Temporal consistency | Good | Improved consistency across longer sequences |
The most visible upgrade was the jump from 1080p to 4K resolution, a fourfold increase in pixel count that brought generated video closer to professional production standards. Veo 2 also brought a markedly improved understanding of real-world physics, producing more realistic interactions between objects, more natural fluid dynamics, and more convincing lighting and shadow behavior.
Human figures saw particular improvement. Veo 2 rendered more nuanced facial expressions, more natural body movement, and more realistic hand gestures, an area where earlier video generation models frequently struggled with artifacts.
Google conducted head-to-head comparison tests using 1,003 prompts from Meta's MovieGenBench dataset. Human evaluators judged 720p resolution, eight-second video clips produced by Veo 2 against output from Meta Movie Gen, Kling v1.5, Minimax, and OpenAI's Sora Turbo. In both the "overall preference" and "prompt adherence" categories, Veo 2 received higher ratings than all compared models. These results were widely covered in the technology press, with outlets such as Fortune and The Decoder reporting that Veo 2 had "trounced" the competition.
It is worth noting that these benchmarks were conducted by Google using its own evaluation methodology. Independent third-party benchmarks may yield different results depending on the specific evaluation criteria and prompt sets used.
Veo 2 was rolled out to VideoFX in Google Labs with an expanded user base, though a waitlist remained in place. Google also announced plans to bring Veo 2 capabilities to YouTube Shorts and other Google products throughout 2025. For developers, Veo 2 became generally available on Vertex AI with support for advanced video controls, including the ability to specify the last frame of a video or extend clips in length. Veo 2 was also made available through the Gemini API.
Veo 3 was announced at Google I/O on May 20, 2025, during Sundar Pichai's keynote presentation. The headline feature was native audio generation, making Veo 3 the first major video generation model from a leading AI lab to produce synchronized sound alongside visuals as part of a single generation process.
Veo 3 generates audio natively as part of the video creation process rather than requiring a separate audio model or post-production step. This represents a fundamental shift from treating video and audio as separate problems. The audio generation covers three main categories:
| Audio Type | Description | Examples |
|---|---|---|
| Dialogue | Character speech with accurate lip synchronization | Conversations, narration, monologues |
| Sound effects | Context-aware sounds matching on-screen actions | Footsteps, door creaking, water splashing, phone ringing |
| Ambient noise | Background sounds that establish scene atmosphere | City traffic, wind, office hum, ocean waves, forest birds |
Google described this capability as breaking the "silent era of video generation." The model produces dialogue with accurate lip-sync, environmental sounds that match the scene context, and sound effects that respond to visual actions. Users can control the tone, accent, and emotion of dialogue through their text prompts. The audio and video are generated jointly, meaning the model considers both modalities simultaneously rather than generating video first and then adding audio as an afterthought.
Beyond audio, Veo 3 delivered improvements in physics simulation, realism, and prompt adherence. The model excelled at understanding short narrative descriptions, allowing users to describe a brief scene or story in their prompt and receive a clip that faithfully brings the narrative to life. Physics understanding continued to improve, with more realistic gravity, momentum, and material interactions.
Veo 3 generated significant public attention, with some demo videos going viral on social media. One widely shared example showed a fictional street interview that appeared so realistic it sparked discussions about the implications of AI-generated media for misinformation. Sundar Pichai later noted that within weeks of launch, users had created over 40 million videos with Veo 3.
Alongside the standard Veo 3 model, Google released Veo 3 Fast, a variant optimized for speed and cost efficiency. Veo 3 Fast generates videos more quickly and at a lower per-second cost, making it suitable for rapid iteration, prototyping, and workflows where generation speed is more important than maximum quality. On the Gemini API, Veo 3 Fast is priced at $0.15 per second compared to $0.40 per second for the standard model.
Veo 3 launched initially in private preview on Vertex AI and was subsequently made generally available. It was also released through the Gemini API in Google AI Studio, the Gemini consumer app, and the Flow creative tool. Google AI Pro subscribers ($19.99/month) received access to Veo 3 Fast with three generations per day in the Gemini app.
Veo 3.1 was released on October 15, 2025, as a paid preview in the Gemini API. It builds on Veo 3 with enhanced audio quality, improved visual realism, and several new editing and control capabilities that move the platform closer to a full video production toolkit.
Veo 3.1 also introduced vertical video support, allowing generation of portrait-orientation clips suitable for mobile-first platforms like YouTube Shorts, Instagram Reels, and TikTok.
Veo is distributed through multiple platforms, each designed for a different audience and use case.
VideoFX was the first consumer-facing tool for Veo, launched alongside the original model in May 2024 as part of Google Labs. It provided a simple web-based interface for text-to-video generation. On VideoFX, Veo 2 generated videos at 720p resolution and up to 8 seconds in length, though the underlying model supported higher resolutions and longer durations when accessed through other channels.
Flow is Google's dedicated AI filmmaking tool, introduced at Google I/O 2025. It is custom-designed for Veo, Imagen, and Gemini models and provides a more complete creative environment than VideoFX. Flow allows users to generate images and videos from scratch, swap objects within scenes, extend scenes, direct camera movement, and control pacing. It includes a timeline-based interface that supports iterative refinement of generated content. Flow is available to subscribers of Google AI Pro and Google AI Ultra plans in the United States, with additional countries being added over time. As of early 2026, Google reported that users have created over 275 million AI videos through the Flow platform.
Developers can access Veo models programmatically through Google AI Studio and the Gemini API. This enables building video generation capabilities into custom applications, automated workflows, and third-party tools. All Veo models from Veo 2 through Veo 3.1 (including both Standard and Fast variants) are accessible through this route. Developers are charged on a pay-per-second basis only for successfully generated videos.
For enterprise customers, Veo is available on Vertex AI, Google Cloud's managed machine learning platform. Vertex AI integration enables companies to incorporate Veo into existing cloud infrastructure, combine it with other Google Cloud services, and manage access through enterprise-grade identity and access controls. Veo 2, Veo 3, and Veo 3 Fast have all reached general availability on Vertex AI.
Consumer access to Veo is available directly within the Gemini app. Google AI Pro subscribers receive access to Veo 3.1 Fast with up to three video generations per day, while Google AI Ultra subscribers receive the highest level of access to the full Veo 3.1 model.
Veo pricing varies by model version, output resolution, and the access method used. The following tables reflect pricing as of late 2025.
| Model | 720p/1080p (per second) | 4K (per second) |
|---|---|---|
| Veo 2 | $0.35 | N/A |
| Veo 3 | $0.40 | $0.60 |
| Veo 3 Fast | $0.15 | $0.35 |
| Veo 3.1 | $0.40 | $0.60 |
| Veo 3.1 Fast | $0.15 | $0.35 |
Charges apply only when videos are successfully generated. There is no free tier for Veo video generation on the Gemini API. For Veo 3 and later models, the per-second price includes both video and audio output; audio generation does not incur a separate charge.
| Plan | Monthly Price | Veo Access | AI Credits |
|---|---|---|---|
| Google AI Pro | $19.99 | Veo 3.1 Fast (up to 3 per day in Gemini app); limited Flow access | 1,000/month |
| Google AI Ultra | $249.99 | Veo 3.1 (highest access tier); full Flow access | 25,000/month |
Vertex AI pricing for Veo 2 is $0.50 per second of generated video. Veo 3 pricing on Vertex AI was initially set at $0.75 per second at launch in May 2025 and was subsequently reduced to $0.40 per second in September 2025. Enterprise customers may also negotiate custom pricing through Google Cloud sales.
The AI video generation landscape has grown increasingly competitive since 2024, with multiple well-funded companies releasing capable models. The following table compares Veo with several prominent alternatives as of late 2025.
| Feature | Veo 3.1 (Google) | Sora 2 (OpenAI) | Runway Gen-3 Alpha | Pika 2.2 | Kling 2.6 |
|---|---|---|---|---|---|
| Developer | Google DeepMind | OpenAI | Runway | Pika Labs | Kuaishou |
| Max resolution | 4K | 1080p | 4K (upscaled) | 1080p | 1080p |
| Base clip duration | 8 seconds | Up to 20 seconds | Up to 10 seconds | Up to 10 seconds | 5-10 seconds |
| Extended duration | 1+ minute (scene extension) | 20 seconds | Extendable in 8s increments | Limited | Up to 3 minutes (extension) |
| Native audio | Yes (dialogue, SFX, ambient) | Yes | No | No | Yes (as of v2.6) |
| Text-to-video | Yes | Yes | Yes | Yes | Yes |
| Image-to-video | Yes | Yes | Yes | Yes | Yes |
| Reference images | Up to 3 | No | First/last frame | First/last frame | No |
| Camera controls | Yes | Limited | Yes (advanced) | Limited | Yes (motion brush) |
| API access | Gemini API, Vertex AI | OpenAI API | Runway API | Pika API | Kling API |
| Consumer pricing (from) | $19.99/mo | $20/mo | $12/mo | $8/mo | $10/mo |
| API cost (per second) | $0.15 - $0.60 | $0.10 - $0.50 | Credit-based | Credit-based | Credit-based |
| AI watermark | SynthID | C2PA metadata | C2PA metadata | Watermark (free tier) | Watermark (free tier) |
In Google's internal benchmarks conducted in December 2024 using 1,003 prompts from Meta's MovieGenBench dataset, human evaluators preferred Veo 2 over Sora Turbo, Meta Movie Gen, Kling v1.5, and Minimax for both overall quality and prompt adherence. Independent community evaluations, such as the Artificial Analysis Video Arena, have also ranked Veo models competitively, though relative rankings can shift rapidly as all providers release frequent updates.
Veo has found applications across a range of creative and professional domains:
Despite its capabilities, Veo has several known limitations as of late 2025:
Google has implemented several measures to address the ethical implications of realistic AI video generation:
The rapid improvement in AI video generation quality has raised broader societal concerns about deepfakes, misinformation, and the impact on creative professions. Google has stated that SynthID and content policies are part of an ongoing effort to balance the creative potential of the technology with responsible deployment.
| Date | Event |
|---|---|
| May 14, 2024 | Veo 1 announced at Google I/O 2024; VideoFX launched in Google Labs with waitlist access |
| December 16, 2024 | Veo 2 announced with 4K resolution support, improved physics understanding, and state-of-the-art benchmark results against Sora Turbo and other models |
| May 20, 2025 | Veo 3 announced at Google I/O 2025 with native audio generation (dialogue, sound effects, ambient sound); Flow filmmaking tool introduced |
| September 2025 | Veo 3 pricing on Vertex AI reduced from $0.75/second to $0.40/second; Veo 3 Fast reaches general availability |
| October 15, 2025 | Veo 3.1 released in paid preview with reference image support, multi-person dialogue, scene extension, and vertical video support |