Veo 3
Last reviewed
May 17, 2026
Sources
40 citations
Review status
Source-backed
Revision
v5 ยท 9,970 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 17, 2026
Sources
40 citations
Review status
Source-backed
Revision
v5 ยท 9,970 words
Add missing citations, update stale details, or suggest a clearer explanation.
Veo 3 is a video generation model developed by Google DeepMind and announced at Google I/O on May 20, 2025. It is the third major iteration of the Veo family and the first commercially available video generation model to natively produce synchronized audio alongside video in a single generation pass. The audio output includes spoken dialogue with lip-synced characters, sound effects, ambient noise, and background music, all derived from a text prompt. Veo 3 generates clips up to eight seconds at resolutions up to 1080p (with upscaling paths to 4K), running at 24 frames per second with 48 kHz stereo audio output.
The model is accessible through multiple Google platforms: the Flow filmmaking application, the Gemini app, Vertex AI, Google AI Studio, and the Gemini API. Consumer access at launch was restricted to Google AI Ultra subscribers in the United States at $249.99 per month. Veo 3 was later updated to Veo 3.1 in October 2025, which introduced reference image support, scene extension tools, enhanced audio fidelity, and a "Fast" tier variant optimized for speed and lower cost. A second major Veo 3.1 update in January 2026 added native 4K output, native vertical generation, and audio support for the Ingredients to Video mode. Google DeepMind CEO Demis Hassabis described the launch as the moment "the silent film era ended" for AI-generated video, a phrase that was widely repeated in coverage of the announcement.
Veo 3 sits at the intersection of two research directions in generative media: latent diffusion models for video synthesis, which had been the dominant paradigm since 2023, and joint audio-visual generation, which had largely been a research curiosity prior to Veo 3 due to the difficulty of aligning sound and image at training scale. The model treats audio and video as two views of the same underlying scene rather than as separate outputs requiring alignment in post-processing. This architectural choice is the headline difference between Veo 3 and competing models such as Sora 2, Runway Gen-4, and Kling 2.0, none of which produced native audio at the time of Veo 3's release.
The model launched into a crowded competitive landscape. By May 2025, OpenAI's Sora had been generally available since December 2024, Runway had released Gen-4 in March 2025, Kuaishou had iterated through several Kling versions, and several open-weight Chinese models including Wan 2.1 from Alibaba and Hailuo from MiniMax had captured significant attention. What set Veo 3 apart was not pure visual fidelity, where its competitors were broadly comparable, but the combination of synchronized native audio with a first-party integration into Google's existing creative and developer ecosystems, including the Gemini app, YouTube Shorts, and Vertex AI.
By mid-2026, that landscape had narrowed substantially. OpenAI announced the shutdown of its standalone Sora consumer experience on March 24, 2026, with the web and mobile apps going dark on April 26, 2026 and the API following on September 24, 2026. With Sora withdrawn, Veo 3.1 became the de facto Western leader in the commercial text-to-video space, alongside Runway and a strengthening Kling lineup.
The first version of the Veo model was announced at Google I/O in May 2024. Google DeepMind described Veo 1 as capable of generating 1080p video clips over a minute in length from text or image prompts. The model supported a range of visual styles and demonstrated a reasonable understanding of physical motion and scene composition. Veo 1 was made available to a limited set of creators and filmmakers through a waitlist program, and was later integrated into YouTube Shorts as a generative tool. It did not include any audio generation capability.
Veo 1 was positioned as Google's direct response to OpenAI's announcement of Sora in February 2024, which had captured the public imagination with its demo reels of cinematic-quality short clips. While Sora demos showed strong visual quality, neither model was widely available at the time of their announcements, and both companies were criticized for cherry-picking the best examples for promotional material.
In December 2024, Google released Veo 2, available initially through the VideoFX experimental platform. Veo 2 improved substantially on its predecessor with stronger physics modeling, more realistic motion trajectories, better camera control, and the ability to generate clips beyond two minutes in some configurations. The model supported up to 4K resolution output. Veo 2 also demonstrated improved understanding of cinematic framing conventions and was made available on Vertex AI for enterprise customers. It still generated video only, without audio.
The Veo 2 release coincided with the public availability of Sora through OpenAI's ChatGPT Plus subscription, and the two models were widely compared in benchmarks during late 2024 and early 2025. Veo 2 was often cited as having a slight edge in physics realism and cinematographic vocabulary, while Sora was viewed as stronger in prompt adherence for unusual or surreal scenarios. Veo 2 also became the underlying engine for the first generation of Flow when that platform launched in 2025.
Veo 3 was announced on May 20, 2025, at Google I/O. The central advance over Veo 2 was native audio-visual generation: rather than generating video and then applying audio as a post-processing step, Veo 3 generates both modalities simultaneously within a shared latent space. Google described this as the first time a video generation model had been trained to treat sound and image as jointly conditioned outputs from a single prompt. The model was made available the same day through Google AI Ultra subscriptions and via Vertex AI. Developer access through the Gemini API followed shortly after launch.
The announcement drew immediate public attention, partly because of how quickly creators began sharing viral outputs, including realistic dialogue scenes, mock advertisements, and scenarios previously only achievable through conventional production. The Will Smith eating spaghetti test, a benchmark that had originated with a deliberately distorted AI video in March 2023, was re-run with Veo 3 within days of launch and produced results that were widely shared as evidence of how far the technology had progressed in two years.
Veo 3 Fast was introduced shortly after the original Veo 3 launch as a lighter-weight variant optimized for speed and cost. Initial pricing was $0.40 per second on the Gemini API and Vertex AI. Veo 3 Fast generates video at approximately twice the speed of the standard model while sacrificing a small amount of visual quality, generally cited by Google as 1 to 8% degradation on internal benchmarks. The Fast variant initially produced 720p output but received a 1080p update later in 2025.
In September 2025, Google announced a price cut bringing Veo 3 Fast pricing down to $0.15 per second for video with audio and $0.10 per second for video only. The same announcement reduced Veo 3 standard pricing from $0.75 to $0.40 per second with audio, and from $0.50 to $0.20 per second without audio. Google framed the price changes as part of a broader effort to expand commercial accessibility as the company scaled its generation infrastructure.
Veo 3.1 was released on October 14, 2025 as an update to Veo 3. Google described it as focused on improved prompt adherence, scene comprehension, audio-visual alignment, and consistency across frames. The release also introduced a structured three-tier pricing model with Light, Fast, and Standard variants. Veo 3.1 became immediately available through the Gemini app, Flow, the Gemini API, and Vertex AI for enterprise users.
The 3.1 update added several features that addressed common complaints from creators about Veo 3, particularly character consistency and the inability to extend clips beyond eight seconds without visible discontinuities. The headline new feature, Ingredients to Video, allowed users to upload up to three reference images and have the model maintain consistency with them throughout a generated clip, which made it practical for the first time to produce multi-shot sequences featuring the same character.
On January 13, 2026, Google released a major update to Veo 3.1 that added native 4K output, native vertical generation, audio support for the Ingredients to Video mode, and improved expressiveness in dialogue. The update was framed by Google as moving Veo from a "social media tool" into a tier suited for broadcast and short-form commercial production.
The headline change was native 4K (3840x2160) generation rather than upscaling. Google describes the 4K path as reconstructing fine detail (hair, fabric weave, skin texture, foliage, raindrops) at the model level rather than interpolating from a 1080p source. This made Veo 3.1 the first mainstream commercial text-to-video model to support true 4K generation as a first-class output, ahead of Sora 2, which capped at 1080p. The 4K mode was available through Flow, the Gemini API, and Vertex AI but not in the consumer Gemini mobile app or YouTube interface.
The update also added native 9:16 vertical generation inside the Ingredients to Video flow, so creators could lock characters across shots while producing TikTok-format or YouTube Shorts content without center-cropping. Audio support inside Ingredients to Video closed a gap from the October 2025 release, where reference-driven multi-shot sequences still had to be voiced and scored separately. The character consistency model was retrained to keep an Ingredients reference identity stable across radically different environments. Audio expressiveness was also tightened, with improvements to dialogue prosody, emotional delivery, and two-speaker overlap handling, although multi-character scenes remained a documented weakness.
Veo 3.1 Lite was launched on April 1, 2026 as Google's lowest-cost public tier, timed against the discontinuation of Sora the prior week. Priced below 50% of the Veo 3.1 Fast tier, it targeted high-volume developer workloads, agency prototyping pipelines, and small creator tools. The Lite tier traded sample-level fidelity (particularly fine texture, hand anatomy, and reflective surfaces) for cost and latency improvements while preserving native audio and Ingredients to Video character locking. It served as Google's competitive answer to budget Chinese models such as Wan 2.1, Hailuo, and Kling Fast variants.
As of mid-May 2026, Google had not officially announced either Veo 3.2 or Veo 4, but signals from inside DeepMind suggested both projects existed. On January 18, 2026, a Google internals leaker reported an unreleased "Veo 3.2" build (codenamed Snowbunny) observed in internal Google Workspace services. Leaked references mentioned an "Artemis" world-model engine, a 30-second native single-shot cap (up from eight seconds), explicit physics simulation, and Ingredients 2.0 character locking. None of these claims were confirmed by Google. Following the late March 2026 Sora shutdown, Medhini Narasimhan from the Veo team made public references to a forthcoming Veo 4, with the most likely announcement window being Google I/O 2026 on May 19 and 20.
The defining feature of Veo 3 is its ability to generate audio that is synchronized with video at generation time. Previous video generation models, including Veo 2, Sora, Kling, and Runway models, produced video-only output; any audio had to be added manually in post-production.
Veo 3 generates three categories of audio:
Users specify dialogue by enclosing it in quotation marks within the prompt. For example, a prompt such as "a barista at a Brooklyn coffee shop says, 'Oat milk's gonna be a dollar extra'" produces a video where the barista's mouth shapes match the spoken line. Sound effects and ambient audio are generated based on the implicit acoustic content of the described scene rather than requiring explicit specification. A prompt for "a thunderstorm at night" produces rolling thunder, rain on the surface visible in the scene, and ambient wind, all without the user having to list those audio elements.
The quality of audio generation varies. Dialogue with single speakers and clear emotional valence is generally produced reliably. Scenes with multiple overlapping speakers, rapid speech, or technical vocabulary are more error-prone. Background music tends to be generic, with the model rarely producing the kind of melodically distinctive scoring that human composers create. Sound effects for unusual or fabricated objects are also less reliable, since the model has limited training data on what such sounds should be.
Lip-sync is achieved as an emergent property of joint training rather than as an explicit alignment step. Because audio and video share the same latent representation during training, the model learns that certain mouth shapes correspond to certain phonemes, and these associations are encoded in the joint latent space. At inference time, the model generates both streams in parallel, with cross-modal attention ensuring that the visual mouth positions for a given moment correspond to the phonemes being uttered at that moment.
In practical terms, Veo 3 produces convincing lip-sync for English dialogue in roughly natural speech rhythms. Performance degrades for very rapid speech, whispered or shouted speech, and non-English languages, where training data is less abundant. The model also struggles with multiple characters speaking simultaneously, often producing dialogue from one character that is incorrectly attributed to another in the visual frame.
Veo 3 is built on a latent diffusion model transformer architecture. Both the video and audio streams are compressed into lower-dimensional latent representations before the diffusion process begins. The transformer then operates simultaneously across visual spacetime patches (height, width, and time) and temporal audio frames, allowing the model to learn statistical dependencies between audio events and corresponding visual events within a unified latent space.
The architecture uses cross-frame attention mechanisms to maintain object consistency across video frames, motion vector modeling to predict natural trajectories, temporal position embeddings, and a joint audio-visual decoder that reconstructs both streams from the shared latent representation at inference time. Because audio and video share the same latent space during training, synchronization is an intrinsic property of the model rather than a post-hoc alignment step.
Independent technical analysis by Tyler Frink and others suggests that Veo 3 employs a hierarchical approach in which a roughly 12-billion-parameter transformer generates keyframes at fixed temporal intervals while a larger U-Net interpolates the intermediate frames. The 3D U-Net is structurally similar to those used in 2D diffusion models but operates on spatiotemporal latents that combine height, width, and time. Google has not published exact parameter counts in its public documentation, so these numbers are best regarded as informed estimates rather than official specifications.
The model was trained on large-scale datasets combining video, audio, and text metadata, enabling it to associate sonic properties with visual contexts: the sound of footsteps on different surfaces, the acoustic qualities of indoor versus outdoor spaces, and the relationship between a speaker's visible mouth position and the corresponding phonemes.
Google DeepMind's technical reports describe the training objective as maximizing the joint likelihood of the audio and video given a text or image conditioning input. This differs from prior approaches in which separate video and audio models were trained independently and then connected through an alignment module. The joint training approach means the model learns that a door slamming should produce a sharp transient sound at the exact frame the door contacts its frame, rather than learning this relationship through a secondary alignment step that is inherently imprecise. The tradeoff is increased training complexity and inference cost, since the model must maintain high-dimensional representations of both modalities throughout the generation process.
Veo 3 generates video at up to 1080p natively, with post-generation upscaling available to 4K in supported workflows. The model supports both 16:9 (landscape) and 9:16 (portrait/vertical) aspect ratios. Output runs at 24 frames per second, which is the standard for cinematic content. Audio is output at 48 kHz stereo. Veo 3.1 Standard supports a native 4K (3840x2160) output mode added in the January 2026 update, which reconstructs fine detail at the model level rather than upscaling from a 1080p source. Native 4K is available through Flow, the Gemini API, and Vertex AI but is gated out of the consumer Gemini app and YouTube Shorts interfaces.
Clip length at launch was limited to approximately four to eight seconds per generation. This is a practical constraint of the computational cost of joint audio-visual generation, as generating both streams simultaneously over longer durations increases inference time and memory requirements substantially. Longer clips can be produced by chaining multiple generations through the Scenebuilder feature in Flow or the scene extension feature in Veo 3.1, but each segment retains its eight-second native cap. Leaked references to the unreleased Veo 3.2 ("Snowbunny") build suggest a future native cap of around 30 seconds, although this has not been confirmed by Google.
Veo 3 accepts text prompts and, in supported configurations, image prompts (image-to-video). The model interprets descriptions of scenes, characters, actions, camera movements, lighting conditions, and audio cues within a single prompt. For example, a prompt can specify a camera movement, the style of dialogue a character should speak, the acoustic environment of the scene, and the mood of background music simultaneously.
Prompt adherence improved with the Veo 3.1 update in October 2025, which addressed cases where the model would deviate from explicitly described scene elements. The model handles cinematic vocabulary such as dolly shots, tracking shots, rack focus, and specific color grading references with reasonable fidelity.
Veo 3 supports image-to-video generation, where a static image is provided as the first frame and the model generates subsequent motion and audio. This feature is useful for animating still artwork, product photographs, or reference images. At launch, image-to-video was not available in the European Economic Area, Switzerland, or the United Kingdom due to regional restrictions tied to provisions of the EU AI Act.
The image-to-video pathway uses a separate conditioning mechanism in which the input image is encoded and fed to the diffusion process as a starting state. This means the model has stronger constraints on the first frame than on subsequent frames, and visual consistency degrades gradually over the clip duration. In practice, motion that is described in the prompt usually appears as intended, but fine details from the input image can drift as the clip progresses.
Veo 3 provides camera control through natural language description. Users can specify camera movements such as zooms, pans, tilts, and tracking shots within the prompt. The Flow platform provides a more structured interface for camera control as part of its scene-building workflow. Supported camera descriptors include dolly in, dolly out, orbit, crane up, crane down, Dutch angle, and static lock. The model interprets these directions within the context of the scene geometry it generates, so a "tracking shot following a runner" will attempt to maintain the runner as the subject throughout the clip's duration.
Compared with Runway Gen-4, which provides a structured camera-control UI with explicit dolly and crane parameters, Veo 3's camera control is more language-driven and less precisely repeatable. The same prompt run twice will produce different specific camera trajectories even when the requested shot type is held constant. For creators who need precise control, Flow provides a complementary visual interface in which camera movements can be specified through the timeline rather than buried inside a prose prompt.
Veo 3 accepts visual style guidance within prompts. Creators can reference specific film looks, such as film grain, aspect ratio, color grading style (warm desaturated, high contrast black and white, neon-lit), or a named cinematographer's aesthetic. The model does not reproduce copyrighted works, but will approximate visual properties associated with described styles. This capability is more reliable for broad aesthetic categories than for precise replication of specific works.
The original Veo 3 model, released May 20, 2025, generates up to eight seconds of video with synchronized audio from a text or image prompt. It is billed per second of generated video on Vertex AI ($0.50 per second for video only, $0.75 per second for video with audio at launch, reduced in September 2025 to $0.20 and $0.40 per second respectively) and is accessible to Google AI Ultra subscribers through Flow.
Veo 3 Fast is a lighter-weight variant of Veo 3 designed for faster generation and lower cost. It originally produced output at 720p rather than 1080p and generated video roughly twice as fast as the standard model. A subsequent update added 1080p output to Veo 3 Fast. The quality tradeoff relative to the standard model is described by Google as small (approximately 1 to 8% degradation on internal quality benchmarks). Veo 3 Fast is suited for rapid prototyping, iterating on creative concepts, or high-volume generation workflows where cost is a priority. On the Gemini API, Veo 3 Fast is priced at $0.15 per second following the September 2025 price cut.
Veo 3.1 was released on October 14, 2025 as an update to Veo 3. Google described it as focused on improved prompt adherence, scene comprehension, audio-visual alignment, and consistency across frames. Key additions in Veo 3.1 include:
Veo 3.1 also introduced a three-tier pricing structure: Veo 3.1 Light ($0.05 per video on the Gemini API), Veo 3.1 Fast ($0.15 per video), and Veo 3.1 Standard ($0.40 per video).
Veo 3.1 Fast generates video at approximately twice the speed of standard Veo 3.1 with a 1 to 8% quality reduction, making it practical for draft workflows. Veo 3.1 Light is the lowest-cost tier, outputting shorter clips at reduced resolution, suited for applications where generation volume matters more than output quality.
The January 2026 update added native 4K (3840x2160) generation for Veo 3.1 Standard, audio support inside Ingredients to Video, native vertical generation in the same flow, and improvements to dialogue prosody and two-speaker overlap handling. The 4K mode was gated to Flow, the Gemini API, and Vertex AI, and was not available in the Gemini mobile app or YouTube.
Veo 3.1 Lite was launched on April 1, 2026 as the lowest-cost public tier in the Veo 3 family. Priced below half the Veo 3.1 Fast rate, it preserved native audio output and Ingredients to Video character locking while reducing texture fidelity, hand anatomy reliability, and reflective surface accuracy. The launch came one week after the Sora discontinuation announcement and was an explicit attempt to keep budget customers from migrating to Chinese alternatives such as Wan 2.1 or low-end Kling tiers.
Veo 3.2 had not been publicly announced as of mid-May 2026, but a January 18, 2026 leak reported a build labeled "Veo 3.2" in internal Google Workspace services. Leaked references mentioned a Snowbunny codename, an Artemis world-model engine, a 30-second native single-shot cap, fluid dynamics and object permanence simulation, and an Ingredients 2.0 character locking system. None of these features had been confirmed by Google.
At launch in May 2025, Veo 3 was available exclusively to Google AI Ultra subscribers in the United States. Google AI Ultra costs $249.99 per month, with introductory pricing of 50% off for the first three months bringing the initial cost down to $124.99 per month for new subscribers. The Ultra tier provides approximately 12,500 generation credits monthly, sufficient for roughly 625 Fast-tier videos or 125 Quality-tier videos in Flow.
Google AI Pro ($19.99 per month) received limited access to Veo 3 generation in Flow and the Gemini mobile app in subsequent months. Pro users receive approximately 1,000 credits per month, enough for roughly 10 high-quality Veo videos. By late May 2025, Google had expanded Veo 3 access to over 71 countries through the Gemini mobile app, and Pro subscribers in supported regions received a limited trial of Veo 3 generation.
Flow is a filmmaking-oriented platform launched by Google alongside Veo 3 at Google I/O 2025. It integrates Veo 3, the Imagen 4 image generation model, and Gemini's natural language capabilities into a single interface designed for cinematic production workflows. Flow provides a visual interface for building scenes, controlling camera movements, generating and extending clips, and managing prompt-driven production. The platform targets independent filmmakers, content creators, and advertising professionals who want to produce cinematic content without conventional production infrastructure.
Flow is the primary consumer interface for Veo 3 generation for Google AI subscribers. It includes a Scenebuilder tool that allows creators to chain multiple generated clips into a longer narrative sequence, with controls for maintaining visual consistency between scenes. The platform also supports text-to-video, frame-to-video, and camera control prompting from a single unified interface.
Flow launched with showcase content produced by working filmmakers, including Dave Clark (whose short "Freelancers" explores the relationship between two estranged adopted brothers and was produced largely with Google's generative tools), Henry Daubrez (whose project "Electric Pink" drew on his own creative journey), and Junie Lau (whose "Dear Stranger" explores the relationship between a grandmother and grandchild across parallel worlds). These showcase projects formed part of Google's launch messaging about Flow as a serious tool for narrative filmmaking rather than only a meme generator.
Vertex AI is Google Cloud's enterprise AI platform and provides API access to Veo 3 for developers and enterprise customers. Vertex AI access is billed per second of generated video. Enterprise customers generating significant video volumes can negotiate custom pricing, and Google has indicated that volume discounts of 15 to 30% are available for monthly commitments above $10,000.
Vertex AI was one of the two platforms where Veo 3 was available on launch day, alongside the Google AI Ultra subscription tier. The platform offers enterprise features including unified authentication and access control, integration with other Google models including Gemini, Imagen, and Chirp, security and compliance handled within the enterprise account, and detailed cost and resource tracking. Google bills only for successful generations, so requests that fail safety filters or technical errors are not charged.
After the March 2026 Sora shutdown, several enterprise customers running Sora pilots migrated their workloads to Veo 3.1 on Vertex AI, with Google emphasizing data residency, audit logging, and Google Cloud identity integration. The Vertex AI listing also gained a more explicit enterprise SLA for Veo 3.1 Standard during the same window.
The Gemini API provides developer access to Veo 3 through Google AI Studio. Veo 3 was made available via the Gemini API shortly after its launch at Google I/O. Developers can call the model programmatically to generate video and audio from prompts or reference images. The API is suitable for building applications that incorporate generated video content, media pipelines, or automated content creation workflows. Pricing on the Gemini API is per second for Veo 3 and Veo 3 Fast, and per video for the Veo 3.1 tier structure.
YouTube CEO Neal Mohan announced at the Cannes Lions festival in June 2025 that Veo 3 would be integrated into YouTube Shorts later that summer. The integration was framed as part of YouTube's broader generative tooling rollout for creators, including text-to-image stickers and AI-assisted editing features. The Veo 3 integration in Shorts uses the Fast variant for cost reasons, with shorter generation latencies and lower per-clip cost. The vertical 9:16 output of Veo 3.1 is well suited to the Shorts format, and Google has positioned the integration as a way for creators to produce supplementary content (cutaways, intros, b-roll) directly within the Shorts editor.
The expanded MENA-region rollout of Veo 3 on YouTube Shorts was announced separately, reflecting Google's international expansion priorities for the model. Creators in the MENA region received access through the Shorts editor in late 2025. The January 2026 update brought native vertical Ingredients to Video generation into the Shorts editor for partnered creators, though consumer Shorts access did not include the 4K mode reserved for Flow, the Gemini API, and Vertex AI.
In 2025, Canva integrated Veo 3 into its platform through a "Create a Video Clip" feature, extending access to Canva's user base of designers and marketing professionals. The integration allows Canva users to generate short video clips directly within the Canva design environment without requiring a separate Google AI subscription. The Canva integration was Google's first significant third-party distribution arrangement for Veo 3 and is one of the few ways to access the model without a Google subscription tier.
| Platform | Model | Price (initial) | Price (Sept 2025+) |
|---|---|---|---|
| Vertex AI | Veo 3 (video only) | $0.50 per second | $0.20 per second |
| Vertex AI | Veo 3 (video + audio) | $0.75 per second | $0.40 per second |
| Vertex AI | Veo 3 Fast | $0.40 per second | $0.15 per second |
| Gemini API | Veo 3 Standard (video + audio) | $0.75 per second | $0.40 per second |
| Gemini API | Veo 3 Fast | $0.40 per second | $0.15 per second |
| Gemini API | Veo 3.1 Standard | n/a (post-Oct 2025) | $0.40 per video (8 sec) |
| Gemini API | Veo 3.1 Fast | n/a (post-Oct 2025) | $0.15 per video (8 sec) |
| Gemini API | Veo 3.1 Light | n/a (post-Oct 2025) | $0.05 per video (8 sec) |
| Gemini API | Veo 3.1 Lite | n/a (post-Apr 2026) | Below $0.07 per video (sub-Fast pricing) |
| Google AI Ultra | Flow (via subscription) | $249.99/month | (12,500 monthly credits) |
| Google AI Pro | Flow (via subscription) | $19.99/month | (1,000 monthly credits) |
Vertex AI and Gemini API pricing is subject to change and enterprise volume discounts may apply. The per-second billing model on Vertex AI means costs scale predictably with output duration: at launch pricing, an eight-second clip with audio cost $6.00 on Vertex AI. After the September 2025 price cut, the same clip costs $3.20 on Vertex AI or $0.40 per generated video at the Veo 3.1 Standard tier on the Gemini API.
The pricing structure for Veo 3.1 deliberately decoupled cost from clip length by moving to a flat per-video price for each tier. This makes the cost of a single eight-second clip predictable but shorter clips do not become proportionally cheaper. The three-tier Light/Fast/Standard model provides explicit cost levers mapping to quality tradeoffs, with Veo 3.1 Lite added in April 2026 as a fourth, lower-cost tier for high-volume budget workloads.
At the time of Veo 3's launch, the major competing commercial video generation models were Sora 2 from OpenAI, Kling 2.0 from Kuaishou, and Runway Gen-4 from Runway. Veo 3 was the only model in this group to offer native audio generation as a core output. Sora 2 added native audio later in 2025, which partially closed this gap, but Veo 3 retained the lead in lip-sync accuracy and audio integration depth.
| Model | Developer | Native audio | Max resolution | Max clip length | Approx. API cost |
|---|---|---|---|---|---|
| Veo 3 | Google DeepMind | Yes (dialogue, SFX, music) | 1080p (4K via upscale) | 8 sec | $0.40/sec (with audio, post-Sept 2025) |
| Veo 3.1 | Google DeepMind | Yes | 1080p / native 4K (Jan 2026) | 8 sec | $0.40/video |
| Veo 3.1 Lite | Google DeepMind | Yes | 1080p | 8 sec | Below $0.07/video |
| Sora 2 | OpenAI | Yes (added late 2025) | 1080p | 20 sec | Discontinued April 2026 |
| Kling 2.0 | Kuaishou | No | 1080p | 10 sec | ~$0.10/sec |
| Kling 3.0 | Kuaishou | Yes (multi-character) | 1080p | 15 sec | ~$0.10/sec |
| Runway Gen-4 | Runway | No (add separately) | 1080p | 10 sec | ~$0.15/sec |
| Hailuo 2 | MiniMax | No | 1080p | 10 sec | Lower cost |
| Pika 2.0 | Pika Labs | No | 1080p | 5 sec | Subscription |
| Wan 2.1 | Alibaba | No | 720p | 5 sec | Lower cost |
Veo 3's primary differentiation from competitors is the synchronized audio output. In independent evaluations conducted shortly after launch, Veo 3 consistently placed first or near the top for lip-sync accuracy, dialogue generation, and ambient sound realism. In pure visual quality metrics, Runway Gen-4 and Sora 2 were considered competitive peers, with some benchmarks placing Runway Gen-4 slightly ahead on frame-level visual fidelity following its late 2025 "World Engine" architecture update. For high-volume or cost-sensitive workflows, Kling and Wan offered lower per-second costs at the expense of visual quality and (initially) no audio capability.
Veo 3 has a higher price point per second of video than most competitors, reflecting the cost of joint audio-visual generation. Veo 3 Fast and the Veo 3.1 tier structure partially close this gap, with Light at $0.05 per video competitive with Wan 2.1 and the April 2026 Lite tier priced below half the Fast rate.
On the Artificial Analysis Video Arena rankings for text-to-video models with audio, Kling 3.0 (released February 2026) and Veo 3.1 traded the top spots through early 2026, with Kling gaining ground through multi-character audio and multi-shot sequencing that Veo 3.1 did not initially match. The January 2026 4K update reopened the gap on visual fidelity, and the April 2026 Lite tier helped Google reclaim the budget end of the market. Sora ceased to be available as a standalone product in April 2026, narrowing the field of major Western competitors. As of mid-2026, the main alternatives to Veo 3.1 are Runway Gen-4.5, Kling 3.0, and Luma Dream Machine.
Veo 3 launched roughly five months after OpenAI's Sora became generally available in December 2024. Sora 2, OpenAI's revised model with explicit physics modeling and tightly synchronized audio, was released later in 2025 and closed much of Veo 3's audio advantage. The two models occupied similar positions in the market for most of late 2025 and early 2026: high-quality, high-cost video generators bundled into consumer subscriptions (ChatGPT Pro for Sora 2, Google AI Ultra for Veo 3). Direct comparisons in independent reviews tended to favor Veo 3 for talking-head and dialogue-driven content and Sora 2 for longer clips and surreal or stylized scenes.
The direct comparison effectively ended on March 24, 2026, when OpenAI announced the discontinuation of the standalone Sora experience. The web and mobile apps were taken offline on April 26, 2026, with the API scheduled for shutdown on September 24, 2026. Reporting cited declining monthly active users and inference costs that exceeded subscription revenue by four to six times. The shutdown left Veo 3.1 without a direct Western equivalent, and Google launched Veo 3.1 Lite within a week as a budget counterweight to Chinese alternatives.
Runway Gen-4 was released on March 31, 2025, roughly seven weeks before Veo 3. Gen-4's signature features were character consistency through image conditioning and a structured camera-control interface with explicit dolly and crane parameters. It did not generate audio natively. For client work that requires repeatable shots of the same character across multiple clips, Gen-4 was widely considered the safer choice. For end-to-end video with dialogue, Veo 3 was the only viable option until Sora 2 added audio later in 2025. By 2026 Runway had announced a custom variant developed in partnership with Lionsgate, although the partnership has been reported as commercially slower-moving than the original announcement suggested.
Kuaishou's Kling models occupied the lower-priced segment of the market through 2025. Kling 1.6 and 2.0 offered competitive visual quality at a fraction of Veo 3's cost but lacked audio. Kling 3.0, released in February 2026, added multi-character audio with voice reference support and multi-shot sequences, becoming a direct competitor to Veo 3.1 on quality and feature parity at lower cost. The Veo 3.1 January 2026 4K update and the April 2026 Lite tier were widely read as Google's response to Kling 3.0 on the two dimensions where Kling had pulled ahead: fidelity at the top end and price at the bottom end.
MiniMax's Hailuo 2 (and its October 2025 Hailuo 2.3 update) and Pika Labs's Pika 2.0 (December 2024) occupy the lower-cost end of the video generation market. Both produce shorter, lower-resolution clips than Veo 3 and lack native audio. They remain popular for rapid prototyping and meme generation, where iteration speed and cost matter more than fidelity.
Veo 3 and Flow together provide independent filmmakers with a path to producing short-form content with production values that previously required camera equipment, sets, actors, and post-production facilities. Filmmakers use Veo 3 to generate establishing shots, b-roll footage, scene extensions, and test renders for visual concepts before committing to full production. The native dialogue capability means that character scenes can be roughed out from prompts rather than requiring voice actors and lip-sync work.
The showcase filmmakers Google partnered with at Flow's launch (Dave Clark, Henry Daubrez, and Junie Lau) demonstrated the model on emotionally complex narrative material rather than only on demonstrative reels. Clark's "Freelancers" used Veo 3 alongside other generative tools to explore family relationships, and the resulting short was widely shared as evidence that generative video had reached a level where it could carry narrative weight rather than only generating spectacle.
Production cost reductions of 60 to 70% have been reported in independent productions that lean heavily on Veo 3 and Flow, though these figures are not formally audited and tend to reflect single productions rather than systematic benchmarks. The cost savings come from reducing the need for crew, equipment, location fees, and post-production work, though they assume that the project can tolerate the visual style and limitations of generated video.
Larger studios have remained cautious about replacing live-action production with generative video. The 2023 Hollywood strikes ended with contractual restrictions on generative AI in covered productions, and even where contracts permit AI use, studio risk committees have generally treated Veo 3 outputs as suitable for pre-visualization, look development, and certain background plates rather than as final-pixel content. Reporting through early 2026 emphasized the difficulty of integrating any current text-to-video tool into a conventional studio post-production workflow, particularly for shot-to-shot character consistency over feature-length runtimes.
Marketing teams use Veo 3 to generate concept videos, product demonstration clips, and social media content. The ability to generate video with synchronized voiceover or character dialogue reduces the production cost of short-form advertising content. Agencies use the model to test multiple creative directions quickly before committing production resources to a final version. Direct-to-consumer brands have used Veo 3 to generate product unboxing-style videos, spokesperson clips, and testimonial-format content at a fraction of the cost of hiring actors and production crews. The short clip length of eight seconds aligns well with the format requirements of many paid social advertising placements, where shorter clips often outperform longer ones in engagement metrics.
Creators on platforms such as YouTube, TikTok, and Instagram use Veo 3 to generate short clips with audio for posting directly as content or as components of larger edited videos. The 9:16 vertical output format supported by Veo 3.1 is suited to mobile-first platforms. Clips of characters delivering short monologues, mock product reviews, and fictional news segments became common Veo 3 output formats in the months following launch.
A particularly common format that emerged was the staged street interview, in which a Veo 3-generated character would deliver an opinion or punchline on a topical question. These clips often passed for real human content in the first scroll, with viewers only noticing inconsistencies on closer inspection. The trend prompted some platforms to flag synthetic content more aggressively in their feeds.
Production teams use Veo 3 Fast to generate animated storyboards and rough cuts before entering full production. The lower cost and faster generation speed of the Fast variant makes iterative ideation more practical. Directors can visualize lighting, camera angles, and action sequences without on-set production. The eight-second clip limit aligns well with the shot-level granularity of storyboarding, where each shot is typically a few seconds in duration. Following the April 2026 Lite tier launch, more pre-visualization workflows shifted onto Veo 3.1 Lite as a cheaper draft layer ahead of Standard finals.
Educators and corporate training developers use Veo 3 to produce explainer videos, historical recreations, and scenario-based training content. The ability to generate a character explaining a concept with synchronized dialogue reduces reliance on human talent for short educational clips. Training departments at companies have used Veo 3 to generate compliance training videos and product onboarding content without commissioning full productions. The dialogue-generation capability is particularly useful for scenario-based training, where multiple variations of a workplace interaction can be generated to illustrate desired and undesired behaviors.
All videos generated by Veo 3 carry a SynthID watermark embedded by Google DeepMind. SynthID is a steganographic watermarking system that encodes a signal imperceptible to human viewers into the generated content. The watermark is designed to persist through common transformations such as resizing, compression, format conversion, frame rate changes, and color adjustments. SynthID has been the default for Imagen and Veo model families on Vertex AI since 2024.
Google also announced plans to add a visible "Made with AI" overlay to Veo 3 outputs as access expanded beyond the initial Ultra subscriber base, providing a visible indicator for viewers encountering the content on social media or other platforms. This visible watermark appeared in the bottom corner of videos generated through Flow for most users, though Ultra subscribers and Vertex AI customers could in some cases generate outputs without the visible mark while retaining the invisible SynthID signal.
SynthID detection requires access to Google's SynthID Detector tool, which is not publicly available to end users in any general-purpose form. This means a typical viewer cannot independently verify whether a given video was generated by Veo 3. Google acknowledged this limitation and noted that the Detector tool was not widely deployed at the time of Veo 3's launch, with broader availability planned but contingent on policy reviews.
SynthID only detects content generated by Google AI systems. Videos generated by Sora 2, Runway, Kling, or other non-Google systems are outside its scope. Academic research has also demonstrated that adversarial methods can degrade SynthID watermark detectability, with one published approach ("UnMarker," presented at IEEE S&P 2025) achieving a 79% bypass rate against the system on tested image and video samples. Subsequent bypass research targeting Gemini-family SynthID detectors has been published on GitHub and in academic venues, indicating that watermark robustness remains an open research problem.
In practice, SynthID functions primarily as a backstop for platforms and fact-checking organizations that have access to Google's detection tools, rather than as a real-time provenance signal that ordinary viewers can verify themselves. The visible "Made with AI" overlay provides a simpler and more universal signal but is itself trivially removable through cropping or editing.
Veo 3's model card, published on May 23, 2025, describes the safety evaluation process that preceded launch. Google DeepMind conducted internal development evaluations, external assurance evaluations, and red-teaming by specialists ahead of release. The DeepMind Responsibility and Safety Council reviewed the model's performance against Google's AI Principles and approved the release with documented mitigations.
Mitigations include pre-training data filtering, post-training safety interventions (including the SynthID watermark), and production-time content filtering to block clearly harmful prompts. Google's content policies prohibit generating sexual content, content depicting real public figures in deceptive contexts (with some exceptions for satire and clearly non-deceptive content), and content that promotes violence or self-harm.
The model card acknowledges several known limitations. Veo 3's outputs skew toward lighter skin tones when race is unspecified in the prompt, reflecting biases in the training data. The model can also exhibit semantic biases that wrongly associate certain terms with specific demographics. Google has stated that ongoing testing and mitigation work is intended to address these issues, though the model card does not specify timelines or quantitative goals.
Veo 3 drew significant immediate attention after its May 2025 launch. The key novelty of synchronized audio generation produced viral social media moments, as creators shared clips that demonstrated realistic dialogue, ambient sound, and music generated from text prompts. Examples circulated widely included scenes of characters engaged in realistic conversations, mock documentary segments, and fictional news broadcasts. One of the most widely shared early examples was a clip of the Loch Ness Monster playing bagpipes, which circulated across social media platforms within days of launch and was cited by multiple journalists as demonstration of the model's ability to create entertaining, physically plausible content with matching audio.
Technology press coverage was generally positive about the technical advance. CNBC described Veo 3 as representing a meaningful jump from prior video generation tools due to the audio integration. The Verge and TechCrunch covered its launch as one of the more significant announcements from Google I/O 2025, alongside Gemini 2.5 and Imagen 4. DataCamp noted that Veo 3 represented a qualitative shift in what was expected of video generation systems, as the inclusion of native audio meant that generated clips could be published without additional post-production work for many use cases.
Creators who gained early access through Google AI Ultra reported that the model produced outputs substantially more convincing than Veo 2, particularly for any scene involving character speech. The lip-sync quality was widely noted as a departure from the uncanny valley quality of prior AI video dialogue.
Some commercial users reported that audio synchronization worked well on approximately 25% of generations on the first attempt, with multiple regenerations often required to achieve the desired audio-visual alignment. This was one of the issues Veo 3.1 addressed in its October 2025 update.
The "Will Smith eating spaghetti" test became one of the most discussed reception moments for Veo 3. The benchmark originated in March 2023, when an AI-generated video produced through ModelScope showed a grotesquely distorted Will Smith attempting to eat spaghetti, with the actor's face deforming and spaghetti appearing in unintended places. The clip went viral as a marker of how primitive video generation was at the time, and "eating spaghetti" became shorthand for testing how far a new video model had progressed on realistic human motion and food physics.
Within days of Veo 3's launch, AI content creator Javi Lopez ran the test and posted the result on X. The video showed a recognizable Will Smith eating noodles with synchronized audio, including chewing and slurping sounds, in a comparatively realistic style. PetaPixel and several other outlets covered the test as the first generative video to convincingly "pass" the benchmark. The clip was widely shared and discussed, with reactions ranging from impressed (Forbes called it a milestone) to faintly unsettled (YouTuber Marques Brownlee responded "I don't feel so good" to the sound of crunching spaghetti). Some viewers latched on to the crunching sound as an example of the model's audio generation still being imperfect, since spaghetti would not normally crunch.
The test illustrated a broader pattern in Veo 3's reception: the model was clearly a step change over prior tools, but the very realism that made it impressive also surfaced new ways for the output to be subtly wrong. Audio inaccuracies, finger artifacts, and small physics errors became the new focus of critique once the obvious problems of earlier generations were no longer the main story.
Media organizations and researchers raised concerns about the misinformation potential of the model's realism. TIME Magazine published a feature in June 2025 reporting that the publication had been able to generate realistic-looking fabricated videos using Veo 3, including scenes depicting a Pakistani crowd setting fire to a Hindu temple, Chinese researchers handling a bat in a laboratory, an election worker shredding ballots, and Palestinians accepting U.S. aid in Gaza. Experts interviewed for that reporting noted that while the videos contained detectable flaws on close inspection, they were realistic enough to be plausible when viewed quickly in a social media context. The combination of visual realism with synchronized dialogue and ambient audio, absent from prior generation tools, made Veo 3 outputs harder to identify as AI-generated than its predecessors.
Within the first week of Veo 3's release, online users had already posted fabricated news segments in multiple languages, including a fake report announcing the death of J.K. Rowling and fake political press conferences with public figures. These clips were shared on platforms with mixed labeling, and at least some viewers initially mistook them for genuine reporting. The fact-checking organization Africa Check published guidance on identifying Veo 3-generated content, focusing on visual tells (text rendering, finger anatomy, lighting inconsistencies) and audio tells (unnatural prosody, mismatched chewing sounds, generic background music) that ordinary viewers could use to flag suspect clips.
Google responded by accelerating the rollout of visible "Made with AI" overlays on Flow-generated content and reaffirming its content policy commitments, but critics noted that the visible overlay was easily removable through cropping and that the underlying SynthID watermark was not detectable without proprietary tooling.
Within the AI industry, Veo 3 was widely viewed as forcing competitors to add native audio generation. Sora 2 added audio later in 2025, Kling 3.0 added multi-character audio in February 2026, and several other models followed suit. By early 2026, native audio generation had moved from a Veo 3 differentiator to a baseline expectation for new video models.
Stock photography and footage services responded by adding AI-generated video to their offerings, with Adobe Stock and others integrating partner models alongside their conventional libraries. The professional film industry's response was more mixed. The 2023 writers' and actors' strikes in Hollywood had concluded with contractual protections against generative AI use in some contexts, and the rise of Veo 3 reopened questions about what those protections covered when the underlying technology improved.
The spring 2026 Sora discontinuation reshaped the competitive narrative. With OpenAI exiting the standalone video product space and Google introducing both the January 2026 4K update and the April 2026 Lite tier, industry coverage shifted from a two-horse race to a market in which the practical alternatives were Veo 3.1, the Kling family, Runway Gen-4.5, and Luma Dream Machine.
Veo 3 has several documented technical limitations:
Clip length: Generation is limited to eight seconds per clip at launch. Longer-form content requires stitching multiple clips together or using the scene extension capability introduced in Veo 3.1. This is a practical constraint of the computational cost of joint audio-visual diffusion. Leaked references to the unreleased Veo 3.2 suggest a future native cap of around 30 seconds, but as of mid-May 2026 this had not been confirmed or shipped.
Audio synchronization reliability: Audio synchronization quality varies across generations. Complex scenes with multiple speakers, rapid movement, or overlapping sounds are more likely to produce misaligned audio. Users often need multiple generations to achieve a clip where both the visual content and audio match the prompt as intended. Veo 3.1 reduced but did not eliminate this problem, and the January 2026 update made further incremental improvements particularly in two-speaker overlap handling.
Hand and finger generation: Human hands remain a persistent weak point for Veo 3, consistent with limitations seen in other video and image generation models. Fingers frequently appear in anatomically incorrect configurations, particularly during rapid motion or when the hand is partially occluded. The model can produce extra fingers, merged digits, or unnaturally bent joints, especially when hands are in motion. The Veo 3.1 Lite tier exacerbates this failure mode in exchange for lower cost.
Text rendering: On-screen text (signs, documents, labels) often appears garbled or illegible. Veo 3 is not reliable for generating scenes where readable text is part of the composition; this work is typically done in post-production. The model produces text-like marks that look approximately correct at a glance but resolve into nonsense on closer inspection.
Physics accuracy: While Veo 3 handles common physical scenarios with reasonable fidelity, complex fluid dynamics, cloth simulation, and intricate mechanical interactions produce inconsistent results. The model sometimes prioritizes visual plausibility over physical accuracy in ambiguous cases, leading to motion that looks acceptable in isolation but does not strictly obey the laws it appears to be following. Leaked references to the Veo 3.2 "Artemis" engine describe an explicit world-model layer for fluid and rigid-body simulation, although Google has not confirmed any of these claims.
Hallucinations: The model occasionally generates visual content not specified in the prompt, or loses consistency in object appearance across frames. Characters may change subtly in appearance between shots, and objects can morph or disappear in longer clips. The Ingredients to Video feature in Veo 3.1 partially mitigates this for character consistency but does not eliminate it for general object continuity.
Regional restrictions: Image-to-video was not available at launch in the European Economic Area, Switzerland, or the United Kingdom, reflecting regulatory differences in those jurisdictions. Some other features have rolled out region by region rather than universally.
Access cost: The $249.99 monthly cost of Google AI Ultra placed Veo 3 out of reach for many independent creators at launch. The Veo 3 Fast API tier at $0.15 per second, the Veo 3.1 Light tier at $0.05 per video, and the Veo 3.1 Lite tier from April 2026 provided progressively more affordable options as the ecosystem matured, but the original Quality tier in Flow remained expensive for hobbyist use.
Multi-character interactions: Scenes with two or more characters speaking are less reliable than single-character scenes. Lip-sync can break down, dialogue can be attributed to the wrong visible character, and conversational rhythm can feel mechanical. Kling 3.0's multi-character audio with voice reference partially leapfrogged Veo 3 on this dimension in early 2026, and the Veo 3.1 January 2026 update narrowed but did not close that gap.
Misinformation risk: The realism of Veo 3's audio-visual output, particularly the native dialogue generation with lip-sync, creates a meaningful capability for generating fabricated video content that is harder to detect as AI-generated than outputs from prior tools. This has been documented by media researchers and is an ongoing concern for platform trust and information integrity.