# Veo

> Source: https://aiwiki.ai/wiki/veo
> Updated: 2026-06-21
> Categories: Deep Learning, Generative AI, Google DeepMind, Video Generation
> From AI Wiki (https://aiwiki.ai), a free encyclopedia of artificial intelligence. Quote with attribution.

**Veo** is a family of [text-to-video](/wiki/text_to_video) generative AI models developed by [Google DeepMind](/wiki/google_deepmind), and is best known as the first video model from a leading AI lab to natively generate synchronized audio (dialogue, sound effects, and ambient sound) in the same pass as the visuals.[^1][^9] The original Veo model was unveiled by DeepMind CEO Demis Hassabis at [Google I/O](/wiki/google_io) on May 14, 2024, where it was positioned as Google's direct answer to [OpenAI](/wiki/openai)'s [Sora](/wiki/sora).[^1][^2] The family has since evolved through three major successors, [veo 2](/wiki/veo_2) (December 16, 2024), [veo 3](/wiki/veo_3) (May 20, 2025), and [veo 3 1](/wiki/veo_3_1) (October 15, 2025), each adding substantial capabilities including 4K resolution, native synchronized audio generation, and reference-image conditioning.[^3][^4][^5] As of June 2026, Veo 3.1 is the most recent publicly released version; Google DeepMind's official Veo page still lists Veo 3.1 as its leading video model, and no Veo 4 has been officially announced.[^1]

Veo is distributed to consumers through Google's [gemini app](/wiki/gemini_app) and the dedicated [flow](/wiki/flow) filmmaking tool, and to developers and enterprises through the [Gemini API](/wiki/gemini_api), [Google AI Studio](/wiki/google_ai_studio), and [vertex ai](/wiki/vertex_ai).[^6] All Veo-generated videos carry an invisible [synthid](/wiki/synthid) watermark embedded in every frame.[^7] Adoption was rapid: on July 10, 2025, roughly seven weeks after Veo 3's launch, Google reported that "over 40 million Veo 3 videos generated across the Gemini app and Flow," and by October 2025 Google reported over 275 million videos generated through the Flow tool alone.[^33][^23]

Within the broader competitive landscape of generative video, Veo competes with [sora](/wiki/sora) (OpenAI), [runway gen 3](/wiki/runway_gen_3) and [runway gen 4](/wiki/runway_gen_4) (Runway), Pika Labs, and [kling](/wiki/kling) (Kuaishou). Veo 3 was the first major model from a leading AI lab to natively generate synchronized audio (dialogue, sound effects, and ambient sound) alongside the visual output in a single generation pass. Describing the breakthrough in a press briefing, DeepMind CEO Demis Hassabis said, "For the first time, we're emerging from the silent era of video generation."[^32]

## What is Veo used for?

Veo is used to generate short video clips (with synchronized audio from Veo 3 onward) from text prompts, still images, or reference images, for applications spanning content creation, filmmaking pre-visualization, advertising, education, prototyping, and entertainment. A full breakdown of applications appears in the [Use cases](#use-cases) section below.

## Background

Veo is the culmination of more than a decade of video-generation research at Google and DeepMind, and synthesizes techniques from a long line of predecessor systems.[^10] Before [Google Brain](/wiki/google_brain) and [DeepMind](/wiki/deepmind) merged into Google DeepMind in April 2023, both organizations had independently pursued video synthesis along several different research tracks.

DeepMind's early contributions included the Generative Query Network (GQN), which learned 3D scene representations from 2D observations, and DVD-GAN, a [generative adversarial network](/wiki/generative_adversarial_network) for high-resolution video. On the Google Research side, [imagen video](/wiki/imagen_video) (October 2022) applied cascaded [diffusion model](/wiki/diffusion_model) techniques to generate 1280x768 video at 24 frames per second from text prompts, while Phenaki (October 2022) introduced an autoregressive [transformer](/wiki/transformer) approach that could produce variable-length video from sequences of text prompts, enabling basic storytelling through prompt chaining.[^11]

Subsequent projects closed the gap between research demonstrations and practical applications. WALT explored latent video diffusion with a window-attention mechanism to handle long temporal sequences efficiently. VideoPoet (December 2023) used a [large language model](/wiki/large_language_model) backbone to unify text-to-video, image-to-video, video stylization, and video inpainting within a single architecture. [lumiere](/wiki/lumiere) (January 2024) introduced a Space-Time U-Net (STUNet) architecture that generated the entire temporal duration of a video in a single pass, improving global temporal consistency.[^12]

Veo synthesized lessons from all of these projects. It combined the latent diffusion approach that made Imagen Video computationally tractable, the prompt-chaining storytelling capability from Phenaki, and the temporal consistency improvements pioneered by Lumiere. The result was a model substantially more capable than any of its predecessors across resolution, duration, realism, and prompt adherence. Veo also drew heavily on Google's foundational work on the [transformer](/wiki/transformer) architecture and the multimodal understanding capabilities built into the [gemini](/wiki/gemini) model family.

## Veo 1 (May 2024)

The original Veo model was unveiled by Demis Hassabis, head of Google DeepMind, and Douglas Eck, who leads DeepMind's generative-media research, during the Google I/O 2024 keynote on May 14, 2024.[^1][^2] The announcement came on the same day Google announced [Imagen](/wiki/imagen) 3 and during a keynote that mentioned "AI" 121 times, reflecting the company's full-court press on generative AI.

### Capabilities at launch

Google described Veo 1 as capable of generating 1080p resolution videos "beyond a minute" in length from text prompts.[^1] The model supported several input modes:[^2]

- **Text-to-video**: generating clips from natural-language descriptions of scenes, actions, and visual styles
- **Image-to-video**: animating a still image based on an accompanying text prompt describing the desired motion
- **Masked editing**: making targeted changes to specific regions of a previously generated video while leaving the rest intact
- **Storyboard generation**: given a sequence of prompts that together tell a story, Veo could produce longer narrative videos exceeding one minute by chaining generated segments

Veo 1 demonstrated an understanding of cinematic concepts such as aerial shots, dolly zooms, time-lapses, and various lighting conditions, and it showed a basic understanding of physical interactions (fluid dynamics, gravity), though this was an area where subsequent versions would improve substantially.

### Initial availability

Veo 1 was initially made available through a waitlist on Google Labs, inside a new web-based front end called **VideoFX**.[^2] Access was limited during this early phase as Google gathered user feedback, tested safety measures, and monitored for misuse. Google indicated at launch that some of Veo's capabilities would ultimately be brought to YouTube Shorts.[^2]

## Veo 2 (December 2024)

Google announced Veo 2 on December 16, 2024, describing it as a substantial upgrade in quality, realism, and creative control.[^3] The announcement was authored by Aäron van den Oord (Research Scientist, Google DeepMind) and Elias Roman (Senior Director, Product Management, Google Labs), and was published on the Google Blog alongside the release of Imagen 3 and the introduction of the Whisk experimental tool.

### Key improvements

| Feature | Veo 1 | Veo 2 |
|---|---|---|
| Maximum resolution | 1080p | 4K |
| Maximum duration | "Beyond a minute" | Several minutes |
| Physics understanding | Basic | Improved real-world physics simulation |
| Human motion | Limited | Better nuance in movement and expression |
| Cinematography control | Moderate | Advanced (lens types, depth-of-field, genre cues) |

Veo 2 produced fewer hallucinations such as extra fingers or unexpected objects.[^3] It also understood "the unique language of cinematography," interpreting specific requests such as "18mm lens" and "shallow depth of field."[^3] Human figures saw particular improvement, with more nuanced facial expressions, more natural body movement, and more realistic hand gestures.

### How did Veo 2 perform in benchmarks?

Google conducted head-to-head comparison tests using 1,003 prompts from Meta's MovieGenBench dataset, with human evaluators judging 720p, eight-second clips produced by Veo 2 against output from Meta Movie Gen, [Kling](/wiki/kling) v1.5, MiniMax, and OpenAI's Sora Turbo.[^3] In both "overall preference" and "prompt adherence" categories, Veo 2 received higher ratings than all compared models. Press coverage from outlets including *Fortune* and *The Decoder* characterized Veo 2 as having "trounced" the competition.[^13][^14]

These benchmarks were conducted by Google using its own evaluation methodology, and independent third-party benchmarks may yield different rankings.

### Availability

Veo 2 was rolled out to VideoFX in Google Labs with an expanded user base. For developers, Veo 2 became generally available on Vertex AI with support for advanced video controls, including the ability to specify the last frame of a video or extend clips in length, and was also offered through the [Gemini API](/wiki/gemini_api). Veo 2 was made available to advanced Gemini app subscribers in April 2025.[^15]

## Veo 3 (May 2025)

Veo 3 was announced at Google I/O on May 20, 2025, during Sundar Pichai's keynote presentation.[^4][^16] The headline feature was native audio generation, making Veo 3 the first major video-generation model from a leading AI lab to produce synchronized sound alongside visuals as part of a single generation process.

### Native audio generation

Veo 3 generates audio natively as part of the video creation process rather than requiring a separate audio model or post-production step.[^4][^17] The audio generation covers three main categories:

| Audio type | Description | Examples |
|---|---|---|
| Dialogue | Character speech with accurate lip synchronization | Conversations, narration, monologues |
| Sound effects | Context-aware sounds matching on-screen actions | Footsteps, door creaking, water splashing, phone ringing |
| Ambient noise | Background sounds that establish scene atmosphere | City traffic, wind, office hum, ocean waves, birds |

Google described this capability as breaking "the silent era of video generation," and DeepMind CEO Demis Hassabis told reporters, "For the first time, we're emerging from the silent era of video generation."[^9][^32] The model produces dialogue with accurate lip-sync, environmental sounds that match the scene context, and sound effects that respond to visual actions. Users can control the tone, accent, and emotion of dialogue through their text prompts. The audio and video are generated jointly, meaning the model considers both modalities simultaneously rather than generating video first and then adding audio as an afterthought.

### Additional improvements

Beyond audio, Veo 3 delivered improvements in physics simulation, realism, and prompt adherence.[^4] The model excelled at understanding short narrative descriptions, allowing users to describe a brief scene or story in their prompt and receive a clip that faithfully brings the narrative to life. Physics understanding continued to improve, with more realistic gravity, momentum, and material interactions. Veo 3 generated 4- to 8-second clips at resolutions up to 4K and in both 16:9 and 9:16 aspect ratios.[^17]

### Viral impact and rapid adoption

Veo 3 generated significant public attention, with multiple demo videos going viral on social media.[^18][^19] One widely shared example, a fictional street interview that appeared so realistic it was widely mistaken for real footage, racked up more than 14 million views on X.[^19] Online users posted fake news segments in multiple languages within Veo 3's first week, including an anchor announcing a fake death of a public figure and a fake political news conference, sparking widespread concern about misinformation.[^20] On July 10, 2025, roughly seven weeks after launch, Google reported "over 40 million Veo 3 videos generated across the Gemini app and Flow," and introduced a photo-to-video feature in the [gemini app](/wiki/gemini_app) in the same period.[^33]

### Veo 3 Fast

Alongside the standard Veo 3 model, Google released **Veo 3 Fast**, a variant optimized for speed and cost efficiency. Veo 3 Fast generates videos more quickly and at a lower per-second cost, making it suitable for rapid iteration, prototyping, and workflows where generation speed is more important than maximum quality. On the Gemini API, Veo 3 Fast is priced at $0.15 per second compared to $0.40 per second for the standard model.[^21]

### Availability

Veo 3 launched initially in private preview on [vertex ai](/wiki/vertex_ai) and was subsequently made generally available.[^6] It was also released through the Gemini API in Google AI Studio, the Gemini consumer app, and the [flow](/wiki/flow) creative tool, Google's dedicated AI filmmaking platform that was introduced at I/O 2025 specifically to showcase Veo.[^22] Google AI Pro subscribers ($19.99 per month) received access to Veo 3 Fast with three generations per day in the Gemini app.

## Veo 3.1 (October 2025)

Veo 3.1 was released on October 15, 2025, as a paid preview in the Gemini API.[^5][^23] It builds on Veo 3 with enhanced audio quality, improved visual realism, and several new editing and control capabilities that move the platform closer to a full video-production toolkit. Google summarized the release in its announcement: "Veo 3.1 brings richer audio, more narrative control, and enhanced realism that captures true-to-life textures."[^23]

### What is new in Veo 3.1?

- **Reference image support**: users can provide up to three reference images of a character, object, or scene to guide the generation process (the "Ingredients to Video" feature), helping maintain visual consistency across multiple shots or applying a specific artistic style[^5]
- **Multi-person dialogue**: Veo 3.1 supports two or more characters taking turns speaking without confusion or audio overlap, a notable improvement over Veo 3's primarily single-speaker audio capabilities
- **Scene extension**: users can create longer videos lasting a minute or more by generating new clips that seamlessly connect to a previous video; each new segment is generated based on the final second (all 24 frames) of the preceding clip, and chaining up to 20 extensions can produce videos exceeding 140 seconds[^5]
- **First-and-last-frame support**: available in both Standard and Fast variants, this lets users define exactly how a video begins and ends by specifying both frames; the model generates the transition between them, complete with synchronized audio[^5]
- **Outpainting**: expanding the frame beyond its original boundaries
- **Object insertion and removal**: users can add new objects to or remove existing objects from generated scenes (the "Insert" and "Remove" tools in Flow)[^23]
- **Improved character consistency**: better preservation of character appearance, expressions, and movements even when prompts are short or underspecified
- **Enhanced native audio**: richer audio generation, better environmental sound matching, and tighter audio-video synchronization

Veo 3.1 outputs video at up to 1080p resolution (and 720p at 24 fps) and supports both horizontal (16:9) and vertical (9:16) formats, allowing portrait-orientation clips suitable for mobile-first platforms like YouTube Shorts, Instagram Reels, and TikTok.[^24]

### Availability and pricing

Veo 3.1 and Veo 3.1 Fast launched simultaneously across the Gemini API, Google AI Studio (in a Veo Studio demo), Vertex AI, the Gemini app, and Flow.[^5] Pricing is identical to Veo 3: $0.40 per second for the Standard model and $0.15 per second for the Fast variant on the Gemini API.[^21]

## Technical approach

Google DeepMind has not published a full technical paper detailing Veo's architecture, but several key aspects of the system have been described publicly through blog posts, developer documentation, and presentations.[^10]

### Latent diffusion transformer

Veo uses a latent diffusion transformer architecture that combines the efficiency of [latent space](/wiki/latent_space) operations with the sequence-modeling strengths of [transformers](/wiki/transformer). The pipeline begins with a specialized video autoencoder consisting of an encoder and a decoder. The encoder compresses raw video frames into a lower-dimensional, information-dense latent representation. By operating within this compressed latent space, the computationally expensive diffusion process becomes far more manageable, enabling generation of high-resolution video without prohibitive amounts of processing power.

The compressed latent space is then tokenized, converting the spatio-temporal data into a sequence of tokens that a transformer network can process. The transformer's [self-attention](/wiki/attention) mechanism captures long-range dependencies across both spatial dimensions (within a frame) and the temporal dimension (across frames). This means the model can understand not just what appears in a single frame but how objects should consistently evolve, move, and interact over time.

### Diffusion process

Veo follows the standard forward-and-reverse diffusion paradigm.[^25] During training, the model takes clean latent representations of video and systematically adds Gaussian noise over a series of scheduled steps (the forward process) until nothing but random noise remains. By learning to predict and remove this noise at each step, the model internalizes the statistical structure of video data at every level of detail.

At inference time, the process runs in reverse. The model starts from random Gaussian noise in the latent space and iteratively denoises it, guided by the text prompt or image conditioning signal, until a coherent video latent emerges. The decoder then transforms this latent representation back into pixel space to produce the final video frames. The number of denoising steps influences both quality and generation speed; the "Fast" variants of Veo 3 and Veo 3.1 use fewer denoising steps or a distilled version of the model.

### Conditioning and prompt understanding

A significant factor in Veo's output quality is the richness of its conditioning mechanism. Google enriched its training data with detailed, multi-sentence captions for each training video, going well beyond simple one-line descriptions, enabling the model to associate nuanced text descriptions with specific visual elements, camera movements, and scene dynamics.

The model understands specialized cinematic terminology. Users can specify camera angles (e.g., low angle, bird's-eye view), lens types (e.g., 35 mm, fisheye, anamorphic), camera movements (e.g., dolly, tracking shot, crane shot), lighting setups (e.g., golden hour, chiaroscuro, neon), and genre-specific visual styles (e.g., film noir, documentary, anime).

Starting with Veo 3, the conditioning system was extended to audio. The model generates synchronized dialogue, sound effects, and ambient audio conditioned on the same text prompt and the generated visual content, producing a unified audiovisual output.

## Availability

Veo's evolution shows a clear progression from a controlled Labs experiment to a fully productized creative platform that spans consumer, developer, and enterprise channels.

### VideoFX

VideoFX was the first consumer-facing tool for Veo, launched alongside the original model in May 2024 as part of [Google Labs](/wiki/google_labs).[^2] It provided a simple web-based interface for text-to-video generation, with Veo 2 generating at 720p resolution and up to 8 seconds in length on VideoFX, though the underlying model supported higher resolutions and longer durations through other channels.

### Flow

**Flow** is Google's dedicated AI-filmmaking tool, introduced at Google I/O 2025.[^22] It is custom-designed for Veo, [Imagen](/wiki/imagen), and [Gemini](/wiki/gemini) models and provides a more complete creative environment than VideoFX. Flow allows users to generate images and videos from scratch, swap objects within scenes, extend scenes, direct camera movement, and control pacing. It includes a timeline-based interface that supports iterative refinement of generated content and is built around the idea of "longer projects with continuity," preserving the same characters and actors across cuts.[^26] Flow is available to subscribers of Google AI Pro and Google AI Ultra plans. By October 2025, Google reported that users had created more than 275 million AI videos through the Flow platform.[^23]

### Google AI Studio and the Gemini API

Developers can access Veo models programmatically through Google AI Studio and the Gemini API. All Veo models from Veo 2 through Veo 3.1 (including both Standard and Fast variants) are accessible through this route, with charges applied on a pay-per-second basis only for successfully generated videos.[^21]

### Vertex AI

For enterprise customers, Veo is available on [vertex ai](/wiki/vertex_ai), Google Cloud's managed [machine learning](/wiki/machine_learning) platform.[^6] Vertex AI integration enables companies to incorporate Veo into existing cloud infrastructure, combine it with other Google Cloud services, and manage access through enterprise-grade identity and access controls. Veo 2, Veo 3, Veo 3 Fast, Veo 3.1, and Veo 3.1 Fast have all reached general availability on Vertex AI.

### Gemini app

Consumer access to Veo is available directly within the [gemini app](/wiki/gemini_app). Google AI Pro subscribers receive access to Veo 3.1 Fast with up to three video generations per day, while Google AI Ultra subscribers receive the highest level of access to the full Veo 3.1 model.[^21]

## SynthID watermarking

All videos generated by Veo are watermarked using [synthid](/wiki/synthid), a technology developed by Google DeepMind that embeds an imperceptible digital watermark directly into the pixels of every video frame.[^7] This watermark is invisible to the human eye but detectable by automated tools, enabling identification of AI-generated media. The watermark is designed to be robust against common transformations such as cropping, resizing, and compression, though it is not intended to withstand motivated adversarial attacks.

Because SynthID watermarks every individual frame, the mark remains detectable even after substantial trimming or editing of a video.[^27] Google reported in late 2025 that over 10 billion pieces of content had been watermarked with SynthID across four modalities (images via Imagen, video via Veo, audio via Lyria, and text via Gemini), making it the most widely deployed invisible AI watermarking system in existence.[^27]

Beyond watermarking, Veo passes all generated content through multiple safety layers: automated safety filters that block requests for harmful, misleading, or inappropriate content; memorization-checking processes that reduce the likelihood of reproducing specific content from the training data; and content policies aligned with Google's broader AI Principles. Google has also made SynthID detection tools available to selected third parties to support the broader ecosystem's ability to identify AI-generated media.[^7]

## How does Veo compare with competitors?

The AI video generation landscape has grown increasingly competitive since 2024, with multiple well-funded companies releasing capable models. The following table compares Veo 3.1 with several prominent alternatives as of late 2025.

| Feature | Veo 3.1 (Google) | [sora](/wiki/sora) 2 (OpenAI) | [runway gen 4](/wiki/runway_gen_4) (Runway) | Pika 2.2 | [kling](/wiki/kling) 2.6 (Kuaishou) |
|---|---|---|---|---|---|
| Developer | Google DeepMind | OpenAI | Runway | Pika Labs | Kuaishou |
| Max resolution | 4K | 1080p | 4K (upscaled) | 1080p | 1080p |
| Base clip duration | 8 seconds | Up to 20 seconds | Up to 10 seconds | Up to 10 seconds | 5-10 seconds |
| Extended duration | 1+ minute (scene extension) | 20 seconds | Extendable in 8 s increments | Limited | Up to 3 minutes (extension) |
| Native audio | Yes (dialogue, SFX, ambient) | Yes | No | No | Yes (since v2.6) |
| Text-to-video | Yes | Yes | Yes | Yes | Yes |
| Image-to-video | Yes | Yes | Yes | Yes | Yes |
| Reference images | Up to 3 | No | First/last frame | First/last frame | No |
| Camera controls | Yes | Limited | Yes (advanced) | Limited | Yes (motion brush) |
| API access | Gemini API, Vertex AI | OpenAI API | Runway API | Pika API | Kling API |
| Consumer pricing (from) | $19.99/mo | $20/mo | $12/mo | $8/mo | $10/mo |
| API cost (per second) | $0.15-$0.60 | $0.10-$0.50 | Credit-based | Credit-based | Credit-based |
| AI watermark | SynthID | C2PA metadata | C2PA metadata | Watermark (free tier) | Watermark (free tier) |

In Google's internal benchmarks conducted in December 2024 using 1,003 prompts from Meta's MovieGenBench dataset, human evaluators preferred Veo 2 over Sora Turbo, Meta Movie Gen, Kling v1.5, and MiniMax for both overall quality and prompt adherence.[^3][^13] Independent community evaluations, such as the Artificial Analysis Video Arena, have ranked Veo models competitively, though relative rankings can shift rapidly as all providers release frequent updates.

### How does Veo differ from Sora?

Veo and [Sora](/wiki/sora) are the two most prominent text-to-video systems from frontier AI labs, and they differ on several axes. Veo's distinguishing strengths are native synchronized audio (introduced in Veo 3, May 2025), output up to 4K resolution, reference-image conditioning of up to three images, and per-frame [synthid](/wiki/synthid) watermarking.[^4][^5][^7] OpenAI's Sora and Sora 2 also generate audio and support longer single clips (up to roughly 20 seconds), and tag outputs with C2PA metadata rather than an in-pixel watermark. In Google's own December 2024 evaluation on the MovieGenBench prompt set, human raters preferred Veo 2 over Sora Turbo on both overall quality and prompt adherence, although these were vendor-run tests and independent rankings vary.[^3][^13]

## Limitations and controversies

### Known limitations

Despite its capabilities, Veo has several documented limitations as of late 2025:

- **Duration constraints**: base generation clips remain limited to approximately 8 seconds; longer videos require iterative scene extension that can introduce subtle discontinuities
- **Temporal consistency over long durations**: very long extended videos can exhibit gradual drift in character appearance, clothing details, or background elements
- **Fine-grained control**: precise frame-by-frame control over character positioning, timing, and object placement remains limited compared with traditional video editing
- **Text rendering**: like virtually all current video generation models, Veo can struggle with rendering legible text, signs, or written content within frames
- **Cost**: at $0.40-$0.60 per second for standard-quality output, generating a minute of video through the API can cost $24-$36, prohibitive for many high-volume use cases

### Subtitle artifacts

In July 2025, *MIT Technology Review* reported that Veo 3 added garbled, nonsensical subtitles to generated videos even when users explicitly requested no captions, affecting up to 40 percent of dialogue scenes.[^28] The root cause was attributed to training on YouTube videos, vlogs, and TikTok content that contained embedded subtitles, leading the model to "learn" that captions enhance similarity to human-created videos. The problem persisted more than a month after Google announced fixes on June 9, 2025.

### Training-data controversy

In June 2025, *CNBC* reported that Google had used its catalog of YouTube videos, estimated at 20 billion videos, to train Veo 3 and other Gemini-family models.[^29] Multiple leading creators and intellectual-property professionals told CNBC they had not been informed that their content could be used in this way. Google noted that its terms of service permit using YouTube content to improve "the product experience … including through machine learning and AI applications," but users have no opt-out mechanism. Even using one percent of YouTube would amount to roughly 2.3 billion minutes of training data, 40 times the volume reportedly used by some competing AI models. Google offers indemnification for users facing copyright challenges over content generated with Veo.[^29]

### Misinformation and deepfake concerns

Veo 3's realism, combined with its native audio generation, fueled rapid concerns about misinformation. *Time* magazine reported that Veo 3 could generate plausible deepfakes of riots, election fraud, and conflict.[^20] In one notable incident, Philippine officials reportedly shared a Veo 3-generated street-interview video to support Vice President Sara Duterte during impeachment proceedings, illustrating real-world political misuse.[^30]

In July 2025, Media Matters for America reported that racist and antisemitic videos generated using Veo 3 were being widely uploaded to TikTok.[^31] Ars Technica's Ryan Whitwam observed that "vagueness in the prompt and the AI's inability to understand the subtleties of racist tropes (i.e., the use of monkeys instead of humans in some videos) make it easy to skirt the rules."

### Quality concerns at the low end

A *Gizmodo* report noted that early users frequently directed Veo 3 toward low-quality content, including fake "man on the street" interviews, low-effort haul videos, and repetitive jokes, raising questions about the social value of such ultra-cheap video at scale.

## Use cases

Veo has found applications across a range of creative and professional domains:

- **Content creation**: short-form video for YouTube Shorts, Instagram Reels, and TikTok
- **Filmmaking and pre-visualization**: directors and producers use Flow to prototype scenes, test camera angles, explore lighting, and visualize narrative sequences before physical production
- **Advertising and marketing**: draft video ads, product demos, and concept videos for client presentations
- **Education**: explanatory videos, historical recreations, and visualizations of scientific concepts
- **Prototyping and product design**: mock-up videos to visualize user experiences, app flows, or physical product concepts
- **Accessibility**: lowers the barrier to video creation for individuals and small teams that lack traditional video production resources
- **Gaming and entertainment**: game developers and interactive media creators generate concept art in motion, cutscene prototypes, and environmental visualizations

## How much does Veo cost? (Pricing as of late 2025)

### Gemini API (developer pricing)

| Model | 720p/1080p (per second) | 4K (per second) |
|---|---|---|
| Veo 2 | $0.35 | N/A |
| Veo 3 | $0.40 | $0.60 |
| Veo 3 Fast | $0.15 | $0.35 |
| Veo 3.1 | $0.40 | $0.60 |
| Veo 3.1 Fast | $0.15 | $0.35 |

Charges apply only when videos are successfully generated.[^21] There is no free tier for Veo video generation on the Gemini API. For Veo 3 and later models, the per-second price includes both video and audio output.

### Consumer subscriptions

| Plan | Monthly price | Veo access | AI credits |
|---|---|---|---|
| Google AI Pro | $19.99 | Veo 3.1 Fast (up to 3 per day in Gemini app); limited Flow access | 1,000/month |
| Google AI Ultra | $249.99 | Veo 3.1 (highest tier); full Flow access | 25,000/month |

### Vertex AI (enterprise)

Vertex AI pricing for Veo 2 is $0.50 per second of generated video. Veo 3 pricing on Vertex AI was initially set at $0.75 per second at launch in May 2025 and was reduced to $0.40 per second in September 2025.[^6] Enterprise customers may negotiate custom pricing through Google Cloud sales.

## When was each Veo version released? (Timeline)

| Date | Event |
|---|---|
| May 14, 2024 | Veo 1 announced at Google I/O 2024 by Demis Hassabis and Douglas Eck; VideoFX launched in Google Labs with waitlist access |
| December 16, 2024 | Veo 2 announced with 4K resolution, improved physics understanding, and benchmark wins against Sora Turbo and other models |
| April 2025 | Veo 2 made available to advanced Gemini app subscribers |
| May 20, 2025 | Veo 3 announced at Google I/O 2025 with native audio generation; Flow filmmaking tool introduced |
| June 19, 2025 | CNBC reports that Veo 3 was trained on YouTube videos, drawing creator concerns |
| July 10, 2025 | Google reports over 40 million Veo 3 videos generated across the Gemini app and Flow; photo-to-video launches in the Gemini app |
| July 15, 2025 | *MIT Technology Review* documents Veo 3's persistent "garbled subtitles" problem |
| September 2025 | Veo 3 pricing on Vertex AI reduced from $0.75 to $0.40 per second; Veo 3 Fast reaches GA |
| October 15, 2025 | Veo 3.1 released in paid preview with reference-image support, multi-person dialogue, scene extension, and vertical-video support |
| Late 2025 | Google reports 275M+ videos generated through Flow |

## Legacy and current status

As of June 2026, Veo 3.1 remains the most recent publicly released Veo model, and Google DeepMind's official Veo page still lists it as the company's leading video generation model.[^1] Google has not officially announced Veo 4, though industry observers consider Google I/O 2026 (held in May 2026) a likely venue based on the company's historical pattern of unveiling major Veo releases at I/O. Until Google publishes an official announcement, no Veo 4 capabilities, pricing, or release date should be considered confirmed.

Veo's two-year arc, from a waitlist-only Labs experiment in May 2024 to a full creative platform with native audio, reference-image conditioning, multi-platform availability, and billions of frames watermarked through SynthID, illustrates how rapidly generative video has matured. The technology has also surfaced acute challenges around training data provenance, misinformation, deepfakes, and the unresolved economic relationship between AI labs and the creators whose work feeds these models, debates that are likely to define the next phase of the generative-video industry.

## See also

- [veo 2](/wiki/veo_2)
- [veo 3](/wiki/veo_3)
- [veo 3 1](/wiki/veo_3_1)
- [sora](/wiki/sora)
- [runway gen 3](/wiki/runway_gen_3)
- [runway gen 4](/wiki/runway_gen_4)
- [kling](/wiki/kling)
- [flow](/wiki/flow)
- [gemini app](/wiki/gemini_app)
- [vertex ai](/wiki/vertex_ai)
- [synthid](/wiki/synthid)
- [imagen video](/wiki/imagen_video)
- [lumiere](/wiki/lumiere)
- [diffusion model](/wiki/diffusion_model)
- [text to video](/wiki/text_to_video)
- [google deepmind](/wiki/google_deepmind)

## References

[^1]: Google DeepMind. "Veo." https://deepmind.google/models/veo/
[^2]: TechCrunch. "Google Veo, a serious swing at AI-generated video, debuts at Google I/O 2024." May 14, 2024. https://techcrunch.com/2024/05/14/google-veo-a-serious-swing-at-ai-generated-video-debuts-at-google-io-2024/
[^3]: Google Blog. "State-of-the-art video and image generation with Veo 2 and Imagen 3." December 16, 2024. https://blog.google/technology/google-labs/video-image-generation-update-december-2024/
[^4]: Google Cloud Blog. "Announcing Veo 3, Imagen 4, and Lyria 2 on Vertex AI." May 20, 2025. https://cloud.google.com/blog/products/ai-machine-learning/announcing-veo-3-imagen-4-and-lyria-2-on-vertex-ai
[^5]: Google Developers Blog. "Introducing Veo 3.1 and new creative capabilities in the Gemini API." October 15, 2025. https://developers.googleblog.com/introducing-veo-3-1-and-new-creative-capabilities-in-the-gemini-api/
[^6]: Google Cloud. "Vertex AI generative AI pricing." https://cloud.google.com/vertex-ai/generative-ai/pricing
[^7]: Google DeepMind. "Watermarking AI-generated text and video with SynthID." https://deepmind.google/blog/watermarking-ai-generated-text-and-video-with-synthid/
[^8]: Blockchain.news. "Google Veo 3 AI Video Generator Surpasses 40 Million Videos, Introduces Photo-to-Video Feature in Gemini App." July 11, 2025. https://blockchain.news/ainews/google-veo-3-ai-video-generator-surpasses-40-million-videos-introduces-photo-to-video-feature-in-gemini-app
[^9]: Wikipedia. "Veo (text-to-video model)." https://en.wikipedia.org/wiki/Veo_(text-to-video_model)
[^10]: Google DeepMind. "Veo model card." https://deepmind.google/models/veo/
[^11]: Google Research. "Imagen Video: High-definition video generation with diffusion models." October 2022. https://imagen.research.google/video/
[^12]: Google Research. "Lumiere: A Space-Time Diffusion Model for Video Generation." January 2024. https://lumiere-video.github.io/
[^13]: Fortune. "Google DeepMind's new Veo 2 AI video generator trounces OpenAI's Sora with 4K resolution." December 16, 2024. https://fortune.com/2024/12/16/google-deepmind-veo-2-ai-video-generator-4k-openai-sora/
[^14]: The Decoder. "Google's Veo 2 outperforms OpenAI's Sora Turbo in head-to-head AI video generation tests." December 2024. https://the-decoder.com/googles-veo-2-outperforms-openais-sora-turbo-in-head-to-head-ai-video-generation-tests/
[^15]: 9to5Google. "Google announces Veo 2 video generation model, expanding VideoFX access." December 16, 2024. https://9to5google.com/2024/12/16/google-veo-2/
[^16]: Business Standard. "Google I/O 2025: All AI products announced last night, from Beam to Veo 3." May 21, 2025. https://www.business-standard.com/technology/tech-news/google-io-2025-ai-launches-beam-veo3-imagen4-gemini-updates-125052100158_1.html
[^17]: Google Developers Blog. "Build with Veo 3, now available in the Gemini API." https://developers.googleblog.com/en/veo-3-now-available-gemini-api/
[^18]: The Tech Outlook. "Google Introduces Veo 3 With Native Audio Generation; Deep Think Mode Brought to Gemini 2.5 Pro." May 21, 2025. https://www.thetechoutlook.com/new-release/google-introduces-veo-3-with-native-audio-generation-deep-think-mode-brought-to-gemini-2-5-pro-more-updates-from-google-i-o-2025-keynote/
[^19]: BizzBuzz. "Google Launches Veo 3 AI Model, Sparking Viral Videos and Misinformation Fears." May 2025. https://www.bizzbuzz.news/technology/google-launches-veo-3-ai-model-sparking-viral-videos-and-misinformation-fears-1362803
[^20]: Time. "Google's Veo 3 Can Make Deepfakes of Riots, Election Fraud, Conflict." May 2025. https://time.com/7290050/veo-3-google-misinformation-deepfake/
[^21]: Google. "Gemini Developer API Pricing." https://ai.google.dev/gemini-api/docs/pricing
[^22]: Google Blog. "Introducing Flow: Google's AI filmmaking tool designed for Veo." May 20, 2025. https://blog.google/innovation-and-ai/products/google-flow-veo-ai-filmmaking-tool/
[^23]: Google Blog. "Bringing new Veo 3.1 updates into Flow to edit AI video." October 15, 2025. https://blog.google/technology/ai/veo-updates-flow/
[^24]: AICloudIT. "Google Veo 3.1 Update (Oct 2025): Full Breakdown of What's New." https://www.aicloudit.com/blog/ai/google-veo-3-1-complete-guide-ai-video-model/
[^25]: Lumiere. "A Space-Time Diffusion Model for Video Generation." Google Research, January 2024. https://lumiere-video.github.io/
[^26]: Google Flow. "Flow product page." https://labs.google/flow/about
[^27]: Google DeepMind. "SynthID." https://deepmind.google/models/synthid/
[^28]: MIT Technology Review. "Google's generative video model Veo 3 has a subtitles problem." July 15, 2025. https://www.technologyreview.com/2025/07/15/1120156/googles-generative-video-model-veo-3-has-a-subtitles-problem/
[^29]: CNBC. "Google is using YouTube videos to train its Gemini, Veo 3 AI models." June 19, 2025. https://www.cnbc.com/2025/06/19/google-youtube-ai-training-veo-3.html
[^30]: AI Incident Database. "Incident 1128: Philippine Officials Reportedly Share Veo 3-Generated Video to Support Vice President Sara Duterte During Impeachment." https://incidentdatabase.ai/cite/1128/
[^31]: Ars Technica / Media Matters reporting. As referenced in Wikipedia's "Veo (text-to-video model)" article. https://en.wikipedia.org/wiki/Veo_(text-to-video_model)
[^32]: TechCrunch. "Veo 3 can generate videos, and soundtracks to go along with them." May 20, 2025. https://techcrunch.com/2025/05/20/googles-veo-3-can-generate-videos-and-soundtracks-to-go-along-with-them/
[^33]: Google Blog. "Introducing Gemini with photo to video capability." July 10, 2025. https://blog.google/products-and-platforms/products/gemini/photo-to-video/

