Gemini 2.5 Flash
Last reviewed
May 18, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v2 ยท 4,939 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 18, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v2 ยท 4,939 words
Add missing citations, update stale details, or suggest a clearer explanation.
Gemini 2.5 Flash is a fast, cost-optimized multimodal large language model developed by Google DeepMind and the mid-tier member of the Gemini 2.5 family. It was first released as a public preview on April 17, 2025 under the identifier gemini-2.5-flash-preview-04-17, then reached general availability on June 17, 2025 as gemini-2.5-flash. It was the first Flash-class Gemini with built-in reasoning, and Google described it as the company's first "fully hybrid reasoning model," meaning developers can turn the internal chain of thought on or off and set an explicit thinking budget that caps how many tokens the model spends deliberating before answering.[^1]
The model carries the same 1 million token context window, native multimodal input set (text, image, audio, video, PDF), and tool-use surface as Gemini 2.5 Pro, with a smaller and faster expert routing configuration to keep latency and per-token pricing low. At preview, Google priced it at $0.15 per million input tokens, $0.60 per million output tokens with thinking disabled, and $3.50 per million output tokens with thinking enabled, making it one of the cheapest reasoning models in the market. At GA, Google unified output pricing at $2.50 per million tokens regardless of thinking mode, with input pricing rising to $0.30 per million for text, image, and video.[^5]
Gemini 2.5 Flash sat at the center of Google's 2025 model strategy. It became the default in the Gemini app after Google I/O on May 20, 2025, replacing Gemini 2.0 Flash, and on Vertex AI and the Gemini API it became Google's recommended workhorse for high-volume reasoning, agentic coding, retrieval-augmented generation, and chat backends. It powered AI Mode in Google Search and large parts of Workspace AI features for the remainder of 2025. The model was superseded as default by Gemini 3 Flash on December 17, 2025 but remained available through 2026 as a price-performance option below the Gemini 3 line. Google has scheduled the gemini-2.5-flash endpoint for shutdown on October 16, 2026, with gemini-3-flash-preview listed as the recommended replacement.[^21]
Google has used the "Flash" label for the smaller, faster member of each Gemini generation since Gemini 1.5 Flash arrived in May 2024 as a distillation of Gemini 1.5 Pro optimized for high-throughput summarization, classification, and chat. Gemini 2.0 Flash followed in December 2024, adding native tool use including Google Search grounding, a code-execution sandbox, and real-time streaming through the Multimodal Live API at roughly twice the throughput of Gemini 1.5 Pro. Two days later Google released the experimental Gemini 2.0 Flash Thinking, which spent extra inference compute on internal deliberation and foreshadowed the architectural direction of the 2.5 family.
The 2.5 generation was unveiled on March 25, 2025 with Gemini 2.5 Pro Experimental. Google said that "all of our models going forward will be thinking models" and described the 2.5 line as integrating reasoning into the default response path.[^2] Pro went first; Flash followed three weeks later. The competitive context in April 2025 included Anthropic's Claude 3.7 Sonnet (February 24, 2025), which had introduced configurable extended thinking; OpenAI's o4-mini, a small reasoning model in the same price band; and DeepSeek R1, the open-weight reasoner from late January.
Gemini 2.5 Flash was first announced on April 9, 2025 at Google Cloud Next in Las Vegas as a low-latency model with built-in thinking capabilities. Public preview opened eight days later, on April 17, 2025, under the identifier gemini-2.5-flash-preview-04-17 through Google AI Studio, the Gemini API, and Vertex AI. The accompanying Google Developers Blog post, by Tulsee Doshi, framed the model as Google's "first fully hybrid reasoning model," with the ability to turn thinking on or off and to set thinking budgets that trade quality, cost, and latency.[^1]
Google's launch claims were threefold: the model ranked second on the LMArena Hard Prompts leaderboard at an Elo of approximately 1,392, behind only Gemini 2.5 Pro; it used the same multimodal input set, function calling, and 1 million token context window as Pro at a fraction of the price; and the thinking budget exposed a 0 to 24,576 token range of internal reasoning.[^1] VentureBeat's coverage focused on the cost split: $3.50 per million output tokens with thinking on, $0.60 with thinking off, headlined as cutting costs by 600 percent when turned down.[^10]
Google followed with an updated checkpoint at I/O on May 20, 2025. gemini-2.5-flash-preview-05-20 became the default in the Gemini app the same day, with Google reporting improvements across reasoning, multimodality, code, and long-context benchmarks while using 20 to 30 percent fewer tokens.[^3] The model reached general availability on June 17, 2025 as gemini-2.5-flash, alongside Gemini 2.5 Pro's GA and the preview of Gemini 2.5 Flash-Lite.[^4][^17] The GA release emphasized supervised fine-tuning (SFT) availability, structured output support, and unified output pricing, and the April 17 preview endpoint was deprecated on July 15, 2025.
Google shipped several variants and snapshots of Gemini 2.5 Flash through 2025 and into 2026.
| Identifier | Status | Key date | Notes |
|---|---|---|---|
gemini-2.5-flash-preview-04-17 | Preview (retired) | April 17, 2025 | First public release, hybrid reasoning, thinking budget 0 to 24,576 |
gemini-2.5-flash-preview-05-20 | Preview (retired) | May 20, 2025 | Google I/O refresh, 20 to 30 percent fewer tokens per task |
gemini-2.5-flash | General availability | June 17, 2025 | Stable production endpoint, unified output pricing, January 2025 knowledge cutoff[^13] |
gemini-2.5-flash-lite | Sibling preview, then GA | June 17, 2025 (preview), July 22, 2025 (stable) | Smaller and cheaper Flash-class model for translation, classification, high-volume tasks[^4][^17] |
gemini-2.5-flash-preview-09-2025 | Preview (deprecated) | September 25, 2025 | Better tool use, 24 percent fewer output tokens, +5 points on SWE-bench Verified[^7] |
gemini-flash-latest | Floating alias | September 25, 2025 | Pointer to the current Flash preview, later migrated to Gemini 3[^7] |
gemini-2.5-flash-native-audio (Live API) | Preview, then upgraded | May 20, 2025 announce; late 2025 upgrade | Native audio dialogue via the Live API, 30+ HD voices in 24 languages, 70-language understanding[^21][^22][^23] |
gemini-2.5-flash-image (Nano Banana) | Preview, then stable | August 26, 2025 | Image generation and editing model with character consistency, $30 per million output tokens (~$0.039 per image)[^24][^25] |
gemini-2.5-flash for TTS | Preview | 2025 | Text-to-speech with fine control over style and pacing[^13] |
The September 25, 2025 preview was the last major iteration of the text model before Gemini 3 took over, lifting SWE-bench Verified from 48.9 percent to 54.0 percent and cutting output tokens by 24 percent.[^7] Google subsequently published a deprecation schedule listing October 16, 2026 as the earliest possible shutdown date for the gemini-2.5-flash endpoint, with gemini-3-flash-preview named as the recommended replacement; the Live API preview gemini-live-2.5-flash-preview was retired earlier, on December 9, 2025, in favor of gemini-3.1-flash-live-preview.[^21]
Google DeepMind has not published the parameter count or training corpus of Gemini 2.5 Flash. The technical report "Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context" describes the model qualitatively rather than in architectural detail.[^8] Gemini 2.5 Flash is a decoder-only Transformer built on the same architectural family as Gemini 1.5 and 2.0. It uses a Mixture-of-Experts (MoE) routing scheme inherited from Gemini 1.5, where only a subset of expert sub-networks is activated for each input token, placing the active-parameter count at a small fraction of Gemini 2.5 Pro's. The model is natively multimodal: text, image, audio, video, and PDF inputs are processed by per-modality encoders that emit tokens in a shared representation, then routed through a single Transformer stack.
The defining 2.5 change is the integration of reasoning into the default response path, but Flash exposes that integration as an opt-in via the thinking budget parameter. With the budget at zero the model behaves like a fast, non-reasoning model; with a positive value it produces an internal chain of thought of up to that many tokens before its visible response. The default at GA was "dynamic thinking," where the model decides for itself how much to spend. The maximum budget was 24,576 tokens, smaller than the 32,000 token ceiling on Gemini 2.5 Pro. Reasoning was trained through supervised fine-tuning on chain-of-thought traces combined with reinforcement learning from preference data. Like Pro, Flash returns inspectable thought summaries (a redacted form of internal reasoning) rather than the full raw chain-of-thought transcript, and it supports the same tool-use surfaces: function calling, structured outputs, batch processing, context caching, Google Search grounding, URL context, and a code-execution sandbox.
The stable gemini-2.5-flash text endpoint accepts up to 1,048,576 input tokens and emits up to 65,536 output tokens per response, with a January 2025 knowledge cutoff.[^13] Supported features on the stable endpoint include function calling, structured outputs, code execution, Google Search and Google Maps grounding, URL context, context caching, the Batch API, file search, flex inference, priority inference, and supervised fine-tuning, with image and audio generation and the Live API exposed only through dedicated sibling endpoints (gemini-2.5-flash-image and the Live API variants).[^13]
The hybrid reasoning design is Gemini 2.5 Flash's most distinctive feature. The thinking_budget parameter has three behaviors: zero disables thinking entirely; a positive integer caps internal reasoning tokens at that value; the default lets the model decide. The model is trained to stop early when the prompt does not require deeper reasoning, so the 24,576 ceiling does not commit it to spending that many tokens per request.[^1] A single endpoint can therefore serve both reasoning-light traffic (translation, classification, formatting) and reasoning-heavy traffic (multi-step math, complex code, document analysis) without forcing the developer to swap models. Simon Willison called the pattern a "poor person's mixture of models" that achieves cost efficiency by varying inference compute per request, and in his test prompts he measured costs ranging from 0.1025 cents (11 in / 1,705 out, thinking off) to 2.87 cents (5,174 out + 3,023 thinking) on the preview pricing.[^9]
Gemini 2.5 Flash accepts text, images, audio, video, and PDFs in the same prompt with capacity limits similar to Gemini 2.5 Pro: up to 3,000 images per prompt, roughly 45 minutes of video with audio (or one hour without), about 8.4 hours of audio, and PDFs up to 1,000 pages or 50 MB. Output is text by default, with native audio output available through the Live API variant. On MMMU the May I/O snapshot scored 79.7 percent and on Global MMLU Lite it scored 88.4 percent.[^3] Independent benchmarkers placed the model in the upper third of the Artificial Analysis Intelligence Index, with output speeds above 200 tokens per second on the non-reasoning configuration.[^11]
Flash was not positioned as the leading coding model in the 2.5 family, since Gemini 2.5 Pro received the bulk of Google's coding-specific reinforcement learning. Still, SWE-bench Verified climbed from 48.9 percent on the May I/O preview to 54.0 percent on the September 25 refresh, and Replit, Cursor, and similar developer tools adopted Flash as the lower-cost coding option below Pro.[^7] An early tester from Manus AI publicly reported a "15 percent leap in performance for long-horizon agentic tasks" on the September preview.[^7] The 1 million token context window holds roughly 750,000 words of English text, several hours of audio, or a multi-thousand-line codebase, with MRCR long-context retrieval in the high 80s percent band at 128k tokens. Function calling, structured outputs (JSON schema enforcement), batch processing, context caching, Google Search grounding, URL context, and a code-execution sandbox are all available, and the model supports the Model Context Protocol (MCP) standard that became important in the second half of 2025.
A separate sibling model, exposed through the Gemini Live API, handles real-time voice and video conversations on top of the Gemini 2.5 Flash backbone. The native-audio variant, announced in preview at Google I/O 2025, generates speech directly from the model rather than chaining a text response to an external text-to-speech system. Google's documentation lists more than 30 HD voices, output across 24 languages, and understanding of approximately 70 languages, with audio I/O at 16 kHz raw 16-bit PCM input and 24 kHz output.[^22][^23] Distinctive Live API features include affective dialogue, in which the model adapts its tone and style to the user's emotional cues; proactive audio, which lets developers control when and in what contexts the model speaks; improved barge-in handling for interruptions in noisy environments; and live streaming speech-to-speech translation with style transfer that preserves intonation, pacing, and pitch. A later upgrade to the native-audio model, published in the last quarter of 2025, raised the ComplexFuncBench Audio score to 71.5 percent and the instruction-following score to about 90 percent (up from 84 percent), and added 2,000+ language pairs for live translation.[^22] Google rolled the upgraded model out to Google AI Studio, Vertex AI, Gemini Live, and Search Live.[^22]
Gemini 2.5 Flash Image, internally codenamed "Nano Banana" during anonymous testing on LMArena, launched as a separate model on August 26, 2025.[^24] The model targets image generation and natural-language image editing, with Google highlighting three capabilities: character consistency across scenes (placing the same subject in different environments while preserving identity), targeted local edits with natural language prompts (changing poses, removing background elements, colorizing photos), and image grounding that draws on Gemini's broader world knowledge for accurate composition. The Gemini API priced output at $30.00 per million tokens, which works out to approximately $0.039 per generated image at 1,290 output tokens per image.[^24] The stable gemini-2.5-flash-image endpoint accepts up to 65,536 input tokens and emits up to 32,768 output tokens, with a June 2025 knowledge cutoff and support for batch, caching, flex, and priority inference; it does not support function calling, the Live API, code execution, or audio.[^25] The gemini-2.5-flash-image-preview identifier was deprecated and scheduled for shutdown on January 15, 2026, with gemini-2.5-flash-image as the recommended successor.[^21]
The table below summarizes the public benchmark numbers Google reported for Gemini 2.5 Flash across the preview and GA period, with comparison columns for Gemini 2.5 Pro and Gemini 2.0 Flash where the same benchmark was run.
| Benchmark | Gemini 2.5 Flash | Gemini 2.5 Pro | Gemini 2.0 Flash |
|---|---|---|---|
| LMArena Elo (Hard Prompts) | ~1,392 | ~1,400+ | ~1,290 |
| AIME 2024 | 88.0% | 92.0% | 36.7% |
| AIME 2025 (Sept refresh) | 72.0% | 86.7% | n/a |
| GPQA Diamond | 82.8% | 84.0% | 60.1% |
| Global MMLU Lite | 88.4% | 89.8% | 83.4% |
| MMMU | 79.7% | 81.7% | 70.7% |
| MMMU Pro (Sept refresh) | 66.7% | 75.4% | n/a |
| SWE-bench Verified | 48.9% (May) to 54.0% (Sept) | 63.8% (June GA) | 31.5% |
| LiveCodeBench Pro Elo | 1,143 (Sept) | n/a | n/a |
| FACTS Grounding | 85.3% | 84.1% | n/a |
| Humanity's Last Exam (no tools) | ~5% | 18.8% | n/a |
A few comparisons stand out. On AIME 2024 the Flash 88.0 percent sat within four points of Gemini 2.5 Pro, an unusually small gap for a Flash-class model. On GPQA Diamond the gap was just 1.2 points. The headline weakness was Humanity's Last Exam, where Flash trailed Pro by roughly 14 points without tools. On the Artificial Analysis Intelligence Index, the non-reasoning configuration scored 21 against a category median of 17, ranked 31st of 84 models tracked at the time, with output speed of 215.4 tokens per second (4th overall) and time to first token of 0.65 seconds against a median of 1.87 seconds. The blended cost of about $0.85 per million tokens at a 3:1 input-to-output ratio placed Flash 53rd for input pricing and 80th for output pricing, in the same cost band as OpenAI's GPT-4o-mini and below Claude 3.5 Haiku. Artificial Analysis also noted unusually high verbosity, with Flash producing 17 million output tokens against a category median of 9.1 million during evaluation.[^11]
Gemini 2.5 Flash pricing changed twice during the model's lifetime. The April 17 preview introduced a split between thinking-enabled and thinking-disabled output pricing. The June 17 GA release unified output pricing at $2.50 per million tokens regardless of thinking mode and raised input pricing modestly. The table below summarizes both regimes, plus the discounted batch tier and the premium priority tier that Google added later in 2025.[^5]
| Period / Tier | Input (text, image, video) | Input (audio) | Output thinking off | Output thinking on |
|---|---|---|---|---|
| Preview (April 17 to June 17, 2025) | $0.15 / 1M | $1.00 / 1M | $0.60 / 1M | $3.50 / 1M |
| GA Standard (from June 17, 2025) | $0.30 / 1M | $1.00 / 1M | $2.50 / 1M | $2.50 / 1M |
| GA Batch / Flex | $0.15 / 1M | $0.50 / 1M | $1.25 / 1M | $1.25 / 1M |
| GA Priority | $0.54 / 1M | $1.80 / 1M | $4.50 / 1M | $4.50 / 1M |
Context caching at the GA tier was charged at $0.03 per million cached tokens for text, image, and video, $0.10 per million for audio, and $1.00 per million tokens per hour for cache storage, with a roughly 75 percent cache-hit discount that became one of the main cost levers for retrieval-augmented generation workloads.[^5] The pricing shift between preview and GA drew pushback from developers who had built their cost models around the $0.60 thinking-off output rate; Google's defense was that production users rarely kept all traffic on the thinking-off path. GA pricing put Gemini 2.5 Flash at roughly three times the input cost of Gemini 2.0 Flash ($0.10) and 6.25 times the output cost ($0.40), offset by Google's claim that the model used 20 to 30 percent fewer tokens for the same tasks.[^3] The September 25 refresh further improved efficiency, with a 24 percent reduction in output tokens.[^7]
On Vertex AI, the GA release on June 17, 2025 added supervised fine-tuning (SFT) as a generally available capability, letting enterprises adapt Gemini 2.5 Flash to custom datasets, industry-specific terminology, and bespoke brand voice.[^4] Google positioned Flash on Vertex AI as a roughly 1.5x speed-up over Gemini 2.0 Flash at a lower cost,[^4] and made it the recommended workhorse for large-scale summarization, responsive chat backends, and structured data extraction. The Vertex AI release added structured output enforcement with JSON schema, batch processing for asynchronous high-throughput jobs, and provisioned throughput options for predictable production traffic. The launch partner list included Spline, Rooms, Snap, SmartBear, Figma, Bridgewater Associates, Box, Geotab, Warp, Salesforce, Workday, and Harvey, with Workspace's document-aware chat in Gmail, Docs, and Drive using Flash for summarization and short-form generation.[^17] Within Google, Flash powered AI Mode in Google Search for queries that did not require Pro-level reasoning and served as the default fast subtask model in Project Astra.
Google announced Gemini 2.5 Flash-Lite as a preview alongside the June 17, 2025 GA release of Flash and Pro, with the stable version reaching general availability on July 22, 2025.[^4][^17] Flash-Lite shares the 1 million token context window and adjustable thinking mode of Flash, supports Google Search grounding and code execution, and accepts multimodal input, but is tuned for the high-volume, latency-sensitive end of the market: translation, classification, simple chat, and bulk text processing.[^17] The September 25, 2025 preview refresh delivered a roughly 50 percent reduction in output tokens for Flash-Lite (compared to 24 percent for Flash) along with better instruction following, reduced verbosity, and multimodal improvements; Google described the two models as targeting different points on the cost-quality curve, with Flash-Lite tilted toward conciseness and efficiency and Flash tilted toward tool use and multi-step complexity.[^7] The gemini-flash-lite-latest floating alias was introduced alongside gemini-flash-latest in the same release.[^7]
Third-party adoption began within days of the April 17 preview. Replit, Cursor, Codecademy, and Aider added the model as a selectable backend, with LangChain, LlamaIndex, and the Vercel AI SDK shipping first-class support within the first month and exposing the thinking budget as a configurable parameter. OpenRouter listed Gemini 2.5 Flash as one of its top-traffic endpoints by mid-summer. Inside Google, the May 20 I/O snapshot replaced Gemini 2.0 Flash as the default in the Gemini app, and AI Mode in Google Search ran on the 2.5 Flash line for queries that did not require Pro-level reasoning.[^18] Workspace's document-aware chat in Gmail, Docs, and Drive used Flash for summarization and short-form generation. Project Astra used Flash for the always-on assistant loop and switched to Pro for harder subtasks. Google's GA announcement named Spline, Rooms, Snap, SmartBear, Figma, Bridgewater Associates, Box, Geotab, Warp, Salesforce, Workday, and Harvey among the launch partners.[^17] Google stated in December 2025 that the Flash family collectively was processing "trillions of tokens" across "hundreds of thousands of apps built by millions of developers," and that Gemini API traffic crossed roughly one trillion tokens per day after the Gemini 3 launch.
VentureBeat's preview coverage ran with the "thinking budgets that cut AI costs by 600% when turned down" framing,[^10] and Simon Willison called it "the cheapest reasoning model from a major lab," pointing to the LMArena Hard Prompts placement as the most striking data point.[^9] The most common developer complaint was the unpredictability of the dynamic thinking default, where a simple request sometimes triggered a long internal deliberation that drove up the bill; the fix was the explicit thinking_budget cap, which became the recommended pattern by GA. Artificial Analysis later noted that the model produced unusually verbose responses (17 million tokens against a median of 9.1 million), which raised the effective per-task cost above the headline price.[^11] The June 17 GA release drew mixed reactions: enterprise reviewers liked the unified pricing and SFT availability, while developers who had built around the preview's $0.60 thinking-off output rate had to redo cost models.
The table below places Gemini 2.5 Flash alongside its closest contemporary peers in the fast, multimodal, cost-efficient tier of 2025 models.
| Model | Vendor | Released | Context | Reasoning mode | Multimodal input | Input price | Output price |
|---|---|---|---|---|---|---|---|
| Gemini 2.5 Flash (GA) | Google DeepMind | June 17, 2025 | 1M | Hybrid (thinking budget) | Text, image, audio, video, PDF | $0.30 / 1M | $2.50 / 1M |
| Gemini 2.5 Flash-Lite | Google DeepMind | July 22, 2025 | 1M | Hybrid (lower budget) | Text, image, audio, video, PDF | Lower than Flash | Lower than Flash |
| Gemini 2.0 Flash | Google DeepMind | December 2024 | 1M | None | Text, image, audio, video | $0.10 / 1M | $0.40 / 1M |
| Gemini 2.5 Pro | Google DeepMind | June 17, 2025 GA | 1M | Default thinking + Deep Think | Text, image, audio, video, PDF | $1.25 / 1M | $10.00 / 1M |
| Gemini 3 Flash | Google DeepMind | December 17, 2025 | 1M | Fast and Thinking modes | Text, image, audio, video, PDF | $0.50 / 1M | $3.00 / 1M |
| Claude 3.5 Haiku | Anthropic | November 2024 | 200k | None | Text, image | $0.80 / 1M | $4.00 / 1M |
| OpenAI o4-mini | OpenAI | April 16, 2025 | 200k | Always-on reasoning | Text, image | $1.10 / 1M | $4.40 / 1M |
| GPT-4.1 mini | OpenAI | April 14, 2025 | 1M | None | Text, image | $0.40 / 1M | $1.60 / 1M |
The most direct competitive contrast was with OpenAI's o4-mini, released the day before Gemini 2.5 Flash on April 16, 2025. Both targeted the same price-performance band, but o4-mini was always-on reasoning while Gemini 2.5 Flash was hybrid. Against Claude 3.5 Haiku, Gemini 2.5 Flash was cheaper on both input and output, had a much larger context window, and supported more modalities and reasoning. Against GPT-4.1 mini, Gemini 2.5 Flash was more expensive at the GA price but added the thinking budget and broader multimodal input. Inside Google's own lineup the relationship between Flash and Gemini 2.5 Pro settled into a clear split: Pro for hard reasoning, long-horizon agentic coding, and Deep Think workloads; Flash for high-volume chat, document analysis, RAG, multimodal Q&A, and lightweight agents; and Flash-Lite for the very high-volume, latency-sensitive tail of translation, classification, and simple chat.
Google described the safety work for Gemini 2.5 Flash under the same Frontier Safety Framework that covered Gemini 2.5 Pro. The company reported that the model had been evaluated against the then-current critical-capability thresholds for risks such as cyber-offense, autonomy, and biological uplift, and that no mitigations beyond standard release-time controls were required. The smaller model size and shorter thinking budget meant Flash was seen as a lower-risk release than Pro on most autonomy axes. Known limitations included the January 2025 knowledge cutoff (Google Search grounding is the recommended workaround for post-cutoff information),[^13] noticeable time-to-first-token delays on very long prompts, and the absence of the 2 million token context expansion announced for Gemini 2.5 Pro; the Flash ceiling remained 1 million tokens. The model can hallucinate, particularly when reasoning across multiple long documents, with fabricated quotes and confident citations of sources not present in the input among the most-discussed failure modes in mid-2025. Image, audio, and video generation are not native to the stable gemini-2.5-flash text endpoint; they are handled by sibling models such as Imagen 4, Veo 3, the Gemini 2.5 Flash Image variant, and the Live API native audio mode.[^13]
Gemini 3 Flash launched on December 17, 2025 and replaced Gemini 2.5 Flash as the default in the Gemini app and AI Mode in Google Search on launch day. The Gemini 3 Flash release was a substantial leap on benchmarks, with 90.4 percent on GPQA Diamond and 78 percent on SWE-bench Verified, while pricing rose modestly to $0.50 per million input and $3.00 per million output tokens. Gemini 2.5 Flash remained available on the Gemini API and Vertex AI after the Gemini 3 launch as a price-performance option, with gemini-flash-latest moving on to Gemini 3 Flash. Google's published deprecation schedule lists October 16, 2026 as the earliest possible shutdown date for gemini-2.5-flash, gemini-2.5-pro, and gemini-2.5-flash-lite, with gemini-3-flash-preview, gemini-3.1-pro-preview, and gemini-3.1-flash-lite respectively named as the recommended replacements; the Live API preview gemini-live-2.5-flash-preview was retired earlier, on December 9, 2025.[^21]
The lasting impact sat in three areas. First, it normalized the idea that a Flash-class model could be a real reasoning model rather than just a fast non-reasoner, a pattern most other labs followed through 2025 and 2026. Second, the hybrid reasoning design with an explicit thinking budget became a widely copied pattern at Anthropic, OpenAI, and several open-weight labs. Third, the unified production endpoint that could serve both reasoning and non-reasoning traffic through a single model identifier reduced the operational overhead of routing requests across multiple models. The accompanying sibling-model strategy, in which Google paired a single text backbone with dedicated Live API, image, and TTS endpoints under the same generation name, also became a template Google reused for the Gemini 3 family.