Gemini 2.5 Flash

27 min read

Updated Jul 23, 2026

Gemini 2.5 Flash is a fast, cost-optimized multimodal large language model developed by Google DeepMind and the mid-tier member of the Gemini 2.5 family. It pairs a 1 million token context window with a controllable "thinking budget" that lets developers turn the model's internal reasoning on or off and cap how many tokens it spends deliberating, making it one of the cheapest reasoning-capable models from a major lab. It was first released as a public preview on April 17, 2025 under the identifier gemini-2.5-flash-preview-04-17, then reached general availability on June 17, 2025 as gemini-2.5-flash. It was the first Flash-class Gemini with built-in reasoning, and Google described it as the company's first "fully hybrid reasoning model," meaning developers can turn the internal chain of thought on or off and set an explicit thinking budget that caps how many tokens the model spends deliberating before answering.^[1]

The model carries the same 1 million token context window, native multimodal input set (text, image, audio, video, PDF), and tool-use surface as Gemini 2.5 Pro, with a smaller and faster expert routing configuration to keep latency and per-token pricing low. At preview, Google priced it at $0.15 per million input tokens, $0.60 per million output tokens with thinking disabled, and $3.50 per million output tokens with thinking enabled, making it one of the cheapest reasoning models in the market. At GA, Google unified output pricing at $2.50 per million tokens regardless of thinking mode, with input pricing rising to $0.30 per million for text, image, and video.^[5]

Gemini 2.5 Flash sat at the center of Google's 2025 model strategy. It became the default in the Gemini app after Google I/O on May 20, 2025, replacing Gemini 2.0 Flash, and on Vertex AI and the Gemini API it became Google's recommended workhorse for high-volume reasoning, agentic coding, retrieval-augmented generation, and chat backends. It powered AI Mode in Google Search and large parts of Workspace AI features for the remainder of 2025. The model was superseded as default by Gemini 3 Flash on December 17, 2025 but remained available through 2026 as a price-performance option below the Gemini 3 line.^[26] Google has scheduled the gemini-2.5-flash endpoint for shutdown on October 16, 2026, with gemini-3.5-flash now listed as the recommended replacement.^[21]

What is Gemini 2.5 Flash?

Gemini 2.5 Flash is the fast, cost-efficient, hybrid-reasoning model in Google DeepMind's Gemini 2.5 lineup, positioned one rung below the flagship Gemini 2.5 Pro. "Hybrid reasoning" means a single model can run with its internal chain of thought switched off for cheap, low-latency answers or switched on (up to a developer-set token budget) for harder, multi-step problems, without swapping to a different model. It accepts text, images, audio, video, and PDFs, holds up to 1,048,576 input tokens of context, and emits up to 65,536 output tokens per response, with a January 2025 knowledge cutoff.^[13] In Google's words, it is the company's "first fully hybrid reasoning model, giving developers the ability to turn thinking on or off."^[1]

Background: where does the "Flash" name come from?

Google has used the "Flash" label for the smaller, faster member of each Gemini generation since Gemini 1.5 Flash arrived in May 2024 as a distillation of Gemini 1.5 Pro optimized for high-throughput summarization, classification, and chat. Gemini 2.0 Flash followed in December 2024, adding native tool use including Google Search grounding, a code-execution sandbox, and real-time streaming through the Multimodal Live API at roughly twice the throughput of Gemini 1.5 Pro. Two days later Google released the experimental Gemini 2.0 Flash Thinking, which spent extra inference compute on internal deliberation and foreshadowed the architectural direction of the 2.5 family.

The 2.5 generation was unveiled on March 25, 2025 with Gemini 2.5 Pro Experimental. Google said that "all of our models going forward will be thinking models" and described the 2.5 line as integrating reasoning into the default response path.^[2] Pro went first; Flash followed three weeks later. The competitive context in April 2025 included Anthropic's Claude 3.7 Sonnet (February 24, 2025), which had introduced configurable extended thinking; OpenAI's o4-mini, a small reasoning model in the same price band; and DeepSeek R1, the open-weight reasoner from late January.

When was Gemini 2.5 Flash released?

Gemini 2.5 Flash was first announced on April 9, 2025 at Google Cloud Next in Las Vegas as a low-latency model with built-in thinking capabilities. Public preview opened eight days later, on April 17, 2025, under the identifier gemini-2.5-flash-preview-04-17 through Google AI Studio, the Gemini API, and Vertex AI. The accompanying Google Developers Blog post, by Tulsee Doshi, framed the model as Google's "first fully hybrid reasoning model," with the ability to turn thinking on or off and to set thinking budgets that trade quality, cost, and latency.^[1]

Google's launch claims were threefold: the model ranked second on the LMArena Hard Prompts leaderboard at an Elo of approximately 1,392, behind only Gemini 2.5 Pro; it used the same multimodal input set, function calling, and 1 million token context window as Pro at a fraction of the price; and the thinking budget exposed a 0 to 24,576 token range of internal reasoning.^[1] VentureBeat's coverage focused on the cost split: $3.50 per million output tokens with thinking on, $0.60 with thinking off, headlined as cutting costs by 600 percent when turned down.^[10]

Google followed with an updated checkpoint at I/O on May 20, 2025. gemini-2.5-flash-preview-05-20 became the default in the Gemini app the same day, with Google reporting improvements across reasoning, multimodality, code, and long-context benchmarks while using 20 to 30 percent fewer tokens.^[3] The model reached general availability on June 17, 2025 as gemini-2.5-flash, alongside Gemini 2.5 Pro's GA and the preview of Gemini 2.5 Flash-Lite.^[4]^[17] The GA release emphasized supervised fine-tuning (SFT) availability, structured output support, and unified output pricing, and the April 17 preview endpoint was deprecated on July 15, 2025.

What are the Gemini 2.5 Flash variants and snapshots?

Google shipped several variants and snapshots of Gemini 2.5 Flash through 2025 and into 2026.

Identifier	Status	Key date	Notes
`gemini-2.5-flash-preview-04-17`	Preview (retired)	April 17, 2025	First public release, hybrid reasoning, thinking budget 0 to 24,576
`gemini-2.5-flash-preview-05-20`	Preview (retired)	May 20, 2025	Google I/O refresh, 20 to 30 percent fewer tokens per task
`gemini-2.5-flash`	General availability	June 17, 2025	Stable production endpoint, unified output pricing, January 2025 knowledge cutoff^[13]
`gemini-2.5-flash-lite`	Sibling preview, then GA	June 17, 2025 (preview), July 22, 2025 (stable)	Smaller and cheaper Flash-class model for translation, classification, high-volume tasks^[4]^[17]
`gemini-2.5-flash-preview-09-2025`	Preview (deprecated)	September 25, 2025	Better tool use, 24 percent fewer output tokens, +5 points on SWE-bench Verified^[7]
`gemini-flash-latest`	Floating alias	September 25, 2025	Pointer to the current Flash preview, later migrated to Gemini 3^[7]
`gemini-2.5-flash-native-audio` (Live API)	Preview, then upgraded	May 20, 2025 announce; late 2025 upgrade	Native audio dialogue via the Live API, 30+ HD voices in 24 languages, 70-language understanding^[21]^[22]^[23]
`gemini-2.5-flash-image` (Nano Banana)	Preview, then stable	August 26, 2025	Image generation and editing model with character consistency, $30 per million output tokens (~$0.039 per image)^[24]^[25]
`gemini-2.5-flash` for TTS	Preview	2025	Text-to-speech with fine control over style and pacing^[13]

The September 25, 2025 preview was the last major iteration of the text model before Gemini 3 took over, lifting SWE-bench Verified from 48.9 percent to 54.0 percent and cutting output tokens by 24 percent.^[7] Google subsequently published a deprecation schedule listing October 16, 2026 as the earliest possible shutdown date for the gemini-2.5-flash endpoint, with gemini-3.5-flash named as the recommended replacement; the Live API preview gemini-live-2.5-flash-preview was retired earlier, on December 9, 2025, in favor of gemini-3.1-flash-live-preview.^[21]

How is Gemini 2.5 Flash built?

Google DeepMind has not published the parameter count or training corpus of Gemini 2.5 Flash. The technical report "Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context" describes the model qualitatively rather than in architectural detail.^[8] Gemini 2.5 Flash is a decoder-only Transformer built on the same architectural family as Gemini 1.5 and 2.0. It uses a Mixture-of-Experts (MoE) routing scheme inherited from Gemini 1.5, where only a subset of expert sub-networks is activated for each input token, placing the active-parameter count at a small fraction of Gemini 2.5 Pro's. The model is natively multimodal: text, image, audio, video, and PDF inputs are processed by per-modality encoders that emit tokens in a shared representation, then routed through a single Transformer stack.

The defining 2.5 change is the integration of reasoning into the default response path, but Flash exposes that integration as an opt-in via the thinking budget parameter. With the budget at zero the model behaves like a fast, non-reasoning model; with a positive value it produces an internal chain of thought of up to that many tokens before its visible response. The default at GA was "dynamic thinking," where the model decides for itself how much to spend. The maximum budget was 24,576 tokens, smaller than the 32,000 token ceiling on Gemini 2.5 Pro. Reasoning was trained through supervised fine-tuning on chain-of-thought traces combined with reinforcement learning from preference data. Like Pro, Flash returns inspectable thought summaries (a redacted form of internal reasoning) rather than the full raw chain-of-thought transcript, and it supports the same tool-use surfaces: function calling, structured outputs, batch processing, context caching, Google Search grounding, URL context, and a code-execution sandbox.

The stable gemini-2.5-flash text endpoint accepts up to 1,048,576 input tokens and emits up to 65,536 output tokens per response, with a January 2025 knowledge cutoff.^[13] Supported features on the stable endpoint include function calling, structured outputs, code execution, Google Search and Google Maps grounding, URL context, context caching, the Batch API, file search, flex inference, priority inference, and supervised fine-tuning, with image and audio generation and the Live API exposed only through dedicated sibling endpoints (gemini-2.5-flash-image and the Live API variants).^[13]

What can Gemini 2.5 Flash do?

What is the thinking budget?

The hybrid reasoning design is Gemini 2.5 Flash's most distinctive feature. The thinking_budget parameter has three behaviors: zero disables thinking entirely; a positive integer caps internal reasoning tokens at that value; the default lets the model decide. Google's documentation states that for 2.5 Flash "the budget can range from 0 to 24576 tokens."^[1] The model is trained to stop early when the prompt does not require deeper reasoning, so the 24,576 ceiling does not commit it to spending that many tokens per request.^[1] A single endpoint can therefore serve both reasoning-light traffic (translation, classification, formatting) and reasoning-heavy traffic (multi-step math, complex code, document analysis) without forcing the developer to swap models. Simon Willison called the pattern a "poor person's mixture of models" that achieves cost efficiency by varying inference compute per request, and in his test prompts he measured costs ranging from 0.1025 cents (11 in / 1,705 out, thinking off) to 2.87 cents (5,174 out + 3,023 thinking) on the preview pricing.^[9]

How good is Gemini 2.5 Flash at multimodal understanding?

Gemini 2.5 Flash accepts text, images, audio, video, and PDFs in the same prompt with capacity limits similar to Gemini 2.5 Pro: up to 3,000 images per prompt, roughly 45 minutes of video with audio (or one hour without), about 8.4 hours of audio, and PDFs up to 1,000 pages or 50 MB. Output is text by default, with native audio output available through the Live API variant. On MMMU the May I/O snapshot scored 79.7 percent and on Global MMLU Lite it scored 88.4 percent.^[3] Independent benchmarkers placed the model in the upper third of the Artificial Analysis Intelligence Index, with output speeds above 200 tokens per second on the non-reasoning configuration.^[11]

How good is Gemini 2.5 Flash at coding, long context, and tool use?

Flash was not positioned as the leading coding model in the 2.5 family, since Gemini 2.5 Pro received the bulk of Google's coding-specific reinforcement learning. Still, SWE-bench Verified climbed from 48.9 percent on the May I/O preview to 54.0 percent on the September 25 refresh, and Replit, Cursor, and similar developer tools adopted Flash as the lower-cost coding option below Pro.^[7] An early tester from Manus AI publicly reported a "15 percent leap in performance for long-horizon agentic tasks" on the September preview.^[7] The 1 million token context window holds roughly 750,000 words of English text, several hours of audio, or a multi-thousand-line codebase, with MRCR long-context retrieval in the high 80s percent band at 128k tokens. Function calling, structured outputs (JSON schema enforcement), batch processing, context caching, Google Search grounding, URL context, and a code-execution sandbox are all available, and the model supports the Model Context Protocol (MCP) standard that became important in the second half of 2025.

What does Gemini 2.5 Flash native audio and the Live API do?

A separate sibling model, exposed through the Gemini Live API, handles real-time voice and video conversations on top of the Gemini 2.5 Flash backbone. The native-audio variant, announced in preview at Google I/O 2025, generates speech directly from the model rather than chaining a text response to an external text-to-speech system. Google's documentation lists more than 30 HD voices, output across 24 languages, and understanding of approximately 70 languages, with audio I/O at 16 kHz raw 16-bit PCM input and 24 kHz output.^[22]^[23] Distinctive Live API features include affective dialogue, in which the model adapts its tone and style to the user's emotional cues; proactive audio, which lets developers control when and in what contexts the model speaks; improved barge-in handling for interruptions in noisy environments; and live streaming speech-to-speech translation with style transfer that preserves intonation, pacing, and pitch. A later upgrade to the native-audio model, published in the last quarter of 2025, raised the ComplexFuncBench Audio score to 71.5 percent and the instruction-following score to about 90 percent (up from 84 percent), and added 2,000+ language pairs for live translation.^[22] Google rolled the upgraded model out to Google AI Studio, Vertex AI, Gemini Live, and Search Live.^[22]

What is Gemini 2.5 Flash Image (Nano Banana)?

Gemini 2.5 Flash Image, internally codenamed "Nano Banana" during anonymous testing on LMArena, launched as a separate model on August 26, 2025.^[24] The model targets image generation and natural-language image editing, with Google highlighting three capabilities: character consistency across scenes (placing the same subject in different environments while preserving identity), targeted local edits with natural language prompts (changing poses, removing background elements, colorizing photos), and image grounding that draws on Gemini's broader world knowledge for accurate composition. The Gemini API priced output at $30.00 per million tokens, which works out to approximately $0.039 per generated image at 1,290 output tokens per image.^[24] The stable gemini-2.5-flash-image endpoint accepts up to 65,536 input tokens and emits up to 32,768 output tokens, with a June 2025 knowledge cutoff and support for batch, caching, flex, and priority inference; it does not support function calling, the Live API, code execution, or audio.^[25] The gemini-2.5-flash-image-preview identifier was deprecated and scheduled for shutdown on January 15, 2026, with gemini-2.5-flash-image as the recommended successor.^[21]

How does Gemini 2.5 Flash score on benchmarks?

The table below summarizes the public benchmark numbers Google reported for Gemini 2.5 Flash across the preview and GA period, with comparison columns for Gemini 2.5 Pro and Gemini 2.0 Flash where the same benchmark was run.

Benchmark	Gemini 2.5 Flash	Gemini 2.5 Pro	Gemini 2.0 Flash
LMArena Elo (Hard Prompts)	~1,392	~1,400+	~1,290
AIME 2024	88.0%	92.0%	36.7%
AIME 2025 (Sept refresh)	72.0%	86.7%	n/a
GPQA Diamond	82.8%	84.0%	60.1%
Global MMLU Lite	88.4%	89.8%	83.4%
MMMU	79.7%	81.7%	70.7%
MMMU Pro (Sept refresh)	66.7%	75.4%	n/a
SWE-bench Verified	48.9% (May) to 54.0% (Sept)	63.8% (June GA)	31.5%
LiveCodeBench Pro Elo	1,143 (Sept)	n/a	n/a
FACTS Grounding	85.3%	84.1%	n/a
Humanity's Last Exam (no tools)	~5%	18.8%	n/a

A few comparisons stand out. On AIME 2024 the Flash 88.0 percent sat within four points of Gemini 2.5 Pro, an unusually small gap for a Flash-class model. On GPQA Diamond the gap was just 1.2 points. The headline weakness was Humanity's Last Exam, where Flash trailed Pro by roughly 14 points without tools. On the Artificial Analysis Intelligence Index, the non-reasoning configuration scored 21 against a category median of 17, ranked 31st of 84 models tracked at the time, with output speed of 215.4 tokens per second (4th overall) and time to first token of 0.65 seconds against a median of 1.87 seconds. The blended cost of about $0.85 per million tokens at a 3:1 input-to-output ratio placed Flash 53rd for input pricing and 80th for output pricing, in the same cost band as OpenAI's GPT-4o-mini and below Claude 3.5 Haiku. Artificial Analysis also noted unusually high verbosity, with Flash producing 17 million output tokens against a category median of 9.1 million during evaluation.^[11]

How much does Gemini 2.5 Flash cost?

Gemini 2.5 Flash pricing changed twice during the model's lifetime. The April 17 preview introduced a split between thinking-enabled and thinking-disabled output pricing. The June 17 GA release unified output pricing at $2.50 per million tokens regardless of thinking mode and raised input pricing modestly. The table below summarizes both regimes, plus the discounted batch tier and the premium priority tier that Google added later in 2025.^[5]

Period / Tier	Input (text, image, video)	Input (audio)	Output thinking off	Output thinking on
Preview (April 17 to June 17, 2025)	$0.15 / 1M	$1.00 / 1M	$0.60 / 1M	$3.50 / 1M
GA Standard (from June 17, 2025)	$0.30 / 1M	$1.00 / 1M	$2.50 / 1M	$2.50 / 1M
GA Batch / Flex	$0.15 / 1M	$0.50 / 1M	$1.25 / 1M	$1.25 / 1M
GA Priority	$0.54 / 1M	$1.80 / 1M	$4.50 / 1M	$4.50 / 1M

Context caching at the GA tier was charged at $0.03 per million cached tokens for text, image, and video, $0.10 per million for audio, and $1.00 per million tokens per hour for cache storage, with a roughly 75 percent cache-hit discount that became one of the main cost levers for retrieval-augmented generation workloads.^[5] The pricing shift between preview and GA drew pushback from developers who had built their cost models around the $0.60 thinking-off output rate; Google's defense was that production users rarely kept all traffic on the thinking-off path. GA pricing put Gemini 2.5 Flash at roughly three times the input cost of Gemini 2.0 Flash ($0.10) and 6.25 times the output cost ($0.40), offset by Google's claim that the model used 20 to 30 percent fewer tokens for the same tasks.^[3] The September 25 refresh further improved efficiency, with a 24 percent reduction in output tokens.^[7]

What enterprise features does Gemini 2.5 Flash offer on Vertex AI?

On Vertex AI, the GA release on June 17, 2025 added supervised fine-tuning (SFT) as a generally available capability, letting enterprises adapt Gemini 2.5 Flash to custom datasets, industry-specific terminology, and bespoke brand voice.^[4] Google positioned Flash on Vertex AI as a roughly 1.5x speed-up over Gemini 2.0 Flash at a lower cost,^[4] and made it the recommended workhorse for large-scale summarization, responsive chat backends, and structured data extraction. The Vertex AI release added structured output enforcement with JSON schema, batch processing for asynchronous high-throughput jobs, and provisioned throughput options for predictable production traffic. The launch partner list included Spline, Rooms, Snap, SmartBear, Figma, Bridgewater Associates, Box, Geotab, Warp, Salesforce, Workday, and Harvey, with Workspace's document-aware chat in Gmail, Docs, and Drive using Flash for summarization and short-form generation.^[17] Within Google, Flash powered AI Mode in Google Search for queries that did not require Pro-level reasoning and served as the default fast subtask model in Project Astra.

What is Gemini 2.5 Flash-Lite?

Google announced Gemini 2.5 Flash-Lite as a preview alongside the June 17, 2025 GA release of Flash and Pro, with the stable version reaching general availability on July 22, 2025.^[4]^[17] Flash-Lite shares the 1 million token context window and adjustable thinking mode of Flash, supports Google Search grounding and code execution, and accepts multimodal input, but is tuned for the high-volume, latency-sensitive end of the market: translation, classification, simple chat, and bulk text processing.^[17] The September 25, 2025 preview refresh delivered a roughly 50 percent reduction in output tokens for Flash-Lite (compared to 24 percent for Flash) along with better instruction following, reduced verbosity, and multimodal improvements; Google described the two models as targeting different points on the cost-quality curve, with Flash-Lite tilted toward conciseness and efficiency and Flash tilted toward tool use and multi-step complexity.^[7] The gemini-flash-lite-latest floating alias was introduced alongside gemini-flash-latest in the same release.^[7]

How was Gemini 2.5 Flash adopted and received?

Third-party adoption began within days of the April 17 preview. Replit, Cursor, Codecademy, and Aider added the model as a selectable backend, with LangChain, LlamaIndex, and the Vercel AI SDK shipping first-class support within the first month and exposing the thinking budget as a configurable parameter. OpenRouter listed Gemini 2.5 Flash as one of its top-traffic endpoints by mid-summer. Inside Google, the May 20 I/O snapshot replaced Gemini 2.0 Flash as the default in the Gemini app, and AI Mode in Google Search ran on the 2.5 Flash line for queries that did not require Pro-level reasoning.^[18] Workspace's document-aware chat in Gmail, Docs, and Drive used Flash for summarization and short-form generation. Project Astra used Flash for the always-on assistant loop and switched to Pro for harder subtasks. Google's GA announcement named Spline, Rooms, Snap, SmartBear, Figma, Bridgewater Associates, Box, Geotab, Warp, Salesforce, Workday, and Harvey among the launch partners.^[17] Google stated in December 2025 that the Flash family collectively was processing "trillions of tokens" across "hundreds of thousands of apps built by millions of developers," and that Gemini API traffic crossed roughly one trillion tokens per day after the Gemini 3 launch.

VentureBeat's preview coverage ran with the "thinking budgets that cut AI costs by 600% when turned down" framing,^[10] and Simon Willison called it "the cheapest reasoning model from a major lab," pointing to the LMArena Hard Prompts placement as the most striking data point.^[9] The most common developer complaint was the unpredictability of the dynamic thinking default, where a simple request sometimes triggered a long internal deliberation that drove up the bill; the fix was the explicit thinking_budget cap, which became the recommended pattern by GA. Artificial Analysis later noted that the model produced unusually verbose responses (17 million tokens against a median of 9.1 million), which raised the effective per-task cost above the headline price.^[11] The June 17 GA release drew mixed reactions: enterprise reviewers liked the unified pricing and SFT availability, while developers who had built around the preview's $0.60 thinking-off output rate had to redo cost models.

How does Gemini 2.5 Flash compare with rival models?

The table below places Gemini 2.5 Flash alongside its closest contemporary peers in the fast, multimodal, cost-efficient tier of 2025 models.

Model	Vendor	Released	Context	Reasoning mode	Multimodal input	Input price	Output price
Gemini 2.5 Flash (GA)	Google DeepMind	June 17, 2025	1M	Hybrid (thinking budget)	Text, image, audio, video, PDF	$0.30 / 1M	$2.50 / 1M
Gemini 2.5 Flash-Lite	Google DeepMind	July 22, 2025	1M	Hybrid (lower budget)	Text, image, audio, video, PDF	Lower than Flash	Lower than Flash
Gemini 2.0 Flash	Google DeepMind	December 2024	1M	None	Text, image, audio, video	$0.10 / 1M	$0.40 / 1M
Gemini 2.5 Pro	Google DeepMind	June 17, 2025 GA	1M	Default thinking + Deep Think	Text, image, audio, video, PDF	$1.25 / 1M	$10.00 / 1M
Gemini 3 Flash	Google DeepMind	December 17, 2025	1M	Fast and Thinking modes	Text, image, audio, video, PDF	$0.50 / 1M	$3.00 / 1M
Claude 3.5 Haiku	Anthropic	November 2024	200k	None	Text, image	$0.80 / 1M	$4.00 / 1M
OpenAI o4-mini	OpenAI	April 16, 2025	200k	Always-on reasoning	Text, image	$1.10 / 1M	$4.40 / 1M
GPT-4.1 mini	OpenAI	April 14, 2025	1M	None	Text, image	$0.40 / 1M	$1.60 / 1M

The most direct competitive contrast was with OpenAI's o4-mini, released the day before Gemini 2.5 Flash on April 16, 2025. Both targeted the same price-performance band, but o4-mini was always-on reasoning while Gemini 2.5 Flash was hybrid. Against Claude 3.5 Haiku, Gemini 2.5 Flash was cheaper on both input and output, had a much larger context window, and supported more modalities and reasoning. Against GPT-4.1 mini, Gemini 2.5 Flash was more expensive at the GA price but added the thinking budget and broader multimodal input. Inside Google's own lineup the relationship between Flash and Gemini 2.5 Pro settled into a clear split: Pro for hard reasoning, long-horizon agentic coding, and Deep Think workloads; Flash for high-volume chat, document analysis, RAG, multimodal Q&A, and lightweight agents; and Flash-Lite for the very high-volume, latency-sensitive tail of translation, classification, and simple chat.

What are the safety measures and limitations of Gemini 2.5 Flash?

Google described the safety work for Gemini 2.5 Flash under the same Frontier Safety Framework that covered Gemini 2.5 Pro. The company reported that the model had been evaluated against the then-current critical-capability thresholds for risks such as cyber-offense, autonomy, and biological uplift, and that no mitigations beyond standard release-time controls were required. The smaller model size and shorter thinking budget meant Flash was seen as a lower-risk release than Pro on most autonomy axes. Known limitations included the January 2025 knowledge cutoff (Google Search grounding is the recommended workaround for post-cutoff information),^[13] noticeable time-to-first-token delays on very long prompts, and the absence of the 2 million token context expansion announced for Gemini 2.5 Pro; the Flash ceiling remained 1 million tokens. The model can hallucinate, particularly when reasoning across multiple long documents, with fabricated quotes and confident citations of sources not present in the input among the most-discussed failure modes in mid-2025. Image, audio, and video generation are not native to the stable gemini-2.5-flash text endpoint; they are handled by sibling models such as Imagen 4, Veo 3, the Gemini 2.5 Flash Image variant, and the Live API native audio mode.^[13]

What replaced Gemini 2.5 Flash?

Gemini 3 Flash launched on December 17, 2025 and replaced Gemini 2.5 Flash as the default in the Gemini app and AI Mode in Google Search on launch day.^[26] The Gemini 3 Flash release was a substantial leap on benchmarks, with 90.4 percent on GPQA Diamond and 78 percent on SWE-bench Verified, while pricing rose modestly to $0.50 per million input and $3.00 per million output tokens.^[26] Google described it as offering "frontier intelligence built for speed at a fraction of the cost."^[26] Gemini 3 Flash in turn handed off the default slot to Gemini 3.5 Flash, which launched at Google I/O on May 19-20, 2026 and became the default in the consumer Gemini app, AI Mode in Search, and the Gemini API and Vertex AI.^[27] Gemini 2.5 Flash remained available on the Gemini API and Vertex AI after the Gemini 3 launch as a price-performance option, with gemini-flash-latest moving on to the newer Flash line. Google's published deprecation schedule lists October 16, 2026 as the earliest possible shutdown date for gemini-2.5-flash, gemini-2.5-pro, and gemini-2.5-flash-lite, with gemini-3.5-flash, gemini-3.1-pro-preview, and gemini-3.1-flash-lite respectively named as the recommended replacements; the Live API preview gemini-live-2.5-flash-preview was retired earlier, on December 9, 2025.^[21]

The lasting impact sat in three areas. First, it normalized the idea that a Flash-class model could be a real reasoning model rather than just a fast non-reasoner, a pattern most other labs followed through 2025 and 2026. Second, the hybrid reasoning design with an explicit thinking budget became a widely copied pattern at Anthropic, OpenAI, and several open-weight labs. Third, the unified production endpoint that could serve both reasoning and non-reasoning traffic through a single model identifier reduced the operational overhead of routing requests across multiple models. The accompanying sibling-model strategy, in which Google paired a single text backbone with dedicated Live API, image, and TTS endpoints under the same generation name, also became a template Google reused for the Gemini 3 family.

ELI5: What is Gemini 2.5 Flash?

Imagine a smart, fast assistant who can read, look at pictures, listen to audio, and watch video, all at once. Most assistants either think hard about every question (slow and expensive) or never think much at all (fast and cheap). Gemini 2.5 Flash lets you choose: you can tell it "don't bother thinking, just answer" for easy questions, or "think for up to this many steps" for hard ones. That way you only pay for deep thinking when you actually need it, which is why it became one of the cheapest "thinking" AI models you could rent in 2025.

References

^Google Developers Blog, "Start building with Gemini 2.5 Flash," Tulsee Doshi, April 17, 2025. developers.googleblog.com/...-with-gemini-25-flash
^Google DeepMind, "Gemini 2.5: Our newest Gemini model with thinking," March 25, 2025. blog.google/...i-model-thinking-updates-march-2025
^Google DeepMind, "Google I/O 2025: Updates to Gemini 2.5 from Google DeepMind," May 20, 2025. blog.google/...google-gemini-updates-io-2025
^Google Cloud Blog, "Gemini 2.5 Updates: Flash/Pro GA, SFT, Flash-Lite on Vertex AI," June 17, 2025. cloud.google.com/...sh-lite-flash-pro-ga-vertex-ai
^Google AI for Developers, "Gemini Developer API pricing." ai.google.dev/...pricing
Google Developers Blog, "Gemini 2.5: Updates to our family of thinking models," May 2025. developers.googleblog.com/...hinking-model-updates
^Google Developers Blog, "Continuing to bring you our latest models, with an improved Gemini 2.5 Flash and Flash-Lite release," September 25, 2025. developers.googleblog.com/...nd-flash-lite-release
^Google DeepMind, "Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context" (technical report). storage.googleapis.com/...gemini_v2_5_report.pdf
^Simon Willison, "Start building with Gemini 2.5 Flash," April 17, 2025. simonwillison.net/...building-with-gemini-25-flash
^VentureBeat, "Google's Gemini 2.5 Flash introduces thinking budgets that cut AI costs by 600% when turned down," April 17, 2025. venturebeat.com/...i-costs-by-600-when-turned-down
^Artificial Analysis, "Gemini 2.5 Flash: Intelligence, Performance & Price Analysis." artificialanalysis.ai/...gemini-2-5-flash
LLM Stats, "Gemini 2.5 Flash: Pricing, Context Window, Benchmarks, and More." llm-stats.com/...gemini-2.5-flash
^Google AI for Developers, "Gemini 2.5 Flash model card." ai.google.dev/...gemini-2.5-flash
Wikipedia, "Gemini (language model)." en.wikipedia.org/...Gemini_(language_model)
BGR, "Gemini 2.5 Flash Is Google's Cheapest Thinking AI: What You Need To Know," April 17, 2025. bgr.com/...apest-thinking-ai-what-you-need-to-know
Nerdschalk, "Google Rolls Out Gemini 2.5 Flash with Adjustable 'Thinking Budget' and 24,576 Token Limit," April 2025. nerdschalk.com/...ing-budget-and-24576-token-limit
^Google, "We're expanding our Gemini 2.5 family of models," June 17, 2025. blog.google/...gemini-2-5-model-family-expands
^9to5Google, "Gemini app releases 2.5 Flash & Live camera on iPhone, 2.5 Pro Deep Think mode coming," May 20, 2025. 9to5google.com/...gemini-app-i-o-2025
Google Cloud Documentation, "Gemini 2.5 Flash | Generative AI on Vertex AI." docs.cloud.google.com/...2-5-flash
VentureBeat, "Google launches production-ready Gemini 2.5 AI models to challenge OpenAI's enterprise dominance," June 17, 2025. venturebeat.com/...ge-openais-enterprise-dominance
^Google AI for Developers, "Gemini deprecations." ai.google.dev/...deprecations
^Google, "Gemini 2.5 Native Audio upgrade, plus text-to-speech model updates." blog.google/...gemini-audio-model-updates
^Google AI for Developers, "Live API capabilities guide." ai.google.dev/...live-guide
^Google Developers Blog, "Introducing Gemini 2.5 Flash Image, our state-of-the-art image model," August 26, 2025. developers.googleblog.com/...emini-2-5-flash-image
^Google AI for Developers, "Gemini 2.5 Flash Image (Nano Banana) model card." ai.google.dev/...gemini-2.5-flash-image
^Google, "Introducing Gemini 3 Flash: Benchmarks, global availability," December 17, 2025. blog.google/...gemini-3-flash
^Google, "Gemini 3.5: frontier intelligence with action," May 2026. blog.google/...gemini-3-5

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

4 revisions by 1 contributors · v5 · 5,406 words · full history

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Suggest edit

Gemini 2.5 Flash

What is Gemini 2.5 Flash?

Background: where does the "Flash" name come from?

When was Gemini 2.5 Flash released?

What are the Gemini 2.5 Flash variants and snapshots?

How is Gemini 2.5 Flash built?

What can Gemini 2.5 Flash do?

What is the thinking budget?

How good is Gemini 2.5 Flash at multimodal understanding?

How good is Gemini 2.5 Flash at coding, long context, and tool use?

What does Gemini 2.5 Flash native audio and the Live API do?

What is Gemini 2.5 Flash Image (Nano Banana)?

How does Gemini 2.5 Flash score on benchmarks?

How much does Gemini 2.5 Flash cost?

What enterprise features does Gemini 2.5 Flash offer on Vertex AI?

What is Gemini 2.5 Flash-Lite?

How was Gemini 2.5 Flash adopted and received?

How does Gemini 2.5 Flash compare with rival models?

What are the safety measures and limitations of Gemini 2.5 Flash?

What replaced Gemini 2.5 Flash?

ELI5: What is Gemini 2.5 Flash?

See also

References

Improve this article

What links here

What links here

What is Gemini 2.5 Flash?

Background: where does the "Flash" name come from?

When was Gemini 2.5 Flash released?

What are the Gemini 2.5 Flash variants and snapshots?

How is Gemini 2.5 Flash built?

What can Gemini 2.5 Flash do?

What is the thinking budget?

How good is Gemini 2.5 Flash at multimodal understanding?

How good is Gemini 2.5 Flash at coding, long context, and tool use?

What does Gemini 2.5 Flash native audio and the Live API do?

What is Gemini 2.5 Flash Image (Nano Banana)?

How does Gemini 2.5 Flash score on benchmarks?

How much does Gemini 2.5 Flash cost?

What enterprise features does Gemini 2.5 Flash offer on Vertex AI?

What is Gemini 2.5 Flash-Lite?

How was Gemini 2.5 Flash adopted and received?

How does Gemini 2.5 Flash compare with rival models?

What are the safety measures and limitations of Gemini 2.5 Flash?

What replaced Gemini 2.5 Flash?

ELI5: What is Gemini 2.5 Flash?

See also

References

Improve this article

Related Articles

Gemini 3

Gemma 3

SmolVLA

Flamingo (visual language model)

Gemini Omni

Gemini 3.5 Flash

What links here

Related Articles

Gemini 3

Gemma 3

SmolVLA

Flamingo (visual language model)

Gemini Omni

Gemini 3.5 Flash

What links here