Gemini 2.5 Pro

Gemini 2.5 Pro is a large language model developed by Google DeepMind and the flagship reasoning model of the Gemini 2.5 family. It was first released as an experimental preview on March 25, 2025, under the model identifier gemini-2.5-pro-exp-03-25.[^1] The model introduced "thinking" as a default behavior across the Gemini line, meaning every response begins with an internal chain of reasoning before any text is shown to the user. Gemini 2.5 Pro reached general availability on June 17, 2025, and is offered through Google AI Studio, the Gemini API, and Vertex AI.[^3]

At launch the model debuted at the top of the LMArena (Chatbot Arena) leaderboard, becoming the first publicly accessible model to clear an Elo score of approximately 1,300 by a margin of close to 40 points over the previous leader.[^1][^13] It also took the top position on WebDev Arena, posted state-of-the-art scores on GPQA Diamond, AIME 2025, and Humanity's Last Exam, and pulled level with or ahead of Anthropic's Claude 3.7 Sonnet and OpenAI's GPT-4.5 across most public reasoning benchmarks.[^1][^15] The model's broad capability profile, large context window, and pricing under five dollars per million output tokens at standard tiers led to fast adoption inside developer tools such as Cursor, Replit, and Codecademy through the spring and summer of 2025.

The Gemini 2.5 family also includes the smaller Gemini 2.5 Flash and Gemini 2.5 Flash-Lite siblings, plus an enhanced reasoning configuration called Gemini 2.5 Pro Deep Think, announced at Google I/O on May 20, 2025.[^2] Deep Think went on to achieve a gold-medal score at the 2025 International Mathematical Olympiad in July and at the International Collegiate Programming Contest World Finals in September, both under official competition rules.[^26][^27] Gemini 2.5 Pro was eventually superseded by Gemini 3 Pro in November 2025, but it remained the default "Pro" model on the Gemini API, in the Gemini consumer app's paid tiers, and across Google products for roughly seven months and continues to be served as a price-performance option into 2026.[^4]

Background

Gemini 2.5 Pro is the third generation of Google's flagship Gemini line. It follows Gemini 1.5 Pro (February 2024) and the short-lived Gemini 2.0 Pro Experimental (February 2025), and it is the first Gemini model in which reasoning is the default rather than a separate variant.

Gemini 1.5 Pro arrived in February 2024 with a 1 million token context window, later expanded to 2 million tokens for waitlisted developers. It was the first Gemini model to use a Mixture-of-Experts (MoE) architecture, where only a subset of expert sub-networks is activated for each token, and it supported native input across text, images, audio, video, and code. Gemini 1.5 Pro introduced the very long context window that would come to define the Gemini line, but it had no dedicated chain-of-thought mode.

Gemini 2.0 Flash Experimental was announced on December 11, 2024, and reframed Google's strategy around what the company called the "agentic era." It added native tool use (Google Search, code execution), native audio output, real-time streaming through the Multimodal Live API, and roughly twice the speed of 1.5 Pro at lower cost. Two days later, on December 19, 2024, Google released the experimental Gemini 2.0 Flash Thinking, a variant trained to spend additional inference compute on internal deliberation before answering. Flash Thinking foreshadowed the architectural direction of the 2.5 family.

In February 2025, Google released a wave of Gemini 2.0 stable models: Gemini 2.0 Flash on January 30, 2025, then Gemini 2.0 Pro Experimental and Gemini 2.0 Flash-Lite on February 5 and February 25, respectively.[^11] Gemini 2.0 Pro Experimental shipped with a 1 million token context window (with 2 million tokens for waitlisted developers), native tool use, and stronger coding performance, but no built-in reasoning step. It was a brief stop on the way to 2.5.

The broader market context for the 2.5 launch was unusually crowded. DeepSeek R1, an open-weight reasoning model from a Chinese research lab, had landed in late January 2025 and demonstrated that distilled chain-of-thought training could produce frontier-class results at a fraction of the cost.[^23] xAI released Grok 3 on February 17, 2025, and announced its own Big Brain reasoning mode.[^22] OpenAI released GPT-4.5 on February 27, 2025, as a non-reasoning frontier model with strong general knowledge but uneven math and code performance.[^21] Anthropic released Claude 3.7 Sonnet on February 24, 2025, which became the first commercial model to expose a configurable extended thinking mode to all developers.[^20] By the time Gemini 2.5 Pro arrived four weeks later, every major lab had committed to some form of test-time reasoning. Google's contribution was to make thinking the default rather than a toggle.

Launch

Gemini 2.5 Pro was announced on March 25, 2025, in a Google DeepMind blog post titled "Gemini 2.5: Our newest Gemini model with thinking," co-credited to Demis Hassabis and Koray Kavukcuoglu.[^1] The post described the model as "our most intelligent model yet" and framed the 2.5 series as a generation in which all Gemini models will be thinking models, capable of reasoning through their own thoughts before producing a response.

The initial release shipped under the experimental model ID gemini-2.5-pro-exp-03-25. It was made available immediately and at no cost in Google AI Studio, and rolled out the same day to the Gemini consumer app for Gemini Advanced subscribers.[^1] Vertex AI access followed within days. The free tier included generous rate limits intended to encourage broad early experimentation, and developer documentation listed a 1 million token context window with an announced ("coming soon") expansion to 2 million tokens.[^1]

Google's launch claims fell into three categories. First, the model debuted at number one on LMArena (the human-preference leaderboard run by the LMSYS / Chatbot Arena project), with a reported lead of close to 40 Elo points over the second-place model.[^1][^18] Independent verification by the Chatbot Arena team placed the launch score in the 1,300 to 1,310 range, making Gemini 2.5 Pro the first model to clear the 1,300 line on a leaderboard that had hovered in the 1,250 to 1,290 band for most of 2024. Second, Google reported industry-leading scores on Humanity's Last Exam (18.8 percent without tools), GPQA Diamond (84.0 percent), and AIME 2025 (86.7 percent), along with strong but not market-leading results on SWE-bench Verified (63.8 percent).[^1] Third, the company emphasized improvements in coding: a top placement on the WebDev Arena leaderboard for full web app generation, and a 70.4 percent score on LiveCodeBench v5 for competitive programming.

The "Pro Experimental" label was important. Google had used the phrase "Pro Experimental" earlier in 2025 for Gemini 2.0 Pro Experimental, signaling that a model was production-eligible in spirit but still subject to changes in pricing, rate limits, and behavior before stable release. Gemini 2.5 Pro Experimental sat in this category from March 25 through early May, when a paid preview tier opened on Vertex AI. The model received an updated checkpoint, gemini-2.5-pro-preview-05-06, just before Google I/O 2025 (May 20-21), positioned specifically as a coding upgrade.[^6] A further refresh, gemini-2.5-pro-preview-06-05, accompanied the WebDev Arena update and improved coding behavior, including reduced errors in function calling, sharper front-end output, and more creative formatting.[^19] The June 17, 2025, GA release retired the preview suffix in favor of the stable identifier gemini-2.5-pro.[^3]

Variants and snapshot timeline

Google shipped multiple variants and snapshots of Gemini 2.5 in 2025. The table below summarizes the major identifiers as they appeared in the Gemini API and Vertex AI catalogs.[^9]

Identifier	Status	Key dates	Notes
`gemini-2.5-pro-exp-03-25`	Experimental	March 25, 2025	Initial launch; free in AI Studio
`gemini-2.5-pro-preview-05-06`	Preview (paid)	May 6, 2025	Improved coding, WebDev Arena number one
`gemini-2.5-pro-preview-06-05`	Preview (paid)	June 5, 2025	Final preview snapshot before GA
`gemini-2.5-pro`	General availability	June 17, 2025	Stable production model
`gemini-2.5-pro-deep-think`	Limited preview, then GA in Gemini Ultra	May 20, 2025 (announced); August 1, 2025 (consumer rollout)	Extended reasoning, parallel hypotheses
`gemini-2.5-flash-preview-04-17`	Preview	April 17, 2025	First Flash with thinking
`gemini-2.5-flash`	General availability	June 17, 2025	Stable Flash
`gemini-2.5-flash-lite-preview-06-17`	Preview	June 17, 2025	Smallest 2.5 variant
`gemini-2.5-flash-lite`	General availability	July 22, 2025	Stable Flash-Lite
`gemini-2.5-pro-tts-preview`	Preview	Mid-2025	Speech synthesis variant

Gemini 2.5 Pro

The headline model. It uses the largest parameter budget of the 2.5 family and has the longest configurable thinking budget (up to 32,000 tokens of internal reasoning).[^7] It is the only Gemini 2.5 model approved for the Deep Think configuration and the only one with the upgraded coding RL training that drove the WebDev Arena number-one finish.

Gemini 2.5 Flash

First Flash-class model with built-in thinking. It shares the same 1 million token context window as Pro and supports the same input modalities, but uses a smaller and faster expert routing configuration. Google reported that the I/O 2025 preview of Flash used 20 to 30 percent fewer tokens than its predecessor on internal evaluations while improving on reasoning and code benchmarks.[^2] Flash reached general availability on June 17, 2025.

Gemini 2.5 Flash-Lite

The smallest and cheapest model in the 2.5 family. Flash-Lite is targeted at high-volume, latency-sensitive workloads such as text classification, translation, content moderation, and intelligent routing. Google described it as roughly 1.5 times faster than Gemini 2.0 Flash at lower cost. Flash-Lite entered preview on June 17, 2025, and reached general availability on July 22, 2025.[^3]

Gemini 2.5 Pro Deep Think

Deep Think is an enhanced reasoning configuration of Gemini 2.5 Pro that uses additional inference techniques to consider multiple hypotheses in parallel before committing to a final answer.[^2] It was announced at Google I/O on May 20, 2025, and described as Google's first model to score in the gold-medal range on the 2025 USA Mathematical Olympiad (USAMO) under research conditions. Public reports cited a 49.4 percent score on USAMO 2025 and number-one placement on LiveCodeBench's competition-coding leaderboard.

Deep Think was held back for an extended safety review and rolled out first to trusted testers via the Gemini API; broader access for Google AI Ultra subscribers in the Gemini app began on August 1, 2025.[^25] In July 2025, a research version of Deep Think achieved an official gold-medal score at the 2025 International Mathematical Olympiad, solving five of six problems for 35 out of 42 possible points, working entirely in natural language and within the 4.5-hour competition limit per session.[^26] In September 2025 the same line of work reached gold-medal level at the International Collegiate Programming Contest World Finals, solving ten problems including one (Problem C) that no human team solved during the contest.[^27]

Gemini 2.5 Pro TTS Preview

A speech-synthesis-focused variant offered in preview through the Live API. It supports controllable speaker voices, emotional inflection, and over 24 languages. The TTS variant is optimized for structured audio output (audiobook, podcast, narration) rather than open-ended chat.

Architecture

Google DeepMind has not published a full technical report disclosing the parameter count or training corpus of Gemini 2.5 Pro. The technical brief released in mid-2025 ("Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context") describes the model qualitatively rather than in terms of architecture details.[^7] From the brief and from independent commentary, the following design choices are well established.

Gemini 2.5 Pro is a decoder-only Transformer built on the same architectural family as Gemini 1.5 and Gemini 2.0. It uses a Mixture-of-Experts routing scheme inherited from Gemini 1.5, with sparsely activated expert sub-networks chosen per token. The total parameter count has not been disclosed, but discussion among researchers and informed observers (including Simon Willison, Vellum AI, and Helicone analyses) places the active-parameter count in the tens of billions and the total parameter count well above one trillion, consistent with the Gemini family's MoE pattern.[^8]

The model is natively multimodal in the same sense as earlier Gemini releases: text, image, audio, video, and PDF inputs are converted to a shared token representation by per-modality encoders, then processed by a single Transformer stack. There is no separate vision tower running on the side. This is in contrast to systems that bolt a vision encoder onto a text-only model after the fact.

The defining 2.5 change is the integration of reasoning into the default response path. Rather than a separate "thinking" model, every Gemini 2.5 Pro request runs through an internal chain-of-thought pass. Developers can set a thinking budget (in tokens) that caps how much computation the model spends on internal deliberation. Setting the budget to zero disables thinking entirely, producing latencies and costs comparable to a non-reasoning model. The default thinking budget is dynamic, based on Google's internal heuristics. Reasoning is trained through a combination of supervised fine-tuning on chain-of-thought traces and reinforcement learning from preference data, drawing on techniques explored earlier in Gemini 2.0 Flash Thinking and on research published by Google DeepMind throughout 2024.

The model supports inspectable thought summaries: a redacted form of its internal reasoning that developers can return alongside the final answer. Google added the feature to the API around Google I/O 2025, presenting thoughts with headers, key details, and explicit tool-call annotations so developers can audit how the model arrived at an answer.[^19] Full chain-of-thought transcripts are not returned in production, partly to discourage distillation by competitors and partly to keep raw deliberation, which can include exploratory or partially incorrect intermediate steps, out of user-facing surfaces.

The knowledge cutoff is January 2025 for the GA model, with model-card disclosures consistently citing this date across Google AI Studio, Vertex AI, and downstream documentation.[^4]

Capabilities

Reasoning

Gemini 2.5 Pro's reasoning behavior is most visible in long, multi-step problems. Internal Google evaluations described in the launch blog show the model decomposing graduate-level science problems, working through multi-step proofs in mathematics, and reading lengthy code diffs before proposing a fix.[^1] Developers can ask the model to summarize its thinking, which yields a paragraph or two of human-readable reasoning that can be used for debugging or transparency.

The configurable thinking budget (up to 32,000 tokens) lets applications trade latency for accuracy. Internal Google plots show smooth improvements on AIME, GPQA Diamond, and Humanity's Last Exam as the budget increases.[^7] The diminishing-returns point depends on the task; for many real-world coding queries, a budget of a few thousand tokens captures most of the benefit.

Coding and software engineering

Google invested heavily in coding for the 2.5 series. The May 6, 2025, preview snapshot (gemini-2.5-pro-preview-05-06) was specifically positioned as a coding upgrade, and Google's developer blog reported large improvements on the company's internal evals.[^6] Public results that landed in the Vellum, Helicone, and Artificial Analysis comparisons placed Gemini 2.5 Pro at or near the top on most coding benchmarks: WebDev Arena (number one for full app generation), Aider Polyglot (74.0 percent on real-world code editing across multiple languages), and LiveCodeBench v5 (70.4 percent on competitive programming).[^15][^16]

On SWE-bench Verified, the standard benchmark for autonomous bug-fixing on real GitHub repositories, the model scored 63.8 percent at launch using a custom agent harness.[^1] That number trailed Claude 3.7 Sonnet's 70.3 percent (extended thinking) on the same benchmark, and was acknowledged as such by Google. SWE-bench Verified remained the one major coding benchmark where Anthropic's Claude line held a clear lead through mid-2025.

Qualitatively, Replit, Cursor, and Codecademy all integrated Gemini 2.5 Pro within weeks of the preview. Replit president Michele Catasta publicly described the model as "the best frontier model for the capability over latency ratio" for their agent use cases, and a Replit engineer compared its judgment to that of "a more senior developer." Cursor users widely adopted the model for refactoring and multi-file edits, particularly in front-end work where the WebDev Arena strength translated to better UI generation.

Multimodal understanding

Gemini 2.5 Pro accepts text, images, audio, video, and PDFs in the same prompt.[^4] Concrete capacity limits include up to 3,000 images per prompt (each up to 7 MB inline or 30 MB through Cloud Storage), approximately 45 minutes of video with audio (or one hour without), up to roughly 8.4 hours of audio per prompt, and PDFs up to 1,000 pages or 50 MB.

Video understanding is one of the model's most differentiated capabilities. At Google I/O 2025, Google showed a preview version scoring 84.8 percent on VideoMME, the standard benchmark for multimodal video comprehension. The same demo showcased a "video-to-app" workflow where the model watched a recording of a UI interaction and produced a working web application that reproduced the interaction.[^2] The capability was widely shared on social media and was one of the reasons Gemini 2.5 Pro became the preferred model for creative and educational video tasks in mid-2025.

Audio handling is also strong. The model can transcribe speech, describe ambient sounds, identify musical instruments, and reason about spoken content over hours of input. The Live API exposes a real-time streaming mode that supports interruptions and turn-taking. A dedicated TTS preview offers controllable voice synthesis across more than 24 languages.

Long context

Gemini 2.5 Pro launched with a 1 million token context window. Google announced at the same time that 2 million tokens were "coming soon," matching the eventual ceiling of Gemini 1.5 Pro.[^1] The 2 million token expansion did not roll out to general availability for Gemini 2.5 Pro through 2025 or into early 2026; the ceiling remained at 1 million tokens at GA, then through the model's life as the default Gemini Pro through November 2025 and beyond.[^4]

A 1 million token window is enough for roughly 750,000 words of English text, several hours of audio, or a multi-thousand-line codebase. On the MRCR (Multi-Round Coreference Resolution) benchmark at 128k context, the model scored 94.5 percent, indicating strong long-context retrieval.[^7] Citizen Health, an early enterprise customer, used the model to ingest decades of longitudinal patient records, including physician notes, imaging reports, and genomic data, in a single API call.[^1]

Agentic and tool use

Gemini 2.5 Pro supports function calling, structured outputs, batch processing, and context caching, all of which are important for building AI agents that interact with external tools and APIs over multiple steps. Google extended the API at I/O 2025 with thought summaries, asynchronous Batch endpoints at roughly 50 percent of interactive cost, and broader native tool use including code execution and Google Search grounding.[^19] The model also supports the Model Context Protocol (MCP) standard for connecting to data sources, and Google integrated Project Mariner's computer-use capabilities into the Gemini 2.5 API as a preview, which allowed the model to navigate web interfaces and interact with desktop applications autonomously.[^2]

Benchmarks

The table below summarizes the public benchmark numbers Google reported at and shortly after the March 2025 launch, along with their counterparts on competing frontier models from the same period.

Benchmark	Gemini 2.5 Pro	Claude 3.7 Sonnet (ext. thinking)	GPT-4.5	Grok 3 (Big Brain)	DeepSeek R1
LMArena Elo (March 2025)	~1,300+ (#1)	~1,290	~1,280	~1,275	~1,265
GPQA Diamond	84.0%	84.8%	71.4%	84.6%	71.5%
AIME 2025	86.7%	80.0% (approx)	36.7%	93.3%	79.8%
AIME 2024	92.0%	78.0%	36.7%	93.3% (cited)	79.8%
SWE-bench Verified	63.8%	70.3%	38.0%	n/a	49.2%
Humanity's Last Exam (no tools)	18.8%	8.9%	6.4%	n/a	9.4%
MMMU	81.7%	75.0%	74.4%	78.0%	n/a
Global MMLU (Lite)	89.8%	n/a	89.6%	n/a	n/a
LiveCodeBench v5	70.4%	n/a	n/a	79.4%	n/a
Aider Polyglot	74.0%	64.9%	44.9%	n/a	56.9%
MRCR (128k)	94.5%	n/a	n/a	n/a	n/a
VideoMME (I/O preview)	84.8%	n/a	n/a	n/a	n/a
SimpleQA	52.9%	28.2%	62.5%	n/a	n/a
ARC-AGI-2	4.9%	n/a	n/a	n/a	n/a

A few comparisons are worth noting. Claude 3.7 Sonnet held a clear lead on SWE-bench Verified, the headline benchmark for agentic coding. Grok 3's Big Brain mode posted higher AIME 2025 numbers than Gemini 2.5 Pro, but trailed it on broader reasoning benchmarks. GPT-4.5 was strong on world-knowledge benchmarks like SimpleQA but weak on competition mathematics, since it was a non-reasoning model. DeepSeek R1, despite being open-weight and substantially cheaper to serve, was not competitive on multimodal or long-context tasks. On ARC-AGI-2, the abstract-reasoning grid benchmark introduced in 2025, every model in this comparison scored in the low single digits to mid-single digits; Gemini 2.5 Pro's 4.9 percent was typical of the frontier cohort and underscored that the benchmark remained largely unsolved before Gemini 3.

Gemini 2.5 Pro Deep Think pushed several of these numbers further. Public reporting cited a 49.4 percent score on USAMO 2025, an 84.0 percent score on MMMU at the Deep Think setting, and a top-of-leaderboard finish on LiveCodeBench's competition coding category.[^2] Deep Think also led the FrontierMath benchmark in its tier 1 to 3 range with approximately 29 percent. The most-cited Deep Think result came in July 2025, when an advanced research version solved five of six problems for 35 out of 42 points at the 2025 International Mathematical Olympiad, working in natural language within the standard 4.5-hour competition limit and earning an official gold-medal score.[^26] Two months later, Deep Think reached gold-medal level at the ICPC 2025 World Finals, solving ten problems including one that no human team solved.[^27]

Independent evaluations through 2025 reinforced the launch numbers but added important nuance. METR's preregistered task-suite evaluations placed Gemini 2.5 Pro near but not at the frontier on long-horizon agent tasks, behind both Claude 3.7 Sonnet and OpenAI's o3 family on certain rollouts.[^17] Vellum's coding leaderboard had Gemini 2.5 Pro and Claude 3.7 Sonnet trading the top spot through April and May 2025 depending on the task type.[^15] Simon Willison's running notes called the model "genuinely good at hard things" and singled out the video-to-code demo as the most novel new capability he had seen in 2025.[^13][^14] By early 2026, Artificial Analysis tracked the released gemini-2.5-pro model at an Intelligence Index score of 35 and a blended price of $3.44 per million tokens, ranking it as a mid-tier reasoning model relative to the Gemini 3 and GPT-5 cohort that followed.[^16]

Pricing and availability

Gemini 2.5 Pro is billed per million tokens through both the Gemini API (Google AI Studio) and Vertex AI.[^5] Pricing is tiered by prompt size: requests with up to 200,000 input tokens are charged at the standard rate, while requests with longer prompts use a higher per-token rate. Multiple service tiers exist. The free rate-limited tier in Google AI Studio is intended for experimentation and small projects, and paid Standard, Batch (also called Flex), and Priority tiers cover production workloads.

Gemini 2.5 Pro pricing (per million tokens)

Tier	Input <=200k ctx	Input >200k ctx	Output <=200k ctx	Output >200k ctx
Standard	$1.25	$2.50	$10.00	$15.00
Batch / Flex	$0.625	$1.25	$5.00	$7.50
Priority	$2.25	$4.50	$18.00	$27.00

Context caching is available at $0.125 per million tokens for prompts under 200k tokens ($0.25 above 200k), with cache storage at $4.50 per million tokens per hour.[^5]

Gemini 2.5 Flash pricing (per million tokens)

Tier	Input (text/image/video)	Input (audio)	Output
Standard	$0.30	$1.00	$2.50
Batch / Flex	$0.15	$0.50	$1.25
Priority	$0.54	$1.80	$4.50

Gemini 2.5 Flash-Lite pricing (per million tokens)

Tier	Input (text/image/video)	Input (audio)	Output
Standard	$0.10	$0.30	$0.40
Batch / Flex	$0.05	$0.15	$0.20
Priority	$0.18	$0.54	$0.72

Gemini 2.5 Pro's $1.25 input and $10.00 output Standard pricing put it well below GPT-4.5 (which OpenAI priced at $75 per million input tokens and $150 per million output tokens, more than ten times higher) and roughly in line with Claude 3.7 Sonnet ($3 / $15 per million tokens). Pricing for the long-context tier (above 200k input tokens) doubled the input cost and increased output cost by 50 percent, similar to how Anthropic and OpenAI have approached very long contexts. The Batch / Flex tier offered a roughly 50 percent discount for asynchronous workloads where latency was not critical.

Availability extended across three primary surfaces. Free experimentation lived in Google AI Studio, with rate limits of a few requests per minute on the experimental endpoint. Production use ran through the paid Gemini API or through Vertex AI for enterprise customers needing fine-tuning, audit logs, and Google Cloud integrations.[^3] Consumer access came through the Gemini app's paid tiers (Gemini Advanced, then later Google AI Pro at $19.99 per month and Google AI Ultra at $249.99 per month after Google's May 2025 subscription restructuring); Deep Think was gated to the Ultra tier when it rolled out to the consumer app in August.[^25]

Adoption

Third-party adoption began within hours of the March 25 launch. Cursor added the experimental endpoint as a selectable model the same week and benchmarked it as one of the top performers on its internal coding evals. Replit integrated the model into Replit Agent and Ghostwriter; the company publicly described the integration as their preferred frontier model for agent workflows. Codecademy added Gemini 2.5 Pro to its AI tutoring features. Aider, the open-source command-line code editor, included it in the polyglot leaderboard, where it took the top spot for several weeks.

In the broader developer ecosystem, the model became a default choice for chat applications and tools that wanted reasoning at low cost. LangChain, LlamaIndex, and the Vercel AI SDK shipped first-class support within the first month. Companies building AI agents layered Gemini 2.5 Pro into pipelines that previously required calls to OpenAI o1 or Claude 3.7 Sonnet, citing the lower price and the larger context window. Google reported at its Q1 2026 earnings call that the Gemini API was handling roughly 85 billion requests per month in January 2026, up from 35 billion in March 2025 when 2.5 Pro launched, and that the API serviced approximately 2.4 million active developers, more than double the prior year.[^28]

Enterprise adoption ran along two tracks. Vertex AI customers used it for code generation, document understanding, and translation across long enterprise documents. Citizen Health used it on patient records. Cognition (the company behind Devin) integrated it as one of several backend models, comparing its performance against Anthropic and OpenAI counterparts on real customer workloads. By early 2026 Google reported more than 8 million Gemini Enterprise seats sold across more than 2,800 companies, with 95 percent of the world's top 20 SaaS companies on the platform; much of that 2025 momentum traced back to Gemini 2.5 Pro as the default high-end model during the year.[^28]

Inside Google, Gemini 2.5 Pro became the default "Pro" model in the Gemini app's paid tiers, took over reasoning queries in Search's AI Mode, and provided the underlying intelligence for Project Astra (Google's universal-assistant prototype) and Project Mariner (its browsing agent).[^2] Workspace customers saw Gemini 2.5 Pro power document-aware chat in Gmail, Docs, and Drive, particularly for summarization tasks that benefit from the long context window.

Reception

The initial March 25 release drew substantial attention from the developer community. The 1,300-plus Elo score on LMArena was widely noted as the first time any model had cleared that line.[^18] Simon Willison, in his March 25 write-up, called it "the new state-of-the-art for everything that involves complicated reasoning, including coding" and singled out the price-to-performance ratio as the model's defining feature.[^13]

The Vellum AI coding leaderboard, run by the AI development platform of the same name, had Gemini 2.5 Pro and Claude 3.7 Sonnet trading the top spot through April. Vellum's commentary described the two models as "functionally interchangeable" for most production code tasks, with Gemini cheaper and Claude marginally more reliable on long-horizon agentic flows.[^15]

METR's preregistered evaluations, published in late April 2025, treated the model carefully. METR's task suite measures the time horizon over which an agent can autonomously make progress on a real software task. Gemini 2.5 Pro fell behind Claude 3.7 Sonnet and OpenAI's o3 in the median horizon length, but ahead of GPT-4.5 and DeepSeek R1. METR noted that the gap with Claude 3.7 Sonnet was not large and was task-dependent, with Gemini doing relatively better on debugging and worse on multi-step refactoring.[^17]

Independent reviewers including Simon Willison and Latent Space picked out video-to-code, the 1 million token context window in practice, and the price as the standout features.[^14] Critics pointed to occasional refusals on benign prompts containing sensitive keywords, hallucinated quotes from documents in long-context queries, and a regression in response quality observed on certain checkpoints in mid-2025. The Google AI developer forum thread "Gemini 2.5 Pro's Response Quality Regression" became one of the most-discussed support threads on the forum, drawing acknowledgement from Google product managers and a series of fixes through July and August 2025.[^12]

Reception inside academia was generally positive. The model was adopted as a baseline in many subsequent reasoning-model papers, often replacing GPT-4o or Claude 3 Opus as the stand-in for a frontier closed model. Researchers cited the publicly available thought summaries as a useful diagnostic tool for understanding chain-of-thought failures, even though the redacted form prevents direct study of the underlying reasoning trace.

Deep Think's July 2025 IMO gold medal and September 2025 ICPC gold medal generated unusually broad press coverage for an AI model achievement, in part because both results were obtained under the actual competition's rules rather than a research-only setting.[^26][^27] Commentators noted that the IMO performance was the first time a general-purpose chat model, working entirely in natural language without a separate theorem prover, had cleared the gold-medal cutoff at the contest.

Comparison with peers

The table below summarizes the major contemporary frontier models that Gemini 2.5 Pro competed with through early 2025.

Model	Lab	Released	Context	Reasoning mode	Multimodal	Headline strength
Gemini 2.5 Pro	Google DeepMind	March 25, 2025	1M	Default thinking + Deep Think	Text, image, audio, video, PDF	Reasoning, video, long context, price
Claude 3.7 Sonnet	Anthropic	February 24, 2025	200k	Configurable extended thinking	Text, image	Coding (SWE-bench), instruction following
GPT-4.5	OpenAI	February 27, 2025	128k	None (non-reasoning)	Text, image, audio (limited)	World knowledge, writing
Grok 3	xAI	February 17, 2025	1M	Big Brain (toggle)	Text, image	Math (AIME), real-time data via X
DeepSeek R1	DeepSeek	January 20, 2025	64k	Always-on reasoning	Text only	Open weights, low cost
OpenAI o3-mini	OpenAI	January 31, 2025	200k	Always-on reasoning	Text, image	Cost-efficient reasoning

Gemini 2.5 Pro had a distinctive competitive position. It was the only model in this group to combine top-tier reasoning, native video understanding, and a 1 million token context window in a single endpoint. Its weakness against Claude 3.7 Sonnet was concentrated on SWE-bench Verified and on certain long-horizon agent tasks. Its weakness against GPT-4.5 was concentrated on factual and conversational tasks where extra reasoning sometimes hurt rather than helped (a phenomenon Simon Willison and others noted on SimpleQA-style queries before reasoning, where the model's instinct was correct, then a long thinking pass talked itself out of the right answer).

Gemini 2.5 Pro's pricing and free-tier access made it the most accessible frontier reasoning model for hobbyists and small teams in 2025. Through Google AI Studio, a developer could use the model at no cost within rate limits, with no credit card. That was a meaningful difference compared to Claude 3.7 Sonnet (which required an Anthropic account and paid credits for any sustained usage) and to GPT-4.5 (which was paid only and the most expensive model on the market by per-token pricing).

The next-generation Claude 4 models from Anthropic, Claude Opus 4 and Claude Sonnet 4, released in May 2025, raised the bar on agentic coding (Claude Opus 4 reached 72.5 percent on SWE-bench Verified at launch, and Claude Sonnet 4 reached 72.7 percent). They overtook Gemini 2.5 Pro on most coding benchmarks but did not match its multimodal breadth or long-context ceiling. Through the second half of 2025, the practical choice between Gemini 2.5 Pro and the Claude 4 line came down to workload: video and long-context tasks favored Gemini, agentic coding favored Claude.

Safety and limitations

Google DeepMind described the safety work for Gemini 2.5 Pro under the rubric of its Frontier Safety Framework, the company's internal policy for tracking and mitigating risks from advanced AI systems.[^25] The framework defines critical capability levels for specific risk categories (such as cyber-offense, autonomy, and biological weapon uplift) and triggers internal mitigations when a model approaches one of these thresholds. For Gemini 2.5 Pro, Google reported that the model had been evaluated against the Frontier Safety Framework's then-current thresholds, including red-teaming for cybersecurity and biosecurity uplift, and that no mitigations beyond standard release-time controls were required.

The Deep Think variant was held back specifically for additional safety review. Google's I/O announcement noted that Deep Think would be released to trusted testers first while the company conducted further frontier-safety evaluations, an unusually conservative posture for a Google model release that drew positive comment from the AI safety community. Public access to Deep Think did not roll out broadly until August 1, 2025, when it became available to Google AI Ultra subscribers in the Gemini app.

Known limitations carried by Gemini 2.5 Pro through its life cycle included the following.

The knowledge cutoff is January 2025. Events, publications, and data after that date are not in the model's weights. Tool use (Google Search, code execution) is the recommended workaround for queries that need post-cutoff information.[^4]

Very long prompts produce noticeable time-to-first-token delays. Prompts in the high hundreds of thousands of tokens can take tens of seconds to begin streaming, which makes them unsuitable for interactive applications. Artificial Analysis measured a typical time-to-first-token of roughly 27 seconds in late 2025, noting that this was on the high end for reasoning models in the same price tier.[^16]

The announced 2 million token context expansion did not reach general availability for Gemini 2.5 Pro. The 1 million token ceiling was the practical maximum throughout the model's life as the default Pro endpoint, and remained so into 2026.[^4]

Like all large language models, Gemini 2.5 Pro can hallucinate. The most-discussed real-world failure modes in mid-2025 included fabricated quotes from documents passed in long context, mixed-up details when reasoning across multiple PDFs in a single prompt, and confident citations of sources that were not present in the input. Google issued multiple checkpoint updates between June and September 2025 to address regressions on these behaviors.[^12]

Safety filters occasionally declined benign prompts that contained sensitive keywords. The defaults are configurable through the API, but the filter set is enforced more strictly in consumer-facing surfaces (the Gemini app, AI Overviews) than in developer-facing surfaces.

Deep Think adds latency and token cost. It is not suited to latency-sensitive production applications, and some Deep Think evaluations require explicit opt-in beyond the standard gemini-2.5-pro-deep-think model identifier.

The model also scored only 4.9 percent on ARC-AGI-2, the abstract-reasoning grid benchmark introduced in 2025 by François Chollet and the ARC Prize Foundation, highlighting that performance on conventional academic benchmarks did not translate to the kind of novelty-driven puzzle reasoning ARC-AGI-2 was designed to measure.

Finally, the model's output is text-only by default. Image, audio, and video generation are handled by separate Google models such as Imagen 3, Veo 2, and the Live API's native audio mode. Native multimodal output, present in Gemini 2.0 Flash for some modalities, was not part of the 2.5 Pro release.

Successor

Gemini 3 Pro launched on November 18, 2025, replacing Gemini 2.5 Pro as Google's flagship. The 3 Pro release came with a substantial leap on most benchmarks: 1,501 LMArena Elo, 91.9 percent on GPQA Diamond, 76.2 percent on SWE-bench Verified, and 37.5 percent on Humanity's Last Exam without tools. It kept the 1 million token context window and the thinking-by-default architecture, and added persistent memory and stronger agentic behavior. Pricing also rose, with Gemini 3 Pro charging $2 per million input tokens and $12 per million output tokens at standard context, compared to $1.25 / $10.00 for Gemini 2.5 Pro.

Gemini 2.5 Pro remained available on the Gemini API and Vertex AI after the Gemini 3 launch, both as a price-performance option for production workloads that did not need the extra capability and as a fallback during the rollout of Gemini 3.[^29] As of May 2026, Google's deprecation schedule lists gemini-2.5-pro with an earliest discontinuation date of October 16, 2026, and recommends gemini-3.1-pro-preview as the replacement; the same earliest-discontinuation date applies to gemini-2.5-flash, with gemini-3-flash-preview as its replacement.[^29]

The Flash and Flash-Lite siblings followed parallel transitions. Gemini 3 Flash launched in December 2025 and became the new default in the Gemini app, while Gemini 3.1 Pro Preview and Gemini 3.1 Flash Lite arrived in February and March 2026, respectively. The 2.5 line as a whole stepped into a long-tail role, supporting cost-sensitive and latency-sensitive workloads while the 3 series took over the frontier.

References

Background

Launch

Variants and snapshot timeline

Gemini 2.5 Pro

Gemini 2.5 Flash

Gemini 2.5 Flash-Lite

Gemini 2.5 Pro Deep Think

Gemini 2.5 Pro TTS Preview

Architecture

Capabilities

Reasoning

Coding and software engineering

Multimodal understanding

Long context

Agentic and tool use

Benchmarks

Pricing and availability

Gemini 2.5 Pro pricing (per million tokens)

Gemini 2.5 Flash pricing (per million tokens)

Gemini 2.5 Flash-Lite pricing (per million tokens)

Adoption

Reception

Comparison with peers

Safety and limitations

Successor

See also

References

Improve this article

Related Articles

DeepSeek 3.0

ERQA

SmolVLA

Claude 4

Claude Opus 4

Claude Sonnet 4.6

Background

Launch

Variants and snapshot timeline

Gemini 2.5 Pro

Gemini 2.5 Flash

Gemini 2.5 Flash-Lite

Gemini 2.5 Pro Deep Think

Gemini 2.5 Pro TTS Preview

Architecture

Capabilities

Reasoning

Coding and software engineering

Multimodal understanding

Long context

Agentic and tool use

Benchmarks

Pricing and availability

Gemini 2.5 Pro pricing (per million tokens)

Gemini 2.5 Flash pricing (per million tokens)

Gemini 2.5 Flash-Lite pricing (per million tokens)

Adoption

Reception

Comparison with peers

Safety and limitations

Successor

See also

References

Related Articles

DeepSeek 3.0

ERQA

SmolVLA

Claude 4

Claude Opus 4

Claude Sonnet 4.6