Gemini 2.5 Pro is a large language model developed by Google DeepMind and the flagship reasoning model of the Gemini 2.5 family. It was first released as an experimental preview on March 25, 2025, under the model identifier gemini-2.5-pro-exp-03-25. The model introduced "thinking" as a default behavior across the Gemini line, meaning every response begins with an internal chain of reasoning before any text is shown to the user. Gemini 2.5 Pro reached general availability on June 17, 2025, and is offered through Google AI Studio, the Gemini API, and Vertex AI.
At launch the model debuted at the top of the LMArena (Chatbot Arena) leaderboard, becoming the first publicly accessible model to clear an Elo score of approximately 1,300 by a margin of close to 40 points over the previous leader. It also took the top position on WebDev Arena, posted state-of-the-art scores on GPQA Diamond, AIME 2025, and Humanity's Last Exam, and pulled level with or ahead of Anthropic's Claude 3.7 Sonnet and OpenAI's GPT-4.5 across most public reasoning benchmarks. The model's broad capability profile, large context window, and pricing under five dollars per million output tokens at standard tiers led to fast adoption inside developer tools such as Cursor, Replit, and Codecademy through the spring and summer of 2025.
The Gemini 2.5 family also includes the smaller Gemini 2.5 Flash and Gemini 2.5 Flash-Lite siblings, plus an enhanced reasoning configuration called Gemini 2.5 Pro Deep Think, announced at Google I/O on May 20, 2025. Gemini 2.5 Pro was eventually superseded by Gemini 3 Pro in November 2025, but it remained the default "Pro" model on the Gemini API, in the Gemini consumer app's paid tiers, and across Google products for roughly seven months.
Gemini 2.5 Pro is the third generation of Google's flagship Gemini line. It follows Gemini 1.5 Pro (February 2024) and the short-lived Gemini 2.0 Pro Experimental (February 2025), and it is the first Gemini model in which reasoning is the default rather than a separate variant.
Gemini 1.5 Pro arrived in February 2024 with a 1 million token context window, later expanded to 2 million tokens for waitlisted developers. It was the first Gemini model to use a Mixture-of-Experts (MoE) architecture, where only a subset of expert sub-networks is activated for each token, and it supported native input across text, images, audio, video, and code. Gemini 1.5 Pro introduced the very long context window that would come to define the Gemini line, but it had no dedicated chain-of-thought mode.
Gemini 2.0 Flash Experimental was announced on December 11, 2024, and reframed Google's strategy around what the company called the "agentic era." It added native tool use (Google Search, code execution), native audio output, real-time streaming through the Multimodal Live API, and roughly twice the speed of 1.5 Pro at lower cost. Two days later, on December 19, 2024, Google released the experimental Gemini 2.0 Flash Thinking, a variant trained to spend additional inference compute on internal deliberation before answering. Flash Thinking foreshadowed the architectural direction of the 2.5 family.
In February 2025, Google released a wave of Gemini 2.0 stable models: Gemini 2.0 Flash on January 30, 2025, then Gemini 2.0 Pro Experimental and Gemini 2.0 Flash-Lite on February 5 and February 25, respectively. Gemini 2.0 Pro Experimental shipped with a 1 million token context window, native tool use, and stronger coding performance, but no built-in reasoning step. It was a brief stop on the way to 2.5.
The broader market context for the 2.5 launch was unusually crowded. DeepSeek R1, an open-weight reasoning model from a Chinese research lab, had landed in late January 2025 and demonstrated that distilled chain-of-thought training could produce frontier-class results at a fraction of the cost. xAI released Grok 3 on February 17, 2025, and announced its own Big Brain reasoning mode. OpenAI released GPT-4.5 on February 27, 2025, as a non-reasoning frontier model with strong general knowledge but uneven math and code performance. Anthropic released Claude 3.7 Sonnet on February 24, 2025, which became the first commercial model to expose a configurable extended thinking mode to all developers. By the time Gemini 2.5 Pro arrived four weeks later, every major lab had committed to some form of test-time reasoning. Google's contribution was to make thinking the default rather than a toggle.
Gemini 2.5 Pro was announced on March 25, 2025, in a Google DeepMind blog post titled "Gemini 2.5: Our newest Gemini model with thinking," co-credited to Demis Hassabis and Koray Kavukcuoglu. The post described the model as "our most intelligent model yet" and framed the 2.5 series as a generation in which all Gemini models will be thinking models, capable of reasoning through their own thoughts before producing a response.
The initial release shipped under the experimental model ID gemini-2.5-pro-exp-03-25. It was made available immediately and at no cost in Google AI Studio, and rolled out the same day to the Gemini consumer app for Gemini Advanced subscribers. Vertex AI access followed within days. The free tier included generous rate limits intended to encourage broad early experimentation, and developer documentation listed a 1 million token context window with an announced ("coming soon") expansion to 2 million tokens.
Google's launch claims fell into three categories. First, the model debuted at number one on LMArena (the human-preference leaderboard run by the LMSYS / Chatbot Arena project), with a reported lead of close to 40 Elo points over the second-place model. Independent verification by the Chatbot Arena team placed the launch score in the 1,300 to 1,310 range, making Gemini 2.5 Pro the first model to clear the 1,300 line on a leaderboard that had hovered in the 1,250 to 1,290 band for most of 2024. Second, Google reported industry-leading scores on Humanity's Last Exam (18.8 percent without tools), GPQA Diamond (84.0 percent), and AIME 2025 (86.7 percent), along with strong but not market-leading results on SWE-bench Verified (63.8 percent). Third, the company emphasized improvements in coding: a top placement on the WebDev Arena leaderboard for full web app generation, and a 70.4 percent score on LiveCodeBench v5 for competitive programming.
The "Pro Experimental" label was important. Google had used the phrase "Pro Experimental" earlier in 2025 for Gemini 2.0 Pro Experimental, signaling that a model was production-eligible in spirit but still subject to changes in pricing, rate limits, and behavior before stable release. Gemini 2.5 Pro Experimental sat in this category from March 25 through early May, when a paid preview tier opened on Vertex AI. The model received an updated checkpoint, gemini-2.5-pro-preview-05-06, around the time of Google I/O 2025 (May 20-21). A further refresh, gemini-2.5-pro-preview-06-05, accompanied the WebDev Arena update and improved coding behavior. The June 17, 2025, GA release retired the preview suffix in favor of the stable identifier gemini-2.5-pro.
Google shipped multiple variants and snapshots of Gemini 2.5 in 2025. The table below summarizes the major identifiers as they appeared in the Gemini API and Vertex AI catalogs.
| Identifier | Status | Key dates | Notes |
|---|---|---|---|
gemini-2.5-pro-exp-03-25 | Experimental | March 25, 2025 | Initial launch; free in AI Studio |
gemini-2.5-pro-preview-05-06 | Preview (paid) | May 6, 2025 | Improved coding, WebDev Arena number one |
gemini-2.5-pro-preview-06-05 | Preview (paid) | June 5, 2025 | Final preview snapshot before GA |
gemini-2.5-pro | General availability | June 17, 2025 | Stable production model |
gemini-2.5-pro-deep-think | Limited preview | May 20, 2025 (announced) | Extended reasoning, parallel hypotheses |
gemini-2.5-flash-preview-04-17 | Preview | April 17, 2025 | First Flash with thinking |
gemini-2.5-flash | General availability | June 17, 2025 | Stable Flash |
gemini-2.5-flash-lite-preview-06-17 | Preview | June 17, 2025 | Smallest 2.5 variant |
gemini-2.5-flash-lite | General availability | July 22, 2025 | Stable Flash-Lite |
gemini-2.5-pro-tts-preview | Preview | Mid-2025 | Speech synthesis variant |
The headline model. It uses the largest parameter budget of the 2.5 family and has the longest configurable thinking budget (up to 32,000 tokens of internal reasoning). It is the only Gemini 2.5 model approved for the Deep Think configuration and the only one with the upgraded coding RL training that drove the WebDev Arena number-one finish.
First Flash-class model with built-in thinking. It shares the same 1 million token context window as Pro and supports the same input modalities, but uses a smaller and faster expert routing configuration. Google reported that the I/O 2025 preview of Flash used 20 to 30 percent fewer tokens than its predecessor on internal evaluations while improving on reasoning and code benchmarks. Flash reached general availability on June 17, 2025.
The smallest and cheapest model in the 2.5 family. Flash-Lite is targeted at high-volume, latency-sensitive workloads such as text classification, translation, content moderation, and intelligent routing. Google described it as roughly 1.5 times faster than Gemini 2.0 Flash at lower cost. Flash-Lite entered preview on June 17, 2025, and reached general availability on July 22, 2025.
Deep Think is an enhanced reasoning configuration of Gemini 2.5 Pro that uses additional inference techniques to consider multiple hypotheses in parallel before committing to a final answer. It was announced at Google I/O on May 20, 2025, and described as Google's first model to score in the gold-medal range on the 2025 USA Mathematical Olympiad (USAMO) under research conditions. Public reports cited a 49.4 percent score on USAMO 2025 and number-one placement on LiveCodeBench's competition-coding leaderboard. Deep Think was held back for an extended safety review and rolled out first to trusted testers via the Gemini API; broader access to Google AI Ultra subscribers in the Gemini app followed in August 2025.
A speech-synthesis-focused variant offered in preview through the Live API. It supports controllable speaker voices, emotional inflection, and over 24 languages. The TTS variant is optimized for structured audio output (audiobook, podcast, narration) rather than open-ended chat.
Google DeepMind has not published a full technical report disclosing the parameter count or training corpus of Gemini 2.5 Pro. The technical brief released in mid-2025 ("Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context") describes the model qualitatively rather than in terms of architecture details. From the brief and from independent commentary, the following design choices are well established.
Gemini 2.5 Pro is a decoder-only Transformer built on the same architectural family as Gemini 1.5 and Gemini 2.0. It uses a Mixture-of-Experts routing scheme inherited from Gemini 1.5, with sparsely activated expert sub-networks chosen per token. The total parameter count has not been disclosed, but discussion among researchers and informed observers (including Simon Willison, Vellum AI, and Helicone analyses) places the active-parameter count in the tens of billions and the total parameter count well above one trillion, consistent with the Gemini family's MoE pattern.
The model is natively multimodal in the same sense as earlier Gemini releases: text, image, audio, video, and PDF inputs are converted to a shared token representation by per-modality encoders, then processed by a single Transformer stack. There is no separate vision tower running on the side. This is in contrast to systems that bolt a vision encoder onto a text-only model after the fact.
The defining 2.5 change is the integration of reasoning into the default response path. Rather than a separate "thinking" model, every Gemini 2.5 Pro request runs through an internal chain-of-thought pass. Developers can set a thinking budget (in tokens) that caps how much computation the model spends on internal deliberation. Setting the budget to zero disables thinking entirely, producing latencies and costs comparable to a non-reasoning model. The default thinking budget is dynamic, based on Google's internal heuristics. Reasoning is trained through a combination of supervised fine-tuning on chain-of-thought traces and reinforcement learning from preference data, drawing on techniques explored earlier in Gemini 2.0 Flash Thinking and on research published by Google DeepMind throughout 2024.
The model supports inspectable thought summaries: a redacted form of its internal reasoning that developers can return alongside the final answer. Full chain-of-thought transcripts are not returned in production, partly to discourage distillation by competitors and partly to keep raw deliberation, which can include exploratory or partially incorrect intermediate steps, out of user-facing surfaces.
Gemini 2.5 Pro's reasoning behavior is most visible in long, multi-step problems. Internal Google evaluations described in the launch blog show the model decomposing graduate-level science problems, working through multi-step proofs in mathematics, and reading lengthy code diffs before proposing a fix. Developers can ask the model to summarize its thinking, which yields a paragraph or two of human-readable reasoning that can be used for debugging or transparency.
The configurable thinking budget (up to 32,000 tokens) lets applications trade latency for accuracy. Internal Google plots show smooth improvements on AIME, GPQA Diamond, and Humanity's Last Exam as the budget increases. The diminishing-returns point depends on the task; for many real-world coding queries, a budget of a few thousand tokens captures most of the benefit.
Google invested heavily in coding for the 2.5 series. The May 6, 2025, preview snapshot (gemini-2.5-pro-preview-05-06) was specifically positioned as a coding upgrade, and Google's developer blog reported large improvements on the company's internal evals. Public results that landed in the Vellum, Helicone, and Artificial Analysis comparisons placed Gemini 2.5 Pro at or near the top on most coding benchmarks: WebDev Arena (number one for full app generation), Aider Polyglot (74.0 percent on real-world code editing across multiple languages), and LiveCodeBench v5 (70.4 percent on competitive programming).
On SWE-bench Verified, the standard benchmark for autonomous bug-fixing on real GitHub repositories, the model scored 63.8 percent at launch using a custom agent harness. That number trailed Claude 3.7 Sonnet's 70.3 percent (extended thinking) on the same benchmark, and was acknowledged as such by Google. SWE-bench Verified remained the one major coding benchmark where Anthropic's Claude line held a clear lead through mid-2025.
Qualitatively, Replit, Cursor, and Codecademy all integrated Gemini 2.5 Pro within weeks of the preview. Replit president Michele Catasta publicly described the model as "the best frontier model for the capability over latency ratio" for their agent use cases, and a Replit engineer compared its judgment to that of "a more senior developer." Cursor users widely adopted the model for refactoring and multi-file edits, particularly in front-end work where the WebDev Arena strength translated to better UI generation.
Gemini 2.5 Pro accepts text, images, audio, video, and PDFs in the same prompt. Concrete capacity limits include up to 3,000 images per prompt (each up to 7 MB inline or 30 MB through Cloud Storage), approximately 45 minutes of video with audio (or one hour without), up to roughly 8.4 hours of audio per prompt, and PDFs up to 1,000 pages or 50 MB.
Video understanding is one of the model's most differentiated capabilities. At Google I/O 2025, Google showed a preview version scoring 84.8 percent on VideoMME, the standard benchmark for multimodal video comprehension. The same demo showcased a "video-to-app" workflow where the model watched a recording of a UI interaction and produced a working web application that reproduced the interaction. The capability was widely shared on social media and was one of the reasons Gemini 2.5 Pro became the preferred model for creative and educational video tasks in mid-2025.
Audio handling is also strong. The model can transcribe speech, describe ambient sounds, identify musical instruments, and reason about spoken content over hours of input. The Live API exposes a real-time streaming mode that supports interruptions and turn-taking.
Gemini 2.5 Pro launched with a 1 million token context window. Google announced at the same time that 2 million tokens were "coming soon," matching the eventual ceiling of Gemini 1.5 Pro. The 2 million token expansion did not roll out to general availability for Gemini 2.5 Pro through 2025; the ceiling remained at 1 million tokens at GA, then through the model's life as the default Gemini Pro through November 2025.
A 1 million token window is enough for roughly 750,000 words of English text, several hours of audio, or a multi-thousand-line codebase. On the MRCR (Multi-Round Coreference Resolution) benchmark at 128k context, the model scored 94.5 percent, indicating strong long-context retrieval. Citizen Health, an early enterprise customer, used the model to ingest decades of longitudinal patient records, including physician notes, imaging reports, and genomic data, in a single API call.
Gemini 2.5 Pro supports function calling, structured outputs, batch processing, and context caching, all of which are important for building AI agents that interact with external tools and APIs over multiple steps. The model also supports the Model Context Protocol (MCP) standard for connecting to data sources, and Google integrated Project Mariner's computer-use capabilities into the Gemini 2.5 API as a preview, which allowed the model to navigate web interfaces and interact with desktop applications autonomously.
The table below summarizes the public benchmark numbers Google reported at and shortly after the March 2025 launch, along with their counterparts on competing frontier models from the same period.
| Benchmark | Gemini 2.5 Pro | Claude 3.7 Sonnet (ext. thinking) | GPT-4.5 | Grok 3 (Big Brain) | DeepSeek R1 |
|---|---|---|---|---|---|
| LMArena Elo (March 2025) | ~1,300+ (#1) | ~1,290 | ~1,280 | ~1,275 | ~1,265 |
| GPQA Diamond | 84.0% | 84.8% | 71.4% | 84.6% | 71.5% |
| AIME 2025 | 86.7% | 80.0% (approx) | 36.7% | 93.3% | 79.8% |
| AIME 2024 | 92.0% | 78.0% | 36.7% | 93.3% (cited) | 79.8% |
| SWE-bench Verified | 63.8% | 70.3% | 38.0% | n/a | 49.2% |
| Humanity's Last Exam (no tools) | 18.8% | 8.9% | 6.4% | n/a | 9.4% |
| MMMU | 81.7% | 75.0% | 74.4% | 78.0% | n/a |
| Global MMLU (Lite) | 89.8% | n/a | 89.6% | n/a | n/a |
| LiveCodeBench v5 | 70.4% | n/a | n/a | 79.4% | n/a |
| Aider Polyglot | 74.0% | 64.9% | 44.9% | n/a | 56.9% |
| MRCR (128k) | 94.5% | n/a | n/a | n/a | n/a |
| VideoMME (I/O preview) | 84.8% | n/a | n/a | n/a | n/a |
| SimpleQA | 52.9% | 28.2% | 62.5% | n/a | n/a |
A few comparisons are worth noting. Claude 3.7 Sonnet held a clear lead on SWE-bench Verified, the headline benchmark for agentic coding. Grok 3's Big Brain mode posted higher AIME 2025 numbers than Gemini 2.5 Pro, but trailed it on broader reasoning benchmarks. GPT-4.5 was strong on world-knowledge benchmarks like SimpleQA but weak on competition mathematics, since it was a non-reasoning model. DeepSeek R1, despite being open-weight and substantially cheaper to serve, was not competitive on multimodal or long-context tasks.
Gemini 2.5 Pro Deep Think pushed several of these numbers further. Public reporting cited a 49.4 percent score on USAMO 2025, gold-medal-level performance on the 2025 International Mathematical Olympiad in research evaluations, an 84.0 percent score on MMMU at the Deep Think setting, and a top-of-leaderboard finish on LiveCodeBench's competition coding category. Deep Think also led the FrontierMath benchmark in its tier 1 to 3 range with approximately 29 percent.
Independent evaluations through 2025 reinforced the launch numbers but added important nuance. METR's preregistered task-suite evaluations placed Gemini 2.5 Pro near but not at the frontier on long-horizon agent tasks, behind both Claude 3.7 Sonnet and OpenAI's o3 family on certain rollouts. Vellum's coding leaderboard had Gemini 2.5 Pro and Claude 3.7 Sonnet trading the top spot through April and May 2025 depending on the task type. Simon Willison's running notes called the model "genuinely good at hard things" and singled out the video-to-code demo as the most novel new capability he had seen in 2025.
Gemini 2.5 Pro is billed per million tokens through both the Gemini API (Google AI Studio) and Vertex AI. Pricing is tiered by prompt size: requests with up to 200,000 input tokens are charged at the standard rate, while requests with longer prompts use a higher per-token rate. Multiple service tiers exist. The free rate-limited tier in Google AI Studio is intended for experimentation and small projects, and paid Standard, Batch (also called Flex), and Priority tiers cover production workloads.
| Tier | Input <=200k ctx | Input >200k ctx | Output <=200k ctx | Output >200k ctx |
|---|---|---|---|---|
| Standard | $1.25 | $2.50 | $10.00 | $15.00 |
| Batch / Flex | $0.625 | $1.25 | $5.00 | $7.50 |
| Priority | $2.25 | $4.50 | $18.00 | $27.00 |
Context caching is available at $0.125 per million tokens for prompts under 200k tokens ($0.25 above 200k), with cache storage at $4.50 per million tokens per hour.
| Tier | Input (text/image/video) | Input (audio) | Output |
|---|---|---|---|
| Standard | $0.30 | $1.00 | $2.50 |
| Batch / Flex | $0.15 | $0.50 | $1.25 |
| Priority | $0.54 | $1.80 | $4.50 |
| Tier | Input (text/image/video) | Input (audio) | Output |
|---|---|---|---|
| Standard | $0.10 | $0.30 | $0.40 |
| Batch / Flex | $0.05 | $0.15 | $0.20 |
| Priority | $0.18 | $0.54 | $0.72 |
Gemini 2.5 Pro's $1.25 input and $10.00 output Standard pricing put it well below GPT-4.5 (which OpenAI priced at $75 per million input tokens and $150 per million output tokens, more than ten times higher) and roughly in line with Claude 3.7 Sonnet ($3 / $15 per million tokens). Pricing for the long-context tier (above 200k input tokens) doubled the input cost and increased output cost by 50 percent, similar to how Anthropic and OpenAI have approached very long contexts. The Batch / Flex tier offered a roughly 50 percent discount for asynchronous workloads where latency was not critical.
Availability extended across three primary surfaces. Free experimentation lived in Google AI Studio, with rate limits of a few requests per minute on the experimental endpoint. Production use ran through the paid Gemini API or through Vertex AI for enterprise customers needing fine-tuning, audit logs, and Google Cloud integrations. Consumer access came through the Gemini app's paid tiers (Gemini Advanced, then later Google AI Pro and Google AI Ultra after Google's May 2025 subscription restructuring).
Third-party adoption began within hours of the March 25 launch. Cursor added the experimental endpoint as a selectable model the same week and benchmarked it as one of the top performers on its internal coding evals. Replit integrated the model into Replit Agent and Ghostwriter; the company publicly described the integration as their preferred frontier model for agent workflows. Codecademy added Gemini 2.5 Pro to its AI tutoring features. Aider, the open-source command-line code editor, included it in the polyglot leaderboard, where it took the top spot for several weeks.
In the broader developer ecosystem, the model became a default choice for chat applications and tools that wanted reasoning at low cost. LangChain, LlamaIndex, and the Vercel AI SDK shipped first-class support within the first month. Companies building AI agents layered Gemini 2.5 Pro into pipelines that previously required calls to OpenAI o1 or Claude 3.7 Sonnet, citing the lower price and the larger context window.
Enterprise adoption ran along two tracks. Vertex AI customers used it for code generation, document understanding, and translation across long enterprise documents. Citizen Health used it on patient records. Cognition (the company behind Devin) integrated it as one of several backend models, comparing its performance against Anthropic and OpenAI counterparts on real customer workloads.
Inside Google, Gemini 2.5 Pro became the default "Pro" model in the Gemini app's paid tiers, took over reasoning queries in Search's AI Mode, and provided the underlying intelligence for Project Astra (Google's universal-assistant prototype) and Project Mariner (its browsing agent). Workspace customers saw Gemini 2.5 Pro power document-aware chat in Gmail, Docs, and Drive, particularly for summarization tasks that benefit from the long context window.
The initial March 25 release drew substantial attention from the developer community. The 1,300-plus Elo score on LMArena was widely noted as the first time any model had cleared that line. Simon Willison, in his March 25 write-up, called it "the new state-of-the-art for everything that involves complicated reasoning, including coding" and singled out the price-to-performance ratio as the model's defining feature.
The Vellum AI coding leaderboard, run by the AI development platform of the same name, had Gemini 2.5 Pro and Claude 3.7 Sonnet trading the top spot through April. Vellum's commentary described the two models as "functionally interchangeable" for most production code tasks, with Gemini cheaper and Claude marginally more reliable on long-horizon agentic flows.
METR's preregistered evaluations, published in late April 2025, treated the model carefully. METR's task suite measures the time horizon over which an agent can autonomously make progress on a real software task. Gemini 2.5 Pro fell behind Claude 3.7 Sonnet and OpenAI's o3 in the median horizon length, but ahead of GPT-4.5 and DeepSeek R1. METR noted that the gap with Claude 3.7 Sonnet was not large and was task-dependent, with Gemini doing relatively better on debugging and worse on multi-step refactoring.
Independent reviewers including Simon Willison and Latent Space picked out video-to-code, the 1 million token context window in practice, and the price as the standout features. Critics pointed to occasional refusals on benign prompts containing sensitive keywords, hallucinated quotes from documents in long-context queries, and a regression in response quality observed on certain checkpoints in mid-2025. The Google AI developer forum thread "Gemini 2.5 Pro's Response Quality Regression" became one of the most-discussed support threads on the forum, drawing acknowledgement from Google product managers and a series of fixes through July and August 2025.
Reception inside academia was generally positive. The model was adopted as a baseline in many subsequent reasoning-model papers, often replacing GPT-4o or Claude 3 Opus as the stand-in for a frontier closed model. Researchers cited the publicly available thought summaries as a useful diagnostic tool for understanding chain-of-thought failures, even though the redacted form prevents direct study of the underlying reasoning trace.
The table below summarizes the major contemporary frontier models that Gemini 2.5 Pro competed with through early 2025.
| Model | Lab | Released | Context | Reasoning mode | Multimodal | Headline strength |
|---|---|---|---|---|---|---|
| Gemini 2.5 Pro | Google DeepMind | March 25, 2025 | 1M | Default thinking + Deep Think | Text, image, audio, video, PDF | Reasoning, video, long context, price |
| Claude 3.7 Sonnet | Anthropic | February 24, 2025 | 200k | Configurable extended thinking | Text, image | Coding (SWE-bench), instruction following |
| GPT-4.5 | OpenAI | February 27, 2025 | 128k | None (non-reasoning) | Text, image, audio (limited) | World knowledge, writing |
| Grok 3 | xAI | February 17, 2025 | 1M | Big Brain (toggle) | Text, image | Math (AIME), real-time data via X |
| DeepSeek R1 | DeepSeek | January 20, 2025 | 64k | Always-on reasoning | Text only | Open weights, low cost |
| OpenAI o3-mini | OpenAI | January 31, 2025 | 200k | Always-on reasoning | Text, image | Cost-efficient reasoning |
Gemini 2.5 Pro had a distinctive competitive position. It was the only model in this group to combine top-tier reasoning, native video understanding, and a 1 million token context window in a single endpoint. Its weakness against Claude 3.7 Sonnet was concentrated on SWE-bench Verified and on certain long-horizon agent tasks. Its weakness against GPT-4.5 was concentrated on factual and conversational tasks where extra reasoning sometimes hurt rather than helped (a phenomenon Simon Willison and others noted on SimpleQA-style queries before reasoning, where the model's instinct was correct, then a long thinking pass talked itself out of the right answer).
Gemini 2.5 Pro's pricing and free-tier access made it the most accessible frontier reasoning model for hobbyists and small teams in 2025. Through Google AI Studio, a developer could use the model at no cost within rate limits, with no credit card. That was a meaningful difference compared to Claude 3.7 Sonnet (which required an Anthropic account and paid credits for any sustained usage) and to GPT-4.5 (which was paid only and the most expensive model on the market by per-token pricing).
The next-generation Claude 4 models from Anthropic, Claude Opus 4 and Claude Sonnet 4, released in May 2025, raised the bar on agentic coding (Claude Opus 4 reached 72.5 percent on SWE-bench Verified at launch, and Claude Sonnet 4 reached 72.7 percent). They overtook Gemini 2.5 Pro on most coding benchmarks but did not match its multimodal breadth or long-context ceiling. Through the second half of 2025, the practical choice between Gemini 2.5 Pro and the Claude 4 line came down to workload: video and long-context tasks favored Gemini, agentic coding favored Claude.
Google DeepMind described the safety work for Gemini 2.5 Pro under the rubric of its Frontier Safety Framework, the company's internal policy for tracking and mitigating risks from advanced AI systems. The framework defines critical capability levels for specific risk categories (such as cyber-offense, autonomy, and biological weapon uplift) and triggers internal mitigations when a model approaches one of these thresholds. For Gemini 2.5 Pro, Google reported that the model had been evaluated against the Frontier Safety Framework's then-current thresholds, including red-teaming for cybersecurity and biosecurity uplift, and that no mitigations beyond standard release-time controls were required.
The Deep Think variant was held back specifically for additional safety review. Google's I/O announcement noted that Deep Think would be released to trusted testers first while the company conducted further frontier-safety evaluations, an unusually conservative posture for a Google model release that drew positive comment from the AI safety community. Public access to Deep Think did not roll out broadly until August 2025.
Known limitations carried by Gemini 2.5 Pro through its life cycle included the following.
The knowledge cutoff is January 2025. Events, publications, and data after that date are not in the model's weights. Tool use (Google Search, code execution) is the recommended workaround for queries that need post-cutoff information.
Very long prompts produce noticeable time-to-first-token delays. Prompts in the high hundreds of thousands of tokens can take tens of seconds to begin streaming, which makes them unsuitable for interactive applications.
The announced 2 million token context expansion did not reach general availability for Gemini 2.5 Pro. The 1 million token ceiling was the practical maximum throughout the model's life as the default Pro endpoint.
Like all large language models, Gemini 2.5 Pro can hallucinate. The most-discussed real-world failure modes in mid-2025 included fabricated quotes from documents passed in long context, mixed-up details when reasoning across multiple PDFs in a single prompt, and confident citations of sources that were not present in the input. Google issued multiple checkpoint updates between June and September 2025 to address regressions on these behaviors.
Safety filters occasionally declined benign prompts that contained sensitive keywords. The defaults are configurable through the API, but the filter set is enforced more strictly in consumer-facing surfaces (the Gemini app, AI Overviews) than in developer-facing surfaces.
Deep Think adds latency and token cost. It is not suited to latency-sensitive production applications, and some Deep Think evaluations require explicit opt-in beyond the standard gemini-2.5-pro-deep-think model identifier.
Finally, the model's output is text-only by default. Image, audio, and video generation are handled by separate Google models such as Imagen 3, Veo 2, and the Live API's native audio mode. Native multimodal output, present in Gemini 2.0 Flash for some modalities, was not part of the 2.5 Pro release.
Gemini 3 Pro launched on November 18, 2025, replacing Gemini 2.5 Pro as Google's flagship. The 3 Pro release came with a substantial leap on most benchmarks: 1,501 LMArena Elo, 91.9 percent on GPQA Diamond, 76.2 percent on SWE-bench Verified, and 37.5 percent on Humanity's Last Exam without tools. It kept the 1 million token context window and the thinking-by-default architecture, and added persistent memory and stronger agentic behavior. Pricing also rose, with Gemini 3 Pro charging $2 per million input tokens and $12 per million output tokens at standard context, compared to $1.25 / $10.00 for Gemini 2.5 Pro.
Gemini 2.5 Pro remained available on the Gemini API and Vertex AI after the Gemini 3 launch, both as a price-performance option for production workloads that did not need the extra capability and as a fallback during the rollout of Gemini 3. As of mid-2026 the model continues to be served, though Google's documentation lists a discontinuation date no earlier than October 16, 2026.
The Flash and Flash-Lite siblings followed parallel transitions. Gemini 3 Flash launched in December 2025 and became the new default in the Gemini app, while Gemini 3.1 Pro Preview and Gemini 3.1 Flash Lite arrived in February and March 2026, respectively. The 2.5 line as a whole stepped into a long-tail role, supporting cost-sensitive and latency-sensitive workloads while the 3 series took over the frontier.