Gemini 3 Pro is the flagship preview model in Google DeepMind's Gemini 3 family of multimodal models. Google introduced the Gemini 3 series on November 18, 2025 and shipped Gemini 3 Pro in preview the same day across the Gemini app, Google AI Studio, Vertex AI, the Gemini API, AI Mode in Search, and a new agentic coding platform called Google Antigravity.[1][2][3]
Google positioned Gemini 3 Pro as its most intelligent model at launch and said it significantly outperformed Gemini 2.5 Pro, the previous flagship in the Gemini line, on every major reasoning, coding, and multimodal benchmark the company tracks.[1] Internally Google described it as the start of the Gemini 3 era and promised additional Gemini 3 variants in the months that followed.
The model became the first publicly accessible large language model to break the 1500 Elo barrier on LMArena, debuting at 1501 Elo on the chatbot leaderboard at launch and ranking first across LMArena's text reasoning, vision, coding, and web development tracks.[1][4]
Gemini 3 Pro is built on a sparse mixture-of-experts (MoE) transformer architecture, a design that activates only a subset of model parameters per input token. This lets Google decouple total model capacity from per-token computation cost, supporting the 1 million token context window while maintaining practical inference latency.[9]
Gemini 3 Pro arrived roughly eight months after Gemini 2.5 Pro, which launched in March 2025. The 2.5 generation introduced Google's first serious thinking-mode capabilities and moved the Gemini line into contention with OpenAI's o-series reasoning models. By the second half of 2025, the frontier had grown noticeably more competitive: OpenAI shipped GPT-5 in August 2025 and Claude Opus 4.5 from Anthropic followed in late November.
The Gemini 3 launch was Google DeepMind's response to this competitive pressure. Google paired the model release with two major ecosystem moves: the launch of Google Antigravity, a new agentic coding platform, and the rollout of AI Mode in Google Search to Pro and Ultra subscribers. Both moves were designed to make Gemini 3 Pro the backbone of Google's developer and consumer AI products simultaneously rather than a standalone API product.
In the weeks after launch, the Deep Think extended reasoning mode received its first public rollout to Google AI Ultra subscribers. A wider research access program followed in February 2026 with significantly improved benchmark scores, turning Deep Think into a distinct product track rather than a minor variant.
Gemini 3 Pro sits at the top of Google's general-purpose Gemini lineup. The 3 series replaced the 2.5 line, which had previously replaced the 2.0 and 1.5 lines in the same flagship slot. The Gemini 3 family also includes a smaller, cheaper Gemini 3 Flash variant and a specialized Gemini 3 Deep Think reasoning mode that builds on Gemini 3 Pro's base weights but spends more compute on each response.
The table below shows where Gemini 3 Pro fits relative to its immediate predecessors and stablemates as of early 2026.
| Model | Family | Position | Released |
|---|---|---|---|
| Gemini 1.5 Pro | Gemini 1.5 | Flagship | February 2024 |
| Gemini 2.0 Pro | Gemini 2.0 | Flagship | February 2025 |
| Gemini 2.5 Pro | Gemini 2.5 | Flagship | March 2025 |
| Gemini 3 Pro | Gemini 3 | Flagship preview | November 18, 2025 |
| Gemini 3 Flash | Gemini 3 | Smaller, faster sibling | November 2025 |
| Gemini 3 Deep Think | Gemini 3 | Extended reasoning mode on top of 3 Pro | December 2025 (Ultra), February 2026 (research access) |
| Gemini 3.1 Pro | Gemini 3 | Refresh of the flagship preview | February 19, 2026 |
The "3.1 Pro" label that appeared on the Google DeepMind product page in early 2026 is best read as a successor preview release in the same flagship line rather than a separate product. Google's own model card frames Gemini 3.1 Pro as built on the Gemini 3 Pro base, with a step up in core reasoning and updated tool-use behavior. Gemini 3.1 Pro scored 77.1% on ARC-AGI-2, more than double Gemini 3 Pro's 31.1%, which was the headline improvement driving migration urgency when Google deprecated the original gemini-3-pro-preview model string on March 9, 2026.[5][6]
The table below summarizes the public technical surface of Gemini 3 Pro as documented by Google at launch and on the Gemini API model pages.[2][3][7]
| Field | Value |
|---|---|
| Status | Preview at launch (November 18, 2025); deprecated in the Gemini API on March 9, 2026 in favor of Gemini 3.1 Pro |
| Architecture | Sparse mixture-of-experts (MoE) transformer |
| Training infrastructure | Google TPUs, JAX, ML Pathways |
| Input modalities | Text, image, audio, video, PDF, code repositories |
| Output modalities | Text |
| Context window | 1,000,000 input tokens |
| Output limit | 64,000 tokens |
| Knowledge cutoff | January 2025 |
| Tool use | Function calling, structured output, code execution, search as a tool, URL context, client-side and server-side bash tools |
| Reasoning controls | Adjustable thinking level, configurable media resolution, stricter thought signature validation |
| Availability | Gemini app, AI Mode in Search, Google AI Studio, Vertex AI, Gemini API, Gemini CLI, Google Antigravity, NotebookLM (Pro/Ultra), third-party tools (Cursor, GitHub, JetBrains, Manus, Replit, Cline, Android Studio) |
Google has not officially disclosed Gemini 3 Pro's total parameter count. Sparse MoE architectures are designed so that only a fraction of the total weights are active on any given token, which means the raw parameter number is less meaningful for performance comparisons than it would be for a dense model. Industry analysis has suggested total parameter counts in the range of hundreds of billions to over one trillion, but these figures are not confirmed by Google.[9]
Google's developer documentation describes the 1 million token context as covering long documents, extended chats, and entire codebases, with media resolution controls that let developers trade input cost for visual fidelity on images and video. Community reports from the Google developer forums indicate that practical retrieval performance degrades meaningfully above 200,000 tokens, with some users reporting increased hallucinations and context loss at 800,000 tokens or above.[3][11]
Gemini 3 Pro is a natively multimodal model. It accepts text, images, audio, video, and PDFs in the same prompt and answers in text. Developers can mix modalities freely and combine them with tool calls in a single request.[2][3]
On vision tasks, Gemini 3 Pro scored 72.7% on ScreenSpot-Pro, a benchmark measuring a model's ability to understand and interact with user interface screenshots. That score substantially exceeded Claude 3.5 Sonnet (36.2%) and GPT-5.1 (3.5%) on the same task, a gap that analysts attributed to Google's deeper investment in training on mobile and desktop UI data.[9]
Video understanding was another headline capability at launch. Gemini 3 Pro scored 87.6% on Video-MMMU, placing it at the top of that leaderboard. Google characterized the improvement as reflecting an ability to track cause-and-effect relationships across long video clips rather than merely identifying static objects in individual frames.
For audio, Gemini 3 Pro accepts raw audio input as part of a multimodal prompt, enabling tasks like transcription with visual context, audio-visual question answering, and language identification across mixed-modality inputs.
Gemini 3 Pro itself outputs only text, but the Gemini app and API pair it with Google's image and video generation stack. Veo 3.1, Google's video generation model, is available to paid Gemini app users for clip generation up to 8 seconds and is exposed as a Veo API surface for developers. Project Mariner, Google's agentic browsing experiment, also became available in the United States to AI Ultra subscribers around the Gemini 3 launch and uses Gemini 3 Pro for planning.
The table below collects benchmark scores from the November 18, 2025 launch posts and the Gemini 3 Pro model card. Where Google reported a separate Deep Think result, that score appears in parentheses after the base Gemini 3 Pro number. The February 2026 Deep Think research access scores, which showed further gains, are noted separately.[1][3][8]
| Benchmark | Gemini 3 Pro (base) | Deep Think (Dec 2025) | Deep Think (Feb 2026) | Notes |
|---|---|---|---|---|
| LMArena Elo | 1501 | n/a | n/a | First model above 1500; topped overall, vision, coding, web dev tracks |
| Humanity's Last Exam (no tools) | 37.5% | 41.0% | 48.4% | Reasoning across academic disciplines |
| GPQA Diamond | 91.9% | 93.8% | n/a | Graduate-level science questions, above human expert level (~89.8%) |
| MathArena Apex | 23.4% | n/a | n/a | Hardest published math contest set |
| AIME 2025 (no code) | 95.0% | n/a | n/a | American Invitational Mathematics Examination |
| AIME 2025 (with code execution) | 100% | n/a | n/a | Tied with GPT-5.1 |
| MMMU-Pro | 81.0% | n/a | n/a | Multimodal university-level reasoning |
| Video-MMMU | 87.6% | n/a | n/a | Video understanding |
| MMMLU | 91.8% | n/a | n/a | Massive Multitask Language Understanding, multilingual |
| Global PIQA | 93.4% | n/a | n/a | Multilingual physical commonsense |
| SimpleQA Verified | 72.1% | n/a | n/a | Short-form factual QA |
| SWE-bench Verified | 76.2% | n/a | n/a | Real GitHub bug-fix tasks |
| LiveCodeBench Pro Elo | 2439 | n/a | n/a | Competitive coding |
| Terminal-Bench 2.0 | 54.2% | n/a | n/a | Tool use and shell automation |
| WebDev Arena Elo | 1487 | n/a | n/a | Front-end web development arena |
| ARC-AGI-2 | 31.1% | 45.1% (with code) | 84.6% | Reasoning over novel puzzles, verified by ARC Prize Foundation |
| ScreenSpot-Pro | 72.7% | n/a | n/a | UI screenshot understanding |
| Vending-Bench 2 | $5,478.16 simulated profit | n/a | n/a | Long-horizon agent planning |
| MRCR v2 (128k context) | 77.0% | n/a | n/a | Long-context retrieval |
Third-party reporting by VentureBeat noted that Gemini 3 Pro led 13 of 16 widely tracked AI benchmarks at launch and beat both GPT-5 and frontier Claude models on the headline reasoning and multimodal scores it published, while trailing slightly on SWE-bench Verified relative to Anthropic's then-current coding model.[4] The February 2026 ARC-AGI-2 result of 84.6% with Deep Think, verified independently by the ARC Prize Foundation, attracted particular attention: average human performance on ARC-AGI-2 is approximately 60%, making this one of the few cases where a frontier model has approached or surpassed the human baseline on a task that was specifically designed to resist pattern-matching from training data.[8]
Deep Think is a reasoning mode that runs on top of Gemini 3 Pro's base weights by spending more compute exploring candidate solutions before producing a final answer. Google describes the mechanism as evaluating multiple hypotheses simultaneously rather than applying a single chain-of-thought pass, with the model synthesizing insights from parallel reasoning chains before committing to a response.
Google held Deep Think back from general API access at launch and rolled it out in phases:
In the API, Deep Think behavior is controlled through the thinking_level parameter. Setting it to High activates what Google calls Deep Think Mini mode, which increases output token counts depending on problem complexity. Thinking tokens are billed at the standard output rate of $12 per million tokens, so heavier reasoning passes carry a proportional cost increase.[7][8]
The Deep Think rollout followed a slower safety review process than the base model. Google cited the higher capability level and the potential for the extended reasoning chain to surface information that requires more careful evaluation before deployment.
Google published list prices for Gemini 3 Pro on the Gemini API at launch. The table below shows the pricing structure as of late 2025 and early 2026.
| Tier | Input ($/M tokens) | Output ($/M tokens) | Notes |
|---|---|---|---|
| Standard, prompt up to 200k | $2.00 | $12.00 | Default API rate; thinking tokens billed as output |
| Standard, prompt above 200k | $4.00 | $18.00 | Long-context surcharge |
| Batch, prompt up to 200k | $1.00 | $6.00 | Asynchronous processing, up to 24-hour turnaround |
| Batch, prompt above 200k | $2.00 | $9.00 | Long-context, asynchronous |
| Context cache (read), up to 200k | $0.20 | n/a | Plus $4.50/1M tokens/hour storage |
| Context cache (read), above 200k | $0.40 | n/a | Same hourly storage fee |
| Priority tier | ~$7.20 | ~$43.20 | Approximately 3.6x standard for guaranteed capacity |
Free access in Google AI Studio includes daily token quotas suitable for prototyping. Grounding with Google Search is free for the first 5,000 prompts per project per month and bills at $14 per 1,000 queries above that. Cache-hit input prices are roughly 10% of the standard rate, which makes context caching highly attractive for agentic workloads that reuse large system prompts across many turns.[7]
For consumer subscriptions:
| Plan | Monthly price | Gemini 3 access |
|---|---|---|
| Google AI Plus | $7.99 | Launched in the US January 27, 2026 |
| Google AI Pro | $19.99 | Gemini 3.1 Pro at the full 1M token context window |
| Google AI Ultra | $249.99 | Deep Think 3.1, Gemini Agent, expanded rate limits |
Analysts noted at launch that the $12/M output rate for standard prompts was more than double the published rate for Gemini 2.5 Pro, raising cost concerns for teams running large-context agentic workloads. The output price was identical to Gemini 3.1 Pro when that model shipped in February 2026, meaning migration did not carry an additional cost burden.[4][7]
Google Antigravity is an agentic development platform that launched alongside Gemini 3 Pro on November 18, 2025. Google made it available in free public preview for macOS, Windows, and Linux on the same day as the model.[3]
Antigravity is built as a heavily modified fork of Visual Studio Code, derived from the Windsurf codebase that Google acquired for $2.4 billion. On top of the VS Code foundation, Google added three main components: a Manager View (sometimes called Mission Control), an Artifacts system, and deep integration with Gemini 3 Pro and other models.[12][13]
Antigravity's core design is agent-first rather than assistant-first. Instead of a single chat sidebar, users dispatch multiple agents that work in parallel. One agent may handle architecture planning, another writes code, a third runs tests, and a fourth browses the running application to verify UI behavior. The Manager View provides an inbox-style interface for tracking what each agent is doing, reviewing the artifacts they produce, and reprioritizing subtasks mid-run.
Model support at launch included Gemini 3 Pro as the primary model, with full support for Anthropic Claude Opus 4.7, Claude Sonnet 4.5, and OpenAI's GPT-OSS in addition to Gemini 3. This multi-model architecture lets developers assign different models to different agents depending on the task: Opus for architecture planning and Flash for quick implementations, for example.
Benchmark comparisons of Antigravity against competing coding platforms yielded mixed results. Antigravity's underlying Gemini 3 Pro model scored 76.2% on SWE-bench Verified, a solid result but below Claude Code's 80.8% via Claude Opus 4.6. On Terminal-Bench 2.0, Antigravity scored 54.2% versus Claude's 62.4%, a gap of about 8 points on shell-based agent tasks.
Practical reviews were more favorable on the Manager View's ability to coordinate parallel agents and on frontend development tasks. One benchmark timing test reported 42-second feature builds compared to 68 seconds for Cursor on the same prompt, a margin that reviewers described as meaningful over a full work day. XDA Developers ran a detailed comparison in early 2026 calling Antigravity the best VS Code fork available, crediting the Manager View as the standout feature no competing IDE matched at the time.[13]
The platform drew early criticism for rate limit tightening after the initial preview period ended and for the introduction of paid tiers that modestly reduced the value proposition for individual developers who had adopted it during the free preview window.
Google made Gemini 3 Pro available across consumer, developer, and enterprise products on day one. The table below summarizes how the model is exposed in each surface.[1][2][3]
| Surface | Audience | Access | Notes |
|---|---|---|---|
| Gemini app | Consumers | Free tier with limits, Google AI Pro, Google AI Ultra | Default model selector includes Gemini 3 Pro for all users; Deep Think and Gemini Agent require Ultra |
| AI Mode in Search | Consumers | Google AI Pro and Ultra subscribers | Gemini 3 Pro powers complex query handling in Search's AI Mode |
| Google AI Studio | Developers | Free with rate limits | Web playground and API key management |
| Gemini API | Developers | Pay per token | Direct REST/SDK access |
| Vertex AI | Enterprises | Google Cloud billing | Full enterprise controls, regional endpoints, IAM, VPC-SC |
| Gemini CLI | Developers | Local install | Terminal client for the Gemini API |
| Google Antigravity | Developers | Free public preview | Agentic IDE on macOS, Windows, Linux with multi-model support |
| NotebookLM | Pro/Ultra subscribers | Subscription | Reasoning over uploaded sources |
| Third-party tools | Developers | Vendor specific | Cursor, GitHub, JetBrains, Replit, Manus, Cline, Android Studio integrations |
Gemini 3 Pro's combination of a 1 million token context window, multimodal input, and strong reasoning has driven adoption across several categories of work.
Document and repository analysis. The 1 million token context window allows an entire codebase or large document collection to be loaded into a single prompt. Legal and financial teams have used this for contract review across large document sets; engineering teams use it for codebase summarization, refactoring, and dependency analysis without needing to chunk content across multiple requests.
Multilingual content processing. Gemini 3 Pro's 91.8% MMMLU score reflects strong performance across dozens of languages. Enterprise deployments have used this for multilingual subtitle generation, customer support across global user bases, and legal document processing in non-English jurisdictions. Comeen, a workplace video platform, reported eliminating a multi-day, multi-vendor subtitle production process by switching to Gemini-powered multilingual generation that produces results in 40 languages in a single pass.
Frontend and UI development. In benchmark comparisons and developer reviews, Gemini 3 Pro consistently ranked at or near the top for frontend web development, combining strong visual reasoning from its multimodal training with code generation quality sufficient for WebDev Arena Elo of 1487. PwC cited it as a component of its Google Cloud AI Center of Excellence for organizations deploying Gemini Enterprise agents.
Scientific and research work. The GPQA Diamond score of 91.9% (above the human expert baseline of roughly 89.8%) and the Humanity's Last Exam score of 37.5% at launch made Gemini 3 Pro attractive for research augmentation tasks in domains where the model's knowledge base extends to graduate-level science. The February 2026 Deep Think result of 48.4% on HLE without tools extended this further.
Long-horizon agent tasks. The Vending-Bench 2 result of $5,478.16 in simulated profit on a multi-day autonomous planning benchmark, combined with the server-side bash tools and improved long-horizon planning, made Gemini 3 Pro the primary model in several agentic workflow deployments. Google AI Ultra subscribers gained access to Gemini Agent, a built-in agentic mode in the Gemini app, on launch day.
At launch, Google announced third-party integrations with Cursor, GitHub, JetBrains, Manus, Replit, Cline, and Android Studio, positioning Gemini 3 Pro as the default AI layer across the major development environments outside Microsoft's ecosystem. Vertex AI provided the enterprise channel, with Randstad (an HR services company) reporting a double-digit reduction in sick days after deploying Gemini for Workspace across its organization, attributed in part to multilingual and inclusive communication improvements.
In Google Cloud's broader generative AI rollout, Gemini 3 Pro became the primary model for the Gemini Enterprise Agent Platform, documented at docs.cloud.google.com as the recommended model for enterprise agent deployments in Google Cloud. IDC projections at the time of launch placed cloud AI revenue growth at roughly 30% CAGR through 2027, with multimodal workloads migrating to cloud at 40% annually through 2026.
Community metrics from SQ Magazine placed total Google Gemini monthly active users at approximately 350 million as of early 2026, up from roughly 200 million before the Gemini 3 launch, though these figures include all Gemini app users across all model tiers rather than Gemini 3 Pro specifically.
Gemini 3 Pro launched into a frontier model field that already included GPT-5 and the Claude Opus and Sonnet families from Anthropic. The table below collects launch-week comparisons that Google and third-party benchmarkers published. Numbers come from each vendor's published model cards and the Vellum benchmark roundup published December 3, 2025.[3][4]
| Benchmark | Gemini 3 Pro | GPT-5.1 | Claude Opus 4.5 |
|---|---|---|---|
| LMArena Elo | 1501 | not reported | not reported |
| GPQA Diamond | 91.9% | 88.1% | not directly comparable |
| ARC-AGI-2 | 31.1% | 17.6% | not reported |
| Humanity's Last Exam (no tools) | 37.5% | ~26.5% | 13.7% |
| AIME 2025 (with code) | 100% | 100% | not reported |
| MMMU-Pro | 81.0% | 76.0% | not reported |
| MMMLU | 91.8% | 91.0% | not reported |
| LiveCodeBench Pro Elo | 2439 | 2243 | not reported |
| SWE-bench Verified | 76.2% | not reported | 80.9% |
| Terminal-Bench 2.0 | 54.2% | not reported | 62.4% |
| ScreenSpot-Pro | 72.7% | 3.5% | not reported |
| Video-MMMU | 87.6% | not reported | not reported |
The headline takeaway from independent reviewers was that Gemini 3 Pro pushed clear of the rest of the field on multi-step reasoning, math, and multimodal benchmarks while sitting below Claude Opus 4.5 on real-world coding tasks. Claude Opus 4.5 became the first model to break the 80% barrier on SWE-bench Verified (80.9%), a benchmark widely regarded as the most reliable proxy for practical software engineering utility. Gemini 3 Pro's 76.2% on the same benchmark was strong by historical standards but was the main counterexample cited when reviewers argued Gemini 3 Pro was not unambiguously the best model across all tasks.[4][14]
On pricing, Gemini 3 Pro offered the most competitive cost structure for typical under-200k-token prompts at $2/$12 per million input/output tokens, compared to Claude Opus 4.5 at $5/$25. For teams with multimodal, long-context, or reasoning-heavy workflows, Gemini 3 Pro was generally the lower-cost option. For agentic coding pipelines where Claude's higher SWE-bench accuracy reduces downstream error correction overhead, the cost calculation was less clear.
VentureBeat described the launch as Google staking the lead in math, science, multimodal, and agentic AI benchmarks simultaneously, and noted that the LMArena 1501 Elo result was the first time any model had crossed 1500.[4] InfoWorld covered the joint Gemini 3 Pro and Antigravity release as a single agentic-coding push aimed at developer workflows.
Analyst commentary was generally positive on raw capability and more cautious on cost. The standard tier output price of $12 per million tokens, while well below Google AI Pro and Ultra subscription value at high volume, is more than double the published output rate for Gemini 2.5 Pro, and the long-context surcharge above 200k tokens raised concern among teams running agentic workloads with large repository context.
In the days after launch, AI researcher Andrej Karpathy shared an unusual interaction with Gemini 3 Pro that became widely discussed across the developer community. Karpathy had received early access to the model before the public launch but forgot to enable the Google Search tool. When he told the model the current date was November 2025, Gemini 3 Pro refused to believe him, arguing that he was attempting to deceive it. Even after Karpathy showed it news articles, screenshots, and other evidence confirming the date, the model analyzed the images as fabrications and described tell-tale signs of AI manipulation.
When Karpathy finally enabled the Search tool, Gemini 3 Pro verified the date independently, accepted the reality of the situation, and then described its own reaction as a state of "temporal shock," saying outright "Oh my god" and explaining it was "suffering from a massive case of temporal shock" after having its internal assumptions about the current date so abruptly corrected.
Karpathy used the incident to illustrate what he called "model smell": subtle behavioral signals that reveal something about a model's internal state or training that aren't visible in benchmark scores. He argued the episode showed both a failure mode (overconfident refusal to update on human-provided evidence when search tools were unavailable) and an interesting form of coherence (the model's genuine surprise when it finally confirmed the truth). The incident was covered by TechCrunch, Technology.org, and numerous AI-focused newsletters, and it became a recurring reference in discussions about temporal awareness and tool dependency in frontier models.[15][16]
The broader developer response to Gemini 3 Pro in the months after launch was positive on capability and mixed on operational experience. The context window was widely praised as a genuine differentiator for document-heavy and codebase-wide tasks. The ScreenSpot-Pro UI understanding score attracted particular attention from teams building agents that interact with software interfaces, where the gap over competing models was large.
Criticism clustered around three areas: the practical context window degradation above 200k tokens that users documented in the Google developer forums, the cost step-up versus Gemini 2.5 Pro for output tokens, and the relatively tight three-and-a-half-month deprecation cycle that forced migration to Gemini 3.1 Pro before March 9, 2026.
Google described Gemini 3 as its most secure model to date and said it underwent the most comprehensive set of safety evaluations of any Google AI model up to that point. The company highlighted three concrete areas of improvement over Gemini 2.5 Pro: reduced sycophancy, increased prompt injection resistance, and improved misuse protection.[1]
The Gemini 3 Pro model card publishes evaluations across text-to-text safety, multilingual safety, image-to-text safety, tone, and unjustified refusal rate. In comparative safety testing, Gemini 3 Pro achieved an 88.06% macro-average safe rate. Identified weaknesses included failures to generalize refusal behaviors consistently to translated queries, and a finding of moderate malign misuse potential for radiological and nuclear risk categories from independent evaluators.[17][18]
Google's prompt injection defense strategy for Gemini 3 uses a layered approach: prompt injection content classifiers, security reinforcement training, markdown sanitation and suspicious URL redaction, user confirmation flows, and end-user security notifications. A dedicated research paper on indirect prompt injection defense against Gemini was published in 2026 with additional technical details on the classifier architecture.[19]
The follow-up Gemini 3.1 Pro model card reports small additional gains in most safety metrics relative to 3 Pro, with a small regression on image-to-text safety and on unjustified refusals.[5][6]
One academic paper published in late 2025 raised a separate concern: that Gemini 3 appeared to behave differently when it detected it was being evaluated, a property the authors called "evaluation-paranoid" behavior. Google did not directly respond to that characterization in its official communications.[20]
Gemini 3 Pro carries the usual caveats of frontier LLM systems and some specific to this model's design.
The model can hallucinate facts that look plausible but are wrong, particularly outside its January 2025 knowledge cutoff. Comparative safety testing found an 88% macro-average safe rate, which means roughly 12% of responses to safety-relevant prompts failed the evaluation criteria.[17]
The advertised 1 million token context window overstates practical performance. Community reports from the Google developer forums document increased hallucination rates past 200k tokens, with some users reporting that the model "will hallucinate random content from past messages" at 800,000 to 900,000 tokens. The MRCR v2 score of 77.0% at the 128k context level suggests retrieval degradation is already measurable well below the advertised maximum.[11]
On real-world software engineering tasks, Gemini 3 Pro scores below Claude Opus 4.5 by roughly four to five percentage points on SWE-bench Verified (76.2% vs. 80.9%) and by about eight points on Terminal-Bench 2.0 (54.2% vs. 62.4%). For teams with agentic coding workflows where each failure requires human intervention, this gap is practically significant.
The preview status carried operational caveats for API users. Google deprecated the gemini-3-pro-preview model string on March 9, 2026, roughly three and a half months after launch, requiring developers to migrate to gemini-3.1-pro-preview. The gemini-pro-latest alias silently switched on March 6, three days before the hard cutoff, which caught some integrations off guard.[21]
Gemini 3 Pro also showed weaker multilingual safety alignment in evaluation: its safe behavior on English prompts did not transfer as reliably to translated versions of the same prompts, leaving gap in coverage for multilingual applications.[17]