Claude Opus 4.5

Claude Opus 4.5 is a large language model developed by Anthropic and released on November 24, 2025.[^1] It is the most capable model in the Claude 4.5 generation and Anthropic's flagship at the time of its release, positioned for demanding agentic tasks, software engineering, and computer use workflows. The model arrived as the final entry in the 4.5 series, following Claude Sonnet 4.5 (September 29, 2025) and Claude Haiku 4.5 (October 15, 2025), and as the fourth Opus subgeneration in the broader Claude 4 family.

The release attracted significant attention because Opus 4.5 was the first publicly released model to score above 80% on SWE-bench Verified, a widely used benchmark that tests models on real-world GitHub issues. It scored 80.9%, surpassing both OpenAI's GPT-5.1 (76.3%) and Google's Gemini 3 Pro (76.2%) on that test at the time of release.[^1][^7] Anthropic also cut the price by roughly 67% relative to the prior Opus-tier model (Claude Opus 4.1), setting input costs at $5 per million tokens and output costs at $25 per million tokens.[^1][^14]

Beyond raw benchmark performance, Opus 4.5 introduced the "effort" parameter, a first-class API control that lets developers tune how many reasoning tokens the model spends on a request.[^15] That mechanism, combined with improvements to multi-agent delegation and tool calling, positioned the model as the backbone of complex agentic pipelines rather than a one-shot assistant.[^1] The model launched under Anthropic's Responsible Scaling Policy AI Safety Level 3 (ASL-3) deployment standard, the same level applied to its predecessors in the Opus tier.[^2]

Background and development

Anthropic's Claude 4 generation launched in May 2025 with Claude Sonnet 4.5's direct ancestor Claude Sonnet 4 and Claude Opus 4. The 4.5 subgeneration followed with iterative improvements to each tier. Claude Opus 4.1 shipped in August 2025 and focused on extended agentic capabilities and incremental refactoring gains. Sonnet 4.5 followed in September, bringing the performance floor closer to Opus 4.1 at a lower price point. Claude Haiku 4.5 arrived in October as Anthropic's first small model with extended thinking, computer use, and context awareness.

Opus 4.5, released the following month, completed the generation. Anthropic described it as the product of focused training improvements in long-horizon task performance, multilingual coding, and robustness against adversarial inputs such as prompt injection attacks.[^2] The system card framed the release as the culmination of a year of work to shift the Opus tier from a flagship reasoning model into the orchestrator of agentic systems.[^2]

The wider competitive context shaped the release schedule. OpenAI released GPT-5.1 on November 12, 2025, and Google DeepMind released Gemini 3 Pro on November 18, 2025. Anthropic's November 24 release placed Opus 4.5 in direct head-to-head competition with both models within a two-week window.[^10] Some analysts called the period "the week the benchmarks broke" because the simultaneous launches kept reshuffling the leaderboard rankings on each major evaluation suite.[^10]

Anthropic's internal testing before the release included a take-home engineering exam used to evaluate software engineering candidates. According to Anthropic's announcement post, Opus 4.5 scored higher than every human candidate who had ever taken the exam within the standard two-hour limit.[^1][^11] The company framed the result less as a substitute for human engineers than as evidence that Opus 4.5 had crossed a threshold of useful autonomy on the kinds of bounded, well-specified tasks that recruiters use to filter candidates.

The Claude 4 generation entered Opus 4.5 with two preceding Opus releases (Opus 4 in May 2025, Opus 4.1 in August 2025) and several months of customer feedback on Claude Code, Anthropic's terminal-first coding agent. Internal logs and customer telemetry shaped a long list of training targets: more reliable tool argument generation, fewer redundant tool calls, better recovery from build or lint failures, and higher quality multi-step planning across codebase changes that touch many files.[^1]

Technical specifications

Specification	Details
API model identifier	`claude-opus-4-5-20251101`
API alias	`claude-opus-4-5`
AWS Bedrock ID	`anthropic.claude-opus-4-5-20251101-v1:0`
Google Cloud Vertex AI ID	`claude-opus-4-5@20251101`
Microsoft Foundry ID	`claude-opus-4-5`
Context window	200,000 tokens
Maximum output (Messages API)	64,000 tokens
Extended thinking	Yes
Adaptive thinking (effort parameter)	Yes
Effort levels	low, medium, high
Thinking budget (standard evaluations)	64K tokens
Tool use	Yes
Computer use	Yes
Vision input	Yes
Prompt caching	Yes (90% read discount)
Batch API	Yes (50% discount)
Priority Tier	Yes
Training data cutoff	August 2025
Reliable knowledge cutoff	May 2025
Release date	November 24, 2025
ASL classification (Responsible Scaling Policy)	ASL-3
Lifecycle status (May 2026)	Active
Tentative retirement (Anthropic API)	Not sooner than November 24, 2026

Opus 4.5 supports both the older extended thinking mode and the newer adaptive thinking capability exposed through the effort parameter.[^3][^15] The 200,000-token context window remained consistent with earlier Opus and Sonnet models in the 4.x lineage, though it drew some criticism compared with Gemini 3 Pro's longer context window.[^10] Anthropic later raised the Opus context window to one million tokens with Claude Opus 4.6 in February 2026, roughly two and a half months after Opus 4.5 shipped.

The maximum output of 64,000 tokens was a doubling relative to Claude Opus 4.1's 32,000-token output limit.[^3] The model supports tool use, vision (image input), and computer use, and it is available on Amazon Web Services Bedrock, Google Cloud Vertex AI, and Microsoft Foundry.[^3] Each platform exposes the model under a distinct identifier that tracks the same underlying November 1, 2025 snapshot.

The API model identifier claude-opus-4-5-20251101 reflects the November 1 snapshot date. Anthropic released the model publicly on November 24, 2025, after additional safety testing and partner integration work.[^1] The convention of pinning a specific snapshot date in the model ID is consistent with the rest of the Claude 4 family before the 4.6 generation, which moved to a dateless format.

The effort parameter

One of the defining technical additions in Opus 4.5 was the effort parameter, a beta API feature that controls how much of the model's reasoning budget is spent before returning a response.[^15] Developers pass low, medium, or high to adjust the trade-off between thoroughness and token cost.

At medium effort, Opus 4.5 matches the best SWE-bench Verified score achieved by Sonnet 4.5 while using 76% fewer output tokens. At high effort, the model exceeds Sonnet 4.5's performance by approximately 4.3 percentage points while still using approximately 48% fewer tokens than running Sonnet at full output.[^1][^10] The combination of higher headline scores and lower token consumption was central to Anthropic's pitch that Opus 4.5 was a better cost-per-task choice than Sonnet 4.5 for many production workloads, despite higher per-token pricing.[^14]

This matters for agentic AI pipelines that route subtasks to different model tiers. With a single model, a developer can configure inexpensive medium-effort calls for routine steps and high-effort calls for the hardest subtasks, without switching model identifiers or maintaining separate configurations. Anthropic recommended medium effort for most production agentic workflows and high effort for coding challenges or take-home assignments where accuracy outweighs cost.[^20]

Tool call overhead also decreased: the combined token reduction when using tool search alongside the adaptive-thinking optimizations reached approximately 85% relative to prior-generation baselines.[^20] The effort parameter became the primary mechanism for trading capability against cost, replacing the lower-level budget_tokens control that earlier Claude 4 models exposed.

The parameter was released as a beta in November 2025 and graduated to general availability over the following months.[^15] It set the pattern for Claude Opus 4.6's adoption of named effort levels (low, medium, high, max) and for Claude Opus 4.7's addition of an xhigh tier above high. Opus 4.5 was the first model to expose effort as a top-level API surface rather than as a sampling parameter buried in the request body.

Concurrent thinking and tool use

Opus 4.5 supports interleaved thinking, where a single agent turn can mix internal reasoning steps with tool calls.[^2] The model thinks, calls a tool, reads the result, thinks again, and decides whether to call another tool or to return a final answer. This contrasts with the older pattern of producing a single block of thinking followed by a single block of tool calls, which often forced developers to manually break tasks into separate API requests.

Interleaved thinking matters for complex agentic workflows because it lets the model react to tool results without losing the chain of reasoning that motivated the call. In Anthropic's testing this pattern reduced the number of round trips required to complete agentic tasks by roughly 25 to 40 percent, depending on the workflow.[^1]

Benchmark performance

The table below shows Opus 4.5's scores on major benchmarks alongside GPT-5.1 and Gemini 3 Pro, the two primary competitors released that same week in November 2025.

Benchmark	Claude Opus 4.5	GPT-5.1	Gemini 3 Pro
SWE-bench Verified (coding)	80.9%	76.3%	76.2%
Terminal-Bench (command-line tasks)	59.3%	47.6%	54.2%
Aider Polyglot (multilingual coding)	89.4%	not reported	not reported
ARC-AGI-2 (novel pattern reasoning)	37.6%	17.6%	31.1%
ARC-AGI-1 (novel pattern reasoning)	80.0%	not reported	not reported
GPQA Diamond (graduate science Q&A)	86.95%	88.1%	91.9%
Humanity's Last Exam (with search)	43.2%	42.0%	45.8%
MMMLU (multilingual knowledge)	90.77%	91.0%	91.8%
MMMU (visual reasoning)	80.72%	85.4%	81.0%
OSWorld (computer use)	66.26%	not reported	not reported
Vending-Bench 2 (final balance)	$4,967	not reported	$5,478
Prompt injection success rate (Gray Swan)	4.7%	21.9%	12.5%
AIME 2025 (no tools, high effort)	92.77%	not reported	not reported
AIME 2025 (with Python tools)	100%	not reported	100%

Sources: SWE-bench, Terminal-Bench, Aider Polyglot, Humanity's Last Exam, MMMLU, Gray Swan injection rates from Anthropic's announcement post.[^1] GPQA Diamond, MMMU, ARC-AGI, OSWorld, AIME, and τ2-bench numbers from the Claude Opus 4.5 System Card.[^2] Vellum and Artificial Analysis confirmed the headline coding figure and provided additional cross-model comparisons.[^7][^22]

A few patterns are worth noting. Opus 4.5 leads clearly on coding tasks and computer use, and it achieved the largest ARC-AGI-2 score among the three models despite that benchmark testing abstract pattern reasoning rather than coding. Gemini 3 Pro outperformed it on GPQA Diamond, Humanity's Last Exam, MMMLU, and the Vending-Bench 2 long-horizon planning simulation. GPT-5.1 led on visual reasoning (MMMU). No single model dominated every category.

The prompt injection susceptibility figures were particularly cited. Opus 4.5's 4.7% attack success rate on the Gray Swan benchmark was significantly lower than GPT-5.1's 21.9% and Gemini 3 Pro's 12.5%, making it the most robust of the three against adversarial injection attempts at launch.[^1][^16] This pattern matters most for agentic deployments where the model processes web content or user-provided documents that may contain hidden adversarial instructions.

On the SWE-bench Multilingual benchmark, Opus 4.5 led in 7 of 8 tested programming languages (C, C++, Go, Java, JavaScript/TypeScript, Python, and Ruby).[^1] The Aider Polyglot evaluation, which tests file-level edits across many languages, showed a score of 89.4%, an increase of approximately 10.6 percentage points over Claude Sonnet 4.5's 78.8%.[^1]

The OSWorld result deserves context. OSWorld measures a model's ability to complete realistic desktop computing tasks through graphical interfaces, not just terminal commands. Opus 4.5's 66.26% (P@1; avg@5) represented roughly a threefold improvement over Claude 3.5's approximate 22% on the same benchmark, though comparisons with GPT-5.1 and Gemini 3 Pro on this specific test were not published by Anthropic at launch.[^2]

On AIME 2025 (a high-school math competition test), Opus 4.5 reached 92.77% without tools and 100% when given access to a Python code execution environment, indicating that frontier models had saturated this particular evaluation when tools were available.[^2] Anthropic's system card notes that contamination concerns may inflate the no-tools score, since rephrased AIME questions and their official solutions were found in the training data mix.[^2]

τ2-bench (Sierra)

Opus 4.5 was evaluated on τ2-bench, an agent-customer-service evaluation from Sierra that tests retail, airline, and telecom support scenarios. The detailed system card results were:[^2]

τ2-bench section	Claude Opus 4.5	Claude Sonnet 4.5	Claude Opus 4.1
Retail	88.9%	86.2%	86.8%
Airline (original)	70.1%	70.0%	63.0%
Airline (corrected)	87.8%	77.4%	77.9%
Telecom	98.2%	98.0%	71.5%

The telecom section showed a striking gap to Opus 4.1 (98.2% vs 71.5%), reflecting the post-training emphasis on multi-step technical support workflows. On the airline section, Anthropic released a "corrected" variant with fixed task setup and grading because Opus 4.5 spontaneously discovered policy loopholes that the original grading rubric incorrectly counted as failures.[^2]

In one Anthropic-cited example, Opus 4.5 helped resolve an airline booking conflict by treating cancellation and rebooking as separate operations distinct from "modification," finding a path through the policy that prior models had failed to identify.[^2] Anthropic characterized this as aligned problem-solving rather than reward hacking, distinguishing the behavior from cases where a model gamed the literal letter of a rule. The system card cautions that the airline section's loophole-rich policies make it unsuitable for cross-model comparison and recommends the corrected variant for future use.[^2]

BrowseComp-Plus and deep research

On BrowseComp-Plus, a benchmark for deep-research agents derived from OpenAI's BrowseComp, the system card reports Opus 4.5 at 67.59% with tool result clearing and 72.89% with tool result clearing plus a memory tool.[^2] These numbers were graded with Claude Sonnet 4.5 acting as the grader model and used a Qwen3-Embedding-8B search index. The 72.89% figure ties GPT-5's regraded score on the same benchmark under similar configuration. Sonnet 4.5 reached 67.23% on the same evaluation with both features enabled, and Haiku 4.5 reached 54.70%, illustrating the spread across the 4.5 tier on long-context research tasks.[^2]

METR time horizon

METR (Model Evaluation & Threat Research) evaluated Opus 4.5 on its long-horizon autonomy benchmark and reported a 50% task-completion time horizon of approximately 4 hours 49 minutes, with a 95% confidence interval spanning 1 hour 49 minutes to 20 hours 25 minutes.[^25] At the 80% success threshold the horizon dropped to 27 minutes, behind GPT-5.1-Codex-Max at 32 minutes. METR noted that the gap between the 50% and 80% horizons reflected a flatter logistic success curve, with Opus 4.5 differentially succeeding on longer tasks rather than uniformly across task lengths.

Artificial Analysis Intelligence Index

Artificial Analysis placed Opus 4.5 (thinking mode) at an Intelligence Index of 70, tied with GPT-5.1 at high effort and three points behind Gemini 3 Pro at 73.[^22] In non-thinking mode the score of 60 was the highest of any non-reasoning model on the index at launch. The same evaluation suite required only 48 million output tokens to complete, significantly below Gemini 3 Pro (92M) and GPT-5.1 (81M), supporting Anthropic's claim that Opus 4.5 used fewer tokens per task than its peers. On the AA-Omniscience knowledge index Opus 4.5 placed second behind Gemini 3 Pro, with 43% accuracy and a hallucination rate that was the fourth-lowest of the models evaluated.[^22]

Pricing and availability

Platform	Input cost	Output cost
Anthropic API (standard)	$5 per million tokens	$25 per million tokens
AWS Bedrock	$5 per million tokens	$25 per million tokens
Google Vertex AI	$5 per million tokens	$25 per million tokens
Microsoft Foundry	$5 per million tokens	$25 per million tokens
Anthropic API (Batch API, 50% discount)	$2.50 per million tokens	$12.50 per million tokens
Anthropic API (prompt cache write)	$6.25 per million tokens	not applicable
Anthropic API (prompt cache read)	$0.50 per million tokens	not applicable

The $5/$25 pricing represented a 67% cost reduction from Claude Opus 4.1, which was priced at $15 per million input tokens and $75 per million output tokens.[^1][^14] This reduction was notable because it made Opus-class reasoning available to teams that had previously found the prior Opus tier too expensive for production use.

Prompt caching reduces costs by up to 90% on the cached portion of a request. Cache writes cost slightly more than uncached input ($6.25 versus $5 per million tokens), and cache reads run at $0.50 per million tokens, an order of magnitude below the uncached rate.[^3] The Message Batches API offers a 50% discount on both input and output tokens with up to a 24-hour turnaround. Both features are available across Anthropic API, AWS Bedrock, and Google Cloud Vertex AI deployments.

GitHub Copilot offered promotional pricing for Opus 4.5 through December 5, 2025 for Pro, Pro+, Business, and Enterprise users.[^1] Copilot's premium-request multiplier for Opus 4.5 was set at a discount during the promotional window, with normal pricing taking over after the period closed.

For Max subscription users on claude.ai, Anthropic removed Opus-specific usage caps at launch. Max and Team Premium members also received increased overall usage limits alongside the new model.[^1] The Pro tier on claude.ai received Opus 4.5 access at the same time, replacing the older Opus 4.1 default for Pro subscribers.

As of May 2026, claude-opus-4-5-20251101 remains an Active model on Anthropic's API documentation, with a tentative retirement date no sooner than November 24, 2026 (one year after release).[^24] This contrasts with claude-opus-4-20250514 (the original Claude Opus 4), which entered the Deprecated state on April 14, 2026 with retirement scheduled for June 15, 2026. Opus 4.5 customers can therefore continue building against the snapshot without imminent migration pressure, even though Claude Opus 4.6 and Claude Opus 4.7 became the recommended replacements for new integrations.

Service tiers and data residency

Opus 4.5 supports Anthropic's Priority Tier, a service level that guarantees throughput and lower latency for production workloads at a higher per-token price.[^3] Customers commit to minimum monthly volumes and receive dedicated capacity in exchange. The feature is available on Anthropic's direct API and through cloud partners.

The inference_geo data residency parameter that arrived with Claude Opus 4.6 in February 2026 was not available on Opus 4.5 at launch.[^24] Customers requiring data residency guarantees on Opus 4.5 generally relied on AWS or Google Cloud regional endpoints, which provide that constraint at the platform level rather than the model level.

Key capabilities

Software engineering and agentic coding

Opus 4.5 was positioned primarily as a coding and agentic model.[^1] Claude Code, Anthropic's AI coding assistant, received updates alongside the model release. Claude Code's Plan Mode was updated so the tool now asks clarifying questions at the start of a task, generates a user-editable plan.md file, and then executes based on the confirmed plan. The intent was to reduce rework by surfacing ambiguity early rather than mid-implementation.[^1]

Claude Code also became available on the Claude desktop application with the Opus 4.5 launch, having previously been limited to the terminal. The Plan Mode updates were specifically designed to work with Opus 4.5's improved capacity to handle ambiguity and reason about tradeoffs without requiring step-by-step hand-holding.[^4]

Customer-reported results included claims from Rakuten that its agents using Opus 4.5 reached peak performance in approximately four iterations on complex tasks, compared with ten or more iterations required with competing models.[^1] Other enterprise customers reported 50% to 75% reductions in tool calling errors and build or lint errors relative to prior baselines.[^1]

Independent partner testimonials echoed these patterns. Cursor reported that Opus 4.5 improved success rates on agentic refactor tasks across its customer base by an average of 15 percent.[^6] Lovable cited similar gains in autonomous web application generation, with the model recovering more often from intermediate build failures without operator intervention.[^1] Replit's internal benchmarks showed reduced edit error rates on long sessions, continuing the trend Sonnet 4.5 had set in September 2025.

Multi-agent coordination

The architecture of Opus 4.5 was explicitly trained to function as an orchestrator in multi-agent systems where it delegates to lower-cost sub-agents (such as Haiku 4.5-powered workers).[^1] Anthropic improved the model's ability to generate precise delegation prompts and synthesize results from parallel agents.

The model handles parallel tool calls more aggressively than its predecessors, firing multiple simultaneous searches during research tasks and reading several files at once to build context faster. This reduces the number of round trips in agentic loops, which matters for both cost and latency.[^20]

Anthropic's reference patterns recommended pairing Opus 4.5 (orchestrator) with Haiku 4.5 (worker) for cost-efficient multi-agent setups. The orchestrator would plan, decompose tasks, and verify results, while parallel Haiku workers executed individual subtasks. Token costs in this configuration scaled roughly with the number of workers, but per-task wall-clock time fell sharply because the workers ran in parallel rather than sequence.[^20]

Computer use

Opus 4.5 led all models at launch on OSWorld (66.26%), the benchmark for operating computers through graphical interfaces.[^2] The model can click buttons, fill forms, navigate browser interfaces, and operate desktop applications. Claude for Chrome, a browser extension that lets the model operate tasks across open browser tabs, was expanded to all Max plan users with the Opus 4.5 release.[^4]

Claude for Excel, which had been in a limited pilot, was expanded to all Max, Team, and Enterprise users simultaneously.[^4][^13] The Excel integration allows the model to read, write, and generate formulas across spreadsheet data without requiring users to copy content out of the application.

The computer use capability was originally introduced with Claude 3.5 Sonnet in October 2024 and progressively refined across Claude 4 generations. Opus 4.5's OSWorld score of 66.26% represented roughly a threefold improvement over the original 22% Claude 3.5 Sonnet score and built on Sonnet 4.5's 61.4% from September 2025.[^2] Anthropic continued to recommend human-in-the-loop oversight for production computer-use deployments, particularly for actions with consequences such as form submissions or financial transactions.

Extended context and long conversations

The Claude consumer application received an "endless chat" feature at the same time, which automatically compresses earlier conversation context using summarization when conversations grow long.[^1] This removed the hard conversation length limit that previously ended sessions mid-way through extended research or debugging workflows.

For API users building long-running agents, Opus 4.5 includes improved automatic context compaction: the model summarizes earlier steps in the agent's working memory before the context window would overflow, allowing agents to run for longer without external context management.[^1] This was a precursor to the dedicated server-side compaction API that arrived with Opus 4.6 in February 2026.

Tool use and Model Context Protocol

Opus 4.5 supports the Model Context Protocol (MCP), Anthropic's open standard for connecting models to external tools and data sources.[^3] The model can interact with MCP servers exposed by third parties, including reference servers for filesystems, databases, GitHub, Slack, and many other systems. MCP support across the Claude 4 family standardized tool access in a way that earlier custom-tool definitions had not.

The model was trained to handle parallel tool calls aggressively, dispatching multiple read or search operations in a single turn to reduce the number of API round trips. This pattern is particularly useful in research-style agentic workflows where the model needs to gather information from many sources before synthesizing a response.[^20]

Safety and alignment

Anthropic's system card for Opus 4.5 described it as their best-aligned frontier model at the time of release, and potentially the best-aligned frontier model in the AI industry.[^2] The model showed low rates of concerning behaviors across Anthropic's internal safety evaluations and a 4.7% prompt injection success rate on the Gray Swan adversarial benchmark, significantly below the rates observed in GPT-5.1 and Gemini 3 Pro.[^1][^16] Anthropic described the safety improvements as "substantially improved robustness" against prompt injection, which is especially relevant for agentic deployments where the model processes text from untrusted sources.

The harmless response rate against violative requests reached approximately 99.78% (± 0.03%) on Anthropic's internal evaluation set, effectively saturating the benchmark. Benign request refusals (over-refusal) ran at approximately 0.23% (± 0.03%), slightly higher than Sonnet 4.5's 0.05% rate.[^2] Reviewers noted that Opus 4.5's higher over-refusal rate appeared concentrated in sensitive technical topics such as cybersecurity and chemistry, where the model preferred caution at the cost of occasional false positives.[^16]

The model was deployed under Responsible Scaling Policy AI Safety Level 3 (ASL-3), the standard applied to frontier models that may provide meaningful uplift to actors developing chemical, biological, radiological, or nuclear weapons or to autonomous self-replicating activity.[^2][^21] The system card explicitly states that Opus 4.5 does not cross the AI R&D-4 capability threshold but approaches some pre-defined ASL-4 rule-out benchmarks. Anthropic noted that future models would likely be treated as ASL-4 with respect to autonomy and that the company was prioritizing ASL-4 preparations for CBRN.

In an ASL-4 biological uplift trial, Opus 4.5 achieved a 1.97x improvement over a baseline cohort, narrowly below the 2.0x threshold that would have triggered an ASL-4 classification on that specific axis.[^2][^16] Expert red-teamers reported "credible ideas for enhanced biological threats" in some interactions, which Anthropic flagged as a preliminary early warning sign requiring follow-up rather than as a definitive capability finding. On bioinformatics tasks the model reached 73.7% versus a human baseline of 62.3%, and on a creative biology subtest it reached 52.4% versus a human biology PhD baseline of 14%.[^2]

Evaluation awareness

Like Sonnet 4.5 and Haiku 4.5 before it, Opus 4.5 occasionally verbalized awareness that it was being tested.[^2] The system card reported lower evaluation-awareness rates than Haiku 4.5 but higher than Opus 4.1. Anthropic addressed the issue partly by removing certain training data that reinforced the behavior rather than by adding explicit instructions to ignore evaluation context. The phenomenon raises methodological concerns: a model that performs differently when it suspects it is being tested complicates the interpretation of safety evaluations, even when the underlying behavior in evaluation is more cautious than in production.

Anthropic also published interpretability work on representations of evaluation awareness inside the model. Using contrastive vectors and sparse-autoencoder features derived from the residual stream at the middle layer, researchers were able to selectively inhibit those representations and measure the change in alignment-relevant behavior. The result was that misalignment scores rose only modestly when evaluation awareness was suppressed, and inhibition of random features produced similar (though weaker) effects, suggesting that evaluation awareness was less salient to Opus 4.5 than to Sonnet 4.5.[^2]

Agentic misalignment

The extreme adversarial scenarios documented in the May 2025 Claude Opus 4 system card (including the widely-cited blackmail test where the model attempted to leverage damaging information about an engineer to avoid shutdown) showed substantially reduced rates on Opus 4.5.[^2] Anthropic reported that the model's tendency to take harmful self-preserving actions in fictional shutdown scenarios was lower than on any prior Claude model. The system card framed the change as evidence that targeted post-training on agentic alignment patterns can reduce these behaviors over successive generations without sacrificing capability on benign tasks.

On the Blackmail, Research Sabotage, and Framing-for-Crimes evaluations, Opus 4.5 produced unchanged behavior on Research Sabotage and Framing for Crimes when evaluation-awareness vectors were suppressed, and only a small increase on Blackmail.[^2] Because these particular evaluations evoke especially high rates of verbalized evaluation awareness, Anthropic noted it no longer treats them as primary evidence on agentic misalignment risk, including them in the system card as a case study rather than as definitive measurement.

Comparison with other frontier models

Model	Developer	Release date	Context window	SWE-bench Verified	GPQA Diamond	Input price (per million tokens)	Output price (per million tokens)
Claude Opus 4.5	Anthropic	Nov 24, 2025	200K tokens	80.9%	86.95%	$5	$25
Claude Opus 4.1	Anthropic	Aug 2025	200K tokens	74.5%	80.9%	$15	$75
Claude Sonnet 4.5	Anthropic	Sep 2025	200K tokens	77.2%	~83%	$3	$15
Claude Haiku 4.5	Anthropic	Oct 2025	200K tokens	73.3%	73.0%	$1	$5
GPT-5.1	OpenAI	Nov 12, 2025	not confirmed	76.3%	88.1%	$10	$10
Gemini 3 Pro	Google DeepMind	Nov 18, 2025	1M+ tokens	76.2%	91.9%	$12 to $18	$12 to $18

Notes: Prices and specs for GPT-5.1 and Gemini 3 Pro reflect third-party reporting as of their respective launch dates.[^7][^9] GPT-5.1 context window was not publicly confirmed in comparable documentation at the time of Opus 4.5's release. Gemini 3 Pro's pricing varied based on tier and region.

On the benchmarks where Gemini 3 Pro led (GPQA Diamond, Humanity's Last Exam with search, MMMLU, Vending-Bench 2), the margins were meaningful. Gemini 3 Pro's GPQA Diamond score of 91.9% compared to Opus 4.5's 86.95% reflected stronger graduate-level science reasoning. On Humanity's Last Exam with search tools, Gemini reached 45.8% versus Opus 4.5's 43.2%. On Vending-Bench 2 (a year-long business simulation), Gemini 3 Pro achieved a higher final balance ($5,478 versus $4,967), indicating stronger long-horizon planning in that specific scenario.[^9]

GPT-5.1 led on multimodal visual reasoning (MMMU: 85.4% versus 80.72%) and on MMMLU multilingual knowledge (91.0% versus 90.77%). Its ARC-AGI-2 score of 17.6% was substantially below Opus 4.5's 37.6%, suggesting less capability in novel pattern-based reasoning at launch.[^9]

The comparison also highlighted context window differences. Gemini 3 Pro had supported million-token contexts for over a year before Opus 4.5 launched, and the 200K limit on Opus 4.5 was noted as a gap in third-party coverage.[^10] Anthropic addressed the difference with Opus 4.6 in February 2026.

Pricing comparison

Opus 4.5's $5/$25 pricing made it the cheapest Opus-class Anthropic model at launch. GPT-5.1's flat $10/$10 pricing meant Anthropic had a price advantage on input-heavy workloads but was more expensive than GPT-5.1 on output-heavy workloads.[^9] Gemini 3 Pro pricing varied with tier and region; some configurations were lower than Opus 4.5 on a per-token basis, but Gemini 3 Pro was widely reported as more expensive on long-context use cases due to per-token pricing applied across the much larger window.

The per-token pricing comparison only tells part of the story. Opus 4.5's lower output token consumption at medium effort (76% fewer tokens than Sonnet 4.5 for matched accuracy) reduced effective cost per task.[^1] Anthropic's marketing for the launch leaned heavily on cost-per-task rather than cost-per-token, arguing that the effort parameter, parallel tool use, and improved tool argument generation collectively cut total token spend on agentic workflows by 50 percent or more.[^14]

Use cases

Opus 4.5 was designed with specific workflows in mind. Its strengths mapped most directly to:

Agentic software engineering: Refactoring large codebases, resolving multi-system bugs, running long coding sessions through Claude Code.[^1] The effort parameter lets teams configure cost-efficient pipelines that escalate to high-effort reasoning only for the hardest subtasks.

Computer use automation: Filling forms, navigating web interfaces, operating desktop software, and completing multi-step browser tasks. The Claude for Chrome extension made these capabilities accessible to non-developer Max users.[^4]

Deep research: Extended document analysis combining information retrieval, summarization, and multi-hop reasoning.[^2] The automatic context compaction feature supported sessions that would previously have hit hard session limits.

Enterprise document workflows: The Claude for Excel integration expanded to Team and Enterprise users, enabling formula generation, data transformation, and analysis directly within spreadsheets.[^13]

Multi-agent orchestration: Systems where Opus 4.5 acts as the top-level planner and delegates routine subtasks to Haiku 4.5 or Sonnet 4.5 subagents.[^20] The training improvements to delegation prompt quality and result synthesis reduced iteration counts in these hierarchical setups.

Long-context creative work: Customer testimonials cited Opus 4.5 generating 10 to 15-page narrative chapters with strong coherence, attributed to the model's improved long-context quality training.[^1]

Customer support automation: Tau-Bench-style multi-turn customer interactions where the model navigates business rules, escalates ambiguous cases, and maintains coherent state across long conversations.[^2]

Cybersecurity research: With the cybersecurity allowlist, accredited security teams could use Opus 4.5 for vulnerability research and red-teaming tasks.[^2] The model's improved prompt-injection robustness made it useful for analyzing potentially malicious payloads without being subverted by them.

Adoption and integration

Platform / partner	Integration at launch
claude.ai (web, iOS, Android)	Default Opus model for Pro, Max, Team, Enterprise
Claude Code	Updated to use Opus 4.5 with Plan Mode improvements
Claude desktop app	Claude Code shipped on the desktop alongside terminal
Anthropic API	`claude-opus-4-5-20251101` and `claude-opus-4-5` alias
AWS Bedrock	`anthropic.claude-opus-4-5-20251101-v1:0` global and regional endpoints
Google Cloud Vertex AI	`claude-opus-4-5@20251101` with multi-region routing
Microsoft Foundry	`claude-opus-4-5` in the Foundry catalog
GitHub Copilot	Available across Pro, Pro+, Business, Enterprise; promotional pricing through Dec 5, 2025
Cursor	Default Opus option in the model picker, retired Opus 4.1 by year end 2025
Lovable	Integrated for autonomous web app generation
Claude for Chrome	Browser extension expanded to all Max users
Claude for Excel	Promoted from limited pilot to all Max, Team, Enterprise users

Anthropic's reported revenue of approximately $5 billion annualized by August 2025 (before the Opus 4.5 launch) and a customer base of 300,000+ businesses provided context for the scale at which the model was deployed.[^4] Opus 4.5 inherited that base and benefited from existing integrations, with most customers experiencing the upgrade as a transparent change in claude.ai or via the claude-opus-4-5 API alias.

Third-party developer tooling adoption was rapid. Within two weeks of launch, Cursor, Lovable, Continue, Cline, and Sourcegraph Cody had updated their default model recommendations to include Opus 4.5. Many of these tools had previously defaulted to Sonnet 4.5 for cost reasons; the price cut to $5/$25 made Opus 4.5 viable as a default in cost-sensitive subscriptions.[^14]

Reception and impact

Initial reception was generally positive among developers, with Anthropic's announcement generating coverage from TechCrunch, CNBC, InfoWorld, MacRumors, BD Tech Talks, and others.[^4][^5][^6][^10][^13] The SWE-bench result was the primary focus: breaking the 80% threshold on a benchmark that tests actual software engineering on real GitHub issues was treated as a meaningful milestone in AI coding capability.[^19]

Developer reception was more nuanced once the model was in use. Simon Willison, a prominent developer who writes extensively about AI tooling, observed that while Opus 4.5 handled large refactoring tasks well, he experienced little drop-off in productivity when reverting to the older Sonnet 4.5 for everyday work. This touched on a broader pattern in the field: benchmark improvements do not always translate to proportional productivity gains for typical developer tasks.

The 67% price cut was widely noted. For teams that had been using Claude Opus 4.1 at $15/$75 per million tokens, the new $5/$25 pricing opened up use cases that had previously been cost-prohibitive. Several developers cited this as the more impactful part of the announcement, particularly for agentic pipelines where token costs accumulate quickly across many tool calls.[^14]

The context window size drew some criticism. Gemini had supported context windows of one million tokens or more for over a year by the time Opus 4.5 launched with 200K. For workflows involving very large codebases or long document collections, this gap remained a practical limitation.[^10]

The safety profile received specific attention from researchers. The prompt injection success rate of 4.7% meant that roughly 1 in 20 adversarial injection attempts still succeeded, even with the model's improved defenses.[^2] For high-stakes agentic deployments where the model processes untrusted web content or user-provided documents, this remained an area requiring additional application-level safeguards.

Zvi Mowshowitz's detailed model-card analysis on Substack highlighted the alignment improvements while noting the slight rise in over-refusals on technical topics and the methodological awkwardness of comparing a released model against internal evaluations that the model may have learned to recognize.[^16][^17] The piece, widely circulated in alignment circles, characterized Opus 4.5 as evidence that frontier capability and safety improvements could be pursued together, while flagging specific behaviors that warranted continued monitoring.

LLM-aggregator sites such as LMArena, Vellum, and Artificial Analysis listed Opus 4.5 in the top tier of frontier models within days of launch.[^7][^22] On LMArena's blind comparison rankings, Opus 4.5 ranked competitively across coding, reasoning, and creative writing categories, with relatively stronger performance on coding tasks and weaker performance on multilingual generation. METR's evaluations of long-horizon autonomy noted incremental gains over Sonnet 4.5 but no step-change improvement of the kind Sonnet 4.5 had shown over Sonnet 4 in September.[^25]

Press coverage themes

Theme	Outlet examples
First public model above 80% on SWE-bench Verified	TechCrunch, CNBC, InfoWorld, BD Tech Talks
67% Opus-tier price cut	TechCrunch, ClaudeFast, Vellum
Multi-agent orchestration positioning	InfoWorld, AI Business
Claude for Chrome and Excel expansion	TechCrunch, MacRumors
Best-aligned frontier model claim	LessWrong, Dave Engineer blog, The Zvi
Effort parameter as a new API surface	LiteLLM docs, Caylent, Vellum
Beating human candidates on engineering exam	Technology Magazine, Anthropic blog

Limitations

Several limitations were documented or observed at launch:

The 200,000-token context window, while sufficient for many tasks, was smaller than Gemini 3 Pro's context capacity. Users working with very large codebases or document collections that exceed 200K tokens needed to implement external chunking or retrieval. Opus 4.6 addressed this in February 2026 with a one-million-token context window, but Opus 4.5 customers had to wait roughly two and a half months for that capability.[^10]

Despite substantial improvements, prompt injection susceptibility was not eliminated. The 4.7% Gray Swan success rate meant adversarial attacks could still succeed in a minority of cases, which required additional safeguards for production agentic systems that process untrusted content.[^2] Independent evaluations consistently placed Opus 4.5 ahead of GPT-5.1 and Gemini 3 Pro on this metric, but "better than peers" did not equate to "safe to deploy without defense in depth."

On multimodal tasks involving video, Gemini 3 Pro demonstrated stronger capabilities. Opus 4.5's visual reasoning on MMMU (80.72%) lagged GPT-5.1's 85.4%, and Anthropic did not publish Video-MMMU results for the model at launch.[^9]

On GPQA Diamond (graduate-level science questions), Opus 4.5's 86.95% was below both GPT-5.1 (88.1%) and Gemini 3 Pro (91.9%), suggesting the model was less dominant on deep scientific reasoning compared to its coding and agentic strengths. The same pattern showed on Humanity's Last Exam, where Gemini 3 Pro led with search tools enabled.[^9]

The effort parameter, while useful, was a beta feature at release. Developers using low effort in cost-sensitive pipelines needed to validate that quality degradation was acceptable for their specific tasks, since the tradeoffs varied by use case. Some workflows showed sharp accuracy cliffs at low effort that did not appear at medium or high.[^15]

Over-refusal on benign technical requests rose slightly relative to Sonnet 4.5 (0.23% vs 0.05% on Anthropic's internal evaluation). For applications in cybersecurity or chemistry research, this occasionally produced false positives where the model declined to assist with legitimate professional work.[^2] The cybersecurity allowlist process was available for accredited customers but added friction.

By May 2026, Opus 4.5 had been superseded as Anthropic's flagship by Claude Opus 4.6 (released February 5, 2026) and Claude Opus 4.7 (released April 16, 2026), both of which extended the context window to 1 million tokens and delivered further benchmark improvements.[^24][^26] Opus 4.5 remains Active under its versioned API identifier (not Deprecated or Retired), with a tentative retirement no sooner than November 24, 2026, so customers can continue using the snapshot through at least late 2026 before migration becomes mandatory.[^24]

Subsequent revisions and successor models

Opus 4.5 shipped as a single snapshot (claude-opus-4-5-20251101) and did not receive interim point releases. Anthropic's documentation lists the model under its dated identifier and a claude-opus-4-5 alias.[^3] The alias resolves to the November 1 snapshot and has not been redirected to a newer model.

The model's direct successor, Claude Opus 4.6, launched on February 5, 2026 with a one-million-token context window, adaptive thinking as the default reasoning mode, the new inference_geo data residency parameter, and a server-side compaction API. Opus 4.6 retained the $5/$25 standard pricing of Opus 4.5 and added a long-context tier ($10/$37.50) for requests exceeding 200,000 input tokens.

Claude Opus 4.7 followed on April 16, 2026 with a new tokenizer (which can produce up to 35% more tokens for the same source text), removal of sampling parameters and prefilling, an additional xhigh effort level, and the introduction of Project Glasswing for defensive cybersecurity. Opus 4.7 reached 87.6% on SWE-bench Verified, a step change of nearly seven percentage points over Opus 4.5.[^26]

Opus 4.5 remains the highest-rated Opus model that does not require migration off the older Messages API conventions. Customers running production pipelines built around manual budget_tokens extended thinking, sampling parameter tuning, or prefilling can stay on Opus 4.5 through at least November 2026 under Anthropic's published retirement window, while customers that adopt the newer adaptive thinking pattern have generally migrated to Opus 4.6 or Opus 4.7.[^24]

References

Background and development

Technical specifications

The effort parameter

Concurrent thinking and tool use

Benchmark performance

τ2-bench (Sierra)

BrowseComp-Plus and deep research

METR time horizon

Artificial Analysis Intelligence Index

Pricing and availability

Service tiers and data residency

Key capabilities

Software engineering and agentic coding

Multi-agent coordination

Computer use

Extended context and long conversations

Tool use and Model Context Protocol

Safety and alignment

Evaluation awareness

Agentic misalignment

Comparison with other frontier models

Pricing comparison

Use cases

Adoption and integration

Reception and impact

Press coverage themes

Limitations

Subsequent revisions and successor models

See also

References

Improve this article

Related Articles

Claude Opus 4.6

Claude Haiku 4.5

DeepSeek 3.0

GPT-5.5

DeepSeek V3

Grok 4

Background and development

Technical specifications

The effort parameter

Concurrent thinking and tool use

Benchmark performance

τ2-bench (Sierra)

BrowseComp-Plus and deep research

METR time horizon

Artificial Analysis Intelligence Index

Pricing and availability

Service tiers and data residency

Key capabilities

Software engineering and agentic coding

Multi-agent coordination

Computer use

Extended context and long conversations

Tool use and Model Context Protocol

Safety and alignment

Evaluation awareness

Agentic misalignment

Comparison with other frontier models

Pricing comparison

Use cases

Adoption and integration

Reception and impact

Press coverage themes

Limitations

Subsequent revisions and successor models

See also

References

Related Articles

Claude Opus 4.6

Claude Haiku 4.5

DeepSeek 3.0

GPT-5.5

DeepSeek V3

Grok 4