Claude Opus 4.6 is a large language model developed by Anthropic and released on February 5, 2026. It is the fourth incremental update to the Opus tier of the Claude 4 family, succeeding Claude Opus 4.5 (November 2025) and preceding Claude Opus 4.7 (April 2026). At the time of its release it was the most capable generally available model in the broader Claude family, designed for long-horizon agentic work, complex software engineering, and enterprise knowledge tasks.
The model introduced two major architectural shifts relative to its predecessor. First, manual extended thinking with the budget_tokens parameter was deprecated in favor of adaptive thinking, a mode in which the model automatically decides when and how deeply to reason before responding. Second, the context window expanded from 200,000 tokens to one million tokens (in beta at launch, generally available from March 13, 2026), giving the model the ability to ingest entire codebases, long legal contracts, or multi-document research corpora in a single request.
On GDPval-AA, an Elo-based evaluation of economically valuable knowledge work spanning finance, legal, and research domains, Opus 4.6 outscored Claude Opus 4.5 by 190 Elo points and OpenAI's GPT-5.2 by approximately 144 Elo points. On the long-context retrieval benchmark MRCR v2 at the one-million-token level, Opus 4.6 achieved 76% accuracy compared to 18.5% for Claude Sonnet 4.5, the sharpest jump in long-context performance Anthropic had reported at that point.
Pricing was held constant at $5 per million input tokens and $25 per million output tokens for prompts up to 200,000 tokens, matching Claude Opus 4.5 and making the performance gains cost-neutral for existing customers.
The Claude 4 family launched in May 2025 with Claude Sonnet 4 and Claude Opus 4. Those two models introduced hybrid reasoning (the ability to produce visible extended thinking chains before a final answer) and set new records on the SWE-bench Verified coding benchmark. The 4.x subgenerations that followed refined specific capabilities without changing the base architecture.
Claude Opus 4.1 shipped in August 2025, focusing on improvements to agentic tool calling and longer autonomous operation. Claude Sonnet 4.5 followed in September 2025 with what Anthropic called the best coding model at that time. Claude Haiku 4.5 arrived in October 2025 as the first small model in the family with extended thinking and computer use.
Claude Opus 4.5 was released November 24, 2025. It was the first publicly available model to cross 80% on SWE-bench Verified (80.9%) and cut Opus-tier pricing by roughly two-thirds compared to Opus 4.1, bringing it to $5/$25 per million tokens. It also introduced the effort parameter, which replaced raw budget_tokens control with three named levels (low, medium, high) governing reasoning depth. Opus 4.5 was widely covered as a step change in cost-to-capability ratio for the Opus tier.[1][21]
Opus 4.6 continued that trajectory. The February 2026 release arrived alongside a wider set of API changes including the general availability of fine-grained tool streaming, the launch of data residency controls, and a compaction API for server-side context summarization. On the same day Anthropic published a fast mode research preview for Opus 4.6 (launched February 7, 2026), offering up to 2.5x faster output at premium pricing.[3]
Claude Sonnet 4.6 was released twelve days later on February 17, 2026, completing the mid-cycle update to the Claude 4 family. Claude Opus 4.7 followed on April 16, 2026, with a new tokenizer, step-change improvements in agentic coding, and the introduction of an xhigh effort level.[12]
Anthropic's release cadence in late 2025 and early 2026 placed Opus 4.6 squarely against OpenAI's GPT-5.2 family (winter 2025-2026) and Google's Gemini 3 Pro line (announced November 18, 2025). Within roughly ten weeks of GPT-5.2 and the first Gemini 3 Pro updates, Anthropic shipped Opus 4.6 with a price-matched, more capable Opus tier and an upgraded long-context architecture, signaling that the company's commercial flywheel through Claude Code and enterprise integrations would not slow.[1][9]
| Model | Release date | API model ID | Key change |
|---|---|---|---|
| Claude Opus 4 | May 22, 2025 | claude-opus-4-20250514 | Hybrid reasoning, Claude 4 launch |
| Claude Opus 4.1 | August 5, 2025 | claude-opus-4-1-20250805 | Agentic tool calling and extended operation |
| Claude Opus 4.5 | November 24, 2025 | claude-opus-4-5-20251101 | 80%+ SWE-bench, effort parameter, price cut |
| Claude Opus 4.6 | February 5, 2026 | claude-opus-4-6 | 1M context window, adaptive thinking, compaction API |
| Claude Opus 4.7 | April 16, 2026 | claude-opus-4-7 | New tokenizer, xhigh effort, agentic coding step-change |
Anthropic released Claude Opus 4.6 on February 5, 2026, simultaneously across the Claude API, claude.ai (Pro and Team subscribers), Amazon Bedrock, Google Cloud Vertex AI, and Microsoft Foundry. The accompanying announcement was titled "Introducing Claude Opus 4.6" and described the model as "our most intelligent model for complex agentic tasks and long-horizon work."[1]
The launch included several associated product announcements. Claude Code received improvements for multi-agent coordination, allowing a lead agent to spin up and direct teammate agents running in parallel. Claude for Excel was updated to use Opus 4.6 and gained support for native operations such as pivot table editing and conditional formatting. Claude for PowerPoint entered research preview, capable of reading existing slide layouts and maintaining brand consistency when inserting content.[1]
GitHub made Opus 4.6 generally available in GitHub Copilot on the same day across all Copilot tiers (Pro, Pro+, Business, Enterprise) via the model picker in Visual Studio Code, Visual Studio, github.com, GitHub Mobile, and the GitHub CLI. Enterprise and Business administrators were required to enable the Opus 4.6 policy in Copilot settings before the model became accessible to team members. The official changelog characterized the model as excelling "in agentic coding, with specialization on especially hard tasks requiring planning and tool calling."[6]
AWS announced availability of Opus 4.6 on Amazon Bedrock on February 5, 2026. Google Cloud's Vertex AI listed the model under the ID claude-opus-4-6 with regional endpoints in us-east5, europe-west1, and asia-southeast1.[5][7]
Anthropic's product team framed the release as a deliberate widening of the Opus customer base. The Head of Product cited adoption among product managers, financial analysts, and other knowledge workers as a reason to invest in features such as Excel and PowerPoint integrations rather than chase a benchmark number. TechCrunch's coverage led with the multi-agent feature and the PowerPoint sidebar, putting the model alongside the broader 2026 industry shift away from chat-window-only deployments.[15]
Anthropic published quotes from a small set of design partners alongside the announcement. GitHub, Asana, and Cognition each described practical gains in multi-step coding work and agentic planning when running early versions of Opus 4.6 against their internal harnesses. The launch post highlighted partner-cited gains on multi-step bug repair, large-codebase navigation, and interruption-tolerant long-running tasks. Anthropic did not publish raw partner numbers, but referenced internal metrics that indicated agent teams built on Opus 4.6 cut time-to-solution on complex repository tasks by an average of 65% relative to single-agent Opus 4.5 setups.[1]
| Specification | Details |
|---|---|
| API model identifier | claude-opus-4-6 |
| AWS Bedrock model ID | anthropic.claude-opus-4-6-v1 |
| Google Vertex AI model ID | claude-opus-4-6 |
| Context window | 1,000,000 tokens (beta at launch; generally available March 13, 2026) |
| Standard context window | 200,000 tokens |
| Maximum output (Messages API) | 128,000 tokens |
| Maximum output (Message Batches API) | 300,000 tokens (with output-300k-2026-03-24 beta header) |
| Reasoning mode | Adaptive thinking (replaces budget_tokens) |
| Effort levels | Low, medium, high, max |
| Extended thinking | Yes (via adaptive mode) |
| Vision | Yes |
| Computer use | Yes |
| Prompt caching | Yes (5-minute and 1-hour TTL) |
| Tool use | Yes (function calling, MCP, computer use) |
| Structured outputs | Yes |
| Training data cutoff | August 2025 |
| Reliable knowledge cutoff | May 2025 |
| Input modalities | Text, images, code, PDF documents |
| Output modalities | Text |
Opus 4.6 replaced the type: "enabled" / budget_tokens extended thinking interface with adaptive thinking (thinking: {type: "adaptive"}). Under adaptive thinking the model decides at inference time whether a given query benefits from extended reasoning and, if so, how many thinking tokens to spend. The four effort levels map to different defaults: low effort minimizes thinking for speed, medium applies moderate reasoning and may skip it for simple queries, high (the default) applies continuous deep reasoning, and max enables unlimited reasoning depth for the most demanding tasks.[3]
The change removed the need for developers to manually tune budget_tokens for each task type. Anthropic deprecated type: "enabled" with budget_tokens for new models starting with Opus 4.6, though older models with that parameter continued to work. The effort parameter itself, which had been a beta API control with Opus 4.5, was promoted to general availability with the Opus 4.6 launch.[3]
In practice, adaptive thinking is built around two properties. The model can produce zero internal reasoning for a trivial routing question, hundreds of reasoning tokens for a moderately complex coding edit, and tens of thousands of tokens for a multi-step proof or repository plan, all within the same model and API call. Effort acts as a soft cap rather than a hard budget. Developers steer cost and speed by raising or lowering effort instead of writing branching logic that decides whether to call extended thinking.[3][8]
The one-million-token context window was offered in beta from February 5, 2026, with long-context pricing applying to requests exceeding 200,000 input tokens. On March 13, 2026, Anthropic promoted it to general availability with no beta header required and removed the dedicated 1M rate limits; standard account limits applied across all context lengths from that date. The GA launch also raised the per-request media limit from 100 to 600 images or PDF pages.[3]
The context window size (approximately 750,000 words or 3.4 million Unicode characters) allowed Opus 4.6 to hold very large codebases, full books, or extensive document sets in a single conversation turn. Anthropic reported a 76% accuracy rate on the MRCR v2 needle-in-haystack benchmark at the 1M-token level, compared to 18.5% for Claude Sonnet 4.5. The company described this as a significant reduction in "context rot", the tendency of earlier models to lose coherence or forget earlier content in very long conversations.[1][8]
At the 256K-token level, Opus 4.6 reached 93.0% on the eight-needle variant of MRCR v2, an internal data point Anthropic surfaced to demonstrate that retrieval quality stayed high well into the new context regime rather than collapsing once requests exceeded the legacy 200K window. The MRCR series tests whether a model can locate and distinguish between multiple similar pieces of information at increasing distances inside long inputs.[8]
Opus 4.6 launched with server-side context compaction in beta, exposed as a new compaction API. The API automatically detected when a conversation was approaching the context limit and summarized the interaction history into compact blocks, preserving key details while freeing space for new content. This allowed long-running agentic workflows to continue beyond what any fixed context window could hold. The feature addressed a common failure mode in autonomous agents where tasks stalled when the model's working memory filled up.[3]
In the launch documentation, Anthropic positioned compaction as the server-side counterpart to the older client-side compaction shipped in the Python and TypeScript SDKs. Server-side compaction was designed to support "effectively infinite" conversations by triggering summarization automatically inside the platform rather than requiring developer-side state machines. Compaction also worked alongside the memory tool that Anthropic had introduced with Sonnet 4.5 in September 2025, giving long-running agents both a scratchpad and an automatic context-shrinking mechanism.[3][8]
Opus 4.6 removed support for assistant message prefilling. Any request that included a partially filled assistant message returned a 400 error. Developers who relied on prefilling to steer output format were directed to use structured outputs with JSON schema, system prompt instructions, or the output_config.format parameter instead. This was a breaking API change relative to earlier Claude models.[3][8]
In parallel with the prefill removal, Anthropic moved the output_format parameter for structured outputs to output_config.format on the same release date, signaling a broader push to consolidate post-Sonnet 4.5 API surfaces into a single output configuration object. Existing callers who used output_format directly were given a transition window before the older field returned a validation error.[3]
Anthropic introduced the inference_geo parameter alongside Opus 4.6, allowing API customers to specify that inference should run only in US datacenters. US-only inference was available at a 1.1x pricing multiplier. The feature was available for models released after February 1, 2026, and was specifically targeted at enterprise customers in regulated industries that required data sovereignty guarantees.[3]
The February 5, 2026 release bundled several non-model-specific platform changes that affected anyone building on the Claude API. Fine-grained tool streaming moved from public beta to general availability across all models and platforms, removing the fine-grained-tool-streaming-2025-05-14 beta header. The effort parameter graduated from beta on Opus 4.5 to general availability on Opus 4.6. The compaction API and the inference_geo parameter both shipped in beta. Together these changes positioned Opus 4.6 less as a single new model and more as the centerpiece of an early-2026 platform refresh that touched billing, residency, streaming, and context management at the same time.[3]
| Usage tier | Input (per million tokens) | Output (per million tokens) |
|---|---|---|
| Standard (up to 200K input tokens) | $5.00 | $25.00 |
| Long-context (200K to 1M input tokens) | $10.00 | $37.50 |
| Batch API (async, up to 200K) | $2.50 | $12.50 |
| US-only inference surcharge | 1.1x multiplier | 1.1x multiplier |
| Prompt caching read | Up to 90% discount | n/a |
The standard tier pricing matched Claude Opus 4.5 exactly. Long-context pricing (for requests exceeding 200,000 input tokens) was introduced with Opus 4.6 as a separate tier, doubling the input cost and increasing output cost by 50% for the portion of requests that used the extended window. This was in line with the higher computational cost of processing very large contexts.[3][10]
Batch API processing through the Message Batches API offered a 50% discount on standard per-token rates. Batch jobs could also use the extended 300,000-token output cap on Opus 4.6 starting March 30, 2026, with the output-300k-2026-03-24 beta header. The fast mode research preview, launched February 7, 2026, offered up to 2.5x faster output token generation at a pricing premium; Anthropic directed interested customers to join a waitlist.[3]
Prompt caching continued to offer up to a 90% discount on cached input reads, with both 5-minute and 1-hour cache TTLs generally available. On February 19, 2026, Anthropic launched automatic caching for the Messages API, automatically caching the last cacheable block in each request and shifting the cache point forward as conversations grow, simplifying optimization for any Opus 4.6 deployment that did not need fine-grained breakpoint control.[3]
At $5/$25 per million tokens, Opus 4.6 sat at roughly one-third the headline rate of GPT-5.2 Pro and was directly comparable to Gemini 3 Pro's mid-tier rate. The long-context surcharge above 200,000 tokens raised effective per-token cost for very large requests, but did not affect prompts that fit within the legacy window. For agentic workloads where token consumption was driven by multi-step tool use rather than input size, the standard tier remained the relevant rate.[10][16]
| Benchmark | Opus 4.6 score | Description |
|---|---|---|
| SWE-bench Verified | 80.8% | Real-world GitHub issue resolution |
| SWE-bench Multilingual | 77.83% | SWE-bench across multiple languages |
| Terminal-Bench 2.0 | 65.4% | Agentic terminal coding tasks |
| GPQA Diamond | 91.3% | PhD-level science reasoning |
| Humanity's Last Exam (no tools) | 40.0% | Multidisciplinary expert-level questions |
| Humanity's Last Exam (with tools) | 53.3% | Same, with tool access |
| MRCR v2 (1M tokens) | 76.0% | Long-context needle-in-haystack retrieval |
| MRCR v2 (256K tokens, 8-needle) | 93.0% | Long-context retrieval at 256K |
| BrowseComp | 83.7% | Hard web information retrieval |
| CharXiv (no tools) | 69.1% | Scientific chart understanding |
| CharXiv (with tools) | 84.7% | Scientific chart understanding with tools |
| MMMLU | 91.1% | Multilingual knowledge questions |
| MMMU Pro | 73.9% | Visual reasoning |
| MCP-Atlas | 75.8% | Multi-step tool use orchestration |
| Finance Agent v1.1 | 60.1% | Financial analysis agent tasks |
| OSWorld-Verified | 72.7% | Computer use and desktop control |
| ARC AGI 2 | 68.8% | Abstract reasoning and generalization |
| BigLaw Bench | 90.2% | Legal reasoning and analysis |
| τ2-bench Retail | 91.9% | Multi-turn customer service tasks |
| GDPval-AA (Elo) | 1606 | Knowledge work across finance, legal, research |
The SWE-bench Verified score of 80.8% was a marginal step beyond the 80.9% set by Claude Opus 4.5, reflecting continued but incremental progress on that benchmark. Anthropic noted that SWE-bench Verified results were averaged over 25 trials run with adaptive thinking at max effort and default sampling settings, suggesting the benchmark had reached a regime where small differences mostly reflected variance rather than capability gaps. The launch material did not include separate scores for MMLU, AIME, or HumanEval, three benchmarks that earlier Anthropic launches had cited but that have largely saturated for frontier-class models by 2026.[2][8]
The larger gains came on long-context, agentic, and enterprise knowledge-work evaluations. The GDPval-AA Elo gain of 190 points over Opus 4.5 (1416 to 1606) was the headline number in Anthropic's own announcement and was used in subsequent comparisons against GPT-5.2.[1]
Terminal-Bench 2.0, a benchmark testing autonomous terminal operation and long-horizon coding, placed Opus 4.6 at 65.4%, measured with adaptive thinking at max effort. That score edged out GPT-5.2 at 64.7% and Gemini 3 Pro at 56.2% on the same benchmark, according to figures cited in third-party analysis.[8]
Humanity's Last Exam (HLE) without tools at 40.0% represented meaningful progress on a benchmark widely regarded as near-impossible for current systems. With tool access the score rose to 53.3%, a 13-point improvement attributable to the model's ability to delegate specific lookups or calculations.[8]
The MRCR v2 jump (from Claude Sonnet 4.5's 18.5% to Opus 4.6's 76.0% at the 1M-token level) was among the most cited results in coverage of the release. The benchmark tests whether a model can correctly retrieve specific pieces of information buried in a very long document. The four-fold improvement reflected both the larger context window and training improvements aimed specifically at maintaining attention quality over long distances.[8][9]
Anthropic reported that the gain in long-context retrieval translated to qualitative improvements on tasks such as cross-repository code analysis, contract comparison across hundreds of pages, and multi-document research synthesis. The MRCR result became a frequent reference point in third-party benchmarking comparing Opus 4.6 to Gemini 3 Pro, which had supported a 1M-2M token context for over a year before Anthropic matched it.[9][10]
DataCamp's review of Opus 4.6 published shortly after launch reported that the model passed eight rigorous hand-crafted logic tests, including hex-to-decimal conversions combined with prime-number filtering, matrix rotation under spatial constraints, seating-arrangement constraint satisfaction with backtracking, modular arithmetic problems, factorial-and-string puzzles, root-cause code debugging, and physics-style counterfactuals. Each test had been designed to exercise a specific reasoning skill in a way that could not be answered by surface-level pattern matching. The result became a frequent talking point in developer-facing coverage, alongside the SWE-bench and GDPval-AA numbers.[8]
Opus 4.6 was designed with autonomous software engineering as its primary use case. Anthropic stated the model "plans more carefully, sustains agentic tasks for longer, can operate more reliably in larger codebases, and has better code review and debugging skills to catch its own mistakes." In internal tests, agent teams built on Opus 4.6 reduced time-to-solution for complex repository-level tasks by an average of 65%.[1]
The model's performance on Claude Code reflected this emphasis. Claude Code gained support for agent teams at launch, allowing a lead Claude instance to coordinate multiple teammate agents working in parallel on different parts of a codebase. Each agent maintained its own context window, enabling thorough execution of sub-tasks without crowding out the lead agent's working memory. This multiagent architecture allowed tasks that might previously have required hours of sequential work to be parallelized across several minutes.[1]
Third-party demonstrations from the launch period included a multi-agent run that produced a complete working C compiler from scratch, generating roughly 100,000 lines of code and producing artifacts capable of booting Linux across x86, ARM, and RISC-V architectures in approximately eight hours. Agent teams remained an experimental Claude Code feature that users had to enable explicitly with an environment flag.[10]
During the design-partner program, Anthropic's red-team and security-research customers reported that the model surfaced over 500 zero-day vulnerabilities in test code corpora during pre-release evaluation. These findings supported the additional cybersecurity safeguards that shipped with the launch and contributed to the six new cybersecurity probes that Anthropic added to its standard pre-deployment testing for Opus 4.6.[10][4]
With one million tokens of context (approximately 750,000 words), Opus 4.6 could hold entire codebases, full-length books, or large document collections in a single request. Practical applications included cross-repository code analysis, review of full legal proceedings, synthesis across large research corpora, and analysis of complete financial filings without document chunking.[1]
The high MRCR v2 score (76%) and the Anthropic-reported reduction in context rot were direct measures of this capability. Earlier Claude models tended to lose track of content introduced many tens of thousands of tokens earlier; Opus 4.6's training specifically targeted this regression. Anthropic positioned the long-context tier of Opus 4.6 as the recommended model for any workflow that previously required retrieval-augmented generation pipelines, with the caveat that long-context pricing applied above 200,000 input tokens.[1][3]
The GDPval-AA benchmark, which evaluates performance on economically valuable tasks in finance, legal, and other professional domains, put Opus 4.6 at 1606 Elo. That placed it 144 points ahead of OpenAI's GPT-5.2 and 190 points ahead of Claude Opus 4.5. Specific enterprise capabilities included financial modeling, legal document analysis, research synthesis, and structured data extraction from complex documents.[1][8]
The BigLaw Bench score of 90.2% demonstrated strong performance on legal reasoning tasks typical of large law firm work, including contract review, regulatory analysis, and legal research tasks. Reddit users early in the release cycle highlighted legal documents specifically as a domain where Opus 4.6 noticeably outperformed earlier Claude models, citing fewer hallucinated citations and more careful handling of conditional clauses.[10][16]
Opus 4.6 continued support for computer use, the ability to control a graphical desktop interface by analyzing screenshots and generating mouse and keyboard actions. The OSWorld-Verified score of 72.7% represented the model's accuracy on a standardized set of computer control tasks, up from 66.3% on Opus 4.5. Computer use support was available through the Claude API and on Vertex AI.[1][8]
The vision pipeline used by computer use also benefited from training improvements aimed at multimodal robustness. Per Anthropic's launch documentation, images and screenshots were processed at the same maximum resolution as Opus 4.5 (about 1,568 pixels on the long edge), with high-resolution image support reserved for the later Opus 4.7 release.[12]
The model accepted text, images, code, and PDF documents as input. The CharXiv benchmark (chart and figure understanding from scientific papers) scored 69.1% without tools and 84.7% with tool access. Vision capabilities supported analysis of photographs, diagrams, charts, and document scans within the context window. On the visual reasoning benchmark MMMU Pro, Opus 4.6 scored 73.9%, an improvement over Opus 4.5's 70.6% but below leading multimodal-focused systems such as Gemini 3 Pro at 81.0% and GPT-5.2 at 79.5%.[8][9]
The four effort levels (low, medium, high, max) gave developers a single dial that traded cost and latency for quality. At low effort, the model rarely engaged extended thinking and produced outputs comparable in latency to standard chat completions. At medium effort, the model decided per turn whether to think before responding. High effort, the default, applied continuous deep reasoning for most non-trivial prompts. Max effort lifted the soft cap on reasoning tokens, useful for the hardest planning, mathematical, or research tasks. Anthropic recommended high or max effort for most agentic coding work and low or medium for high-throughput pipelines where cost mattered more than peak quality.[1][3]
| Benchmark | Claude Opus 4.6 | GPT-5.2 (OpenAI) | Gemini 3 Pro (Google) | Grok 4 (xAI) |
|---|---|---|---|---|
| SWE-bench Verified | 80.8% | 80.0% | 76.2% | ~75% |
| Terminal-Bench 2.0 | 65.4% | 64.7% | 56.2% | N/A |
| GPQA Diamond | 91.3% | 93.2% | 91.9% | N/A |
| Humanity's Last Exam (with tools) | 53.3% | 50.0% | 37.5% | N/A |
| ARC AGI 2 | 68.8% | 54.2% | 45.1% | N/A |
| BrowseComp | 83.7% | 77.9% | 59.2% | N/A |
| GDPval-AA (Elo) | 1606 | 1462 | 1195 | N/A |
| OSWorld-Verified | 72.7% | N/A | N/A | N/A |
| MMMU Pro | 73.9% | 79.5% | 81.0% | N/A |
| Context window | 200K/1M | 128K | 2M | N/A |
| Max output | 128K tokens | 32K tokens | 8K tokens | N/A |
| Standard input price | $5/MTok | $15/MTok | $12.50/MTok | $2/MTok |
| Standard output price | $25/MTok | $60/MTok | $37.50/MTok | $15/MTok |
Notes: Competitor scores are drawn from third-party benchmark comparisons published around the time of Opus 4.6's release. Pricing figures reflect the model tiers most comparable in capability to Opus 4.6. Grok 4 data was limited at the time of publication.[8][9][10]
At launch, Opus 4.6 led competitors on enterprise knowledge work (GDPval-AA), long-context retrieval (MRCR v2), agentic terminal coding (Terminal-Bench 2.0), legal reasoning (BigLaw Bench), and abstract reasoning (ARC AGI 2). GPT-5.2 led on visual reasoning and several mathematics benchmarks; in particular, GPT-5.2's 93.2% on GPQA Diamond edged out Opus 4.6's 91.3%. Gemini 3 Pro offered the largest native context window (2M tokens) at lower per-token pricing for prompts under 200K tokens. Grok 4 posted competitive coding numbers at significantly lower prices, but Anthropic did not include direct comparisons in the launch material.[8][9]
Anthropic's pricing for Opus 4.6 was substantially lower than comparable GPT-5 tiers but higher than Gemini 3 Pro and Grok 4 once long-context surcharges applied. The model's combination of price parity with its predecessor, expanded capabilities, and a one-million-token window without a price increase for legacy-size requests was widely cited in comparisons as a favorable value proposition relative to OpenAI's higher-tier models.[10]
Opus 4.6 was positioned primarily as a software engineering model. Its ability to hold large codebases in context (up to roughly 750,000 words of code in a single request) allowed it to reason about entire repositories rather than individual files. Anthropic cited specific improvements in code review, where the model demonstrated higher rates of catching its own mistakes and bugs it had introduced in earlier turns. Claude Code's agent team feature built directly on these capabilities, allowing automated multi-step coding workflows.[1]
The headline coding integrations included GitHub Copilot (across all paid tiers and IDE surfaces from launch day), JetBrains, Cursor, Sourcegraph, and Replit. Within Anthropic's own product line, Claude Code v2 series releases shipped with Opus 4.6 as the recommended model for agent-team workflows, and the Claude Agent SDK was updated to expose the new compaction and adaptive thinking surfaces directly.[6][1]
The BigLaw Bench score (90.2%) and GDPval-AA performance (1606 Elo) made Opus 4.6 competitive for tasks in professional services. Long-context capabilities allowed the model to analyze full contract texts, lengthy regulatory filings, or extended case law without the chunking and retrieval overhead required by smaller-context models. Finance and legal teams deploying it in 2026 used it for contract review, regulatory analysis, financial modeling from lengthy filings, and research synthesis across large document sets.[1][8]
Claude in Excel, expanded to all Max, Team, and Enterprise users with the Opus 4.6 launch, gave finance teams direct access to Opus-class reasoning inside spreadsheets. The integration supported pivot table editing, conditional formatting, and formula generation across complex models, with the Excel side panel using the model's tool-calling ability to manipulate the workbook directly rather than producing text instructions for users to apply manually.[1]
The combination of BrowseComp performance (83.7%), long-context retention, and tool use capabilities made Opus 4.6 well suited for deep research tasks requiring retrieval and synthesis across many sources. With the compaction API, research agents could maintain continuity across conversations much longer than the context window would otherwise allow, accumulating findings across multiple tool-use cycles without losing earlier results.[1][3]
Agent teams (preview at launch) enabled multi-agent workflows where Opus 4.6 instances coordinated parallel workstreams. Practical enterprise deployments included coordinated document processing pipelines, multi-step financial analysis workflows, and distributed software testing. The data residency controls (inference_geo) met compliance requirements in industries with data sovereignty constraints.[1][3]
Claude for Excel (updated to Opus 4.6 at launch) added support for pivot table editing and conditional formatting via a side-panel interface. Claude for PowerPoint, entering research preview on February 5, 2026, read existing slide layouts, respected template fonts and brand colors, and generated editable objects rather than static images. These integrations brought Opus 4.6 capabilities into common enterprise workflow tools without requiring API access.[1][15]
Opus 4.6 launched under Anthropic's ASL-3 standard, the same Responsible Scaling Policy classification applied to Claude Opus 4 in May 2025 and to all Opus-tier models since. Opus 4.6's system card characterized the safety profile as roughly comparable to Opus 4.5, with measurable improvements on a few alignment metrics and small movements on others. Anthropic described the model as maintaining the alignment quality of Opus 4.5, which the company had previously called potentially the best-aligned frontier model in the industry.[4][1]
Misalignment metrics, including rates of sycophancy, deception, and reckless agentic behavior, were reported at low levels. The system card noted that Opus 4.6 had the lowest over-refusal rate of any recent Claude model, an improvement that mattered specifically in production deployments where false-positive refusals interrupted legitimate workflows. As with earlier Claude models, alignment training combined supervised fine-tuning with Constitutional AI, the technique Anthropic introduced in 2022 to teach models to follow a written set of principles rather than relying solely on human preference labels.[4][1]
Anthropic developed six new cybersecurity evaluation probes alongside Opus 4.6 to assess potential misuse given the model's improved cyber capabilities. The probes covered areas such as exploit chaining, vulnerability discovery in real codebases, and prompt-injection resistance during agentic web tasks. The model's improved performance on cyber-offensive evaluations was a primary reason Anthropic invested in the additional probes; the company reported that combined safeguards kept residual risk within ASL-3 thresholds, the AI safety classification used for models that can offer meaningful uplift to bad actors.[4]
Prompt-injection resistance, a key evaluation for any agentic deployment, remained an active area. Anthropic did not publish a single Gray Swan attack-success-rate number alongside Opus 4.6 (Opus 4.5 had reported 4.7% on that benchmark, which the company called best-in-class for its release week). Subsequent Anthropic communications around Opus 4.7 in April 2026 described continued reductions in unnecessary refusal rates relative to Opus 4.6, citing approximately 0.71% on Opus 4.6 and 0.28% on Opus 4.7.[12][22]
Opus 4.6's training data cutoff was August 2025, with a reliable knowledge cutoff of May 2025. The reliable cutoff was the date through which Anthropic considered the model's knowledge of world events to be reliable; the August 2025 training cutoff captured later events less reliably and was a frequent source of errors in benchmarking exercises that asked about late-2025 news.[4][2]
Coverage of the February 5 launch focused on three points: the 1M token context window reaching general availability (in a beta form), the GDPval-AA margin over GPT-5.2, and the agent teams feature in Claude Code. The CNBC report on the release used the phrase "vibe working" to describe the shift toward AI models capable of executing entire work segments autonomously. TechCrunch led with agent teams and the PowerPoint sidebar, framing the release as Anthropic's bid to expand from coding-centric usage into broader knowledge work.[15]
Developers testing the model in the weeks following launch noted a clearer split in practice between Opus 4.6 (reserved for complex agentic tasks and large-context use cases) and Claude Sonnet 4.6 (which launched February 17, 2026, and became the daily-driver model for most professional developers). One analysis of Claude Code usage data from spring 2026 found that developers preferred Claude Sonnet 4.5 over Claude Opus 4.5 59% of the time in head-to-head testing, illustrating a broader pattern of smaller models narrowing the gap with Opus-tier models on everyday tasks.[1][8]
Enterprise adoption of Opus 4.6 was supported by the model's safety profile. Anthropic's system card for the February 2026 release reported low rates of misaligned behavior (sycophancy, deception) and the lowest over-refusal rates among recent Claude models, two metrics that matter in production deployments where false positives can disrupt workflows. Six new cybersecurity evaluation probes were developed alongside the release for enhanced threat detection assessments.[4]
Overchat AI, a platform tracking model capabilities, described Opus 4.6 as "Anthropic's Best Model Sets New Records" at launch. DataCamp's review noted the model achieved perfect scores across eight rigorous hand-crafted logic tests covering spatial reasoning, constraint satisfaction, modular arithmetic, code debugging, and physics counterfactuals. The DeepLearning.AI Batch newsletter highlighted the MRCR v2 jump and the Terminal-Bench 2.0 lead as the two most consequential numbers for working developers.[8][9][16]
Third-party reviews also raised criticisms. Several Reddit and Hacker News commenters reported that Opus 4.6 produced flatter, more generic prose than Opus 4.5 on creative writing tasks, although detailed system prompts could mitigate the effect. Others pointed out that the long-context surcharge above 200,000 tokens raised real-world costs for any workflow that genuinely needed the new window, partly offsetting the headline price-parity claim. Both critiques recurred in coverage of Sonnet 4.6 twelve days later, where the lack of any long-context surcharge for the smaller model became one of the main selling points.[10][14]
The Codecademy review of the model framed Opus 4.6 as a meaningful upgrade for users on the API and Claude Pro tiers, but noted that less-demanding workloads could continue running on Sonnet 4.5 without major regressions. That tradeoff echoed Simon Willison's commentary on Opus 4.5, where he had observed similar split between Opus and Sonnet productivity for most everyday tasks.[14][21]
Despite its capabilities, Opus 4.6 carried several limitations acknowledged at launch and in subsequent analysis.
The long-context tier pricing (doubling input cost above 200,000 tokens) made large-context workflows meaningfully more expensive. For applications that could operate within 200,000 tokens, Claude Sonnet 4.6 (released February 17, 2026) offered a lower-cost alternative that also supported the 1M-token context window, with no long-context surcharge.[3][13]
The removal of assistant message prefilling was a breaking change for existing integrations. Developers who had used prefilling to guide output format had to migrate to structured outputs or system-prompt instructions, adding friction for teams running existing production deployments. The associated migration to output_config.format for structured outputs was small in scope but required code changes for any caller using the older field name.[3]
The agent teams feature was a preview at launch, meaning Anthropic offered it without the stability guarantees of a general-availability release. Token consumption scaled multiplicatively when multiple agents ran in parallel, making cost management more complex for teams deploying the feature in production. Reddit threads in March 2026 described agent-team runs that produced impressive results but consumed orders of magnitude more tokens than equivalent single-agent runs, prompting Anthropic to recommend the Claude Max subscription for serious experimentation.[1][10]
Multimodal performance, while functional, lagged specialized vision models. The CharXiv score of 69.1% without tools was competitive but below Gemini 3 Pro's leading multimodal numbers. The MMMU Pro score of 73.9% trailed both GPT-5.2 (79.5%) and Gemini 3 Pro (81.0%). For image-heavy workflows, the per-token economics further favored alternatives.[8][9]
GPQA Diamond at 91.3% was competitive but trailed GPT-5.2 (93.2%) and Gemini 3 Pro (91.9%) on graduate-level science reasoning. The pattern was consistent with Opus 4.5 (87.0% on the same benchmark) and reflected the family's relative weakness on pure scientific knowledge benchmarks compared to its strengths on coding, agentic tasks, and economically valuable knowledge work.[8][9]
The creative writing regression noted in early Reddit and Hacker News reception was harder to quantify but recurred in commentary throughout February and March 2026. Anthropic's launch materials emphasized utility and coding gains; the company did not address the creative-prose criticisms directly, although prompting techniques and explicit style guidance were widely shared in community channels as workarounds.[10][14]
Opus 4.6 was superseded by Claude Opus 4.7 on April 16, 2026, which introduced a new tokenizer, a step-change improvement in agentic coding (SWE-bench Verified rose from 80.8% to 87.6% and SWE-bench Pro from 53.4% to 64.3%), and an xhigh effort level. As of the May 2026 Anthropic API documentation, Opus 4.6 is listed as a current model with migration to Opus 4.7 recommended for new agentic coding projects, while Opus 4.6 remains the recommended option for workloads tuned to its specific tokenizer, sampling parameters, or prefilling-replacement code paths.[12][3]