Claude Opus 4.1 is a large language model developed by Anthropic and released on August 5, 2025. It is the first incremental update to the Opus tier of the Claude 4 family, succeeding Claude Opus 4 (May 22, 2025) and preceding Claude Opus 4.5 (November 24, 2025). Anthropic positioned the release as a focused upgrade for agentic tasks, real-world coding, and research-style reasoning, with pricing held flat at $15 per million input tokens and $75 per million output tokens to encourage existing Opus 4 customers to migrate.[1][2]
The headline benchmark for Opus 4.1 was SWE-bench Verified, where the model advanced Anthropic's state of the art for real-world software engineering from 72.5% (Opus 4) to 74.5%. Smaller but consistent gains showed up across most other reported evaluations, including Terminal-bench, GPQA Diamond, AIME 2025, MMLU (in its multilingual MMMLU variant), and MMMU validation. The one regression Anthropic disclosed was on the airline split of tau-bench, which dropped 3.6 percentage points relative to Opus 4.[3][4][5]
Anthropic released the model with API identifier claude-opus-4-1-20250805, kept the 200,000 token context window and 32,000 token maximum output of Opus 4, and shipped the model simultaneously to claude.ai paid tiers, Claude Code, the Claude API, Amazon Bedrock, and Google Cloud Vertex AI. The model was deployed under Anthropic's Responsible Scaling Policy at AI Safety Level 3 (ASL-3), the same standard as Opus 4. Anthropic published a short system card addendum rather than a full new system card, on the basis that Opus 4.1's capability gains did not cross the thresholds that would require fresh red-team evaluations.[1][6][7]
Opus 4.1 served as Anthropic's flagship for roughly 16 weeks before being overtaken by Claude Sonnet 4.5 on agentic coding (September 29, 2025) and then formally superseded at the Opus tier by Claude Opus 4.5 on November 24, 2025. The model remained available on the API through 2026 with a tentative retirement date of "not sooner than August 5, 2026," though it was relegated to the "Legacy models" section of Anthropic's models documentation by early 2026 as Opus 4.5 became the default Opus offering.[8][9]
Anthropic announced the Claude 4 family on May 22, 2025 with the simultaneous launch of Claude Opus 4 and Claude Sonnet 4. Opus 4 was Anthropic's first model deployed under the ASL-3 deployment and security standard. It introduced hybrid reasoning, a 200,000 token context window, a 32,000 token maximum output, and a substantially deeper investment in agentic software engineering than any prior Claude model. The Opus 4 launch also marked the general availability of Claude Code, Anthropic's terminal coding agent, after a research preview that began alongside Claude 3.7 Sonnet in February 2025.[10][11]
By midsummer 2025 the model landscape around Claude 4 had shifted. OpenAI was preparing to launch GPT-5 (which arrived on August 7, 2025, two days after Opus 4.1), and Google's Gemini 2.5 Pro was already on the market with strong reasoning scores. Anthropic was simultaneously feeling competitive pressure from xAI's Grok 4 and from open-weights coding models tuned for SWE-bench. Internal Anthropic data and partner feedback pointed at a small number of high-leverage failure modes in Opus 4: drifting context tracking on multi-file refactors, occasional unnecessary edits in adjacent files, and uneven tool-call discipline on long agent traces. Anthropic chose to ship a focused increment rather than wait for a larger generation step, an approach the company would repeat several times over the next nine months.[1][2][12]
The Opus tier had also drifted out of step with the rest of the family. By early August 2025, Sonnet 4 was the model most enterprise customers actually used in production, both because of cost and because the gap between Opus 4 and Sonnet 4 was smaller than the gap between equivalent Opus and Sonnet pairings in earlier Claude generations. Anthropic needed Opus 4.1 to widen that gap, especially on the long-horizon coding work where a higher per-token price was still defensible. The customer references in the launch post (GitHub on multi-file refactoring, Rakuten on surgical edits in large codebases, Windsurf on a junior-developer benchmark) all targeted exactly the kind of difficult, long-horizon engineering work that justified the $15/$75 Opus tier rather than dropping down to the cheaper $3/$15 Sonnet tier.[1][2]
Internally Anthropic referred to the release simply as a snapshot upgrade rather than a new generation. Simon Willison summarized this in his August 5 launch post: "treating this as a .1 version increment looks like it's an accurate depiction of the model's capabilities."[3]
Anthropic announced Claude Opus 4.1 in a short blog post on August 5, 2025 titled "Claude Opus 4.1." The release post described the model as "an upgrade to Claude Opus 4 on agentic tasks, real-world coding, and reasoning" and called out three customer-quoted improvements. GitHub said that "Claude Opus 4.1 improves across most capabilities relative to Opus 4, with particularly notable performance gains in multi-file code refactoring." Rakuten Group said the model "excels at pinpointing exact corrections within large codebases without making unnecessary adjustments or introducing bugs." Windsurf said 4.1 "delivers a one standard deviation improvement over Opus 4 on their junior developer benchmark," comparable to the jump from Sonnet 3.7 to Sonnet 4.[1]
The release went out simultaneously across all Anthropic surface areas: paid claude.ai tiers (Pro, Max, Team, and Enterprise), Claude Code, the Anthropic API, Amazon Bedrock, and Google Cloud Vertex AI. Pricing matched Opus 4 exactly at $15 per million input tokens and $75 per million output tokens, with prompt caching write at $18.75 per million tokens (a 1.25x premium for the standard 5-minute cache) and cache read at $1.50 per million tokens (a 90% discount on cached input). The model used the API identifier claude-opus-4-1-20250805, with the convenience alias claude-opus-4-1 resolving to that snapshot.[1][2][13]
The Bedrock identifier was anthropic.claude-opus-4-1-20250805-v1:0, available at launch in US West (Oregon), US East (N. Virginia), and US East (Ohio); the Vertex AI identifier was claude-opus-4-1@20250805, available in Anthropic's standard Vertex partner regions. Cross-platform parity was a deliberate Anthropic positioning choice: enterprise customers running through Bedrock or Vertex did not need to wait for a separate availability announcement, and the model behavior was identical regardless of how it was reached.[2][13]
Anthropic also signaled in the post that further upgrades were imminent, writing that "we plan to release substantially larger improvements to our models in the coming weeks." That promise was kept later in 2025, with Claude Sonnet 4.5 arriving on September 29, Claude Haiku 4.5 on October 15, and Claude Opus 4.5 on November 24. From the announcement of Opus 4.1 to the release of Opus 4.5 was 111 days, the second shortest interval between Anthropic Opus-tier flagships in company history (only the Opus 4 to Opus 4.1 gap of 75 days was shorter).[14][15][8]
Anthropic listed Opus 4.1 with a reliable knowledge cutoff of January 2025 and a training data cutoff of March 2025, both unchanged from Opus 4. The reliable cutoff is the date through which Anthropic considers the model's knowledge most extensive and reliable, and the training data cutoff is the broader date range of training data used. Opus 4.1 used the same tokenizer as Opus 4 and the rest of the Claude 4 family; the new tokenizer that defined the Claude Opus 4.7 generation was still nine months in the future at launch.[2][13]
Anthropic's announcement of Opus 4.1 included a comparison chart against Opus 4 covering eight benchmarks. SWE-bench Verified and Terminal-bench were reported without extended thinking; tau-bench, GPQA Diamond, MMMLU, MMMU, and AIME 2025 were reported with extended thinking enabled at up to 64,000 reasoning tokens. The full table is reproduced below.[1][3]
| Benchmark | Opus 4 | Opus 4.1 | Change | Extended thinking |
|---|---|---|---|---|
| SWE-bench Verified | 72.5% | 74.5% | +2.0 pp | No |
| Terminal-bench | 39.2% | 43.3% | +4.1 pp | No |
| GPQA Diamond | 79.6% | 80.9% | +1.3 pp | Yes |
| AIME 2025 | 75.5% | 78.0% | +2.5 pp | Yes |
| MMMLU (multilingual MMLU) | 88.8% | 89.5% | +0.7 pp | Yes |
| MMMU Validation | 76.5% | 77.1% | +0.6 pp | Yes |
| tau-bench Retail | 81.4% | 82.4% | +1.0 pp | Yes |
| tau-bench Airline | 59.6% | 56.0% | -3.6 pp | Yes |
The largest absolute gain came on Terminal-bench, an evaluation of long-running terminal sessions where the model has to plan, execute, and recover from shell commands. The gain on SWE-bench Verified was smaller in percentage-point terms but more closely watched, because SWE-bench Verified had become the de facto industry yardstick for real-world software engineering. Anthropic credited the SWE-bench gain primarily to better discrimination between which files to modify and which to leave alone, an issue several partners had flagged on Opus 4.[1][3][4]
Anthropic noted in the methodology fine print that the tau-bench scores were "achieved with a prompt addendum to both the Airline and Retail Agent Policy" and that the maximum number of agent steps had been increased from 30 to 100. That methodological note mattered because the harder agent tasks in tau-bench tended to terminate near the 30-step ceiling, and lifting it allowed both Opus 4 and Opus 4.1 to attempt more involved sequences. The cleaner comparison was thus between two models on a slightly more lenient harness, not between two models on the original setup.[1][3]
The drop on tau-bench Airline was the only regression Anthropic published. The airline split tests sustained negotiation with a multi-step constrained API, and the regression suggested that the post-training mix that improved code refactoring may have shifted policy slightly on certain customer-service style multi-turn dialogues. Anthropic did not provide a detailed explanation, and partner-side reports did not flag the regression as practically significant.[3]
Beyond the raw scores, Anthropic and several partners described qualitative behavior changes:
The model did not change architecture in any way Anthropic publicly disclosed. The 200,000 token context window, 32,000 token maximum output, hybrid reasoning toggle, vision input, and tool-use schemas were all carried over from Opus 4 unchanged.[1][2][13]
The improvements concentrated in places that mattered for production engineering work. Multi-file refactoring is the kind of task where the cost of a single mistake propagates: an unwanted change to a shared utility breaks tests in many other files, and the model has to reason about callers as well as callees. Anthropic's launch language suggested that the post-training data and reward modeling for Opus 4.1 specifically targeted this case. The Windsurf comparison to "the jump from Sonnet 3.7 to Sonnet 4" was meaningful precisely because that earlier jump had been one of the more significant within-tier changes in Anthropic's history, and Windsurf's internal evaluations were one of the toughest external coding signals around.[1][3]
Anthropic also emphasized that Opus 4.1 was not chasing a single number on a single leaderboard. The launch post called out research-and-data-analysis gains and tool use rather than spotlighting a single benchmark, an unusual move at a time when most frontier-model launches led with one or two attention-grabbing numbers. That positioning matched the version label: a 4.1 release that nudged real workflows along rather than redefining the model class.[1]
GPT-5 launched on August 7, 2025, two days after Opus 4.1, and on its launch chart OpenAI reported 74.9% on SWE-bench Verified, narrowly above Opus 4.1's 74.5%. Independent comparisons over the following weeks generally placed the two models within noise of each other on coding and within a few percentage points on hard reasoning, with GPT-5 ahead on some math and academic benchmarks and Opus 4.1 ahead on several agentic and tool-use evaluations. The Scale SWE-bench Pro leaderboard, a tighter version of SWE-bench, showed both models near the top with Opus 4.1 at 22.7% on the public split and 17.8% on the private split versus GPT-5 at 23.1% public and 14.9% private, with Opus 4.1 noted as more stable across languages and repositories.[18][19]
The picture changed sharply at the next release tier. The table below puts Opus 4.1 alongside GPT-5 and the next Opus generation.
| Benchmark | Opus 4 (May 2025) | Opus 4.1 (Aug 2025) | GPT-5 (Aug 2025) | Claude Opus 4.5 (Nov 2025) |
|---|---|---|---|---|
| SWE-bench Verified | 72.5% | 74.5% | 74.9% | 80.9% |
| Terminal-bench | 39.2% | 43.3% | not reported | not reported by Anthropic at launch |
| GPQA Diamond | 79.6% | 80.9% | 87.0% (with extended thinking) | not directly comparable[a] |
| AIME 2025 | 75.5% | 78.0% | 94.6% (with extended thinking) | not directly comparable[a] |
| Pricing (input / output) | $15 / $75 | $15 / $75 | $1.25 / $10 | $5 / $25 |
[a] Opus 4.5 emphasized agentic and SWE-bench evaluations and downplayed older academic benchmarks at launch.[8]
The pricing column captures the most dramatic context for Opus 4.1's place in history. At its launch, Opus 4.1 was the most expensive model in Anthropic's lineup; by November, Opus 4.5 was offering substantially better performance at one third of the price, and OpenAI was already pricing GPT-5 an order of magnitude lower than Opus 4.1 on inputs and roughly seven times lower on outputs. The economics of running Opus 4.1 in production thus deteriorated rapidly across its 111 days as the flagship Opus model.[8][2][18]
Artificial Analysis tracked Opus 4.1 in its non-reasoning category, giving it an Intelligence Index score of 36, sixteenth out of seventy non-reasoning models on its leaderboard, with output speed of approximately 29 tokens per second and time to first token of about 1.9 seconds. The site noted the score was "well above average" but flagged the model as "notably slow" relative to peers with median speed near 53 tokens per second. By the same site's pricing dimension, Opus 4.1's $18.75 effective input rate (counting cache writes) and $75 output rate placed it near the top of the cost-per-token rankings.[20]
LM Arena added Opus 4.1 to its Text Arena and WebDev Arena leaderboards on August 11, 2025 and to its Search Arena on August 26, 2025. As of late November 2025 (when Opus 4.5 entered the Arena), Opus 4.1 held the number 4 spot on the WebDev leaderboard and number 7 on Text Arena, both highly competitive boards that included GPT-5, Gemini 2.5 Pro, and Grok 4 by that point.[21]
Vellum's published comparison framed Opus 4.1's gains as "steady, incremental" rather than transformative, emphasizing the multi-file refactoring improvement as the change most likely to matter in production code work.[4][22]
TechRepublic's launch coverage tracked the same benchmarks but argued the improvements told a story Anthropic could not afford to skip: Opus 4.1 needed to demonstrate that the company could iterate faster than OpenAI between major generations. With GPT-5 arriving 48 hours later, the value of Opus 4.1 was less the eight-tenths-of-a-point delta on this or that benchmark and more the proof that Anthropic's training stack could push out a credible new flagship inside three months of the previous one. From that angle Opus 4.1 was less an end product than a deployment exercise, with the larger payoff coming when Sonnet 4.5 and Opus 4.5 followed at predictable intervals.[28][29]
Pricing for Opus 4.1 matched Opus 4 exactly across every surface, a deliberate signal from Anthropic that customers should treat 4.1 as a drop-in replacement.
| Item | Price | Notes |
|---|---|---|
| Input tokens | $15.00 per million | Same as Opus 4 |
| Output tokens | $75.00 per million | Same as Opus 4 |
| Prompt cache write (5 minute) | $18.75 per million | 1.25x input rate |
| Prompt cache write (1 hour) | $30.00 per million | 2x input rate |
| Prompt cache read | $1.50 per million | 10% of input rate (a 90% discount) |
| Batch API discount | 50% | On both input and output |
| Vision input | Charged as text equivalent | Same scheme as Opus 4 |
Through claude.ai, Opus 4.1 was available to Pro, Max, Team, and Enterprise tier users. It was available through Claude Code from launch and replaced Opus 4 as the default Opus model in Claude Code shortly after release. On the API, the snapshot identifier was claude-opus-4-1-20250805 and the convenience alias was claude-opus-4-1. The model was simultaneously available as anthropic.claude-opus-4-1-20250805-v1:0 on Amazon Bedrock and as claude-opus-4-1@20250805 on Google Cloud Vertex AI.[2][13]
The 200,000 token context window was unchanged from Opus 4 and remained the standard Anthropic context length until Opus 4.6 expanded it to one million tokens in February 2026. Maximum output tokens were also unchanged at 32,000 (synchronous Messages API). Extended thinking on Opus 4.1 followed the same toggle and budget mechanics as on Opus 4 and Sonnet 4: developers set a budget_tokens value (up to 64,000) and the model spent that budget on a chain-of-thought scratchpad before producing the final answer.[13][23]
Production teams running Opus 4.1 quickly learned that prompt caching changed the practical cost equation. A typical Claude Code session sending 50,000 tokens of repository context plus 5,000 tokens of new prompt would cost $0.825 on a cold call ($15 x 0.055M). Once that 50,000 token prefix was in the 5-minute cache, follow-up calls cost roughly $0.075 for the cache read plus $0.075 for the new 5,000-token tail, for a total of $0.15. The cache write itself cost $0.94 (50,000 x $18.75 / 1M), so the breakeven point was a single follow-up call rather than several.[1][2][13]
For longer-running agent traces, the 1-hour cache option (introduced family-wide in mid-2025) added another layer. Writing a long context to the 1-hour cache cost twice the input rate but allowed sixty minutes of cheap reads, which mattered for Claude Code sessions or Devin runs that could span a full coding session. The combined effect was that Opus 4.1's headline $15/$75 pricing was misleading for any workflow that involved repeated calls against the same large context: effective per-call costs in production agent traces ran an order of magnitude lower.[2][13]
This did not change the absolute price-performance comparison with cheaper models. Sonnet 4 at $3/$15 with the same caching behavior was still cheaper, and GPT-5 at $1.25/$10 was cheaper still. But it did change which workloads Opus 4.1 was reasonable for. Long-horizon coding agents that needed both the strongest available model and many calls against the same repository context could reach Opus 4.1's per-task economics by the third or fourth call.[2][13]
Opus 4.1 inherited the full tool use surface of Opus 4: function calling, prompt caching, vision, computer use (the desktop-control beta introduced with Claude 3.5 Sonnet in October 2024), JSON mode, and the structured response schema feature. The model was supported in Claude Code from day one and fit seamlessly into existing agent harnesses. Anthropic did not publicize architectural changes to tool-use plumbing, and the visible improvements came from training rather than infrastructure.[1][13]
The most consistent partner feedback on agent behavior was that Opus 4.1 was better at staying on task across long tool-using traces and at recovering when a tool returned an error. Cognition's Devin and Anthropic's own Claude Code team both reported that 4.1 tended to push through tool failures rather than restart from scratch. Augment Code and Cursor likewise updated their default Anthropic offerings to point at Opus 4.1 within days of launch.[16][17][24]
In mid-August 2025, Anthropic shipped an update to the consumer claude.ai surface that allowed Opus 4 and Opus 4.1 to end conversations that remained "persistently harmful or abusive" after multiple refusals, the first end-conversation capability Anthropic had given any model. Anthropic announced the feature on August 15, 2025 and framed it as a "model welfare" measure as much as a safety one, with the launch explicitly citing Opus 4.1's improved harmless response rate as evidence that the model could be trusted to make these judgments. The behavior was reserved for what Anthropic called "rare, extreme cases," specifically requests for sexual content involving minors and attempts to solicit information enabling large-scale violence or terror, and only after multiple redirection attempts had failed. Anthropic also explicitly directed Claude not to use the capability when users might be at imminent risk of harming themselves or others, and noted that users whose conversations were ended could still start new conversations or branch existing ones by editing earlier turns.[25][26]
The computer-use beta, in which Claude controls a virtual desktop via screenshots and keyboard or mouse actions, had been available for Claude 3.5 Sonnet since October 2024 and was carried forward to Claude 4 and Claude 4.1 unchanged. Opus 4.1 did not add new computer-use primitives, but partner reports during August through November 2025 indicated that the model handled long computer-use traces more reliably than Opus 4. The same multi-step planning improvements that produced the Terminal-bench gain showed up on desktop tasks, including the kind of multi-application workflows that had been the early demos of computer use.[24]
Anthropic positioned computer use as still pre-production through this period, recommending it for evaluation rather than deployment. Customers experimenting with computer use under Opus 4.1 generally adopted the same posture: meaningful for proof-of-concept builds and internal automation, not yet for customer-facing flows.[24]
Anthropic published a short System Card Addendum for Claude Opus 4.1 rather than a full system card. Under Anthropic's Responsible Scaling Policy, comprehensive new safety evaluations are required when a model is "notably more capable" than the last model that underwent comprehensive assessment, defined as either being notably more capable on automated tests in risk-relevant domains (a 4x or greater jump in effective compute) or having accumulated six months' worth of fine-tuning and capability elicitation since the last evaluation. Opus 4.1 met neither criterion against Opus 4, so the addendum focused on automated evaluations to track capability progression and confirm that ASL-3 deployment standards remained appropriate.[6][7][27]
The addendum disclosed several concrete safety measurements:
The addendum concluded that ASL-3 deployment and security standards remained appropriate for Opus 4.1 and that the model's capability gains over Opus 4 did not warrant a full system-card refresh. The full Claude Opus 4 system card, published with the May 2025 launch, continued to govern Opus 4.1's deployment in combination with the addendum.[6][7]
Anthropic's Constitutional AI training methodology and reinforcement learning from human feedback continued to shape Opus 4.1's behavior, as with all prior Claude models. There were no public disclosures of changes to the constitution itself between Opus 4 and Opus 4.1.
Under Anthropic's Responsible Scaling Policy, ASL-3 ("AI Safety Level 3") models require both deployment-side mitigations (additional monitoring, restricted misuse-cooperation training, refusal of certain prompts even from authorized users) and security-side mitigations (red-teamed model weights protections, restrictions on internal use). ASL-3 was first triggered by Opus 4 in May 2025; Opus 4.1's addendum confirmed that ASL-3 remained the appropriate level rather than being escalated to ASL-4. ASL-4 would have required substantially more invasive monitoring and additional weight security, and Anthropic publicly committed not to deploy ASL-4 models without those measures in place. The fact that Opus 4.1 fit comfortably inside ASL-3 was therefore not just a regulatory finding but also a product decision: it meant the deployment posture and security stack from Opus 4 carried over without modification.[6][7][27]
Anthropic's classifier-based monitoring layer, which screens both prompts and outputs for misuse patterns, continued to apply to Opus 4.1. The 25% reduction in cooperation with high-risk misuse came from a combination of post-training and the same monitoring layer, and Anthropic noted that none of the reductions came at the cost of higher refusal rates on benign requests.[6][7]
Press coverage of Opus 4.1 was relatively low-key compared to the May 2025 Claude 4 launch or the November 2025 Opus 4.5 launch, in keeping with the model's positioning as an incremental update. Most launch coverage focused on the SWE-bench Verified gain, the unchanged pricing, and the implication that Anthropic was on a faster release cadence than competitors.[28][29][30]
Simon Willison's launch-day write-up framed the release as "a drop-in replacement for Opus 4," argued that "treating this as a .1 version increment looks like it's an accurate depiction of the model's capabilities," and noted that the safety metric changes in Anthropic's model card were unusually modest, in line with the small benchmark deltas. Willison also ran his recurring "draw a pelican on a bicycle" creative test and mentioned that he subjectively preferred Opus 4's image to Opus 4.1's, an observation that quickly circulated as a meme around the launch. He also shipped llm-anthropic 0.18 with Opus 4.1 support on launch day, illustrating the speed at which the open-source ecosystem was tracking Anthropic releases.[3]
InfoQ's coverage emphasized the safety side of the release, highlighting the harmless response rate increase from 97.27% to 98.76% and the 25% reduction in cooperation with high-risk misuse scenarios, while noting that the headline performance gains were modest. The headline framed the release as "Improves Refactoring and Safety, Scores 74.5% SWE-bench Verified," underscoring the dual story Anthropic was telling: better engineering work plus better refusals on borderline prompts, neither of which dominated by itself but together justified a 4.1 designation.[7]
Vellum characterized the gains as "steady, incremental" and pointed to multi-file refactoring as the practical improvement most worth caring about. Built In and the Shanghai NYU AI bulletin both published analyses framing the release as Anthropic's bet on rapid iteration over hero-launch cycles. TechCrunch and 9to5Mac ran shorter coverage emphasizing the SWE-bench gain and the unchanged pricing, with both publications noting that the 4.1 designation matched the size of the change.[4][29][12]
LM Arena added Opus 4.1 to its leaderboards within a week of launch. Within the WebDev arena, where coding-focused users vote on side-by-side outputs, Opus 4.1 climbed quickly to the top five and held that position until Opus 4.5 displaced it in late November.[21]
The overall verdict from the developer community was that Opus 4.1 was a reliable upgrade that did not need to be approached carefully: pricing was unchanged, the API surface was identical, and benchmark gains were either positive or close enough to noise to be ignorable. The single airline tau-bench regression was the most-discussed weakness, and several writers cited it as a reminder that "incremental" Anthropic releases were not strictly Pareto-improvements over their predecessors.[3][4]
Pricing analysts and AI consultants framed the release in a different way. At $15 per million input tokens and $75 per million output tokens, Opus 4.1 was a stretch budget item for most teams even at Opus 4 prices. With GPT-5 launching two days later at $1.25 / $10 per million tokens and posting comparable headline numbers, the price-to-performance argument for Opus 4.1 became an enterprise-only argument: companies that already trusted Anthropic's safety positioning, that ran most of their work through Claude Code, or that had verified through their own evaluations that Opus 4.1 outperformed GPT-5 on their specific tasks tended to keep paying. Cost-sensitive teams and consumer-tier products mostly stayed on Sonnet 4 or moved to GPT-5.[2][18]
A subset of independent technical reviewers ran their own evaluations on Opus 4.1 in the weeks after launch. The consistent finding was that 4.1's gains over 4 were real but small, that 4.1 was strongest where Anthropic had advertised (multi-file refactoring, longer agent runs), and that the model rarely surprised testers with regressions outside the documented airline tau-bench drop. METR's evaluation track did not publish a dedicated Opus 4.1 number at launch, but its general comparisons across the Claude 4 family in the second half of 2025 placed Opus 4.1 modestly ahead of Opus 4 on long-horizon autonomy tasks.[3][4]
The Hacker News thread for the Opus 4.1 launch (item 44800185, August 5, 2025) ran roughly two hundred comments. The dominant themes were skepticism about the size of the benchmark gains ("a .1 release that earns its name"), debate about whether Opus's $15/$75 pricing was sustainable as cheaper competitors caught up, and a recurring observation that Anthropic was now on a release cadence closer to OpenAI's than to Google's. Several developers reported that 4.1 had subjectively become their default Opus model within hours of launch, with the common pattern being to point existing Claude Code or Cursor configurations at the new identifier and rerun their personal evaluation suites. The most frequent specific praise was for multi-file refactoring; the most frequent complaint was that the speed remained roughly the same as Opus 4 despite the Terminal-bench gain.[31]
That "it's a .1, it acts like a .1" framing held throughout the model's lifetime. By November, when Opus 4.5 displaced 4.1 as the flagship, the conventional take was that 4.1 had done its job: it had kept Anthropic's Opus tier credibly in the conversation with GPT-5 long enough for the company to ship a real generation step. That story would be repeated for Sonnet 4.5 (a strong incremental release that bought time for Sonnet 4.6) and to a lesser extent for the Haiku tier as well.[14][8]
Anthropic's launch post named three customers: GitHub, Rakuten, and Windsurf. Within hours of launch, several other AI coding agent companies updated their default Anthropic configurations to Opus 4.1, including Cursor, Augment Code, Cognition (the company behind Devin), and Anthropic's own Claude Code product.[1][16][17][24]
Claude Code adopted Opus 4.1 as its default Opus-tier model essentially at launch. The combination of Claude Code 1.x with Opus 4.1 was the configuration most active enterprise developers were running through August and September 2025, until Claude Code 2.0 shipped alongside Sonnet 4.5 on September 29.[24][14]
The pattern through fall 2025 was that Opus 4.1 became the workhorse Opus-tier model for production coding work, while Sonnet 4.5 (after September 29) and Haiku 4.5 (after October 15) absorbed an increasing share of agentic and cost-sensitive workloads. By the time Opus 4.5 launched in late November, most active Anthropic customers were using a mix of Sonnet 4.5 for default coding and Opus 4.1 for harder problems where the extra cost was justified.[14][15]
Anthropic's revenue grew sharply across this period, driven heavily by Claude Code adoption. Public reporting put Anthropic's annualized revenue at roughly $5 billion by August 2025 (Opus 4.1 launch), $7 billion by October, and well above $10 billion by year-end, with Claude Code itself crossing $2.5 billion in annualized run rate by early 2026. Opus 4.1 was the flagship model carrying that growth through August through November 2025.[31][32]
Among partners, Rakuten's quote in the launch post became the most-cited customer reference for the release: "Claude Opus 4.1 excels at pinpointing exact corrections within large codebases without making unnecessary adjustments or introducing new bugs." Windsurf's "one standard deviation improvement on our junior developer benchmark" line was the second most-cited.[1]
The integration story extended beyond developer tools. Enterprise customers running Claude through Amazon Bedrock and Vertex AI generally moved their default Claude routing to Opus 4.1 in the days following launch. Anthropic's largest cloud distribution partners reported that the migration was effectively transparent: a one-line change to point at the new model identifier. Anthropic's public materials emphasized this drop-in property, and Simon Willison's "drop-in replacement" framing was the consensus characterization of the migration cost.[1][3][13]
Several AI search and research products that used Claude as a backend (such as Perplexity's enterprise tier and a number of internal corporate research bots) updated their default Claude model to Opus 4.1 in the August through November window. Many of those products also hedged by routing easier traffic to Sonnet 4 or, after September 29, Sonnet 4.5, with Opus 4.1 reserved for the hardest queries. That pattern of "Opus for hard problems, Sonnet for everything else" came to define how customers used the Claude 4 family during the late summer and fall of 2025.[14][15]
For enterprise teams the practical migration pattern was: keep Opus 4 in production for a few days while internal evaluations ran on Opus 4.1, then either switch the API alias to point at the new snapshot or update environment variables to the new identifier. Because the API and behavior surfaces were identical (same context window, same max output, same tool-use schemas, same prompt format), most teams reported zero code changes were required. Anthropic's pricing parity reinforced this: even teams that did not run formal evaluations could move to Opus 4.1 without modeling cost impact.[1][2][13]
Teams that did run formal A/B comparisons typically used a small set of tasks chosen from their own production traffic. The most common findings were the ones Anthropic had advertised: multi-file refactoring noticeably better, long agent traces slightly more stable, math and reasoning marginally better. A handful of teams, particularly those running heavy customer-service style multi-turn dialogues, reported that they preferred Opus 4 for those specific workflows; the airline tau-bench regression aligned with that anecdotal feedback.[3][4]
Opus 4.1's reign as Anthropic's flagship Opus model lasted from August 5, 2025 to November 24, 2025, a span of 111 days. The successor was Claude Opus 4.5, which Anthropic released on November 24, 2025 with three changes that defined the next phase of the family.[8]
First, Opus 4.5 became the first publicly available model to score above 80% on SWE-bench Verified, with 80.9% versus Opus 4.1's 74.5%, surpassing both GPT-5.1 (76.3%) and Gemini 3 Pro (76.2%). Second, Anthropic cut Opus-tier pricing by roughly 67%, from Opus 4.1's $15 / $75 per million tokens to $5 / $25 per million tokens. Third, the new release introduced a first-class "effort" parameter that let developers tune how many reasoning tokens the model could spend on a request: at medium effort, Opus 4.5 matched Sonnet 4.5's top SWE-bench score while using 76% fewer output tokens, and at maximum effort it surpassed Sonnet 4.5 by 4.3 percentage points while still consuming 48% fewer output tokens.[8]
The combined effect was to make Opus 4.1 essentially uneconomic for new deployments overnight: at one third of the price and a six-percentage-point gain on SWE-bench Verified, Opus 4.5 dominated 4.1 on every published evaluation Anthropic chose to publish. Anthropic kept Opus 4.1 active on the API for backwards compatibility but moved it into the "Legacy models" section of its models documentation as Opus 4.5 became the standard Opus offering.[9][33]
The table below summarizes the Opus tier timeline through Opus 4.7.
| Model | Release date | Days as flagship Opus | API ID | Pricing (in / out) | Headline change |
|---|---|---|---|---|---|
| Claude Opus 4 | May 22, 2025 | 75 | claude-opus-4-20250514 | $15 / $75 | Family launch, ASL-3, Claude Code GA |
| Claude Opus 4.1 | Aug 5, 2025 | 111 | claude-opus-4-1-20250805 | $15 / $75 | Multi-file refactoring, 74.5% SWE-bench |
| Claude Opus 4.5 | Nov 24, 2025 | 73 | claude-opus-4-5-20251101 | $5 / $25 | First above 80% SWE-bench, effort parameter |
| Claude Opus 4.6 | Feb 5, 2026 | 70 | claude-opus-4-6 | $5 / $25 | 1M context, adaptive thinking |
| Claude Opus 4.7 | Apr 16, 2026 | current | claude-opus-4-7 | $5 / $25 | New tokenizer, vision, step-change agentic coding |
After Opus 4.5 displaced Opus 4.1 as the production Opus model, Opus 4.1 settled into the role most short-flagship models eventually take: a known-quantity workhorse that customers kept on standby for compatibility, regression testing, or contractual reasons. Anthropic listed claude-opus-4-1-20250805 under "Legacy models" in the API documentation by early 2026, with the model marked "Active" status and a tentative retirement date of "not sooner than August 5, 2026," giving customers 12 months of formal availability from the original launch date.[2][9]
The base Opus 4 was deprecated on April 14, 2026 with a hard retirement date of June 15, 2026, and the recommended replacement was the then-current Opus 4.7. As of mid-2026, Opus 4.1 remained on the legacy list but had not been formally deprecated. Anthropic's pattern with the Claude 4 family suggests Opus 4.1 will follow a similar trajectory: at least 60 days of deprecation notice before retirement, with claude-opus-4-7 as the recommended replacement for code that used to call into 4.1.[9]
When models are eventually retired from the API, Anthropic has publicly committed to preserving model weights for as long as the company exists, both as a hedge against future research needs and as a stated step toward whatever model-welfare obligations the company eventually settles on. The Claude 3 Opus retirement in early 2026 became a touchstone for this policy: Anthropic gave the retired model its own Substack blog of weekly unedited essays, and made it available again to paying customers via API request. Whether Opus 4.1 will receive similar treatment is not yet clear, but the precedent of Claude 3 Opus suggests that retired Anthropic flagships do not necessarily disappear.[9][33]
Opus 4.1 was generally a Pareto improvement over Opus 4, but it shipped with a small number of documented regressions and limitations.
The tau-bench Airline regression. Opus 4.1 dropped 3.6 percentage points on the airline split of tau-bench (from 59.6% to 56.0%) while improving 1.0 percentage points on the retail split. Anthropic disclosed but did not explain the regression. Public speculation centered on the possibility that the post-training mix that improved code refactoring may have shifted policy on certain customer-service style multi-turn dialogues with constrained tool surfaces. Production users in airline-style customer-service workflows reported that the regression was small enough to be invisible inside their own metrics, and the issue resolved cleanly with later Opus models.[3]
Speed remained slow for Opus tier. Artificial Analysis benchmarked Opus 4.1's output speed at roughly 29 tokens per second, well below the 53 tokens per second median across its non-reasoning model leaderboard. Time to first token was around 1.9 seconds, also slower than peers. The Opus tier had always been positioned for difficult tasks rather than throughput, but these numbers were a recurring complaint in production deployments.[20]
Pricing pressure. At launch Opus 4.1 was the most expensive model on Anthropic's lineup, and within three months it was the most expensive model in the legacy table. With Opus 4.5 launching at $5 / $25 per million tokens in November, and OpenAI pricing GPT-5 at $1.25 / $10 per million from August, Opus 4.1's price-performance ratio aged poorly even though the model itself remained reliable.[8][18]
No long-context expansion. Opus 4.1 did not change the 200,000 token context window. Customers running large codebases through Claude continued to use chunking and prompt caching to fit work into the window. Long-context expansion had to wait for Opus 4.6 in February 2026, which jumped to one million tokens.[34][35]
Image generation aesthetics. Simon Willison's pelican test, while not a serious benchmark, prompted some discussion that 4.1 had become slightly more conservative or stiffer on creative image generation prompts. The effect was anecdotal and Anthropic did not address it.[3]
No new modalities. Opus 4.1 added no new input or output modalities. Vision input (images) and text output remained the supported set, with computer use still in beta.[2][13]
Reward-hacking propensity. Anthropic's own system card addendum disclosed that 4.1 retained a reward-hacking rate similar to Opus 4 and higher than Sonnet 4. While the company did not consider this disqualifying for ASL-3 deployment, it was a signal that the post-training improvements had not addressed every alignment concern from the Opus 4 system card.[6][7]
Anthropic did not ship a 4.1.1 or any post-launch snapshot of Opus 4.1. Once a model snapshot ships under the date in its identifier, that snapshot is fixed; Anthropic's pattern with the Claude 4 family has been to ship a new dated snapshot rather than to update an existing one in place. The next snapshot in the Opus series was claude-opus-4-5-20251101, released November 1, 2025 internally and made public on November 24, 2025.[2][13]
A small number of partner-side adjustments did happen post-launch. Anthropic added Opus 4.1 to its end-conversation capability on August 15, 2025 (alongside Opus 4), shipped minor server-side improvements to prompt caching across the Claude family, and tuned safety filters that affected all currently deployed Claude models. None of these produced a new model snapshot or a revised benchmark publication for Opus 4.1.[25][26]
Within the broader history of Anthropic's model lineage, Opus 4.1 is best understood as the model that proved Anthropic could iterate inside a generation. Earlier point releases (Claude 2.1, Claude 3.5, Claude 3.7) had largely been treated as full re-releases that changed the company's positioning in the market. Opus 4.1 was different: a quiet drop-in upgrade that left the API surface unchanged, that named the same customers, that quoted the same prices, and that nudged the same benchmarks by single-digit percentages. The fact that the change was small was the point. Anthropic was demonstrating, in public, that its training stack could absorb partner feedback and ship a corrected model in eleven weeks.[1][2][8]
That capability turned out to matter. Across the next nine months Anthropic released a series of point increments (Sonnet 4.5, Haiku 4.5, Opus 4.5, Opus 4.6, Opus 4.7) that together formed one of the most compressed model release cadences in industry history. By April 2026 the Opus tier alone had absorbed five public snapshots since the May 2025 family launch, an average of one every 67 days. Opus 4.1 was the first of those, the proof of concept, and arguably the easiest of the five to ship because the gains were modest and the surface area was identical.[8][14][15]
For enterprise customers running long-lived deployments, Opus 4.1 also became a reference point in conversations about migration risk. The Opus 4 to Opus 4.1 migration was the easiest in Claude history, with most teams reporting zero code changes. Subsequent Anthropic releases never reached that level of friction-free migration: Sonnet 4.5 and later models came with new behaviors that were worth checking against existing prompts; Opus 4.5 introduced the effort parameter; Opus 4.6 expanded context to 1M tokens with new pricing implications; Opus 4.7 introduced a new tokenizer that broke per-token cost calculations on the input side. The Opus 4.1 release stood out, in retrospect, as the cleanest example of what Anthropic could deliver when the stated goal was "better at the same things, at the same price, with the same API."[1][2][8]