Claude 3.5 Sonnet is a multimodal large language model developed by Anthropic and first released on June 20, 2024 as the opening model of the company's 3.5 generation. It was the model that ended Anthropic's role as a perennial second-place lab and made the Sonnet tier the company's signature product. A substantially upgraded version followed on October 22, 2024, shipping alongside Claude 3.5 Haiku and the public beta of Computer Use. Within the developer community the upgraded model was widely nicknamed "Claude 3.6" because of how large the jump was, and Anthropic eventually skipped that version number entirely when the next official release became Claude 3.7 Sonnet in February 2025.[1][2][3]
The two snapshots use the API identifiers claude-3-5-sonnet-20240620 (the original) and claude-3-5-sonnet-20241022 (the upgrade). Both share a 200,000-token context window, an 8,192-token maximum output length, and a price of $3 per million input tokens and $15 per million output tokens. Both were released under AI Safety Level 2 (ASL-2) deployment standards. The upgraded snapshot improved SWE-bench Verified from 33.4% to 49.0%, which Anthropic at the time described as state of the art for any publicly available model, ahead of OpenAI's recently launched o1-preview reasoning model.[4][5]
At launch Anthropic positioned 3.5 Sonnet as outperforming the larger and slower Claude 3 Opus at roughly one fifth of the price, an unusual move for the industry at the time. The June launch was paired with the debut of Artifacts, a side-panel interface in claude.ai that rendered code, websites, SVG graphics, and documents live as the model produced them, and which became the visual face of the model for most consumer users. The model's combination of strong coding, low cost, fast latency, and a long context window made it the default option in Cursor, GitHub Copilot (after Anthropic was added in October 2024), Replit, Vercel v0, Sourcegraph Cody, and a long list of agent products built between mid-2024 and early 2025.[1][2][6]
Claude 3.5 Sonnet held leading or tied-leading positions on coding, reasoning, and agentic-task benchmarks for most of the seven-month window between its launch and the release of Claude 3.7 Sonnet. Both snapshots were retired on October 28, 2025 after Anthropic's standard 60-day notice period, with users directed to migrate to Claude Sonnet 4 and its successors.[5][7]
| Snapshot | API ID | Release date | Knowledge cutoff | Retirement |
|---|---|---|---|---|
| Original | claude-3-5-sonnet-20240620 | June 20, 2024 | April 2024 | October 28, 2025 |
| Upgraded ("new", informally "3.6") | claude-3-5-sonnet-20241022 | October 22, 2024 | April 2024 | October 28, 2025 |
Anthropic introduced Claude 3.5 Sonnet on June 20, 2024 in a blog post titled "Introducing Claude 3.5 Sonnet," naming it the first member of "the forthcoming Claude 3.5 model family." The unusual framing was that the new mid-tier model outperformed the existing flagship: the post stated that 3.5 Sonnet "raises the industry bar for intelligence," outperforming both competitor models and Claude 3 Opus while operating "at twice the speed of Claude 3 Opus" and at the cost of the previous mid-tier model. The price was set at $3 per million input tokens and $15 per million output tokens, the same rate as Claude 3 Sonnet, and far below Opus.[1]
The model was made available the same day on the Anthropic API, claude.ai (free and Pro), the Claude iOS app, Amazon Bedrock, and Google Cloud's Vertex AI. It supported a 200,000-token context window, an 8,192-token maximum output, and image inputs (vision) the same way Claude 3 Opus had. The knowledge cutoff was April 2024.[1][8]
At launch the company published a model card addendum to the existing Claude 3 model card rather than a fresh card, the rationale being that 3.5 Sonnet was an evolution of the Claude 3 family rather than a separate generation. The addendum reported benchmark results, vision evaluations, agentic coding numbers, refusal rates on Wildchat and XSTest, and a Responsible Scaling Policy safety evaluation in which Anthropic concluded that the model did not exceed thresholds for ASL-3 and shipped under ASL-2. The UK AI Safety Institute (now UK AISI) ran independent pre-deployment testing.[8]
Alongside the model, Anthropic launched a feature called Artifacts on claude.ai. Artifacts opened a dedicated panel next to the chat where Claude could render generated content in place: code, SVG illustrations, HTML pages, mermaid diagrams, React components, single-file games, and documents. The panel updated in real time as the model produced output and could be edited or rolled back. The combination of 3.5 Sonnet plus Artifacts gave non-developer users an immediate, visual way to use the model for what was usually called "vibe coding" before that term existed. Artifacts also helped seed a class of single-prompt mini-apps that became a recurring genre on Twitter and Hacker News during the second half of 2024.[1][9]
The consumer rollout coincided with the iOS app gaining the new model and free-tier users getting access at higher rate limits than before. The launch post described it as "setting a new standard for the industry's balance of intelligence, speed, and cost."[1]
Until June 2024, frontier AI labs tended to follow a pattern that bigger and more expensive meant smarter, with smaller models trailing somewhere behind. Anthropic's pitch flipped that ordering inside its own family. Claude 3 Opus, released in March 2024, had been the company's flagship and was priced at $15 per million input tokens and $75 per million output tokens. Claude 3.5 Sonnet, sold at $3 input and $15 output, beat Opus on essentially every public benchmark Anthropic published, including GPQA Diamond (59.4% vs 50.4% for Opus), MMLU 5-shot CoT (90.4% vs 88.2%), HumanEval (92.0% vs 84.9%), and an internal agentic coding evaluation (64% vs 38%).[1][8]
The practical effect was that customers had little reason to keep using Opus 3 after Sonnet 3.5 launched. Anthropic announced soon afterward that an upgraded Claude 3.5 Opus and 3.5 Haiku were planned for later in 2024. The 3.5 Haiku eventually shipped in October. The 3.5 Opus did not: by mid-2025 the company quietly dropped the project and went straight to Claude Opus 4 in May 2025, a sequence that became one of the more discussed details in the industry's running attempt to read Anthropic's internal compute and product trade-offs.[2][7][10]
On October 22, 2024 Anthropic published a single post titled "Introducing computer use, a new Claude 3.5 Sonnet, and Claude 3.5 Haiku." The release bundled three things: an upgraded Claude 3.5 Sonnet snapshot referenced as claude-3-5-sonnet-20241022; the first appearance of Claude 3.5 Haiku (which followed later in the month); and a new public beta in the Anthropic API that let developers "direct Claude to use computers the way people do, by looking at a screen, moving a cursor, clicking buttons, and typing text."[2]
The upgraded Sonnet was made available immediately on the Anthropic API, Amazon Bedrock, Google Cloud's Vertex AI, and claude.ai. The price stayed at $3 input and $15 output per million tokens. The context window stayed at 200K. The knowledge cutoff stayed at April 2024. The model card addendum for the upgrade noted that the model had retained the original Sonnet 3.5's pricing while improving "across the board."[2][3]
Anthropic published an updated model card addendum the same day. The addendum reported substantial gains on coding and agentic tool use, modest gains on most reasoning evaluations, and small but real gains on vision. It also documented the introduction of a new agentic capability category, computer use, and the OSWorld scores that established what state of the art looked like for desktop AI agents in late 2024.[3]
The upgrade improved SWE-bench Verified from 33.4% to 49.0%, the largest single jump on that benchmark Anthropic had ever reported. It improved TAU-bench (a tool-use benchmark for customer-service agents) from 62.6% to 69.2% in the retail domain, and from 36.0% to 46.0% in the harder airline domain. On Anthropic's internal agentic coding evaluation the model went from 64% to 78%. GPQA Diamond improved from 59.4% to 65.0%, MATH from 71.1% to 78.3%, and AIME 2024 (a 0-shot CoT score) from a number not previously reported to 16.0%, with Maj@64 reaching 27.6%.[3]
Vision benchmarks improved more modestly. MMMU validation went from 68.3% to 70.4%. MathVista testmini went from 67.7% to 70.7%. ChartQA stayed at 90.8%. DocVQA actually slipped slightly from 95.2% to 94.2%. Anthropic's framing was that the upgrade was primarily aimed at coding, agentic tool use, and instruction following, with vision held roughly steady.[3]
Anthropic did not give the upgrade a new version number. The official position was that this was simply "a new Claude 3.5 Sonnet," referenced internally as claude-3-5-sonnet-20241022. The community refused to accept that. Within days of the launch, posts on the Cursor forum, Hacker News, the Anthropic Discord, and the Reddit r/ClaudeAI subreddit were calling it "Claude 3.6 Sonnet" or "Sonnet 3.6," and the label stuck. Several developer-tooling companies briefly listed it that way in their model pickers. Cursor in particular became a hub for the naming discussion because users had to pick between two snapshots that shared the same display name.[2][6][11]
The nickname became official enough that when Anthropic released the next Sonnet generation in February 2025, the company skipped the 3.6 label entirely and named it Claude 3.7 Sonnet, signaling implicitly that the community's count was correct. Anthropic itself never used the 3.6 name in its own materials, but the convention persists in third-party reviews, blog posts, and benchmarks to this day.[12]
Anthropic does not publicly disclose parameter counts, training compute, dataset composition, or detailed architecture for any Claude model, and 3.5 Sonnet is no exception. The model is a transformer-based autoregressive large language model that handles text and image inputs and produces text outputs. It uses the same general training pipeline as the rest of the Claude 3 family: large-scale pretraining on a mixture of internet text and licensed data, supervised fine-tuning on curated demonstrations, Constitutional AI for alignment training, and reinforcement learning from human feedback (RLHF).[8][13]
What Anthropic did say is that 3.5 Sonnet was a "new model," not a fine-tune of Claude 3 Sonnet, and that the family was named 3.5 to reflect a generational improvement in capabilities while keeping costs flat. The training run for 3.5 Sonnet was reportedly more compute-heavy than Claude 3 Sonnet's, although the company has never published an explicit ratio. The upgrade in October was described as an "improved version of Claude 3.5 Sonnet," without further detail; outside observers generally interpreted that as additional pretraining or post-training rather than a fresh base model, but Anthropic has neither confirmed nor denied this.[1][2]
The knowledge cutoff for both snapshots is April 2024. The 200,000-token context window is the same as Claude 3 Opus, and the maximum output length is 8,192 tokens (an early stretch of Claude 3 Opus had been capped at 4,096 output tokens, which 3.5 Sonnet doubled out of the gate). The model supports the structured tool-use API that Anthropic had introduced earlier in 2024, which is the foundation for Computer Use, Model Context Protocol integrations later, and the standard tool calls used in Claude Code.[1][2][14]
3.5 Sonnet handles general text generation, summarization, classification, multilingual translation, structured extraction, and standard reasoning. The October 2024 upgrade tightened instruction following: Anthropic reported a 51% human preference win rate over the original Sonnet 3.5 on "following precise instructions" tasks. In practice that change was visible to users in the form of cleaner adherence to format requests, JSON schemas, and complex prompt scaffolds, which is part of why the model became the default in agent frameworks.[3]
Coding was the headline strength of both snapshots. The June model already led HumanEval at 92.0% and Anthropic's internal agentic coding evaluation at 64%. The October upgrade pushed those to 93.7% and 78% respectively, and added the SWE-bench Verified jump to 49.0% that became one of the most-cited numbers in the second half of 2024. SWE-bench Verified at the time had a published state of the art of 45.2% (achieved by a system built on a competitor model with elaborate scaffolding); the upgraded Sonnet beat that as a single-model score.[3][15]
Real-world adoption tracked the benchmarks. Cursor reported the model as their default for most users; Replit Agent used it; Sourcegraph Cody added it; v0 by Vercel made it a primary option for frontend code generation; and a large fraction of independent agentic-coding products built between mid-2024 and early 2025 had a Sonnet 3.5 path. By the time the upgrade landed in October, the model was widely described in developer communities as "the model you reach for when you actually need the work done."[6][16]
Both snapshots accept images as inputs. They handle chart and graph interpretation, document understanding, science diagrams, OCR-like transcription from imperfect images, and visual math. MMMU validation reached 68.3% on the original snapshot and 70.4% on the upgrade, ChartQA reached 90.8% (unchanged), and DocVQA reached 95.2% on the original (slipping marginally to 94.2% on the upgrade). The model card addendum framed vision as a steady, not headline, capability.[3][8]
3.5 Sonnet was Anthropic's first model with what became the company's standard structured tool-use API generally available, where the developer registers a tool schema and the model emits typed tool_use calls that the host code executes. The approach generalized beyond function calling: it underpinned Computer Use at launch, became the basis for Model Context Protocol clients introduced in November 2024, and is the same primitive Claude Code used when it arrived in February 2025. TAU-bench, an external benchmark for tool-using customer-service agents, was where the October upgrade most clearly outscored the field: 69.2% retail and 46.0% airline.[2][3][14][17]
The October 2024 release introduced Computer Use, a public beta capability that let Claude operate a real desktop environment by reading screenshots and emitting mouse and keyboard actions. The Anthropic-hosted demo ran the model inside a sandboxed virtual machine and showed it filling vendor forms, browsing Google Maps, and navigating GUI applications. On the OSWorld benchmark, the upgraded Sonnet 3.5 scored 14.9% on the 15-step screenshot-only setting and 22% with 50 steps, against a previous best of around 7.8%. Human performance on OSWorld is roughly 72.36%. Anthropic was candid in the launch post that Computer Use was "at times cumbersome and error-prone" and recommended developers run it in containers with limited privileges.[2][3]
Computer Use was one of the first general-purpose agentic-control products from a major lab and predated OpenAI Operator (January 2025) and Google's Project Mariner (December 2024). It became the template that successive Claude generations refined: by Claude Sonnet 4.5 the same OSWorld benchmark had reached 61.4%.[18]
The table below summarizes the headline benchmark scores for both snapshots, drawn from Anthropic's own model card addendums (June 2024 for the original, October 2024 for the upgrade). Where the original is missing a number, Anthropic did not report it on the original card.
| Benchmark | Original (June 2024) | Upgraded (October 2024) | Notes |
|---|---|---|---|
| GPQA Diamond (0-shot CoT) | 59.4% | 65.0% | Graduate-level science Q&A |
| GPQA Diamond (Maj@32, 5-shot) | 67.2% | not reported | |
| MMLU (5-shot CoT) | 90.4% | 90.5% | General reasoning |
| MMLU Pro (0-shot CoT) | not reported | 78.0% | Harder MMLU variant |
| MATH (0-shot CoT) | 71.1% | 78.3% | Mathematical problem solving |
| HumanEval | 92.0% | 93.7% | Python coding |
| MGSM | 91.6% | 92.5% | Multilingual math |
| DROP (3-shot, F1) | 87.1 | 88.3 | Reading comprehension |
| BIG-Bench Hard (3-shot CoT) | 93.1% | 93.2% | Mixed evaluations |
| GSM8K | 96.4% | not reported | Grade-school math |
| AIME 2024 (0-shot CoT) | not reported | 16.0% | High school math contest |
| AIME 2024 (Maj@64) | not reported | 27.6% | |
| IFEval | not reported | 90.2% | Instruction following |
| MMMU (validation, 0-shot) | 68.3% | 70.4% | Visual question answering |
| MathVista (testmini) | 67.7% | 70.7% | Visual math reasoning |
| AI2D | 94.7% | 95.3% | Science diagrams |
| ChartQA | 90.8% | 90.8% | Chart understanding |
| DocVQA (ANLS) | 95.2% | 94.2% | Document understanding |
| SWE-bench Verified | 33.4% | 49.0% | Real GitHub issues |
| TAU-bench retail (pass^1) | 62.6% | 69.2% | Tool-use customer service |
| TAU-bench airline (pass^1) | 36.0% | 46.0% | Harder TAU-bench split |
| OSWorld (15-step, screenshot) | not applicable | 14.9% | Desktop agent |
| Internal agentic coding | 64% | 78% | Anthropic internal eval |
The single benchmark that did the most to define the model's reputation was SWE-bench Verified. The upgraded snapshot's 49.0% put it ahead of OpenAI's o1-preview reasoning model (41.0% on the same benchmark in OpenAI's reported numbers) and ahead of all open-source alternatives, despite running without explicit chain-of-thought reasoning at inference time.[3][15]
Independent leaderboards generally agreed with the model card numbers. The Vellum LLM Leaderboard tracked the upgraded snapshot near the top of coding and reasoning categories from late October 2024 through February 2025. On Artificial Analysis the model held a leading-tier ranking on its composite intelligence score for that window, behind the o1 family on math but ahead on coding tasks. On the LMSYS Chatbot Arena (later LMArena), the model held a top-five Elo score for the same period, frequently in the top three for coding-tagged prompts.[19][20]
Both snapshots were priced identically:
| Tier | Cost |
|---|---|
| Input tokens | $3 per million |
| Output tokens | $15 per million |
| Prompt caching, cache write | $3.75 per million |
| Prompt caching, cache read | $0.30 per million |
| Batch API input | $1.50 per million (50% discount) |
| Batch API output | $7.50 per million (50% discount) |
Prompt caching was added to the Claude API as a public beta on August 14, 2024, with 3.5 Sonnet as one of the launch models. Cache reads were priced at 10% of the regular input rate, which made multi-turn agent workloads dramatically cheaper than they had been on Claude 3 Opus. The Batch API followed in October 2024 with a flat 50% discount on both input and output for asynchronous workloads.[21][22]
The model was available on the Anthropic API, claude.ai (free, Pro, and Team), the Claude iOS app, Amazon Bedrock, Google Cloud's Vertex AI, and through every major model-routing provider including OpenRouter and AWS Bedrock cross-region endpoints. The Claude desktop app, launched in late 2024, used 3.5 Sonnet as its default. The model was added to GitHub Copilot in October 2024, the first time Anthropic models were available inside Copilot, and Microsoft's M365 Copilot followed shortly after. By early 2025 it was available natively in Cursor, Windsurf, Replit, Vercel v0, Sourcegraph Cody, Continue.dev, Aider, and most other developer tools that exposed model selection.[2][6][16]
Claude 3.5 Sonnet shipped with structured tool-use as a generally available API feature. The developer registers a list of tools as JSON schemas in the request, the model decides when to invoke them, and the host application executes the call and returns results. The format proved durable: it remained the standard tool-use schema across Claude 3.7 Sonnet, Claude 4, and into Claude Sonnet 4.5 with only minor extensions.[14]
The October upgrade extended this with three Anthropic-defined tools attached to the Computer Use beta: a computer tool that emitted screen actions like screenshot, left_click, type, key, mouse_move, and scroll; a text_editor tool for file editing; and a bash tool for shell command execution. These three tools became the bedrock of agentic Claude integrations for the next year. The text_editor and bash tool patterns were carried over directly into Claude Code when it launched in February 2025. The Computer Use schema later became the basis for OpenAI Operator's Computer-Using Agent and Google's Project Mariner, both of which arrived months after Anthropic's initial release.[2][17]
A practical consequence of the tool-use schema was that 3.5 Sonnet became the model that pre-dated and mostly anticipated Model Context Protocol (MCP), the open standard Anthropic announced on November 25, 2024. MCP layered a transport and discovery protocol on top of the same tool-use primitive. The first wave of MCP servers and clients were built and tested against Sonnet 3.5, and the model was the implicit reference target for the early MCP ecosystem.[14][17]
Both snapshots were classified by Anthropic as AI Safety Level 2 (ASL-2) under the company's Responsible Scaling Policy. The June model card addendum reported that 3.5 Sonnet "showed an increase in capabilities in risk-relevant areas compared to Claude 3 Opus" but "did not exceed our safety thresholds in these evaluations," and the company classified it at ASL-2 accordingly. The October addendum reached the same conclusion for the upgrade, including the new Computer Use capability, which Anthropic argued did not change the threat-model evaluation: it noted that visual computer use "does not seem to be on the critical path to enabling" autonomous software engineering at the level that would warrant ASL-3 safeguards.[3][8]
Pre-deployment evaluations involved a combination of internal red-teaming on chemical, biological, radiological, and nuclear (CBRN) risks, cybersecurity (capture-the-flag challenges), and autonomous capabilities (software engineering). The original snapshot was independently evaluated by the UK AI Safety Institute. The upgraded snapshot was independently evaluated by both the UK AISI and the US AI Safety Institute, as part of the cross-Atlantic AISI memorandum of understanding signed earlier in 2024, and also by METR for autonomy-relevant capabilities.[3][8]
This made 3.5 Sonnet (and especially the upgraded snapshot) the first frontier model to undergo coordinated pre-deployment testing by multiple government safety institutes, a precedent that became the norm for frontier releases through 2025.[3]
The model's commercial reception was unusual for an Anthropic release. Where previous Claude generations had been respected but second-tier in market share, 3.5 Sonnet became the default model for a large slice of the developer-tooling market within a few weeks of launch. Adoption was driven by a combination of price (one fifth of GPT-4 Turbo's output rate at the time), a long context window, strong coding scores, low latency, and the perception (initially anecdotal, later confirmed in benchmarks) that the model produced cleaner code than competitors.
| Partner | Integration | Timing |
|---|---|---|
| Cursor | Default tab-complete and chat model | June 2024, deepened October 2024 |
| GitHub Copilot | Selectable model in Copilot Chat | October 29, 2024 |
| Replit Agent | Primary code-generation model | September 2024 |
| Vercel v0 | Selectable model for frontend generation | Mid-2024 |
| Sourcegraph Cody | Selectable model in Cody | June 2024 |
| Windsurf (Codeium) | Selectable model | Late 2024 |
| Continue.dev | Selectable model | June 2024 |
| Microsoft M365 Copilot | Optional model for Copilot users | Late 2024 |
| Amazon Bedrock | Available since launch | June 2024 |
| Google Cloud Vertex AI | Available since launch | June 2024 |
| Notion AI | Backbone model | Late 2024 |
| Perplexity Pro | Selectable model | June 2024 |
GitHub's Universe announcement on October 29, 2024 made Claude 3.5 Sonnet the first non-OpenAI model available in GitHub Copilot, alongside Gemini 1.5 Pro. Microsoft (which owns GitHub) explicitly framed it as moving Copilot to a multi-model architecture, and it was the moment that ended the Copilot-only-uses-OpenAI era. Cursor's adoption was even more visible: by late 2024 the company was telling investors that the bulk of its inference spend went to Anthropic, and CEO Michael Truell had described 3.5 Sonnet in interviews as the model that made the company's product viable in its modern form.[6][16][23]
The initial press response in June 2024 was strong but measured. TechCrunch, The Verge, Ars Technica, and VentureBeat each ran reviews focused on the unusual price-versus-Opus framing and on Artifacts as a UX innovation. Most reviewers said the model was at least as good as GPT-4o on coding and general writing, and several called it the best frontier model available at the time. The Verge wrote that 3.5 Sonnet "feels like a meaningfully smarter Claude," and Ars Technica described Artifacts as "the most interesting thing any chatbot company has shipped in months."[24][25]
The October 2024 upgrade was received with louder enthusiasm. Simon Willison's blog post on the launch became one of the most-cited responses, calling the SWE-bench jump "genuinely surprising" and noting that the same price tag for a substantially better model was unusual in the industry. Latent Space ran a long interview with Anthropic engineer Erik Schluntz that became a primary source for understanding how the team had trained Computer Use. The combination of a quietly-better model and a flashy new capability dominated AI Twitter for several weeks.[18][26]
Third-party benchmark trackers confirmed Anthropic's published numbers and added a few of their own. The Vellum LLM Leaderboard placed the upgraded snapshot at or near the top of coding and reasoning categories for late 2024 and most of January 2025. Artificial Analysis tracked a top-tier composite intelligence score, behind o1 on math-heavy queries but ahead on coding and instruction following. METR's evaluations focused on long-horizon autonomous tasks and reported substantial gains over Claude 3 Opus, although the model still fell well short of the kind of multi-day autonomous performance that later Claude generations would achieve.[19][20][27]
On LMSYS Chatbot Arena (which became LMArena in early 2025), the upgraded snapshot held a top-five Elo for general prompts and frequently a top-three Elo for coding-tagged prompts through the fall and winter of 2024. The model topped the "hard prompts" category at one point in November 2024.[20]
A recurring theme in the community response was that 3.5 Sonnet had a distinct personality compared to other frontier models. The model was described in Reddit, Twitter, and Hacker News threads as more direct than GPT-4o, more willing to disagree, and more comfortable with humor and irony. Some users contrasted it with what they perceived as the more cautious, hedge-heavy register of OpenAI's models. Anthropic's character training (a layer on top of Constitutional AI that aims for coherent personality traits) was sometimes credited for this, although Anthropic itself was careful not to overclaim.[28]
The upgrade in October pushed this further. Janus, Pliny, and other prompt-focused researchers wrote about the upgraded snapshot's increased "vivacity" and willingness to engage with edge cases. The community label "Claude 3.6" was partly a recognition of capability gains but also partly a recognition that the model felt different. The same cohort later argued that the personality shift carried into Claude 3.7 Sonnet and Claude 4, which was sometimes given as evidence that Anthropic was deliberately optimizing for character coherence across generations.[11][28]
For most of the seven-month commercial life of the original snapshot and the four-month commercial life of the upgrade, the only frontier models in the same conversation were GPT-4o (released by OpenAI in May 2024), GPT-4 Turbo (still on the market in mid-2024), Google's Gemini 1.5 Pro, and after September 2024, OpenAI's o1-preview and o1 reasoning models. The general consensus that emerged across reviewers was that 3.5 Sonnet beat GPT-4o on coding-heavy tasks, traded blows with GPT-4o on general writing and chat, lost to o1 on competition math and hard chain-of-thought reasoning problems, and beat o1 on cost, latency, and any task that benefited from fast iteration. On Vellum's published comparison the upgraded Sonnet had the best F1 score on a classification benchmark (77%) but lagged on math equation solving (39% vs o1's higher score), which lined up with the published Anthropic numbers.[19][20]
For agentic and tool-use tasks the comparison ran more decisively in Anthropic's favor. The TAU-bench numbers (69.2% retail, 46.0% airline) were ahead of every published competitor in October 2024. The OSWorld score of 14.9% for screenshot-only computer use was nearly twice the next-best system. Independent agent benchmarks like SWE-Lancer (an agentic engineering eval released in early 2025) and AgentBench would later show similar patterns. The combination contributed to the perception that for any task that involved more than a single chat turn, Sonnet 3.5 was the model to beat.[3][20]
Claude 3.5 Sonnet is the model that converted Anthropic from a boutique AI safety lab into a commercial competitor. Before June 2024 the company was best known for its research output and for being one of OpenAI's most credible rivals; afterward it was widely regarded as one of the two or three labs that could plausibly produce frontier models, and was repeatedly named alongside OpenAI and Google DeepMind in venture decks, press features, and government policy documents. The company's revenue more than tripled between June 2024 and the end of 2024, and continued to grow throughout the model's commercial life.[7][29]
Anthropic's October announcement of Model Context Protocol on November 25, 2024 (one month after the Sonnet upgrade) was timed to take advantage of the moment. MCP shipped with Sonnet 3.5 as the implicit reference model and with Claude Desktop as the reference client. The combined effect of an excellent coding model, Computer Use, and an open agent protocol was that Anthropic became the de facto agent infrastructure provider for late 2024 and most of 2025, even as competitors caught up on raw model quality.[17]
Cursor's success during 2024 and 2025 was deeply tied to Claude 3.5 Sonnet. The editor's tab autocomplete, agent mode, and codebase chat features were built around Anthropic's model, and the company's user base grew from a few thousand to several hundred thousand paying subscribers during the period the upgraded Sonnet was the default. Cursor's emergence as one of the fastest-growing developer tools in history, and its eventual valuation in the multi-billion-dollar range by mid-2025, was widely attributed in part to its Anthropic dependency.[16][23]
The "agentic coding" framing more broadly entered the mainstream during the Sonnet 3.5 era. Before October 2024 most developer-tooling products described themselves as code completion, copilot, or chat. By early 2025 the industry-standard pitch had shifted to agents that took multi-step actions, ran tests, edited multiple files, and self-corrected. The shift was not caused entirely by Sonnet 3.5, but the model's combination of long context, structured tool use, and high SWE-bench scores made the new framing look credible, and the upgrade's Computer Use and TAU-bench numbers gave it a benchmarkable backbone.[3][16]
The pre-deployment evaluation of the October snapshot by US AISI, UK AISI, and METR was a precedent in itself. Apollo Research had also worked with Anthropic on some safety evaluations during the period, and the upgraded snapshot was one of the models that Apollo's later 2024 paper on "in-context scheming" tested for behaviors like sandbagging and goal-preservation under instruction conflicts. The fact that a major model release was now subjected to multiple coordinated independent evaluations before launch became a reference point in AI safety policy discussions through 2025, including in the US executive order followups, EU AI Act implementation discussions, and the Bletchley and Seoul Summit follow-on processes.[3][30]
Within developer culture, Claude 3.5 Sonnet acquired a kind of cult status. By late 2024 it was common in technical Twitter and Hacker News threads to see lines like "Sonnet 3.5 is leading the field" or "if you're not using Sonnet for this you're losing." Newsletters such as Latent Space and AI Snake Oil ran extended retrospectives describing the model as having defined the frontier for a six-to-eight-month window. The combination of capability, cost, and the personality factor made the model one of the few frontier releases that developers spoke about with active affection.[26]
The Sonnet 3.5 era set a number of expectations that became defaults in the rest of the industry. The $3 input / $15 output per million-token price point became the de facto going rate for a frontier mid-tier model, and remained Anthropic's Sonnet pricing through Claude Sonnet 4.5 a year and a half later. The 200K context window became a baseline expectation; competitors that shipped with shorter contexts faced a reflexive complaint that they were behind. Vision as a standard feature rather than a separate product was reinforced. The model card addendum format, with task-specific evaluation tables and an explicit ASL classification, was widely copied by other labs, including Mistral, Cohere, and Meta in their respective frontier releases through 2025.[1][2][8]
More subtly, the upgrade in October 2024 helped establish a new norm around versioning. The fact that Anthropic shipped a substantially better model under the same official name and let the community do the renaming created an awkward precedent: in early 2025 OpenAI faced similar discontent when GPT-4o was updated multiple times under the same display name without clear changelog entries, and complaints frequently invoked the "we got Claude 3.6 by accident" episode as the better way to handle the situation in retrospect. By Sonnet 4 in May 2025, Anthropic had moved to clearer dated snapshots and explicit family naming, which most observers took as a deliberate response to the Sonnet 3.5 versioning experience.[6][12]
Claude 3.7 Sonnet was announced on February 24, 2025, exactly four months after the October upgrade. It was Anthropic's first hybrid reasoning model, capable of either responding immediately or switching into an extended-thinking mode that produced visible chain-of-thought reasoning. The version number jumped from 3.5 to 3.7, skipping 3.6 in deference to the community label that had attached to the October Sonnet 3.5 upgrade.[12]
Claude 3.7 Sonnet improved SWE-bench Verified to 70.3% (up from 49.0%) and shipped alongside an early preview of Claude Code, Anthropic's agentic coding CLI. It used the same $3 / $15 per million token pricing and the same 200K context window as 3.5 Sonnet, but added 128K maximum output tokens in beta. It became the new default Sonnet on the API, on claude.ai, on Bedrock, on Vertex AI, and inside Cursor, GitHub Copilot, and most of the partner products that had been running 3.5 Sonnet. The 3.5 Sonnet snapshots remained on the API as legacy options.[12]
Later Sonnet generations continued the same shape. Claude Sonnet 4 (May 2025) reached 72.7% on SWE-bench Verified at the same price point. Claude Sonnet 4.5 (September 2025) reached 77.2% and 61.4% on OSWorld. By the time Sonnet 3.5 was retired on October 28, 2025, it had been on the market for sixteen months, an unusually long commercial life for a frontier model.[5][7]
On August 13, 2025 Anthropic notified developers that both Claude 3.5 Sonnet snapshots would be retired on October 28, 2025, with Claude Sonnet 4.5 (later updated to Claude Sonnet 4.6) listed as the recommended replacement. The two-month notice followed Anthropic's standard retirement process, which guarantees at least 60 days of advance warning for retirements of publicly released models. Both snapshots reached end-of-life on October 28, 2025; requests to either model on the Claude API now return an error.[5]
Anthropic has separately committed to long-term preservation of model weights and to making past models available again at some point in the future under restricted-access terms, in part because of "safety- and model welfare-related risks" the company associates with permanently retiring frontier models. Claude 3.5 Sonnet weights are preserved internally under that commitment but are not publicly redistributed.[5]