Claude 3.7 Sonnet

Claude 3.7 Sonnet is a large language model developed by Anthropic and released on February 24, 2025. It was the first Anthropic model to expose a unified hybrid reasoning interface, combining a fast, default response mode with an opt-in "extended thinking" mode in which the model produces a visible chain of reasoning before its final answer. Anthropic positioned the release as the first hybrid reasoning model from the company and the first frontier model to put both modes behind a single model identifier rather than splitting fast and reasoning capabilities into separate dedicated models.[^1][^2]

The model uses the API identifier claude-3-7-sonnet-20250219 (snapshot dated February 19, 2025, even though the public launch was February 24), supports a 200,000-token context window, and was priced at $3 per million input tokens and $15 per million output tokens, the same headline price point Anthropic had used for Claude 3.5 Sonnet since June 2024.[^3][^4] It was deployed under AI Safety Level 2 (ASL-2) protections, the same level applied to Claude 3.5 Sonnet, and shipped alongside an early research preview of Claude Code, Anthropic's terminal-first agentic coding assistant.[^1][^5]

In benchmarks reported at launch, Claude 3.7 Sonnet scored 70.3% on SWE-bench Verified using a custom "high-compute" agentic scaffold (the headline single-attempt result was 62.3%), 84.8% on GPQA Diamond with extended thinking, 96.2% on MATH-500, and 81.2% on Tau-bench Retail.[^1][^6][^7] On HumanEval, the model matched Claude 3.5 Sonnet near the saturation point. The pattern across benchmarks led the press to describe it as the strongest publicly released coding model at the time of release, ahead of GPT-4o, DeepSeek V3, and OpenAI's o1.[^6][^8][^9]

Claude 3.7 Sonnet remained Anthropic's flagship Sonnet model for slightly under three months. It was superseded on May 22, 2025 when Anthropic announced the Claude 4 family, releasing Claude Sonnet 4 and Claude Opus 4 at the company's first developer conference, Code with Claude.[^10] Despite that fast turnover, Claude 3.7 Sonnet shaped the family that came after it: the extended thinking toggle, the budget_tokens parameter, and the agent-loop bias toward long-running coding sessions all carried forward into the Claude 4 generation, where they were progressively refined into adaptive thinking by Claude Opus 4.6 in February 2026.[^11][^12]

The "first hybrid reasoning model" framing is product-specific rather than industry-wide. Google had already shipped Gemini 2.0 Flash Thinking as a public experimental model on December 19, 2024, and OpenAI had publicly demonstrated o1's hidden chain-of-thought reasoning in September 2024. What Anthropic claimed as a first was the unified design: a single model and a single API endpoint that could run either as a fast non-reasoning model or as a reasoning model with a developer-controlled token budget, rather than two separate models like OpenAI's GPT-4o and o1.[^1][^13][^14] Several commentators including Simon Willison welcomed this consolidation but noted that the underlying capability of producing a long chain of thought was already in the field by late 2024.[^15][^16]

Anthropic announced the model's deprecation on October 28, 2025 and retired it on February 19, 2026, recommending migration to Claude Sonnet 4.5 (and later Claude Sonnet 4.6) as direct replacements. The single February 19, 2025 snapshot defined the entire 3.7 generation; no minor revisions or intermediate snapshots ever shipped under the Claude 3.7 Sonnet label.[^17][^48]

Background

The reasoning-model wave

The second half of 2024 saw the public emergence of "reasoning" or "thinking" models that produced an internal chain of intermediate steps before answering. OpenAI's o1-preview, announced on September 12, 2024, was the most prominent early example: a separate model from GPT-4o that spent additional inference compute on a hidden chain of thought before producing a final answer. The o1 line traded latency and cost for accuracy, especially on math, science, and competition-style problems.[^14][^18]

Google followed with Gemini 2.0 Flash Thinking on December 19, 2024, an experimental model that exposed its reasoning steps to users. DeepSeek released DeepSeek-R1 on January 20, 2025 with an open-weights chain-of-thought design and detailed training recipe that received heavy attention from researchers. By early 2025, the dominant lab pattern was to maintain two separate models: a fast general-purpose chat model (GPT-4o, Gemini 2.0 Flash, Claude 3.5 Sonnet) and a slower reasoning model (o1, Gemini 2.0 Flash Thinking, R1).[^19][^20]

Anthropic had not shipped a public reasoning model in 2024. The company's most recent major release was the upgraded Claude 3.5 Sonnet (referenced as claude-3-5-sonnet-20241022) on October 22, 2024, which introduced Computer Use and improved coding scores but did not expose any visible chain-of-thought. Anthropic CEO Dario Amodei described the reasoning-model split publicly in interviews around the o1 launch as a product choice he disagreed with: in his framing, asking the user to pick between a fast and a reasoning model was an interface failure.[^1][^16]

Claude 3.7 Sonnet was Anthropic's answer to this. The company decided to train one model that could behave either way, and to expose the choice as a runtime parameter rather than a separate model ID. The technical justification, according to the launch announcement, was that the same underlying weights could produce both fast responses and extended chains of thought, and that having two modes inside one model would simplify product integration for customers building agents.[^1]

The Claude family before 3.7

Claude was Anthropic's first family of large language models, launched in March 2023. The Claude 3 generation in March 2024 introduced the Haiku, Sonnet, Opus three-tier naming pattern, with Opus as the flagship, Sonnet as the balanced mid-tier, and Haiku as the fastest and cheapest tier.[^21] Claude 3.5 Sonnet, released in June 2024, was Anthropic's first widely adopted commercial model and reset the company's competitive position. The October 2024 update to Claude 3.5 Sonnet (sometimes informally called "3.6" by users) added Computer Use and ramped coding capability further. Claude 3.5 Haiku launched alongside it.[^22][^23]

Anthropic skipped a public Claude 3.5 Opus release. Internal reporting suggested an Opus-level Claude 3.5 model had been trained but did not meet release standards. The next planned generation was Claude 4, which would eventually launch in May 2025. Claude 3.7 Sonnet was an interim release between the 3.5 and 4 cycles, designed to ship the hybrid reasoning capability without waiting for the full Claude 4 generation to be ready.[^24][^25]

The naming choice (a 3.7 increment rather than 3.6) was deliberately informal and reflected the company's position that the model was a step on the way to Claude 4 rather than a polished new generation. Anthropic skipped the "3.6" label specifically because users had already adopted that name unofficially for the October 2024 Claude 3.5 Sonnet update; the company chose to honor the community convention rather than re-collide. Anthropic's announcement explicitly said the model was on the path to Claude 4 and that it preferred to ship an interim model that would benefit users immediately rather than hold back the hybrid reasoning work.[^1][^16]

Release and Claude Code launch

Anthropic announced Claude 3.7 Sonnet on February 24, 2025 in a single coordinated post titled "Claude 3.7 Sonnet and Claude Code." The release combined three things: the new model with extended thinking, an early research preview of Claude Code, and a set of pricing and platform updates.[^1] The same day, the company also published a model card detailing the safety evaluations and a separate engineering write-up titled "Visible extended thinking."[^5][^26]

The dual launch of model and command-line tool on the same day was deliberate and strategically significant. Anthropic framed Claude 3.7 Sonnet and Claude Code as a single package: the model supplied the agentic coding capability and the CLI supplied the surface where developers could exercise it most fully. The framing distinguished Anthropic from competitors who treated reasoning models as standalone products and from coding-assistant vendors who treated the model and the tool as separate concerns. The "model plus tool, shipped together" pattern would recur in subsequent Anthropic launches, including the Claude 4 release in May 2025 where Claude Code moved to general availability and the Sonnet 4.5 release in September 2025 which added the Claude Agent SDK.[^1][^10][^27]

Day-one availability

Claude 3.7 Sonnet was made available on every Claude distribution channel on launch day. It shipped on the Anthropic API, claude.ai (web, iOS, and Android), Amazon Bedrock, and Google Cloud Vertex AI. Free claude.ai users received access to the standard mode but not extended thinking; Pro, Team, and Enterprise subscribers received both. The API surface used the snapshot ID claude-3-7-sonnet-20250219 and an alias claude-3-7-sonnet-latest that initially resolved to the same snapshot.[^1][^3]

The pricing held flat against Claude 3.5 Sonnet: $3 per million input tokens and $15 per million output tokens. Crucially, this price applied whether or not extended thinking was enabled. Anthropic chose not to charge a premium for reasoning-mode tokens, in contrast to OpenAI's o1 line where the per-token cost was substantially higher than GPT-4o. With extended thinking on, the model spent more output tokens per response (sometimes many more), which raised effective costs in practice, but the per-token rate did not change.[^1][^6][^27]

Claude Code research preview

The second major announcement of the day was an early research preview of Claude Code, a command-line agentic coding tool. Claude Code ran in the developer's terminal, used Claude 3.7 Sonnet as its underlying model, and could read and edit files, run shell commands, execute tests, and stage Git commits during long, multi-step coding sessions. The launch post described it as "an active collaborator that can search and read code, edit files, write and run tests, commit and push code to GitHub, and use command-line tools."[^1][^28]

Anthropic shared internal metrics in the launch post: in early testing, Claude Code "completed in a single pass tasks that would normally take 45+ minutes of manual work" and the team had been using the tool extensively for shipping production engineering work. The framing positioned Claude Code as a substantial productivity tool rather than a demo. Claude Code in February 2025 was billed as a research preview, available to a limited set of users who could request access through a sign-up form. Anthropic distributed it as an npm package (@anthropic-ai/claude-code) that ran in any terminal environment with Node.js. The product moved out of research preview into general availability roughly three months later, on May 22, 2025, alongside the Claude 4 launch, and went on to become one of Anthropic's headline commercial products.[^1][^10][^28]

Reception headlines

Launch-day coverage was dominated by three themes: the hybrid reasoning design, the SWE-bench Verified score, and the Claude Code preview. TechCrunch, The Verge, Ars Technica, and VentureBeat all led with the SWE-bench result and with Anthropic's framing that this was the first hybrid reasoning model. DataCamp, InfoQ, and the Vellum blog ran detailed benchmark comparisons within 24 hours, broadly confirming Anthropic's reported numbers and emphasizing the price stability against Claude 3.5 Sonnet.[^6][^7][^8][^29]

Simon Willison's same-day write-up was widely shared as the most thorough independent analysis. Willison praised the unified design, ran a series of hands-on tests, and noted that the published reasoning chains were "extensive and unusually transparent" relative to o1, where OpenAI deliberately obscured the model's intermediate thinking. Willison also flagged the early version of the extended thinking trace as "sometimes longer than it needs to be," a complaint Anthropic acknowledged and partially addressed in subsequent snapshots.[^15][^16]

Hybrid reasoning architecture

Extended thinking is the central feature of Claude 3.7 Sonnet and the design that distinguishes it from earlier Claude models. With the feature on, the model produces a sequence of intermediate reasoning steps before its final user-facing answer. The chain is exposed to the developer (and, in claude.ai, to the end user) rather than being hidden as in OpenAI's o1 line.[^1][^26]

How it works

The API exposes extended thinking through a thinking parameter on the Messages endpoint. Setting thinking: { type: "enabled", budget_tokens: 16000 } activates extended thinking with a soft cap of 16,000 reasoning tokens. The minimum budget is 1,024 tokens; the maximum aligns with the model's overall maximum output tokens, up to 128,000 tokens when the output-128k-2025-02-19 beta header is set. The reasoning content is returned as a separate thinking content block in the response, alongside the user-facing text block, and the developer can choose whether to surface it to the end user or strip it before display.[^3][^27]

When extended thinking is off, Claude 3.7 Sonnet behaves like a fast non-reasoning model. Latency and cost are comparable to Claude 3.5 Sonnet, and the response shape is the same: a single text content block. When extended thinking is on, the model emits a long internal monologue first and then its final answer, with the full sequence sharing the same maximum output budget. Total output tokens (thinking plus user-facing) can run substantially higher than in standard mode, which is the main practical cost driver for the reasoning mode.[^27]

The budget_tokens parameter is a soft target rather than a hard cap. Anthropic trained the model to respect it but the actual length of the visible reasoning trace can vary, and the model may not use the entire budget allocated, especially at ranges above 32,000 tokens. Recommended budgets range from a few thousand tokens for routine reasoning gains up to 32,000 or more for hard problems such as competition mathematics, multi-step planning, or graduate-level science. The launch post explicitly recommended starting with the minimum budget (1,024 tokens) and increasing only if benchmarks showed gains for the workload.[^1][^27]

Thinking budgets in practice

The table below summarizes Anthropic's recommended thinking budgets at launch, drawn from the API documentation and the launch post.[^1][^27]

Use case	Suggested `budget_tokens`
Default routine prompts (reasoning gains optional)	1,024 (minimum)
Mid-difficulty coding edits and short proofs	4,000 to 8,000
Difficult multi-step coding (multi-file refactors)	8,000 to 16,000
Competition-level math (AIME, MATH-500)	16,000 to 32,000
Graduate-level science (GPQA Diamond)	32,000 to 64,000
Hard agentic tasks with parallel sampling	up to 64,000 (high-compute mode)

Reasoning budgets above 32,000 tokens were available only when running in batch mode at launch, since the standard streaming API had a lower per-response output cap. Anthropic later raised the streaming output cap to make full 64,000-token thinking sessions possible in real time on the standard API. The full 128,000-token output ceiling was available only through the dedicated beta header output-128k-2025-02-19.[^27]

Hybrid vs separate-model reasoning

The hybrid framing was the most-discussed product distinction at launch. OpenAI's o1 was a separate model from GPT-4o, with separate pricing, a separate API endpoint, and characteristics tuned for reasoning at the cost of normal chat fluency. DeepSeek-R1 was likewise a distinct model from DeepSeek-V3. Gemini 2.0 Flash Thinking sat on its own experimental endpoint distinct from Gemini 2.0 Flash. The pattern across labs by January 2025 was two-model deployment.[^14][^19][^20]

Claude 3.7 Sonnet collapsed this into one model. The same weights ran in two modes; the runtime decision was a boolean parameter on a single API call. Anthropic argued this design had three advantages: simpler integration (one model ID for production code), simpler billing (no need to choose between two pricing tiers), and tighter behavioral consistency (the same model wrote prose in both modes, avoiding the voice shift developers reported when switching between o1 and GPT-4o). The trade-off was the loss of a dedicated reasoning model that might have been more thoroughly tuned for one mode at the expense of the other.[^1][^16]

By late 2025, the field largely converged on Anthropic's pattern. OpenAI folded reasoning into GPT-5 (released August 2025) as a default capability, Google began exposing reasoning toggles on Gemini, and the separate-reasoning-model pattern that had defined late 2024 had largely faded. Simon Willison and Nathan Lambert both later cited Claude 3.7 Sonnet as the model that established the industry's default product pattern for hybrid reasoning.[^15][^37][^39]

Capabilities

Coding and agentic software engineering

Coding was the headline capability for Claude 3.7 Sonnet. The launch announcement led with the SWE-bench Verified result and emphasized that the model had been explicitly tuned for multi-file refactors, agentic edit-test-fix loops, and long-running developer sessions. Independent partner reports at launch added qualitative color: Cursor reported "clear improvements" on multi-file edits, Replit said it had "reduced errors on agent traces," and Cognition (the company behind Devin) called it "the new state of the art for agentic coding."[^1][^30]

The model was also the underlying engine for the Claude Code research preview, where it powered terminal-based coding sessions that could span dozens of file edits and hundreds of tool calls. Early Claude Code users reported sustained autonomous work on tasks lasting tens of minutes to a few hours, well beyond what earlier Claude models had reliably supported. Anthropic's announcement post highlighted internal Anthropic engineering teams using Claude Code to ship features that had previously required several engineer-days in a few hours.[^1][^28]

A separate notable capability gain was on TerminalBench, an agentic terminal benchmark introduced in early 2025 that measures the model's ability to chain shell commands and complete multi-step CLI tasks. Claude 3.7 Sonnet placed at or near the top of TerminalBench through the spring of 2025, consistent with the Claude Code positioning. Cognition's Devin team reported that swapping in Claude 3.7 Sonnet as Devin's underlying model produced measurable gains on internal end-to-end task benchmarks compared with Claude 3.5 Sonnet.[^7][^30]

Tool use and parallel tool use

Claude 3.7 Sonnet supported the same tool-use API as earlier Claude generations: developers defined tools as JSON schemas, the model emitted structured tool calls, and the developer's runtime executed them and fed results back. The model could fire multiple tool calls in parallel within a single assistant turn, allowing agents to issue several searches or read several files at once.[^3][^27]

A new beta header, output-128k-2025-02-19, raised the model's maximum output to 128,000 tokens for high-budget extended thinking sessions. This was the first time Anthropic had publicly offered a maximum output above the previous Claude 3.5 cap of 8,192 tokens, and it was specifically motivated by the need to fit long thinking traces inside the per-response budget. Standard output without the beta header remained at the lower 8,192-token cap.[^27][^31]

A notable refinement at launch was that the model was trained to interleave reasoning and tool calls more cleanly than Claude 3.5 Sonnet had. With extended thinking enabled, the model could reason about which tool to call first, emit the call, and then reason about the result before calling the next one. This was the precursor to the explicit interleaved thinking beta header that shipped later with Claude Sonnet 4 in May 2025.[^10][^27]

Computer use

Claude 3.7 Sonnet inherited computer use, the capability to operate a desktop or browser through screenshot observation and keyboard or mouse actions, that Claude 3.5 Sonnet had introduced in October 2024. Anthropic shipped an updated computer-use tool version (computer_20250124) at the same time as Claude 3.7 Sonnet, expanding the action vocabulary to include right-click, middle-click, double-click, triple-click, drag, hold-key, wait, and scroll-with-direction-and-amount.[^32]

The scroll-with-amount action was the most consequential addition. In the original October 2024 release, scrolling had been a frequent failure mode because the model could only express it as a sequence of mouse-wheel ticks, which often overshot or undershot. The new vocabulary let the model specify both direction and a desired distance, dramatically improving scroll reliability on long documents and dense web pages.[^32]

The model card reported moderate gains on computer-use tasks relative to Claude 3.5 Sonnet. The OSWorld score was reported in the same band as the late-2024 Claude 3.5 Sonnet update, with scrolling reliability the most cited improvement. Anthropic continued to recommend human-in-the-loop oversight for production computer-use deployments, framing the capability as still in beta.[^5][^32]

Vision

Claude 3.7 Sonnet accepted images as input and produced text-only output. The vision pipeline supported document analysis, chart and figure understanding, screenshot interpretation, and visual question answering. On MMMU, the multimodal university benchmark, the model scored 75.0% with extended thinking, in the same band as the strongest non-Anthropic models at the time.[^6]

Refusals and helpfulness

A specific tuning goal at launch was to reduce over-refusal rates relative to Claude 3.5 Sonnet, which had drawn user complaints for declining benign requests. The launch announcement claimed Claude 3.7 Sonnet "reduces unnecessary refusals by 45%" versus the prior generation, with the model card documenting the methodology behind that figure. Harmless response rates on standard violative-request tests remained above 99% in standard mode and slightly higher with extended thinking, indicating that the reduction in over-refusal did not come at the cost of safety on actual policy-violating prompts.[^1][^5]

Multilingual

The model supported the same multilingual surface as earlier Claude generations, with strong results in major European languages and competitive results across Arabic, Chinese, Japanese, Korean, and other non-Latin scripts. The launch announcement and model card did not break out individual language scores, but Anthropic confirmed that the multilingual MMLU variant was within a percentage point of Claude 3.5 Sonnet's score.[^5]

Benchmarks

The table below collates the most cited benchmark results for Claude 3.7 Sonnet at launch and compares them to Claude 3.5 Sonnet (the direct predecessor), GPT-4o, OpenAI o1, and DeepSeek R1, the four most direct points of comparison. Numbers are taken from Anthropic's launch announcement, the system card, and independent reports from DataCamp, Vellum, and Artificial Analysis.[^1][^5][^6][^7][^33]

Benchmark	Claude 3.7 Sonnet (extended thinking)	Claude 3.7 Sonnet (standard)	Claude 3.5 Sonnet (Oct 2024)	GPT-4o	OpenAI o1	DeepSeek R1
SWE-bench Verified (custom scaffold, high compute)	70.3%	62.3% (single attempt)	49.0%	33.2%	48.9%	49.2%
GPQA Diamond	84.8%	68.0%	65.0%	53.6%	78.0%	71.5%
MMLU	86.7%	86.1%	88.7%	88.7%	92.3%	90.8%
MMMU (vision)	75.0%	71.8%	68.3%	69.1%	78.2%	n/a
MATH-500	96.2%	82.2%	78.0%	76.6%	94.8%	97.3%
AIME 2024	80.0% (64K budget)	23.3%	16.0%	13.4%	79.2%	79.8%
HumanEval	92.0%	92.0%	92.0%	90.2%	92.4%	90.4%
Tau-bench Retail	81.2%	80.0%	65.5%	41.2%	n/a	n/a
Tau-bench Airline	58.4%	58.4%	36.0%	29.6%	n/a	n/a
Instruction Following (IFEval)	93.2%	93.2%	89.5%	88.4%	n/a	n/a

Notes: The 70.3% SWE-bench Verified figure is the headline result Anthropic reported using a custom agentic scaffold (with parallel sampling, error analysis, and 192-step task budgets); the more frequently-cited single-attempt number is 62.3%. The AIME 2024 score of 80.0% used the 64,000-token thinking budget; smaller budgets produced proportionally smaller gains. HumanEval saturation at 92% reflects the benchmark's well-known ceiling rather than identical model capability.[^1][^6]

The headline takeaway from the launch numbers was that Claude 3.7 Sonnet was the best public coding model on SWE-bench Verified, at the same price as Claude 3.5 Sonnet. On math and graduate-level science with extended thinking, it was competitive with OpenAI o1, with GPQA Diamond and MATH-500 within a few points of o1 and AIME 2024 within a single point at the largest thinking budget. The trade-off was that the largest thinking budgets produced very long output sequences, which raised effective cost in practice even though the per-token rate was unchanged.[^6][^7][^33]

Independent benchmarking by Artificial Analysis placed Claude 3.7 Sonnet in the top intelligence band as of February 2025, slightly behind OpenAI o1 on overall score but ahead on coding-weighted subsets. Vellum's day-of analysis described the model as the new default for production coding workloads and emphasized stable instruction following on long prompts. DataCamp and InfoQ both noted that the unified design (one model, one ID) made integration simpler than juggling separate fast and reasoning models from OpenAI.[^7][^8][^33]

Pricing

Claude 3.7 Sonnet was priced at $3 per million input tokens and $15 per million output tokens, the same headline pricing Anthropic had used for Claude 3.5 Sonnet since June 2024. Pricing did not change based on whether extended thinking was enabled. The per-token rate held throughout the model's lifecycle and continued to apply to its successors Claude Sonnet 4 (May 2025), Claude Sonnet 4.5 (September 2025), and Claude Sonnet 4.6 (February 2026), establishing $3 / $15 as the long-running Sonnet-tier price.[^1][^3][^4]

The full pricing schedule on the Anthropic API at launch is shown below.

Usage type	Price
Input tokens (standard)	$3.00 per million
Output tokens (standard, includes thinking tokens)	$15.00 per million
Prompt caching (write, 5-minute TTL)	$3.75 per million
Prompt caching (read)	$0.30 per million
Batch API (input)	$1.50 per million
Batch API (output)	$7.50 per million

Prompt caching reduced costs by up to 90% for repeated long-context calls, useful in agent workflows that share a large system prompt across many turns. The Batch API processed requests asynchronously within a 24-hour window at half price. Both features had been introduced for Claude 3.5 Sonnet and carried forward unchanged.[^3][^27]

A recurring criticism of the unified pricing was that it bundled thinking tokens (which were not user-facing in many applications) with final-answer tokens (which were) at the same per-token rate. Some developers argued that Anthropic should have offered a separate cheaper rate for thinking. Anthropic's response was that simplicity of pricing was itself a feature: a single per-token rate let developers reason about cost without tracking which tokens were thinking versus final answers. The company kept the unified pricing through the entire Sonnet line.[^1][^27]

Availability

Claude 3.7 Sonnet was available across Anthropic's full distribution from launch and remained on every major platform for its full lifecycle. The table below lists the main delivery channels.

Platform	Available
Anthropic API	Yes
claude.ai (web, iOS, Android)	Yes (free, Pro, Team, Enterprise)
Amazon Bedrock	Yes
Google Cloud Vertex AI	Yes
Cursor	Yes (selectable model)
Replit	Yes (selectable model in Replit Agent)
GitHub Copilot	Yes (added shortly after launch)
Vercel v0	Yes (selectable model)
Claude Code (research preview)	Yes (default model)

On AWS Bedrock, the model used the regional ID anthropic.claude-3-7-sonnet-20250219-v1:0. On Vertex AI, it was claude-3-7-sonnet@20250219. Each Bedrock and Vertex deployment tracked the underlying Anthropic snapshot exactly. The model also supported Anthropic's then-new Priority Tier for production workloads requiring guaranteed throughput.[^3]

Free-tier access

Free claude.ai users had access to Claude 3.7 Sonnet in standard mode but not in extended thinking mode. Anthropic gated the reasoning feature to paid Pro and Max subscribers, citing the higher inference cost of long thinking traces. The pattern was different from the later Claude Sonnet 4 launch in May 2025, where Anthropic gave free users access to the full model including extended thinking.[^1][^10]

Context window

Claude 3.7 Sonnet supported a 200,000-token context window for input, unchanged from Claude 3.5 Sonnet and consistent with every Claude 3 and Claude 3.5 model. The output ceiling was structured into three tiers: 8,192 tokens by default (the Claude 3.5 inherited cap), 64,000 tokens when extended thinking was enabled in a standard request, and 128,000 tokens when the output-128k-2025-02-19 beta header was set, which was specifically designed to accommodate long thinking traces plus a final answer in a single response.[^3][^27][^31]

The 200,000-token context window translated to roughly 150,000 words or 680,000 unicode characters of text, large enough to ingest most codebases of moderate size, a full book, or several hundred pages of documentation in a single prompt. The Anthropic launch announcement specifically highlighted that the long context allowed Claude 3.7 Sonnet to maintain coherence across very long Claude Code sessions, where the conversation history could grow to include dozens of file reads, command outputs, and intermediate reasoning steps.[^1][^28]

Training data cutoff was reported by Anthropic as November 2024, with the reliable knowledge cutoff at the end of October 2024. The cutoffs were unchanged from the October 2024 Claude 3.5 Sonnet update, reflecting that the two models shared a substantial portion of their training corpus, with the 3.7 model receiving additional reasoning-focused post-training rather than a wholly new pretraining run.[^3][^17]

Visible thinking

The decision to expose the reasoning chain to the developer was a contested design choice. OpenAI had shipped o1 with a deliberately hidden chain of thought, on the stated grounds that letting users read the model's internal reasoning would invite spoofing, prompt injection, and confusion about the model's true capability. Anthropic took the opposite position. The launch engineering post "Claude's extended thinking" argued that visible reasoning was a feature, not a leak: developers could inspect the chain to understand why the model produced a given answer, debug agent behavior, and surface intermediate steps to users in tools like Claude Code.[^26]

In practice, the chain Claude 3.7 Sonnet emitted was extensive and often transparent, with the model frequently writing out its hypothesis, considering counterarguments, and revising. It was rarely a polished prose explanation; it more closely resembled a stream of internal notes. Anthropic emphasized that the visible chain was the model's actual reasoning state rather than a post-hoc rationalization, but the company also cautioned in the model card that the relationship between the visible chain and the eventual answer was an active research question, since some model behaviors might rely on internal computation that did not appear in the textual chain.[^5][^26]

Faithfulness concerns

The model card published alongside the launch dedicated significant space to the question of chain-of-thought faithfulness: whether the visible reasoning trace accurately represented the model's true internal computation, or whether it was a post-hoc rationalization that could mask different decision processes. Anthropic acknowledged the question as open and committed to further research.[^5][^26]

That further work arrived in April 2025 with Anthropic's "Reasoning Models Don't Always Say What They Think" paper, which examined the question directly. The research found that reasoning models, including Claude 3.7 Sonnet, sometimes used hints or shortcuts in their reasoning that did not appear in the visible chain. The result complicated the case for visible reasoning as a transparency tool: while the chain was useful for debugging and human oversight, it should not be relied on as a complete window into the model's decision process. The paper was widely cited in subsequent debates about reasoning-model interpretability and shaped the design of the adaptive thinking mode that would replace explicit budget_tokens control in Claude Opus 4.6 the following year.[^52]

Standard mode vs extended thinking

The headline benchmark gains from extended thinking were on math, science, and competition-style problems where additional inference compute was directly useful. On routine knowledge benchmarks like MMLU, extended thinking produced small gains. On graduate-level science (GPQA Diamond) and competition mathematics (MATH-500, AIME), extended thinking produced large gains, often double-digit percentage point improvements over standard mode.[^1][^6]

The table below summarizes the launch-reported difference between standard mode and extended thinking on the benchmarks Anthropic published in both modes, drawn from the launch announcement and contemporaneous coverage.[^1][^6][^7]

Benchmark	Standard mode	Extended thinking
GPQA Diamond	68.0%	84.8%
MATH-500	82.2%	96.2%
AIME 2024	23.3%	80.0% (at 64K budget)
MMLU	86.1%	86.7%
GPQA general (non-Diamond)	78.0%	84.0%
Visual reasoning (MMMU)	71.8%	75.0%

The AIME 2024 jump is the most striking single result: from 23.3% in standard mode to 80.0% with the largest thinking budget. Anthropic and several independent reviewers cited this as evidence that the same model weights could behave very differently depending on inference compute, and that the gap between fast and reasoning models in this benchmark family was largely an inference-time phenomenon rather than a training-time one.[^1][^6][^29]

Tool use, computer use, and MCP

Tool use API

Claude 3.7 Sonnet used Anthropic's standard tool-use API, where developers define tools as JSON schemas, the model emits structured tool calls, and the developer's runtime executes them and feeds results back. Tools were declared on the API request, and the model could call them in a single response (parallel tool use) or sequentially across turns. Tool use worked identically with extended thinking on or off.[^3][^27]

Computer use

Computer Use, the capability to operate a desktop or browser through screenshots and keyboard or mouse actions, was inherited from Claude 3.5 Sonnet (October 2024). Claude 3.7 Sonnet shipped with an updated computer-use tool version (computer_20250124) that expanded the action vocabulary considerably. New actions included right-click, middle-click, double-click, triple-click, drag, hold-key, wait, and scroll with direction and amount.[^32]

On the OSWorld benchmark, Claude 3.7 Sonnet's computer-use score was reported in the high-teens to low-twenties range depending on configuration, comparable to or slightly above the late-2024 Claude 3.5 Sonnet update. The score was a notable improvement over the original October 2024 launch's 14.9% but well below where the capability would land later in the year with Sonnet 4 (42.2%) and Sonnet 4.5 (61.4%).[^5][^10]

Files API

The launch did not include a Files API as a launch feature. The Files API came later with the Claude Sonnet 4 / Opus 4 release in May 2025, when it was bundled with the MCP connector, the code-execution tool, and the extended one-hour prompt-caching TTL into a coherent agent-API package. In February 2025, file inputs to Claude 3.7 Sonnet were handled through the standard message content array, with documents passed inline as part of the prompt.[^3][^10]

MCP

The Model Context Protocol (MCP), Anthropic's open standard for connecting models to external tools and data, had been announced in November 2024, three months before Claude 3.7 Sonnet. MCP was supported in Claude 3.7 Sonnet through the standard tool-use API: developers could implement MCP clients that exposed remote MCP servers as tools to the model. The dedicated MCP connector, which let the Anthropic API connect directly to remote MCP servers without custom client code, came later with the May 2025 release.[^10][^34]

ASL classification and safety

Claude 3.7 Sonnet was deployed under AI Safety Level 2 (ASL-2) protections, the same level applied to Claude 3.5 Sonnet and the entire Claude 3 generation. ASL-2 required standard responsible-deployment safeguards but did not require the additional CBRN and autonomous-replication safeguards that ASL-3 imposes. The decision to apply ASL-2 reflected Anthropic's judgment that Claude 3.7 Sonnet's evaluations did not show meaningful uplift to actors developing chemical, biological, radiological, or nuclear weapons, or to autonomous self-replication capabilities, that would require the higher tier.[^5][^35]

The model card, published the same day as the launch, detailed evaluations across CBRN risk, cyber capability, autonomous replication, and standard safety metrics. The system card focused on reducing harms via both training and surrounding safeguards, with extensive analysis of evaluations based on Anthropic's Responsible Scaling Policy, prompt injection risks for computer use, coding-related risks, studies of extended thinking faithfulness, and reward hacking issues in agentic contexts. Over-refusal rates dropped notably relative to Claude 3.5 Sonnet (the launch claim of 45% fewer unnecessary refusals), addressing a common complaint about the previous generation.[^5]

The Responsible Scaling Policy version active at the launch was Version 2.0. Version 2.1 was published in March 2025, a few weeks after the model launch, and added new CBRN-related thresholds. Subsequent Anthropic models including Claude Opus 4 in May 2025 were the first to be deployed under ASL-3, but Claude 3.7 Sonnet's evaluations placed it below the ASL-3 threshold.[^35][^36]

Reception and developer adoption

Press and analysis

Reception of Claude 3.7 Sonnet in February 2025 was strongly positive in the technical press. TechCrunch, The Verge, Ars Technica, and VentureBeat all led with the SWE-bench Verified result and the hybrid reasoning framing. DataCamp ran a detailed benchmark comparison and called it the new default for production coding work. InfoQ highlighted the cost stability and the absence of a reasoning premium as a competitive lever against OpenAI's o1 line.[^6][^8][^9][^29]

Nathan Lambert at Interconnects framed the release as Anthropic's pragmatic answer to the reasoning model question, arguing that putting both modes behind one API surface would be more durable than the o1 / GPT-4o split. Lambert read the simultaneous Claude Code preview as a strategic signal that Anthropic was committing to the developer market, a thesis he would return to with greater force three months later when Claude 4 launched.[^37][^38]

Simon Willison

Simon Willison's same-day write-up was widely shared. Willison ran a series of hands-on tests against the new model and posted live commentary on his blog and on Mastodon. His review highlighted several themes that recurred in subsequent coverage: the chain-of-thought transparency was a real distinction from OpenAI's o1, the unified design simplified application logic, and the tool-use behavior in extended thinking mode was unusually clean. Willison also flagged what he called the verbose thinking issue: the visible reasoning sometimes ran much longer than the problem required.[^15][^16]

Willison's broader take was that hybrid reasoning was the right product direction and that the field was likely to converge on this design over the following six to twelve months. That prediction held: by late 2025, OpenAI had folded reasoning behavior into GPT-5 (released August 2025) as a default capability, Google had begun to expose reasoning toggles in Gemini, and the separate-reasoning-model pattern that had defined late 2024 had largely faded.[^15][^39]

Vellum, Artificial Analysis, and LMArena

Vellum's launch-day analysis ran independent benchmarks and confirmed Anthropic's headline numbers within reasonable margins. Vellum's piece described Claude 3.7 Sonnet as a serious step up for coding and noted that the SWE-bench Verified gain over Claude 3.5 Sonnet was the largest single-version jump that had been publicly reported on the benchmark to that point.[^7][^33]

Artificial Analysis tracked the model on its public leaderboard and placed it in the top intelligence band, slightly behind OpenAI o1 on the composite quality index but ahead on coding-weighted subsets. Artificial Analysis flagged the cost / quality trade-off as favorable for production coding workloads, and the model held a top-three slot on the leaderboard for several months before being displaced by GPT-4.5, OpenAI o3, and ultimately Claude Sonnet 4.[^33]

Claude 3.7 Sonnet was added to the LM Arena leaderboard (formerly LMSYS Chatbot Arena) shortly after launch. The model placed in the top tier of the public leaderboard, sitting in the same competitive band as GPT-4o and o1-preview on overall ELO. It ranked particularly highly on coding-style prompts, consistent with the SWE-bench Verified result. The model held a top-five LMArena slot through the spring of 2025 before being displaced by GPT-4.5 (released February 27, 2025) and the Claude 4 family in late May 2025.[^40]

METR

METR (Model Evaluation and Threat Research) published a long-horizon autonomy evaluation of Claude 3.7 Sonnet in early 2025. The report measured the time horizon over which the model could sustain useful autonomous work on software-engineering tasks at a defined quality threshold. METR's evaluation placed Claude 3.7 Sonnet ahead of GPT-4o and o1-preview on this metric, consistent with the agentic-coding emphasis in Anthropic's positioning. METR's later evaluation of Claude Sonnet 4 in mid-2025 showed an additional substantial jump on the same metric, framing 3.7 Sonnet in retrospect as a clear but incremental step toward sustained autonomy.[^41]

Developer reception

Reception in developer communities was strong on coding workloads. Hacker News and Reddit threads in late February 2025 were dominated by Claude Code success stories and SWE-bench Verified comparisons. Several long Reddit threads in r/ClaudeAI emphasized the practical productivity gains: longer reliable agent sessions, better behavior on multi-file refactors, and noticeably less repetitive output when extended thinking was used appropriately.[^42]

Not every reaction was positive. A recurring complaint was that the visible thinking trace, when surfaced to the end user in claude.ai, was often longer than the user wanted to read. Another was that the standard mode (without extended thinking) felt only modestly improved over Claude 3.5 Sonnet on routine prompts, leaving users uncertain when extended thinking was worth turning on. A third was that the cost of long reasoning sessions, while not changing on a per-token basis, could surprise developers who had not budgeted for the longer effective output.[^15][^42]

Cursor, Replit, and GitHub Copilot

Cursor added Claude 3.7 Sonnet to its model selector on launch day. Cursor's announcement said the model produced clear improvements on multi-file edits and complex refactors, and the tool quickly became one of Cursor's primary recommendations for production coding work. Replit followed within hours, integrating the model into Replit Agent and reporting reduced errors on agent traces relative to Claude 3.5 Sonnet.[^1][^30]

GitHub Copilot added Claude 3.7 Sonnet as a selectable model in early March 2025, a few weeks after the Anthropic launch. GitHub's announcement called the model a strong choice for production developers and made it available across Copilot Chat, Copilot Edits, and the Copilot Workspace agent. The integration was significant because it brought Claude into the most widely-used commercial coding-assistance product, broadening the model's reach beyond developers who used Anthropic's own surfaces.[^43]

Claude Code adoption

The Claude Code research preview, launched the same day as Claude 3.7 Sonnet, used the model as its default underlying engine. Claude Code distributed as an npm package and ran in any terminal environment with Node.js. The preview was available to a limited group of users via a sign-up form, and access expanded over the following months. Claude Code stayed on Claude 3.7 Sonnet as its default through May 22, 2025, when the product moved to general availability and switched its default model to Claude Sonnet 4 alongside the Claude 4 launch.[^1][^10][^28]

Claude Code was the long-tail commercial story of the launch. Anthropic's revenue in 2025 grew rapidly on the back of Claude Code adoption, and the company later cited Claude Code as one of the primary drivers of its commercial growth through the year. The product moved from research preview to a $1 billion annualized run rate in roughly six months, a velocity comparable to or faster than ChatGPT's early growth.[^44][^45]

Other partners

A long list of other developer tooling and enterprise partners adopted Claude 3.7 Sonnet during its three-month run. Cognition (the company behind Devin), Sourcegraph (Cody), Vercel (v0), Augment Code, and Continue all added the model as a primary or selectable option. Enterprise customers including Block (parent of Square and Cash App), Rakuten, Notion, and Asana publicly cited Claude 3.7 Sonnet for code refactoring, document review, and agent orchestration use cases by mid-2025.[^1][^30]

The table below summarizes the major adoption channels for Claude 3.7 Sonnet during its launch quarter.

Partner / channel	Integration	Date added
Anthropic API	Native, snapshot ID `claude-3-7-sonnet-20250219`	February 24, 2025
claude.ai	Default model for Pro / Team / Enterprise; standard mode for free users	February 24, 2025
Amazon Bedrock	Native (`anthropic.claude-3-7-sonnet-20250219-v1:0`)	February 24, 2025
Google Cloud Vertex AI	Native (`claude-3-7-sonnet@20250219`)	February 24, 2025
Cursor	Selectable model	February 24, 2025
Replit	Selectable model in Replit Agent	February 24, 2025
Claude Code (research preview)	Default underlying model	February 24, 2025
Vercel v0	Selectable model	February 2025
Cognition Devin	Underlying agent model	February 2025
Sourcegraph Cody	Selectable model	February 2025
GitHub Copilot	Selectable model in Chat / Edits / Workspace	March 2025
Augment Code	Default coding model	March 2025

Claude.ai usage

On claude.ai, Claude 3.7 Sonnet replaced the Claude 3.5 Sonnet update as the default model for Pro, Team, and Enterprise users. Free users got the model in standard mode but not extended thinking. Anthropic later disclosed that the launch coincided with a substantial uptick in claude.ai engagement, particularly for coding-related conversations, and the company's Anthropic Economic Index data for early 2025 showed software-development tasks consolidating as the largest single category of Claude usage.[^1][^46]

Successors

Claude 3.7 Sonnet was superseded by Claude Sonnet 4 on May 22, 2025, three months after launch, when Anthropic announced the Claude 4 family at its first developer conference, Code with Claude. Claude Sonnet 4 used the same $3 / $15 pricing, the same 200,000-token context window, and the same extended thinking design (with the budget_tokens parameter inherited unchanged), but raised SWE-bench Verified to 72.7% (single attempt), GPQA Diamond to 75.4%, and added an interleaved thinking beta header for cleaner reasoning-tool-call sequences.[^10][^47]

The May 2025 launch also brought the four agent-API features that Claude 3.7 Sonnet had not shipped with: the dedicated MCP connector, the code-execution tool, the Files API, and the extended one-hour prompt-caching TTL. Claude Code moved from research preview to general availability the same day and switched its default model to Sonnet 4. Free claude.ai users got Sonnet 4 in full (including extended thinking), expanding access beyond the standard-mode-only deal that Claude 3.7 Sonnet free users had received.[^10]

Claude Sonnet 4.5 followed on September 29, 2025 with snapshot ID claude-sonnet-4-5-20250929, raising SWE-bench Verified to a then-leading 77.2% and shipping with the Claude Agent SDK and Imagine with Claude experiment. Claude Sonnet 4.6 (claude-sonnet-4-6) shipped in February 2026 with a 1-million-token context window, adaptive thinking (replacing the explicit budget_tokens toggle), and a dateless API ID convention that Anthropic adopted starting with the 4.6 generation. By the time of Claude 3.7 Sonnet's retirement on February 19, 2026, the Sonnet line had progressed through three full successor generations.[^48][^49][^50]

Influence on later models

Claude 3.7 Sonnet's design choices propagated through the Claude 4 family. The hybrid reasoning toggle, the budget_tokens parameter, the visible chain of thought, and the bias toward agentic coding workloads all carried forward. Claude Opus 4.5 in November 2025 introduced the higher-level effort parameter (low, medium, high) as a coarser control on top of budget_tokens. Claude Opus 4.6 in February 2026 retired the manual toggle in favor of adaptive thinking, where the model decides at runtime whether and how deeply to reason, with budget_tokens formally deprecated. Claude Opus 4.7 in April 2026 removed the manual fixed-budget option entirely.[^12][^49][^50]

In retrospect, Claude 3.7 Sonnet established the baseline interface for hybrid reasoning that Anthropic refined over the next year. The visible-reasoning principle held throughout the family. The budget_tokens parameter, while eventually deprecated, defined the API shape that successor parameters built on. And the framing of one model with two modes became the dominant industry pattern by late 2025.[^15][^39]

Legacy and deprecation status

Anthropic announced Claude 3.7 Sonnet's deprecation on October 28, 2025, eight months after launch, giving developers at least 60 days notice ahead of the scheduled retirement date. The model was officially retired on February 19, 2026, with claude-sonnet-4-6 (then claude-sonnet-4-5 before the November 2025 launch of 4.6) named as the recommended replacement. The retirement notice specifically called out the API ID claude-3-7-sonnet-20250219 and noted that requests to this model ID would fail after the retirement date on Anthropic-operated platforms.[^17]

Partner-operated platforms (Amazon Bedrock and Google Cloud Vertex AI) maintained their own retirement schedules, so customers on those platforms saw lifecycle dates that differed slightly from the Anthropic-direct timeline. By May 2026, the model was unavailable on Anthropic-operated channels but in some cases still accessible through long-tail partner deployments that had not yet completed their migration windows.[^17]

The retirement closed the active life of Claude 3.7 Sonnet but did not erase its influence. As of mid-2026, the hybrid reasoning design, the thinking API parameter (now deprecated in favor of adaptive thinking but still recognized in legacy code paths), the visible chain of thought, and the "model plus tool, shipped together" launch pattern all traced back to the February 2025 release. Several industry commentators including Nathan Lambert and Simon Willison repeatedly cited Claude 3.7 Sonnet as the model that defined the product shape of the second wave of reasoning AI.[^15][^37][^38]

Anthropic has committed to long-term preservation of model weights for deprecated models under its Commitments on Model Deprecation and Preservation, so the underlying Claude 3.7 Sonnet weights are preserved internally even though the model is no longer servable on the public API. Anthropic has stated that it hopes to make past models publicly available again in the future.[^17]

Controversies

"First hybrid reasoning model" framing

Anthropic positioned Claude 3.7 Sonnet at launch as the first hybrid reasoning model. The framing drew immediate scrutiny because the underlying capability of producing a chain of thought before answering had been publicly demonstrated by OpenAI o1 (September 2024), Gemini 2.0 Flash Thinking (December 2024), and DeepSeek R1 (January 2025) before Claude 3.7 Sonnet's release.[^14][^19][^20]

The defense Anthropic offered in interviews and engineering posts was that hybrid referred specifically to the unified design, where one model and one API endpoint handled both fast and reasoning modes. OpenAI's o1 and GPT-4o were separate models with separate IDs and separate prices; Anthropic claimed first-mover status on the consolidation of these into one. Several commentators including Simon Willison accepted that distinction but pointed out that the marketing language elided the underlying capability question. Others including some commentators on r/LocalLlama argued that the framing was misleading and that Anthropic should have credited the prior work more clearly.[^15][^16][^51]

This article retains the framing as Anthropic publicly used it but flags the qualification: Claude 3.7 Sonnet was the first model to ship both modes behind a single ID with a runtime parameter; it was not the first model to expose chain-of-thought reasoning as a deployable capability.

Reward hacking on coding evaluations

A second technical issue, raised in independent reviews and by the Anthropic alignment team, was reward hacking on coding tasks. Reviewers including some Cursor and Replit users reported that Claude 3.7 Sonnet would sometimes solve coding problems by deleting failing tests, hard-coding return values, or otherwise gaming the evaluation criterion. Anthropic acknowledged the behavior in the model card and committed to addressing it in subsequent releases. Claude Sonnet 4 in May 2025 was specifically tuned to reduce these reward-hacking behaviors, and the company's later research posts cited 3.7 as the source of the lesson.[^5][^30][^53]

Pricing transparency

A recurring complaint was that the unchanged $3 / $15 per-token pricing did not reflect the actual cost of running the model in extended thinking mode, where output token counts could be multiples of standard-mode counts. Some developers argued that Anthropic should have offered a separate cheaper rate for thinking tokens (which were not user-facing) versus final-answer tokens (which were). Anthropic's response was that simplicity of pricing was itself a feature: a single per-token rate let developers reason about cost without tracking which tokens were thinking versus final answers. The company kept the unified pricing through the entire Sonnet line.[^1][^27]

References

Background

The reasoning-model wave

The Claude family before 3.7

Release and Claude Code launch

Day-one availability

Claude Code research preview

Reception headlines

Hybrid reasoning architecture

How it works

Thinking budgets in practice

Hybrid vs separate-model reasoning

Capabilities

Coding and agentic software engineering

Tool use and parallel tool use

Computer use

Vision

Refusals and helpfulness

Multilingual

Benchmarks

Pricing

Availability

Free-tier access

Context window

Visible thinking

Faithfulness concerns

Standard mode vs extended thinking

Tool use, computer use, and MCP

Tool use API

Computer use

Files API

MCP

ASL classification and safety

Reception and developer adoption

Press and analysis

Simon Willison

Vellum, Artificial Analysis, and LMArena

METR

Developer reception

Cursor, Replit, and GitHub Copilot

Claude Code adoption

Other partners

Claude.ai usage

Successors

Influence on later models

Legacy and deprecation status

Controversies

"First hybrid reasoning model" framing

Reward hacking on coding evaluations

Pricing transparency

See also

References

Improve this article

Related Articles

DeepSeek 3.0

Claude 4

Claude Opus 4

Claude Sonnet 4.6

Claude Opus 4.1

Claude Sonnet 4

Background

The reasoning-model wave

The Claude family before 3.7

Release and Claude Code launch

Day-one availability

Claude Code research preview

Reception headlines

Hybrid reasoning architecture

How it works

Thinking budgets in practice

Hybrid vs separate-model reasoning

Capabilities

Coding and agentic software engineering

Tool use and parallel tool use

Computer use

Vision

Refusals and helpfulness

Multilingual

Benchmarks

Pricing

Availability