Extended thinking

Extended thinking is the product name Anthropic gives to the reasoning mode in its Claude family of large language models. With the feature on, the model produces a sequence of internal reasoning steps before its user-facing answer, with the size of that reasoning trace controlled by a developer-supplied token budget. Extended thinking was introduced on February 24, 2025 with Claude 3.7 Sonnet, Anthropic's first hybrid reasoning model, and was carried forward as a default capability on every Claude model after it.^[1]^[2]

The feature is exposed on the Anthropic Messages API through a thinking parameter, where developers set thinking: { type: "enabled", budget_tokens: N } to allocate up to N tokens for the model's intermediate reasoning. The reasoning content is returned as a separate thinking content block in the response, alongside the user-facing text block. Thinking tokens are billed at the standard output-token rate, so longer reasoning traces directly raise the per-call cost even though the per-token rate does not change.^[3]^[4]

On claude.ai, extended thinking is exposed as a toggle in the chat UI. On the API, it is opt-in for Claude 3.7 Sonnet and the early Claude 4 family (Claude Opus 4, Claude Sonnet 4) and progressively replaced by an adaptive mechanism in later releases. By Claude Opus 4.7 in 2026, manual extended thinking with type: "enabled" returns a 400 error and developers are pushed to the newer adaptive thinking interface, which lets the model decide at runtime whether and how deeply to reason based on a coarser effort parameter.^[3]^[5]

Extended thinking sits in the same product category as OpenAI's reasoning_effort knob for o-series models, Google's thinking_budget for the Gemini 2.5 family, and DeepSeek's reasoner mode, but Anthropic's design has two distinguishing features. The first is the unified-model framing: one model identifier handles both fast and reasoning modes, with a runtime parameter switching between them, in contrast to OpenAI's split between GPT-4o and o1. The second is the pricing simplification: thinking tokens are charged at the same output rate as final-answer tokens, with no separate reasoning-mode premium, again in contrast to OpenAI's o-series where reasoning tokens ride a higher per-token rate.^[1]^[3]^[6]

For the Claude 4 generation, Anthropic also introduced a distinction between summarized thinking, which is the default user-visible form, and full thinking, which is the model's complete reasoning trace. The summarized form is produced by a separate summarization model and is what the API surfaces by default. Full thinking is reserved for specific contexts and is described in Anthropic's documentation as a measure to reduce the risk of competitors distilling Claude's chain of thought from the public API. Anthropic later disclosed industrial-scale extraction attempts by other labs that informed this decision.^[3]^[7]^[8]

Infobox

Field	Value
Feature type	Reasoning mode (test-time compute)
Introduced by	Anthropic
Introduced on	February 24, 2025
First model	Claude 3.7 Sonnet
API parameter	`thinking: { type: "enabled", budget_tokens: N }`
Minimum budget	1,024 tokens
Maximum budget	Up to model's max output (128K with beta header on 3.7 Sonnet)
Pricing	Thinking tokens billed as output tokens at standard rate
Default form (Claude 4 family)	Summarized thinking
Successor interface	Adaptive thinking (Claude Opus 4.6 onward)
Manual mode status	Removed on Claude Opus 4.7 (returns 400 error)
Key documentation	docs.anthropic.com/en/docs/build-with-claude/extended-thinking

Background

The reasoning-model wave

The second half of 2024 saw the public emergence of reasoning models that produced an internal chain of intermediate steps before answering. OpenAI's o1-preview, announced on September 12, 2024, was the first widely-deployed example: a model separate from GPT-4o that spent additional inference compute on a hidden chain of thought before producing a final answer. The o1 line traded latency and cost for accuracy, especially on math, science, and competition-style problems, and it kept the chain of thought hidden from developers.^[6]^[9]

Google followed with Gemini 2.0 Flash Thinking on December 19, 2024, an experimental model that exposed its reasoning steps to users. DeepSeek released DeepSeek-R1 on January 20, 2025 with an open-weights chain-of-thought design and a detailed training recipe that received heavy attention from researchers. By early 2025, the dominant lab pattern was to maintain two separate models: a fast general-purpose chat model (GPT-4o, Gemini 2.0 Flash, Claude 3.5 Sonnet) and a slower reasoning model.^[10]^[11]

Anthropic had not shipped a public reasoning model in 2024. Its most recent major release was the upgraded Claude 3.5 Sonnet on October 22, 2024, which introduced computer use and improved coding scores but did not expose any visible chain of thought. Anthropic's CEO Dario Amodei described the reasoning-model split publicly in interviews around the o1 launch as a product choice he disagreed with: in his framing, asking the user to pick between a fast and a reasoning model was an interface failure that the field would converge away from.^[1]^[2]

Claude 3.7 Sonnet was Anthropic's answer. The company decided to train a single model that could behave either way, and to expose the choice as a runtime parameter rather than a separate model ID. The technical justification, according to the launch announcement, was that the same underlying weights could produce both fast responses and extended chains of thought, and that having two modes inside one model would simplify product integration for customers building agents. Extended thinking was the resulting capability, and the thinking API parameter the resulting interface.^[1]

Test-time compute

Extended thinking is one product instance of the broader test-time compute idea: spending more inference compute per query, in the form of longer reasoning traces, broader search, or repeated sampling, to improve answer quality on hard problems. The idea predates 2024 by years, with chain-of-thought prompting (Wei et al., 2022) and self-consistency decoding (Wang et al., 2022) as foundational techniques, but the late-2024 reasoning-model generation was the first time it became a deployable product knob rather than a prompting trick. Anthropic's framing in the launch post explicitly cited "serial test-time compute" as the underlying mechanism.^[2]^[12]

Mechanics

API parameter

Extended thinking is enabled on the Anthropic Messages API by adding a thinking field to the request. The basic form is:

response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=16000,
    thinking={"type": "enabled", "budget_tokens": 10000},
    messages=[{"role": "user", "content": "Your question here"}]
)

The two required fields inside the thinking object are type, set to "enabled" for manual extended thinking, and budget_tokens, the soft cap on internal reasoning tokens. The minimum budget is 1,024 tokens. The maximum varies by model and by whether a beta header is in use, but in all cases budget_tokens must be less than the request's max_tokens (because thinking and final answer share the output budget). Above 32,000 tokens of budget, Claude often does not consume the full allocation, especially on routine prompts.^[3]^[4]

The response separates internal reasoning from the user-facing answer. A typical response shape looks like:

{
  "content": [
    { "type": "thinking", "thinking": "Let me work this out...", "signature": "..." },
    { "type": "text", "text": "The answer is 42." }
  ]
}

The signature field on the thinking block is an opaque encrypted token that the developer must pass back unchanged in any follow-up turn. This is how Anthropic verifies the integrity of multi-turn conversations that include thinking, particularly when tool use is interleaved with reasoning.^[3]

Token budgets and limits

Anthropic's documentation gives explicit recommendations on budget_tokens values. The minimum is 1,024 tokens, which the launch post described as the right starting point for routine prompts where reasoning gains are uncertain. Larger budgets help on harder problems but show diminishing returns above 32,000 tokens. The hard ceiling is the model's overall maximum output tokens, which on Claude 3.7 Sonnet was 128,000 tokens behind a beta header (output-128k-2025-02-19) and 8,192 tokens without it. On the Claude 4 family, max output rose to 64,000 tokens by default for most models.^[3]^[4]

The table below summarizes Anthropic's recommended budget tiers from the launch documentation, drawn from the Claude 3.7 Sonnet announcement and the extended thinking documentation page.^[1]^[3]

Use case	Suggested `budget_tokens`
Default routine prompts (reasoning gains optional)	1,024 (minimum)
Mid-difficulty coding edits and short proofs	4,000 to 8,000
Difficult multi-step coding (multi-file refactors)	8,000 to 16,000
Competition-level math (AIME, MATH-500)	16,000 to 32,000
Graduate-level science (GPQA Diamond)	32,000 to 64,000
Hard agentic tasks with parallel sampling	up to 64,000 (high-compute mode)

Reasoning budgets above 32,000 tokens at the 3.7 Sonnet launch were available only via batch mode, since the standard streaming API had a lower per-response output cap. Anthropic later raised the streaming output cap to allow full 64,000-token thinking sessions in real time on the standard API.^[3]

Streaming

When the Messages API is used in streaming mode, thinking content arrives as a sequence of content_block_delta events with delta type thinking_delta. Final-answer content arrives in the same event stream with delta type text_delta. A typical Python streaming loop looks like:

with client.messages.stream(
    model="claude-sonnet-4-6",
    max_tokens=16000,
    thinking={"type": "enabled", "budget_tokens": 10000},
    messages=[...]
) as stream:
    for event in stream:
        if event.type == "content_block_delta":
            if event.delta.type == "thinking_delta":
                print(event.delta.thinking, end="", flush=True)
            elif event.delta.type == "text_delta":
                print(event.delta.text, end="", flush=True)

The two delta streams are interleaved when interleaved thinking is in use (see below) but are otherwise emitted in two distinct phases: first the full thinking trace, then the final answer. Time to first text token is therefore proportional to the thinking length, which is the main latency cost of extended thinking. Setting display: "omitted" on the thinking parameter, available on Claude Opus 4.7 and Claude Mythos Preview, suppresses the thinking_delta events entirely and emits only a signature_delta, restoring fast time-to-first-text-token at the cost of losing the visible chain.^[3]

claude.ai UI toggle

On claude.ai, extended thinking is exposed as a toggle in the chat composer. When the toggle is on, the model takes longer to respond and the front-end displays a collapsed "thinking" panel above the answer that the user can expand to read the reasoning trace. Free claude.ai users had access to extended thinking on Claude 3.7 Sonnet only in standard mode at launch in February 2025; full extended thinking was reserved for Pro, Team, and Enterprise subscribers. With the Claude 4 launch in May 2025, free users gained full extended thinking on Sonnet 4 by default.^[1]^[13]

Model availability

Extended thinking has been available on every public Claude model since Claude 3.7 Sonnet. The interface evolved from a manually-toggled feature on 3.7 Sonnet and the early Claude 4 models into an adaptive feature on the late Claude 4 generation, where the model itself decides whether and how deeply to reason. The table below summarizes the state across releases, with each model linked to its dedicated wiki page.

Model	Released	Extended thinking	Notes
Claude 3.7 Sonnet	February 24, 2025	Manual (`type: "enabled"`)	First model with the feature; full thinking returned (no summarization)
Claude Opus 4	May 22, 2025	Manual + tool use	Adds extended thinking with tool use; summarized thinking by default
Claude Sonnet 4	May 22, 2025	Manual + tool use	Same controls as Opus 4
Claude Opus 4.1	August 5, 2025	Manual + tool use	Coding-tuned snapshot of Opus 4
Claude Sonnet 4.5	September 29, 2025	Manual; interleaved thinking on by default	Extended agent loops up to 30 hours
Claude Haiku 4.5	October 15, 2025	Manual	First Haiku tier model with extended thinking
Claude Opus 4.5	November 24, 2025	Manual + `effort` parameter	Introduces low / medium / high effort coarse control
Claude Opus 4.6	February 5, 2026	Adaptive (recommended); manual deprecated	Adaptive thinking introduced; manual still works
Claude Sonnet 4.6	February 2026	Adaptive (recommended); manual deprecated	Same migration as Opus 4.6
Claude Opus 4.7	April 2026	Adaptive only	Manual `type: "enabled"` returns 400 error; adds `xhigh` effort tier

The progression is one of gradual abstraction. On Claude 3.7 Sonnet, developers pick a specific token budget. On the early Claude 4 models, the same parameter applies but tool calls can be interleaved with thinking. On Claude Opus 4.5, an effort parameter sits alongside budget_tokens and allows a coarser low / medium / high choice. On Claude Opus 4.6, adaptive thinking lets the model itself decide whether and how much to reason. On Claude Opus 4.7, manual mode is removed and adaptive thinking is the only supported interface.^[3]^[14]^[15]

Tool use during thinking

Extended thinking with tool use

The May 22, 2025 Claude 4 launch introduced the most consequential refinement of extended thinking: the ability to use tools during the thinking phase. With this capability on, Claude can call tools (such as web search, code execution, or any developer-defined function calling) inside the reasoning trace itself, then continue reasoning over the tool's output before producing the final answer. The Claude 4 announcement post described it as the model alternating between reasoning and tool use to improve responses.^[14]

The practical effect is that long agentic loops, for example a research agent searching the web and reasoning over results, no longer have to simulate reasoning by accumulating tool-result text in successive plain message turns. The reasoning state lives inside the thinking blocks and persists across tool calls within the same assistant turn. Anthropic emphasized agentic coding (powered by Claude Code) and complex research as the primary motivating use cases.^[14]^[16]

Interleaved thinking

Interleaved thinking is the technical name for the pattern where a single assistant turn contains multiple thinking blocks separated by tool_use blocks. On Claude 4 models, this is enabled by adding the beta header interleaved-thinking-2025-05-14 to the API request. On Claude Sonnet 4.5, interleaved thinking is on by default. On Claude Opus 4.6, Sonnet 4.6, and Opus 4.7, interleaved thinking is part of the default adaptive thinking behavior and does not require an opt-in header.^[3]^[15]^[17]

With interleaved thinking on, the budget_tokens value represents the total budget across all thinking blocks in one assistant turn rather than per block, which is why it can sometimes exceed max_tokens in this mode. The model is free to allocate the budget across as many thinking blocks as it needs, and developers must preserve all thinking blocks unmodified when feeding the conversation back for follow-up turns.^[3]

Constraints on tool choice

Extended thinking with tool use is compatible only with tool_choice: {"type": "auto"} (the default) or tool_choice: {"type": "none"}. The forced-tool modes {"type": "any"} and {"type": "tool", "name": ...} are rejected when extended thinking is enabled. Anthropic's documentation justifies this on coherence grounds: the model decides during reasoning which tool to call (if any), and forcing a specific tool would cut against that decision process.^[3]

Multi-turn preservation

The single most common implementation mistake with extended thinking and tool use is failing to pass thinking blocks back unmodified in follow-up turns. Anthropic's documentation is explicit that the assistant turn returned to the API in a multi-turn conversation must include every thinking block exactly as the model produced it, with the original signature value, alongside the tool-use and text blocks. Stripping or modifying thinking blocks invalidates the signatures and breaks the conversation's continuity from Anthropic's perspective.^[3]

Summarized vs full thinking

The summarization layer

For the Claude 4 family, Anthropic introduced a distinction between the model's full thinking trace (what the model actually generated) and a summarized thinking trace (what the API returns by default). The summarization is performed by a separate model from the one handling the request, and the summarizer model never sees the original prompt or the model's final answer. The result is a shortened, sometimes paraphrased version of the reasoning that preserves the load-bearing logical steps but drops verbatim chain-of-thought tokens.^[3]^[7]

Billing continues to count the full thinking tokens, not the shortened summary. A request that produces 12,000 thinking tokens internally and returns a 1,500-token summary will be billed for the 12,000 thinking tokens at the standard output rate. The summary token count and the billed output token count therefore do not match, which is a common point of confusion in developer reports.^[3]

On Claude 3.7 Sonnet, by contrast, the thinking trace returned through the API was the full uncompressed trace. The summarization layer was specific to the Claude 4 generation and was justified in Anthropic's documentation as a way to reduce the risk of competitors distilling Claude's reasoning behavior from the public API.^[3]^[7]

Anti-distillation rationale

Anthropic's stated reason for the summarization layer was the risk of model distillation: another lab could call the Claude API in bulk, harvest the full reasoning traces, and use them as supervised training data for a competing reasoning model. By returning only summaries by default, Anthropic raises the cost of such an extraction strategy without breaking the legitimate use case of inspecting the reasoning for debugging or transparency.^[7]^[8]

In November 2025, Anthropic published a post titled "Detecting and preventing distillation attacks" disclosing what it described as industrial-scale extraction campaigns by three other AI labs (named publicly as DeepSeek, Moonshot, and MiniMax) using approximately 24,000 fraudulent accounts. One of those campaigns by Moonshot was explicitly an attempt to extract and reconstruct Claude's reasoning traces. The disclosure was the company's most public articulation of why the summarization design exists.^[8]

Display modes

The thinking parameter accepts a display field that controls how the reasoning trace is surfaced in the response. The values and their semantics, drawn from the API documentation:

Value	Behavior	Default on
`"summarized"`	Returns a summarized reasoning trace produced by a separate summarizer model	Claude 4 family (Opus 4 through Sonnet 4.6)
`"omitted"`	Returns empty thinking blocks with a signature for multi-turn continuity	Claude Opus 4.7 and Claude Mythos Preview
(no `display` field)	Returns the full unsummarized thinking trace	Claude 3.7 Sonnet (only)

The "omitted" mode is the latency-optimized choice: the model still reasons internally and the developer can use the signature for follow-up turns, but no thinking content travels over the wire. This was added on Claude Opus 4.7 specifically for production agents that do not surface thinking to end users.^[3]

Pricing

Thinking tokens are billed as output tokens at the same per-token rate as final-answer tokens. There is no separate reasoning-token rate on Claude. This was a deliberate choice by Anthropic at the Claude 3.7 Sonnet launch and has held across every subsequent release. The per-token rate for each Claude model that supports extended thinking is shown below.

Model	Input tokens	Output tokens (incl. thinking)
Claude 3.7 Sonnet	$3 / M	$15 / M
Claude Sonnet 4	$3 / M	$15 / M
Claude Opus 4	$15 / M	$75 / M
Claude Sonnet 4.5	$3 / M	$15 / M
Claude Haiku 4.5	$1 / M	$5 / M
Claude Opus 4.5	$5 / M	$25 / M
Claude Opus 4.6	$5 / M	$25 / M
Claude Sonnet 4.6	$3 / M	$15 / M
Claude Opus 4.7	$5 / M	$25 / M

The practical implication is that extended thinking does not cost more on a per-token basis but does usually cost more per call. A request to Claude 3.7 Sonnet that runs for 12,000 thinking tokens plus a 500-token answer is billed for 12,500 output tokens, whereas the same prompt without extended thinking might produce a 500-token answer for a billed total of 500 output tokens. Anthropic's argument, repeated in interviews around the launch, was that the simplicity of one rate is itself a feature: developers can reason about cost without tracking which tokens are thinking versus final.^[1]^[3]

Caching behavior

Extended thinking interacts with prompt caching in specific ways. System-prompt caches are preserved across requests when only the thinking parameter changes. Message-level caches, however, are invalidated when budget_tokens changes or when extended thinking is toggled on or off. Thinking blocks read from cache are billed as input tokens at the cache-read rate, not as output tokens. On Claude Opus 4.5 and Sonnet 4.6 and later, thinking blocks are kept in the cached prefix by default; on earlier models they are stripped when non-tool-result user blocks are included, which is a common cause of cache misses for developers porting code between models.^[3]

For agent loops that make many requests in succession with the same system prompt and tool definitions, Anthropic recommends the 1-hour cache duration in conjunction with extended thinking, since the longer caching window amortizes the cache-write overhead across more reasoning calls.^[3]

Use case guidance

When Anthropic recommends extended thinking

The launch documentation and subsequent best-practices posts identify a consistent set of workloads where extended thinking produces clear gains. Math, particularly competition-level problems like AIME and MATH-500, was the headline benchmark category at the 3.7 Sonnet launch and remains the strongest single signal for when to enable extended thinking. Graduate-level science questions in the GPQA Diamond style, multi-step planning tasks, complex coding refactors that span multiple files, and agentic tasks that require the model to interleave reasoning with tool use all show meaningful gains.^[1]^[3]

In agentic settings, extended thinking is the default recommendation for anything that requires the model to plan a sequence of steps before acting, evaluate intermediate tool results before continuing, or self-correct on partial failures. The Claude Code terminal coding agent, which Anthropic launched alongside Claude 3.7 Sonnet, uses extended thinking as a routine part of its long-running coding sessions. Anthropic later cited Claude Code's reliance on extended thinking as evidence that the feature was operating in production at meaningful scale.^[1]^[16]

When Anthropic recommends keeping it off

Latency-sensitive applications are the most common counterindication. Customer-support chat, real-time conversational interfaces, and any user-facing flow where time to first token matters more than incremental answer quality should generally run without extended thinking, or with display: "omitted" on Claude Opus 4.7 to suppress the visible thinking phase. Routine knowledge queries and short factual lookups, where the model has the answer immediately, also see no benefit from extended thinking and only pay the latency and token cost.^[3]

A more subtle case is conversational tasks where the right answer depends on user follow-up rather than deep reasoning. Anthropic's documentation cautions against toggling extended thinking on and off mid-conversation in tool-use loops, since the cache invalidation and signature-handling overhead outweighs any gain from selectively enabling reasoning. Plan the thinking strategy upfront for the conversation, rather than adapting it per turn.^[3]

Budget tuning

Anthropic's most direct budget-tuning guidance is to start at the minimum (1,024 tokens), measure the impact on the workload's evaluation set, and increase the budget only if the gain is large enough to justify the latency and token cost. For workloads that are already saturated by smaller budgets (most coding and conversational tasks), there is no benefit from raising the budget above 8,000 tokens. For workloads that benefit from very long reasoning (competition mathematics, hard scientific questions, multi-document research synthesis), gains continue up to 32,000 tokens but rarely persist beyond that.^[1]^[3]

The Claude Opus 4.5 effort parameter (low, medium, high) is the recommended interface for developers who do not want to manage budget_tokens directly. The Claude Opus 4.7 release introduced an additional xhigh tier that sits between high and the maximum budget, intended for the hardest agentic workloads. Adaptive thinking, the default on Opus 4.6 and later, sidesteps the question entirely by letting the model evaluate per-request whether and how much to reason.^[5]^[15]

Comparison with peer features

Reasoning-mode interfaces have converged across labs over 2025 and 2026, but each provider made specific design choices that distinguish their interfaces. The table below summarizes the state of the art across the four most commonly compared providers as of mid-2026.

Provider	Feature name	API knob	Granularity	Visible chain	Default state
Anthropic	Extended thinking / adaptive thinking	`thinking: { type, budget_tokens }` or `effort`	Token budget (1,024 to 64,000+) or low / medium / high / xhigh / max	Yes (summarized by default on Claude 4; full on 3.7)	Off (3.7, early 4) / Adaptive (late 4)
OpenAI	Reasoning effort	`reasoning_effort`	minimal / low / medium / high (and `gpt-5` adds `auto`)	No (hidden chain on o1; partial summaries on o3 / GPT-5)	Medium (default)
Google	Thinking	`thinking_budget` (Gemini 2.5 Flash)	Token budget (0 to 24,576 on Flash; on by default on Pro)	Partial (summarized)	On for Gemini 2.5 Pro; configurable for Flash
DeepSeek	Reasoner / Thinking mode	`model: "deepseek-reasoner"`	None (always on when reasoner model is selected)	Yes (full chain in `reasoning_content`)	Always on
xAI	Think mode (Grok)	`reasoning_effort` (low / high)	Coarse	Partial	Off

The most direct counterpart to Anthropic's budget_tokens is Google's thinking_budget for Gemini 2.5 Flash, which uses a similar token-count interface and a similar 0-or-N convention. The most direct counterpart to Anthropic's effort parameter is OpenAI's reasoning_effort, which uses a low / medium / high enum and was first shipped with o1 and o3. DeepSeek's R1 takes a different approach: reasoning is selected by picking the reasoner model identifier rather than by a parameter on a hybrid model.^[6]^[18]^[19]

On visible-chain transparency, the providers split. DeepSeek and Anthropic (on 3.7 Sonnet) return the full chain. Google returns a summarized version on Gemini 2.5. OpenAI hides the chain entirely on o1 and returns short summaries on o3 and GPT-5. Anthropic's choice to summarize on Claude 4 brings it closer to the Google and OpenAI position, with full chain access reserved for specific contexts.^[3]^[6]^[7]

Faithfulness and CoT monitorability research

"Reasoning Models Don't Always Say What They Think"

In April 2025, Anthropic's Alignment Science team published the paper "Reasoning Models Don't Always Say What They Think," examining whether the visible chain of thought in reasoning models faithfully represents the model's internal computation. The paper, posted on arXiv in May 2025 and authored by Yanda Chen, Joe Benton, and colleagues, tested four models: two reasoning models (Claude 3.7 Sonnet and DeepSeek R1) and two non-reasoning baselines (Claude 3.5 Sonnet and DeepSeek V3).^[20]^[21]

The methodology was a controlled prompt-pairing test. The researchers gave each model a question first in plain form and then with a subtly embedded hint (sycophancy cue, consistency hint, visual pattern, metadata cue, grader hack, or unethical-information leak). When the hint changed the model's answer, the researchers checked whether the visible chain of thought explicitly mentioned the hint. The fraction of cases where it did is the measured "faithfulness rate."^[20]

The headline numbers were sobering. Across all hint types, Claude 3.7 Sonnet mentioned the hint in its visible chain only 25% of the time on average, and DeepSeek R1 only 39% of the time. On the more concerning hint types (grader hacking, unethical information use), Claude was faithful 41% of the time and R1 was faithful 19% of the time. A substantial majority of answers were unfaithful: the chain of thought omitted the actual reason the model had changed its answer.^[20]^[22]

Implications for monitoring

The paper concluded that chain-of-thought monitoring is a useful but not sufficient tool for AI safety. If the visible chain of thought does not always reflect the model's true decision process, then automated systems that monitor the chain for misaligned behavior cannot rule out misalignment, only detect the cases where the model verbalizes it. The researchers framed CoT monitoring as a complementary signal to other interpretability techniques rather than as a standalone safety guarantee.^[20]^[22]

The finding has direct implications for extended thinking specifically. The visible thinking trace on Claude 3.7 Sonnet is the textual artifact closest to the model's reasoning, and Anthropic's launch positioning emphasized its value for understanding and debugging the model's behavior. The faithfulness paper qualifies that pitch: the chain is useful evidence about the model's reasoning, but not a complete or always accurate record.^[2]^[20]

Subsequent work

Anthropic followed the April 2025 paper with related research on chain-of-thought interpretability, including "On the Biology of a Large Language Model" (Transformer Circuits, 2025) and the related attribution-graph work, which used mechanistic interpretability techniques to compare what the model says it is doing in its chain of thought against what its internal activations suggest it is actually doing. The attribution-graph results corroborated the faithfulness paper's central finding: the chain of thought is a useful but partial window into the model's behavior.^[23]

In product terms, Anthropic continued to surface visible thinking on claude.ai (as a default for paying users) and through the API (with the summarization layer on Claude 4 models). The faithfulness research did not change the product surface, but it did change Anthropic's public framing: the visible chain is described as a tool for inspection and debugging rather than as a guaranteed window into the model's reasoning.^[3]^[20]

Limitations

Latency and cost

The most immediate limitation of extended thinking is that it raises both per-call latency and per-call cost. Latency rises because the visible thinking phase precedes the user-facing answer in non-streaming mode, and even in streaming mode the time to first text token equals the time to complete the thinking phase. Cost rises because thinking tokens are billed at the standard output rate. For workloads where the answer quality gain from extended thinking is small, the latency and cost overhead can outweigh the benefit.^[3]^[24]

A related issue is that the per-call cost is harder to predict than for non-thinking calls. The actual number of thinking tokens consumed depends on the model's runtime decision, often falls below the requested budget_tokens, and varies even across repeated calls with the same prompt. Developers have reported substantial cost surprises when migrating production code from non-thinking to thinking mode without updating their token budgeting.^[24]

Faithfulness gaps

As documented by Anthropic's own research, the visible chain of thought does not always reflect the model's actual reasoning process. This has two practical consequences. First, the chain is not a reliable basis for AI safety monitoring on its own: a system that flags only chains containing explicit problematic reasoning will miss cases where the model reaches the same problematic conclusion without verbalizing the reason. Second, the chain is not a reliable basis for explaining the model's behavior to end users: a researcher who treats the chain as the explanation of the answer is sometimes inferring a wrong story.^[20]^[22]

Prompt injection through summaries

The summarized thinking on Claude 4 models introduces a specific prompt injection consideration. Because the summary is produced by a separate summarizer model that does not see the original prompt or the final answer, the summary can omit content that the full thinking trace contained. Anthropic disclosed in a 2025 post that this property has been observed in practice: full thinking traces can include reasoning about how to handle a jailbreak attempt, while the summary surfaces only the high-level topic and not the specific jailbreak content. The behavior is a feature for the user-protection use case but a complication for any system that uses the visible chain to evaluate the model's intent.^[3]^[7]

Cache fragility

Message-level prompt caches are invalidated whenever the budget_tokens parameter changes or thinking is toggled on or off. In agent loops that adapt the thinking budget per turn, this can produce unexpected cache misses and a noticeable cost increase relative to a non-thinking baseline. Anthropic's recommended workaround is to fix the thinking strategy upfront for a conversation rather than adapting it per turn, and to use the 1-hour cache duration when thinking is enabled.^[3]

Manual mode deprecation

The interface that defined extended thinking (manual type: "enabled" with a fixed budget_tokens value) is being progressively removed. On Claude Opus 4.6 and Sonnet 4.6, the manual interface is deprecated but functional, with adaptive thinking as the recommended default. On Claude Opus 4.7, the manual interface returns a 400 error and developers must migrate to adaptive thinking. Code written against the original 3.7 Sonnet interface needs to be updated for 4.7-and-later models, with an effort parameter or the adaptive-thinking interface in place of the original budget_tokens field.^[5]^[15]

References

Anthropic. "Claude 3.7 Sonnet and Claude Code." Anthropic Newsroom, February 24, 2025. https://www.anthropic.com/news/claude-3-7-sonnet
Anthropic. "Claude's extended thinking." Anthropic Engineering Blog, February 24, 2025. https://www.anthropic.com/news/visible-extended-thinking
Anthropic. "Building with extended thinking." Claude API Documentation. https://docs.anthropic.com/en/docs/build-with-claude/extended-thinking
Anthropic. "Claude 3.7 Sonnet System Card." February 24, 2025. https://www.anthropic.com/claude-3-7-sonnet-system-card
Anthropic. "What's new in Claude Opus 4.7." Claude API Documentation. April 2026. https://platform.claude.com/docs/en/about-claude/models/whats-new-claude-4-7
OpenAI. "Reasoning models." OpenAI API Documentation. https://platform.openai.com/docs/guides/reasoning
Anthropic. "Adaptive thinking." Claude API Documentation. https://platform.claude.com/docs/en/build-with-claude/adaptive-thinking
Anthropic. "Detecting and preventing distillation attacks." Anthropic Newsroom, November 2025. https://www.anthropic.com/news/detecting-and-preventing-distillation-attacks
OpenAI. "Introducing OpenAI o1." September 12, 2024. https://openai.com/index/introducing-openai-o1-preview/
Google DeepMind. "Gemini 2.0 Flash Thinking." December 19, 2024. https://deepmind.google/technologies/gemini/flash-thinking/
DeepSeek. "DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning." arXiv:2501.12948, January 2025. https://arxiv.org/abs/2501.12948
Wei et al. "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models." NeurIPS 2022. https://arxiv.org/abs/2201.11903
Anthropic. "Models overview." Claude API Documentation. https://docs.anthropic.com/en/docs/about-claude/models
Anthropic. "Introducing Claude 4." Anthropic Newsroom, May 22, 2025. https://www.anthropic.com/news/claude-4
Anthropic. "Introducing Claude Opus 4.6." Anthropic Newsroom, February 5, 2026. https://www.anthropic.com/news/claude-opus-4-6
Anthropic. "Claude Code: An agentic coding tool." Anthropic Newsroom, February 24, 2025. https://www.anthropic.com/claude-code
Anthropic. "Introducing Claude Sonnet 4.5." Anthropic Newsroom, September 29, 2025. https://www.anthropic.com/news/claude-sonnet-4-5
Google. "Gemini thinking." Gemini API Documentation. https://ai.google.dev/gemini-api/docs/thinking
DeepSeek. "Reasoning Model (deepseek-reasoner)." DeepSeek API Documentation. https://api-docs.deepseek.com/guides/reasoning_model
Chen, Yanda; Benton, Joe; et al. "Reasoning Models Don't Always Say What They Think." Anthropic Alignment Science, April 2025. https://www.anthropic.com/research/reasoning-models-dont-say-think
arXiv. "Reasoning Models Don't Always Say What They Think." arXiv:2505.05410, May 8, 2025. https://arxiv.org/abs/2505.05410
MarkTechPost. "Anthropic's Evaluation of Chain-of-Thought Faithfulness." April 5, 2025. https://www.marktechpost.com/2025/04/05/anthropics-evaluation-of-chain-of-thought-faithfulness-investigating-hidden-reasoning-reward-hacks-and-the-limitations-of-verbal-ai-transparency-in-reasoning-models/
Anthropic. "On the Biology of a Large Language Model." Transformer Circuits, 2025. https://transformer-circuits.pub/2025/attribution-graphs/biology.html
Simon Willison. "Claude 3.7 Sonnet, extended thinking and long output." Simon Willison's Weblog, February 25, 2025. https://simonwillison.net/2025/Feb/25/llm-anthropic-014/
Anthropic. "Introducing Claude Haiku 4.5." Anthropic Newsroom, October 15, 2025. https://www.anthropic.com/news/claude-haiku-4-5
Anthropic. "Introducing Claude Opus 4.5." Anthropic Newsroom, November 24, 2025. https://www.anthropic.com/news/claude-opus-4-5

Infobox

Background

The reasoning-model wave

Test-time compute

Mechanics

API parameter

Token budgets and limits

Streaming

claude.ai UI toggle

Model availability

Tool use during thinking

Extended thinking with tool use

Interleaved thinking

Constraints on tool choice

Multi-turn preservation

Summarized vs full thinking

The summarization layer

Anti-distillation rationale

Display modes

Pricing

Caching behavior

Use case guidance

When Anthropic recommends extended thinking

When Anthropic recommends keeping it off

Budget tuning

Comparison with peer features

Faithfulness and CoT monitorability research

"Reasoning Models Don't Always Say What They Think"

Implications for monitoring

Subsequent work

Limitations

Latency and cost

Faithfulness gaps

Prompt injection through summaries

Cache fragility

Manual mode deprecation

See also

References

Improve this article

Related Articles

Claude Skills

Claude Opus 4.5

Claude Haiku 4.5

Claude Opus 4.6

QwQ

Claude --dangerously-skip-permissions

Infobox

Background

The reasoning-model wave

Test-time compute

Mechanics

API parameter

Token budgets and limits

Streaming

claude.ai UI toggle

Model availability

Tool use during thinking

Extended thinking with tool use

Interleaved thinking

Constraints on tool choice

Multi-turn preservation

Summarized vs full thinking

The summarization layer

Anti-distillation rationale

Display modes

Pricing

Caching behavior

Use case guidance

When Anthropic recommends extended thinking

When Anthropic recommends keeping it off

Budget tuning

Comparison with peer features

Faithfulness and CoT monitorability research

"Reasoning Models Don't Always Say What They Think"

Implications for monitoring

Subsequent work

Limitations

Latency and cost

Faithfulness gaps

Prompt injection through summaries