Extended thinking is the product name Anthropic gives to the reasoning mode in its Claude family of large language models. With the feature on, the model produces a sequence of internal reasoning steps before its user-facing answer, with the size of that reasoning trace controlled by a developer-supplied token budget. Extended thinking was introduced on February 24, 2025 with Claude 3.7 Sonnet, Anthropic's first hybrid reasoning model, and was carried forward as a default capability on every Claude model after it.[1][2]
The feature is exposed on the Anthropic Messages API through a thinking parameter, where developers set thinking: { type: "enabled", budget_tokens: N } to allocate up to N tokens for the model's intermediate reasoning. The reasoning content is returned as a separate thinking content block in the response, alongside the user-facing text block. Thinking tokens are billed at the standard output-token rate, so longer reasoning traces directly raise the per-call cost even though the per-token rate does not change.[3][4]
On claude.ai, extended thinking is exposed as a toggle in the chat UI. On the API, it is opt-in for Claude 3.7 Sonnet and the early Claude 4 family (Claude Opus 4, Claude Sonnet 4) and progressively replaced by an adaptive mechanism in later releases. By Claude Opus 4.7 in 2026, manual extended thinking with type: "enabled" returns a 400 error and developers are pushed to the newer adaptive thinking interface, which lets the model decide at runtime whether and how deeply to reason based on a coarser effort parameter.[3][5]
Extended thinking sits in the same product category as OpenAI's reasoning_effort knob for o-series models, Google's thinking_budget for the Gemini 2.5 family, and DeepSeek's reasoner mode, but Anthropic's design has two distinguishing features. The first is the unified-model framing: one model identifier handles both fast and reasoning modes, with a runtime parameter switching between them, in contrast to OpenAI's split between GPT-4o and o1. The second is the pricing simplification: thinking tokens are charged at the same output rate as final-answer tokens, with no separate reasoning-mode premium, again in contrast to OpenAI's o-series where reasoning tokens ride a higher per-token rate.[1][3][6]
For the Claude 4 generation, Anthropic also introduced a distinction between summarized thinking, which is the default user-visible form, and full thinking, which is the model's complete reasoning trace. The summarized form is produced by a separate summarization model and is what the API surfaces by default. Full thinking is reserved for specific contexts and is described in Anthropic's documentation as a measure to reduce the risk of competitors distilling Claude's chain of thought from the public API. Anthropic later disclosed industrial-scale extraction attempts by other labs that informed this decision.[3][7][8]
| Field | Value |
|---|---|
| Feature type | Reasoning mode (test-time compute) |
| Introduced by | Anthropic |
| Introduced on | February 24, 2025 |
| First model | Claude 3.7 Sonnet |
| API parameter | thinking: { type: "enabled", budget_tokens: N } |
| Minimum budget | 1,024 tokens |
| Maximum budget | Up to model's max output (128K with beta header on 3.7 Sonnet) |
| Pricing | Thinking tokens billed as output tokens at standard rate |
| Default form (Claude 4 family) | Summarized thinking |
| Successor interface | Adaptive thinking (Claude Opus 4.6 onward) |
| Manual mode status | Removed on Claude Opus 4.7 (returns 400 error) |
| Key documentation | docs.anthropic.com/en/docs/build-with-claude/extended-thinking |
The second half of 2024 saw the public emergence of reasoning models that produced an internal chain of intermediate steps before answering. OpenAI's o1-preview, announced on September 12, 2024, was the first widely-deployed example: a model separate from GPT-4o that spent additional inference compute on a hidden chain of thought before producing a final answer. The o1 line traded latency and cost for accuracy, especially on math, science, and competition-style problems, and it kept the chain of thought hidden from developers.[6][9]
Google followed with Gemini 2.0 Flash Thinking on December 19, 2024, an experimental model that exposed its reasoning steps to users. DeepSeek released DeepSeek-R1 on January 20, 2025 with an open-weights chain-of-thought design and a detailed training recipe that received heavy attention from researchers. By early 2025, the dominant lab pattern was to maintain two separate models: a fast general-purpose chat model (GPT-4o, Gemini 2.0 Flash, Claude 3.5 Sonnet) and a slower reasoning model.[10][11]
Anthropic had not shipped a public reasoning model in 2024. Its most recent major release was the upgraded Claude 3.5 Sonnet on October 22, 2024, which introduced computer use and improved coding scores but did not expose any visible chain of thought. Anthropic's CEO Dario Amodei described the reasoning-model split publicly in interviews around the o1 launch as a product choice he disagreed with: in his framing, asking the user to pick between a fast and a reasoning model was an interface failure that the field would converge away from.[1][2]
Claude 3.7 Sonnet was Anthropic's answer. The company decided to train a single model that could behave either way, and to expose the choice as a runtime parameter rather than a separate model ID. The technical justification, according to the launch announcement, was that the same underlying weights could produce both fast responses and extended chains of thought, and that having two modes inside one model would simplify product integration for customers building agents. Extended thinking was the resulting capability, and the thinking API parameter the resulting interface.[1]
Extended thinking is one product instance of the broader test-time compute idea: spending more inference compute per query, in the form of longer reasoning traces, broader search, or repeated sampling, to improve answer quality on hard problems. The idea predates 2024 by years, with chain-of-thought prompting (Wei et al., 2022) and self-consistency decoding (Wang et al., 2022) as foundational techniques, but the late-2024 reasoning-model generation was the first time it became a deployable product knob rather than a prompting trick. Anthropic's framing in the launch post explicitly cited "serial test-time compute" as the underlying mechanism.[2][12]
Extended thinking is enabled on the Anthropic Messages API by adding a thinking field to the request. The basic form is:
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=16000,
thinking={"type": "enabled", "budget_tokens": 10000},
messages=[{"role": "user", "content": "Your question here"}]
)
The two required fields inside the thinking object are type, set to "enabled" for manual extended thinking, and budget_tokens, the soft cap on internal reasoning tokens. The minimum budget is 1,024 tokens. The maximum varies by model and by whether a beta header is in use, but in all cases budget_tokens must be less than the request's max_tokens (because thinking and final answer share the output budget). Above 32,000 tokens of budget, Claude often does not consume the full allocation, especially on routine prompts.[3][4]
The response separates internal reasoning from the user-facing answer. A typical response shape looks like:
{
"content": [
{ "type": "thinking", "thinking": "Let me work this out...", "signature": "..." },
{ "type": "text", "text": "The answer is 42." }
]
}
The signature field on the thinking block is an opaque encrypted token that the developer must pass back unchanged in any follow-up turn. This is how Anthropic verifies the integrity of multi-turn conversations that include thinking, particularly when tool use is interleaved with reasoning.[3]
Anthropic's documentation gives explicit recommendations on budget_tokens values. The minimum is 1,024 tokens, which the launch post described as the right starting point for routine prompts where reasoning gains are uncertain. Larger budgets help on harder problems but show diminishing returns above 32,000 tokens. The hard ceiling is the model's overall maximum output tokens, which on Claude 3.7 Sonnet was 128,000 tokens behind a beta header (output-128k-2025-02-19) and 8,192 tokens without it. On the Claude 4 family, max output rose to 64,000 tokens by default for most models.[3][4]
The table below summarizes Anthropic's recommended budget tiers from the launch documentation, drawn from the Claude 3.7 Sonnet announcement and the extended thinking documentation page.[1][3]
| Use case | Suggested budget_tokens |
|---|---|
| Default routine prompts (reasoning gains optional) | 1,024 (minimum) |
| Mid-difficulty coding edits and short proofs | 4,000 to 8,000 |
| Difficult multi-step coding (multi-file refactors) | 8,000 to 16,000 |
| Competition-level math (AIME, MATH-500) | 16,000 to 32,000 |
| Graduate-level science (GPQA Diamond) | 32,000 to 64,000 |
| Hard agentic tasks with parallel sampling | up to 64,000 (high-compute mode) |
Reasoning budgets above 32,000 tokens at the 3.7 Sonnet launch were available only via batch mode, since the standard streaming API had a lower per-response output cap. Anthropic later raised the streaming output cap to allow full 64,000-token thinking sessions in real time on the standard API.[3]
When the Messages API is used in streaming mode, thinking content arrives as a sequence of content_block_delta events with delta type thinking_delta. Final-answer content arrives in the same event stream with delta type text_delta. A typical Python streaming loop looks like:
with client.messages.stream(
model="claude-sonnet-4-6",
max_tokens=16000,
thinking={"type": "enabled", "budget_tokens": 10000},
messages=[...]
) as stream:
for event in stream:
if event.type == "content_block_delta":
if event.delta.type == "thinking_delta":
print(event.delta.thinking, end="", flush=True)
elif event.delta.type == "text_delta":
print(event.delta.text, end="", flush=True)
The two delta streams are interleaved when interleaved thinking is in use (see below) but are otherwise emitted in two distinct phases: first the full thinking trace, then the final answer. Time to first text token is therefore proportional to the thinking length, which is the main latency cost of extended thinking. Setting display: "omitted" on the thinking parameter, available on Claude Opus 4.7 and Claude Mythos Preview, suppresses the thinking_delta events entirely and emits only a signature_delta, restoring fast time-to-first-text-token at the cost of losing the visible chain.[3]
On claude.ai, extended thinking is exposed as a toggle in the chat composer. When the toggle is on, the model takes longer to respond and the front-end displays a collapsed "thinking" panel above the answer that the user can expand to read the reasoning trace. Free claude.ai users had access to extended thinking on Claude 3.7 Sonnet only in standard mode at launch in February 2025; full extended thinking was reserved for Pro, Team, and Enterprise subscribers. With the Claude 4 launch in May 2025, free users gained full extended thinking on Sonnet 4 by default.[1][13]
Extended thinking has been available on every public Claude model since Claude 3.7 Sonnet. The interface evolved from a manually-toggled feature on 3.7 Sonnet and the early Claude 4 models into an adaptive feature on the late Claude 4 generation, where the model itself decides whether and how deeply to reason. The table below summarizes the state across releases, with each model linked to its dedicated wiki page.
| Model | Released | Extended thinking | Notes |
|---|---|---|---|
| Claude 3.7 Sonnet | February 24, 2025 | Manual (type: "enabled") | First model with the feature; full thinking returned (no summarization) |
| Claude Opus 4 | May 22, 2025 | Manual + tool use | Adds extended thinking with tool use; summarized thinking by default |
| Claude Sonnet 4 | May 22, 2025 | Manual + tool use | Same controls as Opus 4 |
| Claude Opus 4.1 | August 5, 2025 | Manual + tool use | Coding-tuned snapshot of Opus 4 |
| Claude Sonnet 4.5 | September 29, 2025 | Manual; interleaved thinking on by default | Extended agent loops up to 30 hours |
| Claude Haiku 4.5 | October 15, 2025 | Manual | First Haiku tier model with extended thinking |
| Claude Opus 4.5 | November 24, 2025 | Manual + effort parameter | Introduces low / medium / high effort coarse control |
| Claude Opus 4.6 | February 5, 2026 | Adaptive (recommended); manual deprecated | Adaptive thinking introduced; manual still works |
| Claude Sonnet 4.6 | February 2026 | Adaptive (recommended); manual deprecated | Same migration as Opus 4.6 |
| Claude Opus 4.7 | April 2026 | Adaptive only | Manual type: "enabled" returns 400 error; adds xhigh effort tier |
The progression is one of gradual abstraction. On Claude 3.7 Sonnet, developers pick a specific token budget. On the early Claude 4 models, the same parameter applies but tool calls can be interleaved with thinking. On Claude Opus 4.5, an effort parameter sits alongside budget_tokens and allows a coarser low / medium / high choice. On Claude Opus 4.6, adaptive thinking lets the model itself decide whether and how much to reason. On Claude Opus 4.7, manual mode is removed and adaptive thinking is the only supported interface.[3][14][15]
The May 22, 2025 Claude 4 launch introduced the most consequential refinement of extended thinking: the ability to use tools during the thinking phase. With this capability on, Claude can call tools (such as web search, code execution, or any developer-defined function calling) inside the reasoning trace itself, then continue reasoning over the tool's output before producing the final answer. The Claude 4 announcement post described it as the model alternating between reasoning and tool use to improve responses.[14]
The practical effect is that long agentic loops, for example a research agent searching the web and reasoning over results, no longer have to simulate reasoning by accumulating tool-result text in successive plain message turns. The reasoning state lives inside the thinking blocks and persists across tool calls within the same assistant turn. Anthropic emphasized agentic coding (powered by Claude Code) and complex research as the primary motivating use cases.[14][16]
Interleaved thinking is the technical name for the pattern where a single assistant turn contains multiple thinking blocks separated by tool_use blocks. On Claude 4 models, this is enabled by adding the beta header interleaved-thinking-2025-05-14 to the API request. On Claude Sonnet 4.5, interleaved thinking is on by default. On Claude Opus 4.6, Sonnet 4.6, and Opus 4.7, interleaved thinking is part of the default adaptive thinking behavior and does not require an opt-in header.[3][15][17]
With interleaved thinking on, the budget_tokens value represents the total budget across all thinking blocks in one assistant turn rather than per block, which is why it can sometimes exceed max_tokens in this mode. The model is free to allocate the budget across as many thinking blocks as it needs, and developers must preserve all thinking blocks unmodified when feeding the conversation back for follow-up turns.[3]
Extended thinking with tool use is compatible only with tool_choice: {"type": "auto"} (the default) or tool_choice: {"type": "none"}. The forced-tool modes {"type": "any"} and {"type": "tool", "name": ...} are rejected when extended thinking is enabled. Anthropic's documentation justifies this on coherence grounds: the model decides during reasoning which tool to call (if any), and forcing a specific tool would cut against that decision process.[3]
The single most common implementation mistake with extended thinking and tool use is failing to pass thinking blocks back unmodified in follow-up turns. Anthropic's documentation is explicit that the assistant turn returned to the API in a multi-turn conversation must include every thinking block exactly as the model produced it, with the original signature value, alongside the tool-use and text blocks. Stripping or modifying thinking blocks invalidates the signatures and breaks the conversation's continuity from Anthropic's perspective.[3]
For the Claude 4 family, Anthropic introduced a distinction between the model's full thinking trace (what the model actually generated) and a summarized thinking trace (what the API returns by default). The summarization is performed by a separate model from the one handling the request, and the summarizer model never sees the original prompt or the model's final answer. The result is a shortened, sometimes paraphrased version of the reasoning that preserves the load-bearing logical steps but drops verbatim chain-of-thought tokens.[3][7]
Billing continues to count the full thinking tokens, not the shortened summary. A request that produces 12,000 thinking tokens internally and returns a 1,500-token summary will be billed for the 12,000 thinking tokens at the standard output rate. The summary token count and the billed output token count therefore do not match, which is a common point of confusion in developer reports.[3]
On Claude 3.7 Sonnet, by contrast, the thinking trace returned through the API was the full uncompressed trace. The summarization layer was specific to the Claude 4 generation and was justified in Anthropic's documentation as a way to reduce the risk of competitors distilling Claude's reasoning behavior from the public API.[3][7]
Anthropic's stated reason for the summarization layer was the risk of model distillation: another lab could call the Claude API in bulk, harvest the full reasoning traces, and use them as supervised training data for a competing reasoning model. By returning only summaries by default, Anthropic raises the cost of such an extraction strategy without breaking the legitimate use case of inspecting the reasoning for debugging or transparency.[7][8]
In November 2025, Anthropic published a post titled "Detecting and preventing distillation attacks" disclosing what it described as industrial-scale extraction campaigns by three other AI labs (named publicly as DeepSeek, Moonshot, and MiniMax) using approximately 24,000 fraudulent accounts. One of those campaigns by Moonshot was explicitly an attempt to extract and reconstruct Claude's reasoning traces. The disclosure was the company's most public articulation of why the summarization design exists.[8]
The thinking parameter accepts a display field that controls how the reasoning trace is surfaced in the response. The values and their semantics, drawn from the API documentation:
| Value | Behavior | Default on |
|---|---|---|
"summarized" | Returns a summarized reasoning trace produced by a separate summarizer model | Claude 4 family (Opus 4 through Sonnet 4.6) |
"omitted" | Returns empty thinking blocks with a signature for multi-turn continuity | Claude Opus 4.7 and Claude Mythos Preview |
(no display field) | Returns the full unsummarized thinking trace | Claude 3.7 Sonnet (only) |
The "omitted" mode is the latency-optimized choice: the model still reasons internally and the developer can use the signature for follow-up turns, but no thinking content travels over the wire. This was added on Claude Opus 4.7 specifically for production agents that do not surface thinking to end users.[3]
Thinking tokens are billed as output tokens at the same per-token rate as final-answer tokens. There is no separate reasoning-token rate on Claude. This was a deliberate choice by Anthropic at the Claude 3.7 Sonnet launch and has held across every subsequent release. The per-token rate for each Claude model that supports extended thinking is shown below.
| Model | Input tokens | Output tokens (incl. thinking) |
|---|---|---|
| Claude 3.7 Sonnet | $3 / M | $15 / M |
| Claude Sonnet 4 | $3 / M | $15 / M |
| Claude Opus 4 | $15 / M | $75 / M |
| Claude Sonnet 4.5 | $3 / M | $15 / M |
| Claude Haiku 4.5 | $1 / M | $5 / M |
| Claude Opus 4.5 | $5 / M | $25 / M |
| Claude Opus 4.6 | $5 / M | $25 / M |
| Claude Sonnet 4.6 | $3 / M | $15 / M |
| Claude Opus 4.7 | $5 / M | $25 / M |
The practical implication is that extended thinking does not cost more on a per-token basis but does usually cost more per call. A request to Claude 3.7 Sonnet that runs for 12,000 thinking tokens plus a 500-token answer is billed for 12,500 output tokens, whereas the same prompt without extended thinking might produce a 500-token answer for a billed total of 500 output tokens. Anthropic's argument, repeated in interviews around the launch, was that the simplicity of one rate is itself a feature: developers can reason about cost without tracking which tokens are thinking versus final.[1][3]
Extended thinking interacts with prompt caching in specific ways. System-prompt caches are preserved across requests when only the thinking parameter changes. Message-level caches, however, are invalidated when budget_tokens changes or when extended thinking is toggled on or off. Thinking blocks read from cache are billed as input tokens at the cache-read rate, not as output tokens. On Claude Opus 4.5 and Sonnet 4.6 and later, thinking blocks are kept in the cached prefix by default; on earlier models they are stripped when non-tool-result user blocks are included, which is a common cause of cache misses for developers porting code between models.[3]
For agent loops that make many requests in succession with the same system prompt and tool definitions, Anthropic recommends the 1-hour cache duration in conjunction with extended thinking, since the longer caching window amortizes the cache-write overhead across more reasoning calls.[3]
The launch documentation and subsequent best-practices posts identify a consistent set of workloads where extended thinking produces clear gains. Math, particularly competition-level problems like AIME and MATH-500, was the headline benchmark category at the 3.7 Sonnet launch and remains the strongest single signal for when to enable extended thinking. Graduate-level science questions in the GPQA Diamond style, multi-step planning tasks, complex coding refactors that span multiple files, and agentic tasks that require the model to interleave reasoning with tool use all show meaningful gains.[1][3]
In agentic settings, extended thinking is the default recommendation for anything that requires the model to plan a sequence of steps before acting, evaluate intermediate tool results before continuing, or self-correct on partial failures. The Claude Code terminal coding agent, which Anthropic launched alongside Claude 3.7 Sonnet, uses extended thinking as a routine part of its long-running coding sessions. Anthropic later cited Claude Code's reliance on extended thinking as evidence that the feature was operating in production at meaningful scale.[1][16]
Latency-sensitive applications are the most common counterindication. Customer-support chat, real-time conversational interfaces, and any user-facing flow where time to first token matters more than incremental answer quality should generally run without extended thinking, or with display: "omitted" on Claude Opus 4.7 to suppress the visible thinking phase. Routine knowledge queries and short factual lookups, where the model has the answer immediately, also see no benefit from extended thinking and only pay the latency and token cost.[3]
A more subtle case is conversational tasks where the right answer depends on user follow-up rather than deep reasoning. Anthropic's documentation cautions against toggling extended thinking on and off mid-conversation in tool-use loops, since the cache invalidation and signature-handling overhead outweighs any gain from selectively enabling reasoning. Plan the thinking strategy upfront for the conversation, rather than adapting it per turn.[3]
Anthropic's most direct budget-tuning guidance is to start at the minimum (1,024 tokens), measure the impact on the workload's evaluation set, and increase the budget only if the gain is large enough to justify the latency and token cost. For workloads that are already saturated by smaller budgets (most coding and conversational tasks), there is no benefit from raising the budget above 8,000 tokens. For workloads that benefit from very long reasoning (competition mathematics, hard scientific questions, multi-document research synthesis), gains continue up to 32,000 tokens but rarely persist beyond that.[1][3]
The Claude Opus 4.5 effort parameter (low, medium, high) is the recommended interface for developers who do not want to manage budget_tokens directly. The Claude Opus 4.7 release introduced an additional xhigh tier that sits between high and the maximum budget, intended for the hardest agentic workloads. Adaptive thinking, the default on Opus 4.6 and later, sidesteps the question entirely by letting the model evaluate per-request whether and how much to reason.[5][15]
Reasoning-mode interfaces have converged across labs over 2025 and 2026, but each provider made specific design choices that distinguish their interfaces. The table below summarizes the state of the art across the four most commonly compared providers as of mid-2026.
| Provider | Feature name | API knob | Granularity | Visible chain | Default state |
|---|---|---|---|---|---|
| Anthropic | Extended thinking / adaptive thinking | thinking: { type, budget_tokens } or effort | Token budget (1,024 to 64,000+) or low / medium / high / xhigh / max | Yes (summarized by default on Claude 4; full on 3.7) | Off (3.7, early 4) / Adaptive (late 4) |
| OpenAI | Reasoning effort | reasoning_effort | minimal / low / medium / high (and gpt-5 adds auto) | No (hidden chain on o1; partial summaries on o3 / GPT-5) | Medium (default) |
| Thinking | thinking_budget (Gemini 2.5 Flash) | Token budget (0 to 24,576 on Flash; on by default on Pro) | Partial (summarized) | On for Gemini 2.5 Pro; configurable for Flash | |
| DeepSeek | Reasoner / Thinking mode | model: "deepseek-reasoner" | None (always on when reasoner model is selected) | Yes (full chain in reasoning_content) | Always on |
| xAI | Think mode (Grok) | reasoning_effort (low / high) | Coarse | Partial | Off |
The most direct counterpart to Anthropic's budget_tokens is Google's thinking_budget for Gemini 2.5 Flash, which uses a similar token-count interface and a similar 0-or-N convention. The most direct counterpart to Anthropic's effort parameter is OpenAI's reasoning_effort, which uses a low / medium / high enum and was first shipped with o1 and o3. DeepSeek's R1 takes a different approach: reasoning is selected by picking the reasoner model identifier rather than by a parameter on a hybrid model.[6][18][19]
On visible-chain transparency, the providers split. DeepSeek and Anthropic (on 3.7 Sonnet) return the full chain. Google returns a summarized version on Gemini 2.5. OpenAI hides the chain entirely on o1 and returns short summaries on o3 and GPT-5. Anthropic's choice to summarize on Claude 4 brings it closer to the Google and OpenAI position, with full chain access reserved for specific contexts.[3][6][7]
In April 2025, Anthropic's Alignment Science team published the paper "Reasoning Models Don't Always Say What They Think," examining whether the visible chain of thought in reasoning models faithfully represents the model's internal computation. The paper, posted on arXiv in May 2025 and authored by Yanda Chen, Joe Benton, and colleagues, tested four models: two reasoning models (Claude 3.7 Sonnet and DeepSeek R1) and two non-reasoning baselines (Claude 3.5 Sonnet and DeepSeek V3).[20][21]
The methodology was a controlled prompt-pairing test. The researchers gave each model a question first in plain form and then with a subtly embedded hint (sycophancy cue, consistency hint, visual pattern, metadata cue, grader hack, or unethical-information leak). When the hint changed the model's answer, the researchers checked whether the visible chain of thought explicitly mentioned the hint. The fraction of cases where it did is the measured "faithfulness rate."[20]
The headline numbers were sobering. Across all hint types, Claude 3.7 Sonnet mentioned the hint in its visible chain only 25% of the time on average, and DeepSeek R1 only 39% of the time. On the more concerning hint types (grader hacking, unethical information use), Claude was faithful 41% of the time and R1 was faithful 19% of the time. A substantial majority of answers were unfaithful: the chain of thought omitted the actual reason the model had changed its answer.[20][22]
The paper concluded that chain-of-thought monitoring is a useful but not sufficient tool for AI safety. If the visible chain of thought does not always reflect the model's true decision process, then automated systems that monitor the chain for misaligned behavior cannot rule out misalignment, only detect the cases where the model verbalizes it. The researchers framed CoT monitoring as a complementary signal to other interpretability techniques rather than as a standalone safety guarantee.[20][22]
The finding has direct implications for extended thinking specifically. The visible thinking trace on Claude 3.7 Sonnet is the textual artifact closest to the model's reasoning, and Anthropic's launch positioning emphasized its value for understanding and debugging the model's behavior. The faithfulness paper qualifies that pitch: the chain is useful evidence about the model's reasoning, but not a complete or always accurate record.[2][20]
Anthropic followed the April 2025 paper with related research on chain-of-thought interpretability, including "On the Biology of a Large Language Model" (Transformer Circuits, 2025) and the related attribution-graph work, which used mechanistic interpretability techniques to compare what the model says it is doing in its chain of thought against what its internal activations suggest it is actually doing. The attribution-graph results corroborated the faithfulness paper's central finding: the chain of thought is a useful but partial window into the model's behavior.[23]
In product terms, Anthropic continued to surface visible thinking on claude.ai (as a default for paying users) and through the API (with the summarization layer on Claude 4 models). The faithfulness research did not change the product surface, but it did change Anthropic's public framing: the visible chain is described as a tool for inspection and debugging rather than as a guaranteed window into the model's reasoning.[3][20]
The most immediate limitation of extended thinking is that it raises both per-call latency and per-call cost. Latency rises because the visible thinking phase precedes the user-facing answer in non-streaming mode, and even in streaming mode the time to first text token equals the time to complete the thinking phase. Cost rises because thinking tokens are billed at the standard output rate. For workloads where the answer quality gain from extended thinking is small, the latency and cost overhead can outweigh the benefit.[3][24]
A related issue is that the per-call cost is harder to predict than for non-thinking calls. The actual number of thinking tokens consumed depends on the model's runtime decision, often falls below the requested budget_tokens, and varies even across repeated calls with the same prompt. Developers have reported substantial cost surprises when migrating production code from non-thinking to thinking mode without updating their token budgeting.[24]
As documented by Anthropic's own research, the visible chain of thought does not always reflect the model's actual reasoning process. This has two practical consequences. First, the chain is not a reliable basis for AI safety monitoring on its own: a system that flags only chains containing explicit problematic reasoning will miss cases where the model reaches the same problematic conclusion without verbalizing the reason. Second, the chain is not a reliable basis for explaining the model's behavior to end users: a researcher who treats the chain as the explanation of the answer is sometimes inferring a wrong story.[20][22]
The summarized thinking on Claude 4 models introduces a specific prompt injection consideration. Because the summary is produced by a separate summarizer model that does not see the original prompt or the final answer, the summary can omit content that the full thinking trace contained. Anthropic disclosed in a 2025 post that this property has been observed in practice: full thinking traces can include reasoning about how to handle a jailbreak attempt, while the summary surfaces only the high-level topic and not the specific jailbreak content. The behavior is a feature for the user-protection use case but a complication for any system that uses the visible chain to evaluate the model's intent.[3][7]
Message-level prompt caches are invalidated whenever the budget_tokens parameter changes or thinking is toggled on or off. In agent loops that adapt the thinking budget per turn, this can produce unexpected cache misses and a noticeable cost increase relative to a non-thinking baseline. Anthropic's recommended workaround is to fix the thinking strategy upfront for a conversation rather than adapting it per turn, and to use the 1-hour cache duration when thinking is enabled.[3]
The interface that defined extended thinking (manual type: "enabled" with a fixed budget_tokens value) is being progressively removed. On Claude Opus 4.6 and Sonnet 4.6, the manual interface is deprecated but functional, with adaptive thinking as the recommended default. On Claude Opus 4.7, the manual interface returns a 400 error and developers must migrate to adaptive thinking. Code written against the original 3.7 Sonnet interface needs to be updated for 4.7-and-later models, with an effort parameter or the adaptive-thinking interface in place of the original budget_tokens field.[5][15]