Interleaved thinking
Last reviewed
May 24, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 · 3,717 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 24, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 · 3,717 words
Add missing citations, update stale details, or suggest a clearer explanation.
Interleaved thinking is a feature of the Anthropic Messages API that lets Claude produce extended reasoning blocks between tool calls inside a single multi-turn agentic loop, rather than only at the start of an assistant turn.[1] It was launched in public beta on 2025-05-22 alongside Claude Opus 4 and Claude Sonnet 4 through the request header anthropic-beta: interleaved-thinking-2025-05-14, which allowed the model to insert new thinking blocks after each tool_result rather than committing to a plan up front.[2][3] On the Claude Opus 4.6, Claude Sonnet 4.6, Claude Haiku 4.5, and Claude Opus 4.7 generations of the Anthropic API, interleaved thinking is automatically enabled whenever adaptive thinking is active, and the beta header is deprecated or ignored on those models.[1][4] Anthropic attributes parts of its gains on agentic benchmarks such as SWE-bench Verified, OSWorld, and τ-bench to the pattern of thinking between actions, which lets Claude analyse an intermediate result before deciding on the next tool use step.[5][6][7] Interleaved thinking is widely used by Anthropic's own agent products, including Claude Code and Anthropic Computer Use, and by third-party orchestrators such as AWS Strands Agents.[8][9]
Anthropic introduced extended thinking on Claude Sonnet 3.7 on 2025-02-24 as a "hybrid reasoning" mode in which a developer could set a thinking budget of up to roughly 128,000 tokens of internal chain-of-thought before the model produced its final answer.[10] In that initial release, thinking happened only once per assistant turn, at the start, before any tool_use blocks were emitted. After a tool returned a result, the model continued the same turn without producing additional thinking content, so any reflection on tool output had to fit into the original plan or wait for the next user message.[1]
Anthropic's tool use guide and several engineering blog posts described this pattern as a limitation for long-horizon agents: agents that need to react to surprising outputs (for example, a search query returning empty results, a calculator producing an out-of-range answer, or a UI screenshot showing a different state than expected) had to either pre-plan every contingency or restart reasoning in a new turn, which fragmented context and increased latency.[7][11] In March 2025 the company released the "think" tool, a lightweight pattern that gave the model a scratchpad it could call as a tool between actions. On τ-bench's airline domain the optimized "think" prompt lifted pass-at-1 from 0.370 to 0.570, a 54% relative improvement, and on the retail domain from 0.783 to 0.812; SWE-bench Verified saw a smaller 1.6% statistically significant lift.[11] Those numbers established that giving the model an explicit slot to reason about tool output, rather than racing straight to the next action, mattered for accuracy on multi-step tasks.
The interleaved-thinking feature was the next step. Instead of a separate "think" tool the developer had to define, the Messages API itself was extended so that the model could emit additional thinking content blocks between any two tool_use blocks within one assistant turn, without leaving the agentic loop.[1][2]
The public beta of interleaved thinking was announced in the Anthropic API release notes for 2025-05-22, the same day Claude Opus 4 and Claude Sonnet 4 launched. The release-notes entry reads: "We've launched interleaved thinking in public beta, a feature that enables Claude to think in between tool calls. To enable interleaved thinking, use the beta header interleaved-thinking-2025-05-14."[2]
The launch was tied to Anthropic's broader push toward agentic workloads. The Claude 4 announcement positioned both new models as systems with "extended thinking with tool use (beta)" that could "use tools, like web search, during extended thinking, allowing Claude to alternate between reasoning and tool use to improve responses." Opus 4 was reported at 72.5% on SWE-bench Verified and Sonnet 4 at 72.7%, with Opus 4 capable of "sustained performance on long-running tasks that require focused effort and thousands of steps."[3] Anthropic also said Opus 4 was 65% less likely than Claude Sonnet 3.7 to "engage in shortcut" behaviour on agentic tasks.[3]
In the same May 22 release the company shipped several adjacent features that are typically used together with interleaved thinking: the Files API for uploading documents to the Messages API, a code-execution tool for sandboxed Python, an MCP connector for connecting remote Model Context Protocol servers to the API, and a change in default behaviour for extended thinking so that the API returns a summarised thinking output with the full thinking encrypted in a signature field.[2] The signature is opaque and must be passed back unchanged on subsequent turns when interleaved thinking is active so that the API can validate the previous reasoning chain.[1]
The string 2025-05-14 in the beta-header name refers to an internal version identifier rather than the public-facing launch date. The header surfaced in Anthropic's beta-headers list and SDK documentation in May 2025 and remained the sole way to enable interleaved thinking on Claude 4-family models such as Claude Opus 4, Claude Opus 4.1, Claude Opus 4.5, Claude Sonnet 4, and Claude Sonnet 4.5.[1]
In the Messages API, an assistant response is a sequence of content blocks. With extended thinking but no interleaved thinking, a typical agentic turn looks like:
thinking block at the start of the turn (the model's plan),tool_use block requesting a tool call,tool_result returned by the caller in the next message,tool_use block in a new assistant message, with no fresh thinking block,text blocks once the model has all the data it needs.[1]With the interleaved-thinking-2025-05-14 header enabled (or with adaptive thinking on the 4.6 and 4.7 models, which turns the behaviour on by default), the same turn can include additional thinking blocks after each tool_result, so the structure becomes a think → act → think → act → … → text loop within a single assistant span.[1] The Anthropic docs give a worked example: a user asks for total revenue from 150 units at $50 each and how that compares to monthly average. Without interleaved thinking the model issues two tool calls (a calculator, then a SQL query) and then writes the answer. With interleaved thinking, the model produces a new thinking block between the calculator result ($7,500) and the database query, where it explicitly notes the intermediate value and decides which database column to read, and another thinking block after the database returns the average ($5,200) where it computes the percentage difference before composing the final text.[1]
Three implementation details set interleaved thinking apart from plain extended thinking:
budget_tokens parameter applies to the total thinking spent in one assistant turn across all interleaved blocks, and that total budget can legitimately exceed the max_tokens cap. Anthropic flags this explicitly in the extended-thinking documentation because it inverts an assumption from the non-interleaved case.[1]thinking block that the model produced has to be passed back to the API unchanged on the next turn, including its opaque signature. The API uses the signature to verify continuity of reasoning; modifying or stripping it breaks the chain.[1]The model decides when to insert an interleaved thinking block; the developer does not request one explicitly. In adaptive mode on Opus 4.7, in particular, "every inter-tool reasoning step moves into a thinking block instead of plain text," so the model never explains its tool-selection reasoning in the visible text channel.[1]
On 2026-02-05 Anthropic launched Claude Opus 4.6 and introduced adaptive thinking (thinking: {type: "adaptive"}), a mode in which the model itself decides whether and how much to use extended thinking based on the complexity of each request. The accompanying documentation states: "Adaptive thinking also automatically enables interleaved thinking. This means Claude can think between tool calls, making it especially effective for agentic workflows."[4][12]
The same release deprecated the manual thinking: {type: "enabled", budget_tokens: N} configuration on the new models. On Claude Sonnet 4.6 (launched 2026-02-17) the deprecation is soft: manual mode still works, and the interleaved-thinking-2025-05-14 beta header is still functional but deprecated.[4][13] On Claude Opus 4.6 the manual header is deprecated and "safely ignored," and on Claude Opus 4.7 (launched 2026-04-16) adaptive thinking is the only supported thinking mode, with manual thinking.type: "enabled" rejected by the API.[4][14]
Anthropic's documentation summarises interleaved-thinking availability across modes as follows: in adaptive mode it is automatic on Claude Mythos Preview, Claude Opus 4.7, Claude Opus 4.6, and Claude Sonnet 4.6; in manual mode on Sonnet 4.6 it works via the interleaved-thinking-2025-05-14 beta header; in manual mode on Opus 4.6 it is not available at all, so developers who need thinking between tool calls on Opus 4.6 are directed to use adaptive mode.[4]
The migration from the beta header to automatic behaviour is a quiet but consequential API change. Code written for Claude Sonnet 4 that explicitly set interleaved-thinking-2025-05-14 continues to run on Sonnet 4.6 but produces no warning that the header is now a no-op when adaptive thinking is on. Anthropic's migration guide encourages developers to switch to thinking.type: "adaptive" and drop the beta header on supported models.[14]
The pattern most often compared with interleaved thinking is the ReAct loop, in which an agent alternates short reasoning traces with actions in plain text. ReAct-style agents typically express reasoning as visible model output ("Thought: ... Action: ..."), often outside any built-in thinking channel, and rely on prompt engineering to keep the reasoning useful. Interleaved thinking instead puts the inter-action reasoning into structured thinking blocks at the API level, with summarised text returned to the developer and the full chain encrypted in the signature for continuity.[1][9]
A second comparison is with reasoning models from other labs. OpenAI o1 and the subsequent o3, o4-mini, and GPT-5 families also produce hidden reasoning tokens before a final answer. The OpenAI Responses API supports preserving reasoning items adjacent to tool calls so that subsequent calls can keep using them; OpenAI documents this as a separate capability rather than an always-on behaviour for every model. For o1 and o1-mini, "reasoning items were always ignored in follow-up API requests"; for o3 and o4-mini, "some reasoning items adjacent to function calls are included in the model's context to help improve model performance while using the least amount of reasoning tokens."[15] Anthropic's contribution with interleaved thinking is that the inter-tool reasoning chain is a first-class object in the Messages API request body, with explicit handling (the signature field) for replaying the chain across multiple agentic turns without re-deriving it.[1]
A third comparison is with Anthropic's own "think" tool, released in March 2025. The "think" tool is a developer-defined tool whose only job is to give the model a place to write reflection before the next real tool call; it competes with other tools for slots and requires careful prompting. Interleaved thinking removes the indirection: the same reflective slot is available natively, and the model does not have to spend a tool slot on it. Anthropic still recommends the "think" tool for domains, such as customer-service policy enforcement on τ-bench, where the prompt engineering around when to think is itself part of the contract.[11]
Claude Code, Anthropic's terminal-based coding agent, is one of the heaviest users of interleaved thinking. Anthropic's blog and documentation describe Claude Code's loop as one in which the model reads files, runs shells, edits code, and reruns tests, with reasoning blocks between each step that re-evaluate the latest test output or compiler error before deciding what to edit next.[16] On the Opus 4.6 system card and accompanying announcement, Anthropic reports an SWE-bench Verified score of 80.84%, with the qualifier that "SWE-bench results are averaged over 25 trials, each run with adaptive thinking, max effort, default sampling settings (temperature, top_p), and with the thinking tool"; the company's documentation links the headline number to the fact that adaptive thinking automatically enables interleaved thinking.[5][17] Opus 4.7 lifted the same metric to 87.6%.[14]
The original Claude 4 launch reported Opus 4 at 72.5% and Sonnet 4 at 72.7% on SWE-bench Verified, both at a point when the interleaved-thinking-2025-05-14 header was the path to thinking between actions.[3] The Anthropic engineering team's accompanying writeups described coding agents as "the canonical workload" for interleaved thinking: they have many discrete tool calls (run tests, read errors, search the codebase, write a patch), each tool result can reveal that the original plan was wrong, and the cost of acting on a bad plan is concrete and measurable in failed tests.[16]
Anthropic Computer Use (often called Claude Computer Use) is a second large interleaved-thinking workload. Anthropic's best-practices guide for computer and browser use says that thinking helps most "when the model needs to plan a multi-step sequence before starting, recover from an unexpected UI state, cross-reference information between what's on screen and the task instructions, or complete challenging projects on professional software," all situations in which the next action depends on inspecting the screenshot returned by the previous one.[7] On OSWorld, Anthropic reports a 16-month progression from 14.9% (Sonnet 3.5) to 28.0% (Sonnet 3.5 v2), 42.2% (Sonnet 3.6), 61.4% (Sonnet 4.5), and 72.5% (Sonnet 4.6), with human baseline performance around 72%.[6][13] Opus 4.6 records 72.7% on the same benchmark.[6] Anthropic ties these gains to the combination of computer-use tooling and the model's ability to think between screenshot tool results.[6][7]
Anthropic's engineering blog post on its multi-agent research system, which underlies the Claude Research feature, gives interleaved thinking a specific role: "Each Subagent independently performs web searches, evaluates tool results using interleaved thinking, and returns findings to the LeadResearcher." The post explains the design rationale: "We instructed them to use interleaved thinking after tool results to evaluate quality, identify gaps, and refine their next query. This makes subagents more effective in adapting to any task."[8] The post reports that on the overall research system, the multi-agent design (Opus 4 as lead with Sonnet 4 subagents) "outperformed single-agent Claude Opus 4 by 90.2% on our internal research eval," though that figure covers the whole architecture rather than interleaved thinking in isolation.[8]
τ-bench, an Anthropic-cited benchmark for tool-augmented customer-service agents, is the original empirical case for thinking between tool calls. On the τ-bench airline domain, the "think" tool combined with optimized prompting reached a pass@1 of 0.570 versus the 0.370 baseline, a 54% relative improvement; on the retail domain, 0.812 versus 0.783 baseline. Anthropic positions interleaved thinking as the native API analogue of that earlier "think" tool: the same pattern of pausing to reflect on tool output, with less prompt-engineering overhead.[11]
The pattern was quickly picked up outside Anthropic. AWS published a tutorial showing how to use the interleaved-thinking-2025-05-14 beta header with the Strands Agents framework on Amazon Bedrock. In the AWS walkthrough, Claude 4 with thinking: {"type": "enabled", "budget_tokens": 8000} and the beta header determines which of four cities is closest to the International Space Station. The model fetches the ISS position via an HTTP tool, runs a Python REPL to compute distances, detects a longitude error mid-stream, inserts a new thinking block to correct it without re-issuing the whole event-loop iteration, and identifies New York at 11,680.55 km with bearing 110.1 degrees ESE and elevation 2.1 degrees.[9] AWS describes the behaviour as: "rather than following the rigid ReAct pattern of sequential loops, interleaved thinking enables reasoning and action to occur within a single block, making the process more fluid."[9]
Interleaved thinking interacts with several other parts of the Anthropic API in ways the documentation calls out.
Token budgeting. Because the budget_tokens total covers all interleaved blocks in a single assistant turn, developers cannot use it as a per-block cap. The recommended pattern with adaptive thinking on Opus 4.6, Sonnet 4.6, and Opus 4.7 is to drop budget_tokens entirely and use the effort parameter (low, medium, high, max, plus xhigh on Opus 4.7) as a soft signal.[4][14]
Prompt caching. The Anthropic documentation notes that "consecutive requests using adaptive thinking preserve [prompt cache] breakpoints. However, switching between adaptive and enabled/disabled thinking modes breaks cache breakpoints for messages. System prompts and tool definitions remain cached regardless of mode changes."[4] Earlier writeups also warn that interleaved thinking "amplifies cache invalidation" because new thinking blocks can land between cached tool calls.[1]
Thinking display. With Opus 4.7 the default display value moved from "summarized" to "omitted", so the thinking field of each block is empty unless the developer sets display: "summarized" explicitly. The opaque signature is unchanged; only the visible thinking text is suppressed.[4][14]
Platform availability. The header is accepted unconditionally on the first-party Claude API and on Claude Platform on AWS. On Amazon Bedrock and Vertex AI it is only accepted on the specific Claude 4 model versions listed in the partner-platform documentation, and partner platforms can return a unexpected value(s) interleaved-thinking-2025-05-14 for the anthropic-beta header error on unsupported model IDs.[1][9]
Validation edge cases. Vercel's AI SDK and several other client libraries reported issues where, with the interleaved-thinking-2025-05-14 beta header set, the API would reject sequences in which tool_use ids appeared without a tool_result block immediately after, because the SDK was inserting thinking content between them. The accepted fix is to ensure each tool_use is followed by its tool_result in the conversation history before any further thinking content.[18]
| Mode | Configuration | Models | Inter-tool thinking? |
|---|---|---|---|
| Disabled | omit thinking or {type: "disabled"} | All (except Mythos Preview) | No |
| Extended (manual) | {type: "enabled", budget_tokens: N} | All thinking-capable models; rejected on Opus 4.7 | Only with interleaved-thinking-2025-05-14 header on Claude 4 family; not available on Opus 4.6 in manual mode |
| Extended with interleaved (beta) | manual mode + interleaved-thinking-2025-05-14 header | Opus 4, Opus 4.1, Opus 4.5, Sonnet 4, Sonnet 4.5, Sonnet 4.6 | Yes |
| Adaptive | {type: "adaptive"} | Opus 4.6, Opus 4.7, Sonnet 4.6, Mythos Preview | Yes, automatic |
The data in this table follow Anthropic's adaptive-thinking and extended-thinking reference pages.[1][4]
The Anthropic documentation calls out three practical limits on interleaved thinking. First, it is only available for tool use through the Messages API; it does not apply to the Completions API or to text-only workloads.[1] Second, the developer must pass thinking blocks back unchanged on each subsequent turn, including the encrypted signature, which complicates conversation editing or summarisation pipelines that rewrite assistant turns.[1] Third, on partner platforms the beta header is gated by specific model IDs rather than by capability, so portable agents have to detect the runtime platform before sending the header.[9]
A subtler issue is observability. With summarised thinking and an opaque signature, developers see neither the full reasoning chain nor a stable hash of it that they can log for audit; the visible thinking field is a summary produced by a different model than the one the developer targeted. Anthropic notes that "you're charged for the full thinking tokens generated by the original request, not the summary tokens," and that "the billed output token count will not match the count of tokens you see in the response."[1] This makes per-turn cost reconciliation harder than with non-interleaved extended thinking, where token counts and visible content line up.
A third limitation is benchmark attribution. Although Anthropic and outside observers credit interleaved thinking with parts of the gains on SWE-bench Verified, OSWorld, and τ-bench, the published numbers in Anthropic's announcements typically come from runs that use adaptive thinking with max effort and the "thinking" tool together, so isolating the marginal contribution of interleaving from the marginal contribution of more thinking tokens, better tool choice, or model-architecture changes is difficult from public data.[5][11][17]
Finally, the model decides when to interleave. There is no API surface for the developer to force a thinking block at a specific point in the turn or to forbid one. On Opus 4.7, in particular, "inter-tool reasoning always lives inside thinking blocks," so applications that previously relied on visible plain-text reasoning between tool calls have to migrate either to summarised thinking output or to omitted thinking with a separate explainer step.[4]