Task budgets
Last reviewed
May 24, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 3,828 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 24, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 3,828 words
Add missing citations, update stale details, or suggest a clearer explanation.
Task budgets are an Anthropic API feature introduced in public beta with Claude Opus 4.7 on 16 April 2026 that let a developer tell the model how many tokens it has for a full agentic loop, including thinking, tool calls, tool results, and final output.[1][2] A task budget is set on the Messages API by adding a task_budget object to output_config and sending the task-budgets-2026-03-13 beta header; the minimum accepted budget is 20,000 tokens.[1] During generation Claude sees a server-injected running countdown that decrements as it produces thinking and outputs and as it processes tool results, and it uses that signal to pace itself, scope work, and finish gracefully as the budget is consumed.[1] The feature is advisory rather than enforced: it is distinct from max_tokens, which is a hard per-request ceiling the model is not aware of, and from the effort parameter, which controls per-step reasoning depth qualitatively.[1][2]
Before Opus 4.7, Anthropic offered two main controls over how much compute a Claude request would burn: max_tokens, a hard upper bound on tokens generated per request, and (on extended-thinking models) budget_tokens, which capped how many tokens Claude could spend on a single thinking block.[1][3] Both controls applied at the level of a single API call, and neither was visible to the model during generation. With agentic workflows that issue many sequential Messages API calls (a coding agent that explores a repo and edits files, a deep research agent that runs dozens of searches before drafting a report, a browser agent that clicks through pages), developers reported two recurring failure modes.[2][4] First, runs could silently inflate in cost: at high effort levels, Claude would spend tens of thousands of tokens on thinking and tool calls per turn, with no signal to the model that the human caller cared about total spend.[1][4] Second, agents that did hit a hard cutoff would stop mid-action, leaving a partial tool call or a half-written file with no summary, because the cutoff was applied externally and the model never had a chance to wrap up.[1]
Anthropic's response, shipped with Opus 4.7, was to give Claude its own view of how much budget the loop had left and let the model self-moderate.[1][2] The mechanism's beta header, task-budgets-2026-03-13, was dated more than a month before the 16 April 2026 general-availability launch, suggesting the feature had been in private testing for some weeks before being exposed publicly alongside the Opus 4.7 release.[1][2] Task budgets are eligible for Zero Data Retention arrangements, which puts them in the same compliance posture as other beta features on the API.[1]
The model card and what's-new page place task budgets alongside two other Opus 4.7 changes that reshape how developers control Claude's compute: the new xhigh effort level between high and max, and the removal of extended thinking's explicit budget_tokens parameter in favor of adaptive thinking driven entirely by the effort parameter.[2][3] Together those three changes shift the per-request "how much thinking?" decision from a hand-set token cap to a coarser effort label, and add a new per-loop "how much work in total?" cap that the model itself can see and reason about.[2][3]
Task budgets live inside the output_config object on a Messages API request. The minimal Python invocation looks like:[1][2]
import anthropic
client = anthropic.Anthropic()
response = client.beta.messages.create(
model="claude-opus-4-7",
max_tokens=128000,
output_config={
"effort": "high",
"task_budget": {"type": "tokens", "total": 64000},
},
messages=[
{"role": "user",
"content": "Review the codebase and propose a refactor plan."}
],
betas=["task-budgets-2026-03-13"],
)
The task_budget object has three documented fields:[1]
type: always the literal string "tokens". No other budget type is currently defined.total: the number of tokens Claude can spend across the agentic loop, including thinking, tool calls, tool results, and output.remaining (optional): the budget remainder carried over from a prior request. Defaults to total when omitted.The minimum accepted value for total is 20,000 tokens; values below the minimum return an HTTP 400 error.[1] The beta header task-budgets-2026-03-13 is required on every request that includes a task budget.[1] Task budgets are exposed on the beta Messages endpoint (client.beta.messages.create in the Python SDK and equivalents in TypeScript, Go, Java, C#, PHP, and Ruby), and the same shape works against direct cURL calls to https://api.anthropic.com/v1/messages.[1]
The documentation gives the same code in eight languages, including identical structures in TypeScript, Go, Java, C#, PHP, and Ruby SDKs, so the budget object is a stable cross-language shape rather than a Python-only quirk.[1] The Go SDK in particular models Remaining as a pointer-wrapped optional, consistent with the field being absent on initial requests and only set when a caller manually carries budget across compaction.[1]
When a request arrives with a task budget set, Anthropic injects a budget-countdown marker into the conversation the model sees during generation.[1] The marker is a server-side signal, not an additional message in the API payload: callers cannot read or write it directly. The countdown shows how many tokens remain in the current agentic loop and updates as the model produces thinking, tool calls, and output, and as it processes tool results.[1] Claude uses that signal to pace itself across the loop and finish gracefully as the budget is consumed.[1]
The documentation is explicit that the countdown reflects what Claude has processed in the current loop, not the raw byte-size of the request payload.[1] In an agentic loop a client typically resends the full conversation history on every follow-up request (the Messages API is stateless), but the budget only decrements by tokens Claude actually sees this turn. The cookbook example illustrates the distinction. Consider a loop with a 100,000-token budget and a single shell tool.[1]
On turn 1, the user message ("Audit this repo for security issues and report findings.") is roughly 20 tokens. Claude emits a thinking block plus a tool_use block, totaling about 5,000 generated tokens. The countdown ends near 95,000 remaining.[1]
On turn 2, the client appends the assistant turn and a 2,800-token tool_result block (the output of npm audit --json) and resends the full history. The resent turn-1 content does not count again. The 2,800-token tool result is new content Claude sees this turn and counts against the budget. Claude spends another 4,000 tokens on thinking and a second tool call. The countdown ends near 88,200 remaining.[1]
On turn 3, the client resends history with a 1,200-token grep output appended. Claude writes a 6,000-token final report and stops with stop_reason: "end_turn". Remaining ends near 81,000.[1]
| Turn | Request payload (input tokens sent) | Tokens counted against budget this turn | Budget remaining after |
|---|---|---|---|
| 1 | ~20 | 5,000 (thinking + tool_use) | ~95,000 |
| 2 | ~7,800 (turn 1 history + tool result) | 6,800 (2,800 tool result + 4,000 thinking and tool_use) | ~88,200 |
| 3 | ~13,000 (full history + second tool result) | 7,200 (1,200 tool result + 6,000 text) | ~81,000 |
| Total | ~20,820 sent across requests | 19,000 counted against budget | ~81,000 |
The cumulative payload (about 20,820 tokens transmitted by the client) is larger than the cumulative budget spend (19,000 tokens counted), because turn-1 content is sent three times but counted once.[1] Prompt caching of the resent prefix on turns 2 and 3 further widens the gap between transmitted bytes and budget spend.[1]
If an agentic loop compacts or summarizes earlier turns before sending the next request, the server has no record of how much budget was spent before compaction.[1] In that case the caller can pass remaining on the next request so the countdown continues from where it left off rather than resetting to total:[1]
output_config = {
"effort": "high",
"task_budget": {
"type": "tokens",
"total": 128000,
"remaining": 128000 - tokens_spent_so_far,
},
}
For loops that resend full uncompacted history on every turn, the documentation recommends omitting remaining and letting the server track the countdown.[1] The reason is that mutating remaining client-side breaks prompt caching: the budget-countdown marker is injected per turn server-side and does not need to match across requests, but if the caller decrements a remaining field inside the request body each turn, the changed value invalidates any cache prefix that contains it.[1]
A task budget is a soft hint, not a hard cap.[1][2] Claude may occasionally exceed the budget if it is in the middle of an action that would be more disruptive to interrupt than to finish (for example, completing a partial tool call whose state cannot be cleanly resumed).[1] The enforced limit on total output tokens remains max_tokens, which truncates the response with stop_reason: "max_tokens" when reached.[1]
For a hard cap on cost or latency, Anthropic recommends combining the two parameters: use task_budget to give Claude a target to pace against, and use max_tokens as the absolute per-request ceiling that prevents runaway generation.[1] Because task_budget spans the full agentic loop (potentially many requests) while max_tokens caps each individual request, the two values are independent: one does not need to be at or below the other.[1] A loop with a 200,000-token total task budget might run with max_tokens of 64,000 per request, expecting roughly three to five turns of work before the loop completes.[1]
The advisory nature has a second consequence the documentation flags as a agent design pitfall. A budget that is too small for the task can cause refusal-like behavior:
When Claude sees a budget that is clearly insufficient for the work being asked (for example, a 20,000-token budget for a multi-hour agentic coding task), it may decline to attempt the task at all, scope it down aggressively, or stop early with a partial result rather than start work it cannot finish.[1]
The recommended fix is to raise the budget before tuning other parameters; the migration guide repeats the warning in slightly different words and adds that the model may "complete the task less thoroughly, referencing its budget as the constraint."[3] Sizing budgets against an actual task-length distribution rather than a fixed default avoids those failure modes.[1]
Anthropic's recommended process for sizing a task budget is to measure first.[1] Callers should run a representative sample of their existing tasks without task_budget set and record the total tokens Claude spends per task: for an agentic loop, that is the sum of usage.output_tokens plus tool-result tokens across every request in the loop.[1] The documentation gives a Python helper that runs a loop to completion and accumulates response.usage.output_tokens across every turn:[1]
def run_task_and_count_tokens(messages: list) -> int:
total_spend = 0
while True:
response = client.beta.messages.create(
model="claude-opus-4-7",
max_tokens=128000,
messages=messages,
tools=tools,
betas=["task-budgets-2026-03-13"],
)
total_spend += response.usage.output_tokens
if response.stop_reason == "end_turn":
return total_spend
messages += [
{"role": "assistant", "content": response.content},
{"role": "user", "content": run_tools(response.content)},
]
The recommended starting point is the p99 of per-task token spend across a sample, tuned up or down from there based on how the model's behavior changes once a budget is in place.[1] Sizing from p99 rather than the median lets most tasks finish comfortably inside the budget while still giving the model a signal to pace against on the long tail.[1]
Task budgets sit alongside several other Anthropic API parameters that influence how much compute Claude spends. The documentation calls out four interactions explicitly:[1]
max_tokens is orthogonal. It is a hard per-request cap on generated tokens, while task_budget is an advisory cap across the full agentic loop. Anthropic recommends max_tokens of at least 64,000 when running at xhigh or max effort, to give Claude room to think and act on each individual request.[1][3]
Effort. The effort parameter (with levels low, medium, high, xhigh, and max on Opus 4.7) controls how deeply Claude reasons per step.[2][3] Task budgets control how much total work Claude does across an agentic loop. The two are explicitly described as complementary: effort tunes depth, task budgets tune breadth.[1] In practice a developer might pair xhigh effort with a large task budget for a complex coding task, or medium effort with a tight budget for a constrained extraction pipeline.[3]
Adaptive thinking. Task budgets include thinking tokens in the count, so adaptive thinking naturally scales down as the budget depletes.[1] On Opus 4.7 the explicit thinking.budget_tokens parameter from earlier models was removed entirely; thinking depth is controlled by effort, and overall loop spend (including thinking) is controlled by task_budget.[3]
Prompt caching. As noted above, the budget-countdown marker is injected server-side per turn, so it does not need to match across requests. If a caller decrements task_budget.remaining on each follow-up request, the changed value invalidates cache prefixes that contain it. The documentation's recommendation is to set the budget once on the initial request and let the model self-regulate against the server-side countdown rather than mutating the budget client-side.[1]
The migration guide also lists task budgets among the recommended changes when moving from Opus 4.6 to Opus 4.7, alongside re-evaluating max_tokens for the new tokenizer and raising headroom at xhigh or max effort.[3]
The Opus 4.7 documentation positions task budgets against several adjacent mechanisms.[1][2][3]
Versus max_tokens. max_tokens was, before Opus 4.7, the only built-in API cap on Claude's output. It is enforced server-side: when the response would exceed max_tokens it is truncated and stop_reason is set to max_tokens.[1] The crucial difference is visibility: max_tokens is not passed to the model, so the model is not aware of it; task budgets are passed to the model via the countdown marker, so the model can plan around them.[1][2] In practice this changes the failure mode at the limit: max_tokens produces a mid-sentence truncation; a task budget that approaches zero produces (in normal operation) a summarized handoff or graceful close.[1]
Versus budget_tokens (extended thinking). Earlier Claude models that supported extended thinking accepted thinking: {"type": "enabled", "budget_tokens": N}, which capped how many tokens the model could spend in a single thinking block.[3] That parameter was removed on Opus 4.7; setting it returns a 400 error.[3] Where budget_tokens capped one phase (thinking) of one request, task budgets cap all phases (thinking, tool calls, tool results, output) of a full multi-request loop.[1][3] The semantic shift mirrors the shift from extended thinking to adaptive thinking: from explicit token budgets the developer hand-sets, to model-managed pacing against coarser controls.[3]
Versus prompt-engineered budget hints. Developers had previously simulated task budgets by adding instructions like "You have roughly 30,000 tokens for this task; spend them wisely" inside the system prompt. The documentation does not name this technique explicitly, but the design rationale (give the model a real-time, accurate, decrementing signal rather than a static prompt-time number) addresses its failure mode: a prompt-time hint cannot tell the model how much budget has actually been spent five tool calls in.[1]
Versus the effort parameter. Effort and task budgets answer different questions. Effort answers "how thoroughly should Claude think about each step?" with five qualitative levels.[3] Task budgets answer "how many tokens total can Claude spend across the whole loop?" with a concrete number.[1] A developer can set both at once: the recommended starting configuration for Opus 4.7 agentic coding is xhigh effort with a large task budget sized from measured p99 spend.[1][3]
Versus client-side hard cutoffs. Some agent harnesses implemented their own hard cutoffs by counting tokens client-side and refusing to issue the next request once a threshold was crossed.[4] That approach has two problems the task budget design fixes: the cutoff happens between requests rather than mid-loop, so the model has no chance to wrap up; and the threshold count was approximate because it depended on client-side tokenization estimates rather than what the server actually counted.[1]
Task budgets are designed for agentic workflows where Claude makes multiple tool calls and decisions before finalizing its output to await the next human response.[1] The documentation calls out three situations:[1]
The release coverage and migration guidance identify a few concrete agent types that fit those criteria.
Agentic coding agents. Multi-step refactor and debugging loops are the canonical use case named in both the docs example ("Review the codebase and propose a refactor plan") and the Opus 4.7 launch announcement.[1][2] At xhigh or max effort, Opus 4.7 will routinely spend tens of thousands of tokens per turn on thinking and tool calls; over a multi-turn loop a single task can consume hundreds of thousands of tokens.[2][4] Setting a task budget gives the model a concrete envelope to plan against and reduces the risk of an autonomous coding run inflating in cost beyond what the caller expected.[1][4]
Deep research agents. Deep research workflows that dispatch many search queries, fetch and read documents, and synthesize a final report match the agentic-loop shape task budgets were designed for: many tool calls, a long final synthesis, and a cost ceiling the caller wants to enforce.[1][2]
Browser agents and computer-use loops. Long computer-use sessions accumulate budget across many screenshot inputs, thinking blocks, and click or type actions. Anthropic's tokenizer change in Opus 4.7 increases image-token usage on full-resolution screenshots, which makes a budget cap especially useful for limiting spend on vision-heavy browser agents.[3]
Audit, redlining, and document-analysis pipelines. The launch notes cite improvements on .docx redlining and .pptx editing, both of which involve agentic verification loops that read, edit, and re-check structured documents.[2] Task budgets give an explicit envelope for those self-checking loops, which can otherwise iterate longer than expected when the model decides further verification is warranted.[2]
At launch, task budgets are supported only on Claude Opus 4.7 in public beta.[1] The documentation lists the full support matrix:[1]
| Model | Support |
|---|---|
| Claude Opus 4.7 | Public beta (set task-budgets-2026-03-13 header) |
| Claude Opus 4.6 | Not supported |
| Claude Sonnet 4.6 | Not supported |
| Claude Haiku 4.5 | Not supported |
Task budgets are also not supported on Claude Code or on Anthropic's Cowork surfaces at launch.[1] Callers wanting the feature must invoke it through the Messages API on Claude Opus 4.7 directly, rather than through higher-level Anthropic agent products. The what's-new page notes that Claude Managed Agents have no API breaking changes at all on Opus 4.7 (since the managed product handles effort and budgeting internally), so the task-budgets mechanism is specifically a Messages-API affordance for developers building their own agent harnesses.[2][3]
The launch timing is consistent with task budgets being co-designed with Opus 4.7's new pacing behavior. Pre-launch reviewers reported improved "loop resistance" and "graceful error recovery" in multi-step workflows on Opus 4.7, and the model is described in Anthropic's release notes as devising ways to verify its own outputs and pacing itself across long agentic traces.[2][4][5] Task budgets are the explicit knob that exposes that pacing capability to developers.[1]
The documentation flags several practical limitations of the current beta:[1]
Soft, not hard. The most important caveat is the one most likely to surprise developers used to max_tokens: task budgets are advisory. Claude may exceed them by some amount when interrupting would be more disruptive than finishing.[1] For strict cost ceilings the recommendation is to layer max_tokens on top.[1]
Too-small budgets cause refusal-like behavior. A budget the model judges insufficient for the work being asked can produce refusals, aggressive scope reductions, or premature early stops with partial results.[1][3] Callers debugging unexpected stops are told to raise the budget first before tuning other parameters.[1]
Client-side remaining mutation breaks caching. Decrementing remaining on each follow-up request to mirror the server-side countdown invalidates prompt-cache prefixes that contain the changed value.[1] The recommended pattern is to set the budget once and let the model self-regulate against the server-side countdown; mutate remaining only when the client has actually rewritten the conversation (for example through compaction or summarization).[1]
Budget reflects what Claude sees, not what is sent. Because the budget decrements based on per-turn-visible tokens rather than full request payload size, naive client-side token counters that count resent history will drift away from the server-side number over multi-turn loops. The documentation's recommendation is to budget against measured per-task usage (not raw transmitted bytes) and let the server do the bookkeeping.[1]
Beta-only and Opus-4.7-only at launch. Task budgets are unavailable on Sonnet 4.6, Haiku 4.5, and Opus 4.6, and unavailable on Claude Code and Cowork.[1] Developers wanting per-loop budgeting on those models must continue to use client-side heuristics or hard max_tokens cutoffs.
Tokenizer change interacts with budget sizing. Opus 4.7's new tokenizer can produce roughly 1.0x to 1.35x as many tokens as Opus 4.6 for the same text, with code and structured data near the high end.[2][3] Budgets sized off Opus 4.6 measurements need to be widened on Opus 4.7 to deliver the same amount of model work.[3]
The migration guide includes "Consider adopting task budgets (beta) for agentic workflows" on its Opus 4.7 migration checklist, which marks the feature as a recommended but not required upgrade step for callers moving from Opus 4.6.[3]
Coverage in the days after Opus 4.7's 16 April 2026 launch repeatedly flagged task budgets as a quietly important release item. One review called task budgets "the most underrated feature" of the launch, describing them as "runtime token budgets you can set on long-running agents so they can't silently burn through your quota."[5] Production-focused commentary framed task budgets as the practical answer to agent cost variance, noting that the soft-limit design lets the model "either reprioritize, ask for more, or halt and summarize state" when it sees the budget running low.[4]
The mechanism received attention partly because of two coincident Opus 4.7 changes that compound its value: the new xhigh effort tier defaults to higher thinking spend per turn, and the new tokenizer increases token counts for the same source text by up to 35%.[2][3][4] Both changes push cost upward on identical workloads, so a per-loop budget that the model can see and pace against is a natural complement.[4] The migration guide lists "Adopt task budgets" alongside "Re-evaluate max_tokens" and "Set a large max_tokens at max or xhigh effort" as a coherent group of cost-control recommendations for Opus 4.7 adopters.[3]