Tau2-bench

τ²-bench
Full name	τ²-bench: Evaluating Conversational Agents in a Dual-Control Environment
Abbreviation	τ²-bench, Tau2-bench
First release	2025-06-12 (v0.1.0)
Latest version	0.2.1 (November 2025)
Paper	arXiv:2506.07982 (2025-06-09)
Authors	Victor Barres, Honghua Dong, Soham Ray, Xujie Si, Karthik Narasimhan
Organization	Sierra, with University of Toronto and the Vector Institute
Type	Agent evaluation, conversational AI, tool use
Modality	Text dialogue with structured tool calls
Domains	Airline, Retail, Telecom, Mock
Tasks	Telecom: 114 (sampled from 2,285); Retail: 115; Airline: 50
Metric	pass^k, pass^1, database assertions, action matching
Languages	English
Best reported pass^1 (paper)	Retail ~74% (gpt-4.1); Airline ~56% (gpt-4.1); Telecom 49% (claude-3-7-sonnet)
Saturated	No
Code	sierra-research/tau2-bench
Predecessor	τ-bench (Yao et al., 2024, arXiv:2406.12045)
Successor	τ³-bench (Knowledge and Voice extensions)
License	MIT

τ²-bench (also written Tau2-bench or τ^2-bench) is a benchmark for evaluating conversational AI agents in dual-control environments, where both the agent and a simulated user can call tools to read from and write to a shared world state. It was introduced in a June 2025 paper from Sierra (the customer-experience AI startup co-founded by Bret Taylor and Clay Bavor) with researchers at the University of Toronto and the Vector Institute, and is released under an MIT license ^[1]^[2]. τ²-bench is the direct successor to τ-bench (Yao et al., 2024) and adds a new telecom troubleshooting domain alongside revised airline and retail domains ^[3]. The squared notation captures the move from a single locus of control (the agent) to two (agent and user), modeled as a Decentralized Partially Observable Markov Decision Process (Dec-POMDP) ^[1].

Background and motivation

Most agent benchmarks before τ²-bench, including τ-bench, WebArena, AgentBench, and ToolSandbox, treat the user as a passive provider of preferences. Only the agent has tools. This does not match real customer-service work. In a telecom support call the user has to toggle airplane mode or read off the status bar, while the agent runs database queries on a system the user cannot see ^[1]. The agent has to drive troubleshooting from incomplete information, narrate clearly enough for a non-expert to follow, and interpret what the user reports back.

The authors also flag a practical concern: τ-bench's user simulator made a lot of mistakes. Manual annotation of 100 airline and 50 retail conversations found error rates of 47% and 40%, with 12 to 13% task-critical ^[1]. The new telecom domain drops simulator errors to 16% total and 6% critical.

Authors and organization

The authors are Victor Barres (Sierra) and Honghua Dong (Sierra and Toronto/Vector, internship), who contributed equally, plus Soham Ray (Sierra), Xujie Si (Toronto/Vector), and Karthik Narasimhan (Sierra) ^[1]. Narasimhan also co-authored the original τ-bench and was a Princeton professor before joining Sierra as Head of Research. Sierra was founded in 2023 by Bret Taylor (former co-CEO of Salesforce, chair of the OpenAI board) and Clay Bavor (former head of Google Labs).

Dual-control formalization

τ²-bench formalizes the interaction as a Dec-POMDP defined by the tuple (S, {A_i}, {O_i}, T, R, U, M), where i ranges over the agent and user ^[1].

Component	Definition	Telecom example
Message space M	Messages exchanged between agent and user	User: "I cannot use mobile data."
Global state S	S_world (agent DB + user DB) crossed with S_history	CRM data on agent side, phone status on user side
Action space A_i	Either a tool call or a message (one player acts per turn)	Agent: get_customer_by_id(id); user: toggle_airplane_mode()
Observation O_i	Tool outputs or messages from the other player	User sees "Airplane Mode is now OFF"
Transition T	Joint action maps state to new state and observations	enable_roaming() updates the line so the phone can roam
Reward R	Maps state to [0, 1], 1 if assertions hold	Reward 1 if mocked phone reports "connected"
Instructions U	Scenario for user simulator and policy doc for agent	Persona plus telecom troubleshooting policy

The framing makes explicit what each player cannot see. The agent cannot check the user's airplane-mode state directly; it has to ask. The user cannot see the carrier's CRM record. Each side has its own database, its own tools, and its own partial view of the shared world ^[1].

Domains

τ²-bench ships with three primary domains plus a mock domain. Retail and airline are inherited from τ-bench with task-quality fixes; telecom is new.

Domain	Agent databases	Agent tools	User tools	Tasks
Retail	500 users, 50 products, 1,000 orders	7 write, 6 read	None	115
Airline	500 users, 300 flights, 2,000 reservations	6 write, 6 read	None	50
Telecom	5 plans, 9 lines, 4 customers	6 write, 7 read	15 write, 15 read	114 (from 2,285 generated)
Mock	Synthetic	Configurable	Configurable	Framework testing

In retail and airline the user has no tools (as in τ-bench), so dual control is not active there; they are kept as comparison points. Telecom is the only domain where the user has 15 write tools (toggle_airplane_mode, toggle_data, etc.) and 15 read tools (get_status_bar, get_network_status, etc.), mimicking an end user fiddling with a smartphone ^[1].

Telecom troubleshooting

Telecom tasks are organized around three intents that form a difficulty hierarchy: service_issue (mean 2.31 actions to solve), mobile_data_issue (mean 4.31), and mms_issue (mean 6.00) ^[1]. mms_issue is hardest because resolving an MMS problem often requires first ruling out underlying service or data issues. Each task also comes with an optional persona: Easy (a 41-year-old office administrator who is patient and confirms instructions), Hard (a 64-year-old retired librarian who finds technology confusing and gets flustered), or None ^[1].

Compositional task generator

One of the four headline contributions is a programmatic task generator that builds composite tasks from atomic subtasks. Each atomic subtask is defined by three sets of functions ^[1]: initialization functions that set up the broken state (e.g. set_airplane_mode(True)), solution functions that fix the issue (e.g. toggle_airplane_mode()), and assertion functions that check the final state (e.g. assert_service_status("connected")).

Mutually exclusive subtasks live in the same group. A composite task picks at most one subtask per group and concatenates their function calls; the framework verifies the composite is solvable and is not already solved before the solution functions run ^[1]. The authors built 15 atomic subtask groups for telecom, yielding 2,285 valid combinations, then subsampled 114 tasks for a balanced distribution over intents and difficulty (subtask counts range from 2 to 9). This is a real shift away from τ-bench, where each task was hand-written.

Evaluation methodology

The pass^k metric

τ²-bench inherits pass^k from τ-bench. pass^k is not the standard pass@k from code-generation benchmarks; the two are easy to confuse ^[3]:

Metric	Definition	What it measures
pass@k	At least one of k attempts succeeds	Best-of-k performance
pass^k	All k independent attempts succeed	Reliability across repeated runs

For a model with per-attempt success probability p, pass^k decays as p^k. A 90% pass^1 model has a pass^8 of about 43% even with no correlation between failures. τ-bench used pass^k because real customer-service deployments cannot tolerate occasional failures: an agent that gets a refund right 9 times out of 10 is not shippable when each user only gets one shot.

Task success criteria

Each task can specify any subset of the following criteria ^[1]: database check (final agent DB vs reference), status assertions on the final world state, natural language assertions graded by an LLM (such as "the agent diagnosed the cause of the issue"), communication info check (the agent told the user specific facts), and action matching (every solution function appears in the trajectory). Telecom uses only assertion functions; the others carry over from τ-bench style retail and airline tasks.

Ablation modes

τ²-bench's three-way ablation separates reasoning from coordination failures ^[1]:

Mode	Setup	What it isolates
Default	Both agent and user have their tools and must coordinate	Full dual-control performance
No-User	Agent gets a ticket and controls all tools (including the user's)	Reasoning and tool use only
Oracle Plan	Agent is handed the exact tool-call sequence	Coordination only, with reasoning removed

The No-User vs Default gap isolates the cost of talking a user through actions. The Oracle Plan vs Default gap isolates the cost of figuring out what to do.

Results from the original paper

The paper evaluates four models with a gpt-4.1-2025-04-14 user simulator, four trials per task, temperature 0 ^[1].

Headline pass^1 numbers

Model	Retail (pass^1)	Airline (pass^1)	Telecom (pass^1)
gpt-4.1 (2025-04-14)	~74%	~56%	34%
gpt-4.1-mini (2025-04-14)	not headlined	not headlined	~50%
o4-mini (2025-04-16)	not headlined	not headlined	42%
claude-3-7-sonnet (2025-02-19)	not headlined	not headlined	49%

The most cited finding is that gpt-4.1, the strongest model on retail, drops sharply on telecom (74% to 34% on pass^1). Smaller and reasoning-tuned models do relatively better on telecom: gpt-4.1-mini, o4-mini, and claude-3-7-sonnet all land near 42 to 50% on first attempts ^[1]. claude-3-7-sonnet's pass^1 on telecom (49%) matches its airline score, but its pass^k decays faster on telecom as k grows, indicating less consistency.

What the ablation showed

In telecom, switching from No-User to Default (forcing the agent to guide a user instead of doing everything itself) dropped pass^1 by 18 points for gpt-4.1 and 25 points for o4-mini ^[1]. That gap is the paper's key empirical claim: even with the right tools and knowledge, communicating with a partial-information partner is a substantial bottleneck. Oracle Plan showed a smaller but real gap above Default for both models, with o4-mini benefiting more from a handed plan than gpt-4.1, suggesting o4-mini is relatively better at executing a known sequence than at planning one.

User simulator reliability

Constraining the simulator with structured tools dramatically reduces simulator errors. Manual annotation of 50 telecom conversations found 8 errors total (16%), only 3 task-critical (6%). On airline that figure was 47% total and 13% critical; on retail, 40% total and 12% critical ^[1].

Releases and version history

The Sierra repository hosts a unified codebase covering τ-bench, τ²-bench, and the τ³-bench voice and knowledge extensions ^[2].

Version	Date	Notable changes
0.1.0	2025-06-12	Initial public release with the arXiv paper. Mock, airline, retail, and telecom domains.
0.1.2	2025-07-17	Editable install; custom data directories via TAU2_DATA_DIR; tau2 check-data.
0.1.3	2025-08-26	Stability fixes for LLM argument parsing.
0.2.0	2025-10-06	Web leaderboard at tau-bench.com with trajectory browsing.
0.2.1	2025-11	Gymnasium integration for RL; tau2 play interactive mode; standardized splits.

In early 2026 Sierra published τ³-bench, which adds τ-Knowledge (testing agents over large internal document collections) and τ-Voice (full-duplex voice evaluation across 278 retail, airline, and telecom tasks). Under clean audio the best voice agents reach roughly 54%, but realistic audio with interruptions widens the gap between voice and text agents to 16 to 28 percentage points ^[4].

τ²-Bench-Verified and forks

In late 2025 the Amazon AGI team published τ²-Bench-Verified, a corrected fork addressing places where the original task definitions, expected actions, or evaluation criteria did not match the stated policies or database contents ^[5]. A separate community fork, AGI-Eval-Official/tau2-bench-revised, provides another set of fixes ^[6]. Sierra's own October 2025 web leaderboard release incorporated 75+ task fixes, so reported numbers across forks are not directly comparable. Third parties such as Artificial Analysis run their own evaluations on the telecom domain and publish a separate leaderboard covering over 400 models, with their methodology differing in detail from tau-bench.com ^[7].

Benchmark	Focus	Difference from τ²-bench
τ-bench (Yao et al., 2024)	Customer service via tool use	Single-control: only the agent has tools. τ²-bench adds dual-control telecom.
WebArena	Web navigation in a sandboxed browser	One agent acting on web pages; no simulated user with its own tools.
AgentBench	Broad agent capability suite	Heterogeneous tasks rather than coordination-focused dialogue.
ToolBench / API-Bank	Tool selection and API calling	Static tool sequences without an active user.
GAIA	General assistant questions	Fact retrieval and reasoning, not multi-turn coordination.
FlowBench	Workflow-guided planning	Explicit workflow in prompt; complementary to τ²-bench.
MultiAgentBench	Pure multi-agent collaboration	Symmetric agents with no asymmetric agent-user roles.

Limitations

The dual-control mechanism only exists in telecom; airline and retail still inherit τ-bench's silent-user setup. Domain extension is still mostly manual. The benchmark does not explicitly model the expert-novice gap inherent to real customer support, where the agent must model the user's mental model and adapt explanations ^[1]. Because the user simulator is itself an LLM, evaluating very strong agents eventually runs into the question of whether the simulator can keep up. Default mode runs are expensive: pairing a gpt-4.1 agent with a gpt-4.1 user simulator costs about $0.14 per task, and one full pass over all domains is around $40 ^[1].

References

Barres, V., Dong, H., Ray, S., Si, X., and Narasimhan, K. (2025). τ²-Bench: Evaluating Conversational Agents in a Dual-Control Environment. arXiv:2506.07982. https://arxiv.org/abs/2506.07982
Sierra Research. tau2-bench GitHub repository. https://github.com/sierra-research/tau2-bench
Yao, S., Shinn, N., Razavi, P., and Narasimhan, K. (2024). τ-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains. arXiv:2406.12045. https://arxiv.org/abs/2406.12045
Sierra. τ³-Bench: Advancing agent evaluation to knowledge and voice. https://sierra.ai/blog/bench-advancing-agent-benchmarking-to-knowledge-and-voice
Amazon AGI. tau2-bench-verified. https://github.com/amazon-agi/tau2-bench-verified
AGI-Eval-Official. tau2-bench-revised. https://github.com/AGI-Eval-Official/tau2-bench-revised
Artificial Analysis. τ²-Bench Telecom Benchmark Leaderboard. https://artificialanalysis.ai/evaluations/tau2-bench

Tau2-bench

Background and motivation

Authors and organization

Dual-control formalization

Domains

Telecom troubleshooting

Compositional task generator

Evaluation methodology

The pass^k metric

Task success criteria

Ablation modes

Results from the original paper

Headline pass^1 numbers

What the ablation showed

User simulator reliability

Releases and version history

τ²-Bench-Verified and forks

Limitations

See also

References

Improve this article

Background and motivation

Authors and organization

Dual-control formalization

Domains

Telecom troubleshooting

Compositional task generator

Evaluation methodology

The pass^k metric

Task success criteria

Ablation modes

Results from the original paper

Headline pass^1 numbers

What the ablation showed

User simulator reliability

Releases and version history

τ²-Bench-Verified and forks

Limitations

See also

References

Background and motivation

Authors and organization

Dual-control formalization

Domains

Telecom troubleshooting

Compositional task generator

Evaluation methodology

The pass^k metric

Task success criteria

Ablation modes

Results from the original paper

Headline pass^1 numbers

What the ablation showed

User simulator reliability

Releases and version history

τ²-Bench-Verified and forks

Comparison with related benchmarks

Limitations

See also

References

Improve this article

Related Articles

BrowseComp

Humanity's Last Exam

AA-LCR

GSO

AIME 2025

Creative Writing v3

Background and motivation

Authors and organization

Dual-control formalization

Domains

Telecom troubleshooting

Compositional task generator

Evaluation methodology

The pass^k metric

Task success criteria

Ablation modes

Results from the original paper

Headline pass^1 numbers

What the ablation showed

User simulator reliability

Releases and version history

τ²-Bench-Verified and forks

Comparison with related benchmarks

Limitations

See also

References

Related Articles

BrowseComp

Humanity's Last Exam

AA-LCR

GSO

AIME 2025

Creative Writing v3