Tau2-bench
Last reviewed
May 10, 2026
Sources
7 citations
Review status
Source-backed
Revision
v2 · 2,455 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 10, 2026
Sources
7 citations
Review status
Source-backed
Revision
v2 · 2,455 words
Add missing citations, update stale details, or suggest a clearer explanation.
| τ²-bench | |
|---|---|
| Full name | τ²-bench: Evaluating Conversational Agents in a Dual-Control Environment |
| Abbreviation | τ²-bench, Tau2-bench |
| First release | 2025-06-12 (v0.1.0) |
| Latest version | 0.2.1 (November 2025) |
| Paper | arXiv:2506.07982 (2025-06-09) |
| Authors | Victor Barres, Honghua Dong, Soham Ray, Xujie Si, Karthik Narasimhan |
| Organization | Sierra, with University of Toronto and the Vector Institute |
| Type | Agent evaluation, conversational AI, tool use |
| Modality | Text dialogue with structured tool calls |
| Domains | Airline, Retail, Telecom, Mock |
| Tasks | Telecom: 114 (sampled from 2,285); Retail: 115; Airline: 50 |
| Metric | pass^k, pass^1, database assertions, action matching |
| Languages | English |
| Best reported pass^1 (paper) | Retail ~74% (gpt-4.1); Airline ~56% (gpt-4.1); Telecom 49% (claude-3-7-sonnet) |
| Saturated | No |
| Code | sierra-research/tau2-bench |
| Predecessor | τ-bench (Yao et al., 2024, arXiv:2406.12045) |
| Successor | τ³-bench (Knowledge and Voice extensions) |
| License | MIT |
τ²-bench (also written Tau2-bench or τ^2-bench) is a benchmark for evaluating conversational AI agents in dual-control environments, where both the agent and a simulated user can call tools to read from and write to a shared world state. It was introduced in a June 2025 paper from Sierra (the customer-experience AI startup co-founded by Bret Taylor and Clay Bavor) with researchers at the University of Toronto and the Vector Institute, and is released under an MIT license [1][2]. τ²-bench is the direct successor to τ-bench (Yao et al., 2024) and adds a new telecom troubleshooting domain alongside revised airline and retail domains [3]. The squared notation captures the move from a single locus of control (the agent) to two (agent and user), modeled as a Decentralized Partially Observable Markov Decision Process (Dec-POMDP) [1].
Most agent benchmarks before τ²-bench, including τ-bench, WebArena, AgentBench, and ToolSandbox, treat the user as a passive provider of preferences. Only the agent has tools. This does not match real customer-service work. In a telecom support call the user has to toggle airplane mode or read off the status bar, while the agent runs database queries on a system the user cannot see [1]. The agent has to drive troubleshooting from incomplete information, narrate clearly enough for a non-expert to follow, and interpret what the user reports back.
The authors also flag a practical concern: τ-bench's user simulator made a lot of mistakes. Manual annotation of 100 airline and 50 retail conversations found error rates of 47% and 40%, with 12 to 13% task-critical [1]. The new telecom domain drops simulator errors to 16% total and 6% critical.
The authors are Victor Barres (Sierra) and Honghua Dong (Sierra and Toronto/Vector, internship), who contributed equally, plus Soham Ray (Sierra), Xujie Si (Toronto/Vector), and Karthik Narasimhan (Sierra) [1]. Narasimhan also co-authored the original τ-bench and was a Princeton professor before joining Sierra as Head of Research. Sierra was founded in 2023 by Bret Taylor (former co-CEO of Salesforce, chair of the OpenAI board) and Clay Bavor (former head of Google Labs).
τ²-bench formalizes the interaction as a Dec-POMDP defined by the tuple (S, {A_i}, {O_i}, T, R, U, M), where i ranges over the agent and user [1].
| Component | Definition | Telecom example |
|---|---|---|
| Message space M | Messages exchanged between agent and user | User: "I cannot use mobile data." |
| Global state S | S_world (agent DB + user DB) crossed with S_history | CRM data on agent side, phone status on user side |
| Action space A_i | Either a tool call or a message (one player acts per turn) | Agent: get_customer_by_id(id); user: toggle_airplane_mode() |
| Observation O_i | Tool outputs or messages from the other player | User sees "Airplane Mode is now OFF" |
| Transition T | Joint action maps state to new state and observations | enable_roaming() updates the line so the phone can roam |
| Reward R | Maps state to [0, 1], 1 if assertions hold | Reward 1 if mocked phone reports "connected" |
| Instructions U | Scenario for user simulator and policy doc for agent | Persona plus telecom troubleshooting policy |
The framing makes explicit what each player cannot see. The agent cannot check the user's airplane-mode state directly; it has to ask. The user cannot see the carrier's CRM record. Each side has its own database, its own tools, and its own partial view of the shared world [1].
τ²-bench ships with three primary domains plus a mock domain. Retail and airline are inherited from τ-bench with task-quality fixes; telecom is new.
| Domain | Agent databases | Agent tools | User tools | Tasks |
|---|---|---|---|---|
| Retail | 500 users, 50 products, 1,000 orders | 7 write, 6 read | None | 115 |
| Airline | 500 users, 300 flights, 2,000 reservations | 6 write, 6 read | None | 50 |
| Telecom | 5 plans, 9 lines, 4 customers | 6 write, 7 read | 15 write, 15 read | 114 (from 2,285 generated) |
| Mock | Synthetic | Configurable | Configurable | Framework testing |
In retail and airline the user has no tools (as in τ-bench), so dual control is not active there; they are kept as comparison points. Telecom is the only domain where the user has 15 write tools (toggle_airplane_mode, toggle_data, etc.) and 15 read tools (get_status_bar, get_network_status, etc.), mimicking an end user fiddling with a smartphone [1].
Telecom tasks are organized around three intents that form a difficulty hierarchy: service_issue (mean 2.31 actions to solve), mobile_data_issue (mean 4.31), and mms_issue (mean 6.00) [1]. mms_issue is hardest because resolving an MMS problem often requires first ruling out underlying service or data issues. Each task also comes with an optional persona: Easy (a 41-year-old office administrator who is patient and confirms instructions), Hard (a 64-year-old retired librarian who finds technology confusing and gets flustered), or None [1].
One of the four headline contributions is a programmatic task generator that builds composite tasks from atomic subtasks. Each atomic subtask is defined by three sets of functions [1]: initialization functions that set up the broken state (e.g. set_airplane_mode(True)), solution functions that fix the issue (e.g. toggle_airplane_mode()), and assertion functions that check the final state (e.g. assert_service_status("connected")).
Mutually exclusive subtasks live in the same group. A composite task picks at most one subtask per group and concatenates their function calls; the framework verifies the composite is solvable and is not already solved before the solution functions run [1]. The authors built 15 atomic subtask groups for telecom, yielding 2,285 valid combinations, then subsampled 114 tasks for a balanced distribution over intents and difficulty (subtask counts range from 2 to 9). This is a real shift away from τ-bench, where each task was hand-written.
τ²-bench inherits pass^k from τ-bench. pass^k is not the standard pass@k from code-generation benchmarks; the two are easy to confuse [3]:
| Metric | Definition | What it measures |
|---|---|---|
| pass@k | At least one of k attempts succeeds | Best-of-k performance |
| pass^k | All k independent attempts succeed | Reliability across repeated runs |
For a model with per-attempt success probability p, pass^k decays as p^k. A 90% pass^1 model has a pass^8 of about 43% even with no correlation between failures. τ-bench used pass^k because real customer-service deployments cannot tolerate occasional failures: an agent that gets a refund right 9 times out of 10 is not shippable when each user only gets one shot.
Each task can specify any subset of the following criteria [1]: database check (final agent DB vs reference), status assertions on the final world state, natural language assertions graded by an LLM (such as "the agent diagnosed the cause of the issue"), communication info check (the agent told the user specific facts), and action matching (every solution function appears in the trajectory). Telecom uses only assertion functions; the others carry over from τ-bench style retail and airline tasks.
τ²-bench's three-way ablation separates reasoning from coordination failures [1]:
| Mode | Setup | What it isolates |
|---|---|---|
| Default | Both agent and user have their tools and must coordinate | Full dual-control performance |
| No-User | Agent gets a ticket and controls all tools (including the user's) | Reasoning and tool use only |
| Oracle Plan | Agent is handed the exact tool-call sequence | Coordination only, with reasoning removed |
The No-User vs Default gap isolates the cost of talking a user through actions. The Oracle Plan vs Default gap isolates the cost of figuring out what to do.
The paper evaluates four models with a gpt-4.1-2025-04-14 user simulator, four trials per task, temperature 0 [1].
| Model | Retail (pass^1) | Airline (pass^1) | Telecom (pass^1) |
|---|---|---|---|
| gpt-4.1 (2025-04-14) | ~74% | ~56% | 34% |
| gpt-4.1-mini (2025-04-14) | not headlined | not headlined | ~50% |
| o4-mini (2025-04-16) | not headlined | not headlined | 42% |
| claude-3-7-sonnet (2025-02-19) | not headlined | not headlined | 49% |
The most cited finding is that gpt-4.1, the strongest model on retail, drops sharply on telecom (74% to 34% on pass^1). Smaller and reasoning-tuned models do relatively better on telecom: gpt-4.1-mini, o4-mini, and claude-3-7-sonnet all land near 42 to 50% on first attempts [1]. claude-3-7-sonnet's pass^1 on telecom (49%) matches its airline score, but its pass^k decays faster on telecom as k grows, indicating less consistency.
In telecom, switching from No-User to Default (forcing the agent to guide a user instead of doing everything itself) dropped pass^1 by 18 points for gpt-4.1 and 25 points for o4-mini [1]. That gap is the paper's key empirical claim: even with the right tools and knowledge, communicating with a partial-information partner is a substantial bottleneck. Oracle Plan showed a smaller but real gap above Default for both models, with o4-mini benefiting more from a handed plan than gpt-4.1, suggesting o4-mini is relatively better at executing a known sequence than at planning one.
Constraining the simulator with structured tools dramatically reduces simulator errors. Manual annotation of 50 telecom conversations found 8 errors total (16%), only 3 task-critical (6%). On airline that figure was 47% total and 13% critical; on retail, 40% total and 12% critical [1].
The Sierra repository hosts a unified codebase covering τ-bench, τ²-bench, and the τ³-bench voice and knowledge extensions [2].
| Version | Date | Notable changes |
|---|---|---|
| 0.1.0 | 2025-06-12 | Initial public release with the arXiv paper. Mock, airline, retail, and telecom domains. |
| 0.1.2 | 2025-07-17 | Editable install; custom data directories via TAU2_DATA_DIR; tau2 check-data. |
| 0.1.3 | 2025-08-26 | Stability fixes for LLM argument parsing. |
| 0.2.0 | 2025-10-06 | Web leaderboard at tau-bench.com with trajectory browsing. |
| 0.2.1 | 2025-11 | Gymnasium integration for RL; tau2 play interactive mode; standardized splits. |
In early 2026 Sierra published τ³-bench, which adds τ-Knowledge (testing agents over large internal document collections) and τ-Voice (full-duplex voice evaluation across 278 retail, airline, and telecom tasks). Under clean audio the best voice agents reach roughly 54%, but realistic audio with interruptions widens the gap between voice and text agents to 16 to 28 percentage points [4].
In late 2025 the Amazon AGI team published τ²-Bench-Verified, a corrected fork addressing places where the original task definitions, expected actions, or evaluation criteria did not match the stated policies or database contents [5]. A separate community fork, AGI-Eval-Official/tau2-bench-revised, provides another set of fixes [6]. Sierra's own October 2025 web leaderboard release incorporated 75+ task fixes, so reported numbers across forks are not directly comparable. Third parties such as Artificial Analysis run their own evaluations on the telecom domain and publish a separate leaderboard covering over 400 models, with their methodology differing in detail from tau-bench.com [7].
| Benchmark | Focus | Difference from τ²-bench |
|---|---|---|
| τ-bench (Yao et al., 2024) | Customer service via tool use | Single-control: only the agent has tools. τ²-bench adds dual-control telecom. |
| WebArena | Web navigation in a sandboxed browser | One agent acting on web pages; no simulated user with its own tools. |
| AgentBench | Broad agent capability suite | Heterogeneous tasks rather than coordination-focused dialogue. |
| ToolBench / API-Bank | Tool selection and API calling | Static tool sequences without an active user. |
| GAIA | General assistant questions | Fact retrieval and reasoning, not multi-turn coordination. |
| FlowBench | Workflow-guided planning | Explicit workflow in prompt; complementary to τ²-bench. |
| MultiAgentBench | Pure multi-agent collaboration | Symmetric agents with no asymmetric agent-user roles. |
The dual-control mechanism only exists in telecom; airline and retail still inherit τ-bench's silent-user setup. Domain extension is still mostly manual. The benchmark does not explicitly model the expert-novice gap inherent to real customer support, where the agent must model the user's mental model and adapt explanations [1]. Because the user simulator is itself an LLM, evaluating very strong agents eventually runs into the question of whether the simulator can keep up. Default mode runs are expensive: pairing a gpt-4.1 agent with a gpt-4.1 user simulator costs about $0.14 per task, and one full pass over all domains is around $40 [1].