τ-bench (Tau-bench), short for Tool-Agent-User Interaction Benchmark, is an AI benchmark designed to evaluate language agents' ability to complete complex tasks through realistic interactions with simulated users and domain-specific tools. Released on June 17, 2024, by Sierra Research and Princeton University, τ-bench addresses critical gaps in existing benchmarks by testing agents on their ability to follow domain-specific rules, maintain context over long conversations, and effectively coordinate with users to achieve goals. The paper was accepted as a poster at the International Conference on Learning Representations (ICLR) 2025.
Unlike earlier agent benchmarks such as WebArena, SWE-bench, and AgentBench, which evaluate agents in single-round human-agent interactions, τ-bench requires agents to handle multiple dynamic exchanges where information is gathered incrementally over the course of a conversation. This design mirrors the way customer service agents operate in practice, where a single interaction can involve identity verification, information retrieval, policy checking, multi-step tool calls, and explicit user confirmation before taking action.
| Attribute | Value |
|---|---|
| Full name | Tau-bench: Tool-Agent-User Interaction Benchmark |
| Abbreviation | τ-bench |
| Description | A benchmark for evaluating AI agents' ability to complete complex tasks through realistic tool-agent-user interactions in real-world domains |
| Release date | 2024-06-17 |
| Latest version | τ³-bench (task-corrected) |
| Benchmark updated | 2025 |
| Authors | Shunyu Yao, Noah Shinn, Pedram Razavi, Karthik Narasimhan |
| Organization | Sierra Research, Princeton University |
| Type | Agent Evaluation, Multi-turn Interaction |
| Modality | Text, API Calls |
| Task format | Conversational task completion |
| Number of tasks | 165 total (115 retail, 50 airline) |
| Evaluation metric | pass^k, Database State Comparison |
| Domains | Airline, Retail |
| Languages | English |
| Human performance | Not reported |
| SOTA score | 86.2% (Retail pass^1), 70.0% (Airline pass^1) |
| SOTA model | Claude Sonnet 4.5 |
| SOTA date | 2025 |
| Saturated | No (airline); approaching saturation (retail) |
| Website | Official website |
| Paper | arXiv:2406.12045 |
| GitHub | sierra-research/tau-bench |
| License | MIT |
| Conference | ICLR 2025 (Poster) |
| Successor | τ²-bench, τ³-bench |
τ-bench represents a paradigm shift in AI agent evaluation, moving beyond simple task completion to assess agents' performance in dynamic, multi-turn conversations that mirror real-world applications. The benchmark emulates scenarios where an AI agent must interact with both users (simulated by large language models) and domain-specific API tools while adhering to complex policy guidelines.
Unlike traditional benchmarks that focus on single-turn interactions or isolated capabilities, τ-bench introduces:
The benchmark revealed that even state-of-the-art function calling agents (like GPT-4o) succeed on less than 50% of the tasks and are quite inconsistent (Pass@8 <25% in retail).
By mid-2024, large language models had demonstrated strong capabilities in isolated tasks like text generation, code completion, and question answering. However, deploying these models as autonomous agents in real-world settings revealed a gap between benchmark performance and actual reliability. Existing agent benchmarks at the time tested agents on well-defined, single-turn problems: SWE-bench measured software engineering ability, HumanEval tested code generation, and AgentBench evaluated agents across several environments. None of these benchmarks required agents to interact with a human user over multiple turns while simultaneously calling tools and following complex policies.
Sierra, the conversational AI company co-founded by Bret Taylor (former Salesforce co-CEO and OpenAI board member) and Clay Bavor (former Google executive), had direct experience with the gap between benchmark scores and real-world agent performance. Sierra's research team, led by Karthik Narasimhan, identified three capabilities that existing benchmarks failed to measure:
Shunyu Yao, the lead author of τ-bench, is known for creating the ReAct framework (ICLR 2023), which introduced the paradigm of interleaving reasoning traces with actions in language models. He also created WebShop (NeurIPS 2022), an earlier e-commerce interaction benchmark, and co-developed SWE-agent. At the time of τ-bench's release, Yao was affiliated with both Sierra Research and Princeton University. Noah Shinn, another co-author, is known for his work on Reflexion, a framework for language agent self-reflection. Pedram Razavi contributed from Sierra Research, and Karthik Narasimhan, a professor at Princeton University, leads the Princeton Language and Intelligence lab where much of this agent research originated.
τ-bench employs a modular three-component architecture that simulates realistic customer service interactions. Each component plays a distinct role in creating a closed-loop evaluation environment.
| Component | Role | Implementation Details |
|---|---|---|
| User Simulator | Generates realistic user messages based on hidden instructions | LLM-based (GPT-4, Claude, etc.); guided by task-specific instructions that define user identity, intent, and preferences |
| Agent System | Processes user requests, calls tools, and follows policies | Supports multiple architectures: tool-calling (function calling), ReAct, and Act |
| Environment | Provides API tools, maintains database state, and enforces domain rules | Domain-specific databases with realistic data; tools exposed as callable functions |
The interaction loop works as follows: the user simulator initiates a conversation with a request (for example, "I need to cancel my flight to Denver"). The agent responds by calling tools to look up information, asking clarifying questions, or taking actions. The user simulator replies based on its hidden instruction set, which specifies the user's identity, preferences, and constraints. This back-and-forth continues until the agent resolves the request or the conversation reaches a maximum number of turns.
Critically, the user instructions are hidden from the agent. The agent only sees the user's natural language messages and must infer what the user wants through conversation, just as a real customer service agent would.
Each task in τ-bench includes a structured user instruction that defines:
The instruction is designed so that only one correct outcome exists under the domain's policy. This constraint is essential for automated evaluation: because there is exactly one valid goal state, the benchmark can verify success by comparing the final database state against the annotated expected state without requiring subjective human judgment.
The user simulator itself is powered by an LLM (typically GPT-4 or Claude). While the instructions are synthetic, the utterances the simulator generates are open-ended and natural-sounding. The benchmark supports several user simulation strategies:
| Strategy | Description | Use Case |
|---|---|---|
| LLM (default) | Direct LLM-generated responses based on user instructions | Standard evaluation |
| ReAct | Adds explicit reasoning steps before generating user responses | Complex reasoning tasks |
| Verify | Includes an LLM verification loop after each response | High-accuracy requirements |
| Reflection | Adds self-correction to improve response quality | Improved consistency |
Human evaluation of the user simulator found that it adhered well to its guidelines across four criteria: adherence to simulator guidelines, adherence to user instructions, correct use of user tools (when applicable), and generation of natural, consistent conversational continuations.
The benchmark covers two primary domains:
The retail domain simulates an e-commerce customer support environment. It contains 115 tasks built on a synthetic database of 500 users, 50 products, and 1,000 orders.
Available API Tools (15 total: 7 write, 8 read)
The retail agent has access to tools for retrieving and modifying customer data:
| Tool Category | Examples | Description |
|---|---|---|
| User lookup | get_user_details | Retrieve user profile information after authentication |
| Order retrieval | get_order_details | Look up order status, items, shipping, and payment |
| Product information | get_product_details | Check product availability, options, and pricing |
| Order cancellation | cancel_pending_order | Cancel orders with status "pending" only |
| Order modification | modify_pending_order | Change shipping address, payment method, or item options on pending orders |
| Returns | return_delivered_order | Process returns on delivered orders within the return window |
| Exchanges | exchange_delivered_order | Exchange delivered items for different options of the same product |
| Profile updates | modify_user_address | Update user shipping or billing addresses |
Key Retail Policies
Example Retail Task: A user contacts support wanting to exchange a blue medium t-shirt from their delivered order for a red large version of the same product. The agent must authenticate the user, look up the order, verify the order status is "delivered," check that the red large option is available, collect all exchange details, confirm with the user, and execute the exchange in a single tool call.
The airline domain simulates a flight reservation customer service environment. It contains 50 tasks built on a synthetic database of 500 users, 300 flights, and 2,000 reservations.
Available API Tools (13 total: 6 write, 7 read)
The airline agent has access to tools for managing flight reservations:
| Tool Category | Examples | Description |
|---|---|---|
| Flight search | search_direct_flight, search_onestop_flight | Find available direct or connecting flights |
| Reservation lookup | get_reservation_details | Retrieve booking information, passengers, and flight details |
| User lookup | get_user_details | Authenticate and retrieve user profile |
| Booking | book_reservation | Create new flight reservations (max 5 passengers) |
| Modification | update_reservation_flights, update_reservation_passengers | Change flights, cabin class, baggage, insurance, or passenger details |
| Cancellation | cancel_reservation | Cancel reservations subject to policy rules |
Key Airline Policies
Example Airline Task: A user calls about a delayed flight and wants to rebook on an earlier connection. The agent must verify the user's identity, check the reservation details, determine if the passenger's membership tier qualifies them for compensation, search for alternative flights, confirm the new itinerary with the user, and process the modification while applying the correct fare difference.
Tasks in τ-bench vary in complexity and are designed to apply pressure in diverse ways:
| Complexity Level | Characteristics | Example |
|---|---|---|
| Simple | Single API call, straightforward request | Check flight status |
| Moderate | Multiple API calls, some reasoning required | Book round-trip with preferences |
| Complex | Many API calls, policy checking, user clarification | Multi-city trip with changes |
| Expert | Edge cases, exception handling, complex policies | Group booking with special needs |
| Complexity Factor | Description | Example |
|---|---|---|
| Long-chain dependencies | Multiple sequential steps where later actions depend on earlier results | Booking a multi-passenger reservation where each passenger has different preferences |
| Buried essential details | Critical information hidden deep in the conversation or policy document | A cancellation policy exception that applies only to gold-tier members with travel insurance |
| Realistic forgetting scenarios | Situations where an agent might lose track of earlier conversation context | A user who mentions their email early in the conversation and later asks the agent to use it for a different purpose |
| Policy conflicts | Requests that conflict with domain rules, requiring the agent to refuse or offer alternatives | A user asking to cancel a basic economy ticket without insurance outside the 24-hour window |
| Compound requests | Multiple distinct actions needed in a single conversation | A user who wants to cancel one order, modify another, and update their address |
τ-bench uses an objective, automated evaluation approach that avoids the subjectivity of LLM-as-judge methods. The process works as follows:
This approach is both efficient (no human evaluators or LLM judges needed per evaluation) and faithful (the comparison is deterministic and objective). Partial credit is not awarded; a task either succeeds completely or fails. The designers chose this strict approach because, in real-world customer service, a partial update can create more damage than a refusal.
One of τ-bench's most significant contributions is the introduction of the pass^k ("pass hat k") metric, which measures agent reliability across repeated trials.
The standard pass@k metric, widely used in code generation benchmarks like HumanEval, asks: "Did the agent succeed on at least one of k attempts?" This measures peak capability. In contrast, pass^k asks: "Did the agent succeed on all k independent attempts?" This measures consistency.
Formally, for a given task with n independent trials of which c are successful:
The distinction matters enormously for real-world deployment. A customer service agent that resolves a particular type of issue 60% of the time will, over 8 independent instances of that issue, succeed on all 8 only about 1.7% of the time (0.6^8). The pass^k metric captures this compounding unreliability.
| Metric | Question Answered | Significance |
|---|---|---|
| Pass@1 / pass^1 | What fraction of tasks does the agent solve on a single attempt? | Basic capability measure |
| Pass@4 / pass^4 | What fraction of tasks does the agent solve on all 4 independent attempts? | Moderate reliability threshold |
| Pass@8 / pass^8 | What fraction of tasks does the agent solve on all 8 independent attempts? | High reliability threshold |
| Pass@k / pass^k | Success rate across k attempts | General reliability metric |
The benchmark doesn't just test whether an agent can complete a task once; it measures whether it can do so consistently multiple times. The original paper showed that GPT-4o achieved a pass^1 of roughly 50% on retail tasks but dropped to approximately 25% on pass^8, representing a 60% decline. This dramatic drop revealed that even capable models are highly inconsistent when handling the same type of request with different conversational variations.
τ-bench includes an automatic error identification tool (available in the GitHub repository as auto_error_identification.py) that analyzes failed trajectories to classify errors. The tool categorizes failures along two dimensions:
Fault Assignment (who caused the failure):
Fault Type (what kind of error occurred):
| Fault Type | Description | Frequency |
|---|---|---|
| Wrong Action | Agent selected an incorrect tool or action sequence | Most common workflow error |
| Wrong Arguments | Agent called the correct tool but with incorrect parameters | Common tool error |
| Wrong Information | Agent provided incorrect information to the user | Most common user interaction error |
| Policy Neglect | Agent failed to follow a policy rule | Significant across both domains |
| Context Loss | Agent forgot information from earlier in the conversation | More common in longer conversations |
| Compound Task Failure | Agent failed to handle multiple requests in one conversation | Challenging for all models |
The analysis found that terminal failures (where the agent completely derails) significantly outnumber recovered errors (where the agent stumbles but eventually finds the correct path). Small arithmetic or policy mistakes tend to propagate through the workflow rather than staying contained.
The original paper tested 12 popular LLMs with different agent architectures. The key findings were striking:
Early performance of leading models on τ-bench:
| Model | Version/Mode | Airline Pass@1 | Retail Pass@1 | Notes |
|---|---|---|---|---|
| Claude 3.7 Sonnet | With think tool | 58.4% | 81.2% | Top performer with reasoning |
| Claude 3.5 Sonnet | Upgraded | 46.0% | 69.2% | Improved from 36.0%/62.6% |
| GPT-OSS-120B | Standard | Not reported | 67.8% | Open-weight model |
| GPT-4o | Tool-calling | <50% | <50% | Initial SOTA baseline |
| GPT-4o | ReAct | ~35% | ~40% | Lower with ReAct |
| Claude 3.5 Sonnet | Original | 36.0% | 62.6% | Before upgrade |
Note: Pass@4 and Pass@8 scores are significantly lower across all models, with Pass@8 <25% in retail for most models, indicating consistency challenges.
As of early 2026, the retail domain leaderboard shows substantial improvement over the original 2024 results, with the best models now crossing 80% pass^1:
| Rank | Model | Organization | Retail pass^1 |
|---|---|---|---|
| 1 | Claude Sonnet 4.5 | Anthropic | 0.862 |
| 2 | Claude Opus 4.1 | Anthropic | 0.824 |
| 3 | Claude Opus 4 | Anthropic | 0.814 |
| 4 | Claude 3.7 Sonnet | Anthropic | 0.812 |
| 5 | Claude Sonnet 4 | Anthropic | 0.805 |
| 6 | GLM-4.5 | Zhipu AI | 0.797 |
| 7 | GLM-4.5-Air | Zhipu AI | 0.779 |
| 8 | Qwen3-Coder 480B A35B | Alibaba | 0.775 |
| 9 | o4-mini | OpenAI | 0.718 |
| 10 | o1 | OpenAI | 0.708 |
| 11 | Qwen3-Next-80B-A3B-Thinking | Alibaba | 0.696 |
| 12 | Claude 3.5 Sonnet | Anthropic | 0.692 |
| 13 | GPT-4.5 | OpenAI | 0.684 |
| 14 | GPT-4.1 | OpenAI | 0.680 |
| 15 | GPT OSS 120B | OpenAI | 0.678 |
| 16 | GPT-4o | OpenAI | 0.603 |
| 17 | o3-mini | OpenAI | 0.576 |
| 18 | GPT-4.1 mini | OpenAI | 0.558 |
| 19 | Claude 3.5 Haiku | Anthropic | 0.510 |
| 20 | GPT-4.1 nano | OpenAI | 0.226 |
The average score across all 25 evaluated models is 0.678. Anthropic's Claude models dominate the top five positions, with Claude Sonnet 4.5 achieving the highest retail score of 0.862.
The airline domain remains substantially harder, with the best models scoring around 70% compared to 86% in retail:
| Rank | Model | Organization | Airline pass^1 |
|---|---|---|---|
| 1 | Claude Sonnet 4.5 | Anthropic | 0.700 |
| 2 | MiniMax M1 80K | MiniMax | 0.620 |
| 3 | GLM-4.5-Air | Zhipu AI | 0.608 |
| 4 | GLM-4.5 | Zhipu AI | 0.604 |
| 5 | MiniMax M1 40K | MiniMax | 0.600 |
| 5 | Claude Sonnet 4 | Anthropic | 0.600 |
| 5 | Qwen3-Coder 480B A35B | Alibaba | 0.600 |
| 8 | Claude Opus 4 | Anthropic | 0.596 |
| 9 | Claude 3.7 Sonnet | Anthropic | 0.584 |
| 10 | Claude Opus 4.1 | Anthropic | 0.560 |
| 11 | o1 | OpenAI | 0.500 |
| 11 | GPT-4.5 | OpenAI | 0.500 |
| 13 | GPT-4.1 | OpenAI | 0.494 |
| 14 | o4-mini | OpenAI | 0.492 |
| 15 | Claude 3.5 Sonnet | Anthropic | 0.460 |
| 16 | GPT-4o | OpenAI | 0.428 |
| 17 | GPT-4.1 mini | OpenAI | 0.360 |
| 18 | o3-mini | OpenAI | 0.324 |
| 19 | Claude 3.5 Haiku | Anthropic | 0.228 |
| 20 | GPT-4.1 nano | OpenAI | 0.140 |
The average score across all 23 evaluated airline models is 0.495. The HAL evaluation team at Princeton independently verified that 48 of 50 airline tasks (96%) have been solved by at least one agent, suggesting that the individual tasks are solvable but no single agent can solve them all consistently.
Several patterns emerge from the leaderboard data:
Common failure modes identified:
In 2025, the τ-bench team released τ³-bench, an updated version that audited and corrected over 50 tasks across both domains (27 airline fixes, 26 retail fixes). The corrections addressed five categories of issues identified through community feedback and the τ-Bench Verified research program:
| Issue Category | Description | Examples |
|---|---|---|
| Incorrect expected actions | The annotated goal state was wrong | Compensation offered to ineligible passengers; invalid PayPal refunds |
| Ambiguous user instructions | The user instruction allowed multiple valid interpretations | Economy vs. basic economy confusion; "similar" vs. "same" item specifications |
| Impossible constraints | The task setup made the intended solution unreachable | Required payment methods absent from user profiles; location contradictions |
| Missing fallback behaviors | No guidance for what the agent should do when the intended path fails | No defined action when product searches return no results |
| Policy loophole prevention | Tasks that could be "solved" by exploiting policy gaps | Cancel-and-rebook exploits instead of proper modification refusals; cabin upgrade workarounds |
The impact of these corrections varied by domain. In the airline domain, pass^1 scores increased by 14 to 20 points across models, and pass^4 improvements were even larger (up to 22 points for some models). In the retail domain, the changes were more modest, with pass^1 shifts ranging from -0.4 to +5.5 points. The retail corrections primarily reduced evaluation variance rather than uniformly boosting scores.
τ-bench is open-source under the MIT license and available on GitHub. It can be installed with:
git clone https://github.com/sierra-research/tau-bench
cd tau-bench
pip install -e .
API keys must be set as environment variables for the model providers being evaluated: OPENAI_API_KEY, ANTHROPIC_API_KEY, GOOGLE_API_KEY, MISTRAL_API_KEY.
The benchmark is run through a command-line interface:
python run.py --agent-strategy tool-calling --env retail --model gpt-4o --model-provider openai --user-model gpt-4o --user-model-provider openai --user-strategy llm --max-concurrency 10
Key configuration options include:
| Parameter | Options | Description |
|---|---|---|
| --agent-strategy | tool-calling, react, act | How the agent structures its reasoning and actions |
| --env | retail, airline | Which domain to evaluate |
| --model | Various | The model powering the agent |
| --user-model | Various | The model powering the user simulator |
| --user-strategy | llm, react, verify, reflection | How the user simulator generates responses |
| --max-concurrency | Integer | Number of parallel evaluations |
| --task-ids | Comma-separated | Run specific tasks by ID |
τ-bench supports evaluation of models from multiple providers:
| Provider | Models Supported | Integration Method |
|---|---|---|
| OpenAI | GPT-4, GPT-4o, GPT-4o-mini, GPT-4.1, GPT-3.5, o1, o3, o4-mini | API |
| Anthropic | Claude 3, Claude 3.5 Sonnet, Claude 3.7 Sonnet, Claude Sonnet 4, Claude Opus 4 | API |
| Gemini Pro, Gemini Flash, Gemini Ultra | API | |
| Mistral AI | Mistral Large, Mistral Medium | API |
| Zhipu AI | GLM-4.5, GLM-4.5-Air | API |
| AnyScale | Open-source models via API | API |
--max-concurrency flag, allowing multiple tasks to be evaluated simultaneously.auto_error_identification.py script automatically classifies failures by fault assignment and fault type.--task-ids flag for targeted debugging.τ-bench has become one of the standard benchmarks for evaluating AI agents in the industry:
Within one year of release, τ-bench has had significant academic impact:
Researchers found that naive ReAct-style agents often break down during complex multi-step sequences, leading to the development of enhanced approaches including hierarchical architectures that maintain goals and memory throughout multi-turn conversations.
Released in 2025 by Sierra Research (arXiv: 2506.07982), τ²-bench extends the τ-bench framework to a telecom domain with a critical new challenge: dual-control environments. In the original τ-bench, only the agent uses tools. In τ²-bench, both the agent and the user can take actions in a shared environment, modeled as a Decentralized Partially Observable Markov Decision Process (Dec-POMDP).
Key features of τ²-bench:
| Feature | Description |
|---|---|
| Telecom domain | Customer service scenarios for internet, phone, and TV service providers |
| Dual-control | Both agent and user have tools; the user must perform actions on their end (reboot router, check settings) |
| Compositional task generator | Programmatically creates diverse, verifiable tasks from atomic components |
| Reliable user simulator | Tightly coupled with the environment to ensure consistent behavior |
| Fine-grained error analysis | Separates errors from reasoning vs. communication/coordination |
| Coordination metrics | Measuring agent-user collaboration |
| Enhanced complexity | More sophisticated interactions |
Performance drops significantly in τ²-bench compared to τ-bench. GPT-4.1, which achieves 74% pass^1 in retail and 56% in airline, drops to 34% in the telecom domain. This gap highlights the difficulty of guiding users through actions rather than performing all actions autonomously.
Rather than a new benchmark, τ³-bench is a corrected version of the original τ-bench tasks (described in the Task Corrections section above).
Inspired directly by τ-bench, MedAgentBench adapts the framework to the medical domain. It features:
The authors cited τ-bench as effective for testing general agent capabilities while noting the absence of a standardized medical agent benchmark, prompting creation of the domain-specific variant.
Another domain-specific adaptation that emerged for evaluating AI agents in legal applications, inspired by τ-bench's framework for policy-compliant agent evaluation.
| Limitation | Description | Impact |
|---|---|---|
| Limited domains | Only two domains (retail and airline) in the original benchmark | Raises questions about generalization to other settings |
| Simulated users | LLM-based users may not capture the full range of human behavior | May underestimate difficulty with real, unpredictable users |
| English only | No multilingual support | Limits applicability to non-English customer service settings |
| Static task set | Fixed set of 165 tasks | Risk of overfitting as models are evaluated repeatedly |
| Binary scoring | No partial credit for partially correct solutions | May not capture agents that get "almost right" |
| Text only | No visual, voice, or document-based interactions | Misses multimodal aspects of real customer service |
| Benchmark | Focus | Key Difference from τ-bench |
|---|---|---|
| SWE-bench | Software engineering tasks | Single-turn; no user interaction |
| AgentBench | Multi-environment agent evaluation | Tests multiple environments but single-turn interactions |
| WebArena | Web navigation and interaction | Focuses on browser-based tasks rather than conversational customer service |
| WebShop | E-commerce navigation | Earlier work by the same lead author; simpler single-turn shopping tasks |
| ALFWorld | Embodied agent tasks | Focuses on text-based embodied environments rather than customer service |
| InterCode | Interactive coding benchmark | Code-focused rather than customer service conversations |
| GAIA | General AI assistant tasks | Broader scope but less focus on multi-turn tool use and policy compliance |
| BFCL | Function calling accuracy | Tests tool use in isolation rather than within multi-turn conversations |
| τ²-bench | Dual-control conversational agents | Extends τ-bench with a telecom domain where both agent and user have tools |
| MT-Bench | Multi-turn conversation quality | Tests conversational ability but not tool use or policy compliance |