# τ-bench

> Source: https://aiwiki.ai/wiki/tau-bench
> Updated: 2026-06-09
> Categories: AI Benchmarks
> From AI Wiki (https://aiwiki.ai), a free encyclopedia of artificial intelligence. Quote with attribution.

**τ-bench** (Tau-bench), short for Tool-Agent-User Interaction Benchmark, is an [AI benchmark](/wiki/ai_benchmark) designed to evaluate [language agents](/wiki/Language_agents)' ability to complete complex tasks through realistic interactions with simulated users and domain-specific tools. Released on June 17, 2024, by [Sierra Research](/wiki/Sierra_Research) and [Princeton University](/wiki/Princeton_University), τ-bench addresses critical gaps in existing benchmarks by testing agents on their ability to follow domain-specific rules, maintain context over long conversations, and effectively coordinate with users to achieve goals.[1] The paper was accepted as a poster at the International Conference on Learning Representations (ICLR) 2025.[12]

Unlike earlier agent benchmarks such as [WebArena](/wiki/webarena), [SWE-bench](/wiki/swe_bench), and [AgentBench](/wiki/agentbench), which evaluate agents in single-round human-agent interactions, τ-bench requires agents to handle multiple dynamic exchanges where information is gathered incrementally over the course of a conversation.[1] This design mirrors the way customer service agents operate in practice, where a single interaction can involve identity verification, information retrieval, policy checking, multi-step tool calls, and explicit user confirmation before taking action.

## Quick Facts

| Attribute | Value |
| --- | --- |
| Full name | Tau-bench: Tool-Agent-User Interaction Benchmark |
| Abbreviation | τ-bench |
| Description | A benchmark for evaluating AI agents' ability to complete complex tasks through realistic tool-agent-user interactions in real-world domains |
| Release date | 2024-06-17 [1] |
| Latest version | τ³-bench (task-corrected) [8] |
| Benchmark updated | 2025 |
| Authors | Shunyu Yao, Noah Shinn, Pedram Razavi, Karthik Narasimhan [1] |
| Organization | Sierra Research, Princeton University [1] |
| Type | Agent Evaluation, Multi-turn Interaction |
| Modality | Text, API Calls |
| Task format | Conversational task completion |
| Number of tasks | 165 total (115 retail, 50 airline) [1] |
| Evaluation metric | pass^k, Database State Comparison [1] |
| Domains | Airline, Retail [1] |
| Languages | English |
| Human performance | Not reported |
| SOTA score | 86.2% (Retail pass^1), 70.0% (Airline pass^1) [7] |
| SOTA model | Claude Sonnet 4.5 [7] |
| SOTA date | 2025 |
| Saturated | No (airline); approaching saturation (retail) |
| Website | [Official website](https://taubench.com/) |
| Paper | [arXiv:2406.12045](https://arxiv.org/abs/2406.12045) |
| GitHub | [sierra-research/tau-bench](https://github.com/sierra-research/tau-bench) |
| License | MIT [2] |
| Conference | ICLR 2025 (Poster) [12] |
| Successor | τ²-bench, τ³-bench |

## Overview

τ-bench represents a paradigm shift in AI agent evaluation, moving beyond simple task completion to assess agents' performance in dynamic, multi-turn conversations that mirror real-world applications. The benchmark emulates scenarios where an AI agent must interact with both users (simulated by [large language models](/wiki/large_language_model)) and domain-specific API tools while adhering to complex policy guidelines.[1]

### Key Innovation

Unlike traditional benchmarks that focus on single-turn interactions or isolated capabilities, τ-bench introduces:

- **Stateful Evaluation**: Compares database states after task completion
- **User Simulation**: LLM-simulated users provide realistic conversational dynamics
- **Policy Adherence**: Tests agents' ability to follow domain-specific rules
- **Consistency Measurement**: Introduces Pass@k metric for reliability assessment

The benchmark revealed that even state-of-the-art function calling agents (like [GPT-4o](/wiki/gpt_4o)) succeed on less than 50% of the tasks and are quite inconsistent (Pass@8 <25% in retail).[1]

## Background and Motivation

### The Agent Evaluation Gap

By mid-2024, [large language models](/wiki/large_language_model) had demonstrated strong capabilities in isolated tasks like text generation, code completion, and question answering. However, deploying these models as autonomous agents in real-world settings revealed a gap between benchmark performance and actual reliability. Existing agent benchmarks at the time tested agents on well-defined, single-turn problems: [SWE-bench](/wiki/swe_bench) measured software engineering ability, [HumanEval](/wiki/humaneval) tested code generation, and [AgentBench](/wiki/agentbench) evaluated agents across several environments. None of these benchmarks required agents to interact with a human user over multiple turns while simultaneously calling tools and following complex policies.[1]

Sierra, the conversational AI company co-founded by [Bret Taylor](/wiki/bret_taylor) (former Salesforce co-CEO and OpenAI board member) and Clay Bavor (former Google executive), had direct experience with the gap between benchmark scores and real-world agent performance.[3] Sierra's research team, led by Karthik Narasimhan, identified three capabilities that existing benchmarks failed to measure:[3]

1. **Dynamic multi-party interaction**: Agents must interact with both human users and backend APIs over extended periods, gathering information incrementally rather than receiving all inputs up front.
2. **Policy compliance**: Agents must accurately follow domain-specific rules and guidelines, which can be lengthy and contain subtle edge cases.
3. **Reliability at scale**: A customer service agent that resolves a problem 60% of the time on the first try is not useful if it fails on the same problem the next time a different customer asks.

### The Authors

[Shunyu Yao](/wiki/shunyu_yao), the lead author of τ-bench, is known for creating the [ReAct](/wiki/react_prompting) framework (ICLR 2023), which introduced the paradigm of interleaving reasoning traces with actions in language models. He also created [WebShop](/wiki/WebShop) (NeurIPS 2022), an earlier e-commerce interaction benchmark, and co-developed [SWE-agent](/wiki/swe-agent). At the time of τ-bench's release, Yao was affiliated with both Sierra Research and Princeton University.[1] Noah Shinn, another co-author, is known for his work on Reflexion, a framework for language agent self-reflection. Pedram Razavi contributed from Sierra Research, and Karthik Narasimhan, a professor at Princeton University, leads the Princeton Language and Intelligence lab where much of this agent research originated.[1]

## Architecture and Design

τ-bench employs a modular three-component architecture that simulates realistic customer service interactions. Each component plays a distinct role in creating a closed-loop evaluation environment.[1]

### Three-Component Framework

| Component | Role | Implementation Details |
| --- | --- | --- |
| User Simulator | Generates realistic user messages based on hidden instructions | LLM-based (GPT-4, Claude, etc.); guided by task-specific instructions that define user identity, intent, and preferences |
| Agent System | Processes user requests, calls tools, and follows policies | Supports multiple architectures: tool-calling (function calling), ReAct, and Act |
| Environment | Provides API tools, maintains database state, and enforces domain rules | Domain-specific databases with realistic data; tools exposed as callable functions |

The interaction loop works as follows: the user simulator initiates a conversation with a request (for example, "I need to cancel my flight to Denver"). The agent responds by calling tools to look up information, asking clarifying questions, or taking actions. The user simulator replies based on its hidden instruction set, which specifies the user's identity, preferences, and constraints. This back-and-forth continues until the agent resolves the request or the conversation reaches a maximum number of turns.[1]

Critically, the user instructions are hidden from the agent. The agent only sees the user's natural language messages and must infer what the user wants through conversation, just as a real customer service agent would.[1]

### User Simulation

Each task in τ-bench includes a structured user instruction that defines:

- **User identity**: Name, email, membership tier, payment methods on file
- **User intent**: What the user wants to accomplish (cancel a flight, exchange a product, modify an address)
- **User preferences**: Specific constraints (preferred payment method, desired cabin class, acceptable alternatives)
- **Behavioral guidelines**: How cooperative or difficult the simulated user should be

The instruction is designed so that only one correct outcome exists under the domain's policy. This constraint is essential for automated evaluation: because there is exactly one valid goal state, the benchmark can verify success by comparing the final database state against the annotated expected state without requiring subjective human judgment.[1]

The user simulator itself is powered by an LLM (typically [GPT-4](/wiki/gpt-4) or [Claude](/wiki/claude)). While the instructions are synthetic, the utterances the simulator generates are open-ended and natural-sounding. The benchmark supports several user simulation strategies:

| Strategy | Description | Use Case |
| --- | --- | --- |
| LLM (default) | Direct LLM-generated responses based on user instructions | Standard evaluation |
| ReAct | Adds explicit reasoning steps before generating user responses | Complex reasoning tasks |
| Verify | Includes an LLM verification loop after each response | High-accuracy requirements |
| Reflection | Adds self-correction to improve response quality | Improved consistency |

Human evaluation of the user simulator found that it adhered well to its guidelines across four criteria: adherence to simulator guidelines, adherence to user instructions, correct use of user tools (when applicable), and generation of natural, consistent conversational continuations.[1]

## Domains and Tasks

The benchmark covers two primary domains:

### Retail Domain (τ-retail)

The retail domain simulates an e-commerce customer support environment. It contains 115 tasks built on a synthetic database of 500 users, 50 products, and 1,000 orders.[1]

- **Scenario**: E-commerce customer support
- **Tools**: Order management, inventory, returns, payment APIs
- **Policies**: Return windows, warranty terms, pricing rules
- **Challenges**: Product availability, order modifications, refund processing
- **Difficulty**: Noticeably easier to navigate compared to airline

**Available API Tools (15 total: 7 write, 8 read)**

The retail agent has access to tools for retrieving and modifying customer data:

| Tool Category | Examples | Description |
| --- | --- | --- |
| User lookup | get_user_details | Retrieve user profile information after authentication |
| Order retrieval | get_order_details | Look up order status, items, shipping, and payment |
| Product information | get_product_details | Check product availability, options, and pricing |
| Order cancellation | cancel_pending_order | Cancel orders with status "pending" only |
| Order modification | modify_pending_order | Change shipping address, payment method, or item options on pending orders |
| Returns | return_delivered_order | Process returns on delivered orders within the return window |
| Exchanges | exchange_delivered_order | Exchange delivered items for different options of the same product |
| Profile updates | modify_user_address | Update user shipping or billing addresses |

**Key Retail Policies**

- The agent must authenticate the user at the start of every conversation by locating their user ID via email, or via name plus zip code.
- Only one user can be helped per conversation. Requests related to other users must be denied.
- Before any consequential action (cancel, modify, return, exchange), the agent must list the action details and obtain explicit user confirmation.
- Exchange and modify order tools can only be called once per conversation. All items to be changed must be collected into a single tool call.
- Only pending orders can be cancelled or modified. Only delivered orders can be returned or exchanged.
- The agent must make at most one tool call at a time and should not respond to the user simultaneously with a tool call.

**Example Retail Task**: A user contacts support wanting to exchange a blue medium t-shirt from their delivered order for a red large version of the same product. The agent must authenticate the user, look up the order, verify the order status is "delivered," check that the red large option is available, collect all exchange details, confirm with the user, and execute the exchange in a single tool call.

### Airline Domain (τ-airline)

The airline domain simulates a flight reservation customer service environment. It contains 50 tasks built on a synthetic database of 500 users, 300 flights, and 2,000 reservations.[1]

- **Scenario**: Customer service for airline bookings
- **Tools**: Flight search, booking, cancellation, modification APIs
- **Policies**: Fare rules, refund policies, upgrade procedures
- **Challenges**: Multi-leg trips, schedule changes, policy compliance
- **Difficulty**: More challenging than retail domain

**Available API Tools (13 total: 6 write, 7 read)**

The airline agent has access to tools for managing flight reservations:

| Tool Category | Examples | Description |
| --- | --- | --- |
| Flight search | search_direct_flight, search_onestop_flight | Find available direct or connecting flights |
| Reservation lookup | get_reservation_details | Retrieve booking information, passengers, and flight details |
| User lookup | get_user_details | Authenticate and retrieve user profile |
| Booking | book_reservation | Create new flight reservations (max 5 passengers) |
| Modification | update_reservation_flights, update_reservation_passengers | Change flights, cabin class, baggage, insurance, or passenger details |
| Cancellation | cancel_reservation | Cancel reservations subject to policy rules |

**Key Airline Policies**

- All reservations can be cancelled within 24 hours of booking regardless of cabin class.
- Outside the 24-hour window, basic economy and economy reservations can only be cancelled if the passenger purchased travel insurance and meets the insurance conditions.
- Business class reservations can always be cancelled.
- If the airline cancels a flight, all passengers are eligible for cancellation and potential compensation.
- Cabin class upgrades are allowed without changing flights, but the user must pay the fare difference.
- Modifications cannot change the origin, destination, or trip type (one-way vs. round-trip).
- Only silver and gold members, travelers with insurance, or business-class passengers qualify for service recovery certificates in cases of delays or cancellations.
- Maximum of five passengers per reservation, and all passengers must be on identical flights and cabin class.

**Example Airline Task**: A user calls about a delayed flight and wants to rebook on an earlier connection. The agent must verify the user's identity, check the reservation details, determine if the passenger's membership tier qualifies them for compensation, search for alternative flights, confirm the new itinerary with the user, and process the modification while applying the correct fare difference.

### Task Complexity

Tasks in τ-bench vary in complexity and are designed to apply pressure in diverse ways:

| Complexity Level | Characteristics | Example |
| --- | --- | --- |
| Simple | Single API call, straightforward request | Check flight status |
| Moderate | Multiple API calls, some reasoning required | Book round-trip with preferences |
| Complex | Many API calls, policy checking, user clarification | Multi-city trip with changes |
| Expert | Edge cases, exception handling, complex policies | Group booking with special needs |

| Complexity Factor | Description | Example |
| --- | --- | --- |
| Long-chain dependencies | Multiple sequential steps where later actions depend on earlier results | Booking a multi-passenger reservation where each passenger has different preferences |
| Buried essential details | Critical information hidden deep in the conversation or policy document | A cancellation policy exception that applies only to gold-tier members with travel insurance |
| Realistic forgetting scenarios | Situations where an agent might lose track of earlier conversation context | A user who mentions their email early in the conversation and later asks the agent to use it for a different purpose |
| Policy conflicts | Requests that conflict with domain rules, requiring the agent to refuse or offer alternatives | A user asking to cancel a basic economy ticket without insurance outside the 24-hour window |
| Compound requests | Multiple distinct actions needed in a single conversation | A user who wants to cancel one order, modify another, and update their address |

## Evaluation Methodology

### Database State Comparison

τ-bench uses an objective, automated evaluation approach that avoids the subjectivity of LLM-as-judge methods. The process works as follows:[1]

1. **Initial state capture**: The database is recorded before the conversation begins.
2. **Conversation execution**: The agent interacts with the user simulator over multiple turns, calling tools as needed.
3. **Final state capture**: The database state is recorded after the conversation ends.
4. **Goal state comparison**: The final database state is compared against a pre-annotated goal state that represents the single correct outcome for that task.
5. **Binary success determination**: The task is scored as a success (1) if the final state matches the goal state, or a failure (0) if it does not.

This approach is both efficient (no human evaluators or LLM judges needed per evaluation) and faithful (the comparison is deterministic and objective). Partial credit is not awarded; a task either succeeds completely or fails. The designers chose this strict approach because, in real-world customer service, a partial update can create more damage than a refusal.[1]

### The pass^k Metric (Pass@k)

One of τ-bench's most significant contributions is the introduction of the pass^k ("pass hat k") metric, which measures agent reliability across repeated trials.[1]

The standard pass@k metric, widely used in code generation benchmarks like [HumanEval](/wiki/humaneval), asks: "Did the agent succeed on at least one of k attempts?" This measures peak capability. In contrast, pass^k asks: "Did the agent succeed on all k independent attempts?" This measures consistency.[1]

Formally, for a given task with n independent trials of which c are successful:

- **pass^1** = E[r] = E[c/n], the expected success rate on a single attempt (equivalent to pass@1)
- **pass^k** = the probability that all k independent, identically distributed trials succeed

The distinction matters enormously for real-world deployment. A customer service agent that resolves a particular type of issue 60% of the time will, over 8 independent instances of that issue, succeed on all 8 only about 1.7% of the time (0.6^8). The pass^k metric captures this compounding unreliability.[1]

| Metric | Question Answered | Significance |
| --- | --- | --- |
| Pass@1 / pass^1 | What fraction of tasks does the agent solve on a single attempt? | Basic capability measure |
| Pass@4 / pass^4 | What fraction of tasks does the agent solve on all 4 independent attempts? | Moderate reliability threshold |
| Pass@8 / pass^8 | What fraction of tasks does the agent solve on all 8 independent attempts? | High reliability threshold |
| Pass@k / pass^k | Success rate across k attempts | General reliability metric |

The benchmark doesn't just test whether an agent can complete a task once; it measures whether it can do so consistently multiple times. The original paper showed that [GPT-4o](/wiki/gpt_4o) achieved a pass^1 of roughly 50% on retail tasks but dropped to approximately 25% on pass^8, representing a 60% decline.[1] This dramatic drop revealed that even capable models are highly inconsistent when handling the same type of request with different conversational variations.

### Auto Error Identification

τ-bench includes an automatic error identification tool (available in the GitHub repository as `auto_error_identification.py`) that analyzes failed trajectories to classify errors.[2] The tool categorizes failures along two dimensions:

**Fault Assignment** (who caused the failure):

- Agent fault (the agent made an incorrect decision)
- User simulator fault (the simulated user behaved unrealistically)
- Environment fault (a tool or database issue)

**Fault Type** (what kind of error occurred):

| Fault Type | Description | Frequency |
| --- | --- | --- |
| Wrong Action | Agent selected an incorrect tool or action sequence | Most common workflow error |
| Wrong Arguments | Agent called the correct tool but with incorrect parameters | Common tool error |
| Wrong Information | Agent provided incorrect information to the user | Most common user interaction error |
| Policy Neglect | Agent failed to follow a policy rule | Significant across both domains |
| Context Loss | Agent forgot information from earlier in the conversation | More common in longer conversations |
| Compound Task Failure | Agent failed to handle multiple requests in one conversation | Challenging for all models |

The analysis found that terminal failures (where the agent completely derails) significantly outnumber recovered errors (where the agent stumbles but eventually finds the correct path). Small arithmetic or policy mistakes tend to propagate through the workflow rather than staying contained.[1]

## Experimental Results

### Original Paper Results (June 2024)

The original paper tested 12 popular [LLMs](/wiki/large_language_model) with different agent architectures. The key findings were striking:[1]

- Even [GPT-4o](/wiki/gpt_4o), the best-performing model at the time, succeeded on fewer than 50% of tasks across both domains.
- Function-calling (tool-calling) agents consistently outperformed text-formatted agent methods like ReAct.
- pass^8 scores in the retail domain fell below 25% for all models tested, revealing severe consistency problems.
- The airline domain proved significantly harder than retail across all models.

### Early Model Performance Comparison

Early performance of leading models on τ-bench:

| Model | Version/Mode | Airline Pass@1 | Retail Pass@1 | Notes |
| --- | --- | --- | --- | --- |
| [Claude 3.7 Sonnet](/wiki/claude_sonnet_4_5) | With think tool | 58.4% | 81.2% | Top performer with reasoning [15] |
| [Claude 3.5 Sonnet](/wiki/claude_sonnet_4_5) | Upgraded | 46.0% | 69.2% | Improved from 36.0%/62.6% [13] |
| [GPT-OSS-120B](/wiki/GPT_OSS_120B) | Standard | Not reported | 67.8% | Open-weight model [14] |
| [GPT-4o](/wiki/gpt_4o) | Tool-calling | <50% | <50% | Initial SOTA baseline [1] |
| [GPT-4o](/wiki/gpt_4o) | ReAct | ~35% | ~40% | Lower with ReAct [1] |
| [Claude 3.5 Sonnet](/wiki/claude_sonnet_4_5) | Original | 36.0% | 62.6% | Before upgrade [13] |

Note: Pass@4 and Pass@8 scores are significantly lower across all models, with Pass@8 <25% in retail for most models, indicating consistency challenges.

### Current Leaderboard: Retail Domain

As of early 2026, the retail domain leaderboard shows substantial improvement over the original 2024 results, with the best models now crossing 80% pass^1:[10]

| Rank | Model | Organization | Retail pass^1 |
| --- | --- | --- | --- |
| 1 | [Claude Sonnet 4.5](/wiki/claude) | [Anthropic](/wiki/anthropic) | 0.862 |
| 2 | [Claude Opus 4.1](/wiki/claude) | [Anthropic](/wiki/anthropic) | 0.824 |
| 3 | [Claude Opus 4](/wiki/claude) | [Anthropic](/wiki/anthropic) | 0.814 |
| 4 | [Claude 3.7 Sonnet](/wiki/claude_sonnet_4_5) | [Anthropic](/wiki/anthropic) | 0.812 |
| 5 | [Claude Sonnet 4](/wiki/claude) | [Anthropic](/wiki/anthropic) | 0.805 |
| 6 | GLM-4.5 | [Zhipu AI](/wiki/zhipu_ai) | 0.797 |
| 7 | GLM-4.5-Air | [Zhipu AI](/wiki/zhipu_ai) | 0.779 |
| 8 | Qwen3-Coder 480B A35B | [Alibaba](/wiki/qwen) | 0.775 |
| 9 | [o4-mini](/wiki/openai_o-series) | [OpenAI](/wiki/openai) | 0.718 |
| 10 | [o1](/wiki/openai_o-series) | [OpenAI](/wiki/openai) | 0.708 |
| 11 | Qwen3-Next-80B-A3B-Thinking | [Alibaba](/wiki/qwen) | 0.696 |
| 12 | [Claude 3.5 Sonnet](/wiki/claude_sonnet_4_5) | [Anthropic](/wiki/anthropic) | 0.692 |
| 13 | [GPT-4.5](/wiki/gpt-4) | [OpenAI](/wiki/openai) | 0.684 |
| 14 | [GPT-4.1](/wiki/gpt-4) | [OpenAI](/wiki/openai) | 0.680 |
| 15 | GPT OSS 120B | [OpenAI](/wiki/openai) | 0.678 |
| 16 | [GPT-4o](/wiki/gpt_4o) | [OpenAI](/wiki/openai) | 0.603 |
| 17 | [o3-mini](/wiki/openai_o-series) | [OpenAI](/wiki/openai) | 0.576 |
| 18 | GPT-4.1 mini | [OpenAI](/wiki/openai) | 0.558 |
| 19 | [Claude 3.5 Haiku](/wiki/claude) | [Anthropic](/wiki/anthropic) | 0.510 |
| 20 | GPT-4.1 nano | [OpenAI](/wiki/openai) | 0.226 |

The average score across all 25 evaluated models is 0.678.[10] Anthropic's Claude models dominate the top five positions, with Claude Sonnet 4.5 achieving the highest retail score of 0.862.[10]

### Current Leaderboard: Airline Domain

The airline domain remains substantially harder, with the best models scoring around 70% compared to 86% in retail:[11]

| Rank | Model | Organization | Airline pass^1 |
| --- | --- | --- | --- |
| 1 | [Claude Sonnet 4.5](/wiki/claude) | [Anthropic](/wiki/anthropic) | 0.700 |
| 2 | MiniMax M1 80K | [MiniMax](/wiki/minimax) | 0.620 |
| 3 | GLM-4.5-Air | [Zhipu AI](/wiki/zhipu_ai) | 0.608 |
| 4 | GLM-4.5 | [Zhipu AI](/wiki/zhipu_ai) | 0.604 |
| 5 | MiniMax M1 40K | [MiniMax](/wiki/minimax) | 0.600 |
| 5 | [Claude Sonnet 4](/wiki/claude) | [Anthropic](/wiki/anthropic) | 0.600 |
| 5 | Qwen3-Coder 480B A35B | [Alibaba](/wiki/qwen) | 0.600 |
| 8 | [Claude Opus 4](/wiki/claude) | [Anthropic](/wiki/anthropic) | 0.596 |
| 9 | [Claude 3.7 Sonnet](/wiki/claude_sonnet_4_5) | [Anthropic](/wiki/anthropic) | 0.584 |
| 10 | [Claude Opus 4.1](/wiki/claude) | [Anthropic](/wiki/anthropic) | 0.560 |
| 11 | [o1](/wiki/openai_o-series) | [OpenAI](/wiki/openai) | 0.500 |
| 11 | [GPT-4.5](/wiki/gpt-4) | [OpenAI](/wiki/openai) | 0.500 |
| 13 | [GPT-4.1](/wiki/gpt-4) | [OpenAI](/wiki/openai) | 0.494 |
| 14 | [o4-mini](/wiki/openai_o-series) | [OpenAI](/wiki/openai) | 0.492 |
| 15 | [Claude 3.5 Sonnet](/wiki/claude_sonnet_4_5) | [Anthropic](/wiki/anthropic) | 0.460 |
| 16 | [GPT-4o](/wiki/gpt_4o) | [OpenAI](/wiki/openai) | 0.428 |
| 17 | GPT-4.1 mini | [OpenAI](/wiki/openai) | 0.360 |
| 18 | [o3-mini](/wiki/openai_o-series) | [OpenAI](/wiki/openai) | 0.324 |
| 19 | [Claude 3.5 Haiku](/wiki/claude) | [Anthropic](/wiki/anthropic) | 0.228 |
| 20 | GPT-4.1 nano | [OpenAI](/wiki/openai) | 0.140 |

The average score across all 23 evaluated airline models is 0.495.[11] The HAL evaluation team at Princeton independently verified that 48 of 50 airline tasks (96%) have been solved by at least one agent, suggesting that the individual tasks are solvable but no single agent can solve them all consistently.[9]

### Key Performance Trends

Several patterns emerge from the leaderboard data:

1. **Anthropic dominance in retail**: Claude models hold the top five retail positions, suggesting that Anthropic has specifically optimized for multi-turn tool-use tasks.[10]
2. **Chinese models competitive in airline**: Zhipu AI's GLM-4.5 models and MiniMax's M1 models perform comparably to or better than some Claude variants in the airline domain.[11]
3. **Reasoning models mixed**: OpenAI's reasoning models (o1, o3-mini, o4-mini) do not consistently outperform their non-reasoning counterparts (GPT-4.1, GPT-4o) on this benchmark, suggesting that chain-of-thought reasoning alone is insufficient for multi-turn agent tasks.
4. **Cost-performance tradeoffs**: On the airline domain, the Pareto frontier includes Gemini 2.0 Flash (28% accuracy, $0.31 per evaluation), DeepSeek V3 (44%, $5.43), and o4-mini (56%, $11.36), showing that the cheapest option is not always the worst and the most expensive is not always the best.[11]
5. **Domain gap persists**: The gap between retail and airline performance is consistent across models. Airline tasks involve more complex policies, multi-leg trips, and more subtle edge cases.

### Failure Analysis

Common failure modes identified:

1. **Policy Violations**: Agents bypass or misinterpret domain rules
2. **Context Loss**: Information forgotten in long conversations
3. **User Misunderstanding**: Incorrect interpretation of user intent
4. **API Misuse**: Incorrect tool selection or parameter usage
5. **State Confusion**: Losing track of transaction state

## τ³-bench: Task Corrections

In 2025, the τ-bench team released τ³-bench, an updated version that audited and corrected over 50 tasks across both domains (27 airline fixes, 26 retail fixes).[8] The corrections addressed five categories of issues identified through community feedback and the τ-Bench Verified research program:[8]

| Issue Category | Description | Examples |
| --- | --- | --- |
| Incorrect expected actions | The annotated goal state was wrong | Compensation offered to ineligible passengers; invalid PayPal refunds |
| Ambiguous user instructions | The user instruction allowed multiple valid interpretations | Economy vs. basic economy confusion; "similar" vs. "same" item specifications |
| Impossible constraints | The task setup made the intended solution unreachable | Required payment methods absent from user profiles; location contradictions |
| Missing fallback behaviors | No guidance for what the agent should do when the intended path fails | No defined action when product searches return no results |
| Policy loophole prevention | Tasks that could be "solved" by exploiting policy gaps | Cancel-and-rebook exploits instead of proper modification refusals; cabin upgrade workarounds |

The impact of these corrections varied by domain. In the airline domain, pass^1 scores increased by 14 to 20 points across models, and pass^4 improvements were even larger (up to 22 points for some models). In the retail domain, the changes were more modest, with pass^1 shifts ranging from -0.4 to +5.5 points. The retail corrections primarily reduced evaluation variance rather than uniformly boosting scores.[8]

## Implementation and Usage

### Setup and Configuration

τ-bench is open-source under the MIT license and available on GitHub. It can be installed with:[2]

```
git clone https://github.com/sierra-research/tau-bench
cd tau-bench
pip install -e .
```

API keys must be set as environment variables for the model providers being evaluated: `OPENAI_API_KEY`, `ANTHROPIC_API_KEY`, `GOOGLE_API_KEY`, `MISTRAL_API_KEY`.[2]

### Running Evaluations

The benchmark is run through a command-line interface:

```
python run.py --agent-strategy tool-calling --env retail --model gpt-4o --model-provider openai --user-model gpt-4o --user-model-provider openai --user-strategy llm --max-concurrency 10
```

Key configuration options include:

| Parameter | Options | Description |
| --- | --- | --- |
| --agent-strategy | tool-calling, react, act | How the agent structures its reasoning and actions |
| --env | retail, airline | Which domain to evaluate |
| --model | Various | The model powering the agent |
| --user-model | Various | The model powering the user simulator |
| --user-strategy | llm, react, verify, reflection | How the user simulator generates responses |
| --max-concurrency | Integer | Number of parallel evaluations |
| --task-ids | Comma-separated | Run specific tasks by ID |

### Supported Models

τ-bench supports evaluation of models from multiple providers:

| Provider | Models Supported | Integration Method |
| --- | --- | --- |
| [OpenAI](/wiki/openai) | GPT-4, GPT-4o, GPT-4o-mini, GPT-4.1, GPT-3.5, o1, o3, o4-mini | API |
| [Anthropic](/wiki/anthropic) | Claude 3, Claude 3.5 Sonnet, Claude 3.7 Sonnet, Claude Sonnet 4, Claude Opus 4 | API |
| [Google](/wiki/google_deepmind) | Gemini Pro, Gemini Flash, Gemini Ultra | API |
| [Mistral AI](/wiki/mistral_ai) | Mistral Large, Mistral Medium | API |
| [Zhipu AI](/wiki/zhipu_ai) | GLM-4.5, GLM-4.5-Air | API |
| [AnyScale](/wiki/anyscale) | Open-source models via API | API |

### Technical Features

- **Concurrent Processing**: Parallel API request handling through the `--max-concurrency` flag, allowing multiple tasks to be evaluated simultaneously.
- **Error Recovery**: Automatic error identification and retry.
- **Historical Trajectories**: The repository stores historical interaction trajectories, which can be replayed for analysis without re-running expensive API calls.
- **Error Analysis**: The `auto_error_identification.py` script automatically classifies failures by fault assignment and fault type.[2]
- **Configurable Tasks**: Specific tasks can be selected with the `--task-ids` flag for targeted debugging.
- **Extensible Framework**: New domains can be added by implementing the environment interface with a database, tool set, and policy document.

## Impact and Adoption

### Industry Adoption

τ-bench has become one of the standard benchmarks for evaluating AI agents in the industry:

- **[Anthropic](/wiki/anthropic)**: Has embraced τ-bench as a key benchmark for Claude model development. Claude 3.7 Sonnet's launch announcement highlighted its top performance on τ-bench, and Anthropic has incorporated pass^k metrics into their model evaluation process.[15] The company has used self-reflection and longer chain-of-thought prompting to improve consistency on the benchmark.
- **[OpenAI](/wiki/openai)**: Has used τ-bench to evaluate GPT-4.1, GPT-4.5, and the o-series reasoning models. OpenAI showcased GPT-5's τ-bench performance as part of its agent capabilities, and used the benchmark for GPT-OSS model evaluation.[14]
- **[Sierra AI](/wiki/sierra_ai)**: Uses τ-bench as a core evaluation metric for its own agent development pipeline.[4]
- **AI startups**: Companies like Scaled Cognition have adopted τ-bench to evaluate their agent foundation models.
- **Princeton HAL**: The Princeton Hardware-Aware Learning (HAL) team independently reproduces τ-bench results, providing verified scores for the airline domain leaderboard.[9]
- **Research Labs**: Standard benchmark for agent papers.

### Academic Influence

Within one year of release, τ-bench has had significant academic impact:

- The paper was accepted at ICLR 2025 as a poster presentation.[12]
- Been cited in numerous agent evaluation papers.
- Inspired domain-specific variants (for example [MedAgentBench](/wiki/MedAgentBench)).
- Become standard for multi-turn agent evaluation.
- Influenced new evaluation methodologies.
- The pass^k metric has been adopted beyond τ-bench as a standard reliability measure for agent evaluation.
- It has become a standard reference point alongside [WebArena](/wiki/webarena) and [SWE-bench](/wiki/swe_bench) as representing a new class of rigorous agent benchmarks.

Researchers found that naive ReAct-style agents often break down during complex multi-step sequences, leading to the development of enhanced approaches including hierarchical architectures that maintain goals and memory throughout multi-turn conversations.

### Derivative Benchmarks

#### τ²-bench (Tau-squared-bench)

Released in 2025 by Sierra Research (arXiv: 2506.07982), τ²-bench extends the τ-bench framework to a telecom domain with a critical new challenge: dual-control environments.[6] In the original τ-bench, only the agent uses tools. In τ²-bench, both the agent and the user can take actions in a shared environment, modeled as a Decentralized Partially Observable Markov Decision Process (Dec-POMDP).[6]

Key features of τ²-bench:

| Feature | Description |
| --- | --- |
| Telecom domain | Customer service scenarios for internet, phone, and TV service providers |
| Dual-control | Both agent and user have tools; the user must perform actions on their end (reboot router, check settings) |
| Compositional task generator | Programmatically creates diverse, verifiable tasks from atomic components |
| Reliable user simulator | Tightly coupled with the environment to ensure consistent behavior |
| Fine-grained error analysis | Separates errors from reasoning vs. communication/coordination |
| Coordination metrics | Measuring agent-user collaboration |
| Enhanced complexity | More sophisticated interactions |

Performance drops significantly in τ²-bench compared to τ-bench. [GPT-4.1](/wiki/gpt-4), which achieves 74% pass^1 in retail and 56% in airline, drops to 34% in the telecom domain.[6] This gap highlights the difficulty of guiding users through actions rather than performing all actions autonomously.

#### τ³-bench

Rather than a new benchmark, τ³-bench is a corrected version of the original τ-bench tasks (described in the Task Corrections section above).[8]

#### MedAgentBench

Inspired directly by τ-bench, MedAgentBench adapts the framework to the medical domain. It features:

- Electronic Medical Record (EMR) environment
- FHIR API integration
- Physician-written scenarios
- Clinical task evaluation

The authors cited τ-bench as effective for testing general agent capabilities while noting the absence of a standardized medical agent benchmark, prompting creation of the domain-specific variant.

#### LegalAgentBench

Another domain-specific adaptation that emerged for evaluating AI agents in legal applications, inspired by τ-bench's framework for policy-compliant agent evaluation.

## Limitations and Criticisms

### Methodological Limitations

| Limitation | Description | Impact |
| --- | --- | --- |
| Limited domains | Only two domains (retail and airline) in the original benchmark | Raises questions about generalization to other settings |
| Simulated users | LLM-based users may not capture the full range of human behavior | May underestimate difficulty with real, unpredictable users |
| English only | No multilingual support | Limits applicability to non-English customer service settings |
| Static task set | Fixed set of 165 tasks | Risk of overfitting as models are evaluated repeatedly |
| Binary scoring | No partial credit for partially correct solutions | May not capture agents that get "almost right" |
| Text only | No visual, voice, or document-based interactions | Misses multimodal aspects of real customer service |

### Evaluation Concerns

- **Evaluation Brittleness**: Binary success/failure may miss partial success.
- **User model bias**: LLM-simulated users may behave more predictably and cooperatively than real humans, potentially inflating scores.
- **Policy simplification**: The policy documents in τ-bench, while non-trivial, are [simpler](/wiki/simpler_env) than the policy manuals used by actual airlines and retailers, which can run to hundreds of pages.
- **Sim2Real gap**: Research on the gap between simulated and real user interactions has raised questions about how well τ-bench performance predicts real-world agent performance.
- **Self-reported scores**: Many leaderboard entries are self-reported by model providers rather than independently verified, creating potential for selection bias.
- **Benchmark saturation**: In the airline domain, 48 of 50 tasks have been solved by at least one agent, suggesting the individual tasks may be approaching saturation even though no single agent solves them all.[9]
- **Limited Error Types**: May not cover all failure modes.

## Future Directions

### Planned and Anticipated Improvements

1. **Domain expansion**: Healthcare, finance, and education domains would test generalization across different policy structures and tool sets.
2. **Multimodal support**: Adding images, documents, and voice interactions would better reflect real customer service environments.
3. **Human evaluation studies**: Replacing LLM-simulated users with real humans would address the sim2real gap concern.
4. **Dynamic task generation**: Procedurally generating new tasks would prevent overfitting to the fixed task set.
5. **Fine-grained metrics**: Partial credit scoring could better differentiate agents that nearly solve a task from those that fail completely.
6. **Multilingual evaluation**: Expanding beyond English would test agents in the global customer service settings where they are increasingly deployed.

### Open Research Questions

- **Agent architecture design**: What architectures best handle long-horizon, policy-constrained, multi-turn interactions, and can τ-bench-optimized architectures be developed?
- **Training on τ-bench**: Can agents be trained or fine-tuned specifically on τ-bench-style interactions to improve reliability?
- **User modeling**: How can user simulation be made more realistic without requiring human participants for every evaluation?
- **Policy learning**: Can agents learn to extract and follow policies from lengthy documents more reliably, including automatic policy extraction and compliance?
- **Coordination in dual-control settings**: How should agents guide users through actions in shared environments (as explored in τ²-bench)?

## Related Benchmarks

| Benchmark | Focus | Key Difference from τ-bench |
| --- | --- | --- |
| [SWE-bench](/wiki/swe_bench) | Software engineering tasks | Single-turn; no user interaction |
| [AgentBench](/wiki/agentbench) | Multi-environment agent evaluation | Tests multiple environments but single-turn interactions |
| [WebArena](/wiki/webarena) | Web navigation and interaction | Focuses on browser-based tasks rather than conversational customer service |
| [WebShop](/wiki/WebShop) | E-commerce navigation | Earlier work by the same lead author; simpler single-turn shopping tasks |
| [ALFWorld](/wiki/ALFWorld) | Embodied agent tasks | Focuses on text-based embodied environments rather than customer service |
| [InterCode](/wiki/InterCode) | Interactive coding benchmark | Code-focused rather than customer service conversations |
| [GAIA](/wiki/GAIA) | General AI assistant tasks | Broader scope but less focus on multi-turn tool use and policy compliance |
| [BFCL](/wiki/bfcl) | Function calling accuracy | Tests tool use in isolation rather than within multi-turn conversations |
| [τ²-bench](/wiki/τ²_bench) | Dual-control conversational agents | Extends τ-bench with a telecom domain where both agent and user have tools |
| [MT-Bench](/wiki/mt-bench) | Multi-turn conversation quality | Tests conversational ability but not tool use or policy compliance |

## See Also

- [AI Agents](/wiki/ai_agents)
- [Tool Use in LLMs](/wiki/Tool_Use_in_LLMs)
- [ReAct Prompting](/wiki/react_prompting)
- [Function Calling](/wiki/structured_output)
- [Large Language Models](/wiki/large_language_model)
- [Multi-turn Dialogue Systems](/wiki/Multi_turn_Dialogue_Systems)
- [Conversational AI](/wiki/Conversational_AI)
- [Agent Evaluation](/wiki/agent_evaluation)
- [Human-AI Interaction](/wiki/Human_AI_Interaction)
- [SWE-bench](/wiki/swe_bench)
- [AgentBench](/wiki/agentbench)

## References

1. Yao, S., Shinn, N., Razavi, P., & Narasimhan, K. (2024). "τ-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains." arXiv:2406.12045. Retrieved from [https://arxiv.org/abs/2406.12045](https://arxiv.org/abs/2406.12045)
2. Sierra Research. (2024). "Tau-bench: Code and Data." GitHub. Retrieved from [https://github.com/sierra-research/tau-bench](https://github.com/sierra-research/tau-bench)
3. Sierra AI. (2024). "Benchmarking AI agents for the real-world." Sierra Blog. Retrieved from [https://sierra.ai/blog/benchmarking-ai-agents](https://sierra.ai/blog/benchmarking-ai-agents)
4. Sierra AI. "τ-bench: Shaping the development and evaluation of agents." Sierra Blog. Retrieved from [https://sierra.ai/blog/tau-bench-shaping-development-evaluation-agents](https://sierra.ai/blog/tau-bench-shaping-development-evaluation-agents)
5. Sierra Research. (2024). "τ²-Bench: Evaluating Conversational Agents in a Dual-Control Environment." Retrieved from [https://sierra.ai/resources/research/tau-squared-bench](https://sierra.ai/resources/research/tau-squared-bench)
6. Yao, S., et al. (2025). "τ²-Bench: Evaluating Conversational Agents in a Dual-Control Environment." arXiv:2506.07982. Retrieved from [https://arxiv.org/abs/2506.07982](https://arxiv.org/abs/2506.07982)
7. τ-bench Official Leaderboard. Retrieved from [https://taubench.com/](https://taubench.com/)
8. τ³-bench: Fixing Airline + Retail. Retrieved from [https://taubench.com/blog/tau3-task-fixes.html](https://taubench.com/blog/tau3-task-fixes.html)
9. Princeton HAL: TAU-bench Airline Leaderboard. Retrieved from [https://hal.cs.princeton.edu/taubench_airline](https://hal.cs.princeton.edu/taubench_airline)
10. LLM-Stats: TAU-bench Retail Leaderboard. Retrieved from [https://llm-stats.com/benchmarks/tau-bench-retail](https://llm-stats.com/benchmarks/tau-bench-retail)
11. LLM-Stats: TAU-bench Airline Leaderboard. Retrieved from [https://llm-stats.com/benchmarks/tau-bench-airline](https://llm-stats.com/benchmarks/tau-bench-airline)
12. ICLR 2025 Poster: τ-bench. Retrieved from [https://iclr.cc/virtual/2025/poster/28170](https://iclr.cc/virtual/2025/poster/28170)
13. Anthropic. (2024). "Introducing computer use, a new Claude 3.5 Sonnet." Retrieved from [https://www.anthropic.com/news/3-5-models-and-computer-use](https://www.anthropic.com/news/3-5-models-and-computer-use)
14. OpenAI. (2025). "Introducing gpt-oss." Retrieved from [https://openai.com/index/introducing-gpt-oss/](https://openai.com/index/introducing-gpt-oss/)
15. Medium. (2025). "Claude 3.7 Sonnet Unveiled: Reviewing Anthropic's Most Advanced Reasoning Model." Retrieved from [https://medium.com/@bernardloki/claude-3-7-sonnet-unveiled-reviewing-anthropics-most-advanced-reasoning-model-772b74331226](https://medium.com/@bernardloki/claude-3-7-sonnet-unveiled-reviewing-anthropics-most-advanced-reasoning-model-772b74331226)

