τ-bench

τ-bench (Tau-bench), short for Tool-Agent-User Interaction Benchmark, is an AI benchmark designed to evaluate language agents' ability to complete complex tasks through realistic interactions with simulated users and domain-specific tools. Released on June 17, 2024, by Sierra Research and Princeton University, τ-bench addresses critical gaps in existing benchmarks by testing agents on their ability to follow domain-specific rules, maintain context over long conversations, and effectively coordinate with users to achieve goals. The paper was accepted as a poster at the International Conference on Learning Representations (ICLR) 2025.

Unlike earlier agent benchmarks such as WebArena, SWE-bench, and AgentBench, which evaluate agents in single-round human-agent interactions, τ-bench requires agents to handle multiple dynamic exchanges where information is gathered incrementally over the course of a conversation. This design mirrors the way customer service agents operate in practice, where a single interaction can involve identity verification, information retrieval, policy checking, multi-step tool calls, and explicit user confirmation before taking action.

Quick Facts

Attribute	Value
Full name	Tau-bench: Tool-Agent-User Interaction Benchmark
Abbreviation	τ-bench
Description	A benchmark for evaluating AI agents' ability to complete complex tasks through realistic tool-agent-user interactions in real-world domains
Release date	2024-06-17
Latest version	τ³-bench (task-corrected)
Benchmark updated	2025
Authors	Shunyu Yao, Noah Shinn, Pedram Razavi, Karthik Narasimhan
Organization	Sierra Research, Princeton University
Type	Agent Evaluation, Multi-turn Interaction
Modality	Text, API Calls
Task format	Conversational task completion
Number of tasks	165 total (115 retail, 50 airline)
Evaluation metric	pass^k, Database State Comparison
Domains	Airline, Retail
Languages	English
Human performance	Not reported
SOTA score	86.2% (Retail pass^1), 70.0% (Airline pass^1)
SOTA model	Claude Sonnet 4.5
SOTA date	2025
Saturated	No (airline); approaching saturation (retail)
Website	Official website
Paper	arXiv:2406.12045
GitHub	sierra-research/tau-bench
License	MIT
Conference	ICLR 2025 (Poster)
Successor	τ²-bench, τ³-bench

Overview

τ-bench represents a paradigm shift in AI agent evaluation, moving beyond simple task completion to assess agents' performance in dynamic, multi-turn conversations that mirror real-world applications. The benchmark emulates scenarios where an AI agent must interact with both users (simulated by large language models) and domain-specific API tools while adhering to complex policy guidelines.

Key Innovation

Unlike traditional benchmarks that focus on single-turn interactions or isolated capabilities, τ-bench introduces:

Stateful Evaluation: Compares database states after task completion
User Simulation: LLM-simulated users provide realistic conversational dynamics
Policy Adherence: Tests agents' ability to follow domain-specific rules
Consistency Measurement: Introduces Pass@k metric for reliability assessment

The benchmark revealed that even state-of-the-art function calling agents (like GPT-4o) succeed on less than 50% of the tasks and are quite inconsistent (Pass@8 <25% in retail).

Background and Motivation

The Agent Evaluation Gap

By mid-2024, large language models had demonstrated strong capabilities in isolated tasks like text generation, code completion, and question answering. However, deploying these models as autonomous agents in real-world settings revealed a gap between benchmark performance and actual reliability. Existing agent benchmarks at the time tested agents on well-defined, single-turn problems: SWE-bench measured software engineering ability, HumanEval tested code generation, and AgentBench evaluated agents across several environments. None of these benchmarks required agents to interact with a human user over multiple turns while simultaneously calling tools and following complex policies.

Sierra, the conversational AI company co-founded by Bret Taylor (former Salesforce co-CEO and OpenAI board member) and Clay Bavor (former Google executive), had direct experience with the gap between benchmark scores and real-world agent performance. Sierra's research team, led by Karthik Narasimhan, identified three capabilities that existing benchmarks failed to measure:

Dynamic multi-party interaction: Agents must interact with both human users and backend APIs over extended periods, gathering information incrementally rather than receiving all inputs up front.
Policy compliance: Agents must accurately follow domain-specific rules and guidelines, which can be lengthy and contain subtle edge cases.
Reliability at scale: A customer service agent that resolves a problem 60% of the time on the first try is not useful if it fails on the same problem the next time a different customer asks.

The Authors

Shunyu Yao, the lead author of τ-bench, is known for creating the ReAct framework (ICLR 2023), which introduced the paradigm of interleaving reasoning traces with actions in language models. He also created WebShop (NeurIPS 2022), an earlier e-commerce interaction benchmark, and co-developed SWE-agent. At the time of τ-bench's release, Yao was affiliated with both Sierra Research and Princeton University. Noah Shinn, another co-author, is known for his work on Reflexion, a framework for language agent self-reflection. Pedram Razavi contributed from Sierra Research, and Karthik Narasimhan, a professor at Princeton University, leads the Princeton Language and Intelligence lab where much of this agent research originated.

Architecture and Design

τ-bench employs a modular three-component architecture that simulates realistic customer service interactions. Each component plays a distinct role in creating a closed-loop evaluation environment.

Three-Component Framework

Component	Role	Implementation Details
User Simulator	Generates realistic user messages based on hidden instructions	LLM-based (GPT-4, Claude, etc.); guided by task-specific instructions that define user identity, intent, and preferences
Agent System	Processes user requests, calls tools, and follows policies	Supports multiple architectures: tool-calling (function calling), ReAct, and Act
Environment	Provides API tools, maintains database state, and enforces domain rules	Domain-specific databases with realistic data; tools exposed as callable functions

The interaction loop works as follows: the user simulator initiates a conversation with a request (for example, "I need to cancel my flight to Denver"). The agent responds by calling tools to look up information, asking clarifying questions, or taking actions. The user simulator replies based on its hidden instruction set, which specifies the user's identity, preferences, and constraints. This back-and-forth continues until the agent resolves the request or the conversation reaches a maximum number of turns.

Critically, the user instructions are hidden from the agent. The agent only sees the user's natural language messages and must infer what the user wants through conversation, just as a real customer service agent would.

User Simulation

Each task in τ-bench includes a structured user instruction that defines:

User identity: Name, email, membership tier, payment methods on file
User intent: What the user wants to accomplish (cancel a flight, exchange a product, modify an address)
User preferences: Specific constraints (preferred payment method, desired cabin class, acceptable alternatives)
Behavioral guidelines: How cooperative or difficult the simulated user should be

The instruction is designed so that only one correct outcome exists under the domain's policy. This constraint is essential for automated evaluation: because there is exactly one valid goal state, the benchmark can verify success by comparing the final database state against the annotated expected state without requiring subjective human judgment.

The user simulator itself is powered by an LLM (typically GPT-4 or Claude). While the instructions are synthetic, the utterances the simulator generates are open-ended and natural-sounding. The benchmark supports several user simulation strategies:

Strategy	Description	Use Case
LLM (default)	Direct LLM-generated responses based on user instructions	Standard evaluation
ReAct	Adds explicit reasoning steps before generating user responses	Complex reasoning tasks
Verify	Includes an LLM verification loop after each response	High-accuracy requirements
Reflection	Adds self-correction to improve response quality	Improved consistency

Human evaluation of the user simulator found that it adhered well to its guidelines across four criteria: adherence to simulator guidelines, adherence to user instructions, correct use of user tools (when applicable), and generation of natural, consistent conversational continuations.

Domains and Tasks

The benchmark covers two primary domains:

Retail Domain (τ-retail)

The retail domain simulates an e-commerce customer support environment. It contains 115 tasks built on a synthetic database of 500 users, 50 products, and 1,000 orders.

Scenario: E-commerce customer support
Tools: Order management, inventory, returns, payment APIs
Policies: Return windows, warranty terms, pricing rules
Challenges: Product availability, order modifications, refund processing
Difficulty: Noticeably easier to navigate compared to airline

Available API Tools (15 total: 7 write, 8 read)

The retail agent has access to tools for retrieving and modifying customer data:

Tool Category	Examples	Description
User lookup	get_user_details	Retrieve user profile information after authentication
Order retrieval	get_order_details	Look up order status, items, shipping, and payment
Product information	get_product_details	Check product availability, options, and pricing
Order cancellation	cancel_pending_order	Cancel orders with status "pending" only
Order modification	modify_pending_order	Change shipping address, payment method, or item options on pending orders
Returns	return_delivered_order	Process returns on delivered orders within the return window
Exchanges	exchange_delivered_order	Exchange delivered items for different options of the same product
Profile updates	modify_user_address	Update user shipping or billing addresses

Key Retail Policies

The agent must authenticate the user at the start of every conversation by locating their user ID via email, or via name plus zip code.
Only one user can be helped per conversation. Requests related to other users must be denied.
Before any consequential action (cancel, modify, return, exchange), the agent must list the action details and obtain explicit user confirmation.
Exchange and modify order tools can only be called once per conversation. All items to be changed must be collected into a single tool call.
Only pending orders can be cancelled or modified. Only delivered orders can be returned or exchanged.
The agent must make at most one tool call at a time and should not respond to the user simultaneously with a tool call.

Example Retail Task: A user contacts support wanting to exchange a blue medium t-shirt from their delivered order for a red large version of the same product. The agent must authenticate the user, look up the order, verify the order status is "delivered," check that the red large option is available, collect all exchange details, confirm with the user, and execute the exchange in a single tool call.

Airline Domain (τ-airline)

The airline domain simulates a flight reservation customer service environment. It contains 50 tasks built on a synthetic database of 500 users, 300 flights, and 2,000 reservations.

Scenario: Customer service for airline bookings
Tools: Flight search, booking, cancellation, modification APIs
Policies: Fare rules, refund policies, upgrade procedures
Challenges: Multi-leg trips, schedule changes, policy compliance
Difficulty: More challenging than retail domain

Available API Tools (13 total: 6 write, 7 read)

The airline agent has access to tools for managing flight reservations:

Tool Category	Examples	Description
Flight search	search_direct_flight, search_onestop_flight	Find available direct or connecting flights
Reservation lookup	get_reservation_details	Retrieve booking information, passengers, and flight details
User lookup	get_user_details	Authenticate and retrieve user profile
Booking	book_reservation	Create new flight reservations (max 5 passengers)
Modification	update_reservation_flights, update_reservation_passengers	Change flights, cabin class, baggage, insurance, or passenger details
Cancellation	cancel_reservation	Cancel reservations subject to policy rules

Key Airline Policies

All reservations can be cancelled within 24 hours of booking regardless of cabin class.
Outside the 24-hour window, basic economy and economy reservations can only be cancelled if the passenger purchased travel insurance and meets the insurance conditions.
Business class reservations can always be cancelled.
If the airline cancels a flight, all passengers are eligible for cancellation and potential compensation.
Cabin class upgrades are allowed without changing flights, but the user must pay the fare difference.
Modifications cannot change the origin, destination, or trip type (one-way vs. round-trip).
Only silver and gold members, travelers with insurance, or business-class passengers qualify for service recovery certificates in cases of delays or cancellations.
Maximum of five passengers per reservation, and all passengers must be on identical flights and cabin class.

Example Airline Task: A user calls about a delayed flight and wants to rebook on an earlier connection. The agent must verify the user's identity, check the reservation details, determine if the passenger's membership tier qualifies them for compensation, search for alternative flights, confirm the new itinerary with the user, and process the modification while applying the correct fare difference.

Task Complexity

Tasks in τ-bench vary in complexity and are designed to apply pressure in diverse ways:

Complexity Level	Characteristics	Example
Simple	Single API call, straightforward request	Check flight status
Moderate	Multiple API calls, some reasoning required	Book round-trip with preferences
Complex	Many API calls, policy checking, user clarification	Multi-city trip with changes
Expert	Edge cases, exception handling, complex policies	Group booking with special needs

Complexity Factor	Description	Example
Long-chain dependencies	Multiple sequential steps where later actions depend on earlier results	Booking a multi-passenger reservation where each passenger has different preferences
Buried essential details	Critical information hidden deep in the conversation or policy document	A cancellation policy exception that applies only to gold-tier members with travel insurance
Realistic forgetting scenarios	Situations where an agent might lose track of earlier conversation context	A user who mentions their email early in the conversation and later asks the agent to use it for a different purpose
Policy conflicts	Requests that conflict with domain rules, requiring the agent to refuse or offer alternatives	A user asking to cancel a basic economy ticket without insurance outside the 24-hour window
Compound requests	Multiple distinct actions needed in a single conversation	A user who wants to cancel one order, modify another, and update their address

Evaluation Methodology

Database State Comparison

τ-bench uses an objective, automated evaluation approach that avoids the subjectivity of LLM-as-judge methods. The process works as follows:

Initial state capture: The database is recorded before the conversation begins.
Conversation execution: The agent interacts with the user simulator over multiple turns, calling tools as needed.
Final state capture: The database state is recorded after the conversation ends.
Goal state comparison: The final database state is compared against a pre-annotated goal state that represents the single correct outcome for that task.
Binary success determination: The task is scored as a success (1) if the final state matches the goal state, or a failure (0) if it does not.

This approach is both efficient (no human evaluators or LLM judges needed per evaluation) and faithful (the comparison is deterministic and objective). Partial credit is not awarded; a task either succeeds completely or fails. The designers chose this strict approach because, in real-world customer service, a partial update can create more damage than a refusal.

The pass^k Metric (Pass@k)

One of τ-bench's most significant contributions is the introduction of the pass^k ("pass hat k") metric, which measures agent reliability across repeated trials.

The standard pass@k metric, widely used in code generation benchmarks like HumanEval, asks: "Did the agent succeed on at least one of k attempts?" This measures peak capability. In contrast, pass^k asks: "Did the agent succeed on all k independent attempts?" This measures consistency.

Formally, for a given task with n independent trials of which c are successful:

pass^1 = E[r] = E[c/n], the expected success rate on a single attempt (equivalent to pass@1)
pass^k = the probability that all k independent, identically distributed trials succeed

The distinction matters enormously for real-world deployment. A customer service agent that resolves a particular type of issue 60% of the time will, over 8 independent instances of that issue, succeed on all 8 only about 1.7% of the time (0.6^8). The pass^k metric captures this compounding unreliability.

Metric	Question Answered	Significance
Pass@1 / pass^1	What fraction of tasks does the agent solve on a single attempt?	Basic capability measure
Pass@4 / pass^4	What fraction of tasks does the agent solve on all 4 independent attempts?	Moderate reliability threshold
Pass@8 / pass^8	What fraction of tasks does the agent solve on all 8 independent attempts?	High reliability threshold
Pass@k / pass^k	Success rate across k attempts	General reliability metric

The benchmark doesn't just test whether an agent can complete a task once; it measures whether it can do so consistently multiple times. The original paper showed that GPT-4o achieved a pass^1 of roughly 50% on retail tasks but dropped to approximately 25% on pass^8, representing a 60% decline. This dramatic drop revealed that even capable models are highly inconsistent when handling the same type of request with different conversational variations.

Auto Error Identification

τ-bench includes an automatic error identification tool (available in the GitHub repository as auto_error_identification.py) that analyzes failed trajectories to classify errors. The tool categorizes failures along two dimensions:

Fault Assignment (who caused the failure):

Agent fault (the agent made an incorrect decision)
User simulator fault (the simulated user behaved unrealistically)
Environment fault (a tool or database issue)

Fault Type (what kind of error occurred):

Fault Type	Description	Frequency
Wrong Action	Agent selected an incorrect tool or action sequence	Most common workflow error
Wrong Arguments	Agent called the correct tool but with incorrect parameters	Common tool error
Wrong Information	Agent provided incorrect information to the user	Most common user interaction error
Policy Neglect	Agent failed to follow a policy rule	Significant across both domains
Context Loss	Agent forgot information from earlier in the conversation	More common in longer conversations
Compound Task Failure	Agent failed to handle multiple requests in one conversation	Challenging for all models

The analysis found that terminal failures (where the agent completely derails) significantly outnumber recovered errors (where the agent stumbles but eventually finds the correct path). Small arithmetic or policy mistakes tend to propagate through the workflow rather than staying contained.

Experimental Results

Original Paper Results (June 2024)

The original paper tested 12 popular LLMs with different agent architectures. The key findings were striking:

Even GPT-4o, the best-performing model at the time, succeeded on fewer than 50% of tasks across both domains.
Function-calling (tool-calling) agents consistently outperformed text-formatted agent methods like ReAct.
pass^8 scores in the retail domain fell below 25% for all models tested, revealing severe consistency problems.
The airline domain proved significantly harder than retail across all models.

Early Model Performance Comparison

Early performance of leading models on τ-bench:

Model	Version/Mode	Airline Pass@1	Retail Pass@1	Notes
Claude 3.7 Sonnet	With think tool	58.4%	81.2%	Top performer with reasoning
Claude 3.5 Sonnet	Upgraded	46.0%	69.2%	Improved from 36.0%/62.6%
GPT-OSS-120B	Standard	Not reported	67.8%	Open-weight model
GPT-4o	Tool-calling	<50%	<50%	Initial SOTA baseline
GPT-4o	ReAct	~35%	~40%	Lower with ReAct
Claude 3.5 Sonnet	Original	36.0%	62.6%	Before upgrade

Note: Pass@4 and Pass@8 scores are significantly lower across all models, with Pass@8 <25% in retail for most models, indicating consistency challenges.

Current Leaderboard: Retail Domain

As of early 2026, the retail domain leaderboard shows substantial improvement over the original 2024 results, with the best models now crossing 80% pass^1:

Rank	Model	Organization	Retail pass^1
1	Claude Sonnet 4.5	Anthropic	0.862
2	Claude Opus 4.1	Anthropic	0.824
3	Claude Opus 4	Anthropic	0.814
4	Claude 3.7 Sonnet	Anthropic	0.812
5	Claude Sonnet 4	Anthropic	0.805
6	GLM-4.5	Zhipu AI	0.797
7	GLM-4.5-Air	Zhipu AI	0.779
8	Qwen3-Coder 480B A35B	Alibaba	0.775
9	o4-mini	OpenAI	0.718
10	o1	OpenAI	0.708
11	Qwen3-Next-80B-A3B-Thinking	Alibaba	0.696
12	Claude 3.5 Sonnet	Anthropic	0.692
13	GPT-4.5	OpenAI	0.684
14	GPT-4.1	OpenAI	0.680
15	GPT OSS 120B	OpenAI	0.678
16	GPT-4o	OpenAI	0.603
17	o3-mini	OpenAI	0.576
18	GPT-4.1 mini	OpenAI	0.558
19	Claude 3.5 Haiku	Anthropic	0.510
20	GPT-4.1 nano	OpenAI	0.226

The average score across all 25 evaluated models is 0.678. Anthropic's Claude models dominate the top five positions, with Claude Sonnet 4.5 achieving the highest retail score of 0.862.

Current Leaderboard: Airline Domain

The airline domain remains substantially harder, with the best models scoring around 70% compared to 86% in retail:

Rank	Model	Organization	Airline pass^1
1	Claude Sonnet 4.5	Anthropic	0.700
2	MiniMax M1 80K	MiniMax	0.620
3	GLM-4.5-Air	Zhipu AI	0.608
4	GLM-4.5	Zhipu AI	0.604
5	MiniMax M1 40K	MiniMax	0.600
5	Claude Sonnet 4	Anthropic	0.600
5	Qwen3-Coder 480B A35B	Alibaba	0.600
8	Claude Opus 4	Anthropic	0.596
9	Claude 3.7 Sonnet	Anthropic	0.584
10	Claude Opus 4.1	Anthropic	0.560
11	o1	OpenAI	0.500
11	GPT-4.5	OpenAI	0.500
13	GPT-4.1	OpenAI	0.494
14	o4-mini	OpenAI	0.492
15	Claude 3.5 Sonnet	Anthropic	0.460
16	GPT-4o	OpenAI	0.428
17	GPT-4.1 mini	OpenAI	0.360
18	o3-mini	OpenAI	0.324
19	Claude 3.5 Haiku	Anthropic	0.228
20	GPT-4.1 nano	OpenAI	0.140

The average score across all 23 evaluated airline models is 0.495. The HAL evaluation team at Princeton independently verified that 48 of 50 airline tasks (96%) have been solved by at least one agent, suggesting that the individual tasks are solvable but no single agent can solve them all consistently.

Key Performance Trends

Several patterns emerge from the leaderboard data:

Anthropic dominance in retail: Claude models hold the top five retail positions, suggesting that Anthropic has specifically optimized for multi-turn tool-use tasks.
Chinese models competitive in airline: Zhipu AI's GLM-4.5 models and MiniMax's M1 models perform comparably to or better than some Claude variants in the airline domain.
Reasoning models mixed: OpenAI's reasoning models (o1, o3-mini, o4-mini) do not consistently outperform their non-reasoning counterparts (GPT-4.1, GPT-4o) on this benchmark, suggesting that chain-of-thought reasoning alone is insufficient for multi-turn agent tasks.
Cost-performance tradeoffs: On the airline domain, the Pareto frontier includes Gemini 2.0 Flash (28% accuracy, $0.31 per evaluation), DeepSeek V3 (44%, $5.43), and o4-mini (56%, $11.36), showing that the cheapest option is not always the worst and the most expensive is not always the best.
Domain gap persists: The gap between retail and airline performance is consistent across models. Airline tasks involve more complex policies, multi-leg trips, and more subtle edge cases.

Failure Analysis

Common failure modes identified:

Policy Violations: Agents bypass or misinterpret domain rules
Context Loss: Information forgotten in long conversations
User Misunderstanding: Incorrect interpretation of user intent
API Misuse: Incorrect tool selection or parameter usage
State Confusion: Losing track of transaction state

τ³-bench: Task Corrections

In 2025, the τ-bench team released τ³-bench, an updated version that audited and corrected over 50 tasks across both domains (27 airline fixes, 26 retail fixes). The corrections addressed five categories of issues identified through community feedback and the τ-Bench Verified research program:

Issue Category	Description	Examples
Incorrect expected actions	The annotated goal state was wrong	Compensation offered to ineligible passengers; invalid PayPal refunds
Ambiguous user instructions	The user instruction allowed multiple valid interpretations	Economy vs. basic economy confusion; "similar" vs. "same" item specifications
Impossible constraints	The task setup made the intended solution unreachable	Required payment methods absent from user profiles; location contradictions
Missing fallback behaviors	No guidance for what the agent should do when the intended path fails	No defined action when product searches return no results
Policy loophole prevention	Tasks that could be "solved" by exploiting policy gaps	Cancel-and-rebook exploits instead of proper modification refusals; cabin upgrade workarounds

The impact of these corrections varied by domain. In the airline domain, pass^1 scores increased by 14 to 20 points across models, and pass^4 improvements were even larger (up to 22 points for some models). In the retail domain, the changes were more modest, with pass^1 shifts ranging from -0.4 to +5.5 points. The retail corrections primarily reduced evaluation variance rather than uniformly boosting scores.

Implementation and Usage

Setup and Configuration

τ-bench is open-source under the MIT license and available on GitHub. It can be installed with:

git clone https://github.com/sierra-research/tau-bench
cd tau-bench
pip install -e .

API keys must be set as environment variables for the model providers being evaluated: OPENAI_API_KEY, ANTHROPIC_API_KEY, GOOGLE_API_KEY, MISTRAL_API_KEY.

Running Evaluations

The benchmark is run through a command-line interface:

python run.py --agent-strategy tool-calling --env retail --model gpt-4o --model-provider openai --user-model gpt-4o --user-model-provider openai --user-strategy llm --max-concurrency 10

Key configuration options include:

Parameter	Options	Description
--agent-strategy	tool-calling, react, act	How the agent structures its reasoning and actions
--env	retail, airline	Which domain to evaluate
--model	Various	The model powering the agent
--user-model	Various	The model powering the user simulator
--user-strategy	llm, react, verify, reflection	How the user simulator generates responses
--max-concurrency	Integer	Number of parallel evaluations
--task-ids	Comma-separated	Run specific tasks by ID

Supported Models

τ-bench supports evaluation of models from multiple providers:

Provider	Models Supported	Integration Method
OpenAI	GPT-4, GPT-4o, GPT-4o-mini, GPT-4.1, GPT-3.5, o1, o3, o4-mini	API
Anthropic	Claude 3, Claude 3.5 Sonnet, Claude 3.7 Sonnet, Claude Sonnet 4, Claude Opus 4	API
Google	Gemini Pro, Gemini Flash, Gemini Ultra	API
Mistral AI	Mistral Large, Mistral Medium	API
Zhipu AI	GLM-4.5, GLM-4.5-Air	API
AnyScale	Open-source models via API	API

Technical Features

Concurrent Processing: Parallel API request handling through the --max-concurrency flag, allowing multiple tasks to be evaluated simultaneously.
Error Recovery: Automatic error identification and retry.
Historical Trajectories: The repository stores historical interaction trajectories, which can be replayed for analysis without re-running expensive API calls.
Error Analysis: The auto_error_identification.py script automatically classifies failures by fault assignment and fault type.
Configurable Tasks: Specific tasks can be selected with the --task-ids flag for targeted debugging.
Extensible Framework: New domains can be added by implementing the environment interface with a database, tool set, and policy document.

Impact and Adoption

Industry Adoption

τ-bench has become one of the standard benchmarks for evaluating AI agents in the industry:

Anthropic: Has embraced τ-bench as a key benchmark for Claude model development. Claude 3.7 Sonnet's launch announcement highlighted its top performance on τ-bench, and Anthropic has incorporated pass^k metrics into their model evaluation process. The company has used self-reflection and longer chain-of-thought prompting to improve consistency on the benchmark.
OpenAI: Has used τ-bench to evaluate GPT-4.1, GPT-4.5, and the o-series reasoning models. OpenAI showcased GPT-5's τ-bench performance as part of its agent capabilities, and used the benchmark for GPT-OSS model evaluation.
Sierra AI: Uses τ-bench as a core evaluation metric for its own agent development pipeline.
AI startups: Companies like Scaled Cognition have adopted τ-bench to evaluate their agent foundation models.
Princeton HAL: The Princeton Hardware-Aware Learning (HAL) team independently reproduces τ-bench results, providing verified scores for the airline domain leaderboard.
Research Labs: Standard benchmark for agent papers.

Academic Influence

Within one year of release, τ-bench has had significant academic impact:

The paper was accepted at ICLR 2025 as a poster presentation.
Been cited in numerous agent evaluation papers.
Inspired domain-specific variants (for example MedAgentBench).
Become standard for multi-turn agent evaluation.
Influenced new evaluation methodologies.
The pass^k metric has been adopted beyond τ-bench as a standard reliability measure for agent evaluation.
It has become a standard reference point alongside WebArena and SWE-bench as representing a new class of rigorous agent benchmarks.

Researchers found that naive ReAct-style agents often break down during complex multi-step sequences, leading to the development of enhanced approaches including hierarchical architectures that maintain goals and memory throughout multi-turn conversations.

Derivative Benchmarks

τ²-bench (Tau-squared-bench)

Released in 2025 by Sierra Research (arXiv: 2506.07982), τ²-bench extends the τ-bench framework to a telecom domain with a critical new challenge: dual-control environments. In the original τ-bench, only the agent uses tools. In τ²-bench, both the agent and the user can take actions in a shared environment, modeled as a Decentralized Partially Observable Markov Decision Process (Dec-POMDP).

Key features of τ²-bench:

Feature	Description
Telecom domain	Customer service scenarios for internet, phone, and TV service providers
Dual-control	Both agent and user have tools; the user must perform actions on their end (reboot router, check settings)
Compositional task generator	Programmatically creates diverse, verifiable tasks from atomic components
Reliable user simulator	Tightly coupled with the environment to ensure consistent behavior
Fine-grained error analysis	Separates errors from reasoning vs. communication/coordination
Coordination metrics	Measuring agent-user collaboration
Enhanced complexity	More sophisticated interactions

Performance drops significantly in τ²-bench compared to τ-bench. GPT-4.1, which achieves 74% pass^1 in retail and 56% in airline, drops to 34% in the telecom domain. This gap highlights the difficulty of guiding users through actions rather than performing all actions autonomously.

τ³-bench

Rather than a new benchmark, τ³-bench is a corrected version of the original τ-bench tasks (described in the Task Corrections section above).

MedAgentBench

Inspired directly by τ-bench, MedAgentBench adapts the framework to the medical domain. It features:

Electronic Medical Record (EMR) environment
FHIR API integration
Physician-written scenarios
Clinical task evaluation

The authors cited τ-bench as effective for testing general agent capabilities while noting the absence of a standardized medical agent benchmark, prompting creation of the domain-specific variant.

LegalAgentBench

Another domain-specific adaptation that emerged for evaluating AI agents in legal applications, inspired by τ-bench's framework for policy-compliant agent evaluation.

Limitations and Criticisms

Methodological Limitations

Limitation	Description	Impact
Limited domains	Only two domains (retail and airline) in the original benchmark	Raises questions about generalization to other settings
Simulated users	LLM-based users may not capture the full range of human behavior	May underestimate difficulty with real, unpredictable users
English only	No multilingual support	Limits applicability to non-English customer service settings
Static task set	Fixed set of 165 tasks	Risk of overfitting as models are evaluated repeatedly
Binary scoring	No partial credit for partially correct solutions	May not capture agents that get "almost right"
Text only	No visual, voice, or document-based interactions	Misses multimodal aspects of real customer service

Evaluation Concerns

Evaluation Brittleness: Binary success/failure may miss partial success.
User model bias: LLM-simulated users may behave more predictably and cooperatively than real humans, potentially inflating scores.
Policy simplification: The policy documents in τ-bench, while non-trivial, are simpler than the policy manuals used by actual airlines and retailers, which can run to hundreds of pages.
Sim2Real gap: Research on the gap between simulated and real user interactions has raised questions about how well τ-bench performance predicts real-world agent performance.
Self-reported scores: Many leaderboard entries are self-reported by model providers rather than independently verified, creating potential for selection bias.
Benchmark saturation: In the airline domain, 48 of 50 tasks have been solved by at least one agent, suggesting the individual tasks may be approaching saturation even though no single agent solves them all.
Limited Error Types: May not cover all failure modes.

Future Directions

Planned and Anticipated Improvements

Domain expansion: Healthcare, finance, and education domains would test generalization across different policy structures and tool sets.
Multimodal support: Adding images, documents, and voice interactions would better reflect real customer service environments.
Human evaluation studies: Replacing LLM-simulated users with real humans would address the sim2real gap concern.
Dynamic task generation: Procedurally generating new tasks would prevent overfitting to the fixed task set.
Fine-grained metrics: Partial credit scoring could better differentiate agents that nearly solve a task from those that fail completely.
Multilingual evaluation: Expanding beyond English would test agents in the global customer service settings where they are increasingly deployed.

Open Research Questions

Agent architecture design: What architectures best handle long-horizon, policy-constrained, multi-turn interactions, and can τ-bench-optimized architectures be developed?
Training on τ-bench: Can agents be trained or fine-tuned specifically on τ-bench-style interactions to improve reliability?
User modeling: How can user simulation be made more realistic without requiring human participants for every evaluation?
Policy learning: Can agents learn to extract and follow policies from lengthy documents more reliably, including automatic policy extraction and compliance?
Coordination in dual-control settings: How should agents guide users through actions in shared environments (as explored in τ²-bench)?

Benchmark	Focus	Key Difference from τ-bench
SWE-bench	Software engineering tasks	Single-turn; no user interaction
AgentBench	Multi-environment agent evaluation	Tests multiple environments but single-turn interactions
WebArena	Web navigation and interaction	Focuses on browser-based tasks rather than conversational customer service
WebShop	E-commerce navigation	Earlier work by the same lead author; simpler single-turn shopping tasks
ALFWorld	Embodied agent tasks	Focuses on text-based embodied environments rather than customer service
InterCode	Interactive coding benchmark	Code-focused rather than customer service conversations
GAIA	General AI assistant tasks	Broader scope but less focus on multi-turn tool use and policy compliance
BFCL	Function calling accuracy	Tests tool use in isolation rather than within multi-turn conversations
τ²-bench	Dual-control conversational agents	Extends τ-bench with a telecom domain where both agent and user have tools
MT-Bench	Multi-turn conversation quality	Tests conversational ability but not tool use or policy compliance

References

Yao, S., Shinn, N., Razavi, P., & Narasimhan, K. (2024). "τ-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains." arXiv:2406.12045. Retrieved from https://arxiv.org/abs/2406.12045
Sierra Research. (2024). "Tau-bench: Code and Data." GitHub. Retrieved from https://github.com/sierra-research/tau-bench
Sierra AI. (2024). "Benchmarking AI agents for the real-world." Sierra Blog. Retrieved from https://sierra.ai/blog/benchmarking-ai-agents
Sierra AI. "τ-bench: Shaping the development and evaluation of agents." Sierra Blog. Retrieved from https://sierra.ai/blog/tau-bench-shaping-development-evaluation-agents
Sierra Research. (2024). "τ²-Bench: Evaluating Conversational Agents in a Dual-Control Environment." Retrieved from https://sierra.ai/resources/research/tau-squared-bench
Yao, S., et al. (2025). "τ²-Bench: Evaluating Conversational Agents in a Dual-Control Environment." arXiv:2506.07982. Retrieved from https://arxiv.org/abs/2506.07982
τ-bench Official Leaderboard. Retrieved from https://taubench.com/
τ³-bench: Fixing Airline + Retail. Retrieved from https://taubench.com/blog/tau3-task-fixes.html
Princeton HAL: TAU-bench Airline Leaderboard. Retrieved from https://hal.cs.princeton.edu/taubench_airline
LLM-Stats: TAU-bench Retail Leaderboard. Retrieved from https://llm-stats.com/benchmarks/tau-bench-retail
LLM-Stats: TAU-bench Airline Leaderboard. Retrieved from https://llm-stats.com/benchmarks/tau-bench-airline
ICLR 2025 Poster: τ-bench. Retrieved from https://iclr.cc/virtual/2025/poster/28170
Anthropic. (2024). "Introducing computer use, a new Claude 3.5 Sonnet." Retrieved from https://www.anthropic.com/news/3-5-models-and-computer-use
OpenAI. (2025). "Introducing gpt-oss." Retrieved from https://openai.com/index/introducing-gpt-oss/
Medium. (2025). "Claude 3.7 Sonnet Unveiled: Reviewing Anthropic's Most Advanced Reasoning Model." Retrieved from https://medium.com/@bernardloki/claude-3-7-sonnet-unveiled-reviewing-anthropics-most-advanced-reasoning-model-772b74331226

Quick Facts

Overview

Key Innovation

Background and Motivation

The Agent Evaluation Gap

The Authors

Architecture and Design

Three-Component Framework

User Simulation

Domains and Tasks

Retail Domain (τ-retail)

Airline Domain (τ-airline)

Task Complexity

Evaluation Methodology

Database State Comparison

The pass^k Metric (Pass@k)

Auto Error Identification

Experimental Results

Original Paper Results (June 2024)

Early Model Performance Comparison

Current Leaderboard: Retail Domain

Current Leaderboard: Airline Domain

Key Performance Trends

Failure Analysis

τ³-bench: Task Corrections

Implementation and Usage

Setup and Configuration

Running Evaluations

Supported Models

Technical Features

Impact and Adoption

Industry Adoption

Academic Influence

Derivative Benchmarks

τ²-bench (Tau-squared-bench)

τ³-bench

MedAgentBench

LegalAgentBench

Limitations and Criticisms

Methodological Limitations

Evaluation Concerns

Future Directions

Planned and Anticipated Improvements

Open Research Questions

Related Benchmarks

See Also

References

Improve this article

Related Articles

Vimgolf

Humanity's Last Exam

CharXiv

AIME 2024

Aider Polyglot

BALROG

Quick Facts

Overview

Key Innovation

Background and Motivation

The Agent Evaluation Gap

The Authors

Architecture and Design

Three-Component Framework

User Simulation

Domains and Tasks

Retail Domain (τ-retail)

Airline Domain (τ-airline)

Task Complexity

Evaluation Methodology

Database State Comparison

The pass^k Metric (Pass@k)

Auto Error Identification

Experimental Results

Original Paper Results (June 2024)

Early Model Performance Comparison

Current Leaderboard: Retail Domain

Current Leaderboard: Airline Domain

Key Performance Trends

Failure Analysis

τ³-bench: Task Corrections