| τ-bench | |
|---|---|
| Overview | |
| Full name | Tau-bench: Tool-Agent-User Interaction Benchmark |
| Abbreviation | τ-bench |
| Description | A benchmark for evaluating AI agents' ability to complete complex tasks through realistic tool-agent-user interactions in real-world domains |
| Release date | 2024-06-17 |
| Latest version | 1.0 |
| Benchmark updated | 2024-06-17 |
| Authors | Shunyu Yao, Noah Shinn, Pedram Razavi, Karthik Narasimhan |
| Organization | Sierra Research, Princeton University |
| Technical Details | |
| Type | Agent Evaluation, Multi-turn Interaction |
| Modality | Text, API Calls |
| Task format | Conversational task completion |
| Number of tasks | Not publicly specified |
| Total examples | Not publicly specified |
| Evaluation metric | Pass@k, Database State Comparison |
| Domains | Airline, Retail |
| Languages | English |
| Performance | |
| Human performance | Not reported |
| Baseline | <50% (GPT-4o)
Property "Baseline score" (as page type) with input value "" contains invalid characters or is incomplete and therefore can cause unexpected results during a query or annotation process. |
| SOTA score | 81.2% (Retail Pass@1 with think tool) |
| SOTA model | Claude 3.7 Sonnet |
| SOTA date | 2025 |
| Saturated | No |
| Resources | |
| Website | Official website |
| Paper | Paper |
| GitHub | Repository |
| Dataset | Download |
| License | Apache 2.0 |
| Successor | τ²-bench |
τ-bench (Tau-bench) is an AI benchmark designed to evaluate language agents' ability to complete complex tasks through realistic interactions with simulated users and domain-specific tools. Released on June 17, 2024, by Sierra Research and Princeton University, τ-bench addresses critical gaps in existing benchmarks by testing agents on their ability to follow domain-specific rules, maintain context over long conversations, and effectively coordinate with users to achieve goals.
τ-bench represents a paradigm shift in AI agent evaluation, moving beyond simple task completion to assess agents' performance in dynamic, multi-turn conversations that mirror real-world applications. The benchmark emulates scenarios where an AI agent must interact with both users (simulated by large language models) and domain-specific API tools while adhering to complex policy guidelines.
Unlike traditional benchmarks that focus on single-turn interactions or isolated capabilities, τ-bench introduces:
The benchmark revealed that even state-of-the-art function calling agents (like GPT-4o) succeed on less than 50% of the tasks and are quite inconsistent (Pass@8 <25% in retail).
τ-bench employs a three-component architecture:
| Component | Function | Implementation |
|---|---|---|
| User Simulator | Simulates realistic user behavior and requests | LLM-based (GPT-4, Claude, etc.) |
| Agent System | Processes requests and executes tasks | Various architectures (Tool-calling, ReAct, Act) |
| Environment | Provides APIs and maintains state | Domain-specific databases and tools |
The benchmark covers two primary domains:
τ-bench introduces the Pass@k metric to measure agent reliability:
| Metric | Description | Significance |
|---|---|---|
| Pass@1 | Success rate on first attempt | Basic capability |
| Pass@4 | Success rate across 4 attempts | Consistency measure |
| Pass@8 | Success rate across 8 attempts | High reliability threshold |
| Pass@k | Success rate across k attempts | General reliability metric |
The benchmark doesn't just test whether an agent can complete a task once; it measures whether it can do so consistently multiple times.
Evaluation uses objective database state comparison: 1. Initial database state recorded 2. Agent completes conversational task 3. Final database state captured 4. Comparison with expected goal state 5. Success determined by state match
This efficient and faithful evaluation process provides objective measurement of agent decision-making.
Tasks in τ-bench vary in complexity:
| Complexity Level | Characteristics | Example |
|---|---|---|
| Simple | Single API call, straightforward request | Check flight status |
| Moderate | Multiple API calls, some reasoning required | Book round-trip with preferences |
| Complex | Many API calls, policy checking, user clarification | Multi-city trip with changes |
| Expert | Edge cases, exception handling, complex policies | Group booking with special needs |
Current performance of leading models on τ-bench:
| Model | Version/Mode | Airline Pass@1 | Retail Pass@1 | Notes |
|---|---|---|---|---|
| Claude 3.7 Sonnet | With think tool | 58.4% | 81.2% | Top performer with reasoning |
| Claude 3.5 Sonnet | Upgraded | 46.0% | 69.2% | Improved from 36.0%/62.6% |
| GPT-OSS-120B | Standard | Not reported | 67.8% | Open-weight model |
| GPT-4o | Tool-calling | <50% | <50% | Initial SOTA baseline |
| GPT-4o | ReAct | ~35% | ~40% | Lower with ReAct |
| Claude 3.5 Sonnet | Original | 36.0% | 62.6% | Before upgrade |
Note: Pass@4 and Pass@8 scores are significantly lower across all models, with Pass@8 <25% in retail for most models, indicating consistency challenges.
1. **Low Success Rates**: Even SOTA models achieve <50% success on initial release 2. **Poor Consistency**: Pass@8 rates below 25% indicate high variability 3. **Domain Sensitivity**: Significant performance gap between airline (harder) and retail (easier) 4. **Architecture Impact**: Tool-calling outperforms ReAct consistently
Common failure modes identified: 1. Policy Violations: Agents bypass or misinterpret domain rules 2. Context Loss: Information forgotten in long conversations 3. User Misunderstanding: Incorrect interpretation of user intent 4. API Misuse: Incorrect tool selection or parameter usage 5. State Confusion: Losing track of transaction state
τ-bench supports evaluation of models from multiple providers:
| Provider | Models Supported | Integration Method |
|---|---|---|
| OpenAI | GPT-4, GPT-4o, GPT-3.5 | API |
| Anthropic | Claude 3, Claude 3.5, Claude 3.7 | API |
| Gemini Pro, Gemini Ultra | API | |
| Mistral AI | Mistral Large, Mistral Medium | API |
| AnyScale | Open-source models | API |
| Strategy | Description | Use Case |
|---|---|---|
| LLM (default) | Direct LLM responses | Standard evaluation |
| ReAct | Reasoning + Acting | Complex reasoning tasks |
| Verify | Verification after each step | High accuracy requirements |
| Reflection | Self-correction mechanism | Improved consistency |
τ-bench has gained significant traction:
Within one year of release, τ-bench has:
An extension adding:
Medical domain adaptation featuring:
The authors observed that while benchmarks like τ-bench effectively test general agent capabilities, there was no standardized medical agent benchmark available, prompting creation of a domain-specific variant.
| Limitation | Description | Impact |
|---|---|---|
| Limited Domains | Only 2-3 domains covered | Generalization questions |
| Simulated Users | LLM simulation may not capture human behavior | Ecological validity |
| English Only | No multilingual support | Limited global applicability |
| Static Tasks | Fixed task set | Potential overfitting |
| No Visual Elements | Text-only interactions | Missing multimodal aspects |
1. Domain Expansion: Healthcare, finance, education domains 2. Multimodal Support: Images, documents, voice interactions 3. Human Evaluation: Real user interaction studies 4. Dynamic Task Generation: Procedural task creation 5. Fine-grained Metrics: Partial credit scoring
Cite error: <ref> tag with name "tau_paper" defined in <references> has group attribute "" which does not appear in prior text.
Cite error: <ref> tag with name "github" defined in <references> has group attribute "" which does not appear in prior text.
Cite error: <ref> tag with name "sierra_blog" defined in <references> has group attribute "" which does not appear in prior text.
Cite error: <ref> tag with name "tau2_paper" defined in <references> has group attribute "" which does not appear in prior text.
Cite error: <ref> tag with name "anthropic_claude35" defined in <references> has group attribute "" which does not appear in prior text.
Cite error: <ref> tag with name "openai_gptoss" defined in <references> has group attribute "" which does not appear in prior text.
Cite error: <ref> tag with name "claude37_performance" defined in <references> has group attribute "" which does not appear in prior text.