Tau-bench (stylized tau-bench or τ-bench) is a benchmark suite developed by Sierra AI to evaluate the reliability of language model agents operating in realistic, multi-turn customer service scenarios. First published in June 2024, the benchmark assesses whether agents can follow domain-specific policies, use API tools correctly, and sustain consistent behavior across many independent interactions with simulated users. The suite has since expanded through three major versions -- tau-bench (June 2024), tau2-bench (June 2025), and tau3-bench (March 2026) -- each adding new domains, evaluation modalities, and complexity. A companion benchmark, mu-bench, was released in April 2026 to assess multilingual speech transcription quality for voice agents.
The tau-bench family is widely used in academic research and commercial model development to measure the dimension of agentic reliability that single-attempt benchmarks cannot expose: whether an agent can succeed not just once, but consistently, across repeated interactions.
By mid-2024 the dominant methodology for evaluating large language models in agentic settings relied on pass@k-style metrics inherited from code generation research. Under pass@k, a model receives credit if at least one of k independent sampling attempts produces a correct answer. This is useful for measuring raw capability, but it systematically overstates the practical reliability of agents deployed in production. A customer service agent that resolves a flight rebooking correctly 70% of the time still fails three out of ten customers, and those failures cascade through support queues, human escalation, and customer trust.
Existing benchmarks like AgentBench, WebArena, and GAIA each illuminated different slices of agent capability -- database manipulation, web browsing, multi-step tool use -- but none focused specifically on the triad of (1) multi-turn dialogue with a live user, (2) simultaneous compliance with a written policy document, and (3) stateful tool use against a backing database. SWE-bench captured software engineering skill through real GitHub issues, but an agent that can write a correct patch need not be the same agent that can process a return request under a retailer's refund policy while managing a disgruntled customer across ten conversational turns.
Sierra AI sat at the intersection of both concerns. Founded in 2023 by Bret Taylor (former Salesforce co-CEO and OpenAI board chair) and Clay Bavor (former Google VP), Sierra builds production AI agents for enterprises including 40% of the Fortune 50. Its researchers had firsthand visibility into the gap between benchmark-reported performance and the consistency failures that appeared in deployed agents. The tau-bench project grew out of that operational experience.
The original tau-bench paper, titled "tau-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains," was published on arXiv (arXiv:2406.12045) on June 17, 2024 by Shunyu Yao, Noah Shinn, Pedram Razavi, and Karthik Narasimhan, all affiliated with Sierra AI. The paper was subsequently accepted as a poster at ICLR 2025. Code and data were released simultaneously at github.com/sierra-research/tau-bench under an open-source license, and a companion leaderboard was later established at taubench.com.
Tau-bench was built around a specific thesis: that evaluation of production agents requires measuring reliability across many independent attempts on the same underlying task, not just measuring average performance across many different tasks. The distinction matters because in production, an agent handles thousands of structurally similar requests each day. A system with 70% single-attempt accuracy will fail roughly 300 out of every 1,000 interactions, a rate that is operationally unacceptable for most enterprise deployments.
To test this kind of reliability, tau-bench adopted a simulation architecture in which a language model acts as the user, generating natural conversational utterances in response to the agent. Each conversation begins with a task specification -- a scenario drawn from a realistic customer service context -- and ends when either the agent successfully completes the task (verified by comparing the final database state to a ground-truth goal state) or the conversation is terminated due to failure. The same underlying task can be run many times with different lexical realizations, allowing repeated-trial reliability to be measured directly.
The original release contained two domains:
tau-Retail covers e-commerce customer service. The database contains 500 simulated users, 50 product types, and 1,000 orders. Agents are supplied with a set of Python-callable API tools (for looking up orders, processing returns, applying discounts, updating shipping addresses, and so on) and a written policy document specifying rules such as return windows, eligibility criteria, and exception-handling procedures. The domain contains 115 tasks. Because the database is relatively small and the policy is tractable, tau-Retail is the easier of the two domains -- but still exposed dramatic reliability failures in state-of-the-art models at launch.
tau-Airline covers airline reservation management. The database contains 500 users, 300 flights, and 2,000 reservations. Tasks include rebooking, cancellations, seat upgrades, baggage fee disputes, and policy-mandated refunds. The domain contains 50 tasks and is structurally more complex than tau-Retail because airline policies involve more conditional logic (fare class rules, change fee waivers, same-day standby procedures). Agents must parse long policy documents and apply the correct rule to a specific reservation state.
The user simulator in the original tau-bench uses GPT-4 (specifically gpt-4-0613) to generate utterances. The simulator receives a system prompt containing the task scenario -- for example, "you want to return a defective blender purchased 12 days ago" -- along with the full conversation history. It generates the next user turn in response to the agent's most recent message. This design keeps the evaluation fully automated and reproducible, avoids the cost and latency of human annotators, and enables the repeated trials that pass^k requires.
The simulator's instructions are written in natural language rather than formal grammars, which produces more varied and naturalistic dialogue than rule-based alternatives. Sierra researchers acknowledged that the simulator can occasionally behave inconsistently or drift from the scenario, and they incorporated this as a source of realistic noise rather than attempting to eliminate it.
The central methodological contribution of tau-bench is the pass^k metric (read as "pass to the k"). Pass^k is defined as the probability that an agent successfully completes a given task on all k independent trials. It contrasts with the pass@k metric standard in code generation, which measures the probability that at least one of k trials succeeds.
Formally, if p is the agent's single-trial success probability on a given task, then pass^k = p^k under the assumption of independent trials. The exponential decay is severe: a model with 85% single-trial accuracy (pass^1 = 0.85) has a pass^8 of roughly 0.85^8 = 0.27, meaning it would successfully handle all eight customers with the same underlying issue less than 30% of the time. This is the regime that production agents live in.
Tau-bench typically reports results averaged across tasks in a domain, giving a mean pass^1 and mean pass^k for selected k values (commonly k = 2, 4, 8). The pass^k vs. k curve reveals the reliability profile of a model: a model with high pass^1 but rapidly collapsing pass^k is fragile and stochastic; a model whose pass^k curve flattens near its pass^1 value is robust and consistent.
The distinction from the closely related pass@k is crucial. Pass@k gives credit for a single success among k attempts -- it measures what a model can do given multiple chances, which is appropriate when you want to assess capability. Pass^k measures what a model reliably does every time -- appropriate when you want to assess deployment fitness.
The June 2024 paper reported that even the strongest available models fell well short of reliable performance. GPT-4o, then considered the frontier model for function-calling tasks, achieved a pass^1 rate below 50% across both domains, and its pass^8 rate in tau-Retail dropped to roughly 25% -- a 60-percentage-point collapse from pass^1 to pass^8. Claude 3.5 Sonnet performed somewhat better on single-attempt tasks but showed similar reliability degradation under repeated evaluation.
The most common failure mode was policy violation: agents would identify the correct tool to call but invoke it with arguments that, while plausible, violated some condition specified in the policy document. A second major failure mode was conversation management -- agents that correctly understood the user's goal but elicited insufficient information before acting, or that lost track of task state over a long dialogue.
These results had immediate impact on how researchers and practitioners thought about agent benchmarking. Several model evaluation teams that had previously relied on pass@k-style metrics began supplementing their evaluations with pass^k reporting.
Tau2-bench (tau^2-bench, stylized tau2-bench or τ²-bench) was published as arXiv paper "τ²-Bench: Evaluating Conversational Agents in a Dual-Control Environment" on June 9, 2025, with authors Victor Barres, Honghua Dong, Soham Ray, Xujie Si, and Karthik Narasimhan. The accompanying codebase was released at github.com/sierra-research/tau2-bench, and the project received a dedicated leaderboard entry on Artificial Analysis.
The fundamental advance in tau2-bench was the introduction of a dual-control environment. In the original tau-bench, the user simulator is a passive information provider: it answers the agent's questions and makes requests, but has no independent ability to act on the environment. This is appropriate for many simple service interactions, but it fails to capture a large class of real-world support scenarios where the user must perform actions on their side -- restarting a device, checking a setting, toggling a network mode -- before the agent can proceed.
Tau2-bench models this as a decentralized partially observable Markov decision process (Dec-POMDP), in which both the agent and the user have tools and can independently modify the shared environment state. The user simulator is tightly coupled to the actual environment: it generates actions and responses based on the real configuration of the shared world, preventing it from inventing settings that do not exist or failing to report changes that actually occurred.
This bidirectional simulation design produces a qualitatively harder evaluation. Agents must not only call the right tools themselves, but must also correctly instruct users on what actions to take, wait for confirmation of those actions, interpret the results, and update their troubleshooting strategy accordingly. A drop of up to 25 percentage points in task success rate was observed when agents moved from solo (agent-only tool use) to interactive (dual-control) mode, even for agents built on frontier models such as GPT-4.1 and o4-mini.
Tau2-bench introduced a third domain, tau-Telecom, inspired by the kinds of interactions that occur in real telecommunications support. Tasks include resolving broken mobile data connections, fixing Multimedia Messaging Service (MMS) failures, and switching mobile network modes between 4G and 5G. Each task requires the agent to follow a telecom policy document, communicate diagnostic instructions to the user, interpret the user's reported outcomes, and adapt the troubleshooting path accordingly.
The telecom domain is substantially harder than retail or airline. At the time of tau2-bench's release, GPT-4.1 achieved pass^1 rates of approximately 74% on retail and 56% on airline, but only 34% on telecom. Smaller models like GPT-4.1-mini, o4-mini, and Claude 3.7 Sonnet achieved telecom pass rates around 50% -- closer to the frontier model, suggesting that the telecom domain's challenges penalize different capabilities than the other two domains.
Tau2-bench introduced a compositional task generation approach. Rather than hand-authoring tasks individually, tasks are assembled from atomic actions that represent the smallest coherent troubleshooting steps. This allows systematic control over task complexity, enables automatic verification of correct solutions, and scales to produce larger and more varied task sets than manual authoring allows. The approach also makes it easier to identify which sub-skills an agent lacks when it fails a composite task.
The tau2-bench release included more than 75 community-contributed corrections to the original retail and airline domains. These fixes addressed ambiguous task descriptions, inconsistent policy language, and evaluation criteria that did not correctly capture task success. The corrections tightened the correlation between human judgment and automated evaluation, increasing confidence that pass^k scores reflect genuine task completion rather than evaluation artifacts.
Tau3-bench (stylized tau3-bench or τ³-bench) was released by Sierra AI on March 18, 2026, with an accompanying blog post titled "Advancing agent benchmarking to knowledge and voice." It extended the tau-bench framework in two new directions: unstructured knowledge retrieval (tau-Knowledge) and real-time voice (tau-Voice), each of which represents a modality where production agents routinely struggle but prior evaluation infrastructure was limited.
Tau-Knowledge evaluates agent performance in scenarios where the agent must retrieve relevant information from a large, unstructured knowledge base before it can take action. The domain modeled is tau-Banking, a fintech customer support context. The knowledge base contains 698 documents organized across 21 product categories, totaling approximately 195,000 tokens -- too large for any current model to process in a single context window.
Tasks in tau-Knowledge require agents to perform multi-step problem resolution: search the knowledge base for relevant policy or product information, reason over the retrieved content, and then call the appropriate transactional tools to complete the user's request. Four retrieval strategies were evaluated: keyword search, embedding-based semantic retrieval, long-context window processing (for models with large enough windows to attempt it), and terminal-style freeform file exploration.
A key finding from tau-Knowledge was that models perform significantly better when given flexible, freeform access to the knowledge base through a terminal-like interface than when restricted to traditional retrieval-augmented generation pipelines. Even with exact, oracle-provided documents, top models achieved only about 40% pass^1 -- indicating that comprehension and execution, not just retrieval, are bottlenecks. The best models tested at initial release achieved roughly 25% pass^1 without oracle retrieval, underscoring that knowledge-grounded agent tasks remain substantially unsolved.
Tau-Voice extends 278 tasks from the existing tau-bench text domains into real-time voice evaluation. The task definitions, API tools, and ground-truth evaluation logic are byte-for-byte identical to their text counterparts, enabling a direct comparison between a model's voice and text performance on identical problems.
The voice evaluation framework uses a voice user simulator that synthesizes caller utterances through diverse speaker personas and applies realistic audio degradation: background noise, G.711 mu-law compression (the codec used in standard telephony), random frame drops, vocal tics, non-directed speech, and an LLM-driven turn-taking policy that generates interruptions, backchannels, and overlapping speech. Agents under evaluation use real-time, full-duplex audio APIs (as opposed to the half-duplex turn-by-turn architecture common in earlier voice benchmarks), receiving audio in 200ms chunks and generating spoken responses in kind.
Voice agents are evaluated against the same database-state comparison used for text agents, so the score reflects end-to-end task completion rather than just transcription quality or dialogue naturalness.
Progress on tau-Voice over its first eight months of reporting was substantial. In August 2025, the leading voice agent (OpenAI gpt-realtime-1.0) achieved approximately 30% pass^1. By April 2026, the leading system (xAI's grok-voice-think-fast-1.0) reached 67% pass^1 -- more than doubling performance in eight months. Under the challenging realistic audio conditions (noise, accents, compression), top systems scored between 26% and 38% pass^1, compared to 85% or higher for equivalent text-based agents on the same tasks. Voice agents therefore retain roughly 79% of text model capability under clean conditions but fall to roughly 35-40% under realistic telephony conditions.
Four failure categories were identified through error analysis: speech recognition failures under accented speech and background noise; dialogue state tracking errors across corrections and clarifications; tool-use reliability degradation (the voice interface introduced higher rates of incorrect tool calls relative to the same model on text); and robustness gaps when users produced non-standard conversational patterns.
The tau3-bench release also incorporated community-contributed fixes to the retail, airline, and telecom domains, further tightening the evaluation criteria.
Alongside the tau3-bench voice work, Sierra AI released mu-bench (stylized mu-bench or μ-bench, for Multilingual Utterances benchmark) on April 20, 2026. Mu-bench is a companion benchmark focused specifically on automatic speech recognition (ASR) quality for customer service phone audio, separate from the end-to-end task completion focus of tau-Voice.
The dataset consists of 4,270 human-annotated audio clips drawn from 250 real customer service phone conversations, recorded at 8 kHz mono -- the audio fidelity typical of telephony networks. Annotators produced word-level ground-truth transcripts for each caller utterance. Five languages are covered: English, Spanish, Turkish, Vietnamese, and Mandarin.
Mu-bench introduces a new metric, Utterance Error Rate (UER), designed to address a shortcoming of the traditional Word Error Rate (WER). WER penalizes all transcription differences equally, treating "um" versus a missing word as equivalent errors. UER distinguishes between errors that change the meaning of an utterance and errors that do not -- a distinction that matters for downstream agent behavior, since an agent acting on a mistranscribed intent is far more harmful than one that normalizes filler words differently.
Key findings from the mu-bench launch included substantial provider variation across languages: Mandarin transcription accuracy was in some cases five times worse than English for the same provider, and no single provider dominated across all five languages. Google's Chirp-3 led on multilingual accuracy, while Deepgram's Nova-3 was nearly eight times faster at p50 latency but trailed on non-English tasks. The dataset, code, and an open leaderboard are publicly available through the Sierra Research GitHub organization and Hugging Face.
As of May 2026, the tau-bench retail leaderboard (pass^1 scores) was led by Anthropic models, which have dominated the top positions since 2025.
| Rank | Model | Organization | Pass^1 (Retail) |
|---|---|---|---|
| 1 | Claude Sonnet 4.5 | Anthropic | 86.2% |
| 2 | Claude Opus 4.1 | Anthropic | 82.4% |
| 3 | Claude Opus 4 | Anthropic | 81.4% |
| 4 | Claude 3.7 Sonnet | Anthropic | 81.2% |
| 5 | Claude Sonnet 4 | Anthropic | 80.5% |
| 6 | GLM-4.5 | Zhipu AI | 79.7% |
| 7 | GLM-4.5-Air | Zhipu AI | 77.9% |
| 8 | Qwen3-Coder 480B | Alibaba Cloud | 77.5% |
| 9 | o4-mini | OpenAI | 71.8% |
| 10 | o1 | OpenAI | 70.8% |
| 11 | Claude 3.5 Sonnet | Anthropic | 69.2% |
| 12 | GPT-4.5 | OpenAI | 68.4% |
| 13 | GPT-4.1 | OpenAI | 68.0% |
Airline scores are systematically lower than retail scores, reflecting the greater policy complexity of the domain.
| Rank | Model | Organization | Pass^1 (Airline) |
|---|---|---|---|
| 1 | Claude Sonnet 4.5 | Anthropic | 70.0% |
| 2 | MiniMax M1 80K | MiniMax | 62.0% |
| 3 | GLM-4.5-Air | Zhipu AI | 60.8% |
| 4 | GLM-4.5 | Zhipu AI | 60.4% |
| 5 | Qwen3-Coder 480B | Alibaba Cloud | 60.0% |
| 5 | Claude Sonnet 4 | Anthropic | 60.0% |
| 7 | Claude Opus 4 | Anthropic | 59.6% |
| 8 | Claude 3.7 Sonnet | Anthropic | 58.4% |
| 9 | GPT-4.5 | OpenAI | 50.0% |
| 10 | GPT-4.1 | OpenAI | 49.4% |
The telecom leaderboard reflects scores from the tau2-bench dual-control telecom domain. By early 2026, newer-generation models had achieved very high scores, reflecting the rapid capability improvements in frontier systems over the prior year.
| Rank | Model | Organization | Pass^1 (Telecom) |
|---|---|---|---|
| 1 | LongCat-Flash-Thinking-2601 | Meituan | 99.3% |
| 2 | Claude Opus 4.6 | Anthropic | 99.3% |
| 3 | GPT-5.4 | OpenAI | 98.9% |
| 4 | GPT-5.2 | OpenAI | 98.7% |
| 5 | Claude Opus 4.5 | Anthropic | 98.2% |
| 6 | GPT-5.5 | OpenAI | 98.0% |
| 7 | Claude Sonnet 4.6 | Anthropic | 97.9% |
| 8 | MiMo-V2-Pro | Xiaomi | 96.8% |
| 9 | GPT-5 | OpenAI | 96.7% |
The telecom domain was nearing saturation by mid-2026, prompting Sierra to focus new evaluation work on the harder tau-Knowledge and tau-Voice components.
Tau-bench occupies a distinct position within the broader landscape of agent evaluation benchmarks.
| Benchmark | Primary focus | Environment | User simulation | Reliability metric | Domains |
|---|---|---|---|---|---|
| Tau-bench | Customer service policy adherence | Simulated database + API | Yes (LLM user) | pass^k | Retail, Airline, Telecom, Banking |
| SWE-bench | Software engineering | Real GitHub repos | No | pass@k | Software repos |
| GAIA | General tool use and reasoning | Web + tools | No | Accuracy | Multiple |
| AgentBench | Multi-environment agent skills | 8 distinct envs | No | Success rate | OS, DB, KG, Web, Game |
| WebArena | Web navigation and tasks | Simulated websites | No | Task success | E-commerce, Reddit, CMS |
SWE-bench (SWE-bench) is the closest analogue to tau-bench in terms of real-world relevance and industry adoption. It tests whether agents can resolve GitHub issues by writing code, merge tests, and passing CI pipelines. Unlike tau-bench, SWE-bench does not involve dynamic user interaction or policy adherence, and its tasks are one-shot rather than multi-turn. The two benchmarks are complementary: SWE-bench measures software engineering capability; tau-bench measures customer-facing agentic reliability.
GAIA (GAIA) evaluates general-purpose tool use and multi-step reasoning across diverse tasks including web search, file parsing, and calculation. GAIA tests a broader range of cognitive skills than tau-bench but does not simulate ongoing dialogue with a user, does not require policy adherence, and does not measure reliability across repeated trials.
AgentBench (AgentBench), developed by researchers at Tsinghua University and published at ICLR 2024, evaluates agents across eight distinct environments including operating system shell tasks, SQL database queries, knowledge graph traversal, a digital card game, household simulation, and web browsing. Its breadth makes it useful for diagnosing weaknesses across disparate capability areas. It does not simulate user interaction and does not employ a pass^k-style reliability metric.
WebArena (WebArena) situates agents inside simulated websites -- e-commerce stores, forums, content management systems -- and asks them to complete realistic web tasks. It is closer to tau-bench in its focus on realistic operating environments, but it does not involve multi-turn user dialogue, policy documents, or explicit reliability measurement.
The key differentiator of the tau-bench family across all these comparisons is its combination of user simulation and the pass^k metric. No other major benchmark simultaneously requires agents to interact with a dynamic conversational user, adhere to a written domain policy, and demonstrate reliability over repeated attempts. This combination makes tau-bench particularly relevant to enterprise deployment contexts.
Several limitations of the tau-bench suite have been acknowledged by its developers and noted by the research community.
Simulator-as-oracle. The user simulator in tau-bench uses a language model (originally GPT-4) to generate user utterances. This creates an evaluation that may favor agents built on the same model family. If the agent and the simulator share underlying architecture or training data, their interactions may be unrealistically cooperative compared to what a real human user would produce. Sierra partially addressed this in tau2-bench by making the simulator state-coupled to the actual environment rather than purely LLM-generated.
Domain scope. Despite expansions to airline, telecom, and banking, the tau-bench domains remain a small subset of real-world service contexts. Industries such as healthcare, legal, insurance, and logistics each impose different policy structures and user behaviors that the current benchmark does not capture.
English-only evaluation. The text benchmarks and tau-Voice (as of March 2026) are English-only. The mu-bench companion benchmark addresses multilingual transcription quality, but tau-bench does not yet evaluate end-to-end multilingual agent performance.
Evaluator dependence. Task success in tau-bench is determined by comparing final database state to a ground-truth goal state. This approach is robust for most tasks, but some tasks have multiple valid completion paths, and the ground-truth state may not capture all of them. Community-contributed fixes in tau2-bench and tau3-bench reduced but did not eliminate this issue.
TTS-mediated accents. In tau-Voice, speaker accents are synthesized through text-to-speech rather than collected from real speakers. Synthetic accents may differ from natural accents in ways that affect both ASR accuracy and the difficulty of voice-based evaluation.
Saturation risk. The rapid performance improvement on tau-Telecom (approaching 99% pass^1 by mid-2026) illustrates that domains with bounded policy complexity can saturate quickly. The tau-Knowledge and tau-Voice extensions were designed partly to provide harder, less saturatable challenges, but the saturation trajectory of those domains remains to be seen.
Tau-bench has had measurable influence on how AI developers and researchers approach agent evaluation.
Several major AI laboratories now include pass^k scores alongside pass@k in published model evaluations. Anthropic's technical reports for Claude 3.7 Sonnet and later models reference tau-bench performance as a signal of customer-service agentic reliability. The benchmark has been cited in research expanding its methodology to new domains: MedAgentBench adapts the tau-bench simulation architecture to hospital settings, supplying agents with FHIR APIs and clinical databases and using physician-authored task scenarios.
The reinforcement learning integration added in tau2-bench version 0.2.1 (November 2025) made the benchmark usable as a training environment, not just an evaluation target. Gymnasium compatibility allows RL practitioners to train agents directly on tau-bench tasks and measure improvement in pass^k terms, closing the loop between training and deployment-oriented evaluation.
The taubench.com leaderboard, launched in October 2025, provides a public, reproducible tracking surface with trajectory visualization and automated deployment. This infrastructure has lowered the barrier to community evaluation, allowing teams without large internal evaluation budgets to submit model runs and compare against the community.
Sierra AI itself has used tau-bench as a public signal of progress in the agent reliability problem that motivates its commercial products. The suite's expansion through three major versions in under two years, with each version introducing qualitatively new evaluation challenges, reflects both the rapid evolution of agent capabilities and the ongoing demand for harder evaluation targets.