τ-bench
Last reviewed
May 18, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v2 · 4,503 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 18, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v2 · 4,503 words
Add missing citations, update stale details, or suggest a clearer explanation.
Tau-bench (stylized tau-bench or τ-bench) is a benchmark suite developed by Sierra AI to evaluate the reliability of language model agents operating in realistic, multi-turn customer service scenarios.[^1] First published in June 2024, the benchmark assesses whether agents can follow domain-specific policies, use API tools correctly, and sustain consistent behavior across many independent interactions with simulated users.[^1] The suite has since expanded through three major versions -- tau-bench (June 2024), tau2-bench (June 2025), and tau3-bench (March 2026) -- each adding new domains, evaluation modalities, and complexity.[^2][^3] A companion benchmark, mu-bench, was released in April 2026 to assess multilingual speech transcription quality for voice agents.[^4]
The tau-bench family is widely used in academic research and commercial model development to measure the dimension of agentic reliability that single-attempt benchmarks cannot expose: whether an agent can succeed not just once, but consistently, across repeated interactions.[^1]
By mid-2024 the dominant methodology for evaluating large language models in agentic settings relied on pass@k-style metrics inherited from code generation research. Under pass@k, a model receives credit if at least one of k independent sampling attempts produces a correct answer. This is useful for measuring raw capability, but it systematically overstates the practical reliability of agents deployed in production. A customer service agent that resolves a flight rebooking correctly 70% of the time still fails three out of ten customers, and those failures cascade through support queues, human escalation, and customer trust.
Existing benchmarks like AgentBench, WebArena, and GAIA each illuminated different slices of agent capability -- database manipulation, web browsing, multi-step tool use -- but none focused specifically on the triad of (1) multi-turn dialogue with a live user, (2) simultaneous compliance with a written policy document, and (3) stateful tool use against a backing database. SWE-bench captured software engineering skill through real GitHub issues, but an agent that can write a correct patch need not be the same agent that can process a return request under a retailer's refund policy while managing a disgruntled customer across ten conversational turns.
Sierra AI sat at the intersection of both concerns. Founded in 2023 by Bret Taylor (former Salesforce co-CEO and OpenAI board chair) and Clay Bavor (former Google VP), Sierra builds production AI agents for enterprises. Its researchers had firsthand visibility into the gap between benchmark-reported performance and the consistency failures that appeared in deployed agents. The tau-bench project grew out of that operational experience.[^1]
The original tau-bench paper, titled "τ-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains," was posted to arXiv (arXiv:2406.12045) on June 17, 2024 by Shunyu Yao, Noah Shinn, Pedram Razavi, and Karthik Narasimhan.[^1] Code and data were released at github.com/sierra-research/tau-bench under an open-source license.[^5] A companion leaderboard was later established at taubench.com.[^6]
Tau-bench was built around a specific thesis: that evaluation of production agents requires measuring reliability across many independent attempts on the same underlying task, not just measuring average performance across many different tasks.[^1] The distinction matters because in production, an agent handles thousands of structurally similar requests each day. A system with 70% single-attempt accuracy will fail roughly 300 out of every 1,000 interactions, a rate that is operationally unacceptable for most enterprise deployments.
To test this kind of reliability, tau-bench adopted a simulation architecture in which a language model acts as the user, generating natural conversational utterances in response to the agent.[^1] Each conversation begins with a task specification -- a scenario drawn from a realistic customer service context -- and ends when either the agent successfully completes the task (verified by comparing the final database state to a ground-truth goal state) or the conversation is terminated due to failure. The same underlying task can be run many times with different lexical realizations, allowing repeated-trial reliability to be measured directly.
The original release contained two domains:[^1]
tau-Retail covers e-commerce customer service. The database contains 500 simulated users, 50 product types, and 1,000 orders. Agents are supplied with a set of Python-callable API tools (for looking up orders, processing returns, applying discounts, updating shipping addresses, and so on) and a written policy document specifying rules such as return windows, eligibility criteria, and exception-handling procedures. The domain contains 115 tasks. Because the database is relatively small and the policy is tractable, tau-Retail is the easier of the two domains -- but still exposed dramatic reliability failures in state-of-the-art models at launch.[^1]
tau-Airline covers airline reservation management. The database contains 500 users, 300 flights, and 2,000 reservations. Tasks include rebooking, cancellations, seat upgrades, baggage fee disputes, and policy-mandated refunds. The domain contains 50 tasks and is structurally more complex than tau-Retail because airline policies involve more conditional logic (fare class rules, change fee waivers, same-day standby procedures). Agents must parse long policy documents and apply the correct rule to a specific reservation state.[^1]
The user simulator in the original tau-bench uses GPT-4 (specifically gpt-4-0613) to generate utterances.[^1] The simulator receives a system prompt containing the task scenario -- for example, "you want to return a defective blender purchased 12 days ago" -- along with the full conversation history. It generates the next user turn in response to the agent's most recent message. This design keeps the evaluation fully automated and reproducible, avoids the cost and latency of human annotators, and enables the repeated trials that pass^k requires.
The simulator's instructions are written in natural language rather than formal grammars, which produces more varied and naturalistic dialogue than rule-based alternatives. Sierra researchers acknowledged that the simulator can occasionally behave inconsistently or drift from the scenario, and they incorporated this as a source of realistic noise rather than attempting to eliminate it.
The central methodological contribution of tau-bench is the pass^k metric (read as "pass to the k").[^1] Pass^k is defined as the probability that an agent successfully completes a given task on all k independent trials. It contrasts with the pass@k metric standard in code generation, which measures the probability that at least one of k trials succeeds.
Formally, if p is the agent's single-trial success probability on a given task, then pass^k = p^k under the assumption of independent trials. The exponential decay is severe: a model with 85% single-trial accuracy (pass^1 = 0.85) has a pass^8 of roughly 0.85^8 ≈ 0.27, meaning it would successfully handle all eight customers with the same underlying issue less than 30% of the time. This is the regime that production agents live in.
Tau-bench typically reports results averaged across tasks in a domain, giving a mean pass^1 and mean pass^k for selected k values (commonly k = 2, 4, 8). The pass^k vs. k curve reveals the reliability profile of a model: a model with high pass^1 but rapidly collapsing pass^k is fragile and stochastic; a model whose pass^k curve flattens near its pass^1 value is robust and consistent.
The distinction from the closely related pass@k is crucial. Pass@k gives credit for a single success among k attempts -- it measures what a model can do given multiple chances, which is appropriate when you want to assess capability. Pass^k measures what a model reliably does every time -- appropriate when you want to assess deployment fitness.
The June 2024 paper reported that even the strongest available models fell well short of reliable performance.[^1] GPT-4o, then considered the frontier model for function-calling tasks, achieved a pass^1 rate below 50% across both domains, and its pass^8 rate in tau-Retail dropped to roughly 25% -- a substantial collapse from pass^1 to pass^8.[^1] Claude 3.5 Sonnet performed somewhat better on single-attempt tasks but showed similar reliability degradation under repeated evaluation.
The most common failure mode was policy violation: agents would identify the correct tool to call but invoke it with arguments that, while plausible, violated some condition specified in the policy document. A second major failure mode was conversation management -- agents that correctly understood the user's goal but elicited insufficient information before acting, or that lost track of task state over a long dialogue.[^1]
These results had immediate impact on how researchers and practitioners thought about agent benchmarking. Several model evaluation teams that had previously relied on pass@k-style metrics began supplementing their evaluations with pass^k reporting.
Tau2-bench (also written tau^2-bench or τ²-bench) was published as the arXiv paper "τ²-Bench: Evaluating Conversational Agents in a Dual-Control Environment" on June 9, 2025, with authors Victor Barres, Honghua Dong, Soham Ray, Xujie Si, and Karthik Narasimhan.[^2] The accompanying codebase was released at github.com/sierra-research/tau2-bench.[^7]
The fundamental advance in tau2-bench was the introduction of a dual-control environment.[^2] In the original tau-bench, the user simulator is a passive information provider: it answers the agent's questions and makes requests, but has no independent ability to act on the environment. This is appropriate for many simple service interactions, but it fails to capture a large class of real-world support scenarios where the user must perform actions on their side -- restarting a device, checking a setting, toggling a network mode -- before the agent can proceed.
Tau2-bench models this as a decentralized partially observable Markov decision process (Dec-POMDP), in which both the agent and the user have tools and can independently modify the shared environment state.[^2] The user simulator is tightly coupled to the actual environment: it generates actions and responses based on the real configuration of the shared world, preventing it from inventing settings that do not exist or failing to report changes that actually occurred.
This bidirectional simulation design produces a qualitatively harder evaluation. Agents must not only call the right tools themselves, but must also correctly instruct users on what actions to take, wait for confirmation of those actions, interpret the results, and update their troubleshooting strategy accordingly.[^2]
Tau2-bench introduced a third domain, tau-Telecom, inspired by the kinds of interactions that occur in real telecommunications support.[^2] Tasks include resolving broken mobile data connections, fixing Multimedia Messaging Service (MMS) failures, and switching mobile network modes between 4G and 5G. Each task requires the agent to follow a telecom policy document, communicate diagnostic instructions to the user, interpret the user's reported outcomes, and adapt the troubleshooting path accordingly.
The telecom domain was substantially harder than retail or airline at the time of tau2-bench's release. The paper's analysis separates reasoning errors from communication/coordination failures, and shows that the shift from no-user to dual-control settings produces a marked performance drop even for frontier models.[^2]
Tau2-bench introduced a compositional task generation approach. Rather than hand-authoring tasks individually, tasks are assembled from atomic actions that represent the smallest coherent troubleshooting steps.[^2] This allows systematic control over task complexity, enables automatic verification of correct solutions, and scales to produce larger and more varied task sets than manual authoring allows. The approach also makes it easier to identify which sub-skills an agent lacks when it fails a composite task.
The tau2-bench release included community-contributed corrections to the original retail and airline domains. These fixes addressed ambiguous task descriptions, inconsistent policy language, and evaluation criteria that did not correctly capture task success. The corrections tightened the correlation between human judgment and automated evaluation, increasing confidence that pass^k scores reflect genuine task completion rather than evaluation artifacts.[^7]
Tau3-bench (stylized tau3-bench or τ³-bench) was released by Sierra AI on March 18, 2026, with an accompanying blog post titled "τ³-Bench: Advancing agent evaluation to knowledge and voice."[^3] It extended the tau-bench framework in two new directions: unstructured knowledge retrieval (tau-Knowledge) and real-time voice (tau-Voice), each representing a modality where production agents routinely struggle but prior evaluation infrastructure was limited.
Tau-Knowledge evaluates agent performance in scenarios where the agent must retrieve relevant information from a large, unstructured knowledge base before it can take action.[^8] The domain modeled is tau-Banking, a fintech-inspired customer support setting covering products such as personal and business checking, savings, credit cards, buy-now-pay-later (BNPL), dispute procedures, and identity verification protocols. The knowledge base contains 698 documents organized across 21 product categories, totaling approximately 195,000 tokens.[^8]
Each task requires information from an average of 18.6 documents and an average of 9.5 tool calls, with some tasks requiring up to 33 tool calls -- placing tau-Knowledge among the most operationally demanding agent evaluations published to date.[^8]
Tasks require multi-step problem resolution: search the knowledge base for relevant policy or product information, reason over the retrieved content, and then call the appropriate transactional tools to complete the user's request. The evaluation harness provides three retrieval strategies that models may select between: BM25 keyword search, dense embedding retrieval, and a freeform shell that lets the model navigate and read files directly.[^8]
At launch in early March 2026, the best frontier model (GPT-5.2 with high reasoning) achieved only 25.5% pass^1 and 9.3% pass^4 on tau-Knowledge, underscoring how far state-of-the-art systems remained from reliable knowledge-grounded performance.[^8] By May 2026, GPT-5.5 with extra-high reasoning led the leaderboard at 37.4% pass^1 and 20.6% pass^4 -- an 11.9-point absolute gain in pass^1 over roughly two months.[^8] A consistent finding was that models did substantially better when given flexible, freeform shell access to the knowledge base than when restricted to traditional retrieval-augmented generation pipelines.[^8]
Tau-Voice extends 278 tasks from the existing tau-bench text domains (retail, airline, and telecom) into real-time voice evaluation.[^9] The task definitions, API tools, and ground-truth evaluation logic are identical to their text counterparts, enabling a direct comparison between a model's voice and text performance on identical problems.
The voice evaluation framework synthesizes caller utterances through diverse speaker personas and applies realistic audio degradation: background noise, G.711 mu-law compression (the codec used in standard telephony), 8 kHz sample rate, vocal tics, non-directed speech, and an LLM-driven turn-taking policy that generates interruptions, backchannels, and overlapping speech.[^9] Agents under evaluation use real-time, full-duplex audio APIs (as opposed to the half-duplex turn-by-turn architecture common in earlier voice benchmarks).
Two evaluation modes are reported: a Clean condition (single persona, no acoustic effects, strict turn-taking) and a Realistic condition (diverse personas, environmental noise, free-form turn-taking).[^9] Voice agents are scored against the same database-state comparison used for text agents, so the metric reflects end-to-end task completion rather than just transcription quality or dialogue naturalness.
Progress on tau-Voice over its first eight months of reporting was substantial. In August 2025, the leading voice agent (OpenAI gpt-realtime-1.0) achieved approximately 30% pass^1.[^9] By April 2026, the leading system (xAI's grok-voice-think-fast-1.0) reached 67% pass^1 -- more than doubling performance in eight months.[^9] Under the realistic audio conditions, top systems scored between 26% and 38% pass^1, compared to substantially higher rates for equivalent text-based agents on the same tasks; Sierra reports that voice agents retain roughly 79% of text-model capability under matched conditions.[^9]
The tau3-bench release also incorporated community-contributed fixes to the retail, airline, and telecom domains, further tightening the evaluation criteria.[^3]
Alongside the tau3-bench voice work, Sierra AI released mu-bench (stylized mu-bench or μ-bench, for Multilingual Utterances Transcription Benchmark) on April 20, 2026.[^4] Mu-bench is a companion benchmark focused specifically on automatic speech recognition (ASR) quality for customer service phone audio, separate from the end-to-end task completion focus of tau-Voice.
The dataset consists of 4,270 human-annotated utterances drawn from 250 real customer service phone conversations (roughly 5.1 hours of audio), recorded at 8 kHz mono -- the audio fidelity typical of telephony networks.[^4] Calls were scripted interactions with an AI banking agent built on Sierra's voice platform (checking card status, confirming case codes, disputing transactions, requesting credit-limit increases) but with callers using their own phones in their own environments, producing realistic background noise, disfluencies, emotional variation, interruptions, and diverse speaking styles. Five languages are covered: English, Spanish, Turkish, Vietnamese, and Mandarin.[^4]
Mu-bench introduces a new metric, Utterance Error Rate (UER), designed to address a shortcoming of the traditional Word Error Rate (WER). WER penalizes all transcription differences equally, treating "um" versus a missing word as equivalent errors. UER distinguishes between meaning-changing errors (such as misheard digits) and surface-level errors (such as dropped filler words) -- a distinction that matters for downstream agent behavior, since an agent acting on a mistranscribed intent is far more harmful than one that normalizes filler words differently.[^4]
The initial provider comparison included Google Chirp-3, Deepgram Nova-3, Microsoft Azure Speech, ElevenLabs Scribe v2, and OpenAI GPT-4o Mini Transcribe.[^4] Google's Chirp-3 led on multilingual accuracy but was among the slowest providers, while Deepgram's Nova-3 was approximately 8× faster at p50 latency but trailed on non-English tasks.[^4] No single provider dominated across all five languages, and Mandarin transcription accuracy was in some cases substantially worse than English for the same provider. The dataset, code, and an open leaderboard are publicly available through the Sierra Research GitHub organization and Hugging Face.[^10]
As of May 2026, the tau-bench retail leaderboard (pass^1 scores) was led by Anthropic models, which have dominated the top positions since 2025.[^11]
| Rank | Model | Organization | Pass^1 (Retail) |
|---|---|---|---|
| 1 | Claude Sonnet 4.5 | Anthropic | 86.2% |
| 2 | Claude Opus 4.1 | Anthropic | 82.4% |
| 3 | Claude Opus 4 | Anthropic | 81.4% |
| 4 | Claude 3.7 Sonnet | Anthropic | 81.2% |
| 5 | Claude Sonnet 4 | Anthropic | 80.5% |
| 6 | GLM-4.5 | Zhipu AI | 79.7% |
| 7 | GLM-4.5-Air | Zhipu AI | 77.9% |
| 8 | Qwen3-Coder 480B | Alibaba Cloud | 77.5% |
| 9 | o4-mini | OpenAI | 71.8% |
| 10 | o1 | OpenAI | 70.8% |
| 11 | Qwen3-Next-80B-A3B-Thinking | Alibaba Cloud | 69.6% |
| 12 | Claude 3.5 Sonnet | Anthropic | 69.2% |
| 13 | GPT-4.5 | OpenAI | 68.4% |
| 14 | GPT-4.1 | OpenAI | 68.0% |
Airline scores are systematically lower than retail scores, reflecting the greater policy complexity of the domain.[^12]
| Rank | Model | Organization | Pass^1 (Airline) |
|---|---|---|---|
| 1 | Claude Sonnet 4.5 | Anthropic | 70.0% |
| 2 | MiniMax M1 80K | MiniMax | 62.0% |
| 3 | GLM-4.5-Air | Zhipu AI | 60.8% |
| 4 | GLM-4.5 | Zhipu AI | 60.4% |
| 5 | MiniMax M1 40K | MiniMax | 60.0% |
| 5 | Claude Sonnet 4 | Anthropic | 60.0% |
| 5 | Qwen3-Coder 480B | Alibaba Cloud | 60.0% |
| 8 | Claude Opus 4 | Anthropic | 59.6% |
| 9 | Claude 3.7 Sonnet | Anthropic | 58.4% |
| 10 | Claude Opus 4.1 | Anthropic | 56.0% |
| 11 | o1 | OpenAI | 50.0% |
| 11 | GPT-4.5 | OpenAI | 50.0% |
| 13 | GPT-4.1 | OpenAI | 49.4% |
| 14 | o4-mini | OpenAI | 49.2% |
The telecom leaderboard reflects scores from the tau2-bench dual-control telecom domain. By early 2026, newer-generation models were reported to have achieved very high scores; these are drawn from third-party leaderboard aggregations and have shown larger month-to-month variation than the retail and airline numbers above.
| Rank | Model | Organization | Pass^1 (Telecom) |
|---|---|---|---|
| 1 | LongCat-Flash-Thinking-2601 | Meituan | 99.3% |
| 2 | Claude Opus 4.6 | Anthropic | 99.3% |
| 3 | GPT-5.4 | OpenAI | 98.9% |
| 4 | GPT-5.2 | OpenAI | 98.7% |
| 5 | Claude Opus 4.5 | Anthropic | 98.2% |
| 6 | GPT-5.5 | OpenAI | 98.0% |
| 7 | Claude Sonnet 4.6 | Anthropic | 97.9% |
| 8 | MiMo-V2-Pro | Xiaomi | 96.8% |
| 9 | GPT-5 | OpenAI | 96.7% |
The telecom domain was nearing saturation by mid-2026, prompting Sierra to focus new evaluation work on the harder tau-Knowledge and tau-Voice components.[^3]
The tau-Knowledge (tau-Banking) leaderboard reflects far lower absolute performance than the older domains, consistent with the difficulty of multi-step retrieval-grounded reasoning over ~200K tokens of policy material.[^8]
| Model | Organization | Pass^1 | Pass^4 |
|---|---|---|---|
| GPT-5.5 (xhigh reasoning, May 2026) | OpenAI | 37.4% | 20.6% |
| GPT-5.2 (high reasoning, launch baseline) | OpenAI | 25.5% | 9.3% |
Tau-bench occupies a distinct position within the broader landscape of agent evaluation benchmarks.
| Benchmark | Primary focus | Environment | User simulation | Reliability metric | Domains |
|---|---|---|---|---|---|
| Tau-bench | Customer service policy adherence | Simulated database + API | Yes (LLM user) | pass^k | Retail, Airline, Telecom, Banking |
| SWE-bench | Software engineering | Real GitHub repos | No | pass@k | Software repos |
| GAIA | General tool use and reasoning | Web + tools | No | Accuracy | Multiple |
| AgentBench | Multi-environment agent skills | 8 distinct envs | No | Success rate | OS, DB, KG, Web, Game |
| WebArena | Web navigation and tasks | Simulated websites | No | Task success | E-commerce, Reddit, CMS |
SWE-bench is the closest analogue to tau-bench in terms of real-world relevance and industry adoption. It tests whether agents can resolve GitHub issues by writing code, running tests, and passing CI pipelines. Unlike tau-bench, SWE-bench does not involve dynamic user interaction or policy adherence, and its tasks are one-shot rather than multi-turn. The two benchmarks are complementary: SWE-bench measures software engineering capability; tau-bench measures customer-facing agentic reliability.
GAIA evaluates general-purpose tool use and multi-step reasoning across diverse tasks including web search, file parsing, and calculation. GAIA tests a broader range of cognitive skills than tau-bench but does not simulate ongoing dialogue with a user, does not require policy adherence, and does not measure reliability across repeated trials.
AgentBench, developed by researchers at Tsinghua University and published at ICLR 2024, evaluates agents across eight distinct environments including operating system shell tasks, SQL database queries, knowledge graph traversal, a digital card game, household simulation, and web browsing. Its breadth makes it useful for diagnosing weaknesses across disparate capability areas. It does not simulate user interaction and does not employ a pass^k-style reliability metric.
WebArena situates agents inside simulated websites -- e-commerce stores, forums, content management systems -- and asks them to complete realistic web tasks. It is closer to tau-bench in its focus on realistic operating environments, but it does not involve multi-turn user dialogue, policy documents, or explicit reliability measurement.
The key differentiator of the tau-bench family across all these comparisons is its combination of user simulation and the pass^k metric. No other major benchmark simultaneously requires agents to interact with a dynamic conversational user, adhere to a written domain policy, and demonstrate reliability over repeated attempts. This combination makes tau-bench particularly relevant to enterprise deployment contexts.
Several limitations of the tau-bench suite have been acknowledged by its developers and noted by the research community.
Simulator-as-oracle. The user simulator in tau-bench uses a language model (originally GPT-4) to generate user utterances.[^1] This creates an evaluation that may favor agents built on the same model family. If the agent and the simulator share underlying architecture or training data, their interactions may be unrealistically cooperative compared to what a real human user would produce. Sierra partially addressed this in tau2-bench by making the simulator state-coupled to the actual environment rather than purely LLM-generated.[^2]
Domain scope. Despite expansions to airline, telecom, and banking, the tau-bench domains remain a small subset of real-world service contexts. Industries such as healthcare, legal, insurance, and logistics each impose different policy structures and user behaviors that the current benchmark does not capture.
English-only evaluation. The text benchmarks and tau-Voice (as of March 2026) are English-only.[^9] The mu-bench companion benchmark addresses multilingual transcription quality, but tau-bench does not yet evaluate end-to-end multilingual agent performance.
Evaluator dependence. Task success in tau-bench is determined by comparing final database state to a ground-truth goal state.[^1] This approach is robust for most tasks, but some tasks have multiple valid completion paths, and the ground-truth state may not capture all of them. Community-contributed fixes in tau2-bench and tau3-bench reduced but did not eliminate this issue.
TTS-mediated accents. In tau-Voice, speaker accents are synthesized through text-to-speech rather than collected from real speakers.[^9] Synthetic accents may differ from natural accents in ways that affect both ASR accuracy and the difficulty of voice-based evaluation. Mu-bench partially compensates by using audio recorded from real callers in five languages.[^4]
Saturation risk. The rapid performance improvement on tau-Telecom (approaching 99% pass^1 by mid-2026) illustrates that domains with bounded policy complexity can saturate quickly. The tau-Knowledge and tau-Voice extensions were designed partly to provide harder, less saturatable challenges, but the saturation trajectory of those domains remains to be seen.
Tau-bench has had measurable influence on how AI developers and researchers approach agent evaluation.
Several major AI laboratories now include pass^k scores alongside pass@k in published model evaluations. Anthropic's technical reports for Claude 3.7 Sonnet and later models reference tau-bench performance as a signal of customer-service agentic reliability. The benchmark has been cited in research expanding its methodology to new domains, including adaptations of the tau-bench simulation architecture to clinical and other regulated settings.
The reinforcement learning integration introduced in the tau2-bench codebase made the benchmark usable as a training environment, not just an evaluation target.[^7] Gymnasium-style compatibility allows RL practitioners to train agents directly on tau-bench tasks and measure improvement in pass^k terms, closing the loop between training and deployment-oriented evaluation.
The taubench.com leaderboard provides a public tracking surface with trajectory visualization.[^6] This infrastructure has lowered the barrier to community evaluation, allowing teams without large internal evaluation budgets to submit model runs and compare against the community.
Sierra AI itself has used tau-bench as a public signal of progress in the agent reliability problem that motivates its commercial products. The suite's expansion through three major versions in under two years, with each version introducing qualitatively new evaluation challenges, reflects both the rapid evolution of agent capabilities and the ongoing demand for harder evaluation targets.