Agent evaluation refers to the systematic assessment of AI agents through benchmarks, metrics, and testing methodologies designed to measure their performance on real-world tasks. As autonomous AI systems have evolved from simple chatbots into multi-step, tool-using agents capable of browsing the web, writing code, and operating computer interfaces, the need for rigorous evaluation frameworks has grown rapidly. Agent evaluation encompasses a broad range of approaches, from standardized academic benchmarks like SWE-bench and WebArena to enterprise-grade observability platforms that track cost, latency, and reliability in production.
Unlike traditional language model evaluation, which typically measures performance on static question-answer pairs, agent evaluation must account for multi-turn interactions, tool use, environment manipulation, non-deterministic execution paths, and the compounding effects of errors over long task horizons. The field draws on research in reinforcement learning, human-computer interaction, software engineering, and safety analysis.
The evaluation of AI agents has roots in earlier work on reinforcement learning environments such as Atari games and MuJoCo simulations, where agents were measured by cumulative reward. However, the rise of large language models (LLMs) as the backbone of agentic systems introduced new evaluation challenges. Early LLM benchmarks such as MMLU and HellaSwag focused on static knowledge and reasoning, but they could not capture whether a model could effectively use tools, navigate websites, or resolve real software engineering issues.
The first wave of agent-specific benchmarks emerged in 2022 and 2023. MiniWoB++ provided a collection of over 100 simplified web tasks for testing basic web manipulation skills. WebShop simulated an e-commerce environment with 1.18 million products and 12,087 crowd-sourced shopping instructions. These early benchmarks demonstrated that LLMs could be evaluated as interactive agents rather than passive text generators, but their synthetic nature limited how well results generalized to real-world settings.
By late 2023 and into 2024, a second wave of more realistic benchmarks appeared. SWE-bench tested agents on real GitHub issues from popular Python repositories. WebArena created self-hosted replicas of real websites for autonomous web navigation. GAIA combined multi-modal reasoning with tool use across multiple difficulty levels. AgentBench evaluated LLMs across eight distinct environments spanning operating systems, databases, knowledge graphs, and web browsing. These benchmarks reflected a growing consensus that agent evaluation must test performance in realistic, multi-step scenarios rather than isolated capabilities.
The third wave, beginning in 2025, has focused on enterprise readiness, safety, and consistency. Benchmarks like CUB (Computer Use Benchmark), tau-bench, and OSWorld-Verified have introduced domain-specific workflows, repeated-trial consistency metrics, and verified task sets. The field has also seen the emergence of comprehensive evaluation frameworks from companies like Anthropic, which published detailed guidance on building agent evaluation pipelines that combine code-based graders, model-based graders, and human review.
A comprehensive survey published in 2025 proposed a two-dimensional taxonomy for agent evaluation, organizing prior work by evaluation objectives (what to evaluate) and evaluation process (how to evaluate).
Agent evaluation targets four primary objectives:
| Dimension | Description | Example metrics |
|---|---|---|
| Agent behavior | Overall performance as perceived by a user, treating the agent as a black box | Task completion rate, output quality, latency, cost per task |
| Agent capabilities | Specific skills the agent demonstrates | Tool use accuracy, planning quality, memory retention, multi-agent collaboration |
| Reliability | Consistency and robustness across repeated executions and varied conditions | pass^k (all k trials succeed), robustness under input perturbations |
| Safety and alignment | Adherence to policies, avoidance of harm, fairness | Harm rate, policy violation rate, adversarial robustness, bias detection |
The evaluation process involves several components:
The most fundamental metric in agent evaluation is the success rate (SR), also called the task completion rate. It measures the proportion of tasks that an agent completes correctly out of the total number attempted. Success is typically determined by checking whether the agent's actions produce the desired end state, such as a passing test suite in SWE-bench or the correct final webpage configuration in WebArena.
Variants of success rate include:
| Metric | Definition | Use case |
|---|---|---|
| Success rate (SR) | Fraction of tasks completed correctly | General benchmark scoring |
| pass@k | Probability that at least one of k independent attempts succeeds | Measuring best-case capability |
| pass^k | Probability that all k independent attempts succeed | Measuring consistency and reliability |
| Partial credit | Graded score reflecting progress toward completion | Multi-step tasks where full success is rare |
| Progress rate | Fraction of subtasks or milestones completed | Long-horizon workflow evaluation |
The distinction between pass@k and pass^k is particularly important for agent evaluation. As Anthropic has noted, pass@k approaches 100% as k increases (since the agent only needs to succeed once), while pass^k falls toward 0% (since every attempt must succeed). For production systems where reliability matters, pass^k is often the more relevant metric. Sierra's tau-bench benchmark specifically uses pass^k to highlight the inconsistency of current agents: state-of-the-art models that achieve roughly 50% success on individual tasks can drop below 25% on pass^8 in retail customer service scenarios.
As agents move from research prototypes to production systems, efficiency metrics have become essential:
Safety evaluation has become a distinct research area as agents gain the ability to take real-world actions:
For agents that interact with external tools and APIs:
SWE-bench is one of the most widely cited agent benchmarks. Introduced by researchers at Princeton University in 2023, it evaluates AI coding agents on their ability to resolve real GitHub issues from popular open-source Python repositories. The original dataset contains 2,294 issue-patch pairs, each requiring the agent to understand the issue description, locate relevant code, and generate a patch that passes the repository's test suite.
SWE-bench has spawned several variants:
| Variant | Tasks | Description |
|---|---|---|
| SWE-bench (full) | 2,294 | Original dataset of Python GitHub issues |
| SWE-bench Verified | 500 | Hand-filtered subset validated for test harness correctness |
| SWE-bench Lite | 300 | Smaller subset for faster evaluation |
| SWE-bench++ | 11,100+ | Multi-language extension covering 11 programming languages |
| SWE-bench Live | Ongoing | Continuously updated with new issues to prevent data contamination |
| SWE-bench Pro | Curated | More challenging tasks requiring extended reasoning |
| SWE-bench Java Verified | 91 | First non-Python variant with Dockerized build/test harnesses |
As of early 2026, leading scores on SWE-bench Verified include Claude Opus 4.5 at 80.9%, Claude Opus 4.6 at 80.8%, Gemini 3.1 Pro at 80.6%, and GPT-5.2 at 80.0%. Notably, OpenAI has stopped reporting Verified scores after finding evidence of training data contamination across frontier models, recommending SWE-bench Pro instead. On SWE-bench Pro, scores are significantly lower, with GPT-5.3-Codex leading at 56.8%.
LiveCodeBench provides a continuously refreshed set of coding problems sourced from competitive programming platforms, addressing contamination concerns. InterCode tests agents on interactive coding tasks that require iterative debugging and execution within a sandboxed environment.
WebArena, introduced in 2023, provides a self-hosted environment for evaluating autonomous web browsing agents. It includes replicas of five real websites spanning e-commerce, social forums, collaborative code development, and content management. The benchmark comprises 812 templated tasks instantiated from 241 templates, with an average of 3.3 variations per template.
WebArena measures functional correctness, meaning whether agents achieve the intended final goal regardless of the specific path taken. Human performance on WebArena reaches approximately 78%. AI agent performance has improved substantially since the benchmark's release: early GPT-4 based agents achieved roughly 14% success, while by early 2025, IBM's CUGA framework reached 61.7%.
Extensions to WebArena include WebChoreArena, which adds 532 tasks focused on tedious, long-horizon workflows requiring extensive memory and calculation, and WebArena Verified, which provides a systematically audited version of the original 812 tasks.
MiniWoB++ (Mini World of Bits) is a collection of over 100 web interaction environments with simplified, synthetic web pages. Maintained by the Farama Foundation, it follows the Gymnasium API and uses Selenium WebDriver for browser interaction. Tasks include clicking buttons, filling forms, navigating dropdowns, and other basic web manipulation skills. While MiniWoB++ lacks the realism of later benchmarks, it remains valuable as a training ground and lightweight evaluation environment for early-stage agent development.
Mind2Web, introduced at NeurIPS 2023 by the NLP group at Ohio State University, contains 2,350 tasks spanning 137 real websites across 31 domains. Unlike benchmarks that use simulated websites, Mind2Web evaluates agents on actual web pages collected from top-ranked sites. The benchmark tests three levels of generalization: cross-task (different tasks on the same website), cross-website (similar tasks on different websites in the same domain), and cross-domain (tasks on websites in entirely different domains). GPT-4 based agents achieved roughly 23% strict success on Mind2Web, with partial credit scores reaching 48%.
OSWorld, presented at NeurIPS 2024, is the first benchmark to evaluate multimodal agents on open-ended tasks within real computer environments. It includes 369 tasks involving real desktop applications across Ubuntu, Windows, and macOS, spanning tools like Chromium, GIMP, LibreOffice, Thunderbird, VLC, and Visual Studio Code. Tasks cover web browsing, desktop application use, OS file operations, and multi-application workflows.
Human evaluators complete approximately 72.4% of OSWorld tasks. When the benchmark launched, the best AI agent achieved only 12.2% success. By 2025, performance improved dramatically: Simular's Agent S framework reached 72.6%, effectively matching the human baseline, and one agent reportedly achieved 76.3% by using an experience-augmented hierarchical planning framework.
OSWorld-Verified is an enhanced version with improved infrastructure (migrated from VMware/Docker to AWS with 50x parallelization) and refined task quality for more reliable evaluation signals.
Released in 2025, OSUniverse introduces graduated difficulty levels, automated validation with low error rates, and graph-based evaluation that awards partial credit for multi-step workflows. It supports multiple operating systems and uses Docker containers for simplified setup, making it more modular and accessible than OSWorld.
GAIA (General AI Assistants) was introduced in late 2023 as a collaboration between academic researchers and Meta AI. It presents 466 human-annotated tasks requiring multi-step reasoning, tool use, web browsing, and multimodal interpretation. Tasks are structured across three difficulty levels:
| Level | Description | Typical requirements |
|---|---|---|
| Level 1 | Simple tasks | Single tool, basic reasoning |
| Level 2 | Intermediate tasks | Multiple tools, multi-step planning |
| Level 3 | Complex tasks | Extensive planning, numerous tools, advanced reasoning |
GAIA tasks have unambiguous, verifiable answers, making automated evaluation straightforward. The benchmark's official leaderboard is hosted on Hugging Face. By 2025, top agents achieved scores ranging from roughly 44% to 75% depending on the evaluation framework, with Level 3 remaining particularly challenging (the top score reaching approximately 61%).
AgentBench, published at ICLR 2024, evaluates LLMs as agents across eight distinct environments spanning three categories:
| Category | Environments |
|---|---|
| Code-grounded | Operating system (OS), database (DB), knowledge graph (KG) |
| Game-grounded | Digital card game, lateral thinking puzzles |
| Web-grounded | House-holding, web shopping, web browsing (Mind2Web) |
The benchmark tested 29 LLMs and revealed a significant performance gap between commercial models (like GPT-4) and open-source alternatives. Key findings indicated that poor long-term reasoning, weak decision-making, and limited instruction-following ability were the primary obstacles to building effective LLM agents.
The BFCL, developed by UC Berkeley's Gorilla project, has become the standard benchmark for evaluating LLM function-calling capabilities. Now in version 4 (as of 2025), it evaluates models on serial and parallel function calls across Python, Java, JavaScript, and REST APIs using a novel Abstract Syntax Tree (AST) evaluation method.
BFCL v4 added categories for web search, memory management, and multi-turn interactions. The benchmark assesses models on their ability to select correct functions, structure arguments properly, handle multiple parallel calls, and abstain when no appropriate function is available. Leading scores as of 2025 place Anthropic's Claude models and OpenAI's GPT models near the top, with overall accuracy scores ranging from roughly 59% to 70%.
ToolLLM, presented at ICLR 2024, provides a comprehensive framework for training and evaluating LLMs on tool use. Its associated dataset, ToolBench, contains 16,464 RESTful APIs spanning 49 categories from RapidAPI Hub, along with 126,000+ instruction-solution path pairs. The benchmark evaluates agents on both single-tool and multi-tool scenarios, using a depth-first search based decision tree (DFSDT) approach to generate solution paths.
The automated evaluator, ToolEval, measures both pass rate (whether the tool chain produces correct output) and solution path quality (whether the agent's reasoning process is sound). StableToolBench, a subsequent variant, addressed reproducibility concerns in the original benchmark.
tau-bench, developed by Sierra AI, is a simulation framework for evaluating customer service agents. It emulates multi-turn conversations between a simulated user (powered by an LLM) and an agent equipped with domain-specific API tools and policy guidelines. The benchmark covers realistic domains including airline customer service, retail support, and telecom interactions.
What distinguishes tau-bench from other benchmarks is its emphasis on consistency. Rather than measuring whether an agent can complete a task once, it uses pass^k to assess whether the agent succeeds across multiple independent trials. State-of-the-art models like GPT-4o achieve less than 50% success on individual tasks, and their consistency drops below 25% on pass^8 in retail scenarios. The benchmark has expanded through tau2-bench (adding dual-control environments) and tau3-bench (incorporating knowledge retrieval and voice interactions).
CUB, introduced by Theta Software in mid-2025, is a benchmark specifically designed for computer and browser use agents. It contains 106 end-to-end workflows across seven industries: consumer, construction, finance, healthcare, marketing, sales, and supply chain. Tasks were created in collaboration with domain experts (accountants, investment bankers, doctors) and involve synthetic versions of enterprise platforms like SAP and CapIQ.
CUB is particularly challenging because it requires graphical user interface interactions (clicking buttons, selecting menu items) in addition to typing or API calls. When first released, no tested agent framework exceeded 10% success, even with a granular scoring system that awarded partial credit.
AgentHarm, published at ICLR 2025, contains 110 explicitly malicious agent tasks (440 with augmentations) covering 11 harm categories including fraud, cybercrime, and harassment. The benchmark measures both whether models refuse harmful requests and whether jailbroken agents maintain their capabilities when attempting to complete multi-step harmful tasks. It provides a synthetic environment for measuring the robustness of LLM agents against adversarial attacks.
R-Judge, presented at ICLR 2024, evaluates the safety risk awareness of LLM agents. It contains 569 records of multi-turn agent interactions covering 27 risk scenarios across 5 application categories and 10 risk types. Rather than testing whether agents cause harm directly, R-Judge assesses whether models can identify and flag safety risks in agent interaction records.
ToolEmu takes a different approach to safety evaluation by using an LLM to emulate tool execution and grade accidental safety violations. It covers 36 tools and 144 test cases in high-stakes scenarios where the user's intent is benign but the agent's actions could inadvertently cause harm. This sandbox-based approach allows safety evaluation without requiring actual tool infrastructure.
CRAB, released in late 2024 by Camel-AI, evaluates agents across 120 tasks spanning Ubuntu and Android environments. It introduces graph-based fine-grained scoring with partial credit capability. The best model (GPT-4o) achieved only 14.17% completion ratio, highlighting the difficulty of cross-environment task completion.
These benchmarks provide test suites for desktop computer interfaces with custom execution-based evaluation scripts, extending agent evaluation beyond Linux and web environments.
| Benchmark | Year | Domain | Tasks | Environment | Key metric | Human baseline | Best AI (approx.) |
|---|---|---|---|---|---|---|---|
| MiniWoB++ | 2018 | Web (synthetic) | 100+ | Synthetic web pages | Task success rate | Near 100% | 95%+ |
| WebArena | 2023 | Web (realistic) | 812 | Self-hosted websites | Functional correctness | 78% | 61.7% |
| Mind2Web | 2023 | Web (real) | 2,350 | Real websites | Strict success / partial credit | N/A | 23% strict |
| SWE-bench Verified | 2023 | Software engineering | 500 | Real GitHub repos | pass@1 | N/A | 80.9% |
| GAIA | 2023 | General assistant | 466 | Multi-modal, multi-tool | Accuracy | 92% | ~75% |
| AgentBench | 2023 | Multi-domain (8 envs) | Varies | Simulated environments | Overall score | N/A | Varies by env |
| ToolBench | 2023 | API/tool use | 16,464 APIs | Real APIs via RapidAPI | Pass rate | N/A | Varies |
| OSWorld | 2024 | Desktop OS | 369 | Real VMs (Ubuntu/Win/Mac) | Task success rate | 72.4% | 76.3% |
| BFCL v4 | 2024 | Function calling | 2,000+ | API simulation | Overall accuracy | N/A | ~70% |
| tau-bench | 2024 | Customer service | Multiple domains | Simulated conversations | pass^k | N/A | <50% (SR) |
| CUB | 2025 | Enterprise workflows | 106 | Synthetic enterprise platforms | Task success rate | N/A | <10% |
The most common approach evaluates agents based on final outcomes. In SWE-bench, this means checking whether the generated patch passes the test suite. In WebArena, it means verifying whether the web page reached the desired state. Outcome-based evaluation is attractive because it is objective and mirrors what end users care about, but it can miss important failure modes. An agent might produce the correct result through unsafe or inefficient means, or it might fail on a task for reasons unrelated to its core capabilities (such as a flaky test or ambiguous task specification).
Process-based (or trajectory-based) evaluation examines the steps an agent takes rather than just its final output. This includes analyzing tool call sequences, reasoning traces, and intermediate decisions. Metrics like Node F1 (for tool selection accuracy) and Edge F1 (for sequence accuracy) measure how well an agent's decision process aligns with reference trajectories.
Process evaluation is valuable for diagnosing failure modes and understanding agent behavior, but it risks penalizing valid alternative approaches. As Anthropic has emphasized, "grading what the agent produced, not the path it took" prevents unnecessarily punishing creative solutions.
Using a separate large language model to evaluate agent outputs has become widespread, particularly for tasks where success is subjective or difficult to verify programmatically. The judge model receives the agent's transcript (including actions, tool calls, and outputs) along with a scoring rubric, and assigns scores based on quality criteria.
A more advanced variant, Agent-as-a-Judge, uses a multi-agent setup where evaluator agents can themselves use tools and take actions to verify the primary agent's work. Mind2Web 2 introduced this approach for evaluating agentic search, where the evaluator agent actively checks whether retrieved information is correct and complete.
LLM-as-a-Judge approaches offer flexibility and can handle nuanced evaluation criteria, but they require careful calibration against human judgments and can introduce their own biases.
Human evaluation remains the gold standard for open-ended tasks and subjective quality assessments. Human evaluators review agent transcripts and rate performance on criteria like helpfulness, accuracy, safety, and efficiency. While expensive and slow, human evaluation serves as the ground truth for calibrating automated evaluation methods.
BrowserArena uses human judges for head-to-head agent comparisons on user-submitted tasks, providing a reference-free evaluation approach that does not require predefined ground-truth answers.
Given the non-deterministic nature of LLM-based agents, running a single trial per task provides an unreliable estimate of performance. Multi-trial evaluation runs each task multiple times and reports aggregate statistics. The pass@k and pass^k metrics capture different aspects of multi-trial performance, and Anthropic recommends running at least 3 to 5 trials per task to get stable estimates.
BrowserGym is a universal simulation environment developed by ServiceNow that unifies web-based benchmarks including MiniWoB++, WebArena, and WorkArena under a single API. It provides standardized observation and action spaces, making it easier to compare agents across different web benchmarks.
Inspect is an open-source evaluation framework from the UK AI Safety Institute that supports a wide range of agent benchmarks including GAIA, BFCL, and AgentHarm. It provides composable evaluation pipelines with support for multiple solvers and scorers.
AgentBench Toolkit provides an integrated evaluation package supporting all eight AgentBench environments, with standardized APIs for running evaluations and collecting results.
Several commercial platforms offer agent evaluation and observability capabilities:
| Platform | Key features |
|---|---|
| LangSmith | Native LangChain integration, automatic tracing, minimal performance overhead |
| Braintrust | Unified evaluation/observability/optimization, AI-generated custom scorers |
| AgentOps | Session tracking, LLM call tracing, tool use monitoring, cost tracking |
| Langfuse | Open-source tracing and analytics, prompt management |
| Galileo | Agent metric dashboards, automated evaluation pipelines |
These platforms complement academic benchmarks by providing production-oriented evaluation capabilities including real-time monitoring, A/B testing, and trace-level debugging.
Agent evaluation faces significant reproducibility challenges. LLM-based agents exhibit variability in execution paths, tool selection, and reasoning patterns due to non-deterministic sampling. This means that the same agent can produce different results on the same task across different runs. Long-horizon tasks amplify this problem because errors compound over multiple steps. Without standardized protocols for controlling randomness and reporting variance, benchmark results can be misleading.
Environment reproducibility is also a concern. Web-based benchmarks depend on external services that may change over time, and desktop benchmarks require specific virtual machine configurations. OSWorld-Verified and StableToolBench have addressed some of these issues by improving infrastructure reliability and standardizing evaluation environments.
As LLMs are trained on increasingly large corpora of internet text, the risk of benchmark data appearing in training sets grows. This data contamination can inflate benchmark scores without reflecting genuine capability improvements. OpenAI publicly acknowledged this issue by stopping SWE-bench Verified reporting after finding contamination across frontier models, recommending SWE-bench Pro (which uses more challenging, less common tasks) instead.
Several strategies have been developed to combat contamination. SWE-bench Live provides a continuously updated stream of new issues. LiveCodeBench refreshes its problem set regularly. Some benchmarks create fully synthetic tasks designed to fall outside internet-scale training corpora.
Running comprehensive agent evaluations is expensive. Each task may require multiple API calls, tool executions, and environment setups. Multi-trial evaluation (necessary for reliable results) multiplies these costs further. OSWorld evaluation, for example, requires provisioning and managing virtual machines for each task. SWE-bench requires building and running test suites for real software projects.
Failed attempts still incur costs, making reliability economically critical. An agent that requires many retries to succeed may be technically capable but financially impractical. Developing cost-bounded evaluation protocols that balance thoroughness with efficiency remains an active research challenge.
Defining clear, unambiguous success criteria for agent tasks is difficult. Anthropic found that their most capable models (such as Opus 4.5) frequently revealed evaluation bugs rather than model limitations. Rigid grading, ambiguous task specifications, and stochastic elements in tasks often penalized correct behavior. This insight suggests that many reported failures are artifacts of imperfect evaluation rather than genuine capability gaps.
Most benchmarks test agents on a fixed set of tasks within specific domains. How well performance on these tasks predicts real-world capability remains an open question. Mind2Web explicitly tests three levels of generalization (cross-task, cross-website, cross-domain), but most benchmarks do not systematically evaluate generalization. An agent that achieves high scores on SWE-bench Python tasks may not transfer that performance to other programming languages, as the introduction of SWE-bench++ and SWE-bench Java has begun to reveal.
Current safety benchmarks like AgentHarm and R-Judge cover important failure modes, but the space of possible agent harms is vast and difficult to enumerate. Agents operating in real environments can cause harm through subtle chains of actions that are difficult to predict or test for. The gap between synthetic safety benchmarks and real-world deployment risks remains a significant concern for the field.
Several public leaderboards track agent performance across major benchmarks:
These leaderboards play an important role in driving progress but also create incentive structures that can distort research priorities. Researchers may optimize for specific benchmark scores rather than general capability, and leaderboard positions can be gamed through task-specific fine-tuning or prompt engineering.
Anthropic published a detailed guide on building agent evaluations in 2025, synthesizing lessons learned from developing Claude's agent capabilities. Key recommendations include:
The field of agent evaluation is evolving rapidly along several axes:
Holistic evaluation frameworks that assess multiple dimensions simultaneously (performance, safety, cost, reliability) rather than treating each dimension in isolation. The 2025 survey on LLM agent evaluation identified this as the top research priority.
Enterprise-mimicking environments that replicate real business workflows, including role-based access controls, multi-user scenarios, and integration with enterprise software. CUB and FieldWorkArena (from Fujitsu, focused on manufacturing and warehouse operations) represent early steps in this direction.
Scalable automated evaluation techniques that reduce reliance on expensive human judges while maintaining evaluation quality. Agent-as-a-Judge and improved LLM-based grading methods are active areas of development.
Efficient evaluation protocols that support iterative development cycles without prohibitive costs. This includes techniques for selecting representative task subsets, early stopping based on confidence intervals, and amortizing environment setup costs across multiple evaluations.
Real-time and continuous evaluation that goes beyond static benchmark snapshots to continuously monitor agent performance in production. This connects agent evaluation to the broader field of ML monitoring and observability.
Cross-modal and cross-environment evaluation that tests agents across different input modalities (text, vision, audio) and operating environments (web, desktop, mobile, voice) within unified frameworks. tau3-bench's addition of voice evaluation and CRAB's cross-platform testing represent early examples of this trend.