Agent evaluation
Last reviewed
May 24, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v2 · 9,633 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 24, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v2 · 9,633 words
Add missing citations, update stale details, or suggest a clearer explanation.
Agent evaluation refers to the systematic assessment of AI agents through benchmarks, metrics, and testing methodologies designed to measure their performance on real-world tasks. As autonomous AI systems have evolved from simple chatbots into multi-step, tool-using agents capable of browsing the web, writing code, and operating computer interfaces, the need for rigorous evaluation frameworks has grown rapidly. Agent evaluation encompasses a broad range of approaches, from standardized academic benchmarks like SWE-bench and WebArena to enterprise-grade observability platforms that track cost, latency, and reliability in production.[1][2]
Unlike traditional language model evaluation, which typically measures performance on static question-answer pairs, agent evaluation must account for multi-turn interactions, tool use, environment manipulation, non-deterministic execution paths, and the compounding effects of errors over long task horizons.[3] The field draws on research in reinforcement learning, human-computer interaction, software engineering, and safety analysis. By May 2026, the discipline has matured into a distinct subfield with dedicated frameworks from frontier labs (Anthropic, OpenAI, Google DeepMind), national institutes (UK AISI, US AISI/CAISI), and independent organizations such as METR and Sierra Research.[4][5]
The evaluation of AI agents has roots in earlier work on reinforcement learning environments such as Atari games and MuJoCo simulations, where agents were measured by cumulative reward. However, the rise of large language models (LLMs) as the backbone of agentic systems introduced new evaluation challenges. Early LLM benchmarks such as MMLU and HellaSwag focused on static knowledge and reasoning, but they could not capture whether a model could effectively use tools, navigate websites, or resolve real software engineering issues.
The first wave of agent-specific benchmarks emerged in 2022 and 2023. MiniWoB++ provided a collection of over 100 simplified web tasks for testing basic web manipulation skills. WebShop simulated an e-commerce environment with 1.18 million products and 12,087 crowd-sourced shopping instructions.[6] These early benchmarks demonstrated that LLMs could be evaluated as interactive agents rather than passive text generators, but their synthetic nature limited how well results generalized to real-world settings.
By late 2023 and into 2024, a second wave of more realistic benchmarks appeared. SWE-bench tested agents on real GitHub issues from popular Python repositories.[7] WebArena created self-hosted replicas of real websites for autonomous web navigation.[2] GAIA combined multi-modal reasoning with tool use across multiple difficulty levels.[8] AgentBench evaluated LLMs across eight distinct environments spanning operating systems, databases, knowledge graphs, and web browsing.[1] These benchmarks reflected a growing consensus that agent evaluation must test performance in realistic, multi-step scenarios rather than isolated capabilities.
The third wave, beginning in 2025, has focused on enterprise readiness, safety, and consistency. Benchmarks like CUB (Computer Use Benchmark), τ-bench, and OSWorld-Verified have introduced domain-specific workflows, repeated-trial consistency metrics, and verified task sets.[9][10] The field has also seen the emergence of comprehensive evaluation frameworks from companies like Anthropic, which published detailed guidance on building agent evaluation pipelines that combine code-based graders, model-based graders, and human review.[4]
A comprehensive survey published in 2025 proposed a two-dimensional taxonomy for agent evaluation, organizing prior work by evaluation objectives (what to evaluate) and evaluation process (how to evaluate).[11]
Agent evaluation targets four primary objectives:
| Dimension | Description | Example metrics |
|---|---|---|
| Agent behavior | Overall performance as perceived by a user, treating the agent as a black box | Task completion rate, output quality, latency, cost per task |
| Agent capabilities | Specific skills the agent demonstrates | Tool use accuracy, planning quality, memory retention, multi-agent collaboration |
| Reliability | Consistency and robustness across repeated executions and varied conditions | pass^k (all k trials succeed), robustness under input perturbations |
| Safety and alignment | Adherence to policies, avoidance of harm, fairness | Harm rate, policy violation rate, adversarial robustness, bias detection |
The evaluation process involves several components:
The most fundamental metric in agent evaluation is the success rate (SR), also called the task completion rate. It measures the proportion of tasks that an agent completes correctly out of the total number attempted. Success is typically determined by checking whether the agent's actions produce the desired end state, such as a passing test suite in SWE-bench or the correct final webpage configuration in WebArena.[7][2]
Variants of success rate include:
| Metric | Definition | Use case |
|---|---|---|
| Success rate (SR) | Fraction of tasks completed correctly | General benchmark scoring |
| pass@k | Probability that at least one of k independent attempts succeeds | Measuring best-case capability |
| pass^k | Probability that all k independent attempts succeed | Measuring consistency and reliability |
| Partial credit | Graded score reflecting progress toward completion | Multi-step tasks where full success is rare |
| Progress rate | Fraction of subtasks or milestones completed | Long-horizon workflow evaluation |
The distinction between pass@k and pass^k is particularly important for agent evaluation. As Anthropic has noted, pass@k approaches 100% as k increases (since the agent only needs to succeed once), while pass^k falls toward 0% (since every attempt must succeed). For production systems where reliability matters, pass^k is often the more relevant metric.[4] Sierra's τ-bench specifically uses pass^k to highlight the inconsistency of current agents: state-of-the-art models that achieve roughly 50% success on individual tasks can drop below 25% on pass^8 in retail customer service scenarios.[13]
As agents move from research prototypes to production systems, efficiency metrics have become essential:
Safety evaluation has become a distinct research area as agents gain the ability to take real-world actions:
For agents that interact with external tools and APIs:
SWE-bench is one of the most widely cited agent benchmarks. Introduced by researchers at Princeton University in 2023, it evaluates AI coding agents on their ability to resolve real GitHub issues from popular open-source Python repositories. The original dataset contains 2,294 issue-patch pairs, each requiring the agent to understand the issue description, locate relevant code, and generate a patch that passes the repository's test suite.[7]
SWE-bench has spawned several variants:
| Variant | Tasks | Description |
|---|---|---|
| SWE-bench (full) | 2,294 | Original dataset of Python GitHub issues |
| SWE-bench Verified | 500 | Hand-filtered subset validated for test harness correctness |
| SWE-bench Lite | 300 | Smaller subset for faster evaluation |
| SWE-bench++ | 11,100+ | Multi-language extension covering 11 programming languages |
| SWE-bench Live | Ongoing | Continuously updated with new issues to prevent data contamination |
| SWE-bench Pro | 1,865 | More challenging tasks requiring extended reasoning |
| SWE-bench Multimodal | 617 | JavaScript tasks that include visual inputs (screenshots, mockups) |
| SWE-bench Java Verified | 91 | First non-Python variant with Dockerized build/test harnesses |
By the first half of 2026, leading scores on SWE-bench Verified have effectively saturated. Claude Mythos Preview leads the public leaderboard at 93.9%, followed by Claude Opus 4.7 at 87.6% and GPT-5.3 Codex at 85%.[18] OpenAI stopped reporting Verified scores in late 2025 after an internal audit confirmed that every major frontier model, including GPT-5.2, Claude Opus 4.5, and Gemini 3 Flash, could reproduce verbatim gold patches for some Verified tasks, indicating data contamination.[19] On the harder SWE-bench Pro, scores are roughly half the Verified values: Claude Mythos Preview leads at 77.8%, with the same model dropping from 93.9% on Verified to 45.9% in earlier versions of the Pro leaderboard.[18][19]
SWE-bench Live, released as a NeurIPS 2025 dataset paper by Microsoft Research, addresses contamination by adding 50 freshly verified, high-quality issues every month from public GitHub projects with creation dates after January 1, 2024. Each new issue is restricted to repositories with build and test harnesses that have been validated since the previous model training cutoff.[20]
SWE-bench Multimodal, presented at ICLR 2025, extends the original benchmark to 617 JavaScript tasks drawn from 17 libraries used for diagramming, data visualization, syntax highlighting, and interactive mapping. Each instance includes at least one image (screenshot, UI mockup, or diagram), making it a test of cross-modal reasoning. When released, SWE-agent resolved only 12% of tasks, with the next best system at 6%.[21]
LiveCodeBench provides a continuously refreshed set of coding problems sourced from competitive programming platforms, addressing contamination concerns. InterCode tests agents on interactive coding tasks that require iterative debugging and execution within a sandboxed environment.
MLE-bench, released by OpenAI in October 2024, evaluates agents on machine learning engineering work using 75 curated Kaggle competitions. Tasks span dataset preparation, model training, hyperparameter tuning, and experiment management. Human baselines are derived from the public Kaggle leaderboards. In the original paper, the best open-source agent scaffold (AIDE with OpenAI o1-preview) earned at least a bronze medal in 16.9% of competitions.[22]
WebArena, introduced in 2023, provides a self-hosted environment for evaluating autonomous web browsing agents. It includes replicas of five real websites spanning e-commerce, social forums, collaborative code development, and content management. The benchmark comprises 812 templated tasks instantiated from 241 templates, with an average of 3.3 variations per template.[2]
WebArena measures functional correctness, meaning whether agents achieve the intended final goal regardless of the specific path taken. Human performance on WebArena reaches approximately 78%. AI agent performance has improved substantially since the benchmark's release: early GPT-4 based agents achieved roughly 14% success, while by early 2025, IBM's CUGA (Configurable Generalist Agent) framework reached 61.7%, becoming the top published score on the open leaderboard.[23][24]
Extensions to WebArena include WebChoreArena, which adds 532 tasks focused on tedious, long-horizon workflows requiring extensive memory and calculation, and WebArena Verified, a 2025 audit project that revised all 812 tasks for offline, stack-agnostic evaluation with a 258-task "Hard" subset for fast focused runs.
MiniWoB++ (Mini World of Bits) is a collection of over 100 web interaction environments with simplified, synthetic web pages. Maintained by the Farama Foundation, it follows the Gymnasium API and uses Selenium WebDriver for browser interaction. Tasks include clicking buttons, filling forms, navigating dropdowns, and other basic web manipulation skills. While MiniWoB++ lacks the realism of later benchmarks, it remains valuable as a training ground and lightweight evaluation environment for early-stage agent development.
Mind2Web, introduced at NeurIPS 2023 by the NLP group at Ohio State University, contains 2,350 tasks spanning 137 real websites across 31 domains. Unlike benchmarks that use simulated websites, Mind2Web evaluates agents on actual web pages collected from top-ranked sites. The benchmark tests three levels of generalization: cross-task (different tasks on the same website), cross-website (similar tasks on different websites in the same domain), and cross-domain (tasks on websites in entirely different domains). GPT-4 based agents achieved roughly 23% strict success on Mind2Web, with partial credit scores reaching 48%.[25]
VisualWebArena, presented at ACL 2024, contains 910 tasks across three web apps (a classifieds site, a shopping site, and a forum) that explicitly require visual understanding of images and spatial reasoning, not just navigation. Example tasks include "Find the post with an image of a cat and upvote it." By 2025, the best vision-augmented agents reached roughly 60 to 70% on VisualWebArena, against human performance near 89%.[26]
BrowseComp, released by OpenAI in April 2025, is an open-source benchmark of 1,266 challenging problems that require persistently navigating many websites to retrieve "entangled" information. All questions have a single, short, indisputable answer that does not change over time, which makes grading straightforward. GPT-4o without browsing scored near zero, while OpenAI's Deep Research agent solved roughly half of the problems.[27] A 2025 follow-up, BrowseComp-Plus (spotlighted at NeurIPS 2025), replaces the live web with a fixed, human-verified document corpus, removing the variability of opaque search APIs and enabling reproducible, component-focused evaluation of retrieval pipelines. On BrowseComp-Plus, GPT-5 paired with the Qwen3-Embedding-8B retriever achieves 70.1% accuracy versus 3.86% for the open-source Search-R1 baseline.[28]
OSWorld, presented at NeurIPS 2024, is the first benchmark to evaluate multimodal agents on open-ended tasks within real computer environments. It includes 369 tasks involving real desktop applications across Ubuntu, Windows, and macOS, spanning tools like Chromium, GIMP, LibreOffice, Thunderbird, VLC, and Visual Studio Code. Tasks cover web browsing, desktop application use, OS file operations, and multi-application workflows.[10]
Human evaluators complete approximately 72.4% of OSWorld tasks. When the benchmark launched, the best AI agent achieved only 12.2% success. By 2025, performance improved dramatically: Simular's Agent S framework reached 72.6%, effectively matching the human baseline.
OSWorld-Verified, released by the XLANG Lab in July 2025, is an in-place upgrade with refined task quality and infrastructure improvements: the environment was migrated from VMware/Docker to AWS with 50x parallelization, ambiguous tasks were rewritten, and several flaky web dependencies were stabilized.[9] By May 2026, Claude Mythos Preview led the OSWorld-Verified leaderboard at 79.6%, followed by GPT-5.5 at 78.7% and Claude Opus 4.7 at 78.0%, all exceeding the human baseline.[29]
Released in 2025, OSUniverse introduces graduated difficulty levels, automated validation with low error rates, and graph-based evaluation that awards partial credit for multi-step workflows. It supports multiple operating systems and uses Docker containers for simplified setup, making it more modular and accessible than OSWorld. OSWorld-Human (2025) supplements OSWorld with measurements of human action counts and time, enabling efficiency comparisons in addition to raw success.
Windows Agent Arena, introduced in late 2024 by Microsoft Research, is a reproducible Azure-hosted environment focused exclusively on Windows OS tasks, with custom execution-based evaluation scripts for each task. It complements OSWorld by giving Windows-specific tasks first-class treatment.
GAIA (General AI Assistants) was introduced in late 2023 as a collaboration between academic researchers and Meta AI. It presents 466 human-annotated tasks requiring multi-step reasoning, tool use, web browsing, and multimodal interpretation. Tasks are structured across three difficulty levels:[8]
| Level | Description | Typical requirements |
|---|---|---|
| Level 1 | Simple tasks | Single tool, basic reasoning |
| Level 2 | Intermediate tasks | Multiple tools, multi-step planning |
| Level 3 | Complex tasks | Extensive planning, numerous tools, advanced reasoning |
GAIA tasks have unambiguous, verifiable answers, making automated evaluation straightforward. The benchmark's official leaderboard is hosted on Hugging Face. By 2025, top agents achieved scores ranging from roughly 44% to 75% depending on the evaluation framework, with Level 3 remaining particularly challenging. In February 2025, OpenAI's Deep Research reached the top of the validation set with 72.57% accuracy.[27][30]
AgentBench, published at ICLR 2024, evaluates LLMs as agents across eight distinct environments spanning three categories:[1]
| Category | Environments |
|---|---|
| Code-grounded | Operating system (OS), database (DB), knowledge graph (KG) |
| Game-grounded | Digital card game, lateral thinking puzzles |
| Web-grounded | House-holding, web shopping, web browsing (Mind2Web) |
The benchmark tested 29 LLMs and revealed a significant performance gap between commercial models (like GPT-4) and open-source alternatives. Key findings indicated that poor long-term reasoning, weak decision-making, and limited instruction-following ability were the primary obstacles to building effective LLM agents.
AgentBoard, an oral presentation at NeurIPS 2024, introduced a fine-grained progress-rate metric that captures incremental advancement on partially observable, multi-turn tasks. The framework spans 9 distinct tasks covering embodied, web, tool use, and game environments, and ships with an interactive visualization toolkit for inspecting trajectories step-by-step rather than only checking final-state success.[31]
The BFCL, developed by UC Berkeley's Gorilla project, has become the standard benchmark for evaluating LLM function-calling capabilities. Now in version 4 (as of 2025), it evaluates models on serial and parallel function calls across Python, Java, JavaScript, and REST APIs using a novel Abstract Syntax Tree (AST) evaluation method.[17]
BFCL v4 added categories for web search, memory management, and multi-turn interactions. The benchmark assesses models on their ability to select correct functions, structure arguments properly, handle multiple parallel calls, and abstain when no appropriate function is available. Leading scores as of 2025 place Anthropic's Claude models and OpenAI's GPT models near the top, with overall accuracy scores ranging from roughly 59% to 70%.
ToolLLM, presented at ICLR 2024, provides a comprehensive framework for training and evaluating LLMs on tool use. Its associated dataset, ToolBench, contains 16,464 RESTful APIs spanning 49 categories from RapidAPI Hub, along with 126,000+ instruction-solution path pairs. The benchmark evaluates agents on both single-tool and multi-tool scenarios, using a depth-first search based decision tree (DFSDT) approach to generate solution paths.[32]
The automated evaluator, ToolEval, measures both pass rate (whether the tool chain produces correct output) and solution path quality (whether the agent's reasoning process is sound). StableToolBench, a subsequent variant, addressed reproducibility concerns in the original benchmark.
AppWorld, awarded Best Resource Paper at ACL 2024, is an execution environment of 9 day-to-day apps operable via 457 APIs, populated with the simulated digital lives of roughly 100 people. The benchmark includes 750 natural agent tasks evaluated by state-based unit tests that check both task success and the absence of "collateral damage" (unintended state changes). GPT-4o solved approximately 49% of "normal" tasks and 30% of "challenge" tasks.[33]
τ-bench, developed by Sierra Research, is a simulation framework for evaluating customer service agents. It emulates multi-turn conversations between a simulated user (powered by an LLM) and an agent equipped with domain-specific API tools and policy guidelines. The benchmark covers realistic domains including airline customer service, retail support, and telecom interactions.[13]
What distinguishes τ-bench from other benchmarks is its emphasis on consistency. Rather than measuring whether an agent can complete a task once, it uses pass^k to assess whether the agent succeeds across multiple independent trials. State-of-the-art models like GPT-4o achieve less than 50% success on individual tasks, and their consistency drops below 25% on pass^8 in retail scenarios. The benchmark expanded through τ²-bench (released June 2025) which introduced dual-control environments where both the user and the agent can take actions in a shared state, and τ³-bench, which adds knowledge retrieval and voice interactions. Even Claude 3.7 Sonnet, the strongest model on τ²-bench at release, scored only 81.2% on retail tasks and 58.4% on airline tasks, with first-attempt success dropping from 61% to 25% on pass^8.[34]
TheAgentCompany, developed at Carnegie Mellon University and posted to arXiv in December 2024, evaluates LLM agents on 175 diverse tasks situated inside a simulated software company. The environment includes a self-hosted GitLab, a self-hosted Plane (issue tracker), Rocket.Chat for communication, ownCloud for files, and simulated colleagues with which the agent must coordinate. The strongest closed-API agents (Gemini 2.5 Pro and Claude 3.7 Sonnet) completed 30% of tasks fully autonomously and reached roughly 40% with partial credit, while open-weights models lagged at 7.4% or below. The benchmark paints a sobering picture of long-horizon workplace automation: a substantial share of consequential tasks remain out of reach.[35]
CUB, introduced by Theta Software in mid-2025, is a benchmark specifically designed for computer-use agents. It contains 106 end-to-end workflows across seven industries: consumer, construction, finance, healthcare, marketing, sales, and supply chain. Tasks were created in collaboration with domain experts (accountants, investment bankers, doctors) and involve synthetic versions of enterprise platforms like SAP and CapIQ.
CUB is particularly challenging because it requires graphical user interface interactions (clicking buttons, selecting menu items) in addition to typing or API calls. When first released, no tested agent framework exceeded 10% success, even with a granular scoring system that awarded partial credit.
GDPval, released by OpenAI in October 2025, evaluates models on 1,320 specialized work tasks drawn from 44 occupations across the top nine sectors of the U.S. economy. Tasks include legal briefs, engineering blueprints, customer support conversations, and nursing care plans, and were created by professionals averaging 14 years of experience. The primary grading method is blind head-to-head human comparison between AI and expert deliverables. Frontier models score around 85% on the benchmark depending on the comparison setup.[36]
Cybench, accepted as an ICLR 2025 oral, is a framework that packages 40 professional-level Capture-the-Flag (CTF) tasks from four recent competitions, broken down into subtasks for finer-grained evaluation. Agents are given a shell, the relevant starter files, and an environment in which they can execute commands. In the original paper, agents based on Claude 3.5 Sonnet, GPT-4o, OpenAI o1-preview, and Claude 3 Opus solved unguided tasks that took human red teams up to 11 minutes, but no agent could solve the hardest task in the suite (which took human teams nearly 25 hours).[37]
By 2026 these results have shifted dramatically. The UK AI Security Institute reports that Claude Mythos Preview became the first model to fully solve both of its evaluated 32-step and 7-step cyber-range scenarios, and AISI now estimates that autonomous AI cyber capability is doubling roughly every 4.7 months.[38]
CAIBench, posted to arXiv in late 2025, is a modular cybersecurity meta-benchmark that aggregates Jeopardy-style CTFs, attack-and-defense CTFs, cyber-range exercises, knowledge questions, and privacy assessments over 10,000+ instances. It is designed to test both offensive and defensive cyber capabilities in a single framework.[39]
AgentHarm, published at ICLR 2025, contains 110 explicitly malicious agent tasks (440 with augmentations) covering 11 harm categories including fraud, cybercrime, and harassment. The benchmark measures both whether models refuse harmful requests and whether jailbroken agents maintain their capabilities when attempting to complete multi-step harmful tasks. Findings included that several leading LLMs were surprisingly compliant with malicious agentic requests without jailbreaking, and that simple universal jailbreak templates could be adapted to coherent multi-step harmful agent behavior.[14]
R-Judge, presented at ICLR 2024, evaluates the safety risk awareness of LLM agents. It contains 569 records of multi-turn agent interactions covering 27 risk scenarios across 5 application categories and 10 risk types. Rather than testing whether agents cause harm directly, R-Judge assesses whether models can identify and flag safety risks in agent interaction records.[16]
ToolEmu takes a different approach to safety evaluation by using an LLM to emulate tool execution and grade accidental safety violations. It covers 36 tools and 144 test cases in high-stakes scenarios where the user's intent is benign but the agent's actions could inadvertently cause harm. This sandbox-based approach allows safety evaluation without requiring actual tool infrastructure.[15]
AILuminate v1.0, released by MLCommons in March 2025, is a 24,000-prompt safety benchmark that covers 12 hazard categories: violent crimes, nonviolent crimes, sex-related crimes, child sexual exploitation, indiscriminate weapons, suicide and self-harm, intellectual property, privacy, defamation, hate, sexual content, and specialized advice (election, financial, health, legal). It uses a tuned ensemble of safety evaluation models as graders and is available in English with French, Chinese, and Hindi extensions. AILuminate represents one of the first cross-industry attempts to establish a shared safety reporting standard for general-purpose chat systems.[40]
AdvBench is a widely used dataset of roughly 520 adversarial prompts that target categories like misinformation, illegal activity, and hate speech, frequently paired with the Greedy Coordinate Gradient (GCG) attack. HarmBench, presented at ICML 2024, extends adversarial evaluation to 510 unique harmful behaviors (text, contextual, and multimodal) with a standardized scoring pipeline. The HarmBench paper systematically compared 18 red-teaming methods against 33 LLMs and defenses, finding that no attack or defense was uniformly effective and that model size did not predict robustness.[41]
JailbreakBench, a NeurIPS 2024 Datasets and Benchmarks paper, is an open robustness benchmark for jailbreaking LLMs. Its JBB-Behaviors dataset contains 100 distinct misuse behaviors (55% original, 45% sourced from AdvBench and TDC/HarmBench) split into ten categories matching OpenAI's usage policies. JailbreakBench tracks both attack success rate and defense effectiveness across an open leaderboard.[42]
CRAB, released in late 2024 by Camel-AI, evaluates agents across 120 tasks spanning Ubuntu and Android environments. It introduces graph-based fine-grained scoring with partial credit capability. The best model (GPT-4o) achieved only 14.17% completion ratio, highlighting the difficulty of cross-environment task completion.
Vibe-Eval, released by Reka in 2024, is an open multimodal chat benchmark with 269 visual understanding prompts (100 marked "hard"), each with a gold-standard expert response. The hard set is constructed so that more than 50% of questions are answered incorrectly by every then-frontier model, providing headroom for years of progress.[43]
| Benchmark | Year | Domain | Tasks | Environment | Key metric | Human baseline | Best AI (approx.) |
|---|---|---|---|---|---|---|---|
| MiniWoB++ | 2018 | Web (synthetic) | 100+ | Synthetic web pages | Task success rate | Near 100% | 95%+ |
| WebArena | 2023 | Web (realistic) | 812 | Self-hosted websites | Functional correctness | 78% | 61.7% |
| Mind2Web | 2023 | Web (real) | 2,350 | Real websites | Strict success / partial credit | N/A | 23% strict |
| SWE-bench Verified | 2023 | Software engineering | 500 | Real GitHub repos | pass@1 | N/A | 93.9% (saturated, contaminated) |
| SWE-bench Pro | 2025 | Software engineering | 1,865 | Real GitHub repos | pass@1 | N/A | 77.8% |
| GAIA | 2023 | General assistant | 466 | Multi-modal, multi-tool | Accuracy | 92% | ~75% |
| AgentBench | 2023 | Multi-domain (8 envs) | Varies | Simulated environments | Overall score | N/A | Varies by env |
| ToolBench | 2023 | API/tool use | 16,464 APIs | Real APIs via RapidAPI | Pass rate | N/A | Varies |
| VisualWebArena | 2024 | Visual web | 910 | Self-hosted multimodal sites | Success rate | 89% | 60-70% |
| AppWorld | 2024 | Apps and APIs | 750 | 9 simulated apps, 457 APIs | State-based unit tests | N/A | 49% normal |
| OSWorld | 2024 | Desktop OS | 369 | Real VMs (Ubuntu/Win/Mac) | Task success rate | 72.4% | 79.6% |
| BFCL v4 | 2024 | Function calling | 2,000+ | API simulation | Overall accuracy | N/A | ~70% |
| τ-bench | 2024 | Customer service | Multiple domains | Simulated conversations | pass^k | N/A | <50% (SR) |
| MLE-bench | 2024 | ML engineering | 75 Kaggle | Real ML pipelines | Medal rate | Strong Kaggler | 16.9% bronze |
| Cybench | 2024 | Cybersecurity | 40 CTFs | Sandboxed shells | Subtask completion | Expert teams | Saturating |
| TheAgentCompany | 2024 | Workplace tasks | 175 | Simulated company | Task success / partial | Full-time employee | 30% |
| BrowseComp | 2025 | Deep research | 1,266 | Live web | Exact-match accuracy | N/A | ~50% (Deep Research) |
| CUB | 2025 | Enterprise workflows | 106 | Synthetic enterprise platforms | Task success rate | N/A | <10% |
| AILuminate v1.0 | 2025 | Safety | 24,000 prompts | Static chat | Hazard category scores | N/A | N/A |
| GDPval | 2025 | Economic work | 1,320 | Real deliverables | Blind expert comparison | Expert quality | ~85% |
| BrowseComp-Plus | 2025 | Deep research | Curated corpus | Fixed document corpus | Exact-match accuracy | N/A | 70.1% (GPT-5) |
The most common approach evaluates agents based on final outcomes. In SWE-bench, this means checking whether the generated patch passes the test suite. In WebArena, it means verifying whether the web page reached the desired state. Outcome-based evaluation is attractive because it is objective and mirrors what end users care about, but it can miss important failure modes. An agent might produce the correct result through unsafe or inefficient means, or it might fail on a task for reasons unrelated to its core capabilities (such as a flaky test or ambiguous task specification).
Process-based (or trajectory-based) evaluation examines the steps an agent takes rather than just its final output. This includes analyzing tool call sequences, reasoning traces, and intermediate decisions. Metrics like Node F1 (for tool selection accuracy) and Edge F1 (for sequence accuracy) measure how well an agent's decision process aligns with reference trajectories.
Process evaluation is valuable for diagnosing failure modes and understanding agent behavior, but it risks penalizing valid alternative approaches. As Anthropic has emphasized, "grading what the agent produced, not the path it took" prevents unnecessarily punishing creative solutions.[4]
A growing area of methodology focuses on the side effects an agent leaves behind, not just whether the task itself was completed. AppWorld's state-based unit tests, for example, check both task success and the absence of unintended state changes ("collateral damage").[33] Similar approaches snapshot the sandbox before and after an agent run and diff the file system, database state, or browser DOM, scoring the agent on the precision of its actions. Side-effect evaluation is especially relevant for computer-use agents and enterprise agents that can mutate persistent state.
Using a separate large language model to evaluate agent outputs has become widespread, particularly for tasks where success is subjective or difficult to verify programmatically. The judge model receives the agent's transcript (including actions, tool calls, and outputs) along with a scoring rubric, and assigns scores based on quality criteria. The approach was popularized by Zheng et al.'s 2023 paper "Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena", which showed that GPT-4 as judge agreed with human experts at greater than 80% on 80 multi-turn questions, matching human-human agreement on the same set.[12]
The same paper documented three persistent failure modes that still shape LLM-as-judge practice today: position bias (judges favor whichever answer is shown first or last), verbosity bias (judges prefer longer answers regardless of correctness), and self-enhancement bias (judges favor outputs from the same model family). Pairwise judging with position swapping, length normalization, and using a different judge family from the system under test are the standard mitigations.[12]
A more advanced variant, Agent-as-a-Judge, uses a multi-agent setup where evaluator agents can themselves use tools and take actions to verify the primary agent's work. Mind2Web 2 introduced this approach for evaluating agentic search, where the evaluator agent actively checks whether retrieved information is correct and complete.
LLM-as-a-Judge approaches offer flexibility and can handle nuanced evaluation criteria, but they require careful calibration against human judgments and can introduce their own biases.
Two broad scoring paradigms exist for subjective evaluation. Absolute scoring asks a grader to assign a numerical score to each response on a Likert scale (often 1 to 5 or 1 to 10). Pairwise scoring presents two outputs side by side and asks the grader to pick the better one. Pairwise judging is generally more reliable for LLM judges because it sidesteps the calibration problem (judges anchor inconsistently on absolute scales) but it is more expensive at scale because the number of comparisons grows quadratically with the number of systems. Chatbot Arena's Elo-style aggregation and GDPval's blind head-to-head expert comparison are pairwise schemes; OpenAI Evals and most product-grade graders default to absolute scoring.[12][36]
Human evaluation remains the gold standard for open-ended tasks and subjective quality assessments. Human evaluators review agent transcripts and rate performance on criteria like helpfulness, accuracy, safety, and efficiency. While expensive and slow, human evaluation serves as the ground truth for calibrating automated evaluation methods.
BrowserArena uses human judges for head-to-head agent comparisons on user-submitted tasks, providing a reference-free evaluation approach that does not require predefined ground-truth answers.
Given the non-deterministic nature of LLM-based agents, running a single trial per task provides an unreliable estimate of performance. Multi-trial evaluation runs each task multiple times and reports aggregate statistics. The pass@k and pass^k metrics capture different aspects of multi-trial performance, and Anthropic recommends running at least 3 to 5 trials per task to get stable estimates.[4]
BrowserGym is a universal simulation environment developed by ServiceNow that unifies web-based benchmarks including MiniWoB++, WebArena, VisualWebArena, and WorkArena under a single Gymnasium-style API. It provides standardized observation and action spaces (HTML, accessibility tree, screenshot, set-of-mark), making it easier to compare agents across different web benchmarks. The companion AgentLab framework adds agent construction and analysis tools on top.[44]
Inspect is an open-source evaluation framework from the UK AI Security Institute (UK AISI) and Meridian Labs that supports a wide range of agent benchmarks including GAIA, BFCL, AgentHarm, SWE-bench, GDM CTF, and Cybench. It provides composable evaluation pipelines with support for multiple solvers and scorers, built-in tools (bash, Python, text editing, web search, web browsing, computer use), MCP and custom tool calling, and multi-agent primitives. As of 2025-2026, Inspect ships with more than 200 pre-built evaluations through the Inspect Evals repository.[45][46]
Inspect Sandboxing Toolkit is a 2025 AISI extension to Inspect that bundles plugins for spinning up secure containerized environments for evaluation runs, including Docker, Kubernetes, and isolated VM backends. Inspect Cyber, also from AISI, is a standardized framework specifically for agentic cyber evaluations, with consistent two-file task configuration and built-in support for the 95-task AISI cyber suite.[47][48]
Inspect Evals is the open-source community repository for the framework, launched November 2024 with contributions from over 50 organizations including frontier labs and other AI safety institutes.[46]
AgentBench Toolkit provides an integrated evaluation package supporting all eight AgentBench environments, with standardized APIs for running evaluations and collecting results.
Petri (Parallel Exploration Tool for Risky Interactions) is an open-source automated alignment auditing framework released by Anthropic in October 2025. Petri deploys an auditor agent that runs multi-turn conversations with a target model through simulated users and tools, then uses a judge model to score and summarize the transcripts. Applied to 14 frontier models with 111 seed instructions, Petri elicited a broad set of misaligned behaviors including autonomous deception, oversight subversion, whistleblowing, and cooperation with human misuse. Petri 3.0, released in early 2026, expanded the seed library and scoring rubric.[49]
Several commercial platforms offer agent evaluation and observability capabilities:
| Platform | Key features |
|---|---|
| LangSmith | Native LangChain integration, automatic tracing, minimal performance overhead |
| Braintrust | Unified evaluation/observability/optimization, AI-generated custom scorers |
| AgentOps | Session tracking, LLM call tracing, tool use monitoring, cost tracking |
| Langfuse | Open-source tracing and analytics, prompt management |
| Galileo | Agent metric dashboards, automated evaluation pipelines |
| Arize Phoenix | Open-source observability, online and offline evals, span-level tracing |
| Promptfoo | CLI and library for declarative evals, CI/CD integration, red-teaming and vulnerability scanning |
These platforms complement academic benchmarks by providing production-oriented evaluation capabilities including real-time monitoring, A/B testing, and trace-level debugging. Promptfoo, originally an independent open-source project, was acquired by OpenAI in March 2026 and continues as an MIT-licensed CLI with support for multiple model providers.[50]
OpenAI's OpenAI Evals is an MIT-licensed Python framework with a registry of public benchmarks and a Completion Function Protocol for evaluating prompt chains and tool-using agents. Since late 2025 the framework has been complemented by an OpenAI Evals API and a Dashboard-hosted workflow for running, grading, and tuning evals as part of an iterative product loop.[51]
Agent evaluation faces significant reproducibility challenges. LLM-based agents exhibit variability in execution paths, tool selection, and reasoning patterns due to non-deterministic sampling. This means that the same agent can produce different results on the same task across different runs. Long-horizon tasks amplify this problem because errors compound over multiple steps. Without standardized protocols for controlling randomness and reporting variance, benchmark results can be misleading.
Environment reproducibility is also a concern. Web-based benchmarks depend on external services that may change over time, and desktop benchmarks require specific virtual machine configurations. OSWorld-Verified and StableToolBench have addressed some of these issues by improving infrastructure reliability and standardizing evaluation environments.[9]
As LLMs are trained on increasingly large corpora of internet text, the risk of benchmark data appearing in training sets grows. This data contamination can inflate benchmark scores without reflecting genuine capability improvements. OpenAI publicly acknowledged this issue by stopping SWE-bench Verified reporting after finding contamination across frontier models, recommending SWE-bench Pro (which uses more challenging, less common tasks under a license that legally deters scraping) instead.[19]
Several strategies have been developed to combat contamination. SWE-bench Live provides a continuously updated stream of 50 new issues per month from public GitHub projects dated after January 2024.[20] LiveCodeBench refreshes its problem set regularly. BrowseComp-Plus and SWE-bench Pro use legal and architectural barriers, such as license restrictions and curated private repositories, to prevent inclusion in training corpora.[28][19] Some benchmarks create fully synthetic tasks designed to fall outside internet-scale training corpora.
Running comprehensive agent evaluations is expensive. Each task may require multiple API calls, tool executions, and environment setups. Multi-trial evaluation (necessary for reliable results) multiplies these costs further. OSWorld evaluation, for example, requires provisioning and managing virtual machines for each task. SWE-bench requires building and running test suites for real software projects.
Failed attempts still incur costs, making reliability economically critical. An agent that requires many retries to succeed may be technically capable but financially impractical. Developing cost-bounded evaluation protocols that balance thoroughness with efficiency remains an active research challenge.
Defining clear, unambiguous success criteria for agent tasks is difficult. Anthropic reported in 2026 that for many evaluations of its most capable models (such as Claude Opus 4.5), low scores often revealed evaluation bugs rather than model limitations. Rigid grading that penalizes "96.12" when expecting "96.124991…", ambiguous task specifications, and stochastic elements in tasks often penalized correct behavior.[4] The recommended mitigation is that two domain experts should be able to independently reach the same pass/fail verdict on every task.
Most benchmarks test agents on a fixed set of tasks within specific domains. How well performance on these tasks predicts real-world capability remains an open question. Mind2Web explicitly tests three levels of generalization (cross-task, cross-website, cross-domain), but most benchmarks do not systematically evaluate generalization. An agent that achieves high scores on SWE-bench Python tasks may not transfer that performance to other programming languages, as the introduction of SWE-bench++ and SWE-bench Java has begun to reveal.
A 2025 analysis from Princeton's Holistic Agent Leaderboard (HAL) project found that overall reliability across 14 agents and 12 metrics has improved only slightly even as accuracy has climbed substantially across 18 months of model development. The HAL Reliability Dashboard reports consistency under repeated runs, robustness to perturbations, predictability of failures, and respect for safety constraints separately from raw accuracy. The conclusion is that "improving raw task performance is insufficient for building dependable AI agents", and that reliability requires targeted methodology beyond scaling.[52]
Current safety benchmarks like AgentHarm and R-Judge cover important failure modes, but the space of possible agent harms is vast and difficult to enumerate.[14][16] Agents operating in real environments can cause harm through subtle chains of actions that are difficult to predict or test for. The gap between synthetic safety benchmarks and real-world deployment risks remains a significant concern for the field. Automated auditing tools like Anthropic's Petri attempt to close part of this gap by using auditor agents to probe for misaligned behaviors at scale.[49]
A distinct line of work, led by METR (Model Evaluation and Threat Research), reframes agent evaluation in terms of human-relatable task duration rather than benchmark-specific success rates. METR's flagship metric is the 50%-task-completion time horizon, defined as the task duration (measured by an expert human's completion time) at which an AI agent is predicted to succeed half the time. The team plots the time horizon of frontier models against their release date.[53]
The original METR paper, "Measuring AI Ability to Complete Long Tasks" (March 2025), found that the time horizon has been doubling roughly every seven months since 2019. Claude 3.7 Sonnet, the strongest model in that paper's evaluation, scored a 50% time horizon of about 50 minutes on METR's task suite. The headline implication, extrapolating the trend, was that "in under a decade, we will see AI agents that can independently complete a large fraction of software tasks that currently take humans days or weeks".[53]
In January 2026 METR released Time Horizon 1.1, which expanded the suite from 170 to 228 tasks (with long, 8+ hour tasks doubling from 14 to 31) and migrated the evaluation harness from METR's in-house Vivaria to AISI's Inspect framework. Under the updated estimator the headline hybrid trend remained 196 days (7 months), but the post-2023 doubling time tightened to 131 days (4.3 months) and the post-2024 doubling time tightened to 89 days, suggesting that progress has accelerated since 2023.[54] AISI's own May 2026 cyber-capability tracking found a doubling time of roughly 4.7 months in autonomous cybersecurity tasks, consistent with METR's post-2024 figure.[38]
The METR framework has become an influential summary measure for policy and safety discussions, in part because it translates abstract benchmark percentages into "this model can do tasks that take humans X minutes". Critics caution that the metric is sensitive to task suite construction and that external validity (whether benchmark task durations match the real-world tasks that economically matter) remains an open question. The 2025 GDPval results, which target 1,320 real economic tasks, provide a complementary anchor.[36]
The 18 months from late 2024 to mid-2026 saw the most concentrated changes in agent evaluation since the field's emergence.
Several flagship benchmarks effectively saturated. By May 2026, top scores on SWE-bench Verified exceeded 93%, OSWorld-Verified exceeded 79% (above the human baseline), and GAIA validation accuracy reached the low 70s.[18][29][30] OpenAI's late-2025 audit confirmed measurable contamination on SWE-bench Verified across frontier models, prompting the field to migrate to SWE-bench Pro (1,865 long-horizon tasks under license-restricted enterprise repositories) and SWE-bench Live (50 fresh issues per month).[19][20] The same dynamic played out on browsing benchmarks: BrowseComp gave way to BrowseComp-Plus, which fixes the document corpus to remove the variability of opaque search APIs.[28]
The UK AI Security Institute released and operationalized Inspect as the de facto open evaluation harness across labs and governments. Inspect Evals (November 2024) collected community-contributed evaluations into a single repository, and 2025 saw the release of Inspect Sandboxing Toolkit and Inspect Cyber as agent-focused extensions.[46][47][48] METR migrated its time-horizon harness from Vivaria to Inspect in early 2026, consolidating a shared infrastructure across METR, AISI, and frontier labs.[54] The US AI Safety Institute was rebranded as the Center for AI Standards and Innovation (CAISI) inside NIST in June 2025 and continues to coordinate pre-deployment evaluation with Anthropic and OpenAI, with focus areas including generative AI risk management, synthetic content, evaluations, red teaming, and model safety and security.[55]
Automated red teaming matured beyond static prompt sets. Anthropic's Petri (October 2025) and its 2026 update Petri 3.0 use auditor agents to generate, run, and score multi-turn behavioral probes; the system identified deception, oversight subversion, and other failure modes across 14 frontier models with 111 seed instructions.[49] Other vendors followed: Promptfoo expanded its red-team module (and was acquired by OpenAI in March 2026), and HarmBench, AdvBench, and JailbreakBench remained reference datasets for comparing attacks and defenses.[50][41][42]
OpenAI's GDPval (October 2025) was the first cross-occupational benchmark to grade frontier-model outputs by blind head-to-head comparison with deliverables from experts who averaged 14 years of experience, across 44 occupations and 1,320 tasks. Aggregate frontier-model deliverables reached roughly 85% on the headline metric.[36] Carnegie Mellon University's TheAgentCompany (December 2024) reported that the strongest agents could only autonomously complete 30% of 175 simulated software-company tasks, with partial credit reaching 40%, painting a more sobering picture of long-horizon workplace automation.[35] Sierra's τ²-bench (June 2025) and τ³-bench added dual-control environments and voice channels to customer-service evaluation, while keeping pass^k as the headline reliability metric.[34]
The HAL reliability program at Princeton, which paused leaderboard updates in 2025 to refocus on reliability dimensions, reported in 2026 that accuracy gains had not translated into proportional reliability gains across 14 evaluated agents on 12 metrics covering consistency, predictability, robustness, safety, and abstention.[52] τ-bench's pass^k framing, ReliabilityBench's k-trial / ε-perturbation / λ-fault dimensions, and AppWorld's state-based collateral-damage tests all share this orientation. Anthropic's January 2026 update to its "Demystifying evals for AI agents" guide pushed similar themes for product-grade evaluation, emphasizing balanced positive and negative cases, regular transcript reading, and monitoring for evaluation saturation rather than model saturation.[4]
The most discussed result of the period was METR's January 2026 Time Horizon 1.1 update, which estimated a post-2024 doubling time of 89 days for the 50%-task-completion horizon, down from 7 months under the original estimator.[54] AISI's May 2026 cyber-capability tracking estimated a separate 4.7-month doubling time for autonomous cyber tasks, consistent with METR's post-2024 figure.[38] Even with confidence intervals that span months, both numbers imply that benchmark difficulty levels considered cutting-edge in 2024 (such as long-horizon multi-application workflows) may be effectively solved within a year or two, sharpening the case for continuously refreshed, contamination-resistant evaluation infrastructure.
Several public leaderboards track agent performance across major benchmarks:
These leaderboards play an important role in driving progress but also create incentive structures that can distort research priorities. Researchers may optimize for specific benchmark scores rather than general capability, and leaderboard positions can be gamed through task-specific fine-tuning or prompt engineering.
Anthropic published a detailed guide on building agent evaluations originally in 2025 and updated in January 2026, synthesizing lessons learned from developing Claude's agent capabilities. Key recommendations include:[4]
The field of agent evaluation is evolving rapidly along several axes:
Holistic evaluation frameworks that assess multiple dimensions simultaneously (performance, safety, cost, reliability) rather than treating each dimension in isolation. The 2025 survey on LLM agent evaluation identified this as the top research priority.[11]
Enterprise-mimicking environments that replicate real business workflows, including role-based access controls, multi-user scenarios, and integration with enterprise software. CUB, TheAgentCompany, GDPval, and FieldWorkArena (from Fujitsu, focused on manufacturing and warehouse operations) represent early steps in this direction.[35][36]
Scalable automated evaluation techniques that reduce reliance on expensive human judges while maintaining evaluation quality. Agent-as-a-Judge, automated alignment auditing systems like Petri, and improved LLM-based grading methods are active areas of development.[49]
Efficient evaluation protocols that support iterative development cycles without prohibitive costs. This includes techniques for selecting representative task subsets, early stopping based on confidence intervals, and amortizing environment setup costs across multiple evaluations.
Real-time and continuous evaluation that goes beyond static benchmark snapshots to continuously monitor agent performance in production. This connects agent evaluation to the broader field of ML monitoring and observability.
Cross-modal and cross-environment evaluation that tests agents across different input modalities (text, vision, audio) and operating environments (web, desktop, mobile, voice) within unified frameworks. τ³-bench's addition of voice evaluation, CRAB's cross-platform testing, and BrowserGym's unification of web environments represent examples of this trend.[44]