# AgentBench

> Source: https://aiwiki.ai/wiki/agentbench
> Updated: 2026-06-23
> Categories: AI Agents, AI Benchmarks, Large Language Models
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

**AgentBench** is a multi-dimensional [benchmark](/wiki/benchmark) for evaluating [large language models](/wiki/large_language_model) (LLMs) as autonomous agents across eight distinct interactive environments, and was the first systematic framework to assess LLM agent ability across a broad range of real-world tasks under standardized conditions. Introduced in August 2023 by researchers from [Tsinghua University](/wiki/tsinghua_university), The Ohio State University, and [UC Berkeley](/wiki/uc_berkeley), it tests how well LLMs operate as agents that take actions, observe outcomes, and make decisions in multi-turn settings.[^1] Its headline finding is that top commercial models such as [GPT-4](/wiki/gpt-4) are strong agents while open-source models lag far behind: the paper states there is "a significant disparity in performance between them and many OSS competitors that are no larger than 70B."[^1] The paper was accepted as a poster at the [International Conference on Learning Representations](/wiki/iclr) (ICLR) 2024, and the project is released under the Apache 2.0 license through the THUDM organization on GitHub.[^2][^3]

Unlike traditional benchmarks that evaluate LLMs on static question-answering or text completion, AgentBench places models in interactive settings where they must reason over observations, plan sequences of actions, and adapt to changing environments. It contains 1,360 test samples across the eight environments, and the original evaluation covered 27 LLMs in v1 (later expanded to 29 in v2), including both commercial API-based models and open-source alternatives.[^1] The authors summarize the core obstacle bluntly: "poor long-term reasoning, decision-making, and instruction following abilities are the main obstacles for developing usable LLM agents."[^1] A v3 revision of the arXiv preprint was uploaded on October 4, 2025, coinciding with the release of the function-calling implementation that integrates AgentBench with the [AgentRL](#agentbench-fc-and-agentrl-integration) reinforcement-learning framework.[^4][^5]

## Background and Motivation

The rapid development of LLMs has expanded their role far beyond text generation. Researchers and practitioners increasingly deploy LLMs as the core reasoning engine for autonomous agents that interact with external tools, databases, operating systems, and web interfaces. However, as of mid-2023, no comprehensive benchmark existed to systematically evaluate LLMs in these agentic roles across a broad range of tasks.[^1]

Prior evaluation efforts tended to focus on individual environments. ALFWorld evaluated household task planning,[^6] WebShop tested online shopping behavior,[^7] and Mind2Web assessed web browsing capabilities.[^8] Each of these benchmarks provided valuable insight into specific agent skills, but none offered a holistic view of an LLM's general-purpose agent capabilities. Existing benchmarks also typically targeted fine-tuned models, making it difficult to compare prompted LLMs directly.

AgentBench was created to fill this gap. The authors designed a unified evaluation framework spanning eight environments that together cover code interaction, game strategy, web navigation, and [knowledge graph](/wiki/knowledge_graph) querying. By testing the same set of LLMs across all eight tasks under standardized conditions, AgentBench enables comprehensive comparisons of agent performance.[^1]

## What environments does AgentBench cover?

AgentBench organizes its eight evaluation environments into three categories: code-grounded, game-grounded, and web-grounded. Five environments were newly created by the research team (operating system, database, knowledge graph, digital card game, and lateral thinking puzzles), while three were adapted from existing published datasets (house-holding, web shopping, and web browsing).[^1]

### Code-Grounded Environments

#### Operating System (OS)

The Operating System environment tests an LLM's ability to interact with an Ubuntu Linux system through bash commands. Each task provides a natural language instruction (for example, "Find the total number of non-empty directories inside the /etc directory"), and the model must issue appropriate bash commands to accomplish the goal. The action space includes any valid bash command, and the observation returned is the standard output of the executed command.[^1]

Tasks are divided into two types. The first type asks for a deterministic answer, such as counting files or retrieving specific system information. The second type involves operational goals like changing file permissions or modifying system configurations. The evaluation metric is **success rate (SR)**, where a task is considered successful only if the final answer or system state matches the expected outcome exactly. The development set contains 26 tasks and the test set contains 144 tasks. The environment runs inside [Docker](https://www.docker.com/) containers for isolation and reproducibility.[^1]

#### Database (DB)

The Database environment evaluates an LLM's ability to query relational databases using [SQL](https://en.wikipedia.org/wiki/SQL). Tasks present a natural language question about data stored in a MySQL database (for example, "What was the total number of medals won by the United States?"), and the model must write and execute SQL queries to retrieve the correct answer.[^1]

Datasets were derived from WikiSQL and augmented with additional variety to cover aggregation, joins, filtering, and subqueries. The model interacts with a MySQL command-line interface, receiving query results as observations after each statement. The evaluation metric is **success rate (SR)** based on whether the final query output matches the expected answer. The development set contains 60 tasks and the test set contains 300 tasks.[^1]

#### Knowledge Graph (KG)

The Knowledge Graph environment assesses an LLM's ability to answer questions using a large-scale knowledge base. Tasks require the model to navigate and query a knowledge graph using basic querying tools provided by the environment. A typical task might ask, "Find tropical cyclones that are similar to Hurricane Marie and affected Eastern North America."[^1]

This environment tests the model's ability to make decisions with incomplete information, since the model cannot see the full knowledge graph and must explore it step by step. The evaluation metric is **answer F1 score**, measuring overlap between the model's answer and the ground truth. The development set contains 20 tasks and the test set contains 150 tasks.[^1]

### Game-Grounded Environments

#### Digital Card Game (DCG)

The Digital Card Game environment uses Aquawar, a simplified card game system from the 2021 THU Agent Competition (THUAC). In Aquawar, the LLM controls a team of fish characters, each with different abilities, and battles against an algorithm-based opponent in a turn-based format. The model must make strategic decisions each turn about which fish to deploy and which abilities to use.[^1]

This environment tests strategic [reasoning](/wiki/reasoning) and planning under uncertainty, as the model must anticipate the opponent's moves and manage limited resources. The evaluation metric is **win rate** against the built-in opponent algorithm. The development set contains 12 tasks and the test set contains 20 tasks.[^1]

#### Lateral Thinking Puzzles (LTP)

The Lateral Thinking Puzzles environment presents situation puzzles, a popular group game format where one player hosts a riddle and others must figure out the scenario by asking yes-or-no questions. The host responds with "yes," "no," or "irrelevant" to guide the guesser toward the solution.[^1]

In AgentBench, the puzzle plot is simplified into several key points that represent game progress. The LLM plays the role of the guesser and must ask strategic questions to uncover the puzzle solution. This environment evaluates the model's lateral reasoning abilities and its capacity to form and test hypotheses through targeted questioning. The evaluation metric is **game progress**, measuring what fraction of the key plot points the model successfully uncovers. The development set contains 20 tasks and the test set contains 50 tasks.[^1]

#### House-Holding (HH)

The House-Holding environment is based on ALFWorld, a text-based game environment derived from the TextWorld toolkit. Tasks require the LLM to perform common household activities in a simulated home, such as "Put a clean pan on the dining table" or "Find a pencil and place it on the desk."[^1][^6]

The model must navigate rooms, interact with objects, and execute multi-step plans that require commonsense reasoning about the physical world. To clean a pan, for instance, the model needs to pick it up, go to the sink, use the sink, and then navigate to the dining table and place the pan there. The evaluation metric is **success rate (SR)**. The development set contains 20 tasks and the test set contains 50 tasks.[^1]

### Web-Grounded Environments

#### Web Shopping (WS)

The Web Shopping environment uses WebShop, a simulated e-commerce website where the LLM must find and purchase products that match a given specification. The model searches for products, browses listings, reads product descriptions, selects options (such as size and color), and makes purchasing decisions.[^1][^7]

Originally designed for evaluation of specifically trained models, AgentBench adapted WebShop to assess LLMs with prompting alone, without additional [fine-tuning](/wiki/fine_tuning). The evaluation metric is **reward**, a score between 0 and 1 that measures how well the selected product matches the desired specification. The development set contains 80 tasks and the test set contains 200 tasks.[^1]

#### Web Browsing (WB)

The Web Browsing environment is based on [Mind2Web](/wiki/mind2web), which presents general-purpose web interaction tasks across multiple website domains. Tasks require the LLM to perform actions such as clicking buttons, selecting dropdown options, and typing text on real website interfaces.[^1][^8]

Compared to Mind2Web's original evaluation setting, AgentBench adapted the tasks to allow assessment of prompted LLMs without fine-tuning. The evaluation metric is **step success rate**, measuring the proportion of individual interaction steps the model executes correctly. The development set contains 31 tasks and the test set contains 100 tasks.[^1]

### Dataset Summary

| Environment | Category | Tasks (Dev) | Tasks (Test) | Metric | New/Adapted |
|---|---|---|---|---|---|
| Operating System (OS) | Code-grounded | 26 | 144 | Success Rate | New |
| Database (DB) | Code-grounded | 60 | 300 | Success Rate | New |
| Knowledge Graph (KG) | Code-grounded | 20 | 150 | Answer F1 | New |
| Digital Card Game (DCG) | Game-grounded | 12 | 20 | Win Rate | New |
| Lateral Thinking Puzzles (LTP) | Game-grounded | 20 | 50 | Game Progress | New |
| House-Holding (HH) | Game-grounded | 20 | 50 | Success Rate | Adapted from ALFWorld |
| Web Shopping (WS) | Web-grounded | 80 | 200 | Reward | Adapted from WebShop |
| Web Browsing (WB) | Web-grounded | 31 | 100 | Step Success Rate | Adapted from Mind2Web |

In total the benchmark comprises 1,360 test samples across the eight environments. The full evaluation requires approximately 4,000 LLM generation calls for the development set and 13,000 for the test set, totaling about 17,000 calls per model across all environments.[^1]

## How does AgentBench evaluate models?

### Architecture

AgentBench uses a modular server-client architecture that separates the evaluation into three components: Task Servers, Agent Servers, and an Evaluation Client. These components communicate through HTTP, allowing them to run on separate machines if needed. Each environment runs inside its own Docker container, providing isolation and ensuring that one environment's state does not interfere with another. The framework supports running multiple model-environment pairs simultaneously.[^1]

The architecture incorporates an Edmonds-Karp max-flow algorithm to optimize scheduling of evaluation tasks across available resources. It also supports resumable evaluation, so if a run is interrupted, it can pick up where it left off without re-running completed tasks.[^1]

### Prompting Strategy

AgentBench evaluates all models using [chain-of-thought](/wiki/chain_of_thought) (CoT) prompting. In each interaction turn, the model receives a description of the current state (observation), generates a reasoning chain, and then outputs an action. The environment executes the action and returns the next observation, continuing this loop until the task is complete or a turn limit is reached.[^1]

The number of interaction turns per task ranges from 5 to 50, depending on environment complexity. This multi-turn format tests not just the model's ability to generate a single correct response, but its capacity to maintain coherent plans over extended interaction sequences.[^1]

### Scoring Methodology

Because the eight environments use different metrics (success rate, F1 score, win rate, reward, and game progress), AgentBench employs a normalization procedure to compute an overall score. The raw score for each environment is first normalized by resizing each task's average score to 1.0 across all evaluated models. The overall AgentBench score is then computed as the average of these normalized scores across all eight environments.[^1]

For consistency in future evaluations, fixed normalization weights are calculated as the reciprocal of the average score per task from the initial evaluation round. This ensures that adding new models to the leaderboard does not change the relative scoring of previously evaluated models.[^1]

## Which models did AgentBench evaluate?

The original v1 AgentBench evaluation tested 27 LLMs spanning commercial API-based models and open-source alternatives.[^1] Later updates added additional models including GLM-4 and Claude 3, bringing the total to 29 in v2.[^4]

### API-Based Commercial Models

| Model | Overall Score |
|---|---|
| [GPT-4](/wiki/gpt-4) (0613) | 4.01 |
| Claude 3 (Opus) | 3.11 |
| GLM-4 | 2.89 |
| Claude 2 | 2.49 |
| Claude v1.3 | 2.44 |
| [GPT-3.5](/wiki/gpt-3.5)-turbo | 2.32 |
| text-davinci-003 | 1.71 |
| claude-instant | 1.60 |
| chat-bison-001 | 1.39 |
| text-davinci-002 | (below 1.00) |

### Open-Source Models

| Model | Parameters | Overall Score |
|---|---|---|
| [CodeLlama](/wiki/code_llama)-34B | 34B | 0.96 |
| Vicuna-13B | 13B | 0.93 |
| [Llama 2](/wiki/llama_2)-70B | 70B | 0.78 |
| openchat-8192-13B | 13B | Best OSS in some tasks |
| WizardLM-30B | 30B | Moderate |
| Vicuna-33B | 33B | Moderate |
| Guanaco-65B | 65B | Low |
| Guanaco-33B | 33B | Low |
| [Llama 2](/wiki/llama_2)-13B | 13B | Low |
| [Llama 2](/wiki/llama_2)-7B | 7B | Low |
| Vicuna-7B | 7B | Low |
| WizardLM-13B | 13B | Low |
| [CodeLlama](/wiki/code_llama)-13B | 13B | Low |
| [CodeLlama](/wiki/code_llama)-7B | 7B | Low |
| ChatGLM-6B | 6B | Low |
| CodeGeeX2-6B | 6B | Low |
| Dolly-12B | 12B | Low |
| OASST-12B | 12B | Low |
| Koala-13B | 13B | Low |

The average overall score for API-based models was 2.32, while open-source models averaged 0.51, representing a roughly 4.5x performance gap in the original evaluation.[^1]

## Detailed Results

### Per-Environment Scores for Top Models

| Model | OS (SR%) | DB (SR%) | KG (F1) | DCG (Win%) | LTP (Progress) | HH (SR%) | WS (Reward) | WB (Step SR%) | Overall |
|---|---|---|---|---|---|---|---|---|---|
| [GPT-4](/wiki/gpt-4) | 42.4 | 32.0 | 58.8 | 74.5 | 16.6 | 78.0 | 61.1 | 29.0 | 4.01 |
| Claude 2 | 18.1 | 27.3 | 41.3 | 55.5 | 8.4 | 54.0 | 61.4 | 0.0 | 2.49 |
| Claude v1.3 | 9.7 | 22.0 | 38.9 | 40.9 | 8.2 | 58.0 | 55.7 | 25.0 | 2.44 |
| [GPT-3.5](/wiki/gpt-3.5)-turbo | 32.6 | 36.7 | 25.9 | 33.7 | 10.5 | 16.0 | 64.1 | 20.0 | 2.32 |
| [Llama 2](/wiki/llama_2)-70B | 9.7 | 13.0 | 8.0 | 21.3 | 0.0 | 2.0 | 5.6 | 19.0 | 0.78 |
| [CodeLlama](/wiki/code_llama)-34B | 2.8 | 14.0 | 23.5 | 8.4 | 0.7 | 4.0 | 52.1 | 20.0 | 0.96 |
| Vicuna-13B | 10.4 | 6.7 | 9.4 | 0.1 | 8.0 | 8.0 | 41.7 | 12.0 | 0.93 |

### Key Observations from the Results

GPT-4 achieved the highest overall score of 4.01 and led performance in six of the eight environments. Its 78% success rate on House-Holding tasks demonstrated near-practical usability in that domain. GPT-4 also showed strong results in the Digital Card Game (74.5% win rate) and Knowledge Graph (58.8 F1), indicating solid strategic reasoning and information-retrieval capabilities.[^1]

GPT-3.5-turbo outperformed more expensive models in two specific areas: Database (36.7% SR, higher than GPT-4's 32.0%) and Web Shopping (64.1 reward, the highest among all models). This suggests that certain agentic tasks benefit more from specific training characteristics than from raw model scale.[^1]

Claude 2 and Claude v1.3 performed comparably to each other in overall terms (2.49 vs. 2.44), though their strengths differed. Claude 2 had better OS, DB, and KG performance, while Claude v1.3 outperformed Claude 2 in House-Holding (58% vs. 54%) and Web Browsing (25% vs. 0%). Notably, Claude 2 scored 0% on Web Browsing, indicating a complete failure to follow the required output format for that task.[^1]

Among open-source models, CodeLlama-34B achieved the highest overall score (0.96), largely driven by its strong Web Shopping performance (52.1 reward). However, it performed poorly on the Operating System task (2.8% SR) despite being a code-focused model, suggesting that interacting with a Linux system requires different skills than writing code. Vicuna-13B demonstrated that alignment quality matters more than model size; it outperformed Llama 2-13B and performed comparably to models three times its size, likely because Vicuna was trained on ShareGPT data generated by GPT-4 and GPT-3.5-turbo.[^1]

## Why do LLM agents fail on AgentBench?

The paper provides a detailed analysis of why models fail, categorizing execution outcomes into several types.[^1]

### Error Types

| Error Type | Description | Dominant In |
|---|---|---|
| Task Limit Exceeded (TLE) | The model runs out of allowed interaction turns without completing the task | LTP (82.5%), KG (67.9%) |
| Invalid Format | The model's output does not follow the required response format | DB (53.3%), DCG (38.5%) |
| Invalid Action | The model generates an action that is not valid in the current environment state | HH (64.1%) |
| Context Limit Exceeded (CLE) | The conversation exceeds the model's context window | Various environments |
| Task Error | Unexpected errors during task execution | Minor contributor |

**Task Limit Exceeded** was the most common failure mode overall, particularly in environments requiring extended reasoning chains like Lateral Thinking Puzzles and Knowledge Graph tasks. This indicates that many models struggle with long-term planning and cannot efficiently converge on solutions within the allowed number of turns.[^1]

**Invalid Format** errors were most prevalent in structured output environments like Database and Digital Card Game, where models must produce outputs in a specific syntax (SQL queries or game commands). This points to weaknesses in instruction following when output constraints are strict.[^1]

**Invalid Action** errors dominated in the House-Holding environment, where models frequently attempted actions that were not physically possible in the current state (for example, trying to pick up an object that was not nearby). This reflects limitations in maintaining an accurate internal model of the environment state.[^1]

### How big is the gap between commercial and open-source models?

The most striking finding from AgentBench is the substantial gap between commercial and open-source models. The paper frames it as "a significant disparity in performance between them and many OSS competitors that are no larger than 70B."[^1] Commercial API-based models averaged an overall score of 2.32, while open-source models averaged 0.51, a roughly 4.5x difference that was consistent across most environments. The gap was especially pronounced in environments requiring complex reasoning: in Operating System tasks, commercial models achieved around 65% success versus 9% for open-source models on average; in Knowledge Graph tasks, commercial models scored roughly 0.64 F1 compared to 0.07 for open-source alternatives.[^1]

The researchers attributed this gap to three main factors, writing that "poor long-term reasoning, decision-making, and instruction following abilities are the main obstacles for developing usable LLM agents."[^1] They noted that training on high-quality multi-round alignment data and improving instruction following could help close this gap.[^1]

## When was AgentBench released, and how has it been updated?

AgentBench has been released in several iterations between 2023 and 2025:[^3][^4]

- **v0.1 (August 2023):** Original release accompanying the arXiv preprint, containing all eight environments and evaluation of 27 LLMs. arXiv v1 was submitted on August 7, 2023.[^9]
- **v0.2 (late 2023):** Updated version that added evaluation of newer models including GLM-4 and Claude 3, bringing the total to 29 models. Included refinements to the evaluation framework. arXiv v2 followed on October 25, 2023.[^10]
- **VisualAgentBench (August 2024):** A sibling benchmark for large multimodal models as visual foundation agents, with five new environments spanning embodied AI, GUI interaction, and visual design (see [below](#visualagentbench)).[^11]
- **AgentBench FC (October 10, 2025):** A function-calling reimplementation of AgentBench, integrated with the AgentRL framework. arXiv v3 was uploaded on October 4, 2025.[^4][^5]

The project is maintained under the THUDM (Tsinghua University Data Mining) GitHub organization. As of mid-2026 the main branch of the repository has accumulated roughly 3,400 stars and continues to receive updates.[^3]

### AgentBench FC and AgentRL Integration

The AgentBench FC version, released on October 10, 2025, is a substantial reimplementation built around modern function-calling LLM APIs rather than the original CoT-text-action loop. It fully containerizes five tasks (`alfworld` (AF), `dbbench` (DB), `knowledgegraph` (KG), `os_interaction` (OS), and `webshop` (WS)) with Docker Compose for one-command deployment, with the WebShop environment alone requiring approximately 16 GB of RAM.[^3]

FC is co-released with **AgentRL**, an "end-to-end multitask and multiturn LLM Agent RL framework" introduced in the paper "AgentRL: Scaling Agentic Reinforcement Learning with a Multi-Turn, Multi-Task Framework" (arXiv:2510.04206, October 5, 2025).[^5] AgentRL features a fully asynchronous generation-training pipeline, a unified function-call API across environments, cross-policy sampling for exploration in multi-turn settings, and task-advantage normalization to stabilize multi-task training. The framework uses AgentBench FC's containerized environments to provide the training and evaluation substrate, and the AgentRL paper reports models trained with the framework that outperform [GPT-5](/wiki/gpt-5), Claude Sonnet 4, and [DeepSeek-R1](/wiki/deepseek_r1) on the tested benchmarks.[^5]

### VisualAgentBench

VisualAgentBench (VAB), published as arXiv:2408.06327 in August 2024, extends the AgentBench methodology to large multimodal models (LMMs) acting as visual foundation agents.[^11] Built on the same modular evaluation backbone as AgentBench, VAB introduces five new environments spanning three categories:

- **VAB-OmniGibson** and **VAB-Minecraft** (Embodied)
- **VAB-Mobile** and **VAB-WebArena-Lite** (Graphical User Interface)
- **VAB-CSS** (Visual Design)

VAB also adds trajectory training datasets for behaviour cloning, enabling researchers to develop visual foundation agents through fine-tuning open LMMs, a capability the text-only AgentBench lacked.[^11]

## Impact and Significance

AgentBench has had a notable impact on the [AI agent](/wiki/ai_agents) evaluation landscape. It was the first benchmark to provide a unified, multi-environment assessment of LLM agent capabilities, setting a precedent for how agentic AI systems should be evaluated.[^1]

### Contributions to the Field

The benchmark made several contributions:

1. **Systematic multi-environment evaluation.** Before AgentBench, agent evaluations were fragmented across individual benchmarks. AgentBench demonstrated the value of testing the same model across diverse tasks to reveal strengths and weaknesses that single-environment benchmarks miss.[^1]
2. **Quantification of the commercial-open source gap.** By directly comparing 27-29 models under identical conditions, AgentBench provided the first rigorous quantification of how far behind open-source models were in agentic tasks as of 2023.[^1]
3. **Identification of key bottlenecks.** The failure analysis pinpointed specific capabilities (long-term reasoning, instruction following, format compliance) that needed improvement, giving the research community clear targets for future work.[^1]
4. **Reusable evaluation infrastructure.** The open-source framework, with its modular server-client architecture and Docker-based environments, can be extended with new tasks and models.[^3]

### Influence on Subsequent Work

AgentBench inspired a wave of agent-focused benchmarks and evaluation frameworks. Direct successors and related benchmarks include:

- **[SWE-bench](/wiki/swe-bench)** (October 2023) for software-engineering tasks based on real GitHub issues.
- **[WebArena](/wiki/webarena)** (October 2023) for long-horizon web tasks across realistic websites, and **[VisualWebArena](/wiki/visualwebarena)** (January 2024) for visual web tasks.
- **[GAIA](/wiki/gaia_benchmark)** (November 2023, Meta/Hugging Face) for general-assistant questions requiring tool use.
- **[OSWorld](/wiki/osworld)** (April 2024) for full-stack computer use across operating systems.
- **[tau-bench](/wiki/tau-bench)** (June 2024, Sierra) for tool-agent-user policy-adherence dialogues.
- **MedAgentBench** (January 2025), a Stanford-led adaptation that benchmarks medical LLM agents on a FHIR-compliant electronic health record environment with 300 clinically derived tasks; the strongest model in the initial study, Claude 3.5 Sonnet v2, achieved a 69.67% success rate.[^12]
- **FHIR-AgentBench** (September 2025), another medical-EHR-focused benchmark for interoperable health record question answering.[^13]

The benchmark's emphasis on practical, real-world tasks rather than synthetic puzzles helped shift the evaluation paradigm for LLM agents toward interactive, multi-turn settings. AgentBench is now frequently grouped with these benchmarks in surveys of agent evaluation, where it is positioned as a breadth-oriented diagnostic that complements depth-oriented suites such as SWE-bench (coding), OSWorld (computer use), and GAIA (general assistance).[^14]

The benchmark also influenced model development. The finding that alignment quality (as demonstrated by Vicuna's performance relative to its base model) matters as much as scale for agent tasks encouraged research into better alignment techniques specifically targeting multi-turn, agentic interactions.[^1]

## What are AgentBench's limitations?

While AgentBench represents a significant advance in agent evaluation, the authors and subsequent researchers have identified several limitations:

- **Fixed opponent in DCG.** The Digital Card Game uses a fixed algorithm-based opponent, so win rates may not reflect performance against adaptive or human opponents.[^1]
- **Snapshot evaluation.** Results reflect model capabilities at a specific point in time. LLMs improve rapidly, and the 2023 scores (largely retained through later arXiv revisions) are not representative of frontier models released in 2024-2026 such as [GPT-4o](/wiki/gpt-4), [Claude](/wiki/claude) 3.5/4, [Gemini](/wiki/gemini) 2.x, and [Llama](/wiki/llama) 3/4. The October 2025 v3 preprint and the FC release did not include a refreshed leaderboard for current frontier models.[^4]
- **Prompting sensitivity.** Results can vary based on prompt design. While CoT prompting was used consistently in v1/v2, the FC release re-formulates the interaction in a function-calling style, so direct numerical comparisons across versions are not straightforward.[^3]
- **Limited environment diversity.** Although eight environments is broader than most prior benchmarks, they do not cover all relevant agent domains such as robotic control, scientific experimentation, computer use at the OS-level GUI, or collaborative multi-agent scenarios. Surveys of agent evaluation in 2025 note that legacy single-agent suites like AgentBench miss the dynamics of agents collaborating or competing.[^14]
- **Evaluation cost.** Running the full AgentBench suite requires approximately 17,000 LLM generation calls (development plus test sets), which can be expensive for large commercial models.[^1]
- **Saturation and contamination risk.** Survey work in 2025 identifies AgentBench as one of several legacy benchmark suites at risk of saturation and data leakage, requiring versioning, contamination checks, and periodic task renewal to remain predictive of real-world performance.[^14]

## Authors

AgentBench was developed by a team of 22 researchers across three institutions, with Yuxiao Dong and Jie Tang of Tsinghua serving as corresponding authors.[^1]

- **[Tsinghua University](/wiki/tsinghua_university)** (KEG / THUDM): Xiao Liu (lead author), Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Hangliang Ding, Kaiwen Men, Kejuan Yang, Shudan Zhang, Aohan Zeng, Zhengxiao Du, Chenhui Zhang, Minlie Huang, Yuxiao Dong, and Jie Tang.
- **The Ohio State University:** Yu Gu, Xiang Deng, Yu Su, and Huan Sun.
- **[UC Berkeley](/wiki/uc_berkeley):** Sheng Shen and Tianjun Zhang.

Several authors are also affiliated with Zhipu AI, the commercial spin-out responsible for the GLM family of models.

## See Also

- [Benchmark](/wiki/benchmark)
- [Large Language Model](/wiki/large_language_model)
- [AI Agents](/wiki/ai_agents)
- [Chain of Thought](/wiki/chain_of_thought)
- [SWE-bench](/wiki/swe-bench)
- [GAIA Benchmark](/wiki/gaia_benchmark)
- [OSWorld](/wiki/osworld)
- [tau-bench](/wiki/tau-bench)
- [WebArena](/wiki/webarena)
- [Mind2Web](/wiki/mind2web)
- [Tool Use](/wiki/tool_use)
- [ReAct](/wiki/react)

## References

[^1]: Liu, X., Yu, H., Zhang, H., Xu, Y., Lei, X., Lai, H., Gu, Y., Ding, H., Men, K., Yang, K., Zhang, S., Deng, X., Zeng, A., Du, Z., Zhang, C., Shen, S., Zhang, T., Su, Y., Sun, H., Huang, M., Dong, Y., & Tang, J. (2023). "AgentBench: Evaluating LLMs as Agents." arXiv:2308.03688. https://arxiv.org/abs/2308.03688
[^2]: Liu, X. et al. (2024). "AgentBench: Evaluating LLMs as Agents." Published as a poster at ICLR 2024. OpenReview. https://openreview.net/forum?id=zAdUB0aCTQ
[^3]: THUDM. "AgentBench: A Comprehensive Benchmark to Evaluate LLMs as Agents (ICLR'24)." GitHub repository. https://github.com/THUDM/AgentBench
[^4]: Liu, X. et al. (2025). "AgentBench: Evaluating LLMs as Agents" (v3 revision, uploaded October 4, 2025). arXiv:2308.03688v3. https://arxiv.org/abs/2308.03688v3
[^5]: Zhang, H., Liu, X., Lv, B., Sun, X., Jing, B., Iong, I. L., Hou, Z., Qi, Z., Lai, H., Xu, Y., Lu, R., Wang, H., Tang, J., & Dong, Y. (2025). "AgentRL: Scaling Agentic Reinforcement Learning with a Multi-Turn, Multi-Task Framework." arXiv:2510.04206. https://arxiv.org/abs/2510.04206
[^6]: Shridhar, M., Yuan, X., Cote, M.A., Bisk, Y., Trischler, A., & Hausknecht, M. (2020). "ALFWorld: Aligning Text and Embodied Environments for Interactive Learning." arXiv:2010.03768. https://arxiv.org/abs/2010.03768
[^7]: Yao, S., Chen, H., Yang, J., & Narasimhan, K. (2022). "WebShop: Towards Scalable Real-World Web Interaction with Grounded Language Agents." NeurIPS 2022. https://arxiv.org/abs/2207.01206
[^8]: Deng, X., Gu, Y., Zheng, B., Chen, S., Stevens, S., Wang, B., Sun, H., & Su, Y. (2023). "Mind2Web: Towards a Generalist Agent for the Web." NeurIPS 2023 Spotlight. https://arxiv.org/abs/2306.06070
[^9]: arXiv:2308.03688v1 metadata (submitted August 7, 2023). https://arxiv.org/abs/2308.03688v1
[^10]: arXiv:2308.03688v2 metadata (submitted October 25, 2023). https://arxiv.org/abs/2308.03688v2
[^11]: Liu, X. et al. (2024). "VisualAgentBench: Towards Large Multimodal Models as Visual Foundation Agents." arXiv:2408.06327. https://arxiv.org/abs/2408.06327
[^12]: Jiang, Y. et al. (2025). "MedAgentBench: A Realistic Virtual EHR Environment to Benchmark Medical LLM Agents." arXiv:2501.14654. https://arxiv.org/abs/2501.14654 ; Stanford ML Group project page. https://stanfordmlgroup.github.io/projects/medagentbench/
[^13]: "FHIR-AgentBench: Benchmarking LLM Agents for Realistic Interoperable EHR Question Answering." arXiv:2509.19319. https://arxiv.org/abs/2509.19319
[^14]: "Evaluation and Benchmarking of LLM Agents: A Survey." arXiv:2507.21504 (also in Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V.2). https://arxiv.org/abs/2507.21504