AgentBench is a multi-dimensional benchmark designed to evaluate large language models (LLMs) as autonomous agents across eight distinct interactive environments. Introduced by researchers from Tsinghua University, The Ohio State University, and UC Berkeley in August 2023, AgentBench provides the first systematic framework for assessing how well LLMs can operate as agents that take actions, observe outcomes, and make decisions in real-world scenarios. The benchmark was published as a conference paper at the International Conference on Learning Representations (ICLR) 2024 and is publicly available on GitHub under the Apache 2.0 license.
Unlike traditional benchmarks that evaluate LLMs on static question-answering or text completion tasks, AgentBench places models in multi-turn interactive settings where they must reason over observations, plan sequences of actions, and adapt to changing environments. The original evaluation covered 29 LLMs, including both commercial API-based models and open-source alternatives, revealing a significant performance gap between proprietary systems like GPT-4 and open-source models.
The rapid development of LLMs has expanded their role far beyond text generation. Researchers and practitioners increasingly deploy LLMs as the core reasoning engine for autonomous agents that interact with external tools, databases, operating systems, and web interfaces. However, as of mid-2023, no comprehensive benchmark existed to systematically evaluate LLMs in these agentic roles across a broad range of tasks.
Prior evaluation efforts tended to focus on individual environments. For example, ALFWorld evaluated household task planning, WebShop tested online shopping behavior, and Mind2Web assessed web browsing capabilities. Each of these benchmarks provided valuable insight into specific agent skills, but none offered a holistic view of an LLM's general-purpose agent capabilities. Existing benchmarks also typically required fine-tuned models, making it difficult to compare prompted LLMs directly.
AgentBench was created to fill this gap. The authors designed a unified evaluation framework spanning eight environments that together cover code interaction, game strategy, web navigation, and knowledge graph querying. By testing the same set of LLMs across all eight tasks under standardized conditions, AgentBench enables fair, comprehensive comparisons of agent performance.
AgentBench organizes its eight evaluation environments into three categories: code-grounded, game-grounded, and web-grounded. Five of these environments were newly created by the research team, while three were adapted from existing published datasets.
The Operating System environment tests an LLM's ability to interact with an Ubuntu Linux system through bash commands. Each task provides a natural language instruction (for example, "Find the total number of non-empty directories inside the /etc directory"), and the model must issue appropriate bash commands to accomplish the goal. The action space includes any valid bash command, and the observation returned is the standard output of the executed command.
Tasks are divided into two types. The first type asks for a deterministic answer, such as counting files or retrieving specific system information. The second type involves operational goals like changing file permissions or modifying system configurations. The evaluation metric is success rate (SR), where a task is considered successful only if the final answer or system state matches the expected outcome exactly.
The development set contains 26 tasks and the test set contains 144 tasks. The environment runs inside Docker containers for isolation and reproducibility.
The Database environment evaluates an LLM's ability to query relational databases using SQL. Tasks present a natural language question about data stored in a MySQL database (for example, "What was the total number of medals won by the United States?"), and the model must write and execute SQL queries to retrieve the correct answer.
Datasets were derived from WikiSQL and augmented with additional variety to cover different query types including aggregation, joins, filtering, and subqueries. The model interacts with a MySQL command-line interface, receiving query results as observations after each SQL statement. The evaluation metric is success rate (SR) based on whether the final query output matches the expected answer.
The development set contains 60 tasks and the test set contains 300 tasks.
The Knowledge Graph environment assesses an LLM's ability to answer questions using a large-scale knowledge base. Tasks require the model to navigate and query a knowledge graph using basic querying tools provided by the environment. A typical task might ask, "Find tropical cyclones that are similar to Hurricane Marie and affected Eastern North America."
This environment tests the model's ability to make decisions with incomplete information, since the model cannot see the full knowledge graph and must explore it step by step. The evaluation metric is Answer F1 score, which measures the overlap between the model's answer and the ground truth.
The development set contains 20 tasks and the test set contains 150 tasks.
The Digital Card Game environment uses Aquawar, a simplified card game system from the 2021 THU Agent Competition (THUAC). In Aquawar, the LLM controls a team of fish characters, each with different abilities, and battles against an algorithm-based opponent in a turn-based format. The model must make strategic decisions each turn about which fish to deploy and which abilities to use.
This environment tests strategic reasoning and planning under uncertainty, as the model must anticipate the opponent's moves and manage limited resources. The evaluation metric is win rate against the built-in opponent algorithm.
The development set contains 12 tasks and the test set contains 20 tasks.
The Lateral Thinking Puzzles environment presents situation puzzles, a popular group game format where one player hosts a riddle and others must figure out the scenario by asking yes-or-no questions. The host responds with "yes," "no," or "irrelevant" to guide the guesser toward the solution.
In AgentBench, the puzzle plot is simplified into several key points that represent game progress. The LLM plays the role of the guesser and must ask strategic questions to uncover the puzzle solution. This environment evaluates the model's lateral reasoning abilities and its capacity to form and test hypotheses through targeted questioning. The evaluation metric is game progress, measuring what fraction of the key plot points the model successfully uncovers.
The development set contains 20 tasks and the test set contains 50 tasks.
The House-Holding environment is based on ALFWorld, a well-established text-based game environment derived from the TextWorld toolkit. Tasks require the LLM to perform common household activities in a simulated home, such as "Put a clean pan on the dining table" or "Find a pencil and place it on the desk."
The model must navigate rooms, interact with objects, and execute multi-step plans that require commonsense reasoning about the physical world. For instance, to clean a pan, the model needs to pick it up, go to the sink, use the sink, then navigate to the dining table and place the pan there. The evaluation metric is success rate (SR).
The development set contains 20 tasks and the test set contains 50 tasks.
The Web Shopping environment uses WebShop, a simulated e-commerce website where the LLM must find and purchase products that match a given specification. The model searches for products, browses listings, reads product descriptions, selects options (such as size and color), and makes purchasing decisions.
Originally designed for evaluation of specifically trained models, AgentBench adapted WebShop to assess LLMs with prompting alone, without additional fine-tuning. The evaluation metric is reward, a score between 0 and 1 that measures how well the selected product matches the desired specification.
The development set contains 80 tasks and the test set contains 200 tasks.
The Web Browsing environment is based on Mind2Web, which presents general-purpose web interaction tasks across multiple website domains. Tasks require the LLM to perform actions such as clicking buttons, selecting dropdown options, and typing text on real website interfaces.
Compared to Mind2Web's original evaluation setting, AgentBench adapted the tasks to allow assessment of prompted LLMs without fine-tuning. The evaluation metric is step success rate, measuring the proportion of individual interaction steps the model executes correctly.
The development set contains 31 tasks and the test set contains 100 tasks.
| Environment | Category | Tasks (Dev) | Tasks (Test) | Metric | New/Adapted |
|---|---|---|---|---|---|
| Operating System (OS) | Code-grounded | 26 | 144 | Success Rate | New |
| Database (DB) | Code-grounded | 60 | 300 | Success Rate | New |
| Knowledge Graph (KG) | Code-grounded | 20 | 150 | Answer F1 | New |
| Digital Card Game (DCG) | Game-grounded | 12 | 20 | Win Rate | New |
| Lateral Thinking Puzzles (LTP) | Game-grounded | 20 | 50 | Game Progress | New |
| House-Holding (HH) | Game-grounded | 20 | 50 | Success Rate | Adapted from ALFWorld |
| Web Shopping (WS) | Web-grounded | 80 | 200 | Reward | Adapted from WebShop |
| Web Browsing (WB) | Web-grounded | 31 | 100 | Step Success Rate | Adapted from Mind2Web |
AgentBench uses a modular server-client architecture that separates the evaluation into three components: Task Servers, Agent Servers, and an Evaluation Client. These components communicate through HTTP, allowing them to run on separate machines if needed.
Each environment runs inside its own Docker container, providing isolation and ensuring that one environment's state does not interfere with another. The framework supports running multiple model-environment pairs simultaneously, which is important given that the full evaluation requires approximately 4,000 LLM generation calls for the development set and 13,000 for the test set.
The architecture incorporates an Edmonds-Karp max-flow algorithm to optimize the scheduling of evaluation tasks across available resources. It also supports resumable evaluation, so if a run is interrupted, it can pick up where it left off without re-running completed tasks.
AgentBench evaluates all models using chain-of-thought (CoT) prompting, which has become a standard strategy for eliciting reasoning from LLMs. In each interaction turn, the model receives a description of the current state (observation), generates a reasoning chain, and then outputs an action. The environment executes the action and returns the next observation, continuing this loop until the task is complete or a turn limit is reached.
The estimated number of interaction turns per task ranges from 5 to 50, depending on the environment's complexity. This multi-turn format is critical because it tests not just the model's ability to generate a single correct response, but its capacity to maintain coherent plans over extended interaction sequences.
Because the eight environments use different metrics (success rate, F1 score, win rate, reward, and game progress), AgentBench employs a normalization procedure to compute an overall score. The raw score for each environment is first normalized by resizing each task's average score to 1.0 across all evaluated models. The overall AgentBench score is then computed as the average of these normalized scores across all eight environments.
For consistency in future evaluations, fixed normalization weights are calculated as the reciprocal of the average score per task from the initial evaluation round. This ensures that adding new models to the leaderboard does not change the relative scoring of previously evaluated models.
The original AgentBench evaluation tested 29 LLMs spanning commercial API-based models and open-source alternatives. Later updates added additional models including GLM-4 and Claude 3.
| Model | Overall Score |
|---|---|
| GPT-4 (0613) | 4.01 |
| Claude 3 | 3.11 |
| GLM-4 | 2.89 |
| Claude 2 | 2.49 |
| Claude v1.3 | 2.44 |
| GPT-3.5-turbo | 2.32 |
| text-davinci-003 | 1.71 |
| claude-instant | 1.60 |
| chat-bison-001 | 1.39 |
| text-davinci-002 | (below 1.00) |
| Model | Parameters | Overall Score |
|---|---|---|
| CodeLlama-34B | 34B | 0.96 |
| Vicuna-13B | 13B | 0.93 |
| Llama 2-70B | 70B | 0.78 |
| openchat-8192-13B | 13B | Notable (best OSS in some tasks) |
| WizardLM-30B | 30B | Moderate |
| Vicuna-33B | 33B | Moderate |
| Guanaco-65B | 65B | Low |
| Guanaco-33B | 33B | Low |
| Llama 2-13B | 13B | Low |
| Llama 2-7B | 7B | Low |
| Vicuna-7B | 7B | Low |
| WizardLM-13B | 13B | Low |
| CodeLlama-13B | 13B | Low |
| CodeLlama-7B | 7B | Low |
| ChatGLM-6B | 6B | Low |
| CodeGeeX2-6B | 6B | Low |
| Dolly-12B | 12B | Low |
| OASST-12B | 12B | Low |
| Koala-13B | 13B | Low |
The average overall score for API-based models was 2.32, while open-source models averaged just 0.51, representing a roughly 4.5x performance gap.
| Model | OS (SR%) | DB (SR%) | KG (F1) | DCG (Win%) | LTP (Progress) | HH (SR%) | WS (Reward) | WB (Step SR%) | Overall |
|---|---|---|---|---|---|---|---|---|---|
| GPT-4 | 42.4 | 32.0 | 58.8 | 74.5 | 16.6 | 78.0 | 61.1 | 29.0 | 4.01 |
| Claude 2 | 18.1 | 27.3 | 41.3 | 55.5 | 8.4 | 54.0 | 61.4 | 0.0 | 2.49 |
| Claude v1.3 | 9.7 | 22.0 | 38.9 | 40.9 | 8.2 | 58.0 | 55.7 | 25.0 | 2.44 |
| GPT-3.5-turbo | 32.6 | 36.7 | 25.9 | 33.7 | 10.5 | 16.0 | 64.1 | 20.0 | 2.32 |
| Llama 2-70B | 9.7 | 13.0 | 8.0 | 21.3 | 0.0 | 2.0 | 5.6 | 19.0 | 0.78 |
| CodeLlama-34B | 2.8 | 14.0 | 23.5 | 8.4 | 0.7 | 4.0 | 52.1 | 20.0 | 0.96 |
| Vicuna-13B | 10.4 | 6.7 | 9.4 | 0.1 | 8.0 | 8.0 | 41.7 | 12.0 | 0.93 |
GPT-4 achieved the highest overall score of 4.01 and led performance in six of the eight environments. Its 78% success rate on House-Holding tasks demonstrated near-practical usability in that domain. GPT-4 also showed strong results in the Digital Card Game (74.5% win rate) and Knowledge Graph (58.8 F1), indicating solid strategic reasoning and information retrieval capabilities.
GPT-3.5-turbo outperformed more expensive models in two specific areas: Database (36.7% SR, higher than GPT-4's 32.0%) and Web Shopping (64.1 reward, the highest among all models). This suggests that certain agentic tasks benefit more from specific training characteristics than from raw model scale alone.
Claude 2 and Claude v1.3 performed comparably to each other in overall terms (2.49 vs. 2.44), though their strengths differed. Claude 2 had better OS, DB, and KG performance, while Claude v1.3 outperformed Claude 2 in House-Holding (58% vs. 54%) and Web Browsing (25% vs. 0%). Notably, Claude 2 scored 0% on Web Browsing, indicating a complete failure to follow the required output format for that task.
Among open-source models, CodeLlama-34B achieved the highest overall score (0.96), largely driven by its strong Web Shopping performance (52.1 reward). However, it performed poorly on the Operating System task (2.8% SR) despite being a code-focused model, suggesting that interacting with a Linux system requires different skills than writing code. Vicuna-13B demonstrated that alignment quality matters more than model size; it outperformed Llama 2-13B and performed comparably to models three times its size, likely because Vicuna was trained on ShareGPT data generated by GPT-4 and GPT-3.5-turbo.
The paper provides a detailed analysis of why models fail, categorizing execution outcomes into several types.
| Error Type | Description | Dominant In |
|---|---|---|
| Task Limit Exceeded (TLE) | The model runs out of allowed interaction turns without completing the task | LTP (82.5%), KG (67.9%) |
| Invalid Format | The model's output does not follow the required response format | DB (53.3%), DCG (38.5%) |
| Invalid Action | The model generates an action that is not valid in the current environment state | HH (64.1%) |
| Context Limit Exceeded (CLE) | The conversation exceeds the model's context window | Various environments |
| Task Error | Unexpected errors during task execution | Minor contributor |
Task Limit Exceeded was the most common failure mode overall, particularly in environments requiring extended reasoning chains like Lateral Thinking Puzzles and Knowledge Graph tasks. This indicates that many models struggle with long-term planning and cannot efficiently converge on solutions within the allowed number of turns.
Invalid Format errors were most prevalent in structured output environments like Database and Digital Card Game, where models must produce outputs in a specific syntax (SQL queries or game commands). This points to weaknesses in instruction following, especially when output constraints are strict.
Invalid Action errors dominated in the House-Holding environment, where models frequently attempted actions that were not physically possible in the current state (for example, trying to pick up an object that was not nearby). This reflects limitations in maintaining an accurate internal model of the environment state.
The most striking finding from AgentBench is the substantial gap between commercial and open-source models. Commercial API-based models averaged an overall score of 2.32, while open-source models averaged 0.51. This roughly 4.5x difference was consistent across most environments.
The performance gap was especially pronounced in environments requiring complex reasoning:
The researchers attributed this gap to three main factors: poor long-term reasoning ability, weak decision-making under uncertainty, and inadequate instruction-following capabilities in open-source models. They noted that training on high-quality multi-round alignment data and improving instruction-following could help close this gap.
AgentBench has had a notable impact on the AI agent evaluation landscape. It was the first benchmark to provide a unified, multi-environment assessment of LLM agent capabilities, setting a precedent for how agentic AI systems should be evaluated.
The benchmark made several important contributions:
Systematic multi-environment evaluation. Before AgentBench, agent evaluations were fragmented across individual benchmarks. AgentBench demonstrated the value of testing the same model across diverse tasks to reveal strengths and weaknesses that single-environment benchmarks miss.
Quantification of the commercial-open source gap. By directly comparing 29 models under identical conditions, AgentBench provided the first rigorous quantification of how far behind open-source models were in agentic tasks as of 2023.
Identification of key bottlenecks. The failure analysis pinpointed specific capabilities (long-term reasoning, instruction following, format compliance) that needed improvement, giving the research community clear targets for future work.
Reusable evaluation infrastructure. The open-source framework, with its modular server-client architecture and Docker-based environments, can be extended with new tasks and models. Several subsequent benchmarks, including MedAgentBench for medical AI agents, built upon the AgentBench framework.
AgentBench inspired a wave of agent-focused benchmarks and evaluation frameworks. Projects like SWE-bench (for software engineering), Terminal-Bench (for command-line tasks), and LiveAgentBench (for real-world challenges) followed the pattern of evaluating LLMs in interactive, multi-turn settings. The benchmark's emphasis on practical, real-world tasks rather than synthetic puzzles helped shift the evaluation paradigm for LLM agents.
The benchmark also influenced model development. The finding that alignment quality (as demonstrated by Vicuna's performance relative to its base model) matters as much as scale for agent tasks encouraged research into better alignment techniques specifically targeting multi-turn, agentic interactions.
While AgentBench represents a significant advance in agent evaluation, the authors and subsequent researchers have identified several limitations:
AgentBench has been released in multiple versions:
The project is maintained under the THUDM (Tsinghua University Data Mining) GitHub organization and continues to receive updates.
AgentBench was developed by a team of 22 researchers across three institutions: