WebArena is a realistic, self-hosted web environment and benchmark designed for developing and evaluating autonomous AI agents that perform tasks on the web. Introduced in a 2023 paper by Shuyan Zhou, Frank F. Xu, Hao Zhu, and colleagues at Carnegie Mellon University, WebArena provides fully functional, interactive websites spanning multiple domains, along with a suite of 812 long-horizon tasks that test an agent's ability to carry out realistic web-based activities described in natural language. The benchmark was published as a conference paper at ICLR 2024 and has become one of the most widely used evaluation platforms for web agents.
Unlike earlier benchmarks that relied on simplified or synthetic web pages, WebArena recreates the complexity of real-world websites, complete with authentication systems, databases, dynamic content, and multi-step workflows. As of early 2025, the best autonomous agents achieve roughly 60% task success on WebArena, compared to a human baseline of 78.24%, illustrating both the progress and the remaining challenges in building capable AI browser agents.
Before WebArena, most benchmarks for web-based AI agents either used static snapshots of websites or simplified toy environments that did not capture the full complexity of real web interactions. Benchmarks like MiniWoB and MiniWoB++ provided small-scale browser tasks (such as clicking buttons or filling forms), but their synthetic nature limited their ability to measure whether agents could handle the diversity, ambiguity, and long-horizon planning that characterize real web usage.
Mind2Web, introduced at NeurIPS 2023, offered a larger dataset of over 2,000 tasks across 137 real websites, but it evaluated agents on pre-recorded HTML snapshots rather than in live, interactive environments. This meant agents could not observe the consequences of their actions or adapt their strategies in real time.
WebArena was designed to bridge this gap by providing a fully interactive, reproducible environment where agents execute multi-step tasks on real web applications. The key design goals were:
WebArena provides a self-hosted environment consisting of four primary web applications, each representing a distinct domain commonly found on the internet. These websites are deployed as Docker containers and populated with realistic data.
| Website | Domain | Platform | Data Scale |
|---|---|---|---|
| OneStopShop | E-commerce | Adobe Magento | ~90,000 products across 300+ categories |
| Postmill (Reddit clone) | Social forum | Postmill (open-source) | 95 subreddits, 127,390 posts, 661,781 users |
| GitLab | Software development | GitLab CE | 300 repositories, 1,000+ user accounts |
| CMS Admin | Content management | Adobe Commerce Admin | Product/order management portal |
OneStopShop is an e-commerce site built on the Adobe Magento platform, stocked with approximately 90,000 products across more than 300 categories. Products include realistic prices, options, descriptions, images, and customer reviews. Agents can browse, search, add items to cart, check out, and manage account settings.
Postmill is an open-source Reddit alternative that serves as the social forum component. The researchers seeded it with data sampled from the top 50 subreddits, resulting in 95 active communities, over 127,000 posts, and more than 661,000 user accounts. Agents can search threads, post comments, upvote content, and manage their profiles.
GitLab is a self-hosted instance of GitLab Community Edition, populated with 300 repositories and over 1,000 user accounts. The repositories span multiple programming languages and include both popular projects (with many issues and merge requests) and smaller personal projects, reflecting the diversity found on real code hosting platforms.
CMS Admin is the back-end content management system for the e-commerce store, built on the Adobe Commerce admin panel. Tasks here involve managing product listings, configuring store settings, reviewing orders, and editing content pages.
In addition to the four main websites, WebArena provides several utility tools that agents can use during task completion:
These tools are designed to encourage human-like problem-solving strategies, where agents look up information, perform calculations, and take notes as part of completing a task.
WebArena includes 812 test tasks instantiated from 241 hand-crafted task templates, with an average of 3.3 task variations per template. Each task is described as a high-level natural language instruction that an agent must interpret and execute through a series of web interactions.
The tasks are organized into three primary categories:
| Category | Description | Approximate Count |
|---|---|---|
| Information Seeking | Agent must find specific information and return a textual answer | ~270 tasks |
| Site Navigation | Agent must navigate to a specific page or state on the website | ~300 tasks |
| Content and Configuration | Agent must create, modify, or configure content on a website | ~240 tasks |
Information Seeking tasks require the agent to navigate across multiple pages, locate specific data, and return a textual response. Examples include finding the price of a product, identifying the top-voted post in a subreddit, or determining the number of open issues in a GitLab repository. These tasks often demand multi-step reasoning and the ability to filter, sort, and aggregate information.
Site Navigation tasks ask the agent to reach a particular page or interface element using search functions, menus, links, and other interactive components. An example might be navigating to the settings page for a specific GitLab project or finding the order history section of the e-commerce site.
Content and Configuration tasks require the agent to actively modify website state, such as posting a comment on a forum thread, creating a new GitLab issue, updating a product description in the CMS, or changing account settings. These tasks test the agent's ability to understand form fields, input data correctly, and confirm that changes have been applied.
A distinctive feature of WebArena is the inclusion of cross-site tasks that require agents to interact with multiple websites within a single task. For example, an agent might need to look up a location on the map, then search for a related product on the e-commerce site. These tasks test an agent's ability to transfer information between contexts and manage multiple browser tabs.
WebArena also includes tasks that are intentionally impossible to complete given the current state of the environment. These "unachievable" tasks are labeled with an expected answer of "N/A" and test whether agents can recognize when a requested action cannot be performed, rather than fabricating an incorrect answer or taking unnecessary actions.
WebArena provides agents with observations that mimic the information available to a human browser user. Each observation includes:
The page content can be rendered in three different formats, which can be configured depending on the agent architecture:
| Observation Mode | Description | Use Case |
|---|---|---|
| Raw HTML (DOM tree) | The complete HTML source of the page | Text-based agents that can parse HTML |
| Accessibility tree | A structured, compact representation with element roles, text, and properties | LLM-based agents (most commonly used) |
| Screenshot | A pixel-based RGB image of the rendered page | Multimodal or vision-based agents |
The accessibility tree has become the most popular observation format for LLM-based agents. It provides a simplified view of the page where each element is described by its role (e.g., link, button, textbox), its visible text, and its properties (e.g., whether it is focusable or checked). Each element is assigned a unique numeric ID that agents can reference when specifying actions.
WebArena defines a set of 14 browser actions organized into three groups:
| Action Group | Actions |
|---|---|
| Element Operations | click, hover, type, key press, scroll |
| Tab Management | focus tab, open tab, close tab |
| URL Navigation | goto URL, go back, go forward |
| Special | stop (with answer), noop (no operation) |
Agents interact with web elements by referencing either on-screen coordinates (x, y) or the unique element IDs assigned in the accessibility tree. Using element IDs transforms element selection into a classification problem, simplifying the agent's decision-making.
The environment follows the Gymnasium (formerly OpenAI Gym) interface, exposing standard reset() and step(action) methods. This design makes it straightforward to integrate with existing reinforcement learning and agent frameworks.
One of WebArena's most important contributions is its focus on functional correctness rather than action-sequence matching. Instead of checking whether the agent performed the exact same sequence of clicks and keystrokes as a human demonstrator, WebArena evaluates whether the agent actually achieved the intended goal.
For information-seeking tasks, WebArena uses three types of answer matching:
For site navigation and content/configuration tasks, the evaluation uses programmatic reward functions that inspect the actual state of the website:
This approach accommodates the fact that there are often multiple valid paths to accomplish the same task. Two agents might take completely different sequences of actions but both successfully complete the task, and WebArena's evaluation correctly recognizes both as successful.
Five computer science graduate students evaluated 170 task templates (one instantiation per template). Human evaluators received the same instructions as the agent baselines. Key results from the human evaluation:
Human failures were attributed to misinterpretation of the task intent (50% of failures), incomplete answers, and off-target execution.
The original paper evaluated several large language models using different prompting strategies. All agents used the accessibility tree as the observation format, with a maximum of 30 interaction steps per task.
| Agent | Prompting Strategy | Success Rate |
|---|---|---|
| GPT-4 | Chain-of-thought (CoT) | 14.41% |
| GPT-4 | CoT + unachievable hint | 11.70% |
| GPT-3.5 Turbo | CoT + unachievable hint | 8.75% |
| text-bison-001 | CoT + unachievable hint | 5.05% |
| Human | Direct execution | 78.24% |
Several important findings emerged from this initial evaluation:
Since the initial publication, the WebArena leaderboard has seen dramatic progress. In roughly two years, the best agent scores have climbed from 14% to over 60%, driven by advances in agent architectures, specialized training, and improved foundation models.
| Agent | Organization | Success Rate | Date | Open/Closed |
|---|---|---|---|---|
| CUGA | IBM | 61.7% | Feb 2025 | Open-source |
| AWA 1.5 (Jace) | Jace AI | 57.1% | 2024 | Closed |
| Operator | OpenAI | ~58% | 2025 | Closed |
| Gemini 2.5 Pro | 54.8% | 2025 | Closed | |
| ScribeAgent + GPT-4o | Scribe | 53.0% | 2024 | Closed |
| ORCHESTRA | UNC / Ventus | 52.1% | 2024 | Closed |
| Agent Workflow Memory | Research | 35.5% | 2024 | Open |
The jump from 14% to over 60% did not come from a single breakthrough. Rather, the community converged on a modular architecture often described as the "standard model" for web agents, consisting of three components: a high-level Planner that decomposes tasks into sub-goals, a specialized Executor that carries out individual actions, and a structured Memory system that tracks progress and stores relevant information.
Several research directions contributed to the rapid improvement:
As WebArena became widely adopted, researchers at ServiceNow identified several sources of noise and inconsistency in the original evaluation. In response, they developed WebArena Verified, a cleaned and audited version of the benchmark that improves evaluation reliability.
WebArena Verified addresses the following issues:
WebArena Verified also introduces WebArena Verified Hard, a curated 137-task subset that retains the most difficult tasks while reducing evaluation cost by 83%. This subset is designed for quick but meaningful comparisons between agent systems.
WebArena Verified was presented at the Scaling Environments for Agents (SEA) Workshop at NeurIPS 2025 and is available as a Python package on PyPI.
The original WebArena paper spawned a family of related benchmarks, each extending the core framework to address new challenges.
VisualWebArena (Koh et al., 2024) extends WebArena to evaluate multimodal agents on tasks that require visual understanding of web content. While WebArena tasks can largely be solved using text-based observations (accessibility trees), VisualWebArena introduces 910 tasks where agents must interpret images, visual layouts, colors, and other graphical elements to succeed.
VisualWebArena adds a new Classifieds website (with real-world listing data) alongside the existing Shopping and Reddit environments. The benchmark was published at ACL 2024. The best vision-language model agents achieved 16.4% success, compared to a human baseline of 88.7%.
VideoWebArena (2024) evaluates agents on tasks that require processing and understanding long-context video inputs. The benchmark consists of 2,021 tasks based on 74 manually crafted video tutorials totaling nearly four hours of content. Tasks are split into two categories: skill retention (can the agent replicate a demonstrated workflow?) and factual retention (can the agent retrieve specific information from a video?).
The best model achieved 13.3% success on factual retention tasks, compared to 73.9% for humans. VideoWebArena was published at ICLR 2025.
TheAgentCompany extends the WebArena philosophy to a simulated corporate work environment. It benchmarks LLM agents on 175 consequential professional tasks in a simulated software company, including tasks that involve terminal use, coding, email, and interactions with simulated human colleagues (LLM-based NPCs). Published at ICML 2025, the best agent (Claude 3.5 Sonnet) completed only 24% of tasks.
WebChoreArena (Miyai et al., 2025) extends WebArena's scope to labor-intensive and tedious web tasks. Its 532 tasks emphasize three challenges: massive memory (processing large amounts of on-screen information), calculation (precise mathematical reasoning), and long-term memory (retaining information across multiple web pages). On WebChoreArena, GPT-4o's accuracy drops from 42.8% (on WebArena) to just 6.8%, while Gemini 2.5 Pro achieves 44.9%.
WebArena-Infinity is designed for continuous and scalable evaluation in evolving web environments, moving beyond the fixed task set of the original benchmark to support dynamic, continuously updated evaluation scenarios.
WebArena exists within a broader landscape of agent evaluation benchmarks. The following table compares WebArena with other prominent web agent benchmarks.
| Benchmark | Year | Tasks | Environment | Evaluation | Key Difference |
|---|---|---|---|---|---|
| MiniWoB++ | 2018 | 100+ | Synthetic browser tasks | Reward function | Small-scale, simplified tasks |
| Mind2Web | 2023 | 2,000+ | Static HTML snapshots | Action matching | Offline; no live interaction |
| WebArena | 2023 | 812 | Self-hosted real websites | Functional correctness | Interactive, realistic, reproducible |
| VisualWebArena | 2024 | 910 | Self-hosted real websites | Functional correctness | Requires visual understanding |
| WorkArena | 2024 | 33+ | ServiceNow platform | Task completion | Enterprise software focus |
| OSWorld | 2024 | 369 | Full desktop OS | Screenshot comparison | Tests OS-level interaction |
| WebChoreArena | 2025 | 532 | Self-hosted real websites | Functional correctness | Tedious, memory-heavy tasks |
WebArena's primary advantage over offline benchmarks like Mind2Web is its interactive nature: agents can observe the results of their actions and adjust their strategies accordingly. Compared to synthetic benchmarks like MiniWoB++, WebArena provides a far more realistic and challenging evaluation. And compared to benchmarks that use live public websites, WebArena's self-hosted design ensures reproducibility and prevents issues with websites changing between evaluation runs.
WebArena's Gymnasium-compatible interface has led to its integration with several popular agent development and evaluation frameworks.
BrowserGym, developed by ServiceNow, is a unified environment for web task automation that supports WebArena, WorkArena, and MiniWoB within a single framework. BrowserGym provides standardized observation formats (HTML, accessibility tree, screenshot, set-of-marks) and action spaces, making it easier to develop agents that work across multiple benchmarks.
AgentLab, also from ServiceNow, extends BrowserGym with sophisticated parallelization and analysis tools. It supports running 20 to 100 parallel evaluation tasks and is the recommended framework for new WebArena experiments as of December 2024.
The WebArena codebase itself uses Playwright for browser automation, enabling programmatic control of Chromium-based browsers.
WebArena's environment is distributed as a set of Docker containers, each hosting one of the web applications. The researchers provide scripts to reset the environment to a deterministic initial state before each evaluation run, ensuring that task outcomes are not affected by lingering state from previous runs.
The configuration includes the following service ports:
| Service | Default Port |
|---|---|
| Shopping (OneStopShop) | 7770 |
| Shopping Admin (CMS) | 7780 |
| Reddit (Postmill) | 9999 |
| GitLab | 8023 |
| Map (OpenStreetMap) | 3000 |
| Wikipedia (Kiwix) | 8888 |
As of February 2026, optimized Docker images for all WebArena environments are available on Docker Hub, with images up to 92% smaller than the originals. These optimized images include auto-login headers and a single container option for the Map service.
WebArena has had a substantial influence on the development of autonomous web agents. Its contributions include:
Establishing a standard evaluation protocol: WebArena's emphasis on functional correctness over action-sequence matching has become the default approach for evaluating web agents. This paradigm recognizes that there are multiple valid strategies for completing any given task.
Providing a realistic testing ground: By using fully functional web applications with real data, WebArena exposed the gap between agent performance on simplified benchmarks and their ability to handle real-world complexity.
Catalyzing rapid progress: The clear, measurable nature of WebArena's task success metric has motivated rapid iteration in agent design. The jump from 14% to over 60% success in two years demonstrates the benchmark's role in driving the field forward.
Spawning an ecosystem: The WebArena framework has been extended into VisualWebArena, VideoWebArena, TheAgentCompany, WebChoreArena, and WebArena-Infinity, creating a comprehensive suite of benchmarks that cover text, vision, video, and enterprise domains.
Highlighting remaining challenges: Even with 60% success rates, agents still fall short of human performance at 78%. The remaining gap involves tasks requiring deep visual understanding, common-sense reasoning, long-horizon planning, and robust error recovery.
Despite its strengths, WebArena has several recognized limitations:
WebArena was created by researchers primarily from Carnegie Mellon University's Language Technologies Institute:
The project website is hosted at webarena.dev, and the source code is available on GitHub under the web-arena-x organization.