WebArena

AI Agents AI Benchmarks Model Evaluation

25 min read

Updated Jun 21, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 21, 2026

Fact-checked

In review queue

Sources

23 citations

Revision

v6 · 4,982 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

WebArena is a realistic, self-hosted web environment and benchmark designed for developing and evaluating autonomous AI agents that perform tasks on the web. Introduced in a 2023 paper by Shuyan Zhou, Frank F. Xu, Hao Zhu, and colleagues at Carnegie Mellon University, WebArena provides fully functional, interactive websites spanning multiple domains, along with a suite of 812 long-horizon tasks that test an agent's ability to carry out realistic web-based activities described in natural language.^[1] The benchmark was published as a conference paper at ICLR 2024 and has become one of the most widely used evaluation platforms for web agents.^[1] When the paper was released, its best GPT-4-based agent completed only 14.41% of tasks, far below the human baseline of 78.24%, a gap that the authors framed as evidence that "current state-of-the-art large language models are far from perfect performance in these real-life tasks."^[1]

Unlike earlier benchmarks that relied on simplified or synthetic web pages, WebArena recreates the complexity of real-world websites, complete with authentication systems, databases, dynamic content, and multi-step workflows.^[1] The authors describe their goal as building "an environment for language-guided agents that is highly realistic and reproducible."^[1] As of early 2025, the best autonomous agents achieve roughly 60% task success on WebArena, compared to a human baseline of 78.24%, illustrating both the progress and the remaining challenges in building capable AI browser agents.^[1] By early 2026 the best reported result had risen further: the OpAgent framework reported a 71.6% success rate, taking the top position on the WebArena leaderboard in January 2026.^[12]

Background and Motivation

Before WebArena, most benchmarks for web-based AI agents either used static snapshots of websites or simplified toy environments that did not capture the full complexity of real web interactions. Benchmarks like MiniWoB and MiniWoB++ provided small-scale browser tasks (such as clicking buttons or filling forms), but their synthetic nature limited their ability to measure whether agents could handle the diversity, ambiguity, and long-horizon planning that characterize real web usage.

Mind2Web, introduced at NeurIPS 2023, offered a larger dataset of over 2,000 tasks across 137 real websites, but it evaluated agents on pre-recorded HTML snapshots rather than in live, interactive environments.^[7] This meant agents could not observe the consequences of their actions or adapt their strategies in real time.

WebArena was designed to bridge this gap by providing a fully interactive, reproducible environment where agents execute multi-step tasks on real web applications.^[1] The key design goals were:

Realism: Websites should be fully functional with real data, not simplified mockups.
Reproducibility: The environment should be self-contained and deterministic, allowing fair comparison between agents.
Diversity: Tasks should span different domains, intent types, and difficulty levels.
Functional evaluation: Success should be measured by whether the agent actually achieves the goal, not by whether it follows a specific action sequence.

What websites and tools does WebArena include?

WebArena provides a self-hosted environment consisting of four primary web applications, each representing a distinct domain commonly found on the internet. These websites are deployed as Docker containers and populated with realistic data.^[1]

Websites

Website	Domain	Platform	Data Scale
OneStopShop	E-commerce	Adobe Magento	~90,000 products across 300+ categories
Postmill (Reddit clone)	Social forum	Postmill (open-source)	95 subreddits, 127,390 posts, 661,781 users
GitLab	Software development	GitLab CE	300 repositories, 1,000+ user accounts
CMS Admin	Content management	Adobe Commerce Admin	Product/order management portal

OneStopShop is an e-commerce site built on the Adobe Magento platform, stocked with approximately 90,000 products across more than 300 categories.^[1] Products include realistic prices, options, descriptions, images, and customer reviews. Agents can browse, search, add items to cart, check out, and manage account settings.

Postmill is an open-source Reddit alternative that serves as the social forum component. The researchers seeded it with data sampled from the top 50 subreddits, resulting in 95 active communities, over 127,000 posts, and more than 661,000 user accounts.^[1] Agents can search threads, post comments, upvote content, and manage their profiles.

GitLab is a self-hosted instance of GitLab Community Edition, populated with 300 repositories and over 1,000 user accounts.^[1] The repositories span multiple programming languages and include both popular projects (with many issues and merge requests) and smaller personal projects, reflecting the diversity found on real code hosting platforms.

CMS Admin is the back-end content management system for the e-commerce store, built on the Adobe Commerce admin panel. Tasks here involve managing product listings, configuring store settings, reviewing orders, and editing content pages.

Supporting Tools

In addition to the four main websites, WebArena provides several utility tools that agents can use during task completion:^[1]

Map: A map service powered by OpenStreetMap covering the northeastern United States, allowing agents to search for points of interest such as restaurants, institutions, and geographic locations.
Wikipedia: An offline mirror of the English Wikipedia served through Kiwix, providing a reference knowledge base.
Calculator: A basic calculator utility.
Scratchpad: A note-taking tool where agents can store intermediate information during multi-step tasks.

These tools are designed to encourage human-like problem-solving strategies, where agents look up information, perform calculations, and take notes as part of completing a task.

How are WebArena tasks designed?

WebArena includes 812 test tasks instantiated from 241 hand-crafted task templates, with an average of 3.3 task variations per template.^[1] Each task is described as a high-level natural language instruction that an agent must interpret and execute through a series of web interactions.

Task Categories

The tasks are organized into three primary categories:

Category	Description	Approximate Count
Information Seeking	Agent must find specific information and return a textual answer	~270 tasks
Site Navigation	Agent must navigate to a specific page or state on the website	~300 tasks
Content and Configuration	Agent must create, modify, or configure content on a website	~240 tasks

Information Seeking tasks require the agent to navigate across multiple pages, locate specific data, and return a textual response. Examples include finding the price of a product, identifying the top-voted post in a subreddit, or determining the number of open issues in a GitLab repository. These tasks often demand multi-step reasoning and the ability to filter, sort, and aggregate information.

Site Navigation tasks ask the agent to reach a particular page or interface element using search functions, menus, links, and other interactive components. An example might be navigating to the settings page for a specific GitLab project or finding the order history section of the e-commerce site.

Content and Configuration tasks require the agent to actively modify website state, such as posting a comment on a forum thread, creating a new GitLab issue, updating a product description in the CMS, or changing account settings. These tasks test the agent's ability to understand form fields, input data correctly, and confirm that changes have been applied.

Cross-Site Tasks

A distinctive feature of WebArena is the inclusion of cross-site tasks that require agents to interact with multiple websites within a single task.^[1] For example, an agent might need to look up a location on the map, then search for a related product on the e-commerce site. These tasks test an agent's ability to transfer information between contexts and manage multiple browser tabs.

Unachievable Tasks

WebArena also includes tasks that are intentionally impossible to complete given the current state of the environment. These "unachievable" tasks are labeled with an expected answer of "N/A" and test whether agents can recognize when a requested action cannot be performed, rather than fabricating an incorrect answer or taking unnecessary actions.^[1]

Observation and Action Spaces

Observation Space

WebArena provides agents with observations that mimic the information available to a human browser user. Each observation includes:^[1]

The current URL
A list of open browser tabs
The content of the focused web page

The page content can be rendered in three different formats, which can be configured depending on the agent architecture:

Observation Mode	Description	Use Case
Raw HTML (DOM tree)	The complete HTML source of the page	Text-based agents that can parse HTML
Accessibility tree	A structured, compact representation with element roles, text, and properties	LLM-based agents (most commonly used)
Screenshot	A pixel-based RGB image of the rendered page	Multimodal or vision-based agents

The accessibility tree has become the most popular observation format for LLM-based agents. It provides a simplified view of the page where each element is described by its role (e.g., link, button, textbox), its visible text, and its properties (e.g., whether it is focusable or checked). Each element is assigned a unique numeric ID that agents can reference when specifying actions.^[1]

Action Space

WebArena defines a set of 14 browser actions organized into three groups:^[1]

Action Group	Actions
Element Operations	click, hover, type, key press, scroll
Tab Management	focus tab, open tab, close tab
URL Navigation	goto URL, go back, go forward
Special	stop (with answer), noop (no operation)

Agents interact with web elements by referencing either on-screen coordinates (x, y) or the unique element IDs assigned in the accessibility tree. Using element IDs transforms element selection into a classification problem, simplifying the agent's decision-making.^[1]

The environment follows the Gymnasium (formerly OpenAI Gym) interface, exposing standard reset() and step(action) methods.^[1] This design makes it straightforward to integrate with existing reinforcement learning and agent frameworks.

How does WebArena evaluate agents?

One of WebArena's most important contributions is its focus on functional correctness rather than action-sequence matching.^[1] Instead of checking whether the agent performed the exact same sequence of clicks and keystrokes as a human demonstrator, WebArena evaluates whether the agent actually achieved the intended goal.

Evaluation Types

For information-seeking tasks, WebArena uses three types of answer matching:^[1]

Exact match: The agent's textual answer must be identical to the reference answer.
Must include: The agent's answer must contain certain key phrases or values.
Fuzzy match: A language model (GPT-4) judges whether the agent's answer is semantically equivalent to the reference, accommodating variations in phrasing.

For site navigation and content/configuration tasks, the evaluation uses programmatic reward functions that inspect the actual state of the website:^[1]

Database queries verify that records were correctly created or updated.
API calls check that configuration changes took effect.
JavaScript selectors confirm that the correct page or element is displayed.

This approach accommodates the fact that there are often multiple valid paths to accomplish the same task. Two agents might take completely different sequences of actions but both successfully complete the task, and WebArena's evaluation correctly recognizes both as successful.

Human Performance Baseline

Five computer science graduate students evaluated 170 task templates (one instantiation per template). Human evaluators received the same instructions as the agent baselines.^[1] Key results from the human evaluation:

Overall success rate: 78.24%
Information-seeking tasks: 74.68%
Other task categories: 81.32%
Average task completion time: 110 seconds

Human failures were attributed to misinterpretation of the task intent (50% of failures), incomplete answers, and off-target execution.^[1]

Initial Results (2023)

The original paper evaluated several large language models using different prompting strategies. All agents used the accessibility tree as the observation format, with a maximum of 30 interaction steps per task.^[1]

Agent	Prompting Strategy	Success Rate
GPT-4	Chain-of-thought (CoT)	14.41%
GPT-4	CoT + unachievable hint	11.70%
GPT-3.5 Turbo	CoT + unachievable hint	8.75%
text-bison-001	CoT + unachievable hint	5.05%
Human	Direct execution	78.24%

Several important findings emerged from this initial evaluation:

Large performance gap: Even the best agent (GPT-4 with CoT) completed only 14.41% of tasks, compared to 78.24% for humans.^[1]
Unachievable task detection is hard: When given a hint that some tasks might be unachievable, GPT-4 erroneously identified 54.9% of feasible tasks as impossible, suggesting that agents are overly cautious or unable to distinguish feasible from infeasible tasks.^[1]
Inconsistency across variations: Only 4 out of 61 evaluated task templates achieved 100% success across all variations, indicating that agents struggle with generalization even within similar tasks.^[1]
Chain-of-thought helps but is not sufficient: CoT reasoning improved performance over direct prompting but was far from closing the gap with human performance.

Progress and Leaderboard (2024 to 2026)

Since the initial publication, the WebArena leaderboard has seen dramatic progress. In roughly two and a half years, the best agent scores have climbed from 14.41% to 71.6%, driven by advances in agent architectures, specialized training, and improved foundation models.

Notable Agent Results

Agent	Organization	Success Rate	Date	Open/Closed
OpAgent	CodeFuse (Ant Group)	71.6% ^[12]	Jan 2026	Open-source
CUGA	IBM	61.7% ^[13]	Feb 2025	Open-source
AWA 1.5 (Jace)	Jace AI	57.1% ^[14]	2024	Closed
Operator	OpenAI	~58% ^[15]	2025	Closed
Gemini 2.5 Pro	Google	54.8% ^[6]	2025	Closed
ScribeAgent + GPT-4o	Scribe	53.0% ^[16]	2024	Closed
ORCHESTRA	UNC / Ventus	52.1%	2024	Closed
Agent Workflow Memory	Research	35.5% ^[9]	2024	Open

The jump from 14.41% to over 70% did not come from a single breakthrough. Rather, the community converged on a modular architecture often described as the "standard model" for web agents, consisting of three components: a high-level Planner that decomposes tasks into sub-goals, a specialized Executor that carries out individual actions, and a structured Memory system that tracks progress and stores relevant information.

Key Techniques Driving Progress

Several research directions contributed to the rapid improvement:

Agent Workflow Memory (AWM): Achieved a 51% relative improvement in success rate by allowing agents to learn and reuse workflow patterns from past experience.^[9]
SkillWeaver: Synthesized reusable skill APIs from agent trajectories, achieving a 31.8% relative improvement.^[17] Skill APIs synthesized by strong agents also transferred to weaker agents, yielding improvements of up to 54.3% on WebArena.^[17]
Experience Replay: Replaying successful trajectories raised GPT-4o baseline success rates by 51% to 36.7%.
Automatic Evaluation and Reflexion: Using automatic evaluator feedback in reflection loops yielded a 29% improvement.^[18]
Specialized Training Data: Fine-tuning smaller models on curated web navigation data allowed them to outperform larger generic models, with ScribeAgent demonstrating that production-scale workflow data can be a strong training signal.^[16] ScribeAgent was fine-tuned on workflow data covering more than 250 domains and roughly 6 billion tokens.^[16]
Observation and action space alignment: AgentOccam (2024) demonstrated that simply refining the agent's observation and action spaces to better align with the underlying model's capabilities, without additional training, search, or multi-agent orchestration, improved the previous state of the art by 9.8 percentage points, a 29.4% relative gain.^[23]

Late 2025 and 2026 developments

Progress continued after the initial wave of leaderboard gains. In October 2025, IBM researchers published a report describing the deployment of CUGA (Computer Using Generalist Agent), the system that had topped the WebArena leaderboard at 61.7% earlier that year, in enterprise production settings.^[20] In December 2025, IBM released CUGA as open-source software under an Apache License 2.0.^[19] CUGA uses a hierarchical design that interprets user intent, decomposes goals into structured subtasks tracked in a dynamic ledger that can re-plan when needed, and dispatches specialized sub-agents, including an API agent that generates pseudo-code instructions before invoking code in a secure sandbox.^[19]

In February 2026, the CodeFuse team at Ant Group published OpAgent, a modular framework that orchestrates four components: a Planner that generates high-level instructions, a Grounder that outputs specific page coordinates, a Reflector that analyzes execution and checks task completion, and a Summarizer that produces final answers.^[12] OpAgent combines hierarchical multi-task fine-tuning over functional primitives (planning, acting, and grounding) with online reinforcement learning using a hybrid reward mechanism. The full framework reported a 71.6% success rate on WebArena, securing the top position on the leaderboard in January 2026, and the team released the weights of its RL-trained OpAgent-32B model on Hugging Face.^[12]

WebArena Verified

As WebArena became widely adopted, researchers at ServiceNow identified several sources of noise and inconsistency in the original evaluation. In response, they developed WebArena Verified, a cleaned and audited version of the benchmark that improves evaluation reliability.^[4]

WebArena Verified addresses the following issues:

Brittle string matching: The original benchmark's substring-matching evaluators inflated success rates by 1.4 to 5.2 percentage points. WebArena Verified replaces these with type-aware and normalization-aware comparators.^[4]
Timing errors: Agents sometimes interacted with page elements before the page had fully loaded, leading to inconsistent results. The verified version reduces timing-related errors by 67%.^[4]
Ambiguous task instructions: All 812 tasks were audited, and ambiguous or misleading instructions were clarified.^[4]
Backend state verification: For tasks that modify website state (e.g., creating a post or changing settings), WebArena Verified adds backend checks to confirm that changes actually persisted.^[4]

WebArena Verified also introduces WebArena Verified Hard, a curated 137-task subset that retains the most difficult tasks while reducing evaluation cost by 83%.^[4] This subset is designed for quick but meaningful comparisons between agent systems.

WebArena Verified was presented at the Scaling Environments for Agents (SEA) Workshop at NeurIPS 2025 and is available as a Python package on PyPI.^[4]

In early 2026, ServiceNow shipped WebArena Verified as a maintained package, with a PyPI release on January 7, 2026 and Docker and uvx distribution added on February 2, 2026.^[21] The packaged benchmark eliminates LLM-based fuzzy matching in favor of deterministic scoring built on type-aware normalization (covering dates, currency, URLs, and coordinates, among other value types) and structural comparison, and its evaluators can score agent runs offline from captured network traces in HAR format combined with structured JSON agent responses.^[21] The version of the Hard subset distributed with the maintained package contains 258 difficulty-prioritized tasks.^[21]

What benchmarks extend WebArena?

The original WebArena paper spawned a family of related benchmarks, each extending the core framework to address new challenges.

VisualWebArena

VisualWebArena (Koh et al., 2024) extends WebArena to evaluate multimodal agents on tasks that require visual understanding of web content.^[2] While WebArena tasks can largely be solved using text-based observations (accessibility trees), VisualWebArena introduces 910 tasks where agents must interpret images, visual layouts, colors, and other graphical elements to succeed.^[2]

VisualWebArena adds a new Classifieds website (with real-world listing data) alongside the existing Shopping and Reddit environments. The benchmark was published at ACL 2024. The best vision-language model agents achieved 16.4% success, compared to a human baseline of 88.7%.^[2]

VideoWebArena

VideoWebArena (2024) evaluates agents on tasks that require processing and understanding long-context video inputs. The benchmark consists of 2,021 tasks based on 74 manually crafted video tutorials totaling nearly four hours of content.^[3] Tasks are split into two categories: skill retention (can the agent replicate a demonstrated workflow?) and factual retention (can the agent retrieve specific information from a video?).^[3]

The best model achieved 13.3% success on factual retention tasks, compared to 73.9% for humans. VideoWebArena was published at ICLR 2025.^[3]

TheAgentCompany

TheAgentCompany extends the WebArena philosophy to a simulated corporate work environment. It benchmarks LLM agents on 175 consequential professional tasks in a simulated software company, including tasks that involve terminal use, coding, email, and interactions with simulated human colleagues (LLM-based NPCs).^[5] Published at ICML 2025, the best agent (Claude 3.5 Sonnet) completed only 24% of tasks.^[5]

WebChoreArena

WebChoreArena (Miyai et al., 2025) extends WebArena's scope to labor-intensive and tedious web tasks. Its 532 tasks emphasize three challenges: massive memory (processing large amounts of on-screen information), calculation (precise mathematical reasoning), and long-term memory (retaining information across multiple web pages).^[6] On WebChoreArena, GPT-4o's accuracy drops from 42.8% (on WebArena) to just 6.8%, while Gemini 2.5 Pro achieves 44.9%.^[6]

WebArena-Infinity

WebArena-Infinity is designed for continuous and scalable evaluation in evolving web environments, moving beyond the fixed task set of the original benchmark to support dynamic, continuously updated evaluation scenarios. The project was released under the web-arena-x GitHub organization in 2026 as an automated generation pipeline: coordinated coding agents and browser agents build self-contained web applications with realistic seed data, starting from real-world artifacts such as software manuals and workflows.^[22] Each generated environment is paired with difficulty-graded tasks and programmatic verifiers, a design aimed at reinforcement learning as well as evaluation, and the initial release includes pre-built environments modeled on common productivity applications such as email and project-tracking tools.^[22]

How does WebArena compare with other benchmarks?

WebArena exists within a broader landscape of agent evaluation benchmarks. The following table compares WebArena with other prominent web agent benchmarks.

Benchmark	Year	Tasks	Environment	Evaluation	Key Difference
MiniWoB++	2018	100+	Synthetic browser tasks	Reward function	Small-scale, simplified tasks
Mind2Web	2023	2,000+	Static HTML snapshots	Action matching	Offline; no live interaction
WebArena	2023	812	Self-hosted real websites	Functional correctness	Interactive, realistic, reproducible
VisualWebArena	2024	910	Self-hosted real websites	Functional correctness	Requires visual understanding
WorkArena	2024	33+	ServiceNow platform	Task completion	Enterprise software focus ^[8]
OSWorld	2024	369	Full desktop OS	Screenshot comparison	Tests OS-level interaction
WebChoreArena	2025	532	Self-hosted real websites	Functional correctness	Tedious, memory-heavy tasks

WebArena's primary advantage over offline benchmarks like Mind2Web is its interactive nature: agents can observe the results of their actions and adjust their strategies accordingly.^[1] Compared to synthetic benchmarks like MiniWoB++, WebArena provides a far more realistic and challenging evaluation. And compared to benchmarks that use live public websites, WebArena's self-hosted design ensures reproducibility and prevents issues with websites changing between evaluation runs.

Integration with Agent Frameworks

WebArena's Gymnasium-compatible interface has led to its integration with several popular agent development and evaluation frameworks.

BrowserGym, developed by ServiceNow, is a unified environment for web task automation that supports WebArena, WorkArena, and MiniWoB within a single framework.^[10] BrowserGym provides standardized observation formats (HTML, accessibility tree, screenshot, set-of-marks) and action spaces, making it easier to develop agents that work across multiple benchmarks.^[10]

AgentLab, also from ServiceNow, extends BrowserGym with sophisticated parallelization and analysis tools. It supports running 20 to 100 parallel evaluation tasks and is the recommended framework for new WebArena experiments as of December 2024.^[11]

The WebArena codebase itself uses Playwright for browser automation, enabling programmatic control of Chromium-based browsers.^[1]

Technical Implementation

WebArena's environment is distributed as a set of Docker containers, each hosting one of the web applications. The researchers provide scripts to reset the environment to a deterministic initial state before each evaluation run, ensuring that task outcomes are not affected by lingering state from previous runs.^[1]

The configuration includes the following service ports:

Service	Default Port
Shopping (OneStopShop)	7770
Shopping Admin (CMS)	7780
Reddit (Postmill)	9999
GitLab	8023
Map (OpenStreetMap)	3000
Wikipedia (Kiwix)	8888

As of February 2026, optimized Docker images for all WebArena environments are available on Docker Hub, with images up to 92% smaller than the originals.^[21] These optimized images include auto-login headers and a single container option for the Map service.^[21] They were published by ServiceNow's WebArena Verified project on February 2, 2026; the unified Map container replaces the five separate containers that the original Map service required.^[21]

Significance and Impact

WebArena has had a substantial influence on the development of autonomous web agents. Its contributions include:

Establishing a standard evaluation protocol: WebArena's emphasis on functional correctness over action-sequence matching has become the default approach for evaluating web agents.^[1] This paradigm recognizes that there are multiple valid strategies for completing any given task.
Providing a realistic testing ground: By using fully functional web applications with real data, WebArena exposed the gap between agent performance on simplified benchmarks and their ability to handle real-world complexity.^[1]
Catalyzing rapid progress: The clear, measurable nature of WebArena's task success metric has motivated rapid iteration in agent design. The jump from 14.41% to over 60% success in two years demonstrates the benchmark's role in driving the field forward. By early 2026, reported results exceeded 70%.^[12]
Spawning an ecosystem: The WebArena framework has been extended into VisualWebArena, VideoWebArena, TheAgentCompany, WebChoreArena, and WebArena-Infinity, creating a comprehensive suite of benchmarks that cover text, vision, video, and enterprise domains.
Highlighting remaining challenges: Even with success rates above 70%, agents still fall short of consistent human-level reliability, and the original 78.24% human baseline remains a meaningful reference point. The remaining gap involves tasks requiring deep visual understanding, common-sense reasoning, long-horizon planning, and robust error recovery.

What are WebArena's limitations?

Despite its strengths, WebArena has several recognized limitations:

Fixed task set: The 812 tasks, while diverse, represent a finite set that agents could potentially overfit to. WebArena-Infinity and WebArena Verified Hard attempt to address this concern.
Evaluation noise: The original evaluation framework contained sources of noise (brittle string matching, timing errors) that WebArena Verified was created to fix.^[4]
Limited visual grounding: Most WebArena tasks can be solved using text-based observations alone, which motivated the creation of VisualWebArena.^[2]
English only: All tasks and web content are in English, limiting the benchmark's applicability to multilingual agent evaluation.
Static data: While the websites are functional, the underlying data does not change between evaluations. Real-world websites have continuously evolving content.

Who created WebArena?

WebArena was created by researchers primarily from Carnegie Mellon University's Language Technologies Institute:^[1]

Shuyan Zhou (CMU; later Meta, then Duke University as Assistant Professor)
Frank F. Xu (CMU)
Hao Zhu (CMU)
Xuhui Zhou (CMU)
Robert Lo (CMU)
Abishek Sridhar (CMU)
Xianyi Cheng (CMU)
Tianyue Ou (CMU)
Yonatan Bisk (CMU)
Daniel Fried (CMU)
Uri Alon (CMU; later Google DeepMind)
Graham Neubig (CMU; also affiliated with Inspired Cognition)

The project website is hosted at webarena.dev, and the source code is available on GitHub under the web-arena-x organization.

References

Zhou, S., Xu, F.F., Zhu, H., Zhou, X., Lo, R., Sridhar, A., Cheng, X., Ou, T., Bisk, Y., Fried, D., Alon, U., & Neubig, G. (2023). "WebArena: A Realistic Web Environment for Building Autonomous Agents." *arXiv preprint arXiv:2307.13854*. Published at ICLR 2024. https://arxiv.org/abs/2307.13854 ↩
Koh, J.Y., Lo, R., Jang, L., Dey, V., Hajishirzi, H., & Neubig, G. (2024). "VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks." Published at ACL 2024. ↩
Jang, L., Koh, J.Y., Lo, R., & Neubig, G. (2024). "VideoWebArena: Evaluating Long Context Multimodal Agents with Video Understanding Web Tasks." Published at ICLR 2025. ↩
El Hattami, A., Thakkar, M., & Chapados, N. (2025). "WebArena Verified: Reliable Evaluation for Web Agents." Presented at the SEA Workshop at NeurIPS 2025. ↩
Xu, F.F., Zhou, S., et al. (2024). "TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks." Published at ICML 2025. ↩
Miyai, A., Zhao, Y., et al. (2025). "WebChoreArena: Evaluating Web Browsing Agents on Realistic Tedious Web Tasks." *arXiv preprint arXiv:2506.01952*. ↩
Deng, X., Gu, Y., Zheng, B., Chen, S., Kamoi, R., Yu, J., Sun, H., & Su, Y. (2023). "Mind2Web: Towards a Generalist Agent for the Web." Published at NeurIPS 2023. ↩
Drouin, A., et al. (2024). "WorkArena: How Capable are Web Agents at Solving Common Knowledge Work Tasks?" Published at ICML 2024. ↩
Wang, Z., et al. (2024). "Agent Workflow Memory." *arXiv preprint*. ↩
ServiceNow. "BrowserGym: A Gym Environment for Web Task Automation." GitHub repository, 2024. ↩
ServiceNow. "AgentLab: An Open-Source Framework for Developing, Testing, and Benchmarking Web Agents." GitHub repository, 2024. ↩
Guo, Y., Yang, W., Yang, S., et al. (2026). "OpAgent: Operator Agent for Web Navigation." *arXiv preprint arXiv:2602.13559*. https://arxiv.org/abs/2602.13559 ↩
Marreed, S., Oved, A., Yaeli, A., Shlomov, S., Levy, I., Akrabi, O., Sela, A., Adi, A., & Mashkif, N. (2025). "Towards Enterprise-Ready Computer Using Generalist Agent." *arXiv preprint arXiv:2503.01861*. https://arxiv.org/abs/2503.01861 ↩
Zeta Labs. "AWA 1.5 Achieves Breakthrough Performance on WebArena Benchmark." Jace AI Blog, 2024. https://jace.ai/blog/awa-1-5-achieves-breakthrough-performance-on-web-arena-benchmark ↩
OpenAI. "Computer-Using Agent: Introducing a universal interface for AI to interact with the digital world." January 2025. https://openai.com/index/computer-using-agent/ ↩
Shen, J., Jain, A., Xiao, Z., Amlekar, I., Hadji, M., Podolny, A., & Talwalkar, A. (2024). "ScribeAgent: Towards Specialized Web Agents Using Production-Scale Workflow Data." *arXiv preprint arXiv:2411.15004*. https://arxiv.org/abs/2411.15004 ↩
Zheng, B., Fatemi, M.Y., Jin, X., Wang, Z.Z., Gandhi, A., Song, Y., Gu, Y., Srinivasa, J., Liu, G., Neubig, G., & Su, Y. (2025). "SkillWeaver: Web Agents can Self-Improve by Discovering and Honing Skills." *arXiv preprint arXiv:2504.07079*. https://arxiv.org/abs/2504.07079 ↩
Pan, J., et al. (2024). "Autonomous Evaluation and Refinement of Digital Agents." Published at COLM 2024. *arXiv preprint arXiv:2404.06474*. https://arxiv.org/abs/2404.06474 ↩
The Register. "IBM CUGA stalks enterprises looking to deploy AI agents." December 15, 2025. https://www.theregister.com/2025/12/15/ibm_cuga_agent ↩
IBM Research. "From Benchmarks to Business Impact: Deploying IBM Generalist Agent in Enterprise Production." *arXiv preprint arXiv:2510.23856*, 2025. https://arxiv.org/abs/2510.23856 ↩
ServiceNow. "webarena-verified: A verified version of the WebArena Benchmark." GitHub repository and PyPI package, 2026. https://github.com/ServiceNow/webarena-verified ↩
web-arena-x. "WebArena-Infinity: Generating Browser Environments with Verifiable Tasks at Scale." GitHub repository, 2026. https://github.com/web-arena-x/webarena-infinity ↩
Yang, K., Liu, Y., Chaudhary, S., Fakoor, R., Chaudhari, P., Karypis, G., & Rangwala, H. (2024). "AgentOccam: A Simple Yet Strong Baseline for LLM-Based Web Agents." *arXiv preprint arXiv:2410.13825*. https://arxiv.org/abs/2410.13825 ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

5 revisions by 1 contributors · full history

Suggest edit

WebArena

Background and Motivation

What websites and tools does WebArena include?

Websites

Supporting Tools

How are WebArena tasks designed?

Task Categories

Cross-Site Tasks

Unachievable Tasks

Observation and Action Spaces

Observation Space

Action Space

How does WebArena evaluate agents?

Evaluation Types

Human Performance Baseline

Initial Results (2023)

Progress and Leaderboard (2024 to 2026)

Notable Agent Results

Key Techniques Driving Progress

Late 2025 and 2026 developments

WebArena Verified

What benchmarks extend WebArena?

VisualWebArena

VideoWebArena

TheAgentCompany

WebChoreArena

WebArena-Infinity

How does WebArena compare with other benchmarks?

Integration with Agent Frameworks

Technical Implementation

Significance and Impact

What are WebArena's limitations?

Who created WebArena?

See Also

References

Improve this article

What links here (24 of 26)

What links here (24 of 26)

Background and Motivation

What websites and tools does WebArena include?

Websites

Supporting Tools

How are WebArena tasks designed?

Task Categories

Cross-Site Tasks

Unachievable Tasks

Observation and Action Spaces

Observation Space

Action Space

How does WebArena evaluate agents?

Evaluation Types

Human Performance Baseline

Initial Results (2023)

Progress and Leaderboard (2024 to 2026)

Notable Agent Results

Key Techniques Driving Progress

Late 2025 and 2026 developments

WebArena Verified

What benchmarks extend WebArena?

VisualWebArena

VideoWebArena

TheAgentCompany

WebChoreArena

WebArena-Infinity

How does WebArena compare with other benchmarks?

Integration with Agent Frameworks

Technical Implementation

Significance and Impact

What are WebArena's limitations?

Who created WebArena?

See Also

References

Improve this article

Related Articles

Agent evaluation

Terminal-Bench

WebVoyager

Harness (AI)

Benchmark (AI)

MATH

What links here (24 of 26)

Related Articles

Agent evaluation

Terminal-Bench

WebVoyager

Harness (AI)

Benchmark (AI)

MATH

What links here (24 of 26)