VisualWebArena

AI Agents AI Benchmarks

19 min read

Updated Jun 23, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 23, 2026

Fact-checked

In review queue

Sources

12 citations

Revision

v2 · 3,866 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

VisualWebArena (often abbreviated VWA) is a benchmark of 910 realistic, visually grounded web tasks for evaluating multimodal autonomous AI agents, released in January 2024 by researchers at Carnegie Mellon University and presented at ACL 2024 ^[1]. It extends the earlier text-only WebArena by adding self-hosted Classifieds, Shopping, and Reddit-style sites where many tasks cannot be solved without interpreting images, and it is best known for showing a large gap between agents and people: the strongest model in the launch paper, GPT-4 with vision and Set-of-Mark prompting, reached 16.37 percent task success versus 88.7 percent for humans ^[1]. Each task pairs a natural language instruction with a starting URL and, for about a quarter of tasks, one or more reference images that the agent must understand to act correctly ^[1].

VisualWebArena matters because it pinned down, with hard numbers, how far behind vision-language AI agents lagged humans on web tasks that require both reading and seeing. The paper concludes that "while VLMs show promise, there remains a considerable performance gap compared to humans, who achieve very high success rates on VisualWebArena" ^[1]. That gap, plus the public release of Docker images, made VWA one of the more cited testbeds for browser agents during 2024 and 2025. The repository is now wrapped by BrowserGym along with its sibling benchmarks, so most modern browser agent papers report VWA scores almost as a default.

What problem does VisualWebArena solve?

Before WebArena and VisualWebArena, the standard benchmarks for web-based AI agents were either heavily simplified (MiniWoB and MiniWoB++ presented small synthetic widgets like buttons and forms) or static (datasets like Mind2Web's predecessors gave agents pre-recorded HTML rather than live websites). Live-web evaluations had problems of their own. Real sites change, ban scrapers, and throw rate limits at automation, which makes any reported success rate hard to reproduce a few months later.

WebArena, introduced by Shuyan Zhou and colleagues in mid-2023, took a different route. The team self-hosted four open-source web applications (a Reddit clone, a Magento e-commerce site, a self-hosted GitLab, and a CMS called OneStopShop) and wrote 812 tasks against them ^[3]. Because the sites were containerized, anyone could spin up a fresh copy and reproduce the benchmark deterministically. WebArena was accepted at ICLR 2024 and quickly became a default reference point for browser agent research.

WebArena, however, was almost entirely text-driven. Agents got a serialized accessibility tree or HTML and produced actions in a textual action space. The visual side of the web (product photos, charts, screenshots in forum posts, item thumbnails on classifieds) was either invisible to the agent or reduced to alt text. That left a clear hole. A lot of real human web work involves looking at pictures: comparing two listings on a marketplace, picking out a product that matches a reference image, identifying which post on a forum has the cat in it. None of that fit cleanly into WebArena's evaluation. As the VWA authors put it, "most computer interfaces cater to human perception" and "visual information often augments textual data in ways that text-only models struggle to harness effectively" ^[1].

VisualWebArena was the answer to that gap. The authors (Jing Yu Koh, Robert Lo, Lawrence Jang, Vikram Duvvur, Ming Chong Lim, Po-Yu Huang, Graham Neubig, Shuyan Zhou, Ruslan Salakhutdinov, and Daniel Fried) kept WebArena's design philosophy of self-hosted, containerized, programmatically checkable tasks, and grafted on three image-heavy sites and 910 visually grounded tasks ^[1]. The work came mostly out of CMU's Language Technologies Institute. Of the three environments, the Classifieds site is a new contribution built with real-world data, while the Shopping and Reddit environments are inherited from WebArena ^[1].

What sites and task design does it use?

VisualWebArena hosts three websites. They are open-source clones populated with realistic mock data, so anyone can run the benchmark behind a firewall without scraping or rate limits.

Site	Backend	Description
Classifieds	osclass	An eBay-style marketplace, new to VWA, populated with 65,955 seeded listings ^[1]. Listings have photos, prices, categories, and seller pages, which makes it the most visually demanding of the three sites.
Reddit	postmill	A Reddit-style forum (inherited from WebArena) with image posts, comments, voting, and subreddit-style "forums." Many tasks require identifying which post contains a particular image or reading text inside an image.
Shopping	Magento Open Source	An e-commerce site (also used by WebArena) repurposed for visually grounded product browsing. Many VWA tasks involve matching products by appearance.

Each site is shipped as a Docker image with a known database snapshot, so two researchers running the benchmark on different machines see exactly the same listings, posts, products, and prices. That matters more than it sounds. Earlier live-web benchmarks were full of unreproducible numbers because two papers running the "same" benchmark might be hitting different versions of a website.

Tasks are written by human annotators and stored as JSON specifications. The annotators were six computer science graduate students (co-authors of the paper) who first wrote intent templates and then expanded them into concrete tasks, collecting 314 unique templates at an average of 2.9 tasks per template ^[1]. Each task has at minimum:

A natural language instruction ("Find the cheapest red bicycle on Classifieds and add it to your watchlist").
A starting URL or set of starting URLs.
One or more reference images (a photo of a bicycle, a screenshot of a product page, a visual collage), for the 25.2 percent of tasks that include image inputs ^[1].
An evaluator that programmatically checks whether the agent's final state matches the goal. This is a mix of URL matchers, DOM matchers, exact-string matchers, and (for harder tasks) reference-answer comparisons.

The paper categorizes tasks by the visual skill they require. Roughly, the categories are object recognition ("find the listing with this exact product"), scene understanding ("which of these posts shows an outdoor photo"), counting ("how many red items are on the first page"), and color or attribute matching. The split is not perfectly clean, since most tasks use more than one skill at once, but it is enough to show that an agent strong on object recognition can still fail on counting. The benchmark also deliberately includes 46 tasks (5.1 percent) that are unachievable, to test whether an agent knows when to stop rather than hallucinate a result ^[1].

How many tasks does VisualWebArena have?

VisualWebArena ships 910 tasks total, split across the three sites. The distribution leans toward the Shopping site, with Classifieds and the Reddit-style forum contributing the denser visual content. The exact per-site counts, taken from the released task configuration files, are below and sum to 910 ^[1]^[2].

Site	Task count
Classifieds	234
Reddit	210
Shopping	466
Total	910

The paper reports that 25.2 percent of tasks (229 tasks) require one or more input images, where the agent is given a photo to find or compare against, and almost all of the rest require the agent to look at images embedded in the page ^[1]. A further 17.1 percent of tasks require optical character recognition (reading text out of an image), and 8.7 percent require exact image matching ^[1]. The distribution of action steps per task is broadly similar to WebArena's, with most tasks completable in well under twenty actions if the agent does the right thing.

How is VisualWebArena evaluated?

VisualWebArena reports a single primary metric: success rate. A task either succeeds or fails. The benchmark does not give partial credit, which is intentional. The authors argue that for browser tasks, partial completion ("the agent added the wrong product to the cart") is functionally a failure, and rewarding it with partial credit makes the leaderboard misleading.

Under the hood, evaluation is a mix of three checker types:

URL-based checks confirm that the agent ended up on the correct page, possibly with specific query parameters.
HTML-based checks query the DOM for a specific element or attribute ("is this product in the cart with quantity 2").
Reference-string checks compare the agent's textual answer (for tasks that ask the agent to extract a value) against a gold answer.

These checkers run inside the same Docker network as the sites, so they can hit internal APIs the agent does not have access to. For Reddit-style tasks the checker can ask the postmill database directly whether a comment was created; for Classifieds it can verify that the watchlist contains the right item ID. This is more reliable than scraping the agent's final screenshot.

Humans set the upper bound. The original paper recruited a small set of annotators, gave them the same instructions and the same starting state as the agents, and recorded their success rate. The overall human success rate is 88.7 percent (91.07 percent on Classifieds, 87.10 percent on Reddit, and 88.39 percent on Shopping), and that 88.7 percent figure is what most subsequent papers compare against ^[1]. It is not 100 percent because some tasks are genuinely ambiguous or have edge cases that even careful humans miss.

What were the original results?

The launch paper reports results for a range of vision-language models and multimodal models. The numbers below are the overall success rates from Table 3 of the paper ^[1].

Model	Inputs	Overall success rate
Human annotators	live webpage, same task and starting state	88.70%
GPT-4o (added in the ACL camera-ready)	image + Set-of-Mark	19.78% ^[1]
GPT-4 with vision (GPT-4V) + Set-of-Mark prompting	image + captions + Set-of-Mark	16.37%
GPT-4V (multimodal, no Set-of-Mark)	image + captions + accessibility tree	15.05%
GPT-4 + BLIP-2 captions	accessibility tree + captions	12.75%
GPT-4, text-only	accessibility tree only	7.25%
Gemini-Pro (multimodal)	image + captions + accessibility tree	6.04%
IDEFICS-80B-Instruct	image + captions + accessibility tree	0.99%
CogVLM	image + captions + accessibility tree	0.33%

The spread between the best closed model and the human ceiling is the part most often quoted. A roughly five-fold gap between GPT-4V with the strongest available prompting and a panel of humans is hard to explain away as "we just need a bigger model." The paper itself states that "all existing models substantially underperform compared to humans, which indicate significant headroom in VisualWebArena for future work" ^[1]. The error analysis points at recurring failure modes: agents miss small UI elements they should have clicked, hallucinate items that are not on the page, lose track of multi-step plans, and fail to ground references like "the second listing" to actual elements.

Open-source VLMs in early 2024 sat near the floor, with IDEFICS-80B at 0.99 percent and CogVLM at 0.33 percent overall ^[1]. That is not really a fair fight (they were trained mainly on captioning and visual question answering, not on long-horizon agentic tasks), but it set a baseline that follow-up open-source agent work has been chasing ever since.

What is Set-of-Mark prompting and why does it help?

One of the more durable contributions of VisualWebArena is the demonstration that Set-of-Mark prompting helps non-trivially on visually grounded web tasks. Set-of-Mark is a technique introduced in earlier work (Yang et al., 2023) where every interactive element on a page is overlaid with a numbered marker before the screenshot is sent to the model ^[4]. Instead of asking GPT-4V to describe coordinates or point at a button, the prompt says something like "click element 17."

VisualWebArena evaluates Set-of-Mark on top of GPT-4V. Overall it lifts success from 15.05 percent to 16.37 percent, with the largest gains on the most visually dense sites: Classifieds improves from 12.38 percent to 17.14 percent and Reddit from 8.12 percent to 9.83 percent ^[1]. The authors note that "the SoM representation is superior for strong VLM agents, which can more accurately disentangle and click on the desired elements," while for weaker VLMs Set-of-Mark does not help, which they attribute to the finding that only GPT-4V reliably demonstrates the Set-of-Mark grounding ability ^[1]. That is not enough to close the gap with humans, but it is enough to settle a debate that was active at the time about whether vision-language models could ground references on their own. The VWA result was: the strong ones benefit a lot from the overlays, and the weak ones cannot use them.

Most subsequent web agent systems use either Set-of-Mark or a closely related variant. Anthropic's Computer Use work in late 2024 and OpenAI's Operator both ship visible markings or coordinate grids in their tool harnesses, in part because of evidence from VWA and similar benchmarks that this is what the underlying models actually need.

Implementation

The full benchmark lives in the GitHub repository web-arena-x/visualwebarena ^[2]. The repository contains Docker Compose files that bring up the three sites along with the WebArena sites for a unified environment, the 910 task JSON files, reference agent implementations for GPT-4V and the open-source models in the paper, an evaluation harness that runs the programmatic checkers, and scripts to take Set-of-Mark screenshots and drive a real Chromium browser via Playwright.

The stated minimum hardware to host the full set of sites is around 16 GB of RAM and roughly 100 GB of disk, mostly because the Magento and osclass database snapshots are not small. Most teams run the sites on a single workstation or a modest cloud VM and point their agents at it from a separate process.

Unlike live-web benchmarks, the Docker-based setup means a 2025 paper running VWA can reproduce the 2024 numbers nearly exactly. The agent code changes; the world does not.

How does VisualWebArena compare to other agent benchmarks?

VisualWebArena is one of a growing family of agent benchmarks released in 2023, 2024, and 2025. The table below summarizes the closest siblings and how they relate.

Benchmark	Year	Domain	Modality	Notes
WebArena	2023	Self-hosted Reddit, GitLab, Magento, CMS	Text (HTML/AX tree)	Direct predecessor; 812 tasks; shares several authors with the VWA team ^[3].
VisualWebArena	2024	Self-hosted Classifieds, Reddit, Shopping	Text + images	910 visually grounded tasks ^[1].
Mind2Web	2023	137 real websites, static traces	Text + screenshots	Static dataset, not live; pre-recorded HTML and trajectories ^[12].
OSWorld	2024	Full desktop OS (Linux, Windows, macOS)	Screen + keyboard/mouse	369 tasks across desktop apps; broader than browser only ^[6].
AgentBench	2023	OS, web, DB, knowledge graph, code, etc.	Mixed	Eight environments; broad evaluation, less depth on browser tasks ^[8].
GAIA	2023	Open-ended assistant questions	Multimodal	466 questions; tests tool use, not specifically browsers ^[9].
WebVoyager	2024	Live-web tasks across real sites	Text + screenshots	Uses real production websites; reproducibility is harder ^[7].
WorkArena	2024	Self-hosted ServiceNow	Text + screenshots	Enterprise workflows on ServiceNow ^[5].
BrowserGym	2024	Wraps WebArena, VWA, WorkArena, MiniWoB++	Mixed	Common harness rather than a benchmark; reports unified scores across the wrapped benchmarks.

VWA and WebArena form the spine of most browser-agent leaderboards. OSWorld is the obvious step beyond the browser. GAIA tests something different, namely whether an assistant can stitch together tools and reach the right answer. None of these benchmarks fully overlap; in practice, papers report numbers on two or three of them and treat the bundle as a coverage metric.

One thing worth noticing is that all of these benchmarks emerged in roughly an eighteen-month window. The pace of release is itself a sign of how unsettled the agent evaluation space was during this period.

How have scores changed from 2024 to 2026?

The leaderboard has moved a lot since the launch. Numbers below are drawn from the original paper, the BrowserGym repository, follow-up papers, and vendor blog posts; some are still single data points and may not be perfectly comparable across runs.

Period	Best published VWA success rate	Driver
Early 2024 (launch)	16.37%	GPT-4V plus Set-of-Mark prompting from the original paper ^[1].
Mid 2024 (ACL camera-ready)	19.78%	GPT-4o with Set-of-Mark, added in the updated paper ^[1].
Late 2024	mid 20s to low 30s percent	Anthropic Claude and OpenAI's GPT-4 family plus dedicated browser agent stacks reported gains in this range; Anthropic's Computer Use public preview drew explicit comparisons to VWA-style tasks ^[10].
2025	mid 30s to mid 40s percent for top systems	OpenAI's Operator, Google Gemini-based browser agents, and dedicated open-source agents (BrowserGym recipes, Anthropic and Meta agent stacks) clustered in this range on VWA reports ^[11].
Late 2025 / early 2026	the strongest systems began to approach ~ 50% on VWA	Driven by reasoning-trained multimodal models combined with stronger browser harnesses; the human ceiling near 88.7 percent remained well above the best agent results.

These numbers come with the usual caveats. Different papers run different subsets, use different action spaces, and use different timeouts. The trend is real, the absolute numbers should be treated with some skepticism. The cleanest like-for-like comparisons come from BrowserGym, which fixes the harness and reports a unified VWA score across whatever model the user plugs in.

A pattern that holds across the period: closed-source models stayed ahead of open-source ones on VWA, the gap narrowed, and the human ceiling did not move. As of early 2026 no agent has reported parity with humans on VWA in a peer-reviewed paper.

How has VisualWebArena influenced agent research?

VisualWebArena's influence shows up in several places.

First, the number itself. The 16.37 percent launch result became one of the more cited reality checks against the early-2024 narrative that GPT-4 class models were almost ready to replace human web workers ^[1]. A roughly five-times-worse-than-humans result on tasks an undergraduate can do in an afternoon was hard to spin.

Second, Set-of-Mark prompting and numbered overlays became close to standard in subsequent browser agent stacks. Both Anthropic's Computer Use and OpenAI's Operator ship some form of element annotation by default. Open-source agent harnesses like BrowserGym bake it in as a configurable preprocessing step.

Third, the evaluation methodology (programmatic checkers running inside the same Docker network as the site) became something of a template. WorkArena uses the same approach ^[5]. Several internal evaluation harnesses at large labs are structured the same way. The reproducibility benefit is large enough that almost no new browser agent benchmark released after VWA goes back to live-web evaluation as a primary mode.

Fourth, the dataset itself feeds into training. Several teams have used VWA tasks (or close variants) as training or fine-tuning data for open-source browser agents. The line between training set and evaluation set is therefore not as clean as it was at launch, and most serious leaderboards now distinguish between zero-shot and trained-on-VWA results.

Finally, VWA helped surface the broader point that web agent quality is not just a vision-language-model problem. The same GPT-4V scores very differently with and without Set-of-Mark, with and without a planning loop, with and without retry behavior. That is partly an empirical claim and partly a research-agenda claim. It implies that better browser agents will come at least as much from harness work as from raw model improvements.

Limitations

VisualWebArena is not a perfect benchmark, and the authors are explicit about most of the limitations.

The sites are clones, not real production websites, so they miss the long tail of weirdness on the real web (broken JavaScript, anti-bot interstitials, A/B tests, popups). An agent that crushes VWA might still struggle on a real classifieds site that uses a custom widget the benchmark never thought to include.

Tasks are written by humans and reflect what humans thought was a reasonable mix of visual skills. They under-represent tasks involving charts, scientific images, or specialized domains like medical imaging, and they over-represent the Shopping site, which holds 466 of the 910 tasks ^[1]^[2].

The single-success-or-failure metric does not distinguish between catastrophic failure (the agent did the wrong thing) and benign failure (the agent decided the task was impossible and stopped). For some operational settings the second kind of failure is much less costly than the first.

The benchmark is in English. None of the sites are localized for other languages, and none of the tasks specifically test multilingual or cross-cultural visual web work. WebVoyager and a handful of regional spinoffs try to fill this gap.

Finally, the human ceiling near 88.7 percent is not 100 percent. Some VWA tasks are genuinely ambiguous, which puts a soft cap on how much further agent scores can climb before the benchmark is effectively saturated.

References

Koh, J. Y., Lo, R., Jang, L., Duvvur, V., Lim, M. C., Huang, P.-Y., Neubig, G., Zhou, S., Salakhutdinov, R., and Fried, D. (2024). "VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks." Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL 2024), pages 881-905. arXiv:2401.13649. https://aclanthology.org/2024.acl-long.50/ ↩
web-arena-x/visualwebarena. Official GitHub repository. https://github.com/web-arena-x/visualwebarena ↩
Zhou, S., Xu, F. F., Zhu, H., Zhou, X., Lo, R., Sridhar, A., Cheng, X., Bisk, Y., Fried, D., Alon, U., and Neubig, G. (2023). "WebArena: A Realistic Web Environment for Building Autonomous Agents." arXiv:2307.13854. ICLR 2024. ↩
Yang, J., Zhang, H., Li, F., Zou, X., Li, C., and Gao, J. (2023). "Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V." arXiv:2310.11441. ↩
Drouin, A., Gasse, M., Caccia, M., Laradji, I. H., Del Verme, M., Marty, T., Boisvert, L., Thakkar, M., Cappart, Q., Vazquez, D., Chapados, N., and Lacoste, A. (2024). "WorkArena: How Capable Are Web Agents at Solving Common Knowledge Work Tasks?" arXiv:2403.07718. ↩
Xie, T., Zhang, D., Chen, J., Li, X., Zhao, S., Cao, R., Hua, T. J., Cheng, Z., Shin, D., Lei, F., et al. (2024). "OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments." arXiv:2404.07972. ↩
He, H., Yao, W., Ma, K., Yu, W., Dai, Y., Zhang, H., Lan, Z., and Yu, D. (2024). "WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models." arXiv:2401.13919. ↩
Liu, X., Yu, H., Zhang, H., Xu, Y., Lei, X., Lai, H., Gu, Y., Ding, H., Men, K., Yang, K., et al. (2023). "AgentBench: Evaluating LLMs as Agents." arXiv:2308.03688. ↩
Mialon, G., Fourrier, C., Swift, C., Wolf, T., LeCun, Y., and Scialom, T. (2023). "GAIA: A Benchmark for General AI Assistants." arXiv:2311.12983. ↩
Anthropic (2024). "Introducing computer use, a new Claude 3.5 Sonnet, and Claude 3.5 Haiku." Anthropic blog, October 22, 2024. ↩
OpenAI (2025). "Introducing Operator." OpenAI blog, January 23, 2025. ↩
Deng, X., Gu, Y., Zheng, B., Chen, S., Stevens, S., Wang, B., Sun, H., and Su, Y. (2023). "Mind2Web: Towards a Generalist Agent for the Web." NeurIPS 2023. ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

1 revision by 1 contributors · full history

Suggest edit

What links here

AI browser agent Agent evaluation AgentBench Agentic workflow Browser-use agent BrowserGym Mind2Web WebArena WebVoyager

What problem does VisualWebArena solve?

What sites and task design does it use?

How many tasks does VisualWebArena have?

How is VisualWebArena evaluated?

What were the original results?

What is Set-of-Mark prompting and why does it help?

Implementation

How does VisualWebArena compare to other agent benchmarks?

How have scores changed from 2024 to 2026?

How has VisualWebArena influenced agent research?

Limitations

See also

References

Improve this article

Related Articles

Tau2-bench

WebArena

Agent evaluation

AgentBench

GAIA benchmark

BrowserGym

What links here

Related Articles

Tau2-bench

WebArena

Agent evaluation

AgentBench

GAIA benchmark

BrowserGym

What links here