VisualWebArena
Last reviewed
May 2, 2026
Sources
12 citations
Review status
Source-backed
Revision
v1 · 3,495 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 2, 2026
Sources
12 citations
Review status
Source-backed
Revision
v1 · 3,495 words
Add missing citations, update stale details, or suggest a clearer explanation.
VisualWebArena (often abbreviated VWA) is a benchmark for evaluating multimodal autonomous agents on visually grounded web tasks. It was released in January 2024 by researchers at Carnegie Mellon University and collaborators, and the accompanying paper appeared at ACL 2024. The benchmark extends the earlier text-only WebArena by adding three self-hosted websites (a Classifieds site, a Reddit-style forum, and a Shopping site) where many tasks cannot be solved without looking at images. Each task comes with a natural language instruction, and most include reference images that the agent must interpret to figure out what to do.
VisualWebArena matters because it pinned down, with hard numbers, how far behind vision-language AI agents lagged humans on web tasks that require both reading and seeing. In the launch paper, the strongest model tested (GPT-4 with vision and Set-of-Mark prompting) reached roughly 16 to 17 percent task success, while a small group of human annotators reached around 88 percent on the same tasks. That gap, plus the public release of Docker images, made VWA one of the more cited testbeds for browser agents during 2024 and 2025. The repository is now wrapped by BrowserGym along with its sibling benchmarks, so most modern browser agent papers report VWA scores almost as a default.
Before WebArena and VisualWebArena, the standard benchmarks for web-based AI agents were either heavily simplified (MiniWoB and MiniWoB++ presented small synthetic widgets like buttons and forms) or static (datasets like Mind2Web's predecessors gave agents pre-recorded HTML rather than live websites). Live-web evaluations had problems of their own. Real sites change, ban scrapers, and throw rate limits at automation, which makes any reported success rate hard to reproduce a few months later.
WebArena, introduced by Shuyan Zhou and colleagues in mid-2023, took a different route. The team self-hosted four open-source web applications (a Reddit clone, a Magento e-commerce site, a self-hosted GitLab, and a CMS called OneStopShop) and wrote 812 tasks against them. Because the sites were containerized, anyone could spin up a fresh copy and reproduce the benchmark deterministically. WebArena was accepted at ICLR 2024 and quickly became a default reference point for browser agent research.
WebArena, however, was almost entirely text-driven. Agents got a serialized accessibility tree or HTML and produced actions in a textual action space. The visual side of the web (product photos, charts, screenshots in forum posts, item thumbnails on classifieds) was either invisible to the agent or reduced to alt text. That left a clear hole. A lot of real human web work involves looking at pictures: comparing two listings on a marketplace, picking out a product that matches a reference image, identifying which post on a forum has the cat in it. None of that fit cleanly into WebArena's evaluation.
VisualWebArena was the answer to that gap. The authors (Jing Yu Koh, Robert Lo, Lawrence Jang, Vikram Iyer, Frank F. Xu, Kurt Keutzer, Ruslan Salakhutdinov, and Daniel Fried) kept WebArena's design philosophy of self-hosted, containerized, programmatically checkable tasks, and grafted on three new image-heavy sites and 910 visually grounded tasks. The work came mostly out of CMU's Language Technologies Institute, with Berkeley involvement through Kurt Keutzer.
VisualWebArena hosts three websites. They are open-source clones populated with realistic mock data, so anyone can run the benchmark behind a firewall without scraping or rate limits.
| Site | Backend | Description |
|---|---|---|
| Classifieds | osclass | An eBay-style marketplace with thousands of seeded listings. Listings have photos, prices, categories, and seller pages, which makes it the most visually demanding of the three sites. |
| postmill | A Reddit-style forum with image posts, comments, voting, and subreddit-style "forums." Many tasks require identifying which post contains a particular image or reading text inside an image. | |
| Shopping | Magento Open Source | An e-commerce site (also used by WebArena) repurposed for visually grounded product browsing. Many VWA tasks involve matching products by appearance. |
Each site is shipped as a Docker image with a known database snapshot, so two researchers running the benchmark on different machines see exactly the same listings, posts, products, and prices. That matters more than it sounds. Earlier live-web benchmarks were full of unreproducible numbers because two papers running the "same" benchmark might be hitting different versions of a website.
Tasks are written by human annotators and stored as JSON specifications. Each task has at minimum:
The paper categorizes tasks by the visual skill they require. Roughly, the categories are object recognition ("find the listing with this exact product"), scene understanding ("which of these posts shows an outdoor photo"), counting ("how many red items are on the first page"), and color or attribute matching. The split is not perfectly clean, since most tasks use more than one skill at once, but it is enough to show that an agent strong on object recognition can still fail on counting.
VisualWebArena ships 910 tasks total, split across the three sites. The distribution leans toward Classifieds and the Reddit-style forum, since both are more image-driven than the Shopping site (which already exists in WebArena and is mostly used here for the visually grounded subset of e-commerce tasks).
| Site | Approximate task count |
|---|---|
| Classifieds | ~ 234 |
| ~ 210 | |
| Shopping | ~ 466 |
| Total | 910 |
The paper reports that roughly a quarter of tasks require an input image (the agent is given a photo to find or compare against), and almost all of the rest require the agent to look at images embedded in the page. The distribution of action steps per task is broadly similar to WebArena's, with most tasks completable in well under twenty actions if the agent does the right thing.
VisualWebArena reports a single primary metric: success rate. A task either succeeds or fails. The benchmark does not give partial credit, which is intentional. The authors argue that for browser tasks, partial completion ("the agent added the wrong product to the cart") is functionally a failure, and rewarding it with partial credit makes the leaderboard misleading.
Under the hood, evaluation is a mix of three checker types:
These checkers run inside the same Docker network as the sites, so they can hit internal APIs the agent does not have access to. For Reddit-style tasks the checker can ask the postmill database directly whether a comment was created; for Classifieds it can verify that the watchlist contains the right item ID. This is more reliable than scraping the agent's final screenshot.
Humans set the upper bound. The original paper recruited a small set of annotators, gave them the same instructions and the same starting state as the agents, and recorded their success rate. That number, around 88.7 percent, is what most subsequent papers compare against. It is not 100 percent because some tasks are genuinely ambiguous or have edge cases that even careful humans miss.
The launch paper reports results for a range of vision-language models and multimodal models. The numbers below are the headline success rates from the paper, rounded.
| Model | Notes | Overall success rate |
|---|---|---|
| Human annotators | small panel, same task and starting state | ~ 88.7% |
| GPT-4 with vision (GPT-4V) + Set-of-Mark prompting | best non-human result | ~ 16 to 17% |
| GPT-4V (no Set-of-Mark) | text accessibility tree plus screenshot | ~ 12 to 14% |
| GPT-4 with text-only access | accessibility tree only, no images | low single digits on visually grounded tasks |
| Gemini Pro Vision | early Gemini multimodal model | ~ 6 to 8% |
| LLaVA 1.5 | open-source 13B VLM | low single digits |
| IDEFICS | open-source multimodal model | low single digits |
| Mini-GPT4 | open-source VLM | near zero |
| CogVLM | open-source VLM | low single digits |
The spread between the best closed model and the human ceiling is the part most often quoted. A roughly five-fold gap between GPT-4V with the strongest available prompting and a small panel of humans is hard to explain away as "we just need a bigger model." The error analysis in the paper points at recurring failure modes: agents miss small UI elements they should have clicked, hallucinate items that are not on the page, lose track of multi-step plans, and fail to ground references like "the second listing" to actual elements.
Open-source VLMs in early 2024 sat near the floor. That is not really a fair fight (they were trained mainly on captioning and visual question answering, not on long-horizon agentic tasks), but it set a baseline that follow-up open-source agent work has been chasing ever since.
One of the more durable contributions of VisualWebArena is the demonstration that Set-of-Mark prompting helps non-trivially on visually grounded web tasks. Set-of-Mark is a technique introduced in earlier work (Yang et al., 2023) where every interactive element on a page is overlaid with a numbered marker before the screenshot is sent to the model. Instead of asking GPT-4V to describe coordinates or point at a button, the prompt says something like "click element 17."
VisualWebArena evaluates Set-of-Mark on top of GPT-4V. The result is roughly a four to five point absolute improvement in success rate. That is not enough to close the gap with humans, but it is enough to settle a debate that was active at the time about whether vision-language models could ground references on their own. The VWA result was: no, they need the overlays.
Most subsequent web agent systems use either Set-of-Mark or a closely related variant. Anthropic's Computer Use work in late 2024 and OpenAI's Operator both ship visible markings or coordinate grids in their tool harnesses, in part because of evidence from VWA and similar benchmarks that this is what the underlying models actually need.
The full benchmark lives in the GitHub repository web-arena-x/visualwebarena. The repository contains Docker Compose files that bring up the three sites along with the WebArena sites for a unified environment, the 910 task JSON files, reference agent implementations for GPT-4V and the open-source models in the paper, an evaluation harness that runs the programmatic checkers, and scripts to take Set-of-Mark screenshots and drive a real Chromium browser via Playwright.
The stated minimum hardware to host the full set of sites is around 16 GB of RAM and roughly 100 GB of disk, mostly because the Magento and osclass database snapshots are not small. Most teams run the sites on a single workstation or a modest cloud VM and point their agents at it from a separate process.
Unlike live-web benchmarks, the Docker-based setup means a 2025 paper running VWA can reproduce the 2024 numbers nearly exactly. The agent code changes; the world does not.
VisualWebArena is one of a growing family of agent benchmarks released in 2023, 2024, and 2025. The table below summarizes the closest siblings and how they relate.
| Benchmark | Year | Domain | Modality | Notes |
|---|---|---|---|---|
| WebArena | 2023 | Self-hosted Reddit, GitLab, Magento, CMS | Text (HTML/AX tree) | Direct predecessor; 812 tasks; same authors as part of VWA team. |
| VisualWebArena | 2024 | Self-hosted Classifieds, Reddit, Shopping | Text + images | 910 visually grounded tasks. |
| Mind2Web | 2023 | 137 real websites, static traces | Text + screenshots | Static dataset, not live; pre-recorded HTML and trajectories. |
| OSWorld | 2024 | Full desktop OS (Linux, Windows, macOS) | Screen + keyboard/mouse | 369 tasks across desktop apps; broader than browser only. |
| AgentBench | 2023 | OS, web, DB, knowledge graph, code, etc. | Mixed | Eight environments; broad evaluation, less depth on browser tasks. |
| GAIA | 2023 | Open-ended assistant questions | Multimodal | 466 questions; tests tool use, not specifically browsers. |
| WebVoyager | 2024 | Live-web tasks across real sites | Text + screenshots | Uses real production websites; reproducibility is harder. |
| WorkArena | 2024 | Self-hosted ServiceNow | Text + screenshots | Enterprise workflows on ServiceNow. |
| BrowserGym | 2024 | Wraps WebArena, VWA, WorkArena, MiniWoB++ | Mixed | Common harness rather than a benchmark; reports unified scores across the wrapped benchmarks. |
VWA and WebArena form the spine of most browser-agent leaderboards. OSWorld is the obvious step beyond the browser. GAIA tests something different, namely whether an assistant can stitch together tools and reach the right answer. None of these benchmarks fully overlap; in practice, papers report numbers on two or three of them and treat the bundle as a coverage metric.
One thing worth noticing is that all of these benchmarks emerged in roughly an eighteen-month window. The pace of release is itself a sign of how unsettled the agent evaluation space was during this period.
The leaderboard has moved a lot since the launch. Numbers below are drawn from the original paper, the BrowserGym repository, follow-up papers, and vendor blog posts; some are still single data points and may not be perfectly comparable across runs.
| Period | Best published VWA success rate | Driver |
|---|---|---|
| Early 2024 (launch) | ~ 16 to 17% | GPT-4V plus Set-of-Mark prompting from the original paper. |
| Mid 2024 | low to mid 20s percent | Gains from improved prompting, longer context, and better screenshot annotations on top of GPT-4 class models. |
| Late 2024 | mid 20s to low 30s percent | Anthropic Claude and OpenAI's GPT-4 family plus dedicated browser agent stacks reported gains in this range; Anthropic's Computer Use public preview drew explicit comparisons to VWA-style tasks. |
| 2025 | mid 30s to mid 40s percent for top systems | OpenAI's Operator, Google Gemini-based browser agents, and dedicated open-source agents (BrowserGym recipes, Anthropic and Meta agent stacks) clustered in this range on VWA reports. |
| Late 2025 / early 2026 | the strongest systems began to approach ~ 50% on VWA | Driven by reasoning-trained multimodal models combined with stronger browser harnesses; the human ceiling near 88.7% remained well above the best agent results. |
These numbers come with the usual caveats. Different papers run different subsets, use different action spaces, and use different timeouts. The trend is real, the absolute numbers should be treated with some skepticism. The cleanest like-for-like comparisons come from BrowserGym, which fixes the harness and reports a unified VWA score across whatever model the user plugs in.
A pattern that holds across the period: closed-source models stayed ahead of open-source ones on VWA, the gap narrowed, and the human ceiling did not move. As of early 2026 no agent has reported parity with humans on VWA in a peer-reviewed paper.
VisualWebArena's influence shows up in several places.
First, the number itself. The roughly fifteen percent launch result became one of the more cited reality checks against the early-2024 narrative that GPT-4 class models were almost ready to replace human web workers. A five-times-worse-than-humans result on tasks an undergraduate can do in an afternoon was hard to spin.
Second, Set-of-Mark prompting and numbered overlays became close to standard in subsequent browser agent stacks. Both Anthropic's Computer Use and OpenAI's Operator ship some form of element annotation by default. Open-source agent harnesses like BrowserGym bake it in as a configurable preprocessing step.
Third, the evaluation methodology (programmatic checkers running inside the same Docker network as the site) became something of a template. WorkArena uses the same approach. Several internal evaluation harnesses at large labs are structured the same way. The reproducibility benefit is large enough that almost no new browser agent benchmark released after VWA goes back to live-web evaluation as a primary mode.
Fourth, the dataset itself feeds into training. Several teams have used VWA tasks (or close variants) as training or fine-tuning data for open-source browser agents. The line between training set and evaluation set is therefore not as clean as it was at launch, and most serious leaderboards now distinguish between zero-shot and trained-on-VWA results.
Finally, VWA helped surface the broader point that web agent quality is not just a vision-language-model problem. The same GPT-4V scores very differently with and without Set-of-Mark, with and without a planning loop, with and without retry behavior. That is partly an empirical claim and partly a research-agenda claim. It implies that better browser agents will come at least as much from harness work as from raw model improvements.
VisualWebArena is not a perfect benchmark, and the authors are explicit about most of the limitations.
The sites are clones, not real production websites, so they miss the long tail of weirdness on the real web (broken JavaScript, anti-bot interstitials, A/B tests, popups). An agent that crushes VWA might still struggle on a real classifieds site that uses a custom widget the benchmark never thought to include.
Tasks are written by humans and reflect what humans thought was a reasonable mix of visual skills. They under-represent tasks involving charts, scientific images, or specialized domains like medical imaging, and they over-represent the Shopping site since it is the largest of the three.
The single-success-or-failure metric does not distinguish between catastrophic failure (the agent did the wrong thing) and benign failure (the agent decided the task was impossible and stopped). For some operational settings the second kind of failure is much less costly than the first.
The benchmark is in English. None of the sites are localized for other languages, and none of the tasks specifically test multilingual or cross-cultural visual web work. WebVoyager and a handful of regional spinoffs try to fill this gap.
Finally, the human ceiling near 88.7% is not 100 percent. Some VWA tasks are genuinely ambiguous, which puts a soft cap on how much further agent scores can climb before the benchmark is effectively saturated.