WebVoyager

AI Agents AI Benchmarks Model Evaluation

11 min read

Updated Jun 28, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 28, 2026

Fact-checked

In review queue

Sources

10 citations

Revision

v2 · 2,166 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

WebVoyager is an end-to-end web agent and its companion benchmark, introduced by Hongliang He and seven coauthors in a paper accepted to ACL 2024 ^[1]^[2]. It is a multimodal (vision plus text) agent that completes user instructions by operating live, real-world websites such as Amazon, GitHub, and Google Flights from browser screenshots, and its benchmark contains 643 tasks across 15 popular sites ^[1]^[5]. The original GPT-4V-based agent reached a 59.1% task success rate, judged by an automatic GPT-4V evaluator that agreed with human graders 85.3% of the time ^[1].

The authors frame the project as closing the gap between lab benchmarks and the open internet: "we introduce WebVoyager, an innovative Large Multimodal Model (LMM) powered web agent that can complete user instructions end-to-end by interacting with real-world websites" ^[1]. WebVoyager has since become one of the most widely cited yardsticks for browser agents, used to report scores for systems like OpenAI Operator and open-source frameworks.

What problem does WebVoyager solve?

Web agents built before WebVoyager mostly fell into two camps, and both had a gap between what they measured and what people wanted. The first camp evaluated on static snapshots of pages. Mind2Web, for example, collects real page traces and then scores an agent step by step against a recorded golden action sequence ^[3]. That approach is reproducible and cheap, but it assumes a single correct path and never lets the agent recover from a wrong turn, so it measures imitation of one trajectory rather than whether a task got done. The second camp used self-hosted simulators. WebArena stands up functional clones of sites like a shopping store, a forum, and a content management system inside Docker, so an agent can click through a real backend without touching the open internet ^[4]. Sandboxes give you a stable target and programmatic success checks, but the cloned sites are simplified and frozen, missing the pop-ups, ads, logins, and constant layout churn of the actual web.

WebVoyager takes the harder route. The agent talks to the real Amazon, the real Booking.com, the real Google Maps, with everything those sites throw at a visitor. The authors argue this is the setting that matters for autonomous applications, and that an agent which only works on a tidy clone has not really been tested. The cost is that success is open-ended and hard to check automatically, which is why the paper spends as much effort on the evaluation protocol as on the agent itself.

How does WebVoyager work?

WebVoyager is described as a Large Multimodal Model (LMM) powered agent, with GPT-4V (the gpt-4-vision-preview model) as the backbone ^[1]^[5]. At each step it observes the current browser screenshot rather than a long DOM dump, on the reasoning that raw HTML is verbose enough to drown out the signal a model needs to decide what to do.

To connect what the model sees to what it can click, the agent uses Set-of-Mark prompting ^[6]. Before each decision, interactive elements on the page (links, buttons, text fields) get drawn over with bounding boxes, and each box carries a numbered label in its top-left corner. The labels use black borders and backgrounds for legibility. The model then refers to an element by its number instead of guessing pixel coordinates, which turns grounding into a small classification problem. Alongside the marked screenshot, the agent receives a compact block of auxiliary text for each labeled element: the element type, any text content, and the aria-label accessibility attribute when present.

The action space has seven operations the model can emit:

Action	Effect
Click	Select a labeled element (link, button)
Type	Select a text box, clear it, and enter new text
Scroll	Move the viewport up or down
Wait	Pause to let the page finish loading
GoBack	Return to the previous page
Google	Jump to Google search when stuck
Answer	End the task and return a final response

The agent runs this observe-think-act loop until it issues Answer or hits a step limit. A text-only variant of the same agent swaps the marked screenshots for the textual accessibility-tree representation that WebArena uses, which lets the authors isolate how much the visual channel actually contributes ^[1].

What sites and tasks does the WebVoyager benchmark cover?

The benchmark draws its 643 tasks from 15 high-traffic websites chosen to span everyday online activity: shopping, travel booking, reference lookup, code hosting, media, and computation. As the paper puts it, "in total, we collected 40+ tasks per website, resulting in a total of 643 tasks" ^[5].

Category	Sites
Shopping and tech	Amazon, Apple
Travel and maps	Booking, Google Flights, Google Map
Search and reference	Google Search, Cambridge Dictionary, Wolfram Alpha
Research and code	ArXiv, GitHub, Huggingface
Learning	Coursera
Food	Allrecipes
News and sports	BBC News, ESPN

Tasks read like real user requests, for instance finding a product under a price cap with specific features, booking a hotel for given dates, or pulling a statistic from a sports page. To build the set, the authors rewrote a portion of Mind2Web tasks for five overlapping sites, then used GPT-4 Turbo with a self-instruct procedure to draft roughly 100 new tasks per site, filtering for low repetition and validating each one by hand. Answers were annotated in two flavors: stable "golden" answers where a definite correct response exists, and "possible" answers for open-ended or time-sensitive questions whose correct value can drift. About 22.3% of tasks carry a golden answer ^[5].

How is WebVoyager scored automatically?

Because tasks are open-ended and run against changing live sites, there is no simple string match for success. WebVoyager's answer is a multimodal autorater. After a run, GPT-4V receives the task instruction, the agent's final textual response, and the last k screenshots from the trajectory, and it returns a binary success or failure verdict ^[1].

The authors validate this judge against human labels. Overall agreement is 85.3%, with a Cohen's kappa of 0.70, which the paper notes matches the agreement between two human annotators on the same data. Reliability climbs as the judge sees more of the trajectory: kappa rises from about 0.51 at k=1 to 0.59 at k=2, 0.62 at k=3, and 0.70 when the judge sees the full run ^[1]. The takeaway is that a vision-language model can stand in for a human grader on open-ended web tasks closely enough to support evaluation at scale, which is the piece that makes a live-site benchmark practical at all.

How do agents score on WebVoyager?

The GPT-4V WebVoyager agent reached an overall 59.1% task success rate. Two baselines fell well short. The text-only version of the agent, which reads the accessibility tree instead of marked screenshots, managed 40.1%, and GPT-4 with the official All Tools browsing setting reached 30.8% ^[1]. The roughly 19-point gap between the multimodal agent and its text-only twin is the paper's main evidence that screenshots plus Set-of-Mark add real signal beyond the accessibility tree.

Success varied widely by site, which tracks how visual or how text-dense each one is.

Website	Success rate
Google Flights	76.7%
Coursera	73.8%
Google Map	70.7%
Apple	65.1%
Cambridge Dictionary	65.1%
GitHub	63.4%
Wolfram Alpha	63.0%
BBC News	61.9%
Google Search	59.5%
Amazon	58.5%
Allrecipes	53.3%
ArXiv	51.2%
Huggingface	44.2%
Booking	43.2%
ESPN	38.6%

A detail worth noting: on text-heavy Allrecipes the multimodal agent did slightly worse than the text-only baseline, because dense recipe text is harder to read off a screenshot than off the accessibility tree. On visually structured sites like Booking and Google Flights, with their date pickers and result grids, vision helped more ^[1].

The authors also tested generalization beyond their own tasks. The agent ran on 90 web-browsing tasks (levels 1 and 2) from the GAIA benchmark and on 50 interactive tasks from SeeAct. On the SeeAct online test set it scored 30%, above the 26% reported for the strongest SeeAct autonomous agent ^[1].

An error analysis over 300 sampled failures grouped the mistakes into four buckets: getting stuck navigating, such as poor search queries or repeated useless actions (44.4%); visual grounding errors, like confusing adjacent elements or misreading a calendar (24.8%); hallucination, including incomplete answers or typing the wrong thing into a valid box (21.8%); and prompt misalignment, such as unparseable output or quitting early (9.0%) ^[1].

How do newer agents score on WebVoyager?

WebVoyager has become a common yardstick for browser agents released after the paper, though scores from different groups are not always measured the same way. OpenAI reported that its Computer-Using Agent, the model behind Operator, reached 87% on WebVoyager (the same system scored 58.1% on WebArena and 38.1% on OSWorld) ^[7]. Open-source frameworks have reported numbers in the same range, with Browser Use claiming 89.1%, though it noted this came after small changes to the WebVoyager harness, and other researchers reported trouble reproducing it ^[8].

These figures come with an asterisk. Because the sites are live, different teams run the tasks at different times against different page versions, and small harness or prompting tweaks shift the score. A 2026 re-evaluation by Emergence using a standardized protocol put Operator at 68.6% overall, well under the 87% headline, which illustrates how much methodology and timing move the result ^[9]. Comparisons across published WebVoyager numbers are best read as rough, not exact.

How does WebVoyager differ from Mind2Web and WebArena?

WebVoyager sits alongside two reference points it is usually contrasted with. Mind2Web is offline and stepwise: it scores actions against recorded golden trajectories on static page snapshots, which is reproducible but penalizes any valid alternate path and never tests recovery ^[3]. WebArena is interactive but sandboxed: agents act on self-hosted, functional clones of real sites with deterministic, programmatic success checks, trading away the messiness of the open web for stability and reproducibility ^[4]. A vision-heavy successor, VisualWebArena, extends that sandbox to tasks that require understanding page images ^[10].

WebVoyager's contribution is to keep the interactivity of WebArena while moving onto the genuine internet, and to make that practical with a validated GPT-4V autorater in place of hand-coded success checks. The trade-off is reproducibility: the same property that makes the benchmark realistic, live and ever-changing sites, also makes exact replication hard.

What are the limitations of WebVoyager?

The paper and later usage surface several constraints. Live sites change their layouts, content, and availability, so a task that passed last month can break or become unanswerable, and two evaluation runs are never strictly comparable. Many sites deploy anti-bot defenses, CAPTCHAs, login walls, and rate limits that can block an automated agent for reasons unrelated to its reasoning. Time-sensitive tasks (prices, flight times, news) have answers that drift, which is why a chunk of the benchmark uses flexible "possible" answers rather than fixed keys. The GPT-4V autorater, while well correlated with humans, is not perfect and can be fooled by a confident but wrong final answer, as the hallucination error category shows. And reproducibility is a standing problem: the official code explicitly warns that results are influenced by nondeterministic API outputs, prompt changes, and shifting website styles or pop-ups ^[5].

References

He, H., Yao, W., Ma, K., Yu, W., Dai, Y., Zhang, H., Lan, Z., & Yu, D. (2024). WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models. arXiv:2401.13919. https://arxiv.org/abs/2401.13919 ↩
He, H., et al. (2024). WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL 2024), Volume 1: Long Papers, pp. 6864-6890. https://aclanthology.org/2024.acl-long.371/ ↩
Deng, X., Gu, Y., Zheng, B., Chen, S., Stevens, S., Wang, B., Sun, H., & Su, Y. (2023). Mind2Web: Towards a Generalist Agent for the Web. arXiv:2306.06070. https://arxiv.org/abs/2306.06070 ↩
Zhou, S., Xu, F. F., Zhu, H., Zhou, X., Lo, R., Sridhar, A., et al. (2023). WebArena: A Realistic Web Environment for Building Autonomous Agents. arXiv:2307.13854. https://arxiv.org/abs/2307.13854 ↩
WebVoyager official code repository (MinorJerry/WebVoyager). GitHub. https://github.com/MinorJerry/WebVoyager ↩
Yang, J., Zhang, H., Li, F., Zou, X., Li, C., & Gao, J. (2023). Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V. arXiv:2310.11441. https://arxiv.org/abs/2310.11441 ↩
OpenAI (2025). Computer-Using Agent. https://openai.com/index/computer-using-agent/ ↩
Browser Use (2025). Browser Use = state of the art Web Agent. https://browser-use.com/posts/sota-technical-report ↩
Emergence AI (2026). Emergence WebVoyager: Toward Consistent and Transparent Evaluation of (Web) Agents in the Wild. arXiv:2603.29020. https://arxiv.org/abs/2603.29020 ↩
Koh, J. Y., Lo, R., Jang, L., Duvvur, V., Lim, M. C., Huang, P., Neubig, G., Zhou, S., Salakhutdinov, R., & Fried, D. (2024). VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks. arXiv:2401.13649. https://arxiv.org/abs/2401.13649 ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

1 revision by 1 contributors · full history

Suggest edit

What links here

BrowseComp OSWorld Online-Mind2Web

What problem does WebVoyager solve?

How does WebVoyager work?

What sites and tasks does the WebVoyager benchmark cover?

How is WebVoyager scored automatically?

How do agents score on WebVoyager?

How do newer agents score on WebVoyager?

How does WebVoyager differ from Mind2Web and WebArena?

What are the limitations of WebVoyager?

References

Improve this article

Related Articles

WebArena

Agent evaluation

Terminal-Bench

Harness (AI)

Benchmark (AI)

MATH

What links here

Related Articles

WebArena

Agent evaluation

Terminal-Bench

Harness (AI)

Benchmark (AI)

MATH

What links here