WebVoyager
Last reviewed
May 31, 2026
Sources
10 citations
Review status
Source-backed
Revision
v1 · 2,099 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 31, 2026
Sources
10 citations
Review status
Source-backed
Revision
v1 · 2,099 words
Add missing citations, update stale details, or suggest a clearer explanation.
WebVoyager is an end-to-end web-agent benchmark and a multimodal agent of the same name, introduced by Hongliang He and colleagues in a paper accepted to ACL 2024 [1][2]. Unlike earlier setups that test LLM agents inside sandboxed simulators or against frozen page snapshots, WebVoyager has the agent operate live, real-world websites such as Amazon, GitHub, and Google Flights, then judges whether the agent actually completed each task. The benchmark contains 643 tasks spread across 15 popular sites, and the agent navigates primarily from browser screenshots annotated with Set-of-Mark labels. The original GPT-4V-based agent reached a 59.1% task success rate, and the authors paired it with an automatic evaluator, also built on GPT-4V, that agreed with human judges 85.3% of the time [1].
Web agents built before WebVoyager mostly fell into two camps, and both had a gap between what they measured and what people wanted. The first camp evaluated on static snapshots of pages. Mind2Web, for example, collects real page traces and then scores an agent step by step against a recorded golden action sequence [3]. That approach is reproducible and cheap, but it assumes a single correct path and never lets the agent recover from a wrong turn, so it measures imitation of one trajectory rather than whether a task got done. The second camp used self-hosted simulators. WebArena stands up functional clones of sites like a shopping store, a forum, and a content management system inside Docker, so an agent can click through a real backend without touching the open internet [4]. Sandboxes give you a stable target and programmatic success checks, but the cloned sites are simplified and frozen, missing the pop-ups, ads, logins, and constant layout churn of the actual web.
WebVoyager takes the harder route. The agent talks to the real Amazon, the real Booking.com, the real Google Maps, with everything those sites throw at a visitor. The authors argue this is the setting that matters for autonomous applications, and that an agent which only works on a tidy clone has not really been tested. The cost is that success is open-ended and hard to check automatically, which is why the paper spends as much effort on the evaluation protocol as on the agent itself.
WebVoyager is described as a Large Multimodal Model (LMM) powered agent, with GPT-4V (the gpt-4-vision-preview model) as the backbone [1][5]. At each step it observes the current browser screenshot rather than a long DOM dump, on the reasoning that raw HTML is verbose enough to drown out the signal a model needs to decide what to do.
To connect what the model sees to what it can click, the agent uses Set-of-Mark prompting [6]. Before each decision, interactive elements on the page (links, buttons, text fields) get drawn over with bounding boxes, and each box carries a numbered label in its top-left corner. The labels use black borders and backgrounds for legibility. The model then refers to an element by its number instead of guessing pixel coordinates, which turns grounding into a small classification problem. Alongside the marked screenshot, the agent receives a compact block of auxiliary text for each labeled element: the element type, any text content, and the aria-label accessibility attribute when present.
The action space has seven operations the model can emit:
| Action | Effect |
|---|---|
| Click | Select a labeled element (link, button) |
| Type | Select a text box, clear it, and enter new text |
| Scroll | Move the viewport up or down |
| Wait | Pause to let the page finish loading |
| GoBack | Return to the previous page |
| Jump to Google search when stuck | |
| Answer | End the task and return a final response |
The agent runs this observe-think-act loop until it issues Answer or hits a step limit. A text-only variant of the same agent swaps the marked screenshots for the textual accessibility-tree representation that WebArena uses, which lets the authors isolate how much the visual channel actually contributes [1].
The benchmark draws its 643 tasks from 15 high-traffic websites chosen to span everyday online activity: shopping, travel booking, reference lookup, code hosting, media, and computation. Each site carries 40 or more tasks [5].
| Category | Sites |
|---|---|
| Shopping and tech | Amazon, Apple |
| Travel and maps | Booking, Google Flights, Google Map |
| Search and reference | Google Search, Cambridge Dictionary, Wolfram Alpha |
| Research and code | ArXiv, GitHub, Huggingface |
| Learning | Coursera |
| Food | Allrecipes |
| News and sports | BBC News, ESPN |
Tasks read like real user requests, for instance finding a product under a price cap with specific features, booking a hotel for given dates, or pulling a statistic from a sports page. To build the set, the authors rewrote a portion of Mind2Web tasks for five overlapping sites, then used GPT-4 Turbo with a self-instruct procedure to draft roughly 100 new tasks per site, filtering for low repetition and validating each one by hand. Answers were annotated in two flavors: stable "golden" answers where a definite correct response exists, and "possible" answers for open-ended or time-sensitive questions whose correct value can drift. About 22.3% of tasks carry a golden answer [5].
Because tasks are open-ended and run against changing live sites, there is no simple string match for success. WebVoyager's answer is a multimodal autorater. After a run, GPT-4V receives the task instruction, the agent's final textual response, and the last k screenshots from the trajectory, and it returns a binary success or failure verdict [1].
The authors validate this judge against human labels. Overall agreement is 85.3%, with a Cohen's kappa of 0.70, which the paper notes matches the agreement between two human annotators on the same data. Reliability climbs as the judge sees more of the trajectory: kappa rises from about 0.51 at k=1 to 0.59 at k=2, 0.62 at k=3, and 0.70 when the judge sees the full run [1]. The takeaway is that a vision-language model can stand in for a human grader on open-ended web tasks closely enough to support evaluation at scale, which is the piece that makes a live-site benchmark practical at all.
The GPT-4V WebVoyager agent reached an overall 59.1% task success rate. Two baselines fell well short. The text-only version of the agent, which reads the accessibility tree instead of marked screenshots, managed 40.1%, and GPT-4 with the official All Tools browsing setting reached 30.8% [1]. The roughly 19-point gap between the multimodal agent and its text-only twin is the paper's main evidence that screenshots plus Set-of-Mark add real signal beyond the accessibility tree.
Success varied widely by site, which tracks how visual or how text-dense each one is.
| Website | Success rate |
|---|---|
| Google Flights | 76.7% |
| Coursera | 73.8% |
| Google Map | 70.7% |
| Apple | 65.1% |
| Cambridge Dictionary | 65.1% |
| GitHub | 63.4% |
| Wolfram Alpha | 63.0% |
| BBC News | 61.9% |
| Google Search | 59.5% |
| Amazon | 58.5% |
| Allrecipes | 53.3% |
| ArXiv | 51.2% |
| Huggingface | 44.2% |
| Booking | 43.2% |
| ESPN | 38.6% |
A detail worth noting: on text-heavy Allrecipes the multimodal agent did slightly worse than the text-only baseline, because dense recipe text is harder to read off a screenshot than off the accessibility tree. On visually structured sites like Booking and Google Flights, with their date pickers and result grids, vision helped more [1].
The authors also tested generalization beyond their own tasks. The agent ran on 90 web-browsing tasks (levels 1 and 2) from the GAIA benchmark and on 50 interactive tasks from SeeAct. On the SeeAct online test set it scored 30%, above the 26% reported for the strongest SeeAct autonomous agent [1].
An error analysis over 300 sampled failures grouped the mistakes into four buckets: getting stuck navigating, such as poor search queries or repeated useless actions (44.4%); visual grounding errors, like confusing adjacent elements or misreading a calendar (24.8%); hallucination, including incomplete answers or typing the wrong thing into a valid box (21.8%); and prompt misalignment, such as unparseable output or quitting early (9.0%) [1].
WebVoyager has become a common yardstick for browser agents released after the paper, though scores from different groups are not always measured the same way. OpenAI reported that its Computer-Using Agent, the model behind Operator, reached 87% on WebVoyager (the same system scored 58.1% on WebArena and 38.1% on OSWorld) [7]. Open-source frameworks have reported numbers in the same range, with Browser Use claiming 89.1%, though it noted this came after small changes to the WebVoyager harness, and other researchers reported trouble reproducing it [8].
These figures come with an asterisk. Because the sites are live, different teams run the tasks at different times against different page versions, and small harness or prompting tweaks shift the score. A 2026 re-evaluation by Emergence using a standardized protocol put Operator at 68.6% overall, well under the 87% headline, which illustrates how much methodology and timing move the result [9]. Comparisons across published WebVoyager numbers are best read as rough, not exact.
WebVoyager sits alongside two reference points it is usually contrasted with. Mind2Web is offline and stepwise: it scores actions against recorded golden trajectories on static page snapshots, which is reproducible but penalizes any valid alternate path and never tests recovery [3]. WebArena is interactive but sandboxed: agents act on self-hosted, functional clones of real sites with deterministic, programmatic success checks, trading away the messiness of the open web for stability and reproducibility [4]. A vision-heavy successor, VisualWebArena, extends that sandbox to tasks that require understanding page images [10].
WebVoyager's contribution is to keep the interactivity of WebArena while moving onto the genuine internet, and to make that practical with a validated GPT-4V autorater in place of hand-coded success checks. The trade-off is reproducibility: the same property that makes the benchmark realistic, live and ever-changing sites, also makes exact replication hard.
The paper and later usage surface several constraints. Live sites change their layouts, content, and availability, so a task that passed last month can break or become unanswerable, and two evaluation runs are never strictly comparable. Many sites deploy anti-bot defenses, CAPTCHAs, login walls, and rate limits that can block an automated agent for reasons unrelated to its reasoning. Time-sensitive tasks (prices, flight times, news) have answers that drift, which is why a chunk of the benchmark uses flexible "possible" answers rather than fixed keys. The GPT-4V autorater, while well correlated with humans, is not perfect and can be fooled by a confident but wrong final answer, as the hallucination error category shows. And reproducibility is a standing problem: the official code explicitly warns that results are influenced by nondeterministic API outputs, prompt changes, and shifting website styles or pop-ups [5].