Online-Mind2Web
Last reviewed
Sources
2 citations
Review status
Source-backed
Revision
v1 · 1,497 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Sources
2 citations
Review status
Source-backed
Revision
v1 · 1,497 words
Add missing citations, update stale details, or suggest a clearer explanation.
Online-Mind2Web is an AI benchmark for evaluating AI agents that operate web browsers, run against live, real websites rather than cached page snapshots. It was introduced in the 2025 paper "An Illusion of Progress? Assessing the Current State of Web Agents" by researchers at The Ohio State University and the University of California, Berkeley [1]. The benchmark consists of 300 diverse and realistic tasks spanning 136 popular live websites, and it is designed to measure how well web agents perform when confronted with the dynamic content, loading delays, pop-ups, and constant change of the open internet [1].
The benchmark is the online, live-web counterpart to Mind2Web, an earlier offline web-agent dataset from the same group. Its central empirical claim is that previously reported web-agent success rates were inflated by easy or shortcut-solvable tasks and by unreliable automatic scoring, and that under rigorous, realistic evaluation most agents complete only a modest fraction of tasks [1]. Alongside the benchmark, the authors released WebJudge, an automatic large-language-model-based evaluator that agrees with human judgment roughly 85 percent of the time, allowing the live benchmark to be scored at scale without exhaustive manual annotation [1]. The paper was accepted to the Conference on Language Modeling (COLM) 2025 and the dataset, code, and human evaluation labels are publicly available from the OSU-NLP-Group repository [1][2].
By early 2025, multiple web-agent systems were reporting very high task success rates, with some agents claiming success rates near 90 percent on the WebVoyager benchmark [1]. The authors of Online-Mind2Web argued that these numbers create an illusion of progress, and that the apparent competence of web agents was substantially overstated because of shortcomings in the benchmarks themselves rather than genuine agent capability [1].
The paper identifies three recurring problems with prior web-agent evaluation [1]:
These observations motivated a benchmark built on live websites with carefully curated, currently valid tasks, so that reported scores would reflect what agents can actually do on the real internet today.
Online-Mind2Web is a curated set of 300 tasks distributed across 136 live websites covering everyday domains such as clothing, food, housing, travel, and transportation [1]. Each task is a natural-language instruction that an agent must carry out by interacting with a real website in real time, so the agent must cope with dynamic page content, network latency, interstitial pop-ups, and other conditions that cached benchmarks remove [1].
The tasks are stratified by difficulty according to the number of steps a human needs to complete them [1]:
| Difficulty | Step count | Number of tasks |
|---|---|---|
| Easy | 5 steps or fewer | 83 |
| Medium | 6 to 10 steps | 143 |
| Hard | 11 steps or more | 74 |
| Total | 300 |
Because the tasks run against the open web, the benchmark reflects current site behavior rather than a frozen snapshot, which is its key methodological difference from the original offline Mind2Web. The trade-off is that live evaluation is inherently less reproducible than a sandbox: site layouts, inventory, and availability change over time, so results represent the state of those sites at the time of testing.
A practical obstacle to live web-agent evaluation is scoring. There is no cached ground-truth page to compare against, so determining whether a task succeeded normally requires a human to inspect the agent's trajectory. To make the benchmark usable at scale, the authors introduced WebJudge, an automatic evaluator that uses a large language model as a judge [1].
WebJudge operates in three stages [1]:
The authors report that WebJudge reaches roughly 85 percent agreement with human judgment, with an average success-rate gap of about 3.8 percentage points between automatic and human scoring [1]. Using the o4-mini model as the judge backbone, agreement was about 85.7 percent, and a fine-tuned 7-billion-parameter judge model (WebJudge-7B) reached about 87 percent while reducing the number of model calls needed per trajectory [1]. These figures substantially exceed prior automatic evaluators: the paper reports the WebVoyager-style judge at about 73.9 percent agreement and an autonomous-evaluation baseline at about 79.4 percent [1]. WebJudge also held up better than these baselines as task difficulty increased [1].
| Evaluator | Agreement with human judgment |
|---|---|
| WebJudge-7B | ~87% |
| WebJudge (o4-mini) | ~85.7% |
| Autonomous Eval baseline | ~79.4% |
| WebVoyager-style judge | ~73.9% |
Evaluating a range of web agents on Online-Mind2Web produced a markedly different picture from earlier benchmarks. Most agents clustered around a 30 percent success rate, far below the high numbers previously reported elsewhere, and performance fell sharply as tasks moved from easy to hard [1]. A notable finding is that many recent and more elaborate agents did not outperform SeeAct, a comparatively simple agent released in early 2024, which suggests that added system complexity did not reliably translate into better real-web performance [1]. Only two systems clearly pulled ahead of that early baseline.
The table below summarizes human-evaluated success rates reported in the paper for representative agents [1]:
| Agent | Human-evaluated success rate |
|---|---|
| OpenAI Operator | ~61% |
| Claude Computer Use (Claude 3.7) | ~56% |
| SeeAct | ~31% |
| Browser Use | ~30% |
| Agent-E | ~28% |
| Claude Computer Use (Claude 3.5) | ~29% |
Here Operator refers to OpenAI Operator, and Claude Computer Use refers to the computer-use agent capability built on Anthropic's Claude models. Even the strongest system in the study, OpenAI Operator, completed only about 61 percent of tasks, and the older Claude 3.5 computer-use configuration performed in line with the simpler scaffolds rather than ahead of them [1]. The authors note that success rates of around 90 percent reported on WebVoyager effectively collapse once agents are tested in the more realistic, dynamic setting of Online-Mind2Web [1].
These results are necessarily a snapshot: because the benchmark runs against live websites and the field moves quickly, later agents and updated models report different and generally higher numbers on the public leaderboard than the systems measured in the original paper.
Online-Mind2Web reframed how web-agent progress is measured by arguing that realistic, live evaluation matters as much as raw success rates, and that automatic scoring must be validated against human judgment before its numbers can be trusted [1]. The "illusion of progress" framing became a widely cited cautionary point in agent evaluation, and the benchmark, together with WebJudge, became a common reference for realistic web-agent testing.
The benchmark sits within a family of agent evaluations and was explicitly positioned against several of them [1]:
By emphasizing live websites, curated and currently valid tasks, difficulty stratification, and a human-aligned automatic scorer, Online-Mind2Web offered a more conservative and arguably more honest measure of web-agent capability, and it contributed to a broader shift toward rigorous, reality-grounded evaluation of computer-use agents [1].