Online-Mind2Web

AI Benchmarks AI Code Generation

7 min read

Updated Jun 8, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 8, 2026

Fact-checked

In review queue

Sources

2 citations

Revision

v1 · 1,497 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Overview

Online-Mind2Web is an AI benchmark for evaluating AI agents that operate web browsers, run against live, real websites rather than cached page snapshots. It was introduced in the 2025 paper "An Illusion of Progress? Assessing the Current State of Web Agents" by researchers at The Ohio State University and the University of California, Berkeley ^[1]. The benchmark consists of 300 diverse and realistic tasks spanning 136 popular live websites, and it is designed to measure how well web agents perform when confronted with the dynamic content, loading delays, pop-ups, and constant change of the open internet ^[1].

The benchmark is the online, live-web counterpart to Mind2Web, an earlier offline web-agent dataset from the same group. Its central empirical claim is that previously reported web-agent success rates were inflated by easy or shortcut-solvable tasks and by unreliable automatic scoring, and that under rigorous, realistic evaluation most agents complete only a modest fraction of tasks ^[1]. Alongside the benchmark, the authors released WebJudge, an automatic large-language-model-based evaluator that agrees with human judgment roughly 85 percent of the time, allowing the live benchmark to be scored at scale without exhaustive manual annotation ^[1]. The paper was accepted to the Conference on Language Modeling (COLM) 2025 and the dataset, code, and human evaluation labels are publicly available from the OSU-NLP-Group repository ^[1]^[2].

Motivation: the illusion of progress

By early 2025, multiple web-agent systems were reporting very high task success rates, with some agents claiming success rates near 90 percent on the WebVoyager benchmark ^[1]. The authors of Online-Mind2Web argued that these numbers create an illusion of progress, and that the apparent competence of web agents was substantially overstated because of shortcomings in the benchmarks themselves rather than genuine agent capability ^[1].

The paper identifies three recurring problems with prior web-agent evaluation ^[1]:

Shortcut-solvable tasks. The authors report that a large share of WebVoyager tasks could be completed through simple search-engine shortcuts rather than by genuinely navigating a target site, so high scores did not reflect realistic web navigation.
Outdated or invalid tasks. When sampling tasks from the original offline Mind2Web dataset, the authors found that a substantial fraction had become invalid or outdated because the underlying websites had changed since the cached snapshots were collected.
Limited website diversity. Sandboxed benchmarks such as WebArena and VisualWebArena offer reproducibility but cover only a small number of self-hosted sites, which limits how well results generalize to the broader web.

These observations motivated a benchmark built on live websites with carefully curated, currently valid tasks, so that reported scores would reflect what agents can actually do on the real internet today.

What Online-Mind2Web is

Online-Mind2Web is a curated set of 300 tasks distributed across 136 live websites covering everyday domains such as clothing, food, housing, travel, and transportation ^[1]. Each task is a natural-language instruction that an agent must carry out by interacting with a real website in real time, so the agent must cope with dynamic page content, network latency, interstitial pop-ups, and other conditions that cached benchmarks remove ^[1].

The tasks are stratified by difficulty according to the number of steps a human needs to complete them ^[1]:

Difficulty	Step count	Number of tasks
Easy	5 steps or fewer	83
Medium	6 to 10 steps	143
Hard	11 steps or more	74
Total		300

Because the tasks run against the open web, the benchmark reflects current site behavior rather than a frozen snapshot, which is its key methodological difference from the original offline Mind2Web. The trade-off is that live evaluation is inherently less reproducible than a sandbox: site layouts, inventory, and availability change over time, so results represent the state of those sites at the time of testing.

WebJudge auto-evaluation

A practical obstacle to live web-agent evaluation is scoring. There is no cached ground-truth page to compare against, so determining whether a task succeeded normally requires a human to inspect the agent's trajectory. To make the benchmark usable at scale, the authors introduced WebJudge, an automatic evaluator that uses a large language model as a judge ^[1].

WebJudge operates in three stages ^[1]:

Key point identification. The model reads the task description and derives the key points that must be satisfied for the task to count as complete.
Key screenshot selection. From the agent's recorded trajectory, the model selects the screenshots most relevant to those key points.
Outcome judgment. The model issues a final success or failure verdict based on the task description, the selected screenshots, and the agent's action history.

The authors report that WebJudge reaches roughly 85 percent agreement with human judgment, with an average success-rate gap of about 3.8 percentage points between automatic and human scoring ^[1]. Using the o4-mini model as the judge backbone, agreement was about 85.7 percent, and a fine-tuned 7-billion-parameter judge model (WebJudge-7B) reached about 87 percent while reducing the number of model calls needed per trajectory ^[1]. These figures substantially exceed prior automatic evaluators: the paper reports the WebVoyager-style judge at about 73.9 percent agreement and an autonomous-evaluation baseline at about 79.4 percent ^[1]. WebJudge also held up better than these baselines as task difficulty increased ^[1].

Evaluator	Agreement with human judgment
WebJudge-7B	~87%
WebJudge (o4-mini)	~85.7%
Autonomous Eval baseline	~79.4%
WebVoyager-style judge	~73.9%

Results

Evaluating a range of web agents on Online-Mind2Web produced a markedly different picture from earlier benchmarks. Most agents clustered around a 30 percent success rate, far below the high numbers previously reported elsewhere, and performance fell sharply as tasks moved from easy to hard ^[1]. A notable finding is that many recent and more elaborate agents did not outperform SeeAct, a comparatively simple agent released in early 2024, which suggests that added system complexity did not reliably translate into better real-web performance ^[1]. Only two systems clearly pulled ahead of that early baseline.

The table below summarizes human-evaluated success rates reported in the paper for representative agents ^[1]:

Agent	Human-evaluated success rate
OpenAI Operator	~61%
Claude Computer Use (Claude 3.7)	~56%
SeeAct	~31%
Browser Use	~30%
Agent-E	~28%
Claude Computer Use (Claude 3.5)	~29%

Here Operator refers to OpenAI Operator, and Claude Computer Use refers to the computer-use agent capability built on Anthropic's Claude models. Even the strongest system in the study, OpenAI Operator, completed only about 61 percent of tasks, and the older Claude 3.5 computer-use configuration performed in line with the simpler scaffolds rather than ahead of them ^[1]. The authors note that success rates of around 90 percent reported on WebVoyager effectively collapse once agents are tested in the more realistic, dynamic setting of Online-Mind2Web ^[1].

These results are necessarily a snapshot: because the benchmark runs against live websites and the field moves quickly, later agents and updated models report different and generally higher numbers on the public leaderboard than the systems measured in the original paper.

Significance and relationship to other web-agent benchmarks

Online-Mind2Web reframed how web-agent progress is measured by arguing that realistic, live evaluation matters as much as raw success rates, and that automatic scoring must be validated against human judgment before its numbers can be trusted ^[1]. The "illusion of progress" framing became a widely cited cautionary point in agent evaluation, and the benchmark, together with WebJudge, became a common reference for realistic web-agent testing.

The benchmark sits within a family of agent evaluations and was explicitly positioned against several of them ^[1]:

Mind2Web is the original offline predecessor from the same group, built on cached webpage snapshots; Online-Mind2Web is its live-web successor and re-grounds tasks against sites as they currently behave.
WebVoyager evaluates agents on live sites but, according to the authors, contains many tasks solvable by search-engine shortcuts and relies on a less accurate automatic judge.
WebArena and VisualWebArena provide reproducible sandboxed environments but cover a limited set of self-hosted websites, which constrains how far their results generalize.
OSWorld extends agent evaluation beyond the browser to full computer-use tasks across desktop applications and operating systems; it is referenced in the paper as part of the broader landscape of realistic agent benchmarks.

By emphasizing live websites, curated and currently valid tasks, difficulty stratification, and a human-aligned automatic scorer, Online-Mind2Web offered a more conservative and arguably more honest measure of web-agent capability, and it contributed to a broader shift toward rigorous, reality-grounded evaluation of computer-use agents ^[1].

References

Xue, Tianci; Qi, Weijian; Shi, Tianneng; Song, Chan Hee; Gou, Boyu; Song, Dawn; Sun, Huan; Su, Yu. "An Illusion of Progress? Assessing the Current State of Web Agents." arXiv:2504.01382, 2025 (accepted to COLM 2025). https://arxiv.org/abs/2504.01382 ↩
OSU-NLP-Group. "Online-Mind2Web" (code, data, and human evaluation labels). GitHub repository. https://github.com/OSU-NLP-Group/Online-Mind2Web ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

Suggest edit

What links here

Mind2Web

Overview

Motivation: the illusion of progress

What Online-Mind2Web is

WebJudge auto-evaluation

Results

Significance and relationship to other web-agent benchmarks

References

Improve this article

Related Articles

HumanEval

LiveCodeBench

SWE-bench Verified

MBPP

CodeContests

CRUXEval