Mind2Web

Mind2Web is a dataset and benchmark for developing and evaluating generalist AI agents that follow natural language instructions to complete tasks on the open web. It was introduced in the 2023 paper "Mind2Web: Towards a Generalist Agent for the Web" by Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Samuel Stevens, Boshi Wang, Huan Sun, and Yu Su of the Ohio State University NLP group, and was selected as a NeurIPS 2023 spotlight paper in the Datasets and Benchmarks track.^[1]^[2]

The dataset contains roughly 2,350 open-ended tasks collected from 137 live websites across 31 domains, with crowdsourced action sequences captured on real web pages.^[1]^[2] Mind2Web also introduced MindAct, a two-stage framework that combines a small fine-tuned ranker with a large language model to predict actions from filtered HTML, and reported the first systematic baselines for Flan-T5, GPT-3.5 and GPT-4 as web agents.^[1] The benchmark became a foundational reference for later work on web agents and AI browser agents, and a starting point for follow-on resources such as Multimodal-Mind2Web, SeeAct, Online-Mind2Web, and Mind2Web 2.^[2]

Background and motivation

Before Mind2Web, most benchmarks for web automation either used a small set of simplified pages, such as MiniWoB and MiniWoB++, or focused on a narrow vertical such as e-commerce shopping in WebShop.^[1] These setups were useful for studying interaction primitives, but they did not measure whether an agent could read complex modern HTML, plan multi-step actions, and generalize across the broad surface of the public web.

The Mind2Web authors argued that a credible benchmark for general web agents needed three properties:^[1]

Diverse domains, websites, and tasks, rather than a handful of templates.
Real-world websites captured in their full complexity, including layout chrome, ads, dynamic widgets, and accessibility metadata, instead of synthetic or simulated pages.
A wide range of user interaction patterns, so that agents are exercised on more than just clicking through a menu.

Mind2Web was the first dataset that attempted to satisfy all three properties at once. Its release coincided with the rapid growth of LLM-driven web agent research in 2023 and 2024 and gave the community a shared yardstick for that work.^[2]^[3]

Authors and release

Mind2Web was developed by researchers from the Ohio State University NLP group led by Huan Sun and Yu Su, with the dataset, models, and code released under permissive licenses (CC BY 4.0 for the dataset, MIT for the code).^[2] The paper was first posted to arXiv on June 9, 2023 and accepted as a NeurIPS 2023 spotlight in the Datasets and Benchmarks track.^[1]

Item	Value
Paper title	Mind2Web: Towards a Generalist Agent for the Web
Authors	Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Samuel Stevens, Boshi Wang, Huan Sun, Yu Su
Affiliation	The Ohio State University
First arXiv version	June 9, 2023
Conference	NeurIPS 2023 (Datasets and Benchmarks, Spotlight)
arXiv ID	2306.06070
Code repository	github.com/OSU-NLP-Group/Mind2Web
Project page	osu-nlp-group.github.io/Mind2Web
Dataset license	CC BY 4.0
Code license	MIT

The original release shipped HTML snapshots, MHTML files, screenshots, DOM dumps, and HAR network traces for each annotated trajectory, along with a script to reproduce the dataset splits and baseline experiments.^[2]

Dataset characteristics

Mind2Web is fundamentally an offline dataset of human-collected web trajectories. Each example pairs a high-level natural language instruction with a sequence of (web page, target element, action) tuples that, when executed, complete the task on the original website.^[1]

Headline statistics

Statistic	Value
Total tasks	~2,350
Distinct websites	137
Domains	31
Average actions per task	7.3
Average raw elements per page	~1,135
Average elements after preprocessing	~580
Top-level categories	Travel, Information, Service, Shopping, Entertainment

Source: original Mind2Web paper.^[1]

The 137 websites were selected from popular real services and split into five top-level categories: Travel, Information, Service, Shopping, and Entertainment. Within these categories the dataset reaches into 31 domains such as airlines, car rental, restaurants, social media, weather, real estate, music, sports, and gaming, among others.^[1]

Train and test splits

The authors divided the data into one training split and three test splits. Each test split corresponds to a different generalization regime, which is the central evaluation idea in the paper.^[1]^[2]

Split	Instances	Purpose
Train	1,009	Tasks on websites used for training
Cross-Task	252	New tasks on websites already seen during training
Cross-Website	177	Held-out websites within domains seen during training
Cross-Domain	912	Websites in domains not seen during training

The Cross-Domain split is by far the largest test set and is intended to measure whether an agent has learned transferable web skills as opposed to memorizing site-specific layouts.^[1]

Action space

Mind2Web models web interactions as a sequence of low-level operations performed on individual DOM elements. The action space in the original release covers four operations: Click, Hover, Type, and Select Option. Type and Select Option also carry a value argument such as the text to enter or the option to choose.^[1]^[2]

Operation	Required arguments	Example
Click	Target element	Click the "Search" button
Hover	Target element	Hover over the "Account" menu
Type	Target element, text value	Type "Columbus" into the city field
Select Option	Target element, option value	Select "Economy" from the cabin class drop-down

Each action is grounded to a unique element in a DOM snapshot of the page, so any agent that operates on the dataset must both pick the correct operation and identify the correct element among hundreds or thousands of candidates.^[1]

Data collection

The Mind2Web data was collected through a multi-stage crowdsourcing pipeline.^[1] Annotators first proposed candidate tasks for each target website. The proposals were filtered by the authors for feasibility and diversity, then handed to a second pool of annotators who completed the tasks inside a Playwright-based browser tool. The tool recorded the full DOM, screenshots, network requests, and the chosen element at every step. A final verification pass by the authors checked that each saved trajectory actually completed the stated task. The result is a dataset of natural, human-written instructions paired with executable browsing traces.^[1]^[2]

Evaluation methodology

Mind2Web evaluates agents at two granularities, both reported on each of the three test splits.^[1]

Step-level metrics

For every action in a trajectory the agent must produce both an operation and a target element. The paper reports three step-level metrics:^[1]

Element accuracy measures the percentage of steps where the agent selects the same element that the human annotator clicked.
Operation F1 treats Click, Type, and Select Option as labels and reports a token-level F1 over the predicted operation and any text or option value.
Step success rate counts a step as successful only if both the operation and the element match the ground truth, and the typed text or selected option (if any) is also correct.

Task-level metrics

Task success rate counts a task as successful only if every step in the trajectory is correct. Because tasks average 7.3 actions, this metric is intentionally strict.^[1]

Because Mind2Web evaluates against pre-recorded snapshots rather than a live browser, agents cannot recover from a wrong click by exploring further. This makes step success rate a tighter upper bound on what a deployed agent could do, and makes the task success rate quite low even for strong baselines.^[1]^[3]

MindAct framework

The paper introduces MindAct, an exploratory framework for combining small fine-tuned models with LLMs to act on real web pages. MindAct addresses a basic engineering problem: a typical Mind2Web page contains over a thousand DOM elements, which is far too many to fit into the prompt of a single LLM call. MindAct breaks the problem into two stages.^[1]^[2]

Stage 1: Candidate generation

A small encoder model ranks the elements on the current page by how likely each is to be the correct target for the next action, given the task description and the action history. The Mind2Web release uses a DeBERTa-v3-base cross-encoder for this stage.^[1]^[2]

Test split	Recall@50 of DeBERTa-v3-base ranker
Cross-Task	88.9%
Cross-Website	85.3%
Cross-Domain	85.7%

Source: original Mind2Web paper.^[1]

With Recall@50 around 85 to 89 percent, the candidate stage retains the correct element for most steps while shrinking the action space from roughly 580 elements after preprocessing to a much smaller set that can be reasoned over by a downstream model.

Stage 2: Action prediction

The top candidate elements from Stage 1 are packed into multiple-choice prompts and presented to an action prediction model, which selects the correct option and emits an operation (and any associated text or option value). The paper formulates the task as multi-choice question answering rather than free-form generation, which sidesteps the difficulty of getting an LLM to copy a long DOM snippet verbatim.^[1]

MindAct supports two families of action models:

Fine-tuned encoder-decoder models, specifically Flan-T5 at the base, large, and XL sizes, trained on the Mind2Web training split.
In-context LLMs, specifically GPT-3.5-turbo and GPT-4, prompted in a few-shot setting without parameter updates.

For the GPT models the paper evaluates on a 50-task subset of each test split because of API cost, and uses a 3-step in-context demonstration drawn from the training set.^[1]

Baseline results

The paper reports MindAct results across all three splits and all three step-level metrics, plus task success rate. The numbers below are taken from the original Mind2Web paper, version 3 of the arXiv preprint.^[1]

Element accuracy

Model	Cross-Task	Cross-Website	Cross-Domain
Flan-T5 base	43.6%	32.1%	33.9%
Flan-T5 large	53.4%	39.2%	39.7%
Flan-T5 XL	55.1%	42.0%	42.1%
GPT-3.5-turbo (in-context)	20.3%	19.3%	21.6%
GPT-4 (in-context, 50-task subset)	41.6%	35.8%	37.1%

Operation F1

Model	Cross-Task	Cross-Website	Cross-Domain
Flan-T5 base	76.8	67.6	67.3
Flan-T5 large	75.7	67.1	67.2
Flan-T5 XL	75.7	65.2	66.5
GPT-3.5-turbo	56.6	48.8	52.8
GPT-4	60.6	51.1	46.5

Step success rate

Model	Cross-Task	Cross-Website	Cross-Domain
Flan-T5 base	41.0%	29.5%	31.6%
Flan-T5 large	50.3%	35.3%	37.3%
Flan-T5 XL	52.0%	38.9%	39.6%
GPT-3.5-turbo	17.4%	16.2%	18.6%
GPT-4	36.2%	30.1%	26.4%

Task success rate

Model	Cross-Task	Cross-Website	Cross-Domain
Flan-T5 base	4.0%	1.7%	1.6%
Flan-T5 large	7.1%	1.1%	2.7%
Flan-T5 XL	5.2%	5.1%	2.9%
GPT-3.5-turbo	0.8%	0.6%	1.0%
GPT-4	2.0%	2.0%	2.0%

A few patterns are visible in these numbers.^[1] First, larger fine-tuned models do better than smaller ones, and the largest reported Flan-T5 XL is the strongest fine-tuned baseline on every split. Second, GPT-3.5-turbo struggles on this task in 2023, with element accuracy near 20 percent across the board. Third, GPT-4 in a few-shot setting is competitive with fine-tuned Flan-T5 models on the harder Cross-Website and Cross-Domain splits even though it never sees the training set, which the authors take as an early signal that capable LLMs can generalize to unseen websites without web-specific finetuning. Finally, the absolute task success rate is in the single digits everywhere, which highlights how unforgiving the all-steps-correct metric is for trajectories with seven or more actions.

Generalization settings

The three test splits are intended to isolate different kinds of generalization that a deployable web agent must handle.^[1]

Cross-Task measures how well an agent can perform a new task on a website it has already seen. Layouts and widgets are familiar, but the user intent is new.
Cross-Website holds out entire websites within domains the agent has trained on. The agent has likely seen another airline, retailer, or news site, but the specific HTML and design conventions of the test site are new.
Cross-Domain holds out entire domains. This is the hardest setting and is the largest test split, with 912 instances. It is closest to the deployment scenario where a user points an agent at an arbitrary new corner of the web.

Across all reported baselines, performance drops monotonically from Cross-Task to Cross-Website to Cross-Domain on element accuracy, which the paper takes as evidence that genuine generalization to new sites and domains is the central open problem in web agent research.^[1]

Multimodal-Mind2Web and SeeAct

In early 2024 the OSU group released Multimodal-Mind2Web, a paired version of the dataset that aligns each HTML snapshot with the corresponding rendered screenshot for every step in the trajectories.^[2] The multimodal release was intended to support vision-language model baselines that read the page visually rather than from raw HTML.

Multimodal-Mind2Web underpins the SeeAct framework, a generalist web agent built on GPT-4V and other large multimodal models introduced by Boyuan Zheng, Boyu Gou, Jihyung Kil, Huan Sun, and Yu Su at ICML 2024.^[4] SeeAct evaluates GPT-4V as the planning brain of a web agent and reports that, when its textual plans are manually grounded into actions on live websites, GPT-4V can complete 51.1 percent of tasks in the SeeAct evaluation, substantially better than text-only LLMs and earlier fine-tuned baselines such as Flan-T5 and BLIP-2.^[4] The paper also identifies element grounding as the dominant remaining bottleneck for vision-based web agents, since multimodal models can describe what to do but often fail to point at the right pixel or DOM node.

Mind2Web-Live and Online-Mind2Web

The original Mind2Web evaluates against pre-recorded HTML, which is inexpensive and reproducible but cannot capture the dynamic, stateful nature of real web sessions. Two follow-up benchmarks address this gap.

Mind2Web-Live was introduced in 2024 as part of the WebCanvas project for online evaluation of web agents. It re-curates roughly 542 tasks from Mind2Web with intermediate evaluation states, allowing live browser execution and a key-node-based scoring metric instead of strict trajectory matching. WebCanvas reports that the best-performing agent at the time of release reached a 23.1 percent task success rate and a 48.8 percent task completion rate on the Mind2Web-Live test set.^[5]

Online-Mind2Web, released by the OSU group in March 2025, is a refreshed online benchmark of 300 diverse and realistic tasks across 136 websites, designed to approximate how end users actually invoke web agents.^[6] The accompanying paper, "An Illusion of Progress? Assessing the Current State of Web Agents," introduces an LLM-as-a-judge automatic evaluator that achieves about 85 percent agreement with human raters and reports head-to-head results for several frontier systems.^[6]

Agent	Task success rate on Online-Mind2Web (human evaluation)
OpenAI Operator	61.3%
Claude Computer Use 3.7	56.3%
SeeAct	30.7%
Browser Use	30.0%
Claude Computer Use 3.5	29.0%
Agent-E	28.0%

Source: "An Illusion of Progress? Assessing the Current State of Web Agents."^[6]

The paper also reports a steep degradation by difficulty, with easy tasks averaging around 85 percent success but hard tasks falling to about 38 percent, and notes that earlier results on benchmarks such as WebVoyager appear to overestimate agent capability when re-tested under stricter conditions.^[6]

Mind2Web 2

Mind2Web 2 is a separate benchmark released by Boyu Gou, Zanming Huang, Yuting Ning, and a larger team from the Ohio State University and Amazon AGI in 2025. It targets a different problem from the original Mind2Web: agentic search systems such as OpenAI Deep Research that browse the web autonomously and return long, citation-backed answers.^[7]^[8]

Item	Mind2Web (2023)	Mind2Web 2 (2025)
Primary task	Action prediction on real web pages	Long-horizon agentic search and synthesis
Tasks	~2,350	130
Evaluation	Step and trajectory matching	Agent-as-a-Judge over a tree-structured rubric
Output type	Action sequences	Citation-backed natural language answers
Conference	NeurIPS 2023 (Datasets and Benchmarks)	NeurIPS 2025 (Datasets and Benchmarks)

Mind2Web 2 contains 130 long-horizon tasks built with more than 1,000 hours of human labor, each paired with a tree-structured rubric of fine-grained evaluation nodes. The paper introduces an Agent-as-a-Judge framework that scores both correctness (does the answer satisfy all task requirements) and attribution (can each claim be traced to a cited source). It evaluates ten frontier agentic search systems and finds that OpenAI Deep Research reaches roughly 50 to 70 percent of human performance while spending about half the time.^[7]^[8]

Despite the shared name, Mind2Web 2 is best understood as a sibling benchmark rather than a strict successor: it answers a different question about a newer class of systems, while the original Mind2Web continues to be used as a static benchmark for action prediction.^[2]^[7]

Mind2Web sits in a small family of widely cited benchmarks for web-acting agents. The two most often compared with it are WebArena and VisualWebArena.

WebArena was released in late 2023 by Shuyan Zhou, Frank F. Xu, Hao Zhu, and colleagues at Carnegie Mellon University and accepted at ICLR 2024. It hosts fully functional copies of four real-world web applications (an e-commerce store, a Reddit-style forum, a GitLab instance, and a content management system) and ships 812 long-horizon tasks evaluated by functional outcome rather than by matching a reference trajectory. The paper reports that the best GPT-4 agent at submission time achieved a 14.41 percent task success rate, against a human baseline of 78.24 percent.^[9]

VisualWebArena extends WebArena to visually grounded tasks. It introduces 910 tasks across three new web applications that explicitly require image and spatial reasoning, and benchmarks multimodal agents that must look at the rendered page rather than rely on text alone.^[10]

Benchmark	Year	Setting	Tasks	Websites or apps	Primary signal
Mind2Web	2023	Offline (real HTML snapshots)	~2,350	137 real websites	Action and element matching
WebArena	2023	Live, sandboxed (self-hosted)	812	4 self-hosted apps	Functional task success
VisualWebArena	2024	Live, sandboxed (self-hosted)	910	3 self-hosted multimodal apps	Functional task success
Mind2Web-Live	2024	Live (real websites)	~542	Subset of Mind2Web sites	Key-node task completion
Online-Mind2Web	2025	Live (real websites)	300	136 real websites	LLM-as-a-judge task success
Mind2Web 2	2025	Live agentic search	130	Open web	Agent-as-a-Judge over rubric

In practice the field has converged on using Mind2Web for offline element selection, WebArena and VisualWebArena for sandboxed functional evaluation, and Online-Mind2Web for live evaluation of frontier browser agents.^[6]^[9]^[10]

Impact on web agent research

Mind2Web has been used as either a primary or comparison benchmark for most major web agent systems released between 2023 and 2025, including SeeAct, Browser Use, Agent-E, and the SeeAct V variant adapted for live evaluation.^[2]^[4]^[6] Several recurring themes have emerged from this body of work.

First, the Mind2Web release demonstrated that web agent research could move from isolated demos to a shared benchmark with reproducible splits and clear metrics, in the same way that GLUE and SuperGLUE earlier did for natural language understanding.^[1] Subsequent benchmarks adopted the Mind2Web vocabulary of cross-task, cross-website, and cross-domain generalization.

Second, the MindAct two-stage design (rank then act) became a common architectural pattern for HTML-based agents. Many later systems retain a small ranker over DOM elements and pass a shortlist to an LLM for action selection, even when the LLM is replaced by a vision-language model.^[2]^[4]

Third, the gap between offline element accuracy on Mind2Web and live functional success on WebArena, Online-Mind2Web, and Mind2Web 2 helped sharpen the community's understanding that picking the right click is necessary but not sufficient for finishing real web tasks. Frontier systems such as OpenAI Operator and Claude Computer Use now report results on live benchmarks, with Mind2Web's offline numbers used as a stable diagnostic for ablation studies and smaller models.^[6]^[7]

Limitations

The Mind2Web authors and later commentators have noted several limitations of the original benchmark.^[1]^[6]

First, evaluation against static snapshots cannot model the consequences of an action. An agent that clicks the wrong button cannot recover, and an agent that takes a different but equally correct path is marked wrong. Step-level matching therefore underestimates strategies that diverge from the human annotator's exact route.

Second, the dataset captures the web as it existed in 2022 and 2023. Real websites change over time, so any live re-execution of Mind2Web tasks against the current internet quickly becomes brittle. Mind2Web-Live and Online-Mind2Web were created in part to address this drift.^[5]^[6]

Third, the action space is small. Click, Hover, Type, and Select Option do not cover drag-and-drop, file upload, multi-page forms with iframes, or actions that require waiting for animations or asynchronous responses. Modern browser agents frequently encounter these patterns, but they are not represented in the original Mind2Web action set.^[1]

Fourth, the Cross-Domain split, while large, still draws all of its websites from popular English-language services across a handful of categories, and the dataset reflects the design conventions of those sites. Generalization to administrative portals, internal enterprise tools, or non-English websites is not directly measured.^[1]

Finally, an analysis in "An Illusion of Progress?" argues that several earlier reports of web agent performance, including some that drew on Mind2Web variants, may have over-estimated agent capability because of evaluation leakage or weak human baselines. The Online-Mind2Web authors explicitly call for stricter evaluation protocols when comparing frontier agents.^[6]

References

Deng, Xiang; Gu, Yu; Zheng, Boyuan; Chen, Shijie; Stevens, Samuel; Wang, Boshi; Sun, Huan; Su, Yu. "Mind2Web: Towards a Generalist Agent for the Web." arXiv:2306.06070, June 9, 2023. Accepted as a NeurIPS 2023 spotlight in the Datasets and Benchmarks track. https://arxiv.org/abs/2306.06070
OSU NLP Group. "Mind2Web (NeurIPS 2023 Spotlight)." GitHub repository. https://github.com/OSU-NLP-Group/Mind2Web
OSU NLP Group. "Mind2Web project page." https://osu-nlp-group.github.io/Mind2Web/
Zheng, Boyuan; Gou, Boyu; Kil, Jihyung; Sun, Huan; Su, Yu. "GPT-4V(ision) is a Generalist Web Agent, if Grounded." Proceedings of the 41st International Conference on Machine Learning (ICML 2024). arXiv:2401.01614. https://arxiv.org/abs/2401.01614
Pan, Yichen et al. "WebCanvas: Benchmarking Web Agents in Online Environments." 2024. WebCanvas project, including the Mind2Web-Live benchmark. https://openreview.net/forum?id=wkp57p0uhm
Xue, Tianci; Qi, Weijian; Shi, Yiwen; Tang, Chenyan; Gao, Dingjie; Gu, Yu; Sun, Huan; Su, Yu et al. "An Illusion of Progress? Assessing the Current State of Web Agents." arXiv:2504.01382, 2025. Online-Mind2Web benchmark. https://arxiv.org/abs/2504.01382
Gou, Boyu; Huang, Zanming; Ning, Yuting et al. "Mind2Web 2: Evaluating Agentic Search with Agent-as-a-Judge." NeurIPS 2025 Datasets and Benchmarks Track. arXiv:2506.21506. https://arxiv.org/abs/2506.21506
OSU NLP Group. "Mind2Web 2 project page." https://osu-nlp-group.github.io/Mind2Web-2/
Zhou, Shuyan; Xu, Frank F.; Zhu, Hao; Zhou, Xuhui; Lo, Robert; Sridhar, Abishek; Cheng, Xianyi; Bisk, Yonatan; Fried, Daniel; Alon, Uri; Neubig, Graham. "WebArena: A Realistic Web Environment for Building Autonomous Agents." ICLR 2024. arXiv:2307.13854. https://arxiv.org/abs/2307.13854
Koh, Jing Yu; Lo, Robert; Jang, Lawrence; Duvvur, Vikram; Lim, Ming Chong; Huang, Po-Yu; Neubig, Graham; Zhou, Shuyan; Salakhutdinov, Ruslan; Fried, Daniel. "VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks." 2024. https://arxiv.org/abs/2401.13649

Background and motivation

Authors and release

Dataset characteristics

Headline statistics

Train and test splits

Action space

Data collection

Evaluation methodology

Step-level metrics

Task-level metrics

MindAct framework

Stage 1: Candidate generation

Stage 2: Action prediction

Baseline results

Element accuracy

Operation F1

Step success rate

Task success rate

Generalization settings

Multimodal-Mind2Web and SeeAct

Mind2Web-Live and Online-Mind2Web

Mind2Web 2

Related benchmarks

Impact on web agent research

Limitations

See also

References

Improve this article

Related Articles

BrowserGym

Context engineering

Project Mariner

Agentic Context Engineering

Computer-use agent

OpenClaw

Background and motivation

Authors and release

Dataset characteristics

Headline statistics

Train and test splits

Action space

Data collection

Evaluation methodology

Step-level metrics

Task-level metrics

MindAct framework

Stage 1: Candidate generation

Stage 2: Action prediction

Baseline results

Element accuracy

Operation F1

Step success rate

Task success rate

Generalization settings

Multimodal-Mind2Web and SeeAct

Mind2Web-Live and Online-Mind2Web

Mind2Web 2

Related benchmarks

Impact on web agent research

Limitations

See also

References

Related Articles

BrowserGym

Context engineering

Project Mariner

Agentic Context Engineering

Computer-use agent

OpenClaw