Mind2Web is a dataset and benchmark for developing and evaluating generalist AI agents that follow natural language instructions to complete tasks on the open web. It was introduced in the 2023 paper "Mind2Web: Towards a Generalist Agent for the Web" by Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Samuel Stevens, Boshi Wang, Huan Sun, and Yu Su of the Ohio State University NLP group, and was selected as a NeurIPS 2023 spotlight paper in the Datasets and Benchmarks track.[1][2]
The dataset contains roughly 2,350 open-ended tasks collected from 137 live websites across 31 domains, with crowdsourced action sequences captured on real web pages.[1][2] Mind2Web also introduced MindAct, a two-stage framework that combines a small fine-tuned ranker with a large language model to predict actions from filtered HTML, and reported the first systematic baselines for Flan-T5, GPT-3.5 and GPT-4 as web agents.[1] The benchmark became a foundational reference for later work on web agents and AI browser agents, and a starting point for follow-on resources such as Multimodal-Mind2Web, SeeAct, Online-Mind2Web, and Mind2Web 2.[2]
Before Mind2Web, most benchmarks for web automation either used a small set of simplified pages, such as MiniWoB and MiniWoB++, or focused on a narrow vertical such as e-commerce shopping in WebShop.[1] These setups were useful for studying interaction primitives, but they did not measure whether an agent could read complex modern HTML, plan multi-step actions, and generalize across the broad surface of the public web.
The Mind2Web authors argued that a credible benchmark for general web agents needed three properties:[1]
Mind2Web was the first dataset that attempted to satisfy all three properties at once. Its release coincided with the rapid growth of LLM-driven web agent research in 2023 and 2024 and gave the community a shared yardstick for that work.[2][3]
Mind2Web was developed by researchers from the Ohio State University NLP group led by Huan Sun and Yu Su, with the dataset, models, and code released under permissive licenses (CC BY 4.0 for the dataset, MIT for the code).[2] The paper was first posted to arXiv on June 9, 2023 and accepted as a NeurIPS 2023 spotlight in the Datasets and Benchmarks track.[1]
| Item | Value |
|---|---|
| Paper title | Mind2Web: Towards a Generalist Agent for the Web |
| Authors | Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Samuel Stevens, Boshi Wang, Huan Sun, Yu Su |
| Affiliation | The Ohio State University |
| First arXiv version | June 9, 2023 |
| Conference | NeurIPS 2023 (Datasets and Benchmarks, Spotlight) |
| arXiv ID | 2306.06070 |
| Code repository | github.com/OSU-NLP-Group/Mind2Web |
| Project page | osu-nlp-group.github.io/Mind2Web |
| Dataset license | CC BY 4.0 |
| Code license | MIT |
The original release shipped HTML snapshots, MHTML files, screenshots, DOM dumps, and HAR network traces for each annotated trajectory, along with a script to reproduce the dataset splits and baseline experiments.[2]
Mind2Web is fundamentally an offline dataset of human-collected web trajectories. Each example pairs a high-level natural language instruction with a sequence of (web page, target element, action) tuples that, when executed, complete the task on the original website.[1]
| Statistic | Value |
|---|---|
| Total tasks | ~2,350 |
| Distinct websites | 137 |
| Domains | 31 |
| Average actions per task | 7.3 |
| Average raw elements per page | ~1,135 |
| Average elements after preprocessing | ~580 |
| Top-level categories | Travel, Information, Service, Shopping, Entertainment |
Source: original Mind2Web paper.[1]
The 137 websites were selected from popular real services and split into five top-level categories: Travel, Information, Service, Shopping, and Entertainment. Within these categories the dataset reaches into 31 domains such as airlines, car rental, restaurants, social media, weather, real estate, music, sports, and gaming, among others.[1]
The authors divided the data into one training split and three test splits. Each test split corresponds to a different generalization regime, which is the central evaluation idea in the paper.[1][2]
| Split | Instances | Purpose |
|---|---|---|
| Train | 1,009 | Tasks on websites used for training |
| Cross-Task | 252 | New tasks on websites already seen during training |
| Cross-Website | 177 | Held-out websites within domains seen during training |
| Cross-Domain | 912 | Websites in domains not seen during training |
The Cross-Domain split is by far the largest test set and is intended to measure whether an agent has learned transferable web skills as opposed to memorizing site-specific layouts.[1]
Mind2Web models web interactions as a sequence of low-level operations performed on individual DOM elements. The action space in the original release covers four operations: Click, Hover, Type, and Select Option. Type and Select Option also carry a value argument such as the text to enter or the option to choose.[1][2]
| Operation | Required arguments | Example |
|---|---|---|
| Click | Target element | Click the "Search" button |
| Hover | Target element | Hover over the "Account" menu |
| Type | Target element, text value | Type "Columbus" into the city field |
| Select Option | Target element, option value | Select "Economy" from the cabin class drop-down |
Each action is grounded to a unique element in a DOM snapshot of the page, so any agent that operates on the dataset must both pick the correct operation and identify the correct element among hundreds or thousands of candidates.[1]
The Mind2Web data was collected through a multi-stage crowdsourcing pipeline.[1] Annotators first proposed candidate tasks for each target website. The proposals were filtered by the authors for feasibility and diversity, then handed to a second pool of annotators who completed the tasks inside a Playwright-based browser tool. The tool recorded the full DOM, screenshots, network requests, and the chosen element at every step. A final verification pass by the authors checked that each saved trajectory actually completed the stated task. The result is a dataset of natural, human-written instructions paired with executable browsing traces.[1][2]
Mind2Web evaluates agents at two granularities, both reported on each of the three test splits.[1]
For every action in a trajectory the agent must produce both an operation and a target element. The paper reports three step-level metrics:[1]
Because Mind2Web evaluates against pre-recorded snapshots rather than a live browser, agents cannot recover from a wrong click by exploring further. This makes step success rate a tighter upper bound on what a deployed agent could do, and makes the task success rate quite low even for strong baselines.[1][3]
The paper introduces MindAct, an exploratory framework for combining small fine-tuned models with LLMs to act on real web pages. MindAct addresses a basic engineering problem: a typical Mind2Web page contains over a thousand DOM elements, which is far too many to fit into the prompt of a single LLM call. MindAct breaks the problem into two stages.[1][2]
A small encoder model ranks the elements on the current page by how likely each is to be the correct target for the next action, given the task description and the action history. The Mind2Web release uses a DeBERTa-v3-base cross-encoder for this stage.[1][2]
| Test split | Recall@50 of DeBERTa-v3-base ranker |
|---|---|
| Cross-Task | 88.9% |
| Cross-Website | 85.3% |
| Cross-Domain | 85.7% |
Source: original Mind2Web paper.[1]
With Recall@50 around 85 to 89 percent, the candidate stage retains the correct element for most steps while shrinking the action space from roughly 580 elements after preprocessing to a much smaller set that can be reasoned over by a downstream model.
The top candidate elements from Stage 1 are packed into multiple-choice prompts and presented to an action prediction model, which selects the correct option and emits an operation (and any associated text or option value). The paper formulates the task as multi-choice question answering rather than free-form generation, which sidesteps the difficulty of getting an LLM to copy a long DOM snippet verbatim.[1]
MindAct supports two families of action models:
For the GPT models the paper evaluates on a 50-task subset of each test split because of API cost, and uses a 3-step in-context demonstration drawn from the training set.[1]
The paper reports MindAct results across all three splits and all three step-level metrics, plus task success rate. The numbers below are taken from the original Mind2Web paper, version 3 of the arXiv preprint.[1]
| Model | Cross-Task | Cross-Website | Cross-Domain |
|---|---|---|---|
| Flan-T5 base | 43.6% | 32.1% | 33.9% |
| Flan-T5 large | 53.4% | 39.2% | 39.7% |
| Flan-T5 XL | 55.1% | 42.0% | 42.1% |
| GPT-3.5-turbo (in-context) | 20.3% | 19.3% | 21.6% |
| GPT-4 (in-context, 50-task subset) | 41.6% | 35.8% | 37.1% |
| Model | Cross-Task | Cross-Website | Cross-Domain |
|---|---|---|---|
| Flan-T5 base | 76.8 | 67.6 | 67.3 |
| Flan-T5 large | 75.7 | 67.1 | 67.2 |
| Flan-T5 XL | 75.7 | 65.2 | 66.5 |
| GPT-3.5-turbo | 56.6 | 48.8 | 52.8 |
| GPT-4 | 60.6 | 51.1 | 46.5 |
| Model | Cross-Task | Cross-Website | Cross-Domain |
|---|---|---|---|
| Flan-T5 base | 41.0% | 29.5% | 31.6% |
| Flan-T5 large | 50.3% | 35.3% | 37.3% |
| Flan-T5 XL | 52.0% | 38.9% | 39.6% |
| GPT-3.5-turbo | 17.4% | 16.2% | 18.6% |
| GPT-4 | 36.2% | 30.1% | 26.4% |
| Model | Cross-Task | Cross-Website | Cross-Domain |
|---|---|---|---|
| Flan-T5 base | 4.0% | 1.7% | 1.6% |
| Flan-T5 large | 7.1% | 1.1% | 2.7% |
| Flan-T5 XL | 5.2% | 5.1% | 2.9% |
| GPT-3.5-turbo | 0.8% | 0.6% | 1.0% |
| GPT-4 | 2.0% | 2.0% | 2.0% |
A few patterns are visible in these numbers.[1] First, larger fine-tuned models do better than smaller ones, and the largest reported Flan-T5 XL is the strongest fine-tuned baseline on every split. Second, GPT-3.5-turbo struggles on this task in 2023, with element accuracy near 20 percent across the board. Third, GPT-4 in a few-shot setting is competitive with fine-tuned Flan-T5 models on the harder Cross-Website and Cross-Domain splits even though it never sees the training set, which the authors take as an early signal that capable LLMs can generalize to unseen websites without web-specific finetuning. Finally, the absolute task success rate is in the single digits everywhere, which highlights how unforgiving the all-steps-correct metric is for trajectories with seven or more actions.
The three test splits are intended to isolate different kinds of generalization that a deployable web agent must handle.[1]
Across all reported baselines, performance drops monotonically from Cross-Task to Cross-Website to Cross-Domain on element accuracy, which the paper takes as evidence that genuine generalization to new sites and domains is the central open problem in web agent research.[1]
In early 2024 the OSU group released Multimodal-Mind2Web, a paired version of the dataset that aligns each HTML snapshot with the corresponding rendered screenshot for every step in the trajectories.[2] The multimodal release was intended to support vision-language model baselines that read the page visually rather than from raw HTML.
Multimodal-Mind2Web underpins the SeeAct framework, a generalist web agent built on GPT-4V and other large multimodal models introduced by Boyuan Zheng, Boyu Gou, Jihyung Kil, Huan Sun, and Yu Su at ICML 2024.[4] SeeAct evaluates GPT-4V as the planning brain of a web agent and reports that, when its textual plans are manually grounded into actions on live websites, GPT-4V can complete 51.1 percent of tasks in the SeeAct evaluation, substantially better than text-only LLMs and earlier fine-tuned baselines such as Flan-T5 and BLIP-2.[4] The paper also identifies element grounding as the dominant remaining bottleneck for vision-based web agents, since multimodal models can describe what to do but often fail to point at the right pixel or DOM node.
The original Mind2Web evaluates against pre-recorded HTML, which is inexpensive and reproducible but cannot capture the dynamic, stateful nature of real web sessions. Two follow-up benchmarks address this gap.
Mind2Web-Live was introduced in 2024 as part of the WebCanvas project for online evaluation of web agents. It re-curates roughly 542 tasks from Mind2Web with intermediate evaluation states, allowing live browser execution and a key-node-based scoring metric instead of strict trajectory matching. WebCanvas reports that the best-performing agent at the time of release reached a 23.1 percent task success rate and a 48.8 percent task completion rate on the Mind2Web-Live test set.[5]
Online-Mind2Web, released by the OSU group in March 2025, is a refreshed online benchmark of 300 diverse and realistic tasks across 136 websites, designed to approximate how end users actually invoke web agents.[6] The accompanying paper, "An Illusion of Progress? Assessing the Current State of Web Agents," introduces an LLM-as-a-judge automatic evaluator that achieves about 85 percent agreement with human raters and reports head-to-head results for several frontier systems.[6]
| Agent | Task success rate on Online-Mind2Web (human evaluation) |
|---|---|
| OpenAI Operator | 61.3% |
| Claude Computer Use 3.7 | 56.3% |
| SeeAct | 30.7% |
| Browser Use | 30.0% |
| Claude Computer Use 3.5 | 29.0% |
| Agent-E | 28.0% |
Source: "An Illusion of Progress? Assessing the Current State of Web Agents."[6]
The paper also reports a steep degradation by difficulty, with easy tasks averaging around 85 percent success but hard tasks falling to about 38 percent, and notes that earlier results on benchmarks such as WebVoyager appear to overestimate agent capability when re-tested under stricter conditions.[6]
Mind2Web 2 is a separate benchmark released by Boyu Gou, Zanming Huang, Yuting Ning, and a larger team from the Ohio State University and Amazon AGI in 2025. It targets a different problem from the original Mind2Web: agentic search systems such as OpenAI Deep Research that browse the web autonomously and return long, citation-backed answers.[7][8]
| Item | Mind2Web (2023) | Mind2Web 2 (2025) |
|---|---|---|
| Primary task | Action prediction on real web pages | Long-horizon agentic search and synthesis |
| Tasks | ~2,350 | 130 |
| Evaluation | Step and trajectory matching | Agent-as-a-Judge over a tree-structured rubric |
| Output type | Action sequences | Citation-backed natural language answers |
| Conference | NeurIPS 2023 (Datasets and Benchmarks) | NeurIPS 2025 (Datasets and Benchmarks) |
Mind2Web 2 contains 130 long-horizon tasks built with more than 1,000 hours of human labor, each paired with a tree-structured rubric of fine-grained evaluation nodes. The paper introduces an Agent-as-a-Judge framework that scores both correctness (does the answer satisfy all task requirements) and attribution (can each claim be traced to a cited source). It evaluates ten frontier agentic search systems and finds that OpenAI Deep Research reaches roughly 50 to 70 percent of human performance while spending about half the time.[7][8]
Despite the shared name, Mind2Web 2 is best understood as a sibling benchmark rather than a strict successor: it answers a different question about a newer class of systems, while the original Mind2Web continues to be used as a static benchmark for action prediction.[2][7]
Mind2Web sits in a small family of widely cited benchmarks for web-acting agents. The two most often compared with it are WebArena and VisualWebArena.
WebArena was released in late 2023 by Shuyan Zhou, Frank F. Xu, Hao Zhu, and colleagues at Carnegie Mellon University and accepted at ICLR 2024. It hosts fully functional copies of four real-world web applications (an e-commerce store, a Reddit-style forum, a GitLab instance, and a content management system) and ships 812 long-horizon tasks evaluated by functional outcome rather than by matching a reference trajectory. The paper reports that the best GPT-4 agent at submission time achieved a 14.41 percent task success rate, against a human baseline of 78.24 percent.[9]
VisualWebArena extends WebArena to visually grounded tasks. It introduces 910 tasks across three new web applications that explicitly require image and spatial reasoning, and benchmarks multimodal agents that must look at the rendered page rather than rely on text alone.[10]
| Benchmark | Year | Setting | Tasks | Websites or apps | Primary signal |
|---|---|---|---|---|---|
| Mind2Web | 2023 | Offline (real HTML snapshots) | ~2,350 | 137 real websites | Action and element matching |
| WebArena | 2023 | Live, sandboxed (self-hosted) | 812 | 4 self-hosted apps | Functional task success |
| VisualWebArena | 2024 | Live, sandboxed (self-hosted) | 910 | 3 self-hosted multimodal apps | Functional task success |
| Mind2Web-Live | 2024 | Live (real websites) | ~542 | Subset of Mind2Web sites | Key-node task completion |
| Online-Mind2Web | 2025 | Live (real websites) | 300 | 136 real websites | LLM-as-a-judge task success |
| Mind2Web 2 | 2025 | Live agentic search | 130 | Open web | Agent-as-a-Judge over rubric |
In practice the field has converged on using Mind2Web for offline element selection, WebArena and VisualWebArena for sandboxed functional evaluation, and Online-Mind2Web for live evaluation of frontier browser agents.[6][9][10]
Mind2Web has been used as either a primary or comparison benchmark for most major web agent systems released between 2023 and 2025, including SeeAct, Browser Use, Agent-E, and the SeeAct V variant adapted for live evaluation.[2][4][6] Several recurring themes have emerged from this body of work.
First, the Mind2Web release demonstrated that web agent research could move from isolated demos to a shared benchmark with reproducible splits and clear metrics, in the same way that GLUE and SuperGLUE earlier did for natural language understanding.[1] Subsequent benchmarks adopted the Mind2Web vocabulary of cross-task, cross-website, and cross-domain generalization.
Second, the MindAct two-stage design (rank then act) became a common architectural pattern for HTML-based agents. Many later systems retain a small ranker over DOM elements and pass a shortlist to an LLM for action selection, even when the LLM is replaced by a vision-language model.[2][4]
Third, the gap between offline element accuracy on Mind2Web and live functional success on WebArena, Online-Mind2Web, and Mind2Web 2 helped sharpen the community's understanding that picking the right click is necessary but not sufficient for finishing real web tasks. Frontier systems such as OpenAI Operator and Claude Computer Use now report results on live benchmarks, with Mind2Web's offline numbers used as a stable diagnostic for ablation studies and smaller models.[6][7]
The Mind2Web authors and later commentators have noted several limitations of the original benchmark.[1][6]
First, evaluation against static snapshots cannot model the consequences of an action. An agent that clicks the wrong button cannot recover, and an agent that takes a different but equally correct path is marked wrong. Step-level matching therefore underestimates strategies that diverge from the human annotator's exact route.
Second, the dataset captures the web as it existed in 2022 and 2023. Real websites change over time, so any live re-execution of Mind2Web tasks against the current internet quickly becomes brittle. Mind2Web-Live and Online-Mind2Web were created in part to address this drift.[5][6]
Third, the action space is small. Click, Hover, Type, and Select Option do not cover drag-and-drop, file upload, multi-page forms with iframes, or actions that require waiting for animations or asynchronous responses. Modern browser agents frequently encounter these patterns, but they are not represented in the original Mind2Web action set.[1]
Fourth, the Cross-Domain split, while large, still draws all of its websites from popular English-language services across a handful of categories, and the dataset reflects the design conventions of those sites. Generalization to administrative portals, internal enterprise tools, or non-English websites is not directly measured.[1]
Finally, an analysis in "An Illusion of Progress?" argues that several earlier reports of web agent performance, including some that drew on Mind2Web variants, may have over-estimated agent capability because of evaluation leakage or weak human baselines. The Online-Mind2Web authors explicitly call for stricter evaluation protocols when comparing frontier agents.[6]