MM-BrowseComp
Last reviewed
Jun 2, 2026
Sources
3 citations
Review status
Source-backed
Revision
v1 · 2,131 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 2, 2026
Sources
3 citations
Review status
Source-backed
Revision
v1 · 2,131 words
Add missing citations, update stale details, or suggest a clearer explanation.
MM-BrowseComp is a benchmark for evaluating multimodal web-browsing AI agents, introduced in August 2025 by researchers from ByteDance, Nanjing University, M-A-P, the Institute of Automation of the Chinese Academy of Sciences (CASIA), and Zhejiang University [1][2]. It is a multimodal extension of BrowseComp, the text-only browsing benchmark released by OpenAI earlier the same year. MM-BrowseComp consists of 224 hand-crafted, multi-hop questions that are designed so the information needed to answer them is embedded in images or videos on web pages rather than in plain text, forcing agents to retrieve and reason over visual content during the search process [1][2]. In the authors' evaluation, even the strongest system tested, OpenAI o3 with tools, reached only 29.02% overall accuracy, while most other models and agents scored below 10% [1][3].
Browsing agents built on large language models and vision-language models can answer questions by issuing search queries, reading web pages, and chaining together evidence across many sources, a workflow popularized by commercial systems such as OpenAI Deep Research and Google Gemini Deep Research [1]. Benchmarks that measure this capability had, until 2025, focused almost entirely on text. MM-BrowseComp targets the gap left by those text-only evaluations: it asks whether an agent can find and interpret facts that exist only in the visual modality, such as a detail visible in a photograph, a frame of a video, or a chart, none of which a purely text-based pipeline can solve [1].
The benchmark inherits the design philosophy of BrowseComp and SimpleQA. Questions are easy to verify but hard to solve: each has a short, unique, time-stable answer (a name, number, or color) reached only through a long chain of search and reasoning steps [1]. The headline finding of the paper is that current models are far weaker at browsing for and reasoning over visual content than over text, and that no existing system, open or closed, comes close to solving the benchmark [1].
The work was posted to arXiv as arXiv:2508.13186 on 14 August 2025, with code and data released on GitHub under the MMBrowseComp organization [1][2]. The lead authors include Xingyuan Bu, Wenjie Wang, and Jiaheng Liu, with Ge Zhang, Wangchunshu Zhou, and Zhaoxiang Zhang among the senior contributors [1].
BrowseComp, released by OpenAI in 2025, established the format MM-BrowseComp builds on: a set of questions that require an agent to locate deeply hidden, hard-to-find information across many web pages while remaining trivially checkable once the answer is known [1]. BrowseComp and its derivatives evaluate only textual information. The MM-BrowseComp authors identify two limitations that follow from this: such benchmarks cannot test user queries that contain images, and they ignore the large amount of knowledge that lives in the interleaved text, images, and videos of real web pages [1].
MM-BrowseComp adopts the same inverted construction method as BrowseComp, starting from a known fact and reverse-engineering a question that isolates it as the sole answer, and the same constraint that answers be concise and verifiable [1]. The decisive difference is the mandatory multimodal dependency: the essential information must reside in the visual modality and must not be recoverable from any text source, which eliminates text-only shortcuts [1]. In this sense MM-BrowseComp is to multimodal browsing what BrowseComp is to text browsing. The relationship is explicit in the paper's comparison figures, which show that the same agents score far lower on MM-BrowseComp than on BrowseComp and other prominent benchmarks [1].
The core design principle of MM-BrowseComp is that solving a task requires understanding content in image or video modalities [1][2]. This manifests in two ways. First, the input prompt itself may contain one or more images: 57% of the questions include at least one image in the prompt, while the remaining 43% begin as purely text-based prompts [1]. Second, and regardless of how the question is phrased, the critical evidence encountered during the search and reasoning process may be embedded in images or videos on the web, so an agent must actively inspect visual material it discovers rather than relying on surrounding captions or alt text [1].
To make the visual dependency strict, annotators were instructed that the information needed to complete a task should appear primarily in the visual modality and should not appear in any text source, deliberately removing textual shortcuts [1]. Questions are multi-hop and intentionally difficult: the authors required that even state-of-the-art vision-language models or agents could not answer them in a single attempt, and that a second human annotator could not reliably solve them within five minutes of active web searching [1].
The 224 questions are distributed across 22 distinct subtasks grouped into five broad categories, giving roughly balanced coverage of common knowledge domains [1]:
| Category | Share of dataset |
|---|---|
| Media | 29% |
| Technology | 26% |
| Society | 18% |
| Academics | 14% |
| Geography | 13% |
MM-BrowseComp was assembled by an annotation team of more than twenty master's- and PhD-level AI researchers, each assigned to two or three of the 22 subtasks that matched their domain expertise so that every subtask was authored by multiple annotators [1]. A gold-standard example was provided for each subtask as a reference [1].
Construction followed several quality standards beyond the multimodal-dependency rule. Each question had to be inherently difficult, defined operationally by two checks: it had to remain unanswerable by both Gemini-2.5-Pro and GPT-4o when each was given web search and a single attempt, and it had to resist solution by an unfamiliar annotator allowed up to five minutes of searching [1]. Answers had to be concise, easily verifiable phrases, and they had to be temporally stable, so annotators drew on authoritative sources and added explicit time constraints where an answer might otherwise drift [1]. To enforce answer uniqueness, experts searched for alternative valid answers using tools such as OpenAI Deep Research and tightened a question's constraints until only the intended answer remained correct [1].
Quality control ran in three phases: a pilot-and-calibration round of three instances per subtask reviewed by the core team; full-scale construction with a secondary review and revision cycle; and a final tool-dependency check that discarded any question solvable by Gemini-2.5-Pro or GPT-4o without browsing tools, followed by factual verification of every question, answer, and checklist [1]. The pipeline began with 300 candidate instances. Of these, 161 (53.7%) were accepted directly, 63 (21.0%) were revised to meet the standards, and 76 (25.3%) were discarded, leaving the final set of 224 questions [1].
A defining feature of MM-BrowseComp is its per-question verified checklist, which the authors call an irreducible reasoning checklist [1][2]. The checklist enumerates the minimal, indispensable sequence of search and reasoning steps required to reach the correct answer; annotators were required to ensure every step is necessary and that the full sequence must be completed to derive the answer [1]. The checklist serves as a diagnostic tool, letting evaluators inspect an agent's reasoning trajectory rather than only the final answer, and it makes it possible to distinguish genuine reasoning from a lucky guess: if a model produces the right answer without completing the full checklist, the outcome was likely guessed [1].
The paper reports three metrics [1]:
| Metric | Definition |
|---|---|
| Overall Accuracy (OA) | Percentage of questions answered correctly, judging the final answer only |
| Strict Accuracy (SA) | A question counts as correct only if the final answer is right and every checklist item is completed |
| Average Checklist Score (AVG CS) | Mean completion rate of the checklist across all questions |
All scores are reported at Pass@1 [1]. Checklist items were further labeled as textual or visual, enabling a modality-specific breakdown of where agents fail [1]. Because of the high compute cost of running full agent frameworks, the open-source agents were evaluated on a 54-instance subset uniformly sampled across subtasks, while the proprietary models and tool-free baselines were run on the full set [1].
The authors evaluated three groups of systems: tool-free VLMs, tool-augmented VLMs (official tool-enabled services), and open-source deep-search agents [1]. Selected results from Table 1 of the paper, reported as Pass@1 on the full 224-question set unless noted, are below [1].
| Model / agent | Group | Overall accuracy | Strict accuracy | Avg checklist score |
|---|---|---|---|---|
| OpenAI o3 (with tools) | Tool-augmented VLM | 29.02% | 19.64% | 36.49% |
| Gemini-2.5-Pro-Preview-05-06 (with tools) | Tool-augmented VLM | 7.14% | 3.57% | 15.21% |
| Gemini-2.5-Flash-Preview-05-20 (with tools) | Tool-augmented VLM | 3.12% | 3.12% | 11.34% |
| GPT-4.1 | Tool-free VLM | 7.59% | 5.36% | 14.68% |
| o4-mini-high | Tool-free VLM | 7.14% | 3.13% | 13.67% |
| Gemini-2.5-Pro-Preview-05-06 | Tool-free VLM | 6.31% | 4.50% | 11.56% |
| o4-mini | Tool-free VLM | 5.36% | 2.23% | 12.41% |
| Llama-4-Maverick-17B-128E-Instruct | Tool-free VLM | 2.68% | 0.45% | 6.09% |
| GPT-4o-2024-11-20 | Tool-free VLM | 1.34% | 0.45% | 4.63% |
| Qwen2.5-VL-72B-Instruct | Tool-free VLM | 0.45% | 0.00% | 3.58% |
| Qwen2.5-VL-7B-Instruct | Tool-free VLM | 0.00% | 0.00% | 0.15% |
Among open-source agents, evaluated on the 54-instance subset, the best performer was Agent-R1, a reflective agent following the ReAct paradigm, which reached an overall accuracy of 5.56% with a Gemini-2.5-Flash backbone; frameworks such as OWL, DeerFlow, and WebDancer, including the WebDancer-32B model, generally scored below this and well under 10% [1]. The paper notes that open-source agents struggled in part because several lacked dedicated visual tools and relied on captioning, which loses information [1].
OpenAI o3 was not only the best in the tool-augmented group but the top scorer across every system evaluated [1]. The authors attribute its lead to native multimodal reasoning: rather than calling a separate captioning tool, o3 autonomously wrote and executed code to download a target image to a local file and reloaded it into its context for analysis, letting it perceive visual detail directly during reasoning [1]. By contrast, the Gemini models showed little improvement when given tools, often terminating early for lack of information instead of engaging in multi-step tool use [1].
A modality-specific analysis found that most models perform best on textual checklist items and drop sharply on visual items that require image or video understanding [1]. The authors trace this to both weak visual comprehension and a lack of proactive intent to analyze visual content while searching [1]. A test-time scaling study with the Agent-R1 framework (using QwQ-32B for reasoning and Qwen2.5-VL-72B for visual understanding) showed that aggregating 16 independent runs raised overall accuracy but barely moved strict accuracy, which the authors read as evidence that extra sampling buys lucky guesses rather than better reasoning [1].
MM-BrowseComp was among the first benchmarks to isolate multimodal deep-search ability, where the input, the intermediate evidence, and the final answer can all demand visual understanding [1]. Its low ceiling, with the best system under 30% and most systems in the single digits, established a large headroom for multimodal browsing agents at a time when text-only browsing benchmarks were approaching saturation [1]. The verified-checklist design contributed an evaluation idea beyond final-answer accuracy: by recording the minimal reasoning path, it separates methodical reasoning from chance and, the authors suggest, could supply dense reward signals for training agents with reinforcement learning [1]. The benchmark's framing, that high performance requires a synergy of strong reasoning and a complete toolset rather than either alone, has been cited in later work on visual browsing benchmarks [1].
The benchmark is small by design, with 224 questions, a consequence of the labor-intensive expert annotation and strict multi-stage filtering, and the open-source agents were assessed on only a 54-question subset because of compute cost, which limits the precision of their reported scores [1]. The dataset is deliberately withheld in plain text to prevent training-data contamination and answer leakage, so reproduction depends on the released protected materials [1]. Because answers are pinned to authoritative sources at construction time, some items could in principle become stale if those sources change, although the authors added temporal constraints to mitigate this [1]. Finally, the evaluation captures a specific 2025 generation of models and agents; the published leaderboard reflects systems available at that time, such as OpenAI o3 and the Gemini 2.5 preview models, and a later expansion of the dataset to additional questions was noted on the project repository [2].