DeepResearch Bench
Last reviewed
May 10, 2026
Sources
12 citations
Review status
Source-backed
Revision
v2 ยท 2,491 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 10, 2026
Sources
12 citations
Review status
Source-backed
Revision
v2 ยท 2,491 words
Add missing citations, update stale details, or suggest a clearer explanation.
| DeepResearch Bench | |
|---|---|
| Overview | |
| Full name | DeepResearch Bench: A Comprehensive Benchmark for Deep Research Agents |
| Abbreviation | DRB |
| Description | A benchmark of 100 PhD-level research tasks designed to evaluate Deep Research Agents on multi-step web exploration, retrieval, and report synthesis |
| Initial release | June 13, 2025 (arXiv preprint) |
| Authors | Mingxuan Du, Benfeng Xu, Chiwei Zhu, Xiaorui Wang, Zhendong Mao |
| Organization | University of Science and Technology of China; Metastone Technology |
| Technical Details | |
| Type | Research Agent Evaluation, Report Generation, Long-form Synthesis |
| Modality | Text, Web content |
| Task format | Open-ended research report generation |
| Number of tasks | 100 (50 English + 50 Chinese) |
| Domains | 22 fields, including Science and Technology, Finance and Business, Software, Health, History, Industry, Transportation, Tourism, Art and Design, Entertainment |
| Evaluation frameworks | RACE (report quality) and FACT (citation accuracy and effective citations) |
| Judge LLM | Gemini 2.5 Pro for RACE, Gemini 2.5 Flash for FACT |
| Top RACE score | 48.88 (Gemini 2.5 Pro Deep Research) |
| Top FACT effective citations | 111.21 (Gemini 2.5 Pro Deep Research) |
| Top citation accuracy | 94.04% (Claude 3.5 Sonnet with Search) |
| Resources | |
| Website | Official website |
| Paper | arXiv:2506.11763 |
| GitHub | Ayanami0730/deep_research_bench |
| Leaderboard | Hugging Face Space |
| License | Apache 2.0 |
DeepResearch Bench is a benchmark for evaluating Deep Research Agents, a class of large language model systems that autonomously plan multi-step web searches, gather sources, and write long-form analyst-grade reports. It was introduced in the June 2025 paper DeepResearch Bench: A Comprehensive Benchmark for Deep Research Agents by Mingxuan Du, Benfeng Xu, Chiwei Zhu, Xiaorui Wang, and Zhendong Mao, with affiliations at the University of Science and Technology of China and Metastone Technology[1][2]. The benchmark is built around 100 PhD-level tasks (50 in English and 50 in Chinese) that span 22 fields, and it uses two custom evaluation frameworks called RACE and FACT to measure report quality and citation behavior. Code and data are released under Apache 2.0 on GitHub, and a public leaderboard is hosted on Hugging Face Spaces[3][4].
In early 2025, OpenAI Deep Research, Gemini Deep Research from Google DeepMind, Perplexity Deep Research, and Grok DeepSearch from xAI all shipped agents that spend several minutes browsing the open web, then return a structured report with citations. The category came to be known as Deep Research Agents, or DRAs.
These systems are awkward to evaluate. Standard QA benchmarks like MMLU ask short-answer questions; deep research outputs are long, free-form, and reference-heavy. Search-style benchmarks like GAIA, BrowseComp, and Humanity's Last Exam grade a single final answer rather than the structure and sourcing of a multi-page report[1][5].
The team analyzed 96,147 anonymized queries from a web-search-enabled chatbot and identified 44,019 (about 45.8%) as deep research queries. The 100 benchmark tasks follow that same topical mix so the test set looks like real demand[2][6].
The benchmark consists of 100 prompts, balanced 50/50 between English and Chinese. Each task is a multi-paragraph research request that would normally take a human analyst several hours to complete. The 22 fields cluster into four broad areas: Science and Technology (physics, chemistry, biology, environmental science, engineering), Finance and Business (investing, personal finance, marketing, human resources), Software (software usage, internet topics), and Other (art and design, entertainment, history, industry, transportation, tourism, and additional categories)[6][2].
| Domain cluster | Example fields | Notes |
|---|---|---|
| Science and Technology | Physics, Chemistry, Biology, Environmental science, Engineering | Largest cluster by query volume |
| Finance and Business | Investing, Personal finance, Marketing, Human resources | Second-largest cluster |
| Software | Software usage, Internet | Includes how-to and integration questions |
| Other | Art and Design, Entertainment, History, Industry, Transportation, Tourism | Long tail of high-effort topics |
Tasks were written by more than 100 domain experts, each a PhD holder or senior practitioner with at least five years of relevant experience. Every prompt went through multiple rounds of review, and the bilingual versions were authored locally rather than translated[1][2].
A representative task asks for an analysis, not a fact. Examples include prompts on the state of a research subfield, comparative reviews of competing technologies, market sizing exercises, and policy analyses needing both academic and news sources. The agent must plan its own browsing, decide what to cite, and produce a usable but readable report.
DeepResearch Bench introduces two complementary frameworks. RACE grades the quality of the written report. FACT grades how well the agent uses external sources. The two scores are reported separately so that a model can be strong on one axis and weak on the other.
RACE evaluates the report across four dimensions[1][7]:
For each task, RACE generates task-specific evaluation criteria using a judge LLM. It then scores the target report against a high-quality reference using:
S_final(R_tgt) = S_int(R_tgt) / (S_int(R_tgt) + S_int(R_ref))
Reference reports were drawn from Gemini 2.5 Pro Deep Research; the relative formulation keeps scores comparable across tasks of different difficulty. Dimension weights shift per task to reflect what the prompt requires.
Gemini 2.5 Pro reached 71.33% pairwise agreement with humans and 99.54% Pearson correlation, narrowly beating o4-mini and Claude 3.7 Sonnet, so it was selected as the RACE judge. Average judge cost per task is about $0.13[1].
FACT looks at the citations rather than the prose. The pipeline:
From this, FACT reports two main metrics[1][7]:
The split matters. A model that piles on URLs to look thorough can have high E. Cit. but low C. Acc.; a careful model that only cites a few sources can be the opposite. The framework keeps both visible.
The authors validated RACE against three expert annotators per task on the Chinese subset. RACE Full reached 71.33% pairwise agreement with humans, slightly above the human inter-annotator agreement of 68.44%. A vanilla prompt baseline reached only 58.89%. Removing the reference report dropped the score to 66.56%, an argument for relative scoring against a reference[1].
| RACE configuration | Pairwise agreement | Overall consistency |
|---|---|---|
| RACE (Full) | 71.33% | 72.56% |
| RACE without reference | 66.56% | 68.19% |
| Vanilla prompt | 58.89% | 60.46% |
| Human inter-annotator | 68.44% | n/a |
The original paper reported scores for two groups of systems: end-to-end Deep Research Agents that come as a product, and general LLMs given a search tool. Data collection ran in April and May 2025[1][8].
| Model | Comprehensiveness | Depth | Instruction-Following | Readability | Overall RACE |
|---|---|---|---|---|---|
| Gemini 2.5 Pro Deep Research | 48.53 | 48.50 | 49.18 | 49.44 | 48.88 |
| OpenAI Deep Research | 46.87 | 45.25 | 49.27 | 47.14 | 46.98 |
| Perplexity Deep Research | 40.69 | 39.39 | not reported | not reported | 42.25 |
| Grok Deeper Search | not reported | not reported | not reported | not reported | 40.24 |
| Model | Overall RACE |
|---|---|
| Claude 3.7 Sonnet with Search | 40.67 |
| Perplexity Sonar Reasoning Pro (high) | 40.22 |
| Perplexity Sonar Reasoning (high) | 40.18 |
The spread is informative. Gemini 2.5 Pro Deep Research wins overall, but only by about two RACE points over OpenAI Deep Research, and OpenAI Deep Research itself wins the Instruction-Following dimension at 49.27. Specialized DRAs sit roughly six to nine points above general LLMs that just have search bolted on.
The FACT numbers tell a different story[1]:
| Model | Average Effective Citations | Citation Accuracy |
|---|---|---|
| Gemini 2.5 Pro Deep Research | 111.21 | 81.44% |
| OpenAI Deep Research | 40.79 | not reported as top |
| Perplexity Deep Research | 31.26 | 90.24% |
| Gemini 2.5 Pro Grounding | 32.88 | not reported as top |
| Claude 3.5 Sonnet with Search | not reported as top | 94.04% |
| Claude 3.7 Sonnet with Search | not reported as top | 93.68% |
Gemini 2.5 Pro Deep Research cites the most by a wide margin (about 111 supported citations per report) but accuracy is in the low 80s. Anthropic's Claude with search cites less but supports its claims about 94% of the time. Perplexity Deep Research lands in between with about 31 effective citations at 90% accuracy. Volume and accuracy of citations are not the same capability, and a single composite score would hide that[1].
The authors describe a five-step pipeline for building the benchmark[1][6]:
Human evaluation took roughly 225 person-hours: 50 Chinese tasks, three expert annotators each, about 1.5 hours per task on average[1].
The official implementation is in the public GitHub repository Ayanami0730/deep_research_bench[3]. The harness is written in Python (3.9+) and the README points users at two API keys: a Gemini key for the judge LLM and a Jina Reader key for fetching cited pages. The repository contains the 100 prompts, reference reports, judge prompts for both RACE and FACT, and scripts to evaluate a model from a JSONL file of generated reports. The license is Apache 2.0.
Three results stand out[1]:
Performance is fairly stable across the 22 topic domains and the two languages, suggesting the framework is not just measuring one clustering of skills.
DeepResearch Bench has been picked up quickly by the research-agent community. As of mid-2025, the public leaderboard added entries for Moonshot AI's Kimi-Researcher, ByteDance's Doubao DeepResearch, Anthropic's Claude-Researcher, and several Chinese enterprise systems including Zhipu Deep Research, LINK-Researcher, and Xiaoyi DeepResearch[4][9]. NVIDIA's AI-Q Blueprint documentation uses DeepResearch Bench as an evaluation reference[10].
A follow-up paper from the same authors, DeepResearch Bench II: Diagnosing Deep Research Agents via Rubrics from Expert Reports, extends the methodology by deriving rubrics from human-written expert reports rather than from a single reference report[11].
The authors are explicit about the limits[1]:
Deep Research Agents are one of the more visible product directions in generative AI in 2025. Without a benchmark, every vendor's claim that their agent writes "analyst-grade" reports is unfalsifiable. DeepResearch Bench is the first public benchmark to measure both the prose and the sourcing of these reports, on tasks that look like real user questions.
The split between RACE and FACT is conceptually useful. A high RACE score with a low FACT score is the classic failure mode of a fluent agent that hallucinates citations; a high FACT score with a low RACE score is the failure mode of a careful but boring agent. By keeping the two scores separate, the benchmark forces vendors to be specific about which problem they have actually solved.
The practical lesson from the original paper is more sober than the marketing copy: the strongest agents are close on report quality but quite far apart on whether their citations actually support what the paragraph says[1].