# DeepResearch Bench

> Source: https://aiwiki.ai/wiki/deepresearch_bench
> Updated: 2026-06-09
> Categories: AI Benchmarks
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

*Not to be confused with [Deep Research Bench](/wiki/deep_research_bench), a separate benchmark by FutureSearch.*

| DeepResearch Bench |
| --- |
| Overview |
| Full name | DeepResearch Bench: A Comprehensive Benchmark for Deep Research Agents |
| Abbreviation | DRB |
| Description | A benchmark of 100 PhD-level research tasks designed to evaluate Deep Research Agents on multi-step web exploration, retrieval, and report synthesis |
| Initial release | June 13, 2025 (arXiv preprint) |
| Authors | Mingxuan Du, Benfeng Xu, Chiwei Zhu, Xiaorui Wang, Zhendong Mao |
| Organization | University of Science and Technology of China; Metastone Technology |
| Technical Details |
| Type | Research Agent Evaluation, Report Generation, Long-form Synthesis |
| Modality | Text, Web content |
| Task format | Open-ended research report generation |
| Number of tasks | 100 (50 English + 50 Chinese) |
| Domains | 22 fields, including Science and Technology, Finance and Business, Software, Health, History, Industry, Transportation, Tourism, Art and Design, Entertainment |
| Evaluation frameworks | RACE (report quality) and FACT (citation accuracy and effective citations) |
| Judge LLM | Gemini 2.5 Pro for RACE, Gemini 2.5 Flash for FACT |
| Top RACE score | 48.88 (Gemini 2.5 Pro Deep Research) |
| Top FACT effective citations | 111.21 (Gemini 2.5 Pro Deep Research) |
| Top citation accuracy | 94.04% (Claude 3.5 Sonnet with Search) |
| Resources |
| Website | [Official website](https://deepresearch-bench.github.io/) |
| Paper | [arXiv:2506.11763](https://arxiv.org/abs/2506.11763) |
| GitHub | [Ayanami0730/deep_research_bench](https://github.com/Ayanami0730/deep_research_bench) |
| Leaderboard | [Hugging Face Space](https://huggingface.co/spaces/muset-ai/DeepResearch-Bench-Leaderboard) |
| License | Apache 2.0 |

**DeepResearch Bench** is a benchmark for evaluating [Deep Research Agents](/wiki/deep_research_agents), a class of [large language model](/wiki/large_language_model) systems that autonomously plan multi-step web searches, gather sources, and write long-form analyst-grade reports. It was introduced in the June 2025 paper *DeepResearch Bench: A Comprehensive Benchmark for Deep Research Agents* by Mingxuan Du, Benfeng Xu, Chiwei Zhu, Xiaorui Wang, and Zhendong Mao, with affiliations at the University of Science and Technology of China and Metastone Technology[1][2]. The benchmark is built around 100 PhD-level tasks (50 in English and 50 in Chinese) that span 22 fields, and it uses two custom evaluation frameworks called RACE and FACT to measure report quality and citation behavior. Code and data are released under Apache 2.0 on GitHub, and a public leaderboard is hosted on [Hugging Face](/wiki/hugging_face) Spaces[3][4].

## Background and motivation

In early 2025, [OpenAI](/wiki/openai) Deep Research, [Gemini](/wiki/gemini) Deep Research from [Google DeepMind](/wiki/google_deepmind), [Perplexity](/wiki/perplexity) Deep Research, and Grok DeepSearch from [xAI](/wiki/xai) all shipped agents that spend several minutes browsing the open web, then return a structured report with citations. The category came to be known as Deep Research Agents, or DRAs.

These systems are awkward to evaluate. Standard QA benchmarks like [MMLU](/wiki/mmlu) ask short-answer questions; deep research outputs are long, free-form, and reference-heavy. Search-style benchmarks like [GAIA](/wiki/gaia_benchmark), [BrowseComp](/wiki/browsecomp), and [Humanity's Last Exam](/wiki/humanity_s_last_exam) grade a single final answer rather than the structure and sourcing of a multi-page report[1][5].

The team analyzed 96,147 anonymized queries from a web-search-enabled chatbot and identified 44,019 (about 45.8%) as deep research queries. The 100 benchmark tasks follow that same topical mix so the test set looks like real demand[2][6].

## Task design

### Composition

The benchmark consists of 100 prompts, balanced 50/50 between English and Chinese. Each task is a multi-paragraph research request that would normally take a human analyst several hours to complete. The 22 fields cluster into four broad areas: Science and Technology (physics, chemistry, biology, environmental science, engineering), Finance and Business (investing, personal finance, marketing, human resources), Software (software usage, internet topics), and Other (art and design, entertainment, history, industry, transportation, tourism, and additional categories)[6][2].

| Domain cluster | Example fields | Notes |
| --- | --- | --- |
| Science and Technology | Physics, Chemistry, Biology, Environmental science, Engineering | Largest cluster by query volume |
| Finance and Business | Investing, Personal finance, Marketing, Human resources | Second-largest cluster |
| Software | Software usage, Internet | Includes how-to and integration questions |
| Other | Art and Design, Entertainment, History, Industry, Transportation, Tourism | Long tail of high-effort topics |

### Authoring

Tasks were written by more than 100 domain experts, each a PhD holder or senior practitioner with at least five years of relevant experience. Every prompt went through multiple rounds of review, and the bilingual versions were authored locally rather than translated[1][2].

### Style of prompts

A representative task asks for an analysis, not a fact. Examples include prompts on the state of a research subfield, comparative reviews of competing technologies, market sizing exercises, and policy analyses needing both academic and news sources. The agent must plan its own browsing, decide what to cite, and produce a usable but readable report.

## Evaluation frameworks

DeepResearch Bench introduces two complementary frameworks. RACE grades the quality of the written report. FACT grades how well the agent uses external sources. The two scores are reported separately so that a model can be strong on one axis and weak on the other.

### RACE: Reference-based Adaptive Criteria-driven Evaluation

RACE evaluates the report across four dimensions[1][7]:

1. Comprehensiveness: how much of the relevant scope is covered.
2. Insight or Depth: whether the analysis goes beyond surface description.
3. Instruction-Following: how well the report obeys the prompt's constraints.
4. Readability: structure, organization, and clarity for a human reader.

For each task, RACE generates task-specific evaluation criteria using a judge LLM. It then scores the target report against a high-quality reference using:

```
S_final(R_tgt) = S_int(R_tgt) / (S_int(R_tgt) + S_int(R_ref))
```

Reference reports were drawn from Gemini 2.5 Pro Deep Research; the relative formulation keeps scores comparable across tasks of different difficulty. Dimension weights shift per task to reflect what the prompt requires.

Gemini 2.5 Pro reached 71.33% pairwise agreement with humans and 99.54% Pearson correlation, narrowly beating o4-mini and Claude 3.7 Sonnet, so it was selected as the RACE judge. Average judge cost per task is about $0.13[1].

### FACT: Framework for Factual Abundance and Citation Trustworthiness

FACT looks at the citations rather than the prose. The pipeline:

1. Extract every (statement, URL) pair from the report.
2. Deduplicate pairs that describe the same fact with different wording.
3. Pull each cited webpage through the Jina Reader API.
4. Have a judge LLM (Gemini 2.5 Flash) decide for each pair whether the source actually supports the statement (binary support or not support).

From this, FACT reports two main metrics[1][7]:

- Citation Accuracy (C. Acc.): fraction of (statement, URL) pairs that are supported.
- Average Effective Citations (E. Cit.): mean number of supported citations per task.

The split matters. A model that piles on URLs to look thorough can have high E. Cit. but low C. Acc.; a careful model that only cites a few sources can be the opposite. The framework keeps both visible.

### Human alignment

The authors validated RACE against three expert annotators per task on the Chinese subset. RACE Full reached 71.33% pairwise agreement with humans, slightly above the human inter-annotator agreement of 68.44%. A vanilla prompt baseline reached only 58.89%. Removing the reference report dropped the score to 66.56%, an argument for relative scoring against a reference[1].

| RACE configuration | Pairwise agreement | Overall consistency |
| --- | --- | --- |
| RACE (Full) | 71.33% | 72.56% |
| RACE without reference | 66.56% | 68.19% |
| Vanilla prompt | 58.89% | 60.46% |
| Human inter-annotator | 68.44% | n/a |

## Leaderboard at release

The original paper reported scores for two groups of systems: end-to-end Deep Research Agents that come as a product, and general LLMs given a search tool. Data collection ran in April and May 2025[1][8].

### Deep Research Agents (RACE)

| Model | Comprehensiveness | Depth | Instruction-Following | Readability | Overall RACE |
| --- | --- | --- | --- | --- | --- |
| Gemini 2.5 Pro Deep Research | 48.53 | 48.50 | 49.18 | 49.44 | 48.88 |
| OpenAI Deep Research | 46.87 | 45.25 | 49.27 | 47.14 | 46.98 |
| Perplexity Deep Research | 40.69 | 39.39 | not reported | not reported | 42.25 |
| Grok Deeper Search | not reported | not reported | not reported | not reported | 40.24 |

### LLMs with web search

| Model | Overall RACE |
| --- | --- |
| Claude 3.7 Sonnet with Search | 40.67 |
| Perplexity Sonar Reasoning Pro (high) | 40.22 |
| Perplexity Sonar Reasoning (high) | 40.18 |

The spread is informative. Gemini 2.5 Pro Deep Research wins overall, but only by about two RACE points over OpenAI Deep Research, and OpenAI Deep Research itself wins the Instruction-Following dimension at 49.27. Specialized DRAs sit roughly six to nine points above general LLMs that just have search bolted on.

### Citation behavior (FACT)

The FACT numbers tell a different story[1]:

| Model | Average Effective Citations | Citation Accuracy |
| --- | --- | --- |
| Gemini 2.5 Pro Deep Research | 111.21 | 81.44% |
| OpenAI Deep Research | 40.79 | not reported as top |
| Perplexity Deep Research | 31.26 | 90.24% |
| Gemini 2.5 Pro Grounding | 32.88 | not reported as top |
| Claude 3.5 Sonnet with Search | not reported as top | 94.04% |
| Claude 3.7 Sonnet with Search | not reported as top | 93.68% |

Gemini 2.5 Pro Deep Research cites the most by a wide margin (about 111 supported citations per report) but accuracy is in the low 80s. [Anthropic](/wiki/anthropic)'s Claude with search cites less but supports its claims about 94% of the time. Perplexity Deep Research lands in between with about 31 effective citations at 90% accuracy. Volume and accuracy of citations are not the same capability, and a single composite score would hide that[1].

## Construction process

The authors describe a five-step pipeline for building the benchmark[1][6]:

1. Query analysis: 96,147 raw user queries from a web-search-enabled chatbot were collected.
2. Deep research filtering: 44,019 queries (about 45.8%) were classified as deep research.
3. Topic clustering: deep research queries were grouped into 22 fields, with Science and Technology and Business and Finance carrying the largest weight.
4. Expert authoring: 100+ PhD-level or senior-practitioner contributors wrote the 100 prompts to match the field distribution.
5. Bilingual adaptation: each prompt was authored separately in English and Chinese rather than translated.

Human evaluation took roughly 225 person-hours: 50 Chinese tasks, three expert annotators each, about 1.5 hours per task on average[1].

## Implementation

The official implementation is in the public GitHub repository `Ayanami0730/deep_research_bench`[3]. The harness is written in Python (3.9+) and the README points users at two API keys: a Gemini key for the judge LLM and a Jina Reader key for fetching cited pages. The repository contains the 100 prompts, reference reports, judge prompts for both RACE and FACT, and scripts to evaluate a model from a JSONL file of generated reports. The license is Apache 2.0.

## What the paper found

Three results stand out[1]:

- The strongest commercial DRAs are close on report quality but separated on citation behavior. Gemini 2.5 Pro Deep Research leads on RACE and citation volume; Claude with search leads on citation accuracy.
- General LLMs with search tools sit below specialized DRAs on RACE. The gap from 48.88 (Gemini 2.5 Pro Deep Research) to 40.67 (Claude 3.7 Sonnet with Search) is about 20% in relative terms.
- RACE meets or exceeds human inter-annotator agreement (71.33% vs 68.44% pairwise), the bar for using a synthetic judge.

Performance is fairly stable across the 22 topic domains and the two languages, suggesting the framework is not just measuring one clustering of skills.

## Reception and follow-on work

DeepResearch Bench has been picked up quickly by the research-agent community. As of mid-2025, the public leaderboard added entries for [Moonshot AI](/wiki/moonshot_ai)'s Kimi-Researcher, ByteDance's Doubao DeepResearch, Anthropic's Claude-Researcher, and several Chinese enterprise systems including Zhipu Deep Research, LINK-Researcher, and Xiaoyi DeepResearch[4][9]. NVIDIA's AI-Q Blueprint documentation uses DeepResearch Bench as an evaluation reference[10].

A follow-up paper from the same authors, *DeepResearch Bench II: Diagnosing Deep Research Agents via Rubrics from Expert Reports*, extends the methodology by deriving rubrics from human-written expert reports rather than from a single reference report[11].

## Limitations

The authors are explicit about the limits[1]:

- Scale: 100 tasks is small, constrained by the cost of expert authoring and evaluation bandwidth.
- Coverage: 22 fields cover everyday research, but specialized domains (clinical medicine, advanced mathematics) are underrepresented.
- Languages: only English and Chinese; French, German, Spanish, and Japanese are not covered.
- Reference bias: using Gemini 2.5 Pro to write reference reports could nudge scoring toward agents sharing its style, although OpenAI Deep Research with a different style still scores within two points.
- Judge cost: $0.13 per task per judge call adds up across a leaderboard.

## Why it matters

Deep Research Agents are one of the more visible product directions in [generative AI](/wiki/generative_ai) in 2025. Without a benchmark, every vendor's claim that their agent writes "analyst-grade" reports is unfalsifiable. DeepResearch Bench is the first public benchmark to measure both the prose and the sourcing of these reports, on tasks that look like real user questions.

The split between RACE and FACT is conceptually useful. A high RACE score with a low FACT score is the classic failure mode of a fluent agent that hallucinates citations; a high FACT score with a low RACE score is the failure mode of a careful but boring agent. By keeping the two scores separate, the benchmark forces vendors to be specific about which problem they have actually solved.

The practical lesson from the original paper is more sober than the marketing copy: the strongest agents are close on report quality but quite far apart on whether their citations actually support what the paragraph says[1].

## See also

- [Deep Research Agents](/wiki/deep_research_agents)
- [GAIA](/wiki/gaia_benchmark)
- [BrowseComp](/wiki/browsecomp)
- [Humanity's Last Exam](/wiki/humanity_s_last_exam)
- [OpenAI Deep Research](/wiki/openai_deep_research)
- [Gemini Deep Research](/wiki/gemini_deep_research)
- [Perplexity Deep Research](/wiki/perplexity_deep_research)
- [Grok DeepSearch](/wiki/grok_deepsearch)
- [LLM-as-a-Judge](/wiki/llm_as_a_judge)
- [Retrieval-Augmented Generation](/wiki/retrieval_augmented_generation)

## References

1. Du, M., Xu, B., Zhu, C., Wang, X., and Mao, Z. (2025). "DeepResearch Bench: A Comprehensive Benchmark for Deep Research Agents." arXiv:2506.11763. https://arxiv.org/abs/2506.11763
2. DeepResearch Bench official website. https://deepresearch-bench.github.io/
3. Ayanami0730. "deep_research_bench" (GitHub repository, Apache 2.0). https://github.com/Ayanami0730/deep_research_bench
4. muset-ai. "DeepResearch Bench Leaderboard" (Hugging Face Space). https://huggingface.co/spaces/muset-ai/DeepResearch-Bench-Leaderboard
5. Hugging Face Papers. "DeepResearch Bench: A Comprehensive Benchmark for Deep Research Agents." https://huggingface.co/papers/2506.11763
6. HyperAI Datasets. "DeepResearch Bench dataset entry." https://hyper.ai/en/datasets/40910
7. OpenReview. "DeepResearch Bench: A Comprehensive Benchmark for Deep Research Agents" submission page. https://openreview.net/forum?id=hQ0K2Hhq7H
8. Semantic Scholar. "DeepResearch Bench: A Comprehensive Benchmark for Deep Research Agents." https://www.semanticscholar.org/paper/DeepResearch-Bench:-A-Comprehensive-Benchmark-for-Du-Xu/cca73506ab839718879a49ccce389d33907aa053
9. ResearchGate. "DeepResearch Bench: A Comprehensive Benchmark for Deep Research Agents." https://www.researchgate.net/publication/392717336_DeepResearch_Bench_A_Comprehensive_Benchmark_for_Deep_Research_Agents
10. NVIDIA. "Deep Research Bench Evaluation of NVIDIA AI-Q Blueprint." https://docs.nvidia.com/aiq-blueprint/1.2.1/evaluation/benchmarks/deep-research-bench.html
11. ResearchGate. "DeepResearch Bench II: Diagnosing Deep Research Agents via Rubrics from Expert Report." https://www.researchgate.net/publication/399755315_DeepResearch_Bench_II_Diagnosing_Deep_Research_Agents_via_Rubrics_from_Expert_Report
12. Alici.AI. "Top 10 Deep Research Agents in 2025." https://alici.ai/blog/top-deep-research-agents-2025