DeepResearch Bench

DeepResearch Bench
Overview
Full name	DeepResearch Bench: A Comprehensive Benchmark for Deep Research Agents
Abbreviation	DRB
Description	A benchmark of 100 PhD-level research tasks designed to evaluate Deep Research Agents on multi-step web exploration, retrieval, and report synthesis
Initial release	June 13, 2025 (arXiv preprint)
Authors	Mingxuan Du, Benfeng Xu, Chiwei Zhu, Xiaorui Wang, Zhendong Mao
Organization	University of Science and Technology of China; Metastone Technology
Technical Details
Type	Research Agent Evaluation, Report Generation, Long-form Synthesis
Modality	Text, Web content
Task format	Open-ended research report generation
Number of tasks	100 (50 English + 50 Chinese)
Domains	22 fields, including Science and Technology, Finance and Business, Software, Health, History, Industry, Transportation, Tourism, Art and Design, Entertainment
Evaluation frameworks	RACE (report quality) and FACT (citation accuracy and effective citations)
Judge LLM	Gemini 2.5 Pro for RACE, Gemini 2.5 Flash for FACT
Top RACE score	48.88 (Gemini 2.5 Pro Deep Research)
Top FACT effective citations	111.21 (Gemini 2.5 Pro Deep Research)
Top citation accuracy	94.04% (Claude 3.5 Sonnet with Search)
Resources
Website	Official website
Paper	arXiv:2506.11763
GitHub	Ayanami0730/deep_research_bench
Leaderboard	Hugging Face Space
License	Apache 2.0

DeepResearch Bench is a benchmark for evaluating Deep Research Agents, a class of large language model systems that autonomously plan multi-step web searches, gather sources, and write long-form analyst-grade reports. It was introduced in the June 2025 paper DeepResearch Bench: A Comprehensive Benchmark for Deep Research Agents by Mingxuan Du, Benfeng Xu, Chiwei Zhu, Xiaorui Wang, and Zhendong Mao, with affiliations at the University of Science and Technology of China and Metastone Technology^[1]^[2]. The benchmark is built around 100 PhD-level tasks (50 in English and 50 in Chinese) that span 22 fields, and it uses two custom evaluation frameworks called RACE and FACT to measure report quality and citation behavior. Code and data are released under Apache 2.0 on GitHub, and a public leaderboard is hosted on Hugging Face Spaces^[3]^[4].

Background and motivation

In early 2025, OpenAI Deep Research, Gemini Deep Research from Google DeepMind, Perplexity Deep Research, and Grok DeepSearch from xAI all shipped agents that spend several minutes browsing the open web, then return a structured report with citations. The category came to be known as Deep Research Agents, or DRAs.

These systems are awkward to evaluate. Standard QA benchmarks like MMLU ask short-answer questions; deep research outputs are long, free-form, and reference-heavy. Search-style benchmarks like GAIA, BrowseComp, and Humanity's Last Exam grade a single final answer rather than the structure and sourcing of a multi-page report^[1]^[5].

The team analyzed 96,147 anonymized queries from a web-search-enabled chatbot and identified 44,019 (about 45.8%) as deep research queries. The 100 benchmark tasks follow that same topical mix so the test set looks like real demand^[2]^[6].

Task design

Composition

The benchmark consists of 100 prompts, balanced 50/50 between English and Chinese. Each task is a multi-paragraph research request that would normally take a human analyst several hours to complete. The 22 fields cluster into four broad areas: Science and Technology (physics, chemistry, biology, environmental science, engineering), Finance and Business (investing, personal finance, marketing, human resources), Software (software usage, internet topics), and Other (art and design, entertainment, history, industry, transportation, tourism, and additional categories)^[6]^[2].

Domain cluster	Example fields	Notes
Science and Technology	Physics, Chemistry, Biology, Environmental science, Engineering	Largest cluster by query volume
Finance and Business	Investing, Personal finance, Marketing, Human resources	Second-largest cluster
Software	Software usage, Internet	Includes how-to and integration questions
Other	Art and Design, Entertainment, History, Industry, Transportation, Tourism	Long tail of high-effort topics

Authoring

Tasks were written by more than 100 domain experts, each a PhD holder or senior practitioner with at least five years of relevant experience. Every prompt went through multiple rounds of review, and the bilingual versions were authored locally rather than translated^[1]^[2].

Style of prompts

A representative task asks for an analysis, not a fact. Examples include prompts on the state of a research subfield, comparative reviews of competing technologies, market sizing exercises, and policy analyses needing both academic and news sources. The agent must plan its own browsing, decide what to cite, and produce a usable but readable report.

Evaluation frameworks

DeepResearch Bench introduces two complementary frameworks. RACE grades the quality of the written report. FACT grades how well the agent uses external sources. The two scores are reported separately so that a model can be strong on one axis and weak on the other.

RACE: Reference-based Adaptive Criteria-driven Evaluation

RACE evaluates the report across four dimensions^[1]^[7]:

Comprehensiveness: how much of the relevant scope is covered.
Insight or Depth: whether the analysis goes beyond surface description.
Instruction-Following: how well the report obeys the prompt's constraints.
Readability: structure, organization, and clarity for a human reader.

For each task, RACE generates task-specific evaluation criteria using a judge LLM. It then scores the target report against a high-quality reference using:

S_final(R_tgt) = S_int(R_tgt) / (S_int(R_tgt) + S_int(R_ref))

Reference reports were drawn from Gemini 2.5 Pro Deep Research; the relative formulation keeps scores comparable across tasks of different difficulty. Dimension weights shift per task to reflect what the prompt requires.

Gemini 2.5 Pro reached 71.33% pairwise agreement with humans and 99.54% Pearson correlation, narrowly beating o4-mini and Claude 3.7 Sonnet, so it was selected as the RACE judge. Average judge cost per task is about $0.13^[1].

FACT: Framework for Factual Abundance and Citation Trustworthiness

FACT looks at the citations rather than the prose. The pipeline:

Extract every (statement, URL) pair from the report.
Deduplicate pairs that describe the same fact with different wording.
Pull each cited webpage through the Jina Reader API.
Have a judge LLM (Gemini 2.5 Flash) decide for each pair whether the source actually supports the statement (binary support or not support).

From this, FACT reports two main metrics^[1]^[7]:

Citation Accuracy (C. Acc.): fraction of (statement, URL) pairs that are supported.
Average Effective Citations (E. Cit.): mean number of supported citations per task.

The split matters. A model that piles on URLs to look thorough can have high E. Cit. but low C. Acc.; a careful model that only cites a few sources can be the opposite. The framework keeps both visible.

Human alignment

The authors validated RACE against three expert annotators per task on the Chinese subset. RACE Full reached 71.33% pairwise agreement with humans, slightly above the human inter-annotator agreement of 68.44%. A vanilla prompt baseline reached only 58.89%. Removing the reference report dropped the score to 66.56%, an argument for relative scoring against a reference^[1].

RACE configuration	Pairwise agreement	Overall consistency
RACE (Full)	71.33%	72.56%
RACE without reference	66.56%	68.19%
Vanilla prompt	58.89%	60.46%
Human inter-annotator	68.44%	n/a

Leaderboard at release

The original paper reported scores for two groups of systems: end-to-end Deep Research Agents that come as a product, and general LLMs given a search tool. Data collection ran in April and May 2025^[1]^[8].

Deep Research Agents (RACE)

Model	Comprehensiveness	Depth	Instruction-Following	Readability	Overall RACE
Gemini 2.5 Pro Deep Research	48.53	48.50	49.18	49.44	48.88
OpenAI Deep Research	46.87	45.25	49.27	47.14	46.98
Perplexity Deep Research	40.69	39.39	not reported	not reported	42.25
Grok Deeper Search	not reported	not reported	not reported	not reported	40.24

LLMs with web search

Model	Overall RACE
Claude 3.7 Sonnet with Search	40.67
Perplexity Sonar Reasoning Pro (high)	40.22
Perplexity Sonar Reasoning (high)	40.18

The spread is informative. Gemini 2.5 Pro Deep Research wins overall, but only by about two RACE points over OpenAI Deep Research, and OpenAI Deep Research itself wins the Instruction-Following dimension at 49.27. Specialized DRAs sit roughly six to nine points above general LLMs that just have search bolted on.

Citation behavior (FACT)

The FACT numbers tell a different story^[1]:

Model	Average Effective Citations	Citation Accuracy
Gemini 2.5 Pro Deep Research	111.21	81.44%
OpenAI Deep Research	40.79	not reported as top
Perplexity Deep Research	31.26	90.24%
Gemini 2.5 Pro Grounding	32.88	not reported as top
Claude 3.5 Sonnet with Search	not reported as top	94.04%
Claude 3.7 Sonnet with Search	not reported as top	93.68%

Gemini 2.5 Pro Deep Research cites the most by a wide margin (about 111 supported citations per report) but accuracy is in the low 80s. Anthropic's Claude with search cites less but supports its claims about 94% of the time. Perplexity Deep Research lands in between with about 31 effective citations at 90% accuracy. Volume and accuracy of citations are not the same capability, and a single composite score would hide that^[1].

Construction process

The authors describe a five-step pipeline for building the benchmark^[1]^[6]:

Query analysis: 96,147 raw user queries from a web-search-enabled chatbot were collected.
Deep research filtering: 44,019 queries (about 45.8%) were classified as deep research.
Topic clustering: deep research queries were grouped into 22 fields, with Science and Technology and Business and Finance carrying the largest weight.
Expert authoring: 100+ PhD-level or senior-practitioner contributors wrote the 100 prompts to match the field distribution.
Bilingual adaptation: each prompt was authored separately in English and Chinese rather than translated.

Human evaluation took roughly 225 person-hours: 50 Chinese tasks, three expert annotators each, about 1.5 hours per task on average^[1].

Implementation

The official implementation is in the public GitHub repository Ayanami0730/deep_research_bench^[3]. The harness is written in Python (3.9+) and the README points users at two API keys: a Gemini key for the judge LLM and a Jina Reader key for fetching cited pages. The repository contains the 100 prompts, reference reports, judge prompts for both RACE and FACT, and scripts to evaluate a model from a JSONL file of generated reports. The license is Apache 2.0.

What the paper found

Three results stand out^[1]:

The strongest commercial DRAs are close on report quality but separated on citation behavior. Gemini 2.5 Pro Deep Research leads on RACE and citation volume; Claude with search leads on citation accuracy.
General LLMs with search tools sit below specialized DRAs on RACE. The gap from 48.88 (Gemini 2.5 Pro Deep Research) to 40.67 (Claude 3.7 Sonnet with Search) is about 20% in relative terms.
RACE meets or exceeds human inter-annotator agreement (71.33% vs 68.44% pairwise), the bar for using a synthetic judge.

Performance is fairly stable across the 22 topic domains and the two languages, suggesting the framework is not just measuring one clustering of skills.

Reception and follow-on work

DeepResearch Bench has been picked up quickly by the research-agent community. As of mid-2025, the public leaderboard added entries for Moonshot AI's Kimi-Researcher, ByteDance's Doubao DeepResearch, Anthropic's Claude-Researcher, and several Chinese enterprise systems including Zhipu Deep Research, LINK-Researcher, and Xiaoyi DeepResearch^[4]^[9]. NVIDIA's AI-Q Blueprint documentation uses DeepResearch Bench as an evaluation reference^[10].

A follow-up paper from the same authors, DeepResearch Bench II: Diagnosing Deep Research Agents via Rubrics from Expert Reports, extends the methodology by deriving rubrics from human-written expert reports rather than from a single reference report^[11].

Limitations

The authors are explicit about the limits^[1]:

Scale: 100 tasks is small, constrained by the cost of expert authoring and evaluation bandwidth.
Coverage: 22 fields cover everyday research, but specialized domains (clinical medicine, advanced mathematics) are underrepresented.
Languages: only English and Chinese; French, German, Spanish, and Japanese are not covered.
Reference bias: using Gemini 2.5 Pro to write reference reports could nudge scoring toward agents sharing its style, although OpenAI Deep Research with a different style still scores within two points.
Judge cost: $0.13 per task per judge call adds up across a leaderboard.

Why it matters

Deep Research Agents are one of the more visible product directions in generative AI in 2025. Without a benchmark, every vendor's claim that their agent writes "analyst-grade" reports is unfalsifiable. DeepResearch Bench is the first public benchmark to measure both the prose and the sourcing of these reports, on tasks that look like real user questions.

The split between RACE and FACT is conceptually useful. A high RACE score with a low FACT score is the classic failure mode of a fluent agent that hallucinates citations; a high FACT score with a low RACE score is the failure mode of a careful but boring agent. By keeping the two scores separate, the benchmark forces vendors to be specific about which problem they have actually solved.

The practical lesson from the original paper is more sober than the marketing copy: the strongest agents are close on report quality but quite far apart on whether their citations actually support what the paragraph says^[1].

References

Du, M., Xu, B., Zhu, C., Wang, X., and Mao, Z. (2025). "DeepResearch Bench: A Comprehensive Benchmark for Deep Research Agents." arXiv:2506.11763. https://arxiv.org/abs/2506.11763
DeepResearch Bench official website. https://deepresearch-bench.github.io/
Ayanami0730. "deep_research_bench" (GitHub repository, Apache 2.0). https://github.com/Ayanami0730/deep_research_bench
muset-ai. "DeepResearch Bench Leaderboard" (Hugging Face Space). https://huggingface.co/spaces/muset-ai/DeepResearch-Bench-Leaderboard
Hugging Face Papers. "DeepResearch Bench: A Comprehensive Benchmark for Deep Research Agents." https://huggingface.co/papers/2506.11763
HyperAI Datasets. "DeepResearch Bench dataset entry." https://hyper.ai/en/datasets/40910
OpenReview. "DeepResearch Bench: A Comprehensive Benchmark for Deep Research Agents" submission page. https://openreview.net/forum?id=hQ0K2Hhq7H
Semantic Scholar. "DeepResearch Bench: A Comprehensive Benchmark for Deep Research Agents." https://www.semanticscholar.org/paper/DeepResearch-Bench:-A-Comprehensive-Benchmark-for-Du-Xu/cca73506ab839718879a49ccce389d33907aa053
ResearchGate. "DeepResearch Bench: A Comprehensive Benchmark for Deep Research Agents." https://www.researchgate.net/publication/392717336_DeepResearch_Bench_A_Comprehensive_Benchmark_for_Deep_Research_Agents
NVIDIA. "Deep Research Bench Evaluation of NVIDIA AI-Q Blueprint." https://docs.nvidia.com/aiq-blueprint/1.2.1/evaluation/benchmarks/deep-research-bench.html
ResearchGate. "DeepResearch Bench II: Diagnosing Deep Research Agents via Rubrics from Expert Report." https://www.researchgate.net/publication/399755315_DeepResearch_Bench_II_Diagnosing_Deep_Research_Agents_via_Rubrics_from_Expert_Report
Alici.AI. "Top 10 Deep Research Agents in 2025." https://alici.ai/blog/top-deep-research-agents-2025

Background and motivation

Task design

Composition

Authoring

Style of prompts

Evaluation frameworks

RACE: Reference-based Adaptive Criteria-driven Evaluation

FACT: Framework for Factual Abundance and Citation Trustworthiness

Human alignment

Leaderboard at release

Deep Research Agents (RACE)

LLMs with web search

Citation behavior (FACT)

Construction process

Implementation

What the paper found

Reception and follow-on work

Limitations

Why it matters

See also

References

Improve this article

Related Articles

HealthBench

Humanity's Last Exam

MMMLU

AA-LCR

GSO

AIME 2025

Background and motivation

Task design

Composition

Authoring

Style of prompts

Evaluation frameworks

RACE: Reference-based Adaptive Criteria-driven Evaluation

FACT: Framework for Factual Abundance and Citation Trustworthiness

Human alignment

Leaderboard at release

Deep Research Agents (RACE)

LLMs with web search

Citation behavior (FACT)

Construction process

Implementation

What the paper found

Reception and follow-on work

Limitations

Why it matters

See also

References

Related Articles

HealthBench

Humanity's Last Exam

MMMLU

AA-LCR

GSO

AIME 2025