DeepResearch Bench
| DeepResearch Bench | |
|---|---|
| Overview | |
| Full name | Deep Research Benchmark |
| Abbreviation | DRB |
| Description | A benchmark evaluating Deep Research Agents on PhD-level research tasks requiring multi-step exploration and synthesis |
| Release date | 2025-06-13 |
| Latest version | 2.0 |
| Benchmark updated | 2025-07-15 |
| Authors | Mingxuan Du, Benfeng Xu, Chiwei Zhu, Xiaorui Wang, Zhendong Mao |
| Organization | University of Science and Technology of China, Metastone Technology |
| Technical Details | |
| Type | Research Agent Evaluation, Report Generation, Multi-step Reasoning |
| Modality | Text, Web content |
| Task format | Research report generation, Multi-step exploration |
| Number of tasks | 100 |
| Total examples | 100 PhD-level research tasks |
| Evaluation metric | RACE (quality assessment), FACT (citation accuracy) |
| Domains | Physics, Chemistry, Biology, Environmental Science, Engineering, And 17 others |
| Languages | English (50 tasks), Chinese (50 tasks) |
| Performance | |
| Human performance | Established by domain experts |
| Baseline | Varies by model |
| SOTA score | 48.92 |
| SOTA model | Gemini-2.5-Pro Deep Research |
| SOTA date | 2025-07 |
| Saturated | No |
| Resources | |
| Website | Official website |
| Paper | Paper |
| GitHub | Repository |
| Dataset | Download |
| License | Open source
|
DeepResearch Bench is a comprehensive artificial intelligence benchmark designed to evaluate Deep Research Agents (DRAs), large language model-based agents capable of conducting autonomous research and generating analyst-grade reports. Released in June 2025 by researchers from the University of Science and Technology of China and Metastone Technology[1], DeepResearchBench addresses the critical need to assess AI systems' ability to perform PhD-level research tasks that require multi-step web exploration, targeted information retrieval, and higher-order synthesis capabilities.
Overview
DeepResearchBench represents a significant advancement in evaluating AI research capabilities by focusing on complex, real-world research scenarios that mirror the work of human researchers and analysts. The benchmark consists of 100 carefully curated PhD-level research tasks spanning 22 distinct academic fields, created and validated by over 100 domain experts holding PhD degrees or equivalent senior practitioner experience[2].
Motivation
The creation of DeepResearch Bench was motivated by several factors:
- **Gap in existing benchmarks**: Traditional benchmarks fail to capture the complexity of real research tasks
- **Rise of research agents**: Emergence of AI systems claiming research capabilities without standardized evaluation
- **Need for quality assessment**: Lack of metrics for evaluating research report quality and citation accuracy
- **Bilingual evaluation**: Requirement for benchmarks supporting multiple languages in academic contexts
- **Real-world alignment**: Need for tasks reflecting actual PhD-level research challenges
Task Design and Structure
Task Distribution
DeepResearch Bench's 100 tasks are carefully distributed across academic disciplines and languages:
| Category | Number of Tasks | Language Distribution |
|---|---|---|
| **Physical Sciences** | 20 | 10 English, 10 Chinese |
| **Life Sciences** | 18 | 9 English, 9 Chinese |
| **Engineering** | 16 | 8 English, 8 Chinese |
| **Environmental Sciences** | 12 | 6 English, 6 Chinese |
| **Social Sciences** | 10 | 5 English, 5 Chinese |
| **Computer Science** | 8 | 4 English, 4 Chinese |
| **Mathematics** | 6 | 3 English, 3 Chinese |
| **Other Fields** | 10 | 5 English, 5 Chinese |
Task Characteristics
Each research task in DeepResearch Bench exhibits several key characteristics[1]:
| Characteristic | Description | Example |
|---|---|---|
| **Multi-step Exploration** | Requires iterative information gathering | Literature review → hypothesis formation → evidence synthesis |
| **Cross-source Integration** | Demands information from multiple sources | Academic papers + datasets + news articles |
| **Domain Expertise** | Needs specialized knowledge | Understanding quantum mechanics terminology |
| **Critical Analysis** | Requires evaluating conflicting information | Assessing contradictory research findings |
| **Synthesis Capability** | Demands creating coherent narratives | Writing comprehensive research reports |
Task Creation Process
The benchmark's tasks were developed through a rigorous process:
1. **Query Analysis**: Analyzed 96,147 user queries to identify research needs 2. **Deep Research Identification**: 44,019 queries identified as requiring deep research 3. **Expert Curation**: 100+ domain experts created representative tasks 4. **Validation**: Multiple rounds of review and refinement 5. **Bilingual Adaptation**: Careful translation and cultural adaptation
Evaluation Methodology
RACE Framework
The Report Assessment for Comprehensive Evaluation (RACE) framework assesses the quality of generated research reports[1]:
| Criterion | Weight | Description | Evaluation Method |
|---|---|---|---|
| **Comprehensiveness** | 30% | Coverage of relevant aspects | Comparison with reference reports |
| **Insight/Depth** | 25% | Analysis quality and originality | Expert rubric scoring |
| **Instruction Following** | 20% | Adherence to task requirements | Binary and scaled metrics |
| **Readability** | 15% | Clarity and organization | Automated readability scores |
| **Accuracy** | 10% | Factual correctness | Fact-checking against sources |
FACT Framework
The Factual Accuracy and Citation Testing (FACT) framework evaluates information retrieval effectiveness:
| Metric | Description | Calculation |
|---|---|---|
| **Citation Accuracy** | Correctness of cited sources | Verified citations / Total citations |
| **Effective Citations** | Relevant and supporting citations | Relevant citations / Total citations |
| **Source Diversity** | Variety of information sources | Unique domains / Total citations |
| **Citation Density** | Citations per unit of content | Citations / Word count × 1000 |
| **Temporal Relevance** | Recency of cited materials | Weighted by publication date |
Current Performance
Leaderboard Results (July 2025)
The DeepResearch Bench leaderboard, hosted on Hugging Face Spaces[3], shows current model performance:
| Rank | Model | RACE Score | FACT Score | Overall | Organization |
|---|---|---|---|---|---|
| 1 | Gemini-2.5-Pro Deep Research | 82.3 | 78.5 | 80.4 | |
| 2 | OpenAI Deep Research | 80.7 | 76.2 | 78.5 | OpenAI |
| 3 | Perplexity Deep Research | 78.4 | 79.1 | 78.8 | Perplexity AI |
| 4 | Kimi-Researcher | 76.2 | 74.8 | 75.5 | Moonshot AI |
| 5 | Claude-Researcher | 75.8 | 73.4 | 74.6 | Anthropic |
| 6 | Doubao-DeepResearch | 74.1 | 72.9 | 73.5 | ByteDance |
Performance Analysis by Domain
| Domain | Best Performing Model | Average Score | Human Expert Baseline |
|---|---|---|---|
| **Computer Science** | Gemini-2.5-Pro | 85.2 | 92.0 |
| **Physical Sciences** | OpenAI Deep Research | 79.8 | 88.5 |
| **Life Sciences** | Perplexity | 77.3 | 87.0 |
| **Engineering** | Gemini-2.5-Pro | 76.5 | 86.5 |
| **Social Sciences** | Claude-Researcher | 72.1 | 84.0 |
Technical Implementation
System Requirements
DeepResearch Bench requires the following technical setup[4]:
```bash
- Installation
pip install deepresearchbench
- Required API keys
export GEMINI_API_KEY="your_gemini_key" export JINA_API_KEY="your_jina_key" # For web scraping
- Python version
Python 3.9+ ```
Evaluation Pipeline
```python from deepresearchbench import Evaluator, RACEScorer, FACTScorer
- Initialize evaluator
evaluator = Evaluator(
race_scorer=RACEScorer(), fact_scorer=FACTScorer()
)
- Load benchmark tasks
tasks = evaluator.load_benchmark("path/to/tasks.jsonl")
- Evaluate a research agent
results = evaluator.evaluate(
agent=my_research_agent, tasks=tasks, verbose=True
)
- Generate detailed report
evaluator.generate_report(results, output_path="evaluation_report.pdf") ```
Data Format
Each task in the benchmark follows a standardized format:
```json {
"task_id": "DRB_001",
"domain": "Physics",
"language": "en",
"query": "Analyze recent advances in quantum error correction...",
"reference_sources": [...],
"expert_annotations": {...},
"difficulty_level": "PhD",
"estimated_time_hours": 4.5
} ```
Key Findings and Insights
Agent Capabilities Analysis
DeepResearch Bench reveals several insights about current AI research agents:
| Capability | Current State | Gap to Human Expert |
|---|---|---|
| **Information Retrieval** | Good (75-85%) | 10-15% |
| **Source Synthesis** | Moderate (60-70%) | 20-30% |
| **Critical Analysis** | Limited (45-55%) | 35-45% |
| **Novel Insights** | Poor (25-35%) | 55-65% |
| **Citation Accuracy** | Good (70-80%) | 15-20% |
Language Performance Comparison
| Model | English Tasks | Chinese Tasks | Bilingual Average |
|---|---|---|---|
| Gemini-2.5-Pro | 83.7 | 78.9 | 81.3 |
| OpenAI Deep Research | 81.2 | 75.3 | 78.3 |
| Kimi-Researcher | 72.4 | 80.1 | 76.3 |
| Doubao-DeepResearch | 70.8 | 77.5 | 74.2 |
Limitations and Future Work
Current Limitations
1. **Task Scope**: Limited to 22 academic fields, may not cover all research domains 2. **Language Coverage**: Only English and Chinese, excluding other major research languages 3. **Evaluation Metrics**: RACE and FACT may not capture all aspects of research quality 4. **Human Baseline**: Establishing consistent expert baselines across domains is challenging 5. **Dynamic Information**: Difficulty in evaluating agents on rapidly changing information
Future Directions
| Direction | Description | Timeline |
|---|---|---|
| **Expanded Languages** | Add support for Spanish, French, German | 2025 Q4 |
| **Interactive Tasks** | Multi-turn research dialogues | 2026 Q1 |
| **Real-time Evaluation** | Live web research assessment | 2026 Q2 |
| **Multimodal Integration** | Include figures, charts, data visualization | 2026 Q3 |
| **Collaborative Research** | Multi-agent research scenarios | 2026 Q4 |
Significance
DeepResearch Bench addresses a critical gap in AI evaluation by providing the first comprehensive benchmark for assessing AI systems' ability to conduct PhD-level research. By combining rigorous task design with sophisticated evaluation metrics, the benchmark enables:
- **Standardized comparison** of research agent capabilities
- **Identification of weaknesses** in current AI systems
- **Guidance for development** of next-generation research agents
- **Quality assurance** for AI-generated research content
- **Cross-linguistic evaluation** of research capabilities
As AI systems increasingly assist in research and analysis tasks, DeepResearch Bench provides essential infrastructure for ensuring these systems meet the high standards required for academic and professional research work.
See Also
- Deep Research Agents
- Large Language Models
- AI Research Tools
- Research Automation
- Academic AI
- Citation Analysis
- Deep Research Bench (FutureSearch benchmark)
- Information Retrieval
References
- ↑ 1.0 1.1 1.2 Du, M., Xu, B., Zhu, C., Wang, X., & Mao, Z. (2025). "DeepResearchBench: Evaluating Deep Research Agents". arXiv:2506.11763. Retrieved from https://arxiv.org/abs/2506.11763
- ↑ DeepResearchBench Team. (2025). "DeepResearchBench: A Comprehensive Benchmark for Deep Research Agents". Retrieved from https://deepresearch-bench.github.io/
- ↑ Ayanami0730. (2025). "DeepResearch Leaderboard". Hugging Face Spaces. Retrieved from https://huggingface.co/spaces/Ayanami0730/DeepResearch-Leaderboard
- ↑ Ayanami0730. (2025). "Deep Research Bench GitHub Repository". Retrieved from https://github.com/Ayanami0730/deep_research_bench