| DeepResearch Bench | |
|---|---|
| Overview | |
| Full name | Deep Research Benchmark |
| Abbreviation | DRB |
| Description | A benchmark evaluating Deep Research Agents on PhD-level research tasks requiring multi-step exploration and synthesis |
| Release date | 2025-06-13 |
| Latest version | 2.0 |
| Benchmark updated | 2025-07-15 |
| Authors | Mingxuan Du, Benfeng Xu, Chiwei Zhu, Xiaorui Wang, Zhendong Mao |
| Organization | University of Science and Technology of China, Metastone Technology |
| Technical Details | |
| Type | Research Agent Evaluation, Report Generation, Multi-step Reasoning |
| Modality | Text, Web content |
| Task format | Research report generation, Multi-step exploration |
| Number of tasks | 100 |
| Total examples | 100 PhD-level research tasks |
| Evaluation metric | RACE (quality assessment), FACT (citation accuracy) |
| Domains | Physics, Chemistry, Biology, Environmental Science, Engineering, And 17 others |
| Languages | English (50 tasks), Chinese (50 tasks) |
| Performance | |
| Human performance | Established by domain experts |
| Baseline | Varies by model |
| SOTA score | 48.92 |
| SOTA model | Gemini-2.5-Pro Deep Research |
| SOTA date | 2025-07 |
| Saturated | No |
| Resources | |
| Website | Official website |
| Paper | Paper |
| GitHub | Repository |
| Dataset | Download |
| License | Open source
|
DeepResearch Bench is a comprehensive artificial intelligence benchmark designed to evaluate Deep Research Agents (DRAs), large language model-based agents capable of conducting autonomous research and generating analyst-grade reports. Released in June 2025 by researchers from the University of Science and Technology of China and Metastone Technology[1], DeepResearchBench addresses the critical need to assess AI systems' ability to perform PhD-level research tasks that require multi-step web exploration, targeted information retrieval, and higher-order synthesis capabilities.
DeepResearchBench represents a significant advancement in evaluating AI research capabilities by focusing on complex, real-world research scenarios that mirror the work of human researchers and analysts. The benchmark consists of 100 carefully curated PhD-level research tasks spanning 22 distinct academic fields, created and validated by over 100 domain experts holding PhD degrees or equivalent senior practitioner experience[2].
The creation of DeepResearch Bench was motivated by several factors:
DeepResearch Bench's 100 tasks are carefully distributed across academic disciplines and languages:
| Category | Number of Tasks | Language Distribution |
|---|---|---|
| **Physical Sciences** | 20 | 10 English, 10 Chinese |
| **Life Sciences** | 18 | 9 English, 9 Chinese |
| **Engineering** | 16 | 8 English, 8 Chinese |
| **Environmental Sciences** | 12 | 6 English, 6 Chinese |
| **Social Sciences** | 10 | 5 English, 5 Chinese |
| **Computer Science** | 8 | 4 English, 4 Chinese |
| **Mathematics** | 6 | 3 English, 3 Chinese |
| **Other Fields** | 10 | 5 English, 5 Chinese |
Each research task in DeepResearch Bench exhibits several key characteristics[1]:
| Characteristic | Description | Example |
|---|---|---|
| **Multi-step Exploration** | Requires iterative information gathering | Literature review → hypothesis formation → evidence synthesis |
| **Cross-source Integration** | Demands information from multiple sources | Academic papers + datasets + news articles |
| **Domain Expertise** | Needs specialized knowledge | Understanding quantum mechanics terminology |
| **Critical Analysis** | Requires evaluating conflicting information | Assessing contradictory research findings |
| **Synthesis Capability** | Demands creating coherent narratives | Writing comprehensive research reports |
The benchmark's tasks were developed through a rigorous process:
1. **Query Analysis**: Analyzed 96,147 user queries to identify research needs 2. **Deep Research Identification**: 44,019 queries identified as requiring deep research 3. **Expert Curation**: 100+ domain experts created representative tasks 4. **Validation**: Multiple rounds of review and refinement 5. **Bilingual Adaptation**: Careful translation and cultural adaptation
The Report Assessment for Comprehensive Evaluation (RACE) framework assesses the quality of generated research reports[1]:
| Criterion | Weight | Description | Evaluation Method |
|---|---|---|---|
| **Comprehensiveness** | 30% | Coverage of relevant aspects | Comparison with reference reports |
| **Insight/Depth** | 25% | Analysis quality and originality | Expert rubric scoring |
| **Instruction Following** | 20% | Adherence to task requirements | Binary and scaled metrics |
| **Readability** | 15% | Clarity and organization | Automated readability scores |
| **Accuracy** | 10% | Factual correctness | Fact-checking against sources |
The Factual Accuracy and Citation Testing (FACT) framework evaluates information retrieval effectiveness:
| Metric | Description | Calculation |
|---|---|---|
| **Citation Accuracy** | Correctness of cited sources | Verified citations / Total citations |
| **Effective Citations** | Relevant and supporting citations | Relevant citations / Total citations |
| **Source Diversity** | Variety of information sources | Unique domains / Total citations |
| **Citation Density** | Citations per unit of content | Citations / Word count × 1000 |
| **Temporal Relevance** | Recency of cited materials | Weighted by publication date |
The DeepResearch Bench leaderboard, hosted on Hugging Face Spaces[3], shows current model performance:
| Rank | Model | RACE Score | FACT Score | Overall | Organization |
|---|---|---|---|---|---|
| 1 | Gemini-2.5-Pro Deep Research | 82.3 | 78.5 | 80.4 | |
| 2 | OpenAI Deep Research | 80.7 | 76.2 | 78.5 | OpenAI |
| 3 | Perplexity Deep Research | 78.4 | 79.1 | 78.8 | Perplexity AI |
| 4 | Kimi-Researcher | 76.2 | 74.8 | 75.5 | Moonshot AI |
| 5 | Claude-Researcher | 75.8 | 73.4 | 74.6 | Anthropic |
| 6 | Doubao-DeepResearch | 74.1 | 72.9 | 73.5 | ByteDance |
| Domain | Best Performing Model | Average Score | Human Expert Baseline |
|---|---|---|---|
| **Computer Science** | Gemini-2.5-Pro | 85.2 | 92.0 |
| **Physical Sciences** | OpenAI Deep Research | 79.8 | 88.5 |
| **Life Sciences** | Perplexity | 77.3 | 87.0 |
| **Engineering** | Gemini-2.5-Pro | 76.5 | 86.5 |
| **Social Sciences** | Claude-Researcher | 72.1 | 84.0 |
DeepResearch Bench requires the following technical setup[4]:
```bash
pip install deepresearchbench
export GEMINI_API_KEY="your_gemini_key" export JINA_API_KEY="your_jina_key" # For web scraping
Python 3.9+ ```
```python from deepresearchbench import Evaluator, RACEScorer, FACTScorer
evaluator = Evaluator(
race_scorer=RACEScorer(), fact_scorer=FACTScorer()
)
tasks = evaluator.load_benchmark("path/to/tasks.jsonl")
results = evaluator.evaluate(
agent=my_research_agent, tasks=tasks, verbose=True
)
evaluator.generate_report(results, output_path="evaluation_report.pdf") ```
Each task in the benchmark follows a standardized format:
```json {
"task_id": "DRB_001",
"domain": "Physics",
"language": "en",
"query": "Analyze recent advances in quantum error correction...",
"reference_sources": [...],
"expert_annotations": {...},
"difficulty_level": "PhD",
"estimated_time_hours": 4.5
} ```
DeepResearch Bench reveals several insights about current AI research agents:
| Capability | Current State | Gap to Human Expert |
|---|---|---|
| **Information Retrieval** | Good (75-85%) | 10-15% |
| **Source Synthesis** | Moderate (60-70%) | 20-30% |
| **Critical Analysis** | Limited (45-55%) | 35-45% |
| **Novel Insights** | Poor (25-35%) | 55-65% |
| **Citation Accuracy** | Good (70-80%) | 15-20% |
| Model | English Tasks | Chinese Tasks | Bilingual Average |
|---|---|---|---|
| Gemini-2.5-Pro | 83.7 | 78.9 | 81.3 |
| OpenAI Deep Research | 81.2 | 75.3 | 78.3 |
| Kimi-Researcher | 72.4 | 80.1 | 76.3 |
| Doubao-DeepResearch | 70.8 | 77.5 | 74.2 |
1. **Task Scope**: Limited to 22 academic fields, may not cover all research domains 2. **Language Coverage**: Only English and Chinese, excluding other major research languages 3. **Evaluation Metrics**: RACE and FACT may not capture all aspects of research quality 4. **Human Baseline**: Establishing consistent expert baselines across domains is challenging 5. **Dynamic Information**: Difficulty in evaluating agents on rapidly changing information
| Direction | Description | Timeline |
|---|---|---|
| **Expanded Languages** | Add support for Spanish, French, German | 2025 Q4 |
| **Interactive Tasks** | Multi-turn research dialogues | 2026 Q1 |
| **Real-time Evaluation** | Live web research assessment | 2026 Q2 |
| **Multimodal Integration** | Include figures, charts, data visualization | 2026 Q3 |
| **Collaborative Research** | Multi-agent research scenarios | 2026 Q4 |
DeepResearch Bench addresses a critical gap in AI evaluation by providing the first comprehensive benchmark for assessing AI systems' ability to conduct PhD-level research. By combining rigorous task design with sophisticated evaluation metrics, the benchmark enables:
As AI systems increasingly assist in research and analysis tasks, DeepResearch Bench provides essential infrastructure for ensuring these systems meet the high standards required for academic and professional research work.