GPQA Diamond
| GPQA Diamond | |
|---|---|
| Overview | |
| Full name | Graduate-Level Google-Proof Q&A Benchmark - Diamond Subset |
| Abbreviation | GPQA Diamond |
| Description | A challenging subset of graduate-level, Google-proof science questions testing PhD-level knowledge in biology, physics, and chemistry |
| Release date | 2023-11-20 |
| Latest version | 1.0 |
| Benchmark updated | 2023-11-20 |
| Authors | David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, Samuel R. Bowman |
| Organization | New York University, Anthropic |
| Technical Details | |
| Type | Scientific Reasoning, Expert Knowledge |
| Modality | Text |
| Task format | Multiple choice |
| Number of tasks | 198 |
| Total examples | 198 |
| Evaluation metric | Accuracy, Zero-shot Chain-of-Thought |
| Domains | Biology, Physics, Chemistry |
| Languages | English |
| Performance | |
| Human performance | 65% (PhD experts), 34% (skilled non-experts) |
| Baseline | 39% (GPT-4) |
| SOTA score | 92.4% |
| SOTA model | Aristotle X1 Verify (Autopoiesis Sciences) |
| SOTA date | 2025-01 |
| Saturated | Near saturation |
| Resources | |
| Paper | Paper |
| GitHub | Repository |
| Dataset | Download |
| Predecessor | GPQA (full set) |
GPQA Diamond is a challenging AI benchmark consisting of 198 PhD-level multiple-choice questions in biology, physics, and chemistry. Released on November 20, 2023, it represents the most difficult subset of the Graduate-Level Google-Proof Q&A Benchmark (GPQA), specifically designed to test artificial intelligence systems on questions that require deep scientific expertise and cannot be easily answered through web searches.
Overview
GPQA Diamond was created to address a critical challenge in AI development: how to supervise and validate AI systems that may exceed human capabilities in specialized domains. The benchmark consists of questions where PhD experts achieve 65% accuracy (74% when discounting clear mistakes identified in retrospect), while skilled non-experts with unrestricted internet access only achieve 34% accuracy despite spending over 30 minutes per question on average.
Purpose and Motivation
The primary motivation behind GPQA Diamond is to enable scalable oversight experiments for future AI systems. As AI capabilities approach and potentially surpass human expertise in scientific domains, researchers need robust methodologies for human experts to effectively supervise and validate AI outputs. The significant performance gap between experts and non-experts on GPQA Diamond makes it ideal for testing oversight methods that could help humans reliably obtain truthful information from superhuman AI systems.
Google-Proof Design
A unique characteristic of GPQA Diamond is its "Google-proof" nature. Questions are specifically crafted so that:
- They cannot be easily answered through web searches
- They require deep domain understanding beyond surface-level knowledge
- They test integration of multiple concepts and principles
- They resist simple pattern matching or information retrieval approaches
Technical Specifications
Dataset Composition
GPQA Diamond consists of 198 questions selected from the larger GPQA dataset of 448 questions based on difficulty and quality criteria:
| Domain | Number of Questions | Subdisciplines Covered |
|---|---|---|
| Biology | ~66 | Molecular biology, Genetics, Biochemistry, Cell biology, Ecology |
| Physics | ~66 | Quantum mechanics, Statistical mechanics, Electromagnetism, Classical mechanics |
| Chemistry | ~66 | Organic chemistry, Physical chemistry, Inorganic chemistry, Analytical chemistry |
Question Development Process
The questions were developed through a rigorous multi-stage process:
1. Expert Creation: Questions written by domain experts with PhDs or pursuing PhDs 2. Difficulty Calibration: Questions range from "hard undergraduate" to "post-graduate level" 3. Validation Process: Multiple rounds of review by independent experts 4. Non-Expert Testing: Skilled validators attempt questions with web access 5. Diamond Selection: Most challenging and high-quality questions selected for Diamond subset
Evaluation Methodology
| Evaluation Metric | Description | Implementation |
|---|---|---|
| Zero-shot Accuracy | Direct answer selection without examples | Standard evaluation protocol |
| Chain-of-Thought (CoT) | Step-by-step reasoning before answer | Common for reasoning models |
| Few-shot Learning | Providing example questions and answers | Used in some evaluations |
| Expert Baseline | PhD holders in relevant domain | 65% accuracy benchmark |
| Non-Expert Baseline | Skilled individuals with web access | 34% accuracy benchmark |
Performance Analysis
Current Leaderboard
As of 2025, the following models have achieved notable performance on GPQA Diamond:
| Rank | Model | Accuracy (%) | Organization | Date |
|---|---|---|---|---|
| 1 | Aristotle X1 Verify | 92.4 | Autopoiesis Sciences | January 2025 |
| 2 | xAI Grok 4 Heavy | 88.9 | xAI | 2025 |
| 3 | Gemini 2.5 Pro | 86.4 | Google DeepMind | 2025 |
| 4 | OpenAI o3 | 83.3 | OpenAI | December 2024 |
| 5 | OpenAI o1 | 78.0 | OpenAI | September 2024 |
| 6 | Claude Sonnet 4 | 78.2 | Anthropic | August 2025 |
| 7 | Claude 3 Opus | ~60 | Anthropic | March 2024 |
| 8 | Claude 3.5 Sonnet | 59.4 | Anthropic | June 2024 |
| 9 | GPT-4 (baseline) | 39.0 | OpenAI | November 2023 |
Note: Some scores may vary depending on evaluation methodology and date of testing.
Key Innovation: Aristotle X1 Verify
The top-performing system, Aristotle X1 Verify by Autopoiesis Sciences, achieved 92.4% accuracy while also solving a critical AI challenge: calibration. Unlike most AI systems whose confidence scores don't align with actual accuracy, Aristotle X1 embeds systematic doubt into every layer of reasoning, achieving both high accuracy and reliable confidence estimates. The system also achieved 96.1% on SimpleQA, OpenAI's factuality benchmark.
Performance Trends
Rapid Progress
The benchmark has seen dramatic improvements:
- Initial Release (2023): GPT-4 baseline at 39%
- Mid-2024: Claude 3 Opus reaches ~60%
- Late 2024: o1 and o3 approach 80%
- Early 2025: Aristotle X1 Verify exceeds 90%
This represents a more than doubling of performance in just over one year, with recent models approaching and exceeding human expert performance.
Domain-Specific Performance
Analysis reveals significant variation across scientific domains:
- Physics: Generally highest performance
- Biology: Moderate performance
- Chemistry: Lowest performance, especially organic chemistry
Saturation Analysis
As of 2025, GPQA Diamond shows signs of near saturation:
| Indicator | Status | Implications |
|---|---|---|
| Score Clustering | Models clustered around 80-90% | Approaching ceiling effect |
| Consistent Failures | ~20% questions consistently missed | Potential benchmark limitations |
| Error Analysis | ~8% questions potentially invalid | Some noise in dataset |
| Performance Plateau | Diminishing returns on improvements | Near-saturation |
Analysis by Epoch AI suggests approximately 8% of questions may have validity issues, with most "impossible" questions actually being valid but requiring specialized knowledge. The benchmark appears to have ~90% valid questions, with models struggling due to legitimate challenges rather than question errors.
Key Challenges
Persistent Difficulty Areas
Models consistently struggle with questions requiring:
1. Specialized Procedural Knowledge: Multi-step experimental procedures 2. Non-standard Computation: Problems requiring unusual mathematical approaches 3. Domain-Specific Intuition: Questions needing field-specific heuristics 4. Integration Across Subfields: Problems combining multiple specialized areas 5. Organic Chemistry: Particularly challenging for current AI systems
Question Validity Analysis
Recent investigation of the most challenging questions revealed:
- Approximately 8% may have validity issues
- Most "impossible" questions are actually valid but require specialized knowledge
- Organic chemistry questions show highest difficulty-to-validity ratio
- 2.25 out of 6 examined hardest questions potentially invalid (per Epoch AI analysis)
Applications and Impact
Research Applications
GPQA Diamond serves several critical research purposes:
- Scalable Oversight: Testing methods for supervising superhuman AI
- Capability Evaluation: Measuring progress toward expert-level AI
- Safety Research: Understanding AI limitations in high-stakes domains
- Alignment Research: Ensuring AI systems provide truthful information
Practical Implications
Scientific Research
- Evaluating AI readiness for research assistance
- Identifying knowledge gaps in AI systems
- Benchmarking progress toward AI scientists
Education
- Testing AI tutoring capabilities at graduate level
- Developing advanced educational AI systems
- Understanding limits of AI in specialized education
Industry Applications
- Assessing AI for technical consulting
- Evaluating domain-specific AI assistants
- Benchmarking enterprise AI capabilities
Limitations and Criticisms
Format Limitations
- Multiple Choice Format: May not reflect real scientific reasoning
- Static Questions: Vulnerable to memorization over time
- Limited Scope: Only covers three scientific domains
- English Only: Language limitation
Evaluation Concerns
- Small Dataset Size: 198 questions limits statistical power
- Question Quality: ~8% potentially problematic questions
- Saturation Risk: Approaching performance ceiling with 92.4% SOTA
- Lack of Explanation Evaluation: Only tests final answers
Future Directions
Proposed Improvements
Researchers have suggested several enhancements:
1. Remove Multiple Choice: Test free-form answer generation 2. Skill-Based Classification: Group questions by required competencies 3. Dynamic Question Generation: Create new questions automatically 4. Research Assistant Tasks: Test practical scientific work capabilities 5. Multi-modal Questions: Include diagrams, graphs, and equations
Next Generation Benchmarks
Potential successors to GPQA Diamond might include:
- Interactive problem-solving tasks
- Real laboratory procedure simulations
- Literature review and synthesis tasks
- Novel hypothesis generation challenges
Related Benchmarks
- MMLU: Broader knowledge benchmark including science
- ScienceQA: Elementary to high school science questions
- PubMedQA: Biomedical literature comprehension
- ChemBench: Chemistry-specific benchmark
- PhysicsBench: Physics problem-solving benchmark
- AIME: Mathematical reasoning benchmark
- ARC: Science reasoning challenge
See Also
- Scalable Oversight
- Expert-Level AI
- Scientific AI
- AI Safety Benchmarks
- Superhuman AI
- Chain-of-Thought Reasoning
- AI Calibration
References
Cite error: <ref> tag with name "gpqa_paper" defined in <references> has group attribute "" which does not appear in prior text.
Cite error: <ref> tag with name "github" defined in <references> has group attribute "" which does not appear in prior text.
Cite error: <ref> tag with name "epoch_analysis" defined in <references> has group attribute "" which does not appear in prior text.
Cite error: <ref> tag with name "autopoiesis" defined in <references> has group attribute "" which does not appear in prior text.
Cite error: <ref> tag with name "klu" defined in <references> has group attribute "" which does not appear in prior text.
Cite error: <ref> tag with name "artificial_analysis" defined in <references> has group attribute "" which does not appear in prior text.
Cite error: <ref> tag with name "openreview" defined in <references> has group attribute "" which does not appear in prior text.
Cite error: <ref> tag with name "claude4" defined in <references> has group attribute "" which does not appear in prior text.