| GPQA Diamond | |
|---|---|
| Overview | |
| Full name | Graduate-Level Google-Proof Q&A Benchmark - Diamond Subset |
| Abbreviation | GPQA Diamond |
| Description | A challenging subset of graduate-level, Google-proof science questions testing PhD-level knowledge in biology, physics, and chemistry |
| Release date | 2023-11-20 |
| Latest version | 1.0 |
| Benchmark updated | 2023-11-20 |
| Authors | David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, Samuel R. Bowman |
| Organization | New York University, Anthropic |
| Technical Details | |
| Type | Scientific Reasoning, Expert Knowledge |
| Modality | Text |
| Task format | Multiple choice |
| Number of tasks | 198 |
| Total examples | 198 |
| Evaluation metric | Accuracy, Zero-shot Chain-of-Thought |
| Domains | Biology, Physics, Chemistry |
| Languages | English |
| Performance | |
| Human performance | 65% (PhD experts), 34% (skilled non-experts) |
| Baseline | 39% (GPT-4) |
| SOTA score | 92.4% |
| SOTA model | Aristotle X1 Verify (Autopoiesis Sciences) |
| SOTA date | 2025-01 |
| Saturated | Near saturation |
| Resources | |
| Paper | Paper |
| GitHub | Repository |
| Dataset | Download |
| Predecessor | GPQA (full set) |
GPQA Diamond is a challenging AI benchmark consisting of 198 PhD-level multiple-choice questions in biology, physics, and chemistry. Released on November 20, 2023, it represents the most difficult subset of the Graduate-Level Google-Proof Q&A Benchmark (GPQA), specifically designed to test artificial intelligence systems on questions that require deep scientific expertise and cannot be easily answered through web searches.
GPQA Diamond was created to address a critical challenge in AI development: how to supervise and validate AI systems that may exceed human capabilities in specialized domains. The benchmark consists of questions where PhD experts achieve 65% accuracy (74% when discounting clear mistakes identified in retrospect), while skilled non-experts with unrestricted internet access only achieve 34% accuracy despite spending over 30 minutes per question on average.
The primary motivation behind GPQA Diamond is to enable scalable oversight experiments for future AI systems. As AI capabilities approach and potentially surpass human expertise in scientific domains, researchers need robust methodologies for human experts to effectively supervise and validate AI outputs. The significant performance gap between experts and non-experts on GPQA Diamond makes it ideal for testing oversight methods that could help humans reliably obtain truthful information from superhuman AI systems.
A unique characteristic of GPQA Diamond is its "Google-proof" nature. Questions are specifically crafted so that:
GPQA Diamond consists of 198 questions selected from the larger GPQA dataset of 448 questions based on difficulty and quality criteria:
| Domain | Number of Questions | Subdisciplines Covered |
|---|---|---|
| Biology | ~66 | Molecular biology, Genetics, Biochemistry, Cell biology, Ecology |
| Physics | ~66 | Quantum mechanics, Statistical mechanics, Electromagnetism, Classical mechanics |
| Chemistry | ~66 | Organic chemistry, Physical chemistry, Inorganic chemistry, Analytical chemistry |
The questions were developed through a rigorous multi-stage process:
1. Expert Creation: Questions written by domain experts with PhDs or pursuing PhDs 2. Difficulty Calibration: Questions range from "hard undergraduate" to "post-graduate level" 3. Validation Process: Multiple rounds of review by independent experts 4. Non-Expert Testing: Skilled validators attempt questions with web access 5. Diamond Selection: Most challenging and high-quality questions selected for Diamond subset
| Evaluation Metric | Description | Implementation |
|---|---|---|
| Zero-shot Accuracy | Direct answer selection without examples | Standard evaluation protocol |
| Chain-of-Thought (CoT) | Step-by-step reasoning before answer | Common for reasoning models |
| Few-shot Learning | Providing example questions and answers | Used in some evaluations |
| Expert Baseline | PhD holders in relevant domain | 65% accuracy benchmark |
| Non-Expert Baseline | Skilled individuals with web access | 34% accuracy benchmark |
As of 2025, the following models have achieved notable performance on GPQA Diamond:
| Rank | Model | Accuracy (%) | Organization | Date |
|---|---|---|---|---|
| 1 | Aristotle X1 Verify | 92.4 | Autopoiesis Sciences | January 2025 |
| 2 | xAI Grok 4 Heavy | 88.9 | xAI | 2025 |
| 3 | Gemini 2.5 Pro | 86.4 | Google DeepMind | 2025 |
| 4 | OpenAI o3 | 83.3 | OpenAI | December 2024 |
| 5 | OpenAI o1 | 78.0 | OpenAI | September 2024 |
| 6 | Claude Sonnet 4 | 78.2 | Anthropic | August 2025 |
| 7 | Claude 3 Opus | ~60 | Anthropic | March 2024 |
| 8 | Claude 3.5 Sonnet | 59.4 | Anthropic | June 2024 |
| 9 | GPT-4 (baseline) | 39.0 | OpenAI | November 2023 |
Note: Some scores may vary depending on evaluation methodology and date of testing.
The top-performing system, Aristotle X1 Verify by Autopoiesis Sciences, achieved 92.4% accuracy while also solving a critical AI challenge: calibration. Unlike most AI systems whose confidence scores don't align with actual accuracy, Aristotle X1 embeds systematic doubt into every layer of reasoning, achieving both high accuracy and reliable confidence estimates. The system also achieved 96.1% on SimpleQA, OpenAI's factuality benchmark.
The benchmark has seen dramatic improvements:
This represents a more than doubling of performance in just over one year, with recent models approaching and exceeding human expert performance.
Analysis reveals significant variation across scientific domains:
As of 2025, GPQA Diamond shows signs of near saturation:
| Indicator | Status | Implications |
|---|---|---|
| Score Clustering | Models clustered around 80-90% | Approaching ceiling effect |
| Consistent Failures | ~20% questions consistently missed | Potential benchmark limitations |
| Error Analysis | ~8% questions potentially invalid | Some noise in dataset |
| Performance Plateau | Diminishing returns on improvements | Near-saturation |
Analysis by Epoch AI suggests approximately 8% of questions may have validity issues, with most "impossible" questions actually being valid but requiring specialized knowledge. The benchmark appears to have ~90% valid questions, with models struggling due to legitimate challenges rather than question errors.
Models consistently struggle with questions requiring:
1. Specialized Procedural Knowledge: Multi-step experimental procedures 2. Non-standard Computation: Problems requiring unusual mathematical approaches 3. Domain-Specific Intuition: Questions needing field-specific heuristics 4. Integration Across Subfields: Problems combining multiple specialized areas 5. Organic Chemistry: Particularly challenging for current AI systems
Recent investigation of the most challenging questions revealed:
GPQA Diamond serves several critical research purposes:
Researchers have suggested several enhancements:
1. Remove Multiple Choice: Test free-form answer generation 2. Skill-Based Classification: Group questions by required competencies 3. Dynamic Question Generation: Create new questions automatically 4. Research Assistant Tasks: Test practical scientific work capabilities 5. Multi-modal Questions: Include diagrams, graphs, and equations
Potential successors to GPQA Diamond might include:
Cite error: <ref> tag with name "gpqa_paper" defined in <references> has group attribute "" which does not appear in prior text.
Cite error: <ref> tag with name "github" defined in <references> has group attribute "" which does not appear in prior text.
Cite error: <ref> tag with name "epoch_analysis" defined in <references> has group attribute "" which does not appear in prior text.
Cite error: <ref> tag with name "autopoiesis" defined in <references> has group attribute "" which does not appear in prior text.
Cite error: <ref> tag with name "klu" defined in <references> has group attribute "" which does not appear in prior text.
Cite error: <ref> tag with name "artificial_analysis" defined in <references> has group attribute "" which does not appear in prior text.
Cite error: <ref> tag with name "openreview" defined in <references> has group attribute "" which does not appear in prior text.
Cite error: <ref> tag with name "claude4" defined in <references> has group attribute "" which does not appear in prior text.