GPQA
| GPQA | |
|---|---|
| Overview | |
| Full name | Graduate-Level Google-Proof Q&A Benchmark |
| Abbreviation | GPQA |
| Description | A challenging dataset of graduate-level multiple-choice questions designed to be "Google-proof" and test deep domain knowledge |
| Release date | 2023-11-20 |
| Latest version | 1.0 |
| Benchmark updated | 2023-11 |
| Authors | David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, Samuel R. Bowman |
| Organization | New York University |
| Technical Details | |
| Type | Scientific Reasoning, Graduate-Level Knowledge, Expert Q&A |
| Modality | Text |
| Task format | Multiple-choice (4 options) |
| Number of tasks | 3 domains |
| Total examples | 448 (main), 546 (extended), 198 (diamond) |
| Evaluation metric | Accuracy |
| Domains | Biology, Physics, Chemistry |
| Languages | English |
| Performance | |
| Human performance | 65% (PhD experts), 34% (non-experts with web access) |
| Baseline | 39% (GPT-4, 2023) |
| SOTA score | 87.5% |
| SOTA model | Grok 4 |
| SOTA date | 2025 |
| Saturated | No |
| Resources | |
| Website | Official website |
| Paper | Paper |
| GitHub | Repository |
| Dataset | Download |
| License | MIT
|
GPQA (Graduate-Level Google-Proof Q&A Benchmark) is a challenging benchmark dataset of graduate-level multiple-choice questions designed to test advanced scientific knowledge and reasoning capabilities in artificial intelligence systems. Released in November 2023 by researchers at New York University[1], GPQA addresses the challenge of evaluating AI systems on questions that cannot be easily answered through web searches, requiring deep domain expertise and complex multi-step reasoning. The benchmark features questions written and validated by PhD-level domain experts across biology, physics, and chemistry, with a notable performance gap between experts (65%) and non-experts with internet access (34%).
Overview
GPQA represents a significant advancement in evaluating AI systems' capabilities on expert-level knowledge that goes beyond simple information retrieval. The benchmark's defining characteristic is its "Google-proof" nature, questions are specifically crafted to resist solution through web searches, even when given unlimited time and internet access. This design philosophy ensures that models must demonstrate genuine understanding and reasoning rather than sophisticated search and retrieval capabilities[1].
Significance
The development of GPQA addresses several critical challenges in AI evaluation:
- **Scalable Oversight**: Tests AI systems on knowledge that may exceed typical human capabilities
- **Expert-Level Evaluation**: Provides genuine graduate-level assessment beyond undergraduate benchmarks
- **Reasoning vs. Retrieval**: Distinguishes between information lookup and deep understanding
- **Future-Proofing**: Designed for AI systems that may surpass human expertise
- **Scientific Rigor**: Questions validated by domain experts ensure accuracy and relevance
Dataset Structure
Dataset Variants
GPQA consists of three carefully curated variants, each serving different evaluation purposes:
| Variant | Number of Questions | Description | Primary Use |
|---|---|---|---|
| **GPQA Extended** | 546 | Complete dataset including all validated questions | Comprehensive evaluation |
| **GPQA Main** | 448 | Primary dataset excluding preliminary questions | Standard benchmarking |
| **GPQA Diamond** | 198 | Most challenging subset | Discrimination between expert and non-expert performance |
Domain Distribution
The questions span three core scientific disciplines:
| Domain | Approximate Coverage | Topics Included | Expertise Required |
|---|---|---|---|
| **Biology** | ~33% | Molecular biology, genetics, biochemistry, cell biology | Graduate/PhD level |
| **Physics** | ~33% | Quantum mechanics, statistical mechanics, relativity, particle physics | Graduate/PhD level |
| **Chemistry** | ~33% | Organic chemistry, physical chemistry, inorganic chemistry | Graduate/PhD level |
Question Characteristics
Each GPQA question exhibits several defining features[1]:
- **Length**: Extended problem statements requiring careful parsing
- **Complexity**: Multi-step reasoning chains necessary for solution
- **Precision**: Unambiguous correct answers validated by experts
- **Difficulty**: Challenging even for PhD-level domain experts
- **Independence**: Cannot be solved through simple web searches
Creation Methodology
Expert Question Writing
The creation process involved rigorous expert participation:
| Stage | Participants | Process | Outcome |
|---|---|---|---|
| **Writing** | PhD students/holders | Create original graduate-level questions | Initial question pool |
| **Validation** | Domain experts | Review for accuracy and clarity | Validated questions |
| **Testing** | Non-experts with web access | Attempt solutions with internet | Difficulty verification |
| **Refinement** | Original authors | Address feedback and ambiguities | Final questions |
Quality Control
GPQA employs multiple quality assurance mechanisms:
1. **Expert Agreement**: Multiple experts must agree on correct answer 2. **Non-Expert Testing**: Ensures questions are genuinely difficult 3. **Web Search Resistance**: Verified through non-expert attempts with internet 4. **Clarity Review**: Questions refined to eliminate ambiguity 5. **Answer Validation**: All answers double-checked by domain experts
Diamond Subset Creation
The GPQA Diamond subset represents the most discriminative questions:
Selection Criteria
Questions qualify for Diamond subset if[1]:
- At least one expert answered correctly
- No more than one out of three non-experts answered correctly
- Clear separation between expert and non-expert performance
Performance Characteristics
| Group | Diamond Performance | Main Dataset Performance | Performance Gap |
|---|---|---|---|
| **PhD Experts** | 69.7% | 65% | Consistent high performance |
| **Non-Experts + Web** | <33% | 34% | Significant difficulty |
| **Random Baseline** | 25% | 25% | Four-option multiple choice |
Human Performance Analysis
Expert Performance
Detailed analysis of expert performance reveals:
| Metric | Value | Notes |
|---|---|---|
| **Overall Accuracy** | 65% | Across all domains |
| **Adjusted Accuracy** | 74% | Excluding identified errors |
| **Time per Question** | 15-30 minutes | Typical expert solving time |
| **Confidence** | High | Experts generally confident in answers |
| **Diamond Subset** | 69.7% | Slightly higher on curated subset |
Non-Expert Performance
Non-expert validators with unrestricted internet access achieved:
| Metric | Value | Notes |
|---|---|---|
| **Overall Accuracy** | 34% | Despite >30 minutes per question |
| **Web Search Success** | Minimal | Questions resist Google searches |
| **Time Investment** | >30 minutes | Extended effort with internet |
| **Strategy** | Varied | Combined search, reasoning, guessing |
Model Performance Evolution
Current Leaderboard (2024-2025)
Performance on GPQA Diamond subset:
| Rank | Model | Accuracy | Date | Key Innovation |
|---|---|---|---|---|
| 1 | Grok 4 | 87.5% | 2025 | Advanced reasoning |
| 2 | GPT-5 | 87.3% | 2025 | Scaled architecture |
| 3 | Gemini 2.5 Pro | 86.4% | 2025 | Multimodal training |
| 4 | Grok 3 [Beta] | 84.6% | 2024 | Efficient reasoning |
| 5 | Claude 3.7 Sonnet (Thinking) | 84.8% | 2024 | Extended reasoning |
| 6 | Claude Sonnet 4 | 83.8% | 2025 | Improved architecture |
| 7 | OpenAI o3 | 83.3% | 2024 | Reasoning chains |
Performance Progression
The rapid improvement in model performance on GPQA:
| Period | Best Performance | Leading Model | Improvement Rate |
|---|---|---|---|
| Nov 2023 | 39% | GPT-4 | Baseline |
| Early 2024 | 50.4% | Claude 3 Opus | +11.4% |
| Mid 2024 | 59.4% | Claude 3.5 Sonnet | +9% |
| Late 2024 | 84.8% | Claude 3.7 Sonnet | +25.4% |
| 2025 | 87.5% | Grok 4 | +2.7% |
Evaluation Methodology
Standard Evaluation Protocols
GPQA supports multiple evaluation approaches[1]:
| Method | Description | Typical Performance | Use Case |
|---|---|---|---|
| **Zero-shot** | Direct answering without examples | Lower | Baseline capability |
| **Zero-shot CoT** | Step-by-step reasoning | Medium | Reasoning assessment |
| **5-shot CoT** | Five examples with reasoning | Higher | Best performance |
| **Answer-only** | No reasoning required | Lowest | Minimal baseline |
Implementation Details
Standard evaluation setup: ```python
- Example GPQA evaluation
from gpqa import GPQADataset, evaluate
dataset = GPQADataset(variant="diamond") model_responses = model.generate(
dataset.questions, method="zero-shot-cot", temperature=0.0
) accuracy = evaluate(model_responses, dataset.answers) ```
Key Features and Innovations
Google-Proof Design
The benchmark's resistance to web searches is achieved through:
1. **Original Problems**: Novel questions not found online 2. **Complex Integration**: Requires synthesizing multiple concepts 3. **Deep Reasoning**: Multi-step derivations necessary 4. **Specialized Knowledge**: Graduate-level expertise required 5. **Time Constraints**: Even with web access, solutions take >30 minutes
Scalable Oversight Application
GPQA serves as a testbed for scalable oversight research:
- **Superhuman Domains**: Questions approaching limits of human expertise
- **Validation Methods**: Testing approaches for verifying AI outputs
- **Trust Calibration**: Understanding when to rely on AI vs. human judgment
- **Safety Research**: Ensuring AI systems provide reliable expert-level advice
Limitations and Considerations
Current Limitations
| Limitation | Description | Impact |
|---|---|---|
| **Domain Coverage** | Limited to physics, chemistry, biology | Excludes other fields |
| **Format Restriction** | Multiple-choice only | Doesn't test generation |
| **English Only** | No multilingual support | Limited accessibility |
| **Static Dataset** | Fixed question set | Risk of contamination |
| **Size** | Relatively small (448 questions) | Statistical limitations |
Ethical Considerations
- **Data Contamination**: Request not to publicize examples to prevent training contamination
- **Expert Verification**: Ensures accuracy in critical scientific domains
- **Access Control**: Password protection maintains dataset integrity
Research Impact
Applications
GPQA has influenced several research directions:
| Application | Description | Status |
|---|---|---|
| **Scalable Oversight** | Methods for supervising superhuman AI | Active research |
| **Scientific AI** | Developing expert-level scientific assistants | Ongoing |
| **Evaluation Methods** | New approaches to testing expert knowledge | Expanding |
| **Reasoning Research** | Understanding deep scientific reasoning | Advancing |
Related Benchmarks
| Benchmark | Focus | Difficulty | Comparison to GPQA |
|---|---|---|---|
| MMLU | Broad knowledge | Undergraduate | Easier, broader |
| ScienceQA | K-12 science | Elementary/High school | Much easier |
| ARC | Science reasoning | Elementary | Basic level |
| GPQA | Graduate science | PhD level | Most challenging |
Future Directions
Planned Extensions
1. **Domain Expansion**: Adding mathematics, computer science, engineering 2. **Format Diversity**: Free-response and proof-based questions 3. **Multilingual Versions**: Translations for global accessibility 4. **Dynamic Generation**: Procedurally generated questions 5. **Multimodal Integration**: Including diagrams and equations
Significance
GPQA represents a crucial benchmark for evaluating AI systems at the frontier of human expertise. By focusing on graduate-level questions that resist simple web searches, it provides genuine assessment of deep scientific understanding and reasoning. The dramatic performance improvements from 39% to 87.5% in less than two years demonstrate rapid progress in AI capabilities while highlighting the remaining gap to consistent expert-level performance.
The benchmark's emphasis on "Google-proof" questions and scalable oversight makes it particularly valuable for developing AI systems that can reliably assist with expert-level tasks where human verification may be challenging. As AI systems approach and potentially exceed human expert performance on GPQA, the benchmark serves as both a measure of progress and a tool for understanding how to safely deploy advanced AI in scientific domains.
See Also
- Scalable Oversight
- Scientific Reasoning Benchmarks
- MMLU
- Graduate-Level AI Evaluation
- Chain-of-Thought Prompting
References
Cite error: <ref> tag with name "gpqa_github" defined in <references> is not used in prior text.