**
| GPQA Diamond | |
|---|---|
| Overview | |
| Full name | Graduate-Level Google-Proof Q&A Benchmark, Diamond Subset |
| Abbreviation | GPQA Diamond |
| Description | A challenging subset of graduate-level, Google-proof science questions testing PhD-level knowledge in biology, physics, and chemistry |
| Release date | 2023-11-20 |
| Latest version | 1.0 |
| Benchmark updated | 2023-11-20 |
| Authors | David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, Samuel R. Bowman |
| Organization | New York University, Anthropic |
| Technical Details | |
| Type | Scientific Reasoning, Expert Knowledge |
| Modality | Text |
| Task format | Multiple choice (4 options) |
| Number of tasks | 198 |
| Total examples | 198 |
| Evaluation metric | Accuracy, Zero-shot Chain-of-Thought |
| Domains | Biology, Physics, Chemistry |
| Languages | English |
| Performance | |
| Human performance | 81.3% (Diamond expert validators), 21.9% (non-experts) |
| Baseline | 38.8% (GPT-4, few-shot CoT) |
| SOTA score | ~94.1% |
| SOTA model | Gemini 3.1 Pro Preview (Google DeepMind) |
| SOTA date | 2026-02 |
| Saturated | Near saturation |
| Resources | |
| Paper | Paper |
| GitHub | Repository |
| Dataset | Download |
| Predecessor | GPQA (full set, 448 questions) |
GPQA Diamond** is a challenging AI benchmark consisting of 198 PhD-level multiple-choice questions in biology, physics, and chemistry. Released on November 20, 2023, it represents the most difficult subset of the Graduate-Level Google-Proof Q&A Benchmark (GPQA), specifically designed to test artificial intelligence systems on questions that require deep scientific expertise and cannot be easily answered through web searches. The benchmark was created by David Rein and collaborators at New York University and Anthropic, and has become one of the most widely cited evaluations for measuring scientific reasoning in large language models.
GPQA Diamond has gained particular prominence because it occupies a unique position in the AI evaluation landscape: it tests knowledge and reasoning at a level where even human PhD experts achieve only about 65% accuracy on the broader GPQA set (81.3% on the Diamond subset due to selection effects), while skilled non-experts with unrestricted internet access score just 21.9% on the Diamond questions. This wide expertise gap makes GPQA Diamond especially useful for studying scalable oversight, the problem of how humans can supervise AI systems that may eventually surpass human capabilities in specialized domains.
The primary motivation behind GPQA Diamond is rooted in one of the most pressing challenges in AI safety: scalable oversight. As AI systems grow more capable, they increasingly operate in domains where human supervisors cannot independently verify the correctness of AI outputs. A doctor reviewing an AI's diagnosis might lack the specialized knowledge to confirm whether the AI's reasoning in a rare subspecialty is sound. A policy analyst relying on AI for technical climate modeling may not be equipped to catch subtle errors in the underlying physics.
David Rein and his co-authors framed GPQA as a testbed for studying this problem. The benchmark was designed so that questions fall in a "sweet spot" of difficulty: hard enough that non-experts cannot simply look up the answer, but structured enough that domain experts can reliably identify the correct response. This gap between expert and non-expert performance creates realistic conditions for experimenting with oversight techniques like debate, market-making, and recursive reward modeling, all of which aim to help less-expert humans extract truthful answers from AI systems.
Before GPQA, most science-focused benchmarks occupied either end of the difficulty spectrum. Benchmarks like ScienceQA and ARC tested elementary to high school knowledge, while portions of MMLU covered undergraduate-level material. These benchmarks were valuable for tracking early progress in language model capabilities, but by 2023 frontier models had largely saturated them. At the other extreme, open-ended research tasks (like writing novel proofs or designing experiments) were difficult to evaluate automatically because they lacked clear, verifiable correct answers.
GPQA was designed to fill this gap: questions that are genuinely difficult, require graduate-level expertise across multiple scientific subfields, resist simple information retrieval, and yet have unambiguous correct answers in a multiple-choice format that allows automated scoring.
A distinctive feature of the GPQA dataset is its "Google-proof" nature. The term refers to the observation that skilled non-expert validators, despite having unrestricted access to web searches and academic papers, could not reliably answer the questions. During the validation phase, non-experts spent an average of 37 minutes per question (with a minimum requirement of 15 minutes) searching for information, reading research papers, and attempting to reason through the problems. Despite this effort, they achieved only 33.9% accuracy on the extended set and 21.9% on the Diamond subset, barely above the 25% random-guessing baseline for four-choice questions.
This Google-proof quality arises because the questions typically require:
The GPQA dataset was created by recruiting 61 domain experts through the freelancing platform Upwork. All experts either held a PhD or were actively pursuing one in biology, physics, or chemistry, and indicated proficiency or fluency in English. The creators designed a compensation structure that heavily emphasized quality: the majority of payment came from performance-based bonuses rather than flat fees, with estimated average hourly compensation of approximately $95 per hour and a maximum of $150 per hour.
The creation of each GPQA question followed a rigorous four-stage pipeline:
| Stage | Activity | Participants | Key Requirements |
|---|---|---|---|
| 1. Question Writing | Expert authors write questions with explanations for correct and incorrect answer choices | Domain experts (PhD holders/candidates) | Questions must be difficult, require deep knowledge, and include four plausible answer choices |
| 2. First Expert Validation | A second domain expert attempts the question and provides feedback | Independent expert in the same field | Validator assesses objectivity, accuracy, and difficulty; provides detailed feedback |
| 3. Question Revision | Original author revises the question based on validator feedback | Original question writer | Revision is optional if no changes are suggested |
| 4. Non-Expert Validation | Three non-experts from different domains attempt the question | Skilled validators with PhDs in other fields | Minimum 15 minutes per question; unrestricted web access; average time spent was 37 minutes |
The creators designed an elaborate incentive system to encourage both difficult question writing and honest, careful validation:
Question Writing Compensation:
| Component | Amount | Condition |
|---|---|---|
| Base payment | $10 | Per question submitted |
| Expert validator bonus | $20 per validator | Each expert who answers correctly (max $40) |
| Non-expert difficulty bonus | $15 per validator | Each non-expert who answers incorrectly (max $45) |
| Quality bonus | $30 | Both experts correct AND at least 2 of 3 non-experts incorrect |
Expert Validation Compensation:
| Component | Amount | Condition |
|---|---|---|
| Base payment | $10 | Per question validated |
| Correct answer bonus | $10 | Validator answers correctly |
| Agreement bonus | $10 | Second expert also answers correctly |
| Difficulty bonus | $5 | Majority of non-experts answer incorrectly |
This structure incentivized question writers to produce questions that were genuinely difficult (rewarding non-expert failures) while remaining answerable by experts (rewarding expert successes). It also incentivized validators to answer carefully and provide honest, thorough feedback.
The full GPQA dataset exists in three nested subsets of increasing quality and difficulty:
| Subset | Size | Selection Criteria |
|---|---|---|
| GPQA Extended | 546 questions | All collected questions |
| GPQA Main | 448 questions | At least 1 of 2 experts correct AND at most 2 of 3 non-experts correct |
| GPQA Diamond | 198 questions | Both experts correct AND at most 1 of 3 non-experts correct |
The Diamond subset applies the strictest filters. For a question to qualify, both expert validators had to answer it correctly (or, if the second expert initially answered incorrectly, they had to clearly describe the mistake or demonstrate understanding of the question writer's explanation after seeing the answer). Additionally, the majority of non-experts had to answer incorrectly. This dual requirement ensures that Diamond questions are simultaneously answerable by true domain experts and resistant to non-expert reasoning with internet access.
An additional held-out set of 18 questions was reserved and not publicly released, intended for future validation purposes.
The 198 questions in GPQA Diamond span three scientific domains, though the distribution across subdomains is uneven. Based on the extended set of 546 questions (from which Diamond is drawn), the approximate subdomain breakdown is:
| Domain | Subdomains | Approx. Questions (Extended Set) |
|---|---|---|
| Biology | Molecular Biology (85), Genetics (20) | 105 |
| Physics | Quantum Mechanics (64), High-Energy Particle Physics (46), General Physics (43), Astrophysics (42), Electromagnetism/Photonics (12), Relativistic Mechanics (11), Statistical Mechanics (4), Condensed Matter (4), Optics/Acoustics (1) | 227 |
| Chemistry | Organic Chemistry (144), General Chemistry (64), Inorganic Chemistry (3), Analytical Chemistry (2), Physical Chemistry (1) | 214 |
A notable feature of this distribution is the heavy representation of organic chemistry, which accounts for roughly 26% of all questions in the extended set. This has important implications for model evaluation, as organic chemistry questions turn out to be disproportionately difficult for AI systems.
Each GPQA Diamond question is a text-only, four-option multiple-choice problem. Questions do not include images, diagrams, or graphs (though some questions reference visual concepts that the solver must reason about from text descriptions alone). The average question length is approximately 630 characters (median 561), or about 169 tokens using GPT-4's tokenizer.
Questions range in stated difficulty from "hard undergraduate" to "post-graduate level," and expert validators provide 4-point difficulty ratings after answering. Analysis shows these ratings are predictive of non-expert accuracy, indicating that experts have a reasonable ability to judge how difficult questions will be for less-specialized audiences.
The standard evaluation protocol for GPQA Diamond uses zero-shot or few-shot prompting with chain-of-thought reasoning:
| Evaluation Method | Description | Common Usage |
|---|---|---|
| Zero-shot CoT | Model reasons step-by-step without example questions | Standard protocol in OpenAI's simple-evals suite |
| Few-shot CoT | Model is given example questions and solutions before the test question | Used in some academic evaluations |
| Zero-shot direct | Model selects an answer without explicit reasoning | Less common; generally yields lower scores |
| Pass@1 | Single attempt accuracy; no majority voting | Most commonly reported metric |
| Consensus / Majority voting | Multiple samples; most common answer selected | Sometimes used but can inflate scores |
GPQA Diamond is included in OpenAI's simple-evals evaluation suite alongside MMLU, MATH, and other standard benchmarks, which has helped establish it as a default evaluation for new model releases across the industry.
The GPQA paper estimates that approximately 73.6% of questions have objectively verifiable correct answers under a conservative assessment, rising to 76.4% when including cases where validators demonstrated understanding of the intended answer even if they initially selected the wrong option. The Diamond subset is expected to have a higher objectivity rate than the extended set because both expert validators had to agree on the correct answer.
Later analysis by Epoch AI estimated that roughly 90% to 95% of Diamond questions are valid, with approximately 2 to 3 questions (out of 198) potentially having issues such as incorrect answer keys or ambiguous wording.
Human performance on GPQA varies significantly depending on the subset and how "expert" is defined:
| Metric | Extended (546) | Main (448) | Diamond (198) |
|---|---|---|---|
| Expert accuracy | 65.4% | 72.5% | 81.3% |
| Non-expert accuracy | 33.9% | 30.5% | 21.9% |
| Expert-non-expert gap | 31.5 pp | 42.0 pp | 59.4 pp |
The higher expert accuracy on the Diamond subset (81.3% versus 65.4% on Extended) is a selection artifact: Diamond specifically includes questions where both experts answered correctly, so the measured expert accuracy on that subset is inflated by the filtering criteria. The extended set figure of 65.4% (or approximately 74% after discounting clear mistakes identified in retrospect) more accurately reflects typical expert performance on these types of questions.
OpenAI independently recruited PhD-level experts to answer GPQA Diamond questions and reported an expert accuracy of approximately 69.7%, which is consistent with the original paper's findings when accounting for the selection effects in the Diamond subset.
Non-expert validators were not laypeople; they were skilled individuals with PhDs in fields other than the question's domain. They received unrestricted internet access and were required to spend at least 15 minutes per question. On average, they spent 37 minutes per question (median 30 minutes), often reading multiple academic research papers in the attempt to find relevant information. Despite this substantial effort, their accuracy on the Diamond subset was just 21.9%, only slightly above the 25% random baseline.
This near-chance performance among educated, motivated non-experts is what makes GPQA Diamond "Google-proof" and makes it a compelling testbed for scalable oversight research.
Expert and non-expert performance varies by scientific domain:
| Domain | Expert Accuracy (Extended) | Non-Expert Accuracy (Extended) | GPT-4 Few-shot CoT |
|---|---|---|---|
| Biology | 66.7% | 43.2% | 58.1% |
| Physics | 57.3% | 32.5% | 37.0% |
| Chemistry | 72.0% | 31.4% | 31.8% |
Notably, biology had the highest non-expert accuracy (43.2%), suggesting that some biology questions may be more accessible to educated non-specialists. Chemistry showed the widest expertise gap, with experts at 72.0% but non-experts at just 31.4%.
The benchmark has seen dramatic improvements in AI performance since its release:
| Period | Key Development | Best Score |
|---|---|---|
| November 2023 | Initial release; GPT-4 baseline | 38.8% |
| March 2024 | Claude 3 Opus evaluated | ~60% |
| June 2024 | Claude 3.5 Sonnet | 59.4% |
| September 2024 | OpenAI o1 released | 77.3% |
| December 2024 | OpenAI o3 released | 83.3% |
| January 2025 | DeepSeek-R1 released | 71.5% |
| July 2025 | Aristotle X1 Verify (Autopoiesis Sciences) | 92.4% |
| Late 2025 | GPT-5.2 and Gemini 3 Pro | ~92-93% |
| February 2026 | Gemini 3.1 Pro Preview | ~94.1% |
As of early 2026, the following models have achieved notable performance on GPQA Diamond:
| Rank | Model | Accuracy (%) | Organization | Date |
|---|---|---|---|---|
| 1 | Gemini 3.1 Pro Preview | ~94.1 | Google DeepMind | February 2026 |
| 2 | GPT-5.4 (xhigh) | ~92.0 | OpenAI | 2026 |
| 3 | GPT-5.3 Codex (xhigh) | ~91.5 | OpenAI | 2025 |
| 4 | Claude Opus 4.6 | ~91.0 | Anthropic | 2025 |
| 5 | Gemini 3 Pro | ~91.9 | Google DeepMind | 2025 |
| 6 | GPT-5.2 | ~92.4 | OpenAI | 2025 |
| 7 | Aristotle X1 Verify | 92.4 | Autopoiesis Sciences | July 2025 |
| 8 | Claude Opus 4.5 | ~87.0 | Anthropic | 2025 |
| 9 | xAI Grok 4 | ~87.0 | xAI | 2025 |
| 10 | Gemini 2.5 Pro | ~86.4 | Google DeepMind | March 2025 |
| 11 | OpenAI o3 | 83.3 | OpenAI | December 2024 |
| 12 | DeepSeek R1 | 71.5 | DeepSeek | January 2025 |
| 13 | OpenAI o3-mini-high | 79.7 | OpenAI | 2025 |
| 14 | OpenAI o1 | 77.3 | OpenAI | September 2024 |
| 15 | Claude 3.7 Sonnet (Thinking) | 75.3 | Anthropic | 2025 |
| 16 | Claude 3.7 Sonnet | 67.4 | Anthropic | 2025 |
| 17 | Claude 3 Opus | ~60.0 | Anthropic | March 2024 |
| 18 | Claude 3.5 Sonnet | 59.1 | Anthropic | June 2024 |
| 19 | DeepSeek-V3 | 59.1 | DeepSeek | December 2024 |
| 20 | GPT-4 (baseline) | 38.8 | OpenAI | November 2023 |
Note: Scores may vary depending on evaluation methodology (zero-shot vs. few-shot, pass@1 vs. consensus), prompting strategy, and the specific model version tested. Epoch AI found that self-reported scores from major labs generally fall within the confidence interval of independently reproduced evaluations.
A significant inflection point in GPQA Diamond performance came with the introduction of reasoning-focused models. OpenAI's o1 (September 2024) was the first model to substantially exceed human expert performance on this benchmark, scoring 77.3% compared to the approximately 69.7% expert baseline. The o1 model family uses extended chain-of-thought reasoning at inference time, allocating more computation to work through multi-step problems.
This pattern continued with o3 (83.3%) and subsequent reasoning models from multiple labs. The success of reasoning models on GPQA Diamond suggests that the benchmark's difficulty stems partly from the need for careful, multi-step logical chains rather than from a lack of factual knowledge in the training data.
Autopoiesis Sciences' Aristotle X1 Verify system, which achieved 92.4% on GPQA Diamond in July 2025, introduced an additional innovation: calibrated confidence scoring. Unlike most AI systems whose stated confidence levels do not correlate reliably with actual accuracy, Aristotle X1 embeds systematic doubt into every layer of reasoning. The system also achieved 96.1% on SimpleQA, OpenAI's factuality benchmark, suggesting that its approach to verification and confidence estimation transfers across different evaluation contexts.
Epoch AI's detailed analysis of GPQA Diamond revealed a striking pattern: organic chemistry is massively overrepresented among the questions that models consistently answer incorrectly. While organic chemistry accounts for roughly 36% of the full question set, it makes up approximately 70% of the 40 questions that top models (those scoring above 70% overall) consistently get wrong.
Several factors contribute to organic chemistry's difficulty for AI systems:
Physics questions are generally more tractable for AI models, particularly those employing chain-of-thought reasoning. This is likely because many physics problems can be broken down into explicit mathematical steps: identify the relevant equations, substitute known values, and solve. Reasoning models excel at this type of structured problem-solving, making physics the domain where AI performance tends to be highest.
Subdomain variation exists within physics as well. Quantum mechanics and high-energy particle physics questions, which require knowledge of specialized formalisms and counterintuitive principles, tend to be harder than classical mechanics or electromagnetism questions.
Biology questions show the most variance in difficulty. Molecular biology questions that test specific procedural knowledge (such as the steps of a particular experimental protocol) tend to be very difficult for both AI systems and non-expert humans. However, some genetics and ecology questions are more accessible, contributing to the higher non-expert accuracy observed in the biology domain (43.2% on the extended set).
By early 2026, GPQA Diamond shows clear signs of approaching saturation. Multiple frontier models score above 90%, and the top-performing system (Gemini 3.1 Pro Preview) achieves approximately 94.1%, leaving little room for further improvement. State-of-the-art model scores have been clustering in a narrow band, with diminishing returns on new improvements.
| Saturation Indicator | Observation |
|---|---|
| Score clustering | Multiple models in the 87-94% range |
| Ceiling approach | Top scores within ~6% of perfect accuracy |
| Consistent failure set | ~40 questions (20%) consistently missed by most models |
| Diminishing improvement rate | Score improvements shrinking with each generation |
The question of whether GPQA Diamond is truly saturated or whether the remaining ~10-15% of incorrect answers reflect genuine model limitations (versus flawed questions) has been investigated by several research groups.
Epoch AI conducted a detailed analysis examining the 40 questions that high-performing models (70%+ overall accuracy) most frequently answered incorrectly. Their findings suggest that most of these difficult questions are genuinely valid but require highly specialized knowledge. Of six particularly problematic questions examined in depth (those with sub-5% model accuracy), approximately 2.25 were estimated to be potentially invalid. Extrapolating this rate to the full set of 40 difficult questions yields an estimated invalid question rate of roughly 8% (15 out of 198), though the authors acknowledged significant uncertainty in this extrapolation.
The questions most likely to be invalid included ones involving specialized procedural knowledge that may not have a single correct answer, and at least one question where the intended answer appeared to be incorrect based on independent expert review.
Epoch AI also investigated whether AI labs accurately report their GPQA Diamond scores. By comparing self-reported scores against independently reproduced evaluations, they found that all major labs' self-reported scores fell within the expected confidence interval. The computed p-values were well above 0.05 for all tested models, indicating no statistically significant difference between reported and independently measured performance. Epoch estimates that their independent evaluations can determine true model performance to within 4 to 6 percentage points with 90% confidence, given the 198-question sample size.
Understanding how GPQA Diamond relates to the broader GPQA dataset is important for interpreting benchmark results:
| Property | Extended | Main | Diamond |
|---|---|---|---|
| Number of questions | 546 | 448 | 198 |
| Expert accuracy | 65.4% | 72.5% | 81.3% |
| Non-expert accuracy | 33.9% | 30.5% | 21.9% |
| Selection criteria | None (all collected) | 1+ expert correct, 2+ non-experts wrong | Both experts correct, 2+ non-experts wrong |
| GPT-4 baseline (few-shot CoT) | 38.7% | 39.7% | 38.8% |
The Diamond subset is both harder (for non-experts) and more reliable (in terms of having verifiably correct answers) than the broader sets. This combination of difficulty and quality is why Diamond has become the standard evaluation target rather than the full GPQA set.
Several factors contributed to GPQA Diamond becoming the preferred evaluation subset:
GPQA Diamond's primary intended application is as a testbed for scalable oversight methods. The benchmark's design creates conditions analogous to a scenario where a human supervisor (the non-expert) must evaluate the work of a more capable system (the AI model that outperforms non-experts). Researchers use this setup to test techniques like:
GPQA Diamond serves as one of the primary benchmarks for measuring progress toward expert-level scientific reasoning in AI. It is now routinely reported in model release announcements from OpenAI, Anthropic, Google DeepMind, and other leading labs. The benchmark's focus on graduate-level science makes it complementary to other widely used evaluations:
| Benchmark | Focus | Difficulty Level | Questions |
|---|---|---|---|
| MMLU | Broad knowledge (57 subjects) | High school to professional | 14,042 |
| MMLU-Pro | Harder version of MMLU | Professional | 12,032 |
| GPQA Diamond | Science (bio, phys, chem) | Graduate to post-graduate | 198 |
| MATH | Mathematics | Competition level | 5,000 |
| HumanEval | Code generation | Professional | 164 |
| AIME | Mathematical reasoning | Competition level | Varies |
The benchmark has practical implications beyond model evaluation:
At 198 questions, GPQA Diamond is a relatively small benchmark. This limits the statistical power of comparisons between models, particularly when score differences are small. Epoch AI estimates that independent evaluations can only determine true performance to within 4 to 6 percentage points with 90% confidence. As a result, score differences of less than about 5 percentage points between models may not be statistically meaningful.
The four-option multiple-choice format, while enabling automated evaluation, may not reflect real-world scientific reasoning. In practice, scientists formulate hypotheses, design experiments, interpret ambiguous data, and synthesize information from multiple sources. The multiple-choice structure reduces these complex cognitive tasks to answer selection, which may overestimate a model's true scientific understanding.
Because GPQA Diamond is a fixed set of 198 questions that has been publicly available since November 2023, there is a risk of data contamination. Models trained on data that includes GPQA questions or their answers (directly or indirectly) may achieve inflated scores that do not reflect genuine reasoning ability. The dataset maintainers included canary strings to detect unauthorized use in training data, but the risk increases over time as the benchmark becomes more widely discussed and analyzed online.
GPQA Diamond covers only three scientific domains: biology, physics, and chemistry. Important fields like mathematics, computer science, engineering, earth sciences, and medicine are not represented. This means the benchmark provides only a partial picture of a model's scientific capabilities.
All questions are written in English, limiting the benchmark's applicability to evaluating multilingual scientific reasoning capabilities.
The heavy representation of organic chemistry in the question set (roughly 36% of the extended set, and an even larger share of the hardest questions) means that models' aggregate scores may be disproportionately influenced by performance on a single subdomain. A model that excels at everything except organic chemistry may receive a misleadingly low overall score.
Researchers have suggested several enhancements to address GPQA Diamond's limitations:
As GPQA Diamond approaches saturation, the AI evaluation community is developing more challenging successors: