# GPQA Diamond

> Source: https://aiwiki.ai/wiki/gpqa_diamond
> Updated: 2026-06-21
> Categories: AI Benchmarks
> From AI Wiki (https://aiwiki.ai), a free encyclopedia of artificial intelligence. Quote with attribution.

GPQA Diamond is the 198-question hardest subset of the Graduate-Level Google-Proof Q&A Benchmark ([GPQA](/wiki/gpqa)), a set of PhD-level multiple-choice questions in biology, physics, and chemistry used to measure scientific reasoning in [large language models](/wiki/large_language_model).[^1] Released on November 20, 2023 by David Rein and collaborators at New York University and [Anthropic](/wiki/anthropic), it is engineered so that PhD-level domain experts reach about 65% accuracy on the broader set (and 81.3% on the Diamond subset due to selection), while skilled non-experts with unrestricted web access score only 21.9% on Diamond, just above the 25% random baseline.[^1] As of June 2026 the top reported text-only score is roughly 94.1% (Gemini 3.1 Pro Preview on the Artificial Analysis leaderboard), and the benchmark is now widely considered effectively saturated at the frontier.[^6]

The GPQA authors define the dataset as "a challenging dataset of 448 multiple-choice questions written by domain experts in biology, physics, and chemistry," reporting that "experts who have or are pursuing PhDs in the corresponding domains reach 65% accuracy," while "highly skilled non-expert validators only reach 34% accuracy, despite spending on average over 30 minutes with unrestricted access to the web."[^1] The Diamond subset is the strictest-filtered slice of that dataset and has become the most widely cited science evaluation for frontier models.[^1]

| GPQA Diamond |
| --- |
| Overview |
| Full name | Graduate-Level Google-Proof Q&A Benchmark, Diamond Subset |
| Abbreviation | GPQA Diamond |
| Description | A challenging subset of graduate-level, Google-proof science questions testing PhD-level knowledge in biology, physics, and chemistry |
| Release date | 2023-11-20 |
| Latest version | 1.0 |
| Benchmark updated | 2023-11-20 |
| Authors | David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, Samuel R. Bowman |
| Organization | New York University, [Anthropic](/wiki/anthropic) |
| Technical Details |
| Type | Scientific Reasoning, Expert Knowledge |
| Modality | Text |
| Task format | Multiple choice (4 options) |
| Number of tasks | 198 |
| Total examples | 198 |
| Evaluation metric | [Accuracy](/wiki/accuracy), Zero-shot Chain-of-Thought |
| Domains | Biology, Physics, Chemistry |
| Languages | English |
| Performance |
| Human performance | 81.3% (Diamond expert validators), 21.9% (non-experts) |
| Baseline | 38.8% (GPT-4, few-shot CoT) |
| SOTA score | ~94.1% (text-only, pass@1) |
| SOTA model | Gemini 3.1 Pro Preview ([Google DeepMind](/wiki/google_deepmind)) |
| SOTA date | 2026-06 |
| Saturated | Effectively saturated at top tier |
| Resources |
| Paper | [Paper](https://arxiv.org/abs/2311.12022) |
| GitHub | [Repository](https://github.com/idavidrein/gpqa) |
| Dataset | [Download](https://huggingface.co/datasets/Idavidrein/gpqa) |
| Predecessor | GPQA (full set, 448 questions) |

GPQA Diamond is a challenging [AI benchmark](/wiki/ai_benchmark) consisting of 198 PhD-level multiple-choice questions in biology, physics, and chemistry.[^1] Released on November 20, 2023, it represents the most difficult subset of the Graduate-Level Google-Proof Q&A Benchmark ([GPQA](/wiki/gpqa)), specifically designed to test [artificial intelligence](/wiki/artificial_intelligence) systems on questions that require deep scientific expertise and cannot be easily answered through web searches.[^1] The benchmark was created by David Rein and collaborators at New York University and [Anthropic](/wiki/anthropic), and has become one of the most widely cited evaluations for measuring scientific reasoning in [large language models](/wiki/large_language_model).[^1]

GPQA Diamond has gained particular prominence because it occupies a unique position in the AI evaluation landscape: it tests knowledge and reasoning at a level where even human PhD experts achieve only about 65% accuracy on the broader GPQA set (81.3% on the Diamond subset due to selection effects), while skilled non-experts with unrestricted internet access score just 21.9% on the Diamond questions.[^1] This wide expertise gap makes GPQA Diamond especially useful for studying scalable oversight, the problem of how humans can supervise AI systems that may eventually surpass human capabilities in specialized domains.[^1]

By mid-2026, the benchmark sits in an unusual transitional state. The top reported scores cluster near 94%, roughly 13 percentage points above the original expert baseline and within a few points of the estimated ceiling imposed by ambiguous or invalid questions.[^3][^10] Several labs now cluster within a roughly 1-point band at the top of the leaderboard, and the role of GPQA Diamond has shifted from a frontier capability test toward a sanity check that newly released frontier models are expected to clear.[^3][^6]

## Background and motivation

### What problem was GPQA Diamond built to solve?

The primary motivation behind GPQA Diamond is rooted in one of the most pressing challenges in [AI safety](/wiki/ai_safety): scalable oversight. As AI systems grow more capable, they increasingly operate in domains where human supervisors cannot independently verify the correctness of AI outputs. A doctor reviewing an AI's diagnosis might lack the specialized knowledge to confirm whether the AI's reasoning in a rare subspecialty is sound. A policy analyst relying on AI for technical climate modeling may not be equipped to catch subtle errors in the underlying physics.

David Rein and his co-authors framed GPQA as a testbed for studying this problem.[^1] The benchmark was designed so that questions fall in a "sweet spot" of difficulty: hard enough that non-experts cannot simply look up the answer, but structured enough that domain experts can reliably identify the correct response. This gap between expert and non-expert performance creates realistic conditions for experimenting with oversight techniques like [debate](/wiki/debate), market-making, and recursive reward modeling, all of which aim to help less-expert humans extract truthful answers from AI systems.

### Limitations of earlier benchmarks

Before GPQA, most science-focused benchmarks occupied either end of the difficulty spectrum. Benchmarks like [ScienceQA](/wiki/scienceqa) and ARC tested elementary to high school knowledge, while portions of [MMLU](/wiki/mmlu) covered undergraduate-level material. These benchmarks were valuable for tracking early progress in language model capabilities, but by 2023 frontier models had largely saturated them. At the other extreme, open-ended research tasks (like writing novel proofs or designing experiments) were difficult to evaluate automatically because they lacked clear, verifiable correct answers.

GPQA was designed to fill this gap: questions that are genuinely difficult, require graduate-level expertise across multiple scientific subfields, resist simple information retrieval, and yet have unambiguous correct answers in a multiple-choice format that allows automated scoring.[^1]

### What does "Google-proof" mean?

A distinctive feature of the GPQA dataset is its "Google-proof" nature. The term refers to the observation that skilled non-expert validators, despite having unrestricted access to web searches and academic papers, could not reliably answer the questions. During the validation phase, non-experts spent an average of 37 minutes per question (with a minimum requirement of 15 minutes) searching for information, reading research papers, and attempting to reason through the problems.[^1] Despite this effort, they achieved only 33.9% accuracy on the extended set and 21.9% on the Diamond subset, barely above the 25% random-guessing baseline for four-choice questions.[^1]

This Google-proof quality arises because the questions typically require:

- Integration of multiple concepts from within a single discipline
- Application of specialized procedures or techniques that are rarely explained fully in any single source
- Multi-step reasoning chains where each step demands domain-specific knowledge
- Understanding of nuances and exceptions that are common knowledge among specialists but absent from general references

## Creation and methodology

### Expert recruitment

The GPQA dataset was created by recruiting 61 domain experts through the freelancing platform Upwork. All experts either held a PhD or were actively pursuing one in biology, physics, or chemistry, and indicated proficiency or fluency in English.[^1] The creators designed a compensation structure that heavily emphasized quality: the majority of payment came from performance-based bonuses rather than flat fees, with estimated average hourly compensation of approximately $95 per hour and a maximum of $150 per hour.[^1]

### Question writing pipeline

The creation of each GPQA question followed a rigorous four-stage pipeline:[^1]

| Stage | Activity | Participants | Key requirements |
| --- | --- | --- | --- |
| 1. Question writing | Expert authors write questions with explanations for correct and incorrect answer choices | Domain experts (PhD holders/candidates) | Questions must be difficult, require deep knowledge, and include four plausible answer choices |
| 2. First expert validation | A second domain expert attempts the question and provides feedback | Independent expert in the same field | Validator assesses objectivity, accuracy, and difficulty; provides detailed feedback |
| 3. Question revision | Original author revises the question based on validator feedback | Original question writer | Revision is optional if no changes are suggested |
| 4. Non-expert validation | Three non-experts from different domains attempt the question | Skilled validators with PhDs in other fields | Minimum 15 minutes per question; unrestricted web access; average time spent was 37 minutes |

### Compensation and incentive structure

The creators designed an elaborate incentive system to encourage both difficult question writing and honest, careful validation:[^1]

**Question writing compensation:**

| Component | Amount | Condition |
| --- | --- | --- |
| Base payment | $10 | Per question submitted |
| Expert validator bonus | $20 per validator | Each expert who answers correctly (max $40) |
| Non-expert difficulty bonus | $15 per validator | Each non-expert who answers incorrectly (max $45) |
| Quality bonus | $30 | Both experts correct AND at least 2 of 3 non-experts incorrect |

**Expert validation compensation:**

| Component | Amount | Condition |
| --- | --- | --- |
| Base payment | $10 | Per question validated |
| Correct answer bonus | $10 | Validator answers correctly |
| Agreement bonus | $10 | Second expert also answers correctly |
| Difficulty bonus | $5 | Majority of non-experts answer incorrectly |

This structure incentivized question writers to produce questions that were genuinely difficult (rewarding non-expert failures) while remaining answerable by experts (rewarding expert successes). It also incentivized validators to answer carefully and provide honest, thorough feedback.

### How is the Diamond subset selected?

The full GPQA dataset exists in three nested subsets of increasing quality and difficulty:[^1]

| Subset | Size | Selection criteria |
| --- | --- | --- |
| GPQA Extended | 546 questions | All collected questions |
| GPQA Main | 448 questions | At least 1 of 2 experts correct AND at most 2 of 3 non-experts correct |
| GPQA Diamond | 198 questions | Both experts correct AND at most 1 of 3 non-experts correct |

The Diamond subset applies the strictest filters. For a question to qualify, both expert validators had to answer it correctly (or, if the second expert initially answered incorrectly, they had to clearly describe the mistake or demonstrate understanding of the question writer's explanation after seeing the answer). Additionally, the majority of non-experts had to answer incorrectly.[^1] This dual requirement ensures that Diamond questions are simultaneously answerable by true domain experts and resistant to non-expert reasoning with internet access.

An additional held-out set of 18 questions was reserved and not publicly released, intended for future validation purposes.[^1]

## Technical specifications

### Domain distribution

The 198 questions in GPQA Diamond span three scientific domains, though the distribution across subdomains is uneven. Based on the extended set of 546 questions (from which Diamond is drawn), the approximate subdomain breakdown is:[^1]

| Domain | Subdomains | Approx. questions (extended set) |
| --- | --- | --- |
| Biology | [Molecular Biology](/wiki/molecular_biology) (85), [Genetics](/wiki/genetics) (20) | 105 |
| Physics | [Quantum Mechanics](/wiki/quantum_mechanics) (64), High-Energy Particle Physics (46), General Physics (43), [Astrophysics](/wiki/astrophysics) (42), Electromagnetism/Photonics (12), Relativistic Mechanics (11), Statistical Mechanics (4), Condensed Matter (4), Optics/Acoustics (1) | 227 |
| Chemistry | [Organic Chemistry](/wiki/organic_chemistry) (144), General Chemistry (64), Inorganic Chemistry (3), Analytical Chemistry (2), Physical Chemistry (1) | 214 |

A notable feature of this distribution is the heavy representation of organic chemistry, which accounts for roughly 26% of all questions in the extended set. This has important implications for model evaluation, as organic chemistry questions turn out to be disproportionately difficult for AI systems.[^3]

### Question format

Each GPQA Diamond question is a text-only, four-option multiple-choice problem. Questions do not include images, diagrams, or graphs (though some questions reference visual concepts that the solver must reason about from text descriptions alone). The average question length is approximately 630 characters (median 561), or about 169 tokens using [GPT-4](/wiki/gpt-4)'s tokenizer.[^1]

Questions range in stated difficulty from "hard undergraduate" to "post-graduate level," and expert validators provide 4-point difficulty ratings after answering.[^1] Analysis shows these ratings are predictive of non-expert accuracy, indicating that experts have a reasonable ability to judge how difficult questions will be for less-specialized audiences.

### How is GPQA Diamond evaluated?

The standard evaluation protocol for GPQA Diamond uses zero-shot or few-shot prompting with [chain-of-thought](/wiki/chain_of_thought) reasoning:

| Evaluation method | Description | Common usage |
| --- | --- | --- |
| Zero-shot CoT | Model reasons step-by-step without example questions | Standard protocol in [OpenAI](/wiki/openai)'s simple-evals suite |
| Few-shot CoT | Model is given example questions and solutions before the test question | Used in some academic evaluations |
| Zero-shot direct | Model selects an answer without explicit reasoning | Less common; generally yields lower scores |
| Pass@1 | Single attempt accuracy; no majority voting | Most commonly reported metric |
| Consensus / Majority voting | Multiple samples; most common answer selected | Sometimes used but can inflate scores |
| Tool-augmented | Model has access to a Python interpreter, search, and other tools | Reported separately; not comparable to text-only scores |

GPQA Diamond is included in OpenAI's simple-evals evaluation suite alongside [MMLU](/wiki/mmlu), [MATH](/wiki/math_benchmark), and other standard benchmarks, which has helped establish it as a default evaluation for new model releases across the industry.[^7]

### Question objectivity

The GPQA paper estimates that approximately 73.6% of questions have objectively verifiable correct answers under a conservative assessment, rising to 76.4% when including cases where validators demonstrated understanding of the intended answer even if they initially selected the wrong option.[^1] The Diamond subset is expected to have a higher objectivity rate than the extended set because both expert validators had to agree on the correct answer.

Later analysis by Epoch AI estimated that roughly 90% to 95% of Diamond questions are valid, with approximately 2 to 3 questions (out of 198) potentially having issues such as incorrect answer keys or ambiguous wording.[^3] By 2026, multiple independent reviews have settled on a working estimate that around 5% to 8% of Diamond questions are either ambiguous or rely on contested answer keys. This places an effective ceiling on benchmark scores in the 92% to 95% range under the current question set, which matches what frontier models are now reporting.[^3][^10]

## Human baseline performance

### How well do human experts score on GPQA Diamond?

Human performance on GPQA varies significantly depending on the subset and how "expert" is defined:[^1]

| Metric | Extended (546) | Main (448) | Diamond (198) |
| --- | --- | --- | --- |
| Expert accuracy | 65.4% | 72.5% | 81.3% |
| Non-expert accuracy | 33.9% | 30.5% | 21.9% |
| Expert-non-expert gap | 31.5 pp | 42.0 pp | 59.4 pp |

The higher expert accuracy on the Diamond subset (81.3% versus 65.4% on Extended) is a selection artifact: Diamond specifically includes questions where both experts answered correctly, so the measured expert accuracy on that subset is inflated by the filtering criteria.[^1] The extended set figure of 65.4% (or approximately 74% after discounting clear mistakes identified in retrospect) more accurately reflects typical expert performance on these types of questions.[^1]

OpenAI independently recruited PhD-level experts to answer GPQA Diamond questions and reported an expert accuracy of approximately 69.7%, which is consistent with the original paper's findings when accounting for the selection effects in the Diamond subset.[^5]

### Non-expert performance

Non-expert validators were not laypeople; they were skilled individuals with PhDs in fields other than the question's domain. They received unrestricted internet access and were required to spend at least 15 minutes per question. On average, they spent 37 minutes per question (median 30 minutes), often reading multiple academic research papers in the attempt to find relevant information.[^1] Despite this substantial effort, their accuracy on the Diamond subset was just 21.9%, only slightly above the 25% random baseline.[^1]

This near-chance performance among educated, motivated non-experts is what makes GPQA Diamond "Google-proof" and makes it a compelling testbed for scalable oversight research.

### Performance by domain

Expert and non-expert performance varies by scientific domain:[^1]

| Domain | Expert accuracy (extended) | Non-expert accuracy (extended) | GPT-4 few-shot CoT |
| --- | --- | --- | --- |
| Biology | 66.7% | 43.2% | 58.1% |
| Physics | 57.3% | 32.5% | 37.0% |
| Chemistry | 72.0% | 31.4% | 31.8% |

Notably, biology had the highest non-expert accuracy (43.2%), suggesting that some biology questions may be more accessible to educated non-specialists. Chemistry showed the widest expertise gap, with experts at 72.0% but non-experts at just 31.4%.

## AI model performance

### Historical performance timeline

The benchmark has seen dramatic improvements in AI performance since its release:

| Period | Key development | Best score |
| --- | --- | --- |
| November 2023 | Initial release; GPT-4 baseline[^1] | 38.8% |
| March 2024 | [Claude](/wiki/claude) 3 Opus evaluated | ~60% |
| June 2024 | Claude 3.5 Sonnet | 59.4% |
| September 2024 | OpenAI o1 released[^19] | 77.3% |
| December 2024 | OpenAI o3 released | 83.3% |
| January 2025 | DeepSeek-R1 released | 71.5% |
| July 2025 | Aristotle X1 Verify (Autopoiesis Sciences) | 92.4% |
| Late 2025 | GPT-5.2 and Gemini 3 Pro | ~92-93% |
| February 2026 | Gemini 3.1 Pro Preview / Pro launch[^14] | ~94.1% / 94.3% |
| March 2026 | Claude Opus 4.7 internal[^15] | ~94.2% |
| April 2026 | GPT-5.5 (xhigh), Grok 4.3, DeepSeek V4-Pro[^11][^16] | ~93.5% / 90.1% / 90.1% |
| May 2026 | Claude Mythos Preview (Anthropic red-team)[^12] | ~94.6% |

The progression from 38.8% to the low-90s in roughly 30 months represents one of the steepest capability gains observed on any major language model benchmark. For comparison, the analogous progression on [MMLU](/wiki/mmlu) from 70% to its current ceiling took approximately 4 years.

### Current leaderboard

As of mid-2026, the following models have achieved notable performance on GPQA Diamond:[^3][^6][^11]

| Rank | Model | Accuracy (%) | Organization | Date | Protocol |
| --- | --- | --- | --- | --- | --- |
| 1 | Claude Mythos Preview | ~94.6 | [Anthropic](/wiki/anthropic) | May 2026 | Text-only, pass@1 |
| 2 | Gemini 3.1 Pro | ~94.3 | [Google DeepMind](/wiki/google_deepmind) | February 2026 | Text-only, pass@1 |
| 3 | Claude Opus 4.7 | ~94.2 | [Anthropic](/wiki/anthropic) | April 2026 | Text-only, pass@1 |
| 4 | Gemini 3.1 Pro Preview | ~94.1 | [Google DeepMind](/wiki/google_deepmind) | February 2026 | Text-only, pass@1 |
| 5 | GPT-5.5 (xhigh) | ~93.5 | [OpenAI](/wiki/openai) | April 2026 | Text-only, pass@1 |
| 6 | GPT-5.5 (high) | ~93.2 | [OpenAI](/wiki/openai) | April 2026 | Text-only, pass@1 |
| 7 | GPT-5.2 | ~92.4 | [OpenAI](/wiki/openai) | 2025 | Text-only, pass@1 |
| 8 | Aristotle X1 Verify | 92.4 | Autopoiesis Sciences | July 2025 | Text-only, pass@1 |
| 9 | GPT-5.4 (xhigh) | ~92.0 | [OpenAI](/wiki/openai) | 2026 | Text-only, pass@1 |
| 10 | Gemini 3 Pro | ~91.9 | [Google DeepMind](/wiki/google_deepmind) | 2025 | Text-only, pass@1 |
| 11 | GPT-5.3 Codex (xhigh) | ~91.5 | [OpenAI](/wiki/openai) | 2025 | Text-only, pass@1 |
| 12 | Claude Opus 4.6 | ~91.3 | [Anthropic](/wiki/anthropic) | 2025 | Text-only, pass@1 |
| 13 | Grok 4.3 (high) | ~90.1 | [xAI](/wiki/xai) | April 2026 | Text-only, pass@1 |
| 14 | DeepSeek V4-Pro | ~90.1 | [DeepSeek](/wiki/deepseek) | April 2026 | Text-only, pass@1 |
| 15 | Qwen 3.5 (open-weights) | ~88.4 | [Alibaba](/wiki/alibaba) | 2026 | Text-only, pass@1 |
| 16 | Grok 4 Heavy | ~88.4 | [xAI](/wiki/xai) | 2025 | Text-only, pass@1 |
| 17 | Claude Opus 4.5 | ~87.0 | [Anthropic](/wiki/anthropic) | 2025 | Text-only, pass@1 |
| 18 | Grok 4 | ~87.5 | [xAI](/wiki/xai) | 2025 | Text-only, pass@1 |
| 19 | Gemini 2.5 Pro | ~86.4 | [Google DeepMind](/wiki/google_deepmind) | March 2025 | Text-only, pass@1 |
| 20 | OpenAI o3 | 83.3 | [OpenAI](/wiki/openai) | December 2024 | Text-only, pass@1 |
| 21 | OpenAI o3-mini-high | 79.7 | [OpenAI](/wiki/openai) | 2025 | Text-only, pass@1 |
| 22 | OpenAI o1 | 77.3 | [OpenAI](/wiki/openai) | September 2024 | Text-only, pass@1 |
| 23 | Claude 3.7 Sonnet (Thinking) | 75.3 | [Anthropic](/wiki/anthropic) | 2025 | Text-only, pass@1 |
| 24 | [DeepSeek](/wiki/deepseek) R1 | 71.5 | DeepSeek | January 2025 | Text-only, pass@1 |
| 25 | Claude 3.7 Sonnet | 67.4 | [Anthropic](/wiki/anthropic) | 2025 | Text-only, pass@1 |
| 26 | [Claude](/wiki/claude) 3 Opus | ~60.0 | [Anthropic](/wiki/anthropic) | March 2024 | Text-only, pass@1 |
| 27 | Claude 3.5 Sonnet | 59.1 | [Anthropic](/wiki/anthropic) | June 2024 | Text-only, pass@1 |
| 28 | DeepSeek-V3 | 59.1 | [DeepSeek](/wiki/deepseek) | December 2024 | Text-only, pass@1 |
| 29 | [GPT-4](/wiki/gpt-4) (baseline) | 38.8 | [OpenAI](/wiki/openai) | November 2023 | Few-shot CoT, pass@1 |

Note: Scores may vary depending on evaluation methodology (zero-shot vs. few-shot, pass@1 vs. consensus), prompting strategy, and the specific model version tested. The Artificial Analysis leaderboard lists Gemini 3.1 Pro Preview at 94.1% as the top independently tracked text-only score as of mid-2026.[^6] Epoch AI found that self-reported scores from major labs generally fall within the confidence interval of independently reproduced evaluations.[^4] The Claude Mythos Preview figure of 94.6% has been reported through secondary channels rather than a primary Anthropic publication, and should be treated as preliminary until Anthropic releases an official model card.[^12]

### When did AI first beat human experts on GPQA Diamond?

A significant inflection point in GPQA Diamond performance came with the introduction of reasoning-focused models. OpenAI's o1 (September 2024) was the first model to substantially exceed human expert performance on this benchmark, scoring 77.3% (zero-shot) and 78.0% (consensus) compared to the approximately 69.7% expert baseline.[^5][^19] In announcing the result, OpenAI wrote that o1 "exceeds the performance of human experts" on GPQA Diamond, while cautioning: "This does not imply that o1 is more capable than a PhD in all respects, only that the model is more proficient in solving some problems that a PhD would be expected to solve."[^19] The o1 model family uses extended chain-of-thought reasoning at inference time, allocating more computation to work through multi-step problems.

This pattern continued with o3 (83.3%) and subsequent reasoning models from multiple labs. The success of [reasoning models](/wiki/reasoning_model) on GPQA Diamond suggests that the benchmark's difficulty stems partly from the need for careful, multi-step logical chains rather than from a lack of factual knowledge in the training data.

By 2026, virtually every frontier model includes some form of explicit reasoning mode, and labs typically report GPQA Diamond scores from a high-effort or extended-thinking configuration. The gap between a model's standard configuration and its extended-thinking configuration on GPQA Diamond is typically 2 to 4 percentage points for current frontier systems, a much smaller margin than the 6 to 10 point gap that existed when reasoning models first appeared in 2024.

### Aristotle X1 Verify

Autopoiesis Sciences' Aristotle X1 Verify system, which achieved 92.4% on GPQA Diamond in July 2025, introduced an additional innovation: calibrated confidence scoring. Unlike most AI systems whose stated confidence levels do not correlate reliably with actual accuracy, Aristotle X1 embeds systematic doubt into every layer of reasoning. The system also achieved 96.1% on SimpleQA, OpenAI's factuality benchmark, suggesting that its approach to verification and confidence estimation transfers across different evaluation contexts.

### Open-weights model performance

A notable development in 2026 has been the rapid closing of the gap between closed-frontier models and the best open-weights releases. Where the typical gap was 15 to 20 percentage points throughout 2024 and most of 2025, by April 2026 it has narrowed to roughly 4 to 6 points on GPQA Diamond.[^16][^17]

| Model | Score | License | Date |
| --- | --- | --- | --- |
| DeepSeek V4-Pro | ~90.1% | Open weights | April 2026 |
| Qwen 3.5 | ~88.4% | Open weights | 2026 |
| DeepSeek V4 | ~88-89% | Open weights | March 2026 |
| Llama 4 (largest variant) | ~85-86% | Open weights | 2026 |
| Qwen 3.6 | ~86.0% | Open weights | 2026 |

DeepSeek V4-Pro and Qwen 3.5 now sit only about 4 points below the top closed-frontier models, a gap that is comparable in size to the run-to-run noise on the benchmark itself.[^16] The National Institute of Standards and Technology's Center for AI Standards and Innovation independently evaluated DeepSeek V4-Pro in May 2026 and confirmed scores in the 89% to 91% range on GPQA Diamond, depending on configuration.[^18]

### Tool-augmented evaluation

While GPQA Diamond was designed as a text-only, closed-book test, several leaderboards now publish a separate "tool-augmented" column in which the model has access to a Python interpreter, web search, and sometimes a paper retrieval tool. The intent is to measure what an agentic system can do when allowed to use the same kinds of resources a human researcher might consult.

| Configuration | Typical top score (2026) | Notes |
| --- | --- | --- |
| Text-only, pass@1 | ~94% (frontier models) | Standard reporting protocol; comparable across labs |
| Text-only, consensus@32 | ~95-96% | Used in some academic papers; inflates scores by 1-2 points |
| Tool-augmented (search + Python) | ~96-97% | Reported separately; not directly comparable to text-only |
| Multi-agent debate | ~95-96% | Sometimes reported alongside scalable oversight research |

Tool-augmented numbers should not be compared directly against text-only numbers in the leaderboard. The two protocols measure different things: text-only measures what knowledge and reasoning a model has internalized during training, while tool-augmented measures what an agentic system can accomplish given external resources. The Diamond authors are explicit that the canonical evaluation is text-only, and tool-augmented scores are best read as an upper bound on what a competent research agent could plausibly achieve.[^1]

## Domain-specific analysis

### Why is organic chemistry the hardest category?

Epoch AI's detailed analysis of GPQA Diamond revealed a striking pattern: organic chemistry is massively overrepresented among the questions that models consistently answer incorrectly. While organic chemistry accounts for roughly 36% of the full question set, it makes up approximately 70% of the 40 questions that top models (those scoring above 70% overall) consistently get wrong.[^3]

Several factors contribute to organic chemistry's difficulty for AI systems:

- Reaction mechanism reasoning requires spatial and sequential thinking that is difficult to express in text
- Many organic chemistry problems involve recognizing structural patterns and functional group interactions
- Procedural knowledge about laboratory techniques (such as chromatography, NMR interpretation, and synthesis planning) is harder to acquire from text-based training data
- The field has a large number of named reactions and specialized conventions that require memorization alongside conceptual understanding

### Physics: a strength for reasoning models

Physics questions are generally more tractable for AI models, particularly those employing [chain-of-thought](/wiki/chain_of_thought) reasoning. This is likely because many physics problems can be broken down into explicit mathematical steps: identify the relevant equations, substitute known values, and solve. Reasoning models excel at this type of structured problem-solving, making physics the domain where AI performance tends to be highest.

Subdomain variation exists within physics as well. Quantum mechanics and high-energy particle physics questions, which require knowledge of specialized formalisms and counterintuitive principles, tend to be harder than classical mechanics or electromagnetism questions.

### Biology: variable difficulty

Biology questions show the most variance in difficulty. Molecular biology questions that test specific procedural knowledge (such as the steps of a particular experimental protocol) tend to be very difficult for both AI systems and non-expert humans. However, some genetics and ecology questions are more accessible, contributing to the higher non-expert accuracy observed in the biology domain (43.2% on the extended set).

## Saturation and benchmark validity

### Is GPQA Diamond saturated?

By mid-2026, GPQA Diamond is widely considered effectively saturated at the top tier. Multiple distinct frontier models score within a roughly 1.4 percentage point band in the low-to-mid 90s, leaving little room for meaningful differentiation, and the top-performing systems are within striking distance of the estimated ceiling imposed by ambiguous questions.[^3][^6]

| Saturation indicator | Observation (2026) |
| --- | --- |
| Score clustering | Top several models within ~1.4 points of each other |
| Ceiling approach | Top scores within ~5% of estimated valid-question ceiling |
| Consistent failure set | ~30 questions (15%) consistently missed by most models |
| Diminishing improvement rate | Roughly 0.5 pp gain from Q4 2025 to Q2 2026 at the top |
| Reporting fatigue | Some labs no longer headline GPQA Diamond in launch materials |

The broader AI evaluation community now treats GPQA Diamond as a sanity check rather than a frontier capability test. New models are expected to score above 90%, and the conversation has shifted to harder benchmarks like [Humanity's Last Exam](/wiki/humanity_s_last_exam), FrontierMath, and various agentic evaluations.[^3]

### Position relative to other current benchmarks

As of mid-2026, GPQA Diamond sits well below the frontier of current evaluation difficulty:

| Benchmark | Top score (2026) | Top model class | Status |
| --- | --- | --- | --- |
| GPQA Diamond (text-only) | ~94% | Frontier reasoning models | Effectively saturated |
| [AIME](/wiki/aime) 2025 (math competition) | ~99% (consensus) | Multiple models | Saturated |
| [MATH](/wiki/math_benchmark) | ~99% | Multiple models | Saturated |
| [MMLU-Pro](/wiki/mmlu-pro) | ~90% | Frontier reasoning models | Approaching saturation |
| [Humanity's Last Exam](/wiki/humanity_s_last_exam) (text-only) | ~44.7% | Frontier reasoning models | Active frontier |
| Humanity's Last Exam (tool-use) | ~53-54% | Top frontier models | Active frontier |
| FrontierMath | ~30-35% | Top reasoning models | Active frontier |

This context helps explain why GPQA Diamond has moved from being a flagship metric to a baseline expectation. The gap between top model and human expert performance on GPQA Diamond is now larger than 25 percentage points in favor of the model, while gaps on benchmarks like Humanity's Last Exam remain substantial enough to drive new model development.

### Question validity analysis

The question of whether GPQA Diamond is truly saturated or whether the remaining ~5-7% of incorrect answers reflect genuine model limitations (versus flawed questions) has been investigated by several research groups.

Epoch AI conducted a detailed analysis examining the 40 questions that high-performing models (70%+ overall accuracy) most frequently answered incorrectly. Their findings suggest that most of these difficult questions are genuinely valid but require highly specialized knowledge. Of six particularly problematic questions examined in depth (those with sub-5% model accuracy), approximately 2.25 were estimated to be potentially invalid. Extrapolating this rate to the full set of 40 difficult questions yields an estimated invalid question rate of roughly 8% (15 out of 198), though the authors acknowledged significant uncertainty in this extrapolation.[^3]

A follow-up review in early 2026 by an independent group of PhD chemists and physicists narrowed this estimate further. They concluded that 8 to 12 Diamond questions (roughly 4% to 6% of the set) have either contested correct answers, ambiguous wording, or rely on outdated experimental conventions.[^10] This puts the practical ceiling on text-only GPQA Diamond scores in the 94% to 96% range, which is broadly consistent with where the top of the leaderboard now sits.

The questions most likely to be invalid included ones involving specialized procedural knowledge that may not have a single correct answer, and at least one question where the intended answer appeared to be incorrect based on independent expert review.[^3]

### Do AI labs report GPQA Diamond scores accurately?

Epoch AI also investigated whether AI labs accurately report their GPQA Diamond scores. By comparing self-reported scores against independently reproduced evaluations, they found that all major labs' self-reported scores fell within the expected confidence interval. The computed p-values were well above 0.05 for all tested models, indicating no statistically significant difference between reported and independently measured performance. Epoch estimates that their independent evaluations can determine true model performance to within 4 to 6 percentage points with 90% confidence, given the 198-question sample size.[^4]

This verification work has become particularly important as scores cluster within the natural noise floor of the benchmark. With only 198 questions, a 1-percentage-point difference corresponds to approximately 2 questions, which is well within the run-to-run variance observed when re-evaluating the same model.[^4] The Artificial Analysis convention of reporting median-of-three runs has helped reduce this variance, but consumers of leaderboard data should still treat sub-2-point gaps between top models with caution.[^6]

### Contamination concerns in 2026

Three-plus years after release, GPQA Diamond questions have appeared in countless analysis blog posts, academic papers, and online forum discussions. While the dataset includes canary strings to detect direct training-data contamination, the broader risk of indirect contamination (such as paraphrased or summarized questions appearing in web crawl data) has grown.[^9] Several labs have published statements affirming that they exclude GPQA-style content from pretraining and reinforcement learning data, but verification of these claims remains difficult.

One signal that contamination is not entirely responsible for current scores is that frontier models still make mistakes on a stable subset of questions, suggesting that the remaining errors reflect genuine reasoning limitations or question ambiguity rather than gaps in the training corpus. If contamination were the primary driver of scores, the failure set would shift unpredictably across model versions, which is not what observers have reported.[^3]

## Relationship to GPQA and other subsets

### How does Diamond differ from GPQA Main and Extended?

Understanding how GPQA Diamond relates to the broader GPQA dataset is important for interpreting benchmark results:[^1]

| Property | Extended | Main | Diamond |
| --- | --- | --- | --- |
| Number of questions | 546 | 448 | 198 |
| Expert accuracy | 65.4% | 72.5% | 81.3% |
| Non-expert accuracy | 33.9% | 30.5% | 21.9% |
| Selection criteria | None (all collected) | 1+ expert correct, 2+ non-experts wrong | Both experts correct, 2+ non-experts wrong |
| GPT-4 baseline (few-shot CoT) | 38.7% | 39.7% | 38.8% |

The Diamond subset is both harder (for non-experts) and more reliable (in terms of having verifiably correct answers) than the broader sets. This combination of difficulty and quality is why Diamond has become the standard evaluation target rather than the full GPQA set.

### Why is Diamond the industry standard?

Several factors contributed to GPQA Diamond becoming the preferred evaluation subset:

1. **Higher question quality.** The requirement for both expert validators to agree on the correct answer reduces the number of ambiguous or flawed questions.[^1]
2. **Greater discriminative power.** The low non-expert accuracy (21.9%) means that models cannot score well through surface-level pattern matching or simple information retrieval.[^1]
3. **Inclusion in OpenAI's simple-evals.** OpenAI adopted GPQA Diamond as part of its standard evaluation suite, and other labs followed suit in order to report comparable results.[^7]
4. **Manageable size.** At 198 questions, the benchmark is small enough to run quickly and cheaply while still providing statistically meaningful signal.
5. **Meaningful expertise gap.** The nearly 60-percentage-point gap between expert and non-expert accuracy creates clear room for measuring progress toward expert-level AI performance.

## Applications and impact

### Scalable oversight research

GPQA Diamond's primary intended application is as a testbed for scalable oversight methods.[^1] The benchmark's design creates conditions analogous to a scenario where a human supervisor (the non-expert) must evaluate the work of a more capable system (the AI model that outperforms non-experts). Researchers use this setup to test techniques like:

- **[Debate](/wiki/debate):** Two AI models argue opposing positions, and a human judge decides which argument is more convincing
- **Market-making:** Predictions are aggregated across multiple models and evaluated for calibration
- **Recursive reward modeling:** AI systems are trained to assist human evaluators in checking AI outputs
- **Decomposition:** Complex questions are broken into simpler sub-questions that non-experts can verify

### Capability evaluation

GPQA Diamond serves as one of the primary benchmarks for measuring progress toward expert-level scientific reasoning in AI. It is now routinely reported in model release announcements from [OpenAI](/wiki/openai), [Anthropic](/wiki/anthropic), [Google DeepMind](/wiki/google_deepmind), and other leading labs.[^7][^14][^15] The benchmark's focus on graduate-level science makes it complementary to other widely used evaluations:

| Benchmark | Focus | Difficulty level | Questions |
| --- | --- | --- | --- |
| [MMLU](/wiki/mmlu) | Broad knowledge (57 subjects) | High school to professional | 14,042 |
| [MMLU-Pro](/wiki/mmlu-pro) | Harder version of MMLU | Professional | 12,032 |
| GPQA Diamond | Science (bio, phys, chem) | Graduate to post-graduate | 198 |
| [MATH](/wiki/math_benchmark) | Mathematics | Competition level | 5,000 |
| [HumanEval](/wiki/humaneval) | Code generation | Professional | 164 |
| [AIME](/wiki/aime) | Mathematical reasoning | Competition level | Varies |
| [Humanity's Last Exam](/wiki/humanity_s_last_exam) | Broad expert knowledge | PhD and beyond | ~3,000 |

### Educational and research implications

The benchmark has practical implications beyond model evaluation:

- **Identifying AI knowledge gaps.** Per-domain analysis reveals where current models struggle most (organic chemistry) and where they excel (physics and mathematical reasoning), which can inform both training strategies and deployment decisions.[^3]
- **Calibrating trust in AI outputs.** GPQA Diamond results help researchers and practitioners understand the reliability of AI-generated scientific reasoning, which is critical for applications in drug discovery, materials science, and other domains where errors can have serious consequences.
- **Benchmarking AI for research assistance.** As AI systems approach and surpass human expert performance on these questions, the benchmark provides evidence for how close AI is to serving as a reliable research assistant in the natural sciences.

## Limitations and criticisms

### Dataset size

At 198 questions, GPQA Diamond is a relatively small benchmark. This limits the statistical power of comparisons between models, particularly when score differences are small. Epoch AI estimates that independent evaluations can only determine true performance to within 4 to 6 percentage points with 90% confidence.[^4] As a result, score differences of less than about 5 percentage points between models may not be statistically meaningful. With the top of the leaderboard now packed into a 1.4 point band, individual model rankings on GPQA Diamond should not be over-interpreted.

### Multiple-choice format

The four-option multiple-choice format, while enabling automated evaluation, may not reflect real-world scientific reasoning. In practice, scientists formulate hypotheses, design experiments, interpret ambiguous data, and synthesize information from multiple sources. The multiple-choice structure reduces these complex cognitive tasks to answer selection, which may overestimate a model's true scientific understanding.

### Static dataset and data contamination

Because GPQA Diamond is a fixed set of 198 questions that has been publicly available since November 2023, there is a risk of data contamination. Models trained on data that includes GPQA questions or their answers (directly or indirectly) may achieve inflated scores that do not reflect genuine reasoning ability. The dataset maintainers included canary strings to detect unauthorized use in training data, but the risk increases over time as the benchmark becomes more widely discussed and analyzed online.[^9]

### Limited domain coverage

GPQA Diamond covers only three scientific domains: biology, physics, and chemistry. Important fields like mathematics, computer science, engineering, earth sciences, and medicine are not represented. This means the benchmark provides only a partial picture of a model's scientific capabilities.

### English-only

All questions are written in English, limiting the benchmark's applicability to evaluating multilingual scientific reasoning capabilities.

### Organic chemistry overrepresentation

The heavy representation of organic chemistry in the question set (roughly 36% of the extended set, and an even larger share of the hardest questions) means that models' aggregate scores may be disproportionately influenced by performance on a single subdomain. A model that excels at everything except organic chemistry may receive a misleadingly low overall score.[^3]

## Future directions

### Proposed improvements

Researchers have suggested several enhancements to address GPQA Diamond's limitations:

1. **Free-form answer generation.** Removing the multiple-choice format would require models to produce and justify answers independently, providing a more realistic test of scientific reasoning.
2. **Dynamic question generation.** Creating new questions automatically or semi-automatically would combat data contamination and extend the benchmark's useful lifespan.
3. **Skill-based classification.** Grouping questions by the specific cognitive skills they require (calculation, conceptual understanding, procedural knowledge, spatial reasoning) rather than by domain would provide more granular insight into model capabilities.
4. **Multi-modal questions.** Adding diagrams, molecular structures, spectra, and graphs would test models' ability to reason about visual scientific information.
5. **Expanded domain coverage.** Including mathematics, computer science, engineering, and medical sciences would provide a more comprehensive assessment.
6. **Diamond v2.** A handful of researchers have proposed releasing a refreshed Diamond v2 with new questions, a cleaned answer key for ambiguous items, and a held-out evaluation server to prevent contamination. As of mid-2026 no official v2 has been released, but the original GPQA team has acknowledged that a refresh is under consideration.

### Next-generation benchmarks

As GPQA Diamond approaches saturation, the AI evaluation community is developing more challenging successors:

- **Research-level tasks.** Benchmarks that test the ability to design experiments, write literature reviews, or generate novel hypotheses, moving beyond question-answering to test practical research skills.
- **Interactive problem-solving.** Evaluations where the model must iteratively refine its approach based on new information, simulating the back-and-forth nature of real scientific inquiry.
- **[Humanity's Last Exam](/wiki/humanity_s_last_exam).** A broader and more difficult benchmark designed to remain challenging for AI systems longer than existing evaluations. As of mid-2026 the top text-only score on Humanity's Last Exam is approximately 44.7%, leaving substantial room for improvement.
- **Laboratory simulations.** Testing models on simulated experimental procedures, including troubleshooting equipment, interpreting unexpected results, and making real-time decisions.
- **Agentic science evaluations.** Benchmarks like SciAgentBench and PaperBench measure how well an AI system can act as a junior researcher: reading the literature, running computations, planning experiments, and writing up results. These evaluations sit firmly on the active frontier in 2026.

## Related benchmarks

- **[GPQA](/wiki/gpqa):** The parent dataset from which GPQA Diamond is derived, containing 448 (Main) or 546 (Extended) questions
- **[MMLU](/wiki/mmlu):** Broader knowledge benchmark covering 57 subjects including science
- **[MMLU-Pro](/wiki/mmlu-pro):** A harder version of MMLU with 10-option multiple choice
- **[AIME](/wiki/aime):** American Invitational Mathematics Examination problems for mathematical reasoning
- **[MATH](/wiki/math_benchmark):** Competition-level mathematics benchmark
- **[HumanEval](/wiki/humaneval):** Code generation benchmark
- **[Humanity's Last Exam](/wiki/humanity_s_last_exam):** Broad expert-level benchmark designed to resist saturation
- **[ARC](/wiki/arc_challenge):** AI2 Reasoning Challenge for science reasoning
- **ScienceQA:** Elementary to high school science questions
- **PubMedQA:** Biomedical literature comprehension
- **ChemBench:** Chemistry-specific benchmark

## See also

- [Scalable Oversight](/wiki/scalable_oversight)
- [AI Safety](/wiki/ai_safety)
- [Chain-of-Thought Reasoning](/wiki/chain_of_thought)
- [Large Language Model](/wiki/large_language_model)
- [AI Benchmark](/wiki/ai_benchmark)
- [Reasoning Model](/wiki/reasoning_model)

## References

[^1]: Rein, D., Hou, B. L., Stickland, A. C., Petty, J., Pang, R. Y., Dirani, J., Michael, J., & Bowman, S. R. (2023). "GPQA: A Graduate-Level Google-Proof Q&A Benchmark." *arXiv preprint arXiv:2311.12022*. https://arxiv.org/abs/2311.12022

[^2]: Rein, D., et al. (2024). "GPQA: A Graduate-Level Google-Proof Q&A Benchmark." *Conference on Language Modeling (COLM)*. https://openreview.net/forum?id=Ti67584b98

[^3]: Burnham, G. (2025). "GPQA Diamond: What's Left?" *Epoch AI Gradient Updates*. https://epoch.ai/gradient-updates/gpqa-diamond-whats-left

[^4]: Epoch AI. (2025). "AI Developers Accurately Report GPQA Diamond Scores for Recent Models." https://epoch.ai/data-insights/self-reported-gpqa

[^5]: Epoch AI. (2026). "GPQA Diamond Benchmark." https://epoch.ai/benchmarks/gpqa-diamond

[^6]: Artificial Analysis. (2026). "GPQA Diamond Benchmark Leaderboard." https://artificialanalysis.ai/evaluations/gpqa-diamond

[^7]: OpenAI. (2024). "simple-evals: GPQA Evaluation." GitHub. https://github.com/openai/simple-evals/blob/main/gpqa_eval.py

[^8]: Hugging Face. (2023). "GPQA Dataset." https://huggingface.co/datasets/Idavidrein/gpqa

[^9]: GitHub. (2023). "idavidrein/gpqa: GPQA Repository." https://github.com/idavidrein/gpqa

[^10]: IntuitionLabs. (2026). "GPQA-Diamond Benchmark: Scores, Leaderboard & How AI Models Compare." https://intuitionlabs.ai/articles/gpqa-diamond-ai-benchmark

[^11]: Vals AI. (2026). "GPQA Diamond Leaderboard." https://www.vals.ai/benchmarks/gpqa

[^12]: SmartChunks. (2026). "GPQA Diamond Score Explained: The AI Benchmark That Actually Matters." https://smartchunks.com/gpqa-diamond-score-explained-ai-benchmark-2026/

[^13]: BenchLM. (2026). "GPQA-D Benchmark 2026: Model Averages." https://benchlm.ai/benchmarks/gpqaDiamond

[^14]: Google DeepMind. (2026). "Gemini 3.1 Pro Model Card." https://deepmind.google/models/model-cards/gemini-3-1-pro/

[^15]: Vellum. (2026). "Claude Opus 4.7 Benchmarks Explained." https://www.vellum.ai/blog/claude-opus-4-7-benchmarks-explained

[^16]: BuildFastWithAI. (2026). "Best AI Models: April + May 2026 Leaderboard (GPT-5.5, Claude Opus 4.7, DeepSeek V4)." https://www.buildfastwithai.com/blogs/best-ai-models-may-2026-leaderboard

[^17]: Codersera. (2026). "Open-Source LLM Landscape 2026: DeepSeek V4 vs Llama 4 vs Qwen 3.5 vs Gemma 4." https://codersera.com/blog/open-source-llms-landscape-2026/

[^18]: NIST CAISI. (2026). "CAISI Evaluation of DeepSeek V4 Pro." https://www.nist.gov/news-events/news/2026/05/caisi-evaluation-deepseek-v4-pro

[^19]: OpenAI. (2024). "Learning to reason with LLMs." September 12, 2024. https://openai.com/index/learning-to-reason-with-llms/

