GPQA Diamond

**

GPQA Diamond
Overview
Full name	Graduate-Level Google-Proof Q&A Benchmark, Diamond Subset
Abbreviation	GPQA Diamond
Description	A challenging subset of graduate-level, Google-proof science questions testing PhD-level knowledge in biology, physics, and chemistry
Release date	2023-11-20
Latest version	1.0
Benchmark updated	2023-11-20
Authors	David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, Samuel R. Bowman
Organization	New York University, Anthropic
Technical Details
Type	Scientific Reasoning, Expert Knowledge
Modality	Text
Task format	Multiple choice (4 options)
Number of tasks	198
Total examples	198
Evaluation metric	Accuracy, Zero-shot Chain-of-Thought
Domains	Biology, Physics, Chemistry
Languages	English
Performance
Human performance	81.3% (Diamond expert validators), 21.9% (non-experts)
Baseline	38.8% (GPT-4, few-shot CoT)
SOTA score	~94.6%
SOTA model	Claude Mythos Preview (Anthropic)
SOTA date	2026-05
Saturated	Effectively saturated at top tier
Resources
Paper	Paper
GitHub	Repository
Dataset	Download
Predecessor	GPQA (full set, 448 questions)

GPQA Diamond** is a challenging AI benchmark consisting of 198 PhD-level multiple-choice questions in biology, physics, and chemistry. Released on November 20, 2023, it represents the most difficult subset of the Graduate-Level Google-Proof Q&A Benchmark (GPQA), specifically designed to test artificial intelligence systems on questions that require deep scientific expertise and cannot be easily answered through web searches. The benchmark was created by David Rein and collaborators at New York University and Anthropic, and has become one of the most widely cited evaluations for measuring scientific reasoning in large language models.

GPQA Diamond has gained particular prominence because it occupies a unique position in the AI evaluation landscape: it tests knowledge and reasoning at a level where even human PhD experts achieve only about 65% accuracy on the broader GPQA set (81.3% on the Diamond subset due to selection effects), while skilled non-experts with unrestricted internet access score just 21.9% on the Diamond questions. This wide expertise gap makes GPQA Diamond especially useful for studying scalable oversight, the problem of how humans can supervise AI systems that may eventually surpass human capabilities in specialized domains.

By May 2026, the benchmark sits in an unusual transitional state. The top reported score, 94.6% from Anthropic's Claude Mythos Preview, sits roughly 13 percentage points above the original expert baseline and within about 4 points of the estimated ceiling imposed by ambiguous or invalid questions. Three labs cluster within a 1-point band at the top of the leaderboard, and the role of GPQA Diamond has shifted from a frontier capability test toward a sanity check that newly released frontier models are expected to clear.

Background and motivation

The scalable oversight problem

The primary motivation behind GPQA Diamond is rooted in one of the most pressing challenges in AI safety: scalable oversight. As AI systems grow more capable, they increasingly operate in domains where human supervisors cannot independently verify the correctness of AI outputs. A doctor reviewing an AI's diagnosis might lack the specialized knowledge to confirm whether the AI's reasoning in a rare subspecialty is sound. A policy analyst relying on AI for technical climate modeling may not be equipped to catch subtle errors in the underlying physics.

David Rein and his co-authors framed GPQA as a testbed for studying this problem. The benchmark was designed so that questions fall in a "sweet spot" of difficulty: hard enough that non-experts cannot simply look up the answer, but structured enough that domain experts can reliably identify the correct response. This gap between expert and non-expert performance creates realistic conditions for experimenting with oversight techniques like debate, market-making, and recursive reward modeling, all of which aim to help less-expert humans extract truthful answers from AI systems.

Limitations of earlier benchmarks

Before GPQA, most science-focused benchmarks occupied either end of the difficulty spectrum. Benchmarks like ScienceQA and ARC tested elementary to high school knowledge, while portions of MMLU covered undergraduate-level material. These benchmarks were valuable for tracking early progress in language model capabilities, but by 2023 frontier models had largely saturated them. At the other extreme, open-ended research tasks (like writing novel proofs or designing experiments) were difficult to evaluate automatically because they lacked clear, verifiable correct answers.

GPQA was designed to fill this gap: questions that are genuinely difficult, require graduate-level expertise across multiple scientific subfields, resist simple information retrieval, and yet have unambiguous correct answers in a multiple-choice format that allows automated scoring.

The "Google-proof" design philosophy

A distinctive feature of the GPQA dataset is its "Google-proof" nature. The term refers to the observation that skilled non-expert validators, despite having unrestricted access to web searches and academic papers, could not reliably answer the questions. During the validation phase, non-experts spent an average of 37 minutes per question (with a minimum requirement of 15 minutes) searching for information, reading research papers, and attempting to reason through the problems. Despite this effort, they achieved only 33.9% accuracy on the extended set and 21.9% on the Diamond subset, barely above the 25% random-guessing baseline for four-choice questions.

This Google-proof quality arises because the questions typically require:

Integration of multiple concepts from within a single discipline
Application of specialized procedures or techniques that are rarely explained fully in any single source
Multi-step reasoning chains where each step demands domain-specific knowledge
Understanding of nuances and exceptions that are common knowledge among specialists but absent from general references

Creation and methodology

Expert recruitment

The GPQA dataset was created by recruiting 61 domain experts through the freelancing platform Upwork. All experts either held a PhD or were actively pursuing one in biology, physics, or chemistry, and indicated proficiency or fluency in English. The creators designed a compensation structure that heavily emphasized quality: the majority of payment came from performance-based bonuses rather than flat fees, with estimated average hourly compensation of approximately $95 per hour and a maximum of $150 per hour.

Question writing pipeline

The creation of each GPQA question followed a rigorous four-stage pipeline:

Stage	Activity	Participants	Key requirements
1. Question writing	Expert authors write questions with explanations for correct and incorrect answer choices	Domain experts (PhD holders/candidates)	Questions must be difficult, require deep knowledge, and include four plausible answer choices
2. First expert validation	A second domain expert attempts the question and provides feedback	Independent expert in the same field	Validator assesses objectivity, accuracy, and difficulty; provides detailed feedback
3. Question revision	Original author revises the question based on validator feedback	Original question writer	Revision is optional if no changes are suggested
4. Non-expert validation	Three non-experts from different domains attempt the question	Skilled validators with PhDs in other fields	Minimum 15 minutes per question; unrestricted web access; average time spent was 37 minutes

Compensation and incentive structure

The creators designed an elaborate incentive system to encourage both difficult question writing and honest, careful validation:

Question writing compensation:

Component	Amount	Condition
Base payment	$10	Per question submitted
Expert validator bonus	$20 per validator	Each expert who answers correctly (max $40)
Non-expert difficulty bonus	$15 per validator	Each non-expert who answers incorrectly (max $45)
Quality bonus	$30	Both experts correct AND at least 2 of 3 non-experts incorrect

Expert validation compensation:

Component	Amount	Condition
Base payment	$10	Per question validated
Correct answer bonus	$10	Validator answers correctly
Agreement bonus	$10	Second expert also answers correctly
Difficulty bonus	$5	Majority of non-experts answer incorrectly

This structure incentivized question writers to produce questions that were genuinely difficult (rewarding non-expert failures) while remaining answerable by experts (rewarding expert successes). It also incentivized validators to answer carefully and provide honest, thorough feedback.

Diamond subset selection

The full GPQA dataset exists in three nested subsets of increasing quality and difficulty:

Subset	Size	Selection criteria
GPQA Extended	546 questions	All collected questions
GPQA Main	448 questions	At least 1 of 2 experts correct AND at most 2 of 3 non-experts correct
GPQA Diamond	198 questions	Both experts correct AND at most 1 of 3 non-experts correct

The Diamond subset applies the strictest filters. For a question to qualify, both expert validators had to answer it correctly (or, if the second expert initially answered incorrectly, they had to clearly describe the mistake or demonstrate understanding of the question writer's explanation after seeing the answer). Additionally, the majority of non-experts had to answer incorrectly. This dual requirement ensures that Diamond questions are simultaneously answerable by true domain experts and resistant to non-expert reasoning with internet access.

An additional held-out set of 18 questions was reserved and not publicly released, intended for future validation purposes.

Technical specifications

Domain distribution

The 198 questions in GPQA Diamond span three scientific domains, though the distribution across subdomains is uneven. Based on the extended set of 546 questions (from which Diamond is drawn), the approximate subdomain breakdown is:

Domain	Subdomains	Approx. questions (extended set)
Biology	Molecular Biology (85), Genetics (20)	105
Physics	Quantum Mechanics (64), High-Energy Particle Physics (46), General Physics (43), Astrophysics (42), Electromagnetism/Photonics (12), Relativistic Mechanics (11), Statistical Mechanics (4), Condensed Matter (4), Optics/Acoustics (1)	227
Chemistry	Organic Chemistry (144), General Chemistry (64), Inorganic Chemistry (3), Analytical Chemistry (2), Physical Chemistry (1)	214

A notable feature of this distribution is the heavy representation of organic chemistry, which accounts for roughly 26% of all questions in the extended set. This has important implications for model evaluation, as organic chemistry questions turn out to be disproportionately difficult for AI systems.

Question format

Each GPQA Diamond question is a text-only, four-option multiple-choice problem. Questions do not include images, diagrams, or graphs (though some questions reference visual concepts that the solver must reason about from text descriptions alone). The average question length is approximately 630 characters (median 561), or about 169 tokens using GPT-4's tokenizer.

Questions range in stated difficulty from "hard undergraduate" to "post-graduate level," and expert validators provide 4-point difficulty ratings after answering. Analysis shows these ratings are predictive of non-expert accuracy, indicating that experts have a reasonable ability to judge how difficult questions will be for less-specialized audiences.

Evaluation methodology

The standard evaluation protocol for GPQA Diamond uses zero-shot or few-shot prompting with chain-of-thought reasoning:

Evaluation method	Description	Common usage
Zero-shot CoT	Model reasons step-by-step without example questions	Standard protocol in OpenAI's simple-evals suite
Few-shot CoT	Model is given example questions and solutions before the test question	Used in some academic evaluations
Zero-shot direct	Model selects an answer without explicit reasoning	Less common; generally yields lower scores
Pass@1	Single attempt accuracy; no majority voting	Most commonly reported metric
Consensus / Majority voting	Multiple samples; most common answer selected	Sometimes used but can inflate scores
Tool-augmented	Model has access to a Python interpreter, search, and other tools	Reported separately; not comparable to text-only scores

GPQA Diamond is included in OpenAI's simple-evals evaluation suite alongside MMLU, MATH, and other standard benchmarks, which has helped establish it as a default evaluation for new model releases across the industry.

Question objectivity

The GPQA paper estimates that approximately 73.6% of questions have objectively verifiable correct answers under a conservative assessment, rising to 76.4% when including cases where validators demonstrated understanding of the intended answer even if they initially selected the wrong option. The Diamond subset is expected to have a higher objectivity rate than the extended set because both expert validators had to agree on the correct answer.

Later analysis by Epoch AI estimated that roughly 90% to 95% of Diamond questions are valid, with approximately 2 to 3 questions (out of 198) potentially having issues such as incorrect answer keys or ambiguous wording. By 2026, multiple independent reviews have settled on a working estimate that around 5% to 8% of Diamond questions are either ambiguous or rely on contested answer keys. This places an effective ceiling on benchmark scores in the 92% to 95% range under the current question set, which matches what frontier models are now reporting.

Human baseline performance

Expert performance

Human performance on GPQA varies significantly depending on the subset and how "expert" is defined:

Metric	Extended (546)	Main (448)	Diamond (198)
Expert accuracy	65.4%	72.5%	81.3%
Non-expert accuracy	33.9%	30.5%	21.9%
Expert-non-expert gap	31.5 pp	42.0 pp	59.4 pp

The higher expert accuracy on the Diamond subset (81.3% versus 65.4% on Extended) is a selection artifact: Diamond specifically includes questions where both experts answered correctly, so the measured expert accuracy on that subset is inflated by the filtering criteria. The extended set figure of 65.4% (or approximately 74% after discounting clear mistakes identified in retrospect) more accurately reflects typical expert performance on these types of questions.

OpenAI independently recruited PhD-level experts to answer GPQA Diamond questions and reported an expert accuracy of approximately 69.7%, which is consistent with the original paper's findings when accounting for the selection effects in the Diamond subset.

Non-expert performance

Non-expert validators were not laypeople; they were skilled individuals with PhDs in fields other than the question's domain. They received unrestricted internet access and were required to spend at least 15 minutes per question. On average, they spent 37 minutes per question (median 30 minutes), often reading multiple academic research papers in the attempt to find relevant information. Despite this substantial effort, their accuracy on the Diamond subset was just 21.9%, only slightly above the 25% random baseline.

This near-chance performance among educated, motivated non-experts is what makes GPQA Diamond "Google-proof" and makes it a compelling testbed for scalable oversight research.

Performance by domain

Expert and non-expert performance varies by scientific domain:

Domain	Expert accuracy (extended)	Non-expert accuracy (extended)	GPT-4 few-shot CoT
Biology	66.7%	43.2%	58.1%
Physics	57.3%	32.5%	37.0%
Chemistry	72.0%	31.4%	31.8%

Notably, biology had the highest non-expert accuracy (43.2%), suggesting that some biology questions may be more accessible to educated non-specialists. Chemistry showed the widest expertise gap, with experts at 72.0% but non-experts at just 31.4%.

AI model performance

Historical performance timeline

The benchmark has seen dramatic improvements in AI performance since its release:

Period	Key development	Best score
November 2023	Initial release; GPT-4 baseline	38.8%
March 2024	Claude 3 Opus evaluated	~60%
June 2024	Claude 3.5 Sonnet	59.4%
September 2024	OpenAI o1 released	77.3%
December 2024	OpenAI o3 released	83.3%
January 2025	DeepSeek-R1 released	71.5%
July 2025	Aristotle X1 Verify (Autopoiesis Sciences)	92.4%
Late 2025	GPT-5.2 and Gemini 3 Pro	~92-93%
February 2026	Gemini 3.1 Pro Preview	~94.1%
March 2026	Claude Opus 4.7 launch	~94.2%
April 2026	GPT-5.5 (xhigh)	~93.5%
May 2026	Claude Mythos Preview (Anthropic red-team)	~94.6%

The progression from 38.8% to 94.6% in roughly 30 months represents one of the steepest capability gains observed on any major language model benchmark. For comparison, the analogous progression on MMLU from 70% to its current ceiling took approximately 4 years.

Current leaderboard

As of May 2026, the following models have achieved notable performance on GPQA Diamond:

Rank	Model	Accuracy (%)	Organization	Date	Protocol
1	Claude Mythos Preview	~94.6	Anthropic	May 2026	Text-only, pass@1
2	Gemini 3.1 Pro	~94.3	Google DeepMind	February 2026	Text-only, pass@1
3	Claude Opus 4.7	~94.2	Anthropic	March 2026	Text-only, pass@1
4	Gemini 3.1 Pro Preview	~94.1	Google DeepMind	February 2026	Text-only, pass@1
5	GPT-5.5 (xhigh)	~93.5	OpenAI	April 2026	Text-only, pass@1
6	GPT-5.5 (high)	~93.2	OpenAI	April 2026	Text-only, pass@1
7	GPT-5.2	~92.4	OpenAI	2025	Text-only, pass@1
8	Aristotle X1 Verify	92.4	Autopoiesis Sciences	July 2025	Text-only, pass@1
9	GPT-5.4 (xhigh)	~92.0	OpenAI	2026	Text-only, pass@1
10	Gemini 3 Pro	~91.9	Google DeepMind	2025	Text-only, pass@1
11	GPT-5.3 Codex (xhigh)	~91.5	OpenAI	2025	Text-only, pass@1
12	Claude Opus 4.6	~91.3	Anthropic	2025	Text-only, pass@1
13	Claude Opus 4.5	~87.0	Anthropic	2025	Text-only, pass@1
14	xAI Grok 4	~87.0	xAI	2025	Text-only, pass@1
15	Gemini 2.5 Pro	~86.4	Google DeepMind	March 2025	Text-only, pass@1
16	OpenAI o3	83.3	OpenAI	December 2024	Text-only, pass@1
17	OpenAI o3-mini-high	79.7	OpenAI	2025	Text-only, pass@1
18	OpenAI o1	77.3	OpenAI	September 2024	Text-only, pass@1
19	Claude 3.7 Sonnet (Thinking)	75.3	Anthropic	2025	Text-only, pass@1
20	DeepSeek R1	71.5	DeepSeek	January 2025	Text-only, pass@1
21	Claude 3.7 Sonnet	67.4	Anthropic	2025	Text-only, pass@1
22	Claude 3 Opus	~60.0	Anthropic	March 2024	Text-only, pass@1
23	Claude 3.5 Sonnet	59.1	Anthropic	June 2024	Text-only, pass@1
24	DeepSeek-V3	59.1	DeepSeek	December 2024	Text-only, pass@1
25	GPT-4 (baseline)	38.8	OpenAI	November 2023	Few-shot CoT, pass@1

Note: Scores may vary depending on evaluation methodology (zero-shot vs. few-shot, pass@1 vs. consensus), prompting strategy, and the specific model version tested. Epoch AI found that self-reported scores from major labs generally fall within the confidence interval of independently reproduced evaluations. The Claude Mythos Preview figure of 94.6% has been reported through secondary channels rather than a primary Anthropic publication, and should be treated as preliminary until Anthropic releases an official model card.

The role of reasoning models

A significant inflection point in GPQA Diamond performance came with the introduction of reasoning-focused models. OpenAI's o1 (September 2024) was the first model to substantially exceed human expert performance on this benchmark, scoring 77.3% compared to the approximately 69.7% expert baseline. The o1 model family uses extended chain-of-thought reasoning at inference time, allocating more computation to work through multi-step problems.

This pattern continued with o3 (83.3%) and subsequent reasoning models from multiple labs. The success of reasoning models on GPQA Diamond suggests that the benchmark's difficulty stems partly from the need for careful, multi-step logical chains rather than from a lack of factual knowledge in the training data.

By 2026, virtually every frontier model includes some form of explicit reasoning mode, and labs typically report GPQA Diamond scores from a high-effort or extended-thinking configuration. The gap between a model's standard configuration and its extended-thinking configuration on GPQA Diamond is typically 2 to 4 percentage points for current frontier systems, a much smaller margin than the 6 to 10 point gap that existed when reasoning models first appeared in 2024.

Aristotle X1 Verify

Autopoiesis Sciences' Aristotle X1 Verify system, which achieved 92.4% on GPQA Diamond in July 2025, introduced an additional innovation: calibrated confidence scoring. Unlike most AI systems whose stated confidence levels do not correlate reliably with actual accuracy, Aristotle X1 embeds systematic doubt into every layer of reasoning. The system also achieved 96.1% on SimpleQA, OpenAI's factuality benchmark, suggesting that its approach to verification and confidence estimation transfers across different evaluation contexts.

Tool-augmented evaluation

While GPQA Diamond was designed as a text-only, closed-book test, several leaderboards now publish a separate "tool-augmented" column in which the model has access to a Python interpreter, web search, and sometimes a paper retrieval tool. The intent is to measure what an agentic system can do when allowed to use the same kinds of resources a human researcher might consult.

Configuration	Typical top score (2026)	Notes
Text-only, pass@1	~94.6% (Claude Mythos Preview)	Standard reporting protocol; comparable across labs
Text-only, consensus@32	~95-96%	Used in some academic papers; inflates scores by 1-2 points
Tool-augmented (search + Python)	~96-97%	Reported separately; not directly comparable to text-only
Multi-agent debate	~95-96%	Sometimes reported alongside scalable oversight research

Tool-augmented numbers should not be compared directly against text-only numbers in the leaderboard. The two protocols measure different things: text-only measures what knowledge and reasoning a model has internalized during training, while tool-augmented measures what an agentic system can accomplish given external resources. The Diamond authors are explicit that the canonical evaluation is text-only, and tool-augmented scores are best read as an upper bound on what a competent research agent could plausibly achieve.

Domain-specific analysis

Organic chemistry: the hardest category

Epoch AI's detailed analysis of GPQA Diamond revealed a striking pattern: organic chemistry is massively overrepresented among the questions that models consistently answer incorrectly. While organic chemistry accounts for roughly 36% of the full question set, it makes up approximately 70% of the 40 questions that top models (those scoring above 70% overall) consistently get wrong.

Several factors contribute to organic chemistry's difficulty for AI systems:

Reaction mechanism reasoning requires spatial and sequential thinking that is difficult to express in text
Many organic chemistry problems involve recognizing structural patterns and functional group interactions
Procedural knowledge about laboratory techniques (such as chromatography, NMR interpretation, and synthesis planning) is harder to acquire from text-based training data
The field has a large number of named reactions and specialized conventions that require memorization alongside conceptual understanding

Physics: a strength for reasoning models

Physics questions are generally more tractable for AI models, particularly those employing chain-of-thought reasoning. This is likely because many physics problems can be broken down into explicit mathematical steps: identify the relevant equations, substitute known values, and solve. Reasoning models excel at this type of structured problem-solving, making physics the domain where AI performance tends to be highest.

Subdomain variation exists within physics as well. Quantum mechanics and high-energy particle physics questions, which require knowledge of specialized formalisms and counterintuitive principles, tend to be harder than classical mechanics or electromagnetism questions.

Biology: variable difficulty

Biology questions show the most variance in difficulty. Molecular biology questions that test specific procedural knowledge (such as the steps of a particular experimental protocol) tend to be very difficult for both AI systems and non-expert humans. However, some genetics and ecology questions are more accessible, contributing to the higher non-expert accuracy observed in the biology domain (43.2% on the extended set).

Saturation and benchmark validity

Signs of saturation

By mid-2026, GPQA Diamond is widely considered effectively saturated at the top tier. Six distinct frontier models score within a roughly 1.4 percentage point band between 93.2% and 94.6%, leaving little room for meaningful differentiation, and the top-performing systems are within striking distance of the estimated ceiling imposed by ambiguous questions.

Saturation indicator	Observation (May 2026)
Score clustering	Top six models within 93.2% to 94.6%
Ceiling approach	Top scores within ~5% of estimated valid-question ceiling
Consistent failure set	~30 questions (15%) consistently missed by most models
Diminishing improvement rate	0.5 pp gain from Q4 2025 to Q2 2026 at the top
Reporting fatigue	Some labs no longer headline GPQA Diamond in launch materials

The broader AI evaluation community now treats GPQA Diamond as a sanity check rather than a frontier capability test. New models are expected to score above 90%, and the conversation has shifted to harder benchmarks like Humanity's Last Exam, FrontierMath, and various agentic evaluations.

Position relative to other current benchmarks

As of May 2026, GPQA Diamond sits well below the frontier of current evaluation difficulty:

Benchmark	Top score (May 2026)	Top model	Status
GPQA Diamond (text-only)	~94.6%	Claude Mythos Preview	Effectively saturated
AIME 2025 (math competition)	~99% (consensus)	Multiple models	Saturated
MATH	~99%	Multiple models	Saturated
MMLU-Pro	~90%	Frontier reasoning models	Approaching saturation
Humanity's Last Exam (text-only)	~44.7%	Frontier reasoning models	Active frontier
Humanity's Last Exam (tool-use)	~53-54%	Claude Opus 4.6/4.7	Active frontier
FrontierMath	~30-35%	Top reasoning models	Active frontier

This context helps explain why GPQA Diamond has moved from being a flagship metric to a baseline expectation. The gap between top model and human expert performance on GPQA Diamond is now larger than 25 percentage points in favor of the model, while gaps on benchmarks like Humanity's Last Exam remain substantial enough to drive new model development.

Question validity analysis

The question of whether GPQA Diamond is truly saturated or whether the remaining ~5-7% of incorrect answers reflect genuine model limitations (versus flawed questions) has been investigated by several research groups.

Epoch AI conducted a detailed analysis examining the 40 questions that high-performing models (70%+ overall accuracy) most frequently answered incorrectly. Their findings suggest that most of these difficult questions are genuinely valid but require highly specialized knowledge. Of six particularly problematic questions examined in depth (those with sub-5% model accuracy), approximately 2.25 were estimated to be potentially invalid. Extrapolating this rate to the full set of 40 difficult questions yields an estimated invalid question rate of roughly 8% (15 out of 198), though the authors acknowledged significant uncertainty in this extrapolation.

A follow-up review in early 2026 by an independent group of PhD chemists and physicists narrowed this estimate further. They concluded that 8 to 12 Diamond questions (roughly 4% to 6% of the set) have either contested correct answers, ambiguous wording, or rely on outdated experimental conventions. This puts the practical ceiling on text-only GPQA Diamond scores in the 94% to 96% range, which is broadly consistent with where the top of the leaderboard now sits.

The questions most likely to be invalid included ones involving specialized procedural knowledge that may not have a single correct answer, and at least one question where the intended answer appeared to be incorrect based on independent expert review.

Independent score verification

Epoch AI also investigated whether AI labs accurately report their GPQA Diamond scores. By comparing self-reported scores against independently reproduced evaluations, they found that all major labs' self-reported scores fell within the expected confidence interval. The computed p-values were well above 0.05 for all tested models, indicating no statistically significant difference between reported and independently measured performance. Epoch estimates that their independent evaluations can determine true model performance to within 4 to 6 percentage points with 90% confidence, given the 198-question sample size.

This verification work has become particularly important as scores cluster within the natural noise floor of the benchmark. With only 198 questions, a 1-percentage-point difference corresponds to approximately 2 questions, which is well within the run-to-run variance observed when re-evaluating the same model. The Artificial Analysis convention of reporting median-of-three runs has helped reduce this variance, but consumers of leaderboard data should still treat sub-2-point gaps between top models with caution.

Contamination concerns in 2026

Three-plus years after release, GPQA Diamond questions have appeared in countless analysis blog posts, academic papers, and online forum discussions. While the dataset includes canary strings to detect direct training-data contamination, the broader risk of indirect contamination (such as paraphrased or summarized questions appearing in web crawl data) has grown. Several labs have published statements affirming that they exclude GPQA-style content from pretraining and reinforcement learning data, but verification of these claims remains difficult.

One signal that contamination is not entirely responsible for current scores is that frontier models still make mistakes on a stable subset of questions, suggesting that the remaining errors reflect genuine reasoning limitations or question ambiguity rather than gaps in the training corpus. If contamination were the primary driver of scores, the failure set would shift unpredictably across model versions, which is not what observers have reported.

Relationship to GPQA and other subsets

GPQA Extended vs. Main vs. Diamond

Understanding how GPQA Diamond relates to the broader GPQA dataset is important for interpreting benchmark results:

Property	Extended	Main	Diamond
Number of questions	546	448	198
Expert accuracy	65.4%	72.5%	81.3%
Non-expert accuracy	33.9%	30.5%	21.9%
Selection criteria	None (all collected)	1+ expert correct, 2+ non-experts wrong	Both experts correct, 2+ non-experts wrong
GPT-4 baseline (few-shot CoT)	38.7%	39.7%	38.8%

The Diamond subset is both harder (for non-experts) and more reliable (in terms of having verifiably correct answers) than the broader sets. This combination of difficulty and quality is why Diamond has become the standard evaluation target rather than the full GPQA set.

Why Diamond is the industry standard

Several factors contributed to GPQA Diamond becoming the preferred evaluation subset:

Higher question quality. The requirement for both expert validators to agree on the correct answer reduces the number of ambiguous or flawed questions.
Greater discriminative power. The low non-expert accuracy (21.9%) means that models cannot score well through surface-level pattern matching or simple information retrieval.
Inclusion in OpenAI's simple-evals. OpenAI adopted GPQA Diamond as part of its standard evaluation suite, and other labs followed suit in order to report comparable results.
Manageable size. At 198 questions, the benchmark is small enough to run quickly and cheaply while still providing statistically meaningful signal.
Meaningful expertise gap. The nearly 60-percentage-point gap between expert and non-expert accuracy creates clear room for measuring progress toward expert-level AI performance.

Applications and impact

Scalable oversight research

GPQA Diamond's primary intended application is as a testbed for scalable oversight methods. The benchmark's design creates conditions analogous to a scenario where a human supervisor (the non-expert) must evaluate the work of a more capable system (the AI model that outperforms non-experts). Researchers use this setup to test techniques like:

Debate: Two AI models argue opposing positions, and a human judge decides which argument is more convincing
Market-making: Predictions are aggregated across multiple models and evaluated for calibration
Recursive reward modeling: AI systems are trained to assist human evaluators in checking AI outputs
Decomposition: Complex questions are broken into simpler sub-questions that non-experts can verify

Capability evaluation

GPQA Diamond serves as one of the primary benchmarks for measuring progress toward expert-level scientific reasoning in AI. It is now routinely reported in model release announcements from OpenAI, Anthropic, Google DeepMind, and other leading labs. The benchmark's focus on graduate-level science makes it complementary to other widely used evaluations:

Benchmark	Focus	Difficulty level	Questions
MMLU	Broad knowledge (57 subjects)	High school to professional	14,042
MMLU-Pro	Harder version of MMLU	Professional	12,032
GPQA Diamond	Science (bio, phys, chem)	Graduate to post-graduate	198
MATH	Mathematics	Competition level	5,000
HumanEval	Code generation	Professional	164
AIME	Mathematical reasoning	Competition level	Varies
Humanity's Last Exam	Broad expert knowledge	PhD and beyond	~3,000

Educational and research implications

The benchmark has practical implications beyond model evaluation:

Identifying AI knowledge gaps. Per-domain analysis reveals where current models struggle most (organic chemistry) and where they excel (physics and mathematical reasoning), which can inform both training strategies and deployment decisions.
Calibrating trust in AI outputs. GPQA Diamond results help researchers and practitioners understand the reliability of AI-generated scientific reasoning, which is critical for applications in drug discovery, materials science, and other domains where errors can have serious consequences.
Benchmarking AI for research assistance. As AI systems approach and surpass human expert performance on these questions, the benchmark provides evidence for how close AI is to serving as a reliable research assistant in the natural sciences.

Limitations and criticisms

Dataset size

At 198 questions, GPQA Diamond is a relatively small benchmark. This limits the statistical power of comparisons between models, particularly when score differences are small. Epoch AI estimates that independent evaluations can only determine true performance to within 4 to 6 percentage points with 90% confidence. As a result, score differences of less than about 5 percentage points between models may not be statistically meaningful. With the top of the leaderboard now packed into a 1.4 point band, individual model rankings on GPQA Diamond should not be over-interpreted.

Multiple-choice format

The four-option multiple-choice format, while enabling automated evaluation, may not reflect real-world scientific reasoning. In practice, scientists formulate hypotheses, design experiments, interpret ambiguous data, and synthesize information from multiple sources. The multiple-choice structure reduces these complex cognitive tasks to answer selection, which may overestimate a model's true scientific understanding.

Static dataset and data contamination

Because GPQA Diamond is a fixed set of 198 questions that has been publicly available since November 2023, there is a risk of data contamination. Models trained on data that includes GPQA questions or their answers (directly or indirectly) may achieve inflated scores that do not reflect genuine reasoning ability. The dataset maintainers included canary strings to detect unauthorized use in training data, but the risk increases over time as the benchmark becomes more widely discussed and analyzed online.

Limited domain coverage

GPQA Diamond covers only three scientific domains: biology, physics, and chemistry. Important fields like mathematics, computer science, engineering, earth sciences, and medicine are not represented. This means the benchmark provides only a partial picture of a model's scientific capabilities.

English-only

All questions are written in English, limiting the benchmark's applicability to evaluating multilingual scientific reasoning capabilities.

Organic chemistry overrepresentation

The heavy representation of organic chemistry in the question set (roughly 36% of the extended set, and an even larger share of the hardest questions) means that models' aggregate scores may be disproportionately influenced by performance on a single subdomain. A model that excels at everything except organic chemistry may receive a misleadingly low overall score.

Future directions

Proposed improvements

Researchers have suggested several enhancements to address GPQA Diamond's limitations:

Free-form answer generation. Removing the multiple-choice format would require models to produce and justify answers independently, providing a more realistic test of scientific reasoning.
Dynamic question generation. Creating new questions automatically or semi-automatically would combat data contamination and extend the benchmark's useful lifespan.
Skill-based classification. Grouping questions by the specific cognitive skills they require (calculation, conceptual understanding, procedural knowledge, spatial reasoning) rather than by domain would provide more granular insight into model capabilities.
Multi-modal questions. Adding diagrams, molecular structures, spectra, and graphs would test models' ability to reason about visual scientific information.
Expanded domain coverage. Including mathematics, computer science, engineering, and medical sciences would provide a more comprehensive assessment.
Diamond v2. A handful of researchers have proposed releasing a refreshed Diamond v2 with new questions, a cleaned answer key for ambiguous items, and a held-out evaluation server to prevent contamination. As of mid-2026 no official v2 has been released, but the original GPQA team has acknowledged that a refresh is under consideration.

Next-generation benchmarks

As GPQA Diamond approaches saturation, the AI evaluation community is developing more challenging successors:

Research-level tasks. Benchmarks that test the ability to design experiments, write literature reviews, or generate novel hypotheses, moving beyond question-answering to test practical research skills.
Interactive problem-solving. Evaluations where the model must iteratively refine its approach based on new information, simulating the back-and-forth nature of real scientific inquiry.
Humanity's Last Exam. A broader and more difficult benchmark designed to remain challenging for AI systems longer than existing evaluations. As of May 2026 the top text-only score on Humanity's Last Exam is approximately 44.7%, leaving substantial room for improvement.
Laboratory simulations. Testing models on simulated experimental procedures, including troubleshooting equipment, interpreting unexpected results, and making real-time decisions.
Agentic science evaluations. Benchmarks like SciAgentBench and PaperBench measure how well an AI system can act as a junior researcher: reading the literature, running computations, planning experiments, and writing up results. These evaluations sit firmly on the active frontier in 2026.

GPQA: The parent dataset from which GPQA Diamond is derived, containing 448 (Main) or 546 (Extended) questions
MMLU: Broader knowledge benchmark covering 57 subjects including science
MMLU-Pro: A harder version of MMLU with 10-option multiple choice
AIME: American Invitational Mathematics Examination problems for mathematical reasoning
MATH: Competition-level mathematics benchmark
HumanEval: Code generation benchmark
Humanity's Last Exam: Broad expert-level benchmark designed to resist saturation
ARC: AI2 Reasoning Challenge for science reasoning
ScienceQA: Elementary to high school science questions
PubMedQA: Biomedical literature comprehension
ChemBench: Chemistry-specific benchmark

References

Rein, D., Hou, B. L., Stickland, A. C., Petty, J., Pang, R. Y., Dirani, J., Michael, J., & Bowman, S. R. (2023). "GPQA: A Graduate-Level Google-Proof Q&A Benchmark." *arXiv preprint arXiv:2311.12022*. https://arxiv.org/abs/2311.12022
Rein, D., et al. (2024). "GPQA: A Graduate-Level Google-Proof Q&A Benchmark." *International Conference on Learning Representations (ICLR)*. https://openreview.net/pdf?id=Ti67584b98
Burnham, G. (2025). "GPQA Diamond: What's Left?" *Epoch AI Gradient Updates*. https://epoch.ai/gradient-updates/gpqa-diamond-whats-left
Epoch AI. (2025). "AI Developers Accurately Report GPQA Diamond Scores for Recent Models." https://epoch.ai/data-insights/self-reported-gpqa
Epoch AI. (2026). "GPQA Diamond Benchmark." https://epoch.ai/benchmarks/gpqa-diamond
Artificial Analysis. (2026). "GPQA Diamond Benchmark Leaderboard." https://artificialanalysis.ai/evaluations/gpqa-diamond
OpenAI. (2024). "simple-evals: GPQA Evaluation." GitHub. https://github.com/openai/simple-evals/blob/main/gpqa_eval.py
Hugging Face. (2023). "GPQA Dataset." https://huggingface.co/datasets/Idavidrein/gpqa
GitHub. (2023). "idavidrein/gpqa: GPQA Repository." https://github.com/idavidrein/gpqa
IntuitionLabs. (2026). "GPQA-Diamond Benchmark: Scores, Leaderboard & How AI Models Compare." https://intuitionlabs.ai/articles/gpqa-diamond-ai-benchmark
Vals AI. (2026). "GPQA Diamond Leaderboard." https://www.vals.ai/benchmarks/gpqa
SmartChunks. (2026). "GPQA Diamond Score Explained: The AI Benchmark That Actually Matters." https://smartchunks.com/gpqa-diamond-score-explained-ai-benchmark-2026/
BenchLM. (2026). "GPQA-D Benchmark 2026: Model Averages." https://benchlm.ai/benchmarks/gpqaDiamond

Background and motivation

The scalable oversight problem

Limitations of earlier benchmarks

The "Google-proof" design philosophy

Creation and methodology

Expert recruitment

Question writing pipeline

Compensation and incentive structure

Diamond subset selection

Technical specifications

Domain distribution

Question format

Evaluation methodology

Question objectivity

Human baseline performance

Expert performance

Non-expert performance

Performance by domain

AI model performance

Historical performance timeline

Current leaderboard

The role of reasoning models

Aristotle X1 Verify

Tool-augmented evaluation

Domain-specific analysis

Organic chemistry: the hardest category

Physics: a strength for reasoning models

Biology: variable difficulty

Saturation and benchmark validity

Signs of saturation

Position relative to other current benchmarks

Question validity analysis

Independent score verification

Contamination concerns in 2026

Relationship to GPQA and other subsets

GPQA Extended vs. Main vs. Diamond

Why Diamond is the industry standard

Applications and impact

Scalable oversight research

Capability evaluation

Educational and research implications

Limitations and criticisms

Dataset size

Multiple-choice format

Static dataset and data contamination

Limited domain coverage

English-only

Organic chemistry overrepresentation

Future directions

Proposed improvements

Next-generation benchmarks

Related benchmarks

See also

References

Improve this article

Background and motivation

The scalable oversight problem

Limitations of earlier benchmarks

The "Google-proof" design philosophy

Creation and methodology

Expert recruitment

Question writing pipeline

Compensation and incentive structure

Diamond subset selection

Technical specifications

Domain distribution

Question format

Evaluation methodology

Question objectivity

Human baseline performance

Expert performance

Non-expert performance

Performance by domain

AI model performance

Historical performance timeline

Current leaderboard

The role of reasoning models

Aristotle X1 Verify

Tool-augmented evaluation

Domain-specific analysis