SuperGPQA

AI Benchmarks Model Evaluation

8 min read

Updated Jun 2, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 2, 2026

Fact-checked

In review queue

Sources

6 citations

Revision

v1 · 1,513 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

SuperGPQA is a large graduate-level knowledge and reasoning benchmark for evaluating large language models across 285 academic disciplines. It contains 26,529 expert-level multiple-choice questions and was released in February 2025 by the Doubao (Seed) team at ByteDance together with the M-A-P open-source community. The benchmark was designed to extend evaluation beyond the handful of mainstream fields covered by earlier tests and into "long-tail" graduate disciplines such as light industry, agriculture, and service-oriented sciences. The accompanying paper, "SuperGPQA: Scaling LLM Evaluation across 285 Graduate Disciplines," is available as arXiv:2502.14739 and was presented as a poster at NeurIPS 2025.^[1]^[2]^[3]

Overview

SuperGPQA targets the upper end of academic difficulty, posing questions at a level expected of graduate students and domain specialists rather than general test-takers. Each item is a multiple-choice question with an average of 9.67 answer options, drawn from a range of four to ten choices, which sharply reduces the probability of a model scoring well by guessing compared with the conventional four-option format.^[2]^[4]

The motivation for the project was the observation that widely used knowledge benchmarks concentrate on a small set of popular subjects and leave more than 200 specialized disciplines effectively unmeasured. The authors argue that a model's competence in mainstream domains says little about its grasp of niche or applied fields, and that a benchmark spanning the full breadth of graduate education gives a more faithful picture of how far current systems remain from broad, expert-level mastery.^[1]^[2]

The dataset is released under an open license (ODC-BY) and is distributed through Hugging Face, with evaluation code and a leaderboard published on GitHub. It has since been integrated into third-party evaluation frameworks, which has helped it become a recurring entry in model report cards.^[4]^[5]

Relationship to GPQA, MMLU, and other benchmarks

Despite the similar name, SuperGPQA is a distinct benchmark from GPQA (Graduate-Level Google-Proof Q&A). GPQA is a small, deliberately hard set of roughly 448 questions written by domain experts and confined to three broad areas of biology, physics, and chemistry, with its GPQA Diamond subset being the most cited slice.^[1]^[6] SuperGPQA keeps the goal of graduate-level difficulty but scales the breadth dramatically, replacing the three-domain focus with 285 disciplines and increasing the question count by roughly two orders of magnitude.^[1]^[2]

It also differs from MMLU (Massive Multitask Language Understanding) and its harder successor MMLU-Pro. MMLU spans 57 subjects at a mix of difficulty levels, and many of its questions sit below the graduate threshold. SuperGPQA both raises the difficulty floor and widens disciplinary coverage, explicitly including applied and vocational fields that benchmarks like MMLU and GPQA omit.^[1]^[2] In this sense it sits alongside other recent "frontier" evaluations such as Humanity's Last Exam in trying to push assessment past the point where leading models already saturate older tests.^[2]

The 285 disciplines and question set

SuperGPQA organizes its content hierarchically across three levels: 13 top-level disciplines, 72 fields, and 285 subfields, the last of which corresponds to the "285 graduate disciplines" in the title.^[2]^[4] The 13 top-level disciplines are Engineering, Medicine, Science, Philosophy, Economics, Law, History, Education, Literature and Arts, Management, Military Science, Agronomy, and Sociology.^[3]

The full set comprises 26,529 questions. Each carries metadata including its discipline, field, and subfield, a difficulty label, and a flag indicating whether answering it requires calculation. Questions are distributed across three difficulty tiers labeled easy, middle, and hard. A notable feature of the dataset is its emphasis on reasoning rather than pure recall: 42.33% of the questions require mathematical calculation or rigorous formal inference rather than simple factual lookup.^[2]^[4] This design choice is intended to make the benchmark resistant to models that have memorized surface-level facts but cannot work through multi-step problems.

The construction pipeline

The questions were assembled through a structured pipeline combining human expertise with language-model assistance, involving a team of more than 80 expert annotators. The authors describe three main stages.^[1]^[2]

The first stage, source screening, has expert annotators collect candidate questions from trustworthy materials such as textbooks and authoritative exercise repositories, ensuring that each item reflects genuine graduate-level content in its discipline.^[2] The second stage, transcription, standardizes the collected questions into a uniform academic format, normalizing notation and phrasing so that items are consistent across disciplines.^[2] The third stage, quality inspection, applies a three-part check: rule-based filtering to catch malformed items, an LLM-based validity pass, and expert review.^[2]

Central to the process is a Human-LLM collaborative filtering mechanism. Candidate questions are run past language models, and items that prove trivial or ambiguous are flagged and refined or discarded through iterative rounds that draw on both model responses and expert feedback. This loop is meant to remove questions that are too easy, poorly specified, or answerable without real domain understanding, while retaining those that genuinely discriminate between stronger and weaker models.^[1]^[2]

Evaluation methodology

Models are evaluated by their accuracy in selecting the correct option from the multiple choices. Because each question offers up to ten options, random guessing yields a much lower expected score than on standard four-choice benchmarks, which widens the gap between capable and weak systems.^[2]^[4]

The benchmark reports results at several levels of granularity. In addition to an overall accuracy figure, scores are broken out by difficulty tier (easy, middle, and hard) and aggregated across the discipline, field, and subfield hierarchy, allowing analysis of where a given model is strong or weak.^[3] In the original study the authors evaluated a broad slate of systems, covering 6 reasoning-focused models, 28 instruction-tuned chat models, and 17 base models.^[2]

Notable results by model

The table below lists overall accuracy figures reported in the SuperGPQA paper and its official leaderboard. Even the strongest reasoning model fell well short of mastery, which the authors cite as evidence of the distance between current systems and broad expert-level competence.^[1]^[3]

Model	Type	Overall accuracy
DeepSeek-R1	Reasoning	61.82%
OpenAI o1 (2024-12-17)	Reasoning	60.24%
OpenAI o3-mini (high)	Reasoning	55.22%
OpenAI o3-mini (medium)	Reasoning	52.69%
Doubao-1.5-pro-32k (250115)	Chat	55.09%
Qwen-max (2025-01-25)	Chat	50.08%
Claude 3.5 Sonnet (20241022)	Chat	48.16%
Gemini 2.0 Flash	Chat	47.73%
Qwen2.5-72B	Base	34.33%
Qwen2.5-32B	Base	33.16%
DeepSeek V3-Base	Base	32.14%

The headline finding was that DeepSeek-R1 led the field at 61.82%, narrowly ahead of OpenAI's o1.^[1]^[2] Reasoning-tuned models as a group outperformed standard chat models, and chat models in turn outperformed base models, a pattern consistent with the benchmark's heavy reasoning component.^[2]^[3] Among non-reasoning systems, ByteDance's own Doubao 1.5-pro was the strongest entry in the original report.^[3]

Because the leaderboard is maintained as an open resource, later models from families such as Qwen and Kimi have been added by the community and report higher scores than the original cohort, reflecting general progress on the benchmark since its release.^[5]

Significance

SuperGPQA is one of the broadest graduate-level knowledge benchmarks released to date, and its scale gives it several uses. It provides a single yardstick that spans far more of the academic landscape than prior tests, which makes it useful for spotting disciplines where a model is unexpectedly weak. Its low ceiling at launch, with no evaluated system clearing 62%, gave it considerable headroom at a time when leading models were saturating older benchmarks, making it a meaningful target for tracking progress on expert reasoning.^[1]^[2]

The benchmark also illustrates a now-common construction strategy in which human experts and language models share the labor of building an evaluation set, using models to surface weak or trivial items at a scale that manual review alone could not reach. Its release by a major industry lab in partnership with an open community, under an open license with public code and data, has made it straightforward for others to reproduce and extend.^[2]^[4]

Limitations

Several caveats apply to SuperGPQA. The multiple-choice format, even with up to ten options, constrains the kinds of competence it can probe and does not capture open-ended generation, proof-writing, or interactive problem solving.^[2] Like any benchmark sourced from textbooks and exercise collections, parts of it may overlap with the training data of large models, so high scores can reflect memorization as well as reasoning despite the filtering pipeline. The dataset draws substantially on Chinese-language source material and academic conventions, which may influence both content coverage and how models trained on different corpora perform.^[4] Finally, because the public leaderboard accumulates community submissions over time, scores reported for newer models are not always produced under the same controlled conditions as the figures in the original paper, so cross-model comparisons drawn from different sources should be made with care.^[5]

References

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

Suggest edit

What links here

Benchmark (AI)

Overview

Relationship to GPQA, MMLU, and other benchmarks

The 285 disciplines and question set

The construction pipeline

Evaluation methodology

Notable results by model

Significance

Limitations

References

Improve this article

Related Articles

Benchmark (AI)

MATH

SWE-bench Verified

WebArena

Agent evaluation

Terminal-Bench

What links here

Related Articles

Benchmark (AI)

MATH

SWE-bench Verified

WebArena

Agent evaluation

Terminal-Bench