MATH-500
Last reviewed
May 31, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v2 · 3,444 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 31, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v2 · 3,444 words
Add missing citations, update stale details, or suggest a clearer explanation.
MATH-500 is a 500-problem held-out evaluation subset drawn from the test split of the MATH benchmark of Dan Hendrycks et al. (2021). It was introduced in May 2023 by Hunter Lightman and colleagues at OpenAI in the paper "Let's Verify Step by Step", the same work that released the PRM800K dataset and established process reward models (PRMs) as a viable training signal for mathematical reasoning.[1][2] Although the authors never coined the name in their paper (they simply describe "the remaining 500 held-out problems"), the artifact was adopted by the community (and especially by the Hugging Face dataset card HuggingFaceH4/MATH-500) under the label MATH-500 and has become the de facto standard short-form math evaluation for reasoning-class language models including OpenAI o1, OpenAI o3, DeepSeek-R1, Gemini 2.5 Pro, and GPT-5.[3][4][5]
The 500 problems were drawn uniformly at random from the 5,000-problem MATH test set (not stratified by subject or difficulty) and were intended to be "representative of the test set as a whole."[2][4] The remaining 4,500 test problems were folded into the training pool used to collect step-level human labels for PRM800K, a redistribution that the authors emphasised was necessary to avoid overfitting in their reward-model training process.[1][4] Today, frontier models routinely score above 95% on MATH-500, leading many evaluators, including artificial-analysis platforms and OpenAI's own simple-evals repository, to either retire the benchmark or treat it as a sanity check rather than a discriminator.[5][6][7]
The MATH dataset, introduced in Measuring Mathematical Problem Solving With the MATH Dataset, contains 12,500 competition mathematics problems split into 7,500 training and 5,000 test examples.[8][9] Each problem was drawn from US high-school mathematics competitions including the AMC 10, AMC 12, AIME, and similar olympiad-style events, and each is accompanied by a full step-by-step solution written in a mixture of natural language and LaTeX. The final numerical or symbolic answer is conventionally enclosed in a \boxed{...} macro, which downstream evaluation harnesses still use as the answer-extraction target.[8][10]
The dataset is partitioned along two orthogonal axes:
The MATH paper was authored by Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt, posted to arXiv on 5 March 2021, and accepted to the NeurIPS 2021 Datasets and Benchmarks track.[8][9] The dataset is released under the MIT license.[10][11]
At launch, frontier models performed catastrophically: the largest checkpoint reported in the paper achieved only 6.9% accuracy on the full test set, leading the authors to argue that "simply increasing budgets and model parameter counts will be impractical for achieving strong mathematical reasoning if scaling trends continue."[9] That prediction proved overly pessimistic, partly thanks to the very PRM and reasoning-model techniques that the MATH-500 subset would later help validate.
In Let's Verify Step by Step (arXiv:2305.20050, 31 May 2023), OpenAI researchers Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe set out to compare outcome supervision (rewarding only final answers) with process supervision (rewarding individual reasoning steps).[1][12] To train their process reward models, they needed many candidate solutions per problem. They expanded their effective training pool by merging the original MATH train split (7,500 problems) with 4,500 problems drawn from the MATH test split, leaving only 500 problems unseen during PRM training.[2][4]
The README of OpenAI's companion repository, openai/prm800k, states the construction directly: "We selected these 500 test problems uniformly at random, and we believe they are representative of the test set as a whole."[2] The directory prm800k/math_splits/ contains two Git-LFS files: train.jsonl (the expanded 12,000-problem training set, i.e. 7,500 + 4,500) and test.jsonl (the 500 held-out problems).[2][4] The paper itself uses phrases such as "a representative subset of the MATH test set" rather than the catchier MATH-500 name.[1]
The Lightman et al. paper's contribution to mathematical reasoning was substantial: their best PRM achieved 78.2% on this held-out subset using best-of-1860 reranking, decisively outperforming outcome-supervised baselines and majority voting.[1][12] But it was the subset itself (not the PRM) that survived as community infrastructure. The conversion was largely operational:
test.jsonl was posted to the public openai/prm800k GitHub repository alongside the paper.[2]HuggingFaceH4/MATH-500 dataset, with the 500 records exposed as a single test split with fields problem, solution, answer, subject, level, and unique_id. The dataset card explicitly credits the OpenAI paper as the source.[3]simple-evals library footnoted that "for newer models (anything on or after o1) we evaluate on MATH-500, which is a newer, IID version of MATH."[7] DeepSeek's R1 paper, Kimi's k1.5 paper, the Qwen and GLM technical reports, and most subsequent reasoning-model launches reported MATH-500 numbers as a top-line metric.[13][14]Two facts about this trajectory frequently surprise observers. First, the name MATH-500 does not appear in the original Lightman et al. paper at all; it was crystallised by the Hugging Face dataset card and downstream blog posts. Second, the subset is not stratified by difficulty or subject: a quick check of the released test.jsonl shows the expected uniform-sampling distribution across levels 1–5 and across the seven subjects, with the modest sample-size variance one would expect from a 500/5000 random draw.[2][3]
Each MATH-500 example is a JSON record with the following fields, as documented on the HuggingFaceH4/MATH-500 dataset card:[3]
| Field | Type | Description |
|---|---|---|
problem | string (20–1,730 chars) | Statement of the problem in mixed natural language and LaTeX |
solution | string (45–3,360 chars) | Reference step-by-step solution |
answer | string (1–53 chars) | Final answer in \boxed{...}-extractable form |
subject | string | One of seven subjects (see below) |
level | int | Difficulty level 1–5 (5 = hardest) |
unique_id | string (20–40 chars) | Stable identifier mapping back to the underlying MATH file |
The seven subjects are Prealgebra, Algebra, Intermediate Algebra, Counting & Probability, Number Theory, Geometry, and Precalculus.[3][8] The five difficulty levels follow the original MATH conventions; problems carrying Level 5 are roughly comparable in difficulty to USA(J)MO qualifier-level questions and are the focus of the MATH Level 5 sub-benchmark used in the GLM, DeepSeek, and Qwen papers.[3][10]
The compact size (500 records, around 450 KB total) is one of the benchmark's most attractive features. A typical chain-of-thought evaluation run takes only minutes on modest hardware, and the dataset is small enough that high-cost reasoning models can be evaluated at full token budget without prohibitive expense.[3]
MATH-500 is widely conflated with the parent MATH dataset, partly because both are colloquially called "MATH" in marketing slides and even in some academic papers. The differences matter:
lm-evaluation-harness exposes both hendrycks_math (the full 5,000-problem test set) and a hendrycks_math500 task that loads the Hugging Face MATH-500 mirror, allowing researchers to choose which they want.[15]simple-evals reports a single MATH column for every model, but for o1 and later that column is actually MATH-500.[7] DeepSeek, Kimi, Qwen, and Anthropic typically use the MATH-500 label explicitly.[13][14]This convention has caused confusion. For instance, OpenAI's Learning to Reason with LLMs page reports an o1 score of 94.8% on "MATH" and an o1 score of 96.4% on a separate evaluation; the simple-evals repository clarifies that the 96.4 figure is on MATH-500 specifically.[7] DeepSeek-R1's headline "97.3% on MATH" is also on MATH-500 (pass@1), as their paper makes explicit in Table 4.[13]
Several factors combined to elevate this otherwise unremarkable random-sample dataset into a reasoning-era benchmark:
\boxed{...}, and the PRM800K repository provides a battle-tested grader (grading.py) that handles algebraic equivalence, a non-trivial problem in math evaluation.[2][18]The benchmark has saturated quickly. A non-exhaustive timeline drawn from primary sources:
| Date | Model | MATH-500 (pass@1) | Source |
|---|---|---|---|
| May 2023 | GPT-4 + best-of-1860 PRM | 78.2% | Lightman et al. (Fig. 3)[1] |
| Apr 2024 | GPT-4-turbo-2024-04-09 | 73.4% | Reported via simple-evals[7] |
| May 2024 | GPT-4o (2024-05-13) | 76.6% | OpenAI simple-evals[7] |
| Aug 2024 | GPT-4o (2024-08-06) | 75.9% | OpenAI simple-evals[7] |
| Sep 2024 | OpenAI o1-mini | 90.0% | DeepSeek-R1 Table 4[13] |
| Sep 2024 | OpenAI o1-preview / o1-1217 | 96.4% | OpenAI simple-evals[7][13] |
| Oct 2024 | Claude 3.5 Sonnet (2024-10-22) | 78.3% | DeepSeek-R1 Table 4[13] |
| Dec 2024 | DeepSeek-V3 | 90.2% | DeepSeek-R1 Table 4[13] |
| Jan 2025 | DeepSeek-R1-Zero | 95.9% | DeepSeek-R1 paper[13] |
| Jan 2025 | DeepSeek-R1 | 97.3% | DeepSeek-R1 paper[13] |
| Jan 2025 | Kimi k1.5 (long-CoT) | 96.2% | Kimi k1.5 technical report[14] |
| Jan 2025 | DeepSeek-R1-Distill-Qwen-32B | 94.3% | DeepSeek-R1 Table 5[13] |
| Mar 2025 | Gemini 2.5 Pro | ≈97.3% | Reported in independent leaderboard summaries[19] |
| 2025 | OpenAI o3 (high reasoning) | 99.2% | Artificial Analysis leaderboard[5] |
| 2025 | GPT-5 (high reasoning) | 99.4% | Artificial Analysis leaderboard[5] |
| 2025 | LongCat-Flash-Thinking (Meituan) | 99.2% | llm-stats leaderboard[6] |
| 2025 | GLM-4.5 (Zhipu) | 98.2% | llm-stats leaderboard[6] |
| 2025 | Sarvam-105B | 98.6% | llm-stats leaderboard[6] |
Independent re-evaluation by Vals AI as of mid-2025 gave Gemini 3 Pro 96.4% and clusters most frontier models in the 90–99% band, with the platform later archiving the benchmark because "most recent models now consistently score over 90%."[17]
By every reasonable measure, MATH-500 is saturated for frontier models. The mean across the 32 models tracked on the llm-stats leaderboard is 0.932, with the top model only 6 percentage points above that mean and the top three within 0.2 points of each other.[6] Artificial Analysis ranks 201 models on MATH-500 but notes that the headline weighted "math" score now leans on harder evaluations such as BRUMO and AIME 2025 because MATH-500 no longer separates frontier models.[5][16]
Several knock-on effects of this saturation are now visible:
A small number of frontier vendors still report MATH-500 prominently, including Mistral, Zhipu, NVIDIA, Meituan, Sarvam, and most small-model labs, because it remains a useful checkpoint metric and because matching o1 / R1 on MATH-500 is still a strong baseline claim.[6]
MATH-500 is drawn from publicly available competition mathematics problems (AMC 10, AMC 12, AIME, etc.) that have been indexed, discussed, and solved on the open web, including Art of Problem Solving forums, contest archives, and innumerable solution blog posts. This visibility was already a worry for the parent MATH dataset; multiple surveys document that MATH (Hendrycks et al.) is among the most contaminated of widely used LLM benchmarks, alongside GSM8K, HumanEval, and MMLU.[21][22]
Specific concerns:
These factors complicate inter-model comparisons even at the 90%+ tier. They are partly why benchmarks designed against contamination, such as FrontierMath (held-out problems written by professional mathematicians) and MathArena (real-time contests evaluated within hours of release), have grown in importance.[16][25]
| Benchmark | Size (test) | Source | Typical 2025 SOTA | Saturated? |
|---|---|---|---|---|
| MATH-500 | 500 | MATH test split (Lightman 2023)[1] | 99% (GPT-5 high)[5] | Yes[17] |
| MATH (full) | 5,000 | Hendrycks 2021[8] | ≈98% (R1, o1)[13] | Largely[17] |
| GSM8K | 1,319 | Cobbe et al. 2021[26] | ≈97% (GPT-4)[26] | Yes[26] |
| AIME 2024 | 30 | AIME 2024 contest | ≈90% (o3, R1)[16] | Approaching |
| AIME 2025 | 30 | AIME 2025 contest | 92–100% (GPT-5, Gemini 2.5 Pro)[16] | Approaching |
| FrontierMath | 300+ | Held-out problems by ≈60 mathematicians[25] | <30% (frontier)[25] | No |
| MATH-Perturb-Hard | 279 | Perturbations of MATH Level 5[23] | drops 10–25 pts vs. MATH[23] | No |
Versus GSM8K. GSM8K is grade-school arithmetic word problems, deliberately constrained to addition, subtraction, multiplication, and division over natural numbers; SOTA models score in the high 90s and have for years.[26] MATH-500 is harder by roughly a full difficulty tier; Lightman et al. used precisely this gap (between trivial GSM8K performance and weaker MATH performance) to motivate their work on process supervision.[1] In practice, modern reasoning models now saturate both.
Versus AIME 2024/2025. AIME is a 15-problem high-school invitational; "AIME 2024" and "AIME 2025" in LLM evaluation usage typically refer to the 30 problems from both AIME I and AIME II of the given year, with cons@k or pass@k scoring. AIME problems are individually harder than the median MATH-500 problem (MATH Level 5 problems are AIME-derived, but constitute only about 20% of the 500), so AIME serves as the natural "harder" successor.[16][27]
Versus FrontierMath. Epoch AI's FrontierMath benchmark, introduced in late 2024, is a fundamentally different scale: 300+ problems written and held in secret by professional research mathematicians, problems that take human experts hours to days. Top systems still score in the low double digits at best.[16][25]
Versus MATH-Perturb. MATH-Perturb (Huang et al., ICML 2025) constructs 279 hard- and 279 simple-perturbed problems by editing MATH Level 5 originals. The collapse on hard perturbations is the strongest published evidence that MATH-500 saturation is partly memorisation.[23][24]
Despite saturation, MATH-500 still appears in nearly every reasoning-model release in 2025. The reasons are pragmatic: it is cheap, well-tooled (PRM800K grader is the de facto evaluator), and offers comparability with hundreds of prior published scores. As of 2025, the HuggingFaceH4/MATH-500 dataset shows roughly 175,000 downloads per month and is listed as a training/evaluation dependency by more than 100 models on the Hub.[3]
Several caveats remain in serious use of the benchmark:
prm800k/grading/grader.py) is the conventional choice and handles algebraic equivalence, fraction normalisation, and other surface variation.[2][18]For most practical purposes, MATH-500's role has shifted from a discriminating benchmark to a floor: a reasoning model that does not clear ≈90% on MATH-500 is not yet a frontier reasoning model. Discriminative comparison has migrated to AIME, USAMO, FrontierMath, and HLE.[16][25]