# MATH-500

> Source: https://aiwiki.ai/wiki/math_500
> Updated: 2026-06-09
> Categories: AI Benchmarks, Mathematics
> From AI Wiki (https://aiwiki.ai), a free encyclopedia of artificial intelligence. Quote with attribution.

# MATH-500

**MATH-500** is a 500-problem held-out evaluation subset drawn from the test split of the [MATH benchmark](/wiki/math_benchmark) of [Dan Hendrycks](/wiki/dan_hendrycks) et al. (2021). It was introduced in May 2023 by Hunter Lightman and colleagues at [OpenAI](/wiki/openai) in the paper *"Let's Verify Step by Step"*, the same work that released the PRM800K dataset and established [process reward models](/wiki/process_reward_model) (PRMs) as a viable training signal for mathematical reasoning.[^1][^2] Although the authors never coined the name in their paper (they simply describe "the remaining 500 held-out problems"), the artifact was adopted by the community (and especially by the [Hugging Face](/wiki/hugging_face) dataset card `HuggingFaceH4/MATH-500`) under the label *MATH-500* and has become the de facto standard short-form math evaluation for [reasoning-class language models](/wiki/reasoning_models) including [OpenAI o1](/wiki/o1), [OpenAI o3](/wiki/o3), [DeepSeek-R1](/wiki/deepseek_r1), [Gemini 2.5 Pro](/wiki/gemini_2_5_pro), and [GPT-5](/wiki/gpt-5).[^3][^4][^5]

The 500 problems were drawn **uniformly at random** from the 5,000-problem MATH test set (*not* stratified by subject or difficulty) and were intended to be "representative of the test set as a whole."[^2][^4] The remaining 4,500 test problems were folded into the training pool used to collect step-level human labels for PRM800K, a redistribution that the authors emphasised was necessary to avoid overfitting in their reward-model training process.[^1][^4] Today, frontier models routinely score above 95% on MATH-500, leading many evaluators, including artificial-analysis platforms and OpenAI's own `simple-evals` repository, to either retire the benchmark or treat it as a sanity check rather than a discriminator.[^5][^6][^7]

## Background: the parent MATH benchmark

The MATH dataset, introduced in *Measuring Mathematical Problem Solving With the MATH Dataset*, contains 12,500 competition mathematics problems split into 7,500 training and 5,000 test examples.[^8][^9] Each problem was drawn from US high-school mathematics competitions including the **AMC 10**, **AMC 12**, **[AIME](/wiki/aime)**, and similar olympiad-style events, and each is accompanied by a full step-by-step solution written in a mixture of natural language and LaTeX. The final numerical or symbolic answer is conventionally enclosed in a `\boxed{...}` macro, which downstream evaluation harnesses still use as the answer-extraction target.[^8][^10]

The dataset is partitioned along two orthogonal axes:

- **Seven subjects:** Prealgebra, Algebra, Number Theory, Counting & Probability, Geometry, Intermediate Algebra, and Precalculus.[^8][^10]
- **Five difficulty levels** (1 = easiest, 5 = hardest), assigned by the competition organizers and curators.[^8][^10] Level 5 corresponds roughly to late-AIME or invitational-final difficulty (see [MATH Level 5](/wiki/math_level_5)).

The MATH paper was authored by Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt, posted to arXiv on 5 March 2021, and accepted to the [NeurIPS](/wiki/neurips) 2021 Datasets and Benchmarks track.[^8][^9] The dataset is released under the MIT license.[^10][^11]

At launch, frontier models performed catastrophically: the largest checkpoint reported in the paper achieved only 6.9% accuracy on the full test set, leading the authors to argue that "simply increasing budgets and model parameter counts will be impractical for achieving strong mathematical reasoning if scaling trends continue."[^9] That prediction proved overly pessimistic, partly thanks to the very PRM and reasoning-model techniques that the MATH-500 subset would later help validate.

## Origin and subset construction

### The Lightman et al. paper

In *Let's Verify Step by Step* (arXiv:2305.20050, 31 May 2023), OpenAI researchers Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, [Ilya Sutskever](/wiki/ilya_sutskever), and Karl Cobbe set out to compare *outcome supervision* (rewarding only final answers) with *process supervision* (rewarding individual reasoning steps).[^1][^12] To train their process reward models, they needed many candidate solutions per problem. They expanded their effective training pool by merging the original MATH train split (7,500 problems) with 4,500 problems drawn from the MATH test split, leaving only 500 problems unseen during PRM training.[^2][^4]

The README of OpenAI's companion repository, [`openai/prm800k`](https://github.com/openai/prm800k), states the construction directly: "We selected these 500 test problems uniformly at random, and we believe they are representative of the test set as a whole."[^2] The directory `prm800k/math_splits/` contains two Git-LFS files: `train.jsonl` (the expanded 12,000-problem training set, i.e. 7,500 + 4,500) and `test.jsonl` (the 500 held-out problems).[^2][^4] The paper itself uses phrases such as "a representative subset of the MATH test set" rather than the catchier *MATH-500* name.[^1]

### From local artifact to standard benchmark

The Lightman et al. paper's contribution to mathematical reasoning was substantial: their best PRM achieved 78.2% on this held-out subset using best-of-1860 reranking, decisively outperforming outcome-supervised baselines and majority voting.[^1][^12] But it was the *subset itself* (not the PRM) that survived as community infrastructure. The conversion was largely operational:

1. **OpenAI's release.** The 500-problem `test.jsonl` was posted to the public `openai/prm800k` GitHub repository alongside the paper.[^2]
2. **Hugging Face mirror.** The Hugging Face H4 team, the same group behind Zephyr and the Open LLM Leaderboard, repackaged the file as the `HuggingFaceH4/MATH-500` dataset, with the 500 records exposed as a single `test` split with fields `problem`, `solution`, `answer`, `subject`, `level`, and `unique_id`. The dataset card explicitly credits the OpenAI paper as the source.[^3]
3. **Adoption in reasoning-model papers.** When OpenAI's *o1* reasoning model appeared in September 2024, OpenAI's `simple-evals` library footnoted that "for newer models (anything on or after o1) we evaluate on MATH-500, which is a newer, IID version of MATH."[^7] DeepSeek's R1 paper, Kimi's k1.5 paper, the Qwen and GLM technical reports, and most subsequent reasoning-model launches reported MATH-500 numbers as a top-line metric.[^13][^14]

Two facts about this trajectory frequently surprise observers. First, the name *MATH-500* does not appear in the original Lightman et al. paper at all; it was crystallised by the Hugging Face dataset card and downstream blog posts. Second, the subset is **not** stratified by difficulty or subject: a quick check of the released `test.jsonl` shows the expected uniform-sampling distribution across levels 1-5 and across the seven subjects, with the modest sample-size variance one would expect from a 500/5000 random draw.[^2][^3]

## Format and difficulty levels

Each MATH-500 example is a JSON record with the following fields, as documented on the `HuggingFaceH4/MATH-500` dataset card:[^3]

| Field | Type | Description |
| --- | --- | --- |
| `problem` | string (20-1,730 chars) | Statement of the problem in mixed natural language and LaTeX |
| `solution` | string (45-3,360 chars) | Reference step-by-step solution |
| `answer` | string (1-53 chars) | Final answer in `\boxed{...}`-extractable form |
| `subject` | string | One of seven subjects (see below) |
| `level` | int | Difficulty level 1-5 (5 = hardest) |
| `unique_id` | string (20-40 chars) | Stable identifier mapping back to the underlying MATH file |

The seven subjects are Prealgebra, Algebra, Intermediate Algebra, Counting & Probability, Number Theory, Geometry, and Precalculus.[^3][^8] The five difficulty levels follow the original MATH conventions; problems carrying Level 5 are roughly comparable in difficulty to USA(J)MO qualifier-level questions and are the focus of the [MATH Level 5](/wiki/math_level_5) sub-benchmark used in the GLM, DeepSeek, and Qwen papers.[^3][^10]

The compact size (500 records, around 450 KB total) is one of the benchmark's most attractive features. A typical [chain-of-thought](/wiki/chain_of_thought) evaluation run takes only minutes on modest hardware, and the dataset is small enough that high-cost [reasoning models](/wiki/reasoning_models) can be evaluated at full token budget without prohibitive expense.[^3]

## Distinction from the full MATH benchmark

MATH-500 is widely conflated with the parent MATH dataset, partly because both are colloquially called "MATH" in marketing slides and even in some academic papers. The differences matter:

- **Test-set size.** MATH-500 contains 500 problems; the full MATH test split contains 5,000. The 500 are a random 10% draw, with the *other* 4,500 absorbed into Lightman et al.'s expanded training set.[^2][^4]
- **Training-set status.** Because 4,500 of the original MATH test problems were re-used for PRM800K training, models trained on PRM800K or distillations thereof can no longer be cleanly evaluated on the original MATH test split. MATH-500 is the only "clean" residual.[^1][^2]
- **Use in evaluation harnesses.** Eleuther's `lm-evaluation-harness` exposes both `hendrycks_math` (the full 5,000-problem test set) and a `hendrycks_math500` task that loads the Hugging Face MATH-500 mirror, allowing researchers to choose which they want.[^15]
- **Headline scoring conventions.** OpenAI's `simple-evals` reports a single `MATH` column for every model, but for o1 and later that column is actually MATH-500.[^7] DeepSeek, Kimi, Qwen, and Anthropic typically use the MATH-500 label explicitly.[^13][^14]

This convention has caused confusion. For instance, OpenAI's *Learning to Reason with LLMs* page reports an o1 score of 94.8% on "MATH" and an o1 score of 96.4% on a separate evaluation; the `simple-evals` repository clarifies that the 96.4 figure is on MATH-500 specifically.[^7] DeepSeek-R1's headline "97.3% on MATH" is *also* on MATH-500 (pass@1), as their paper makes explicit in Table 4.[^13]

## Why it became standard for reasoning evaluation

Several factors combined to elevate this otherwise unremarkable random-sample dataset into a reasoning-era benchmark:

1. **Cost-efficient evaluation of [test-time-compute](/wiki/test_time_compute) models.** Reasoning models such as o1, o3, R1, and Gemini-2.5-Pro can spend tens or hundreds of thousands of tokens per problem. A 500-problem test set is the smallest one can run while still producing a statistically meaningful number (about ±2 percentage points 95% CI at the 90% accuracy level). The full 5,000-problem MATH set is roughly 10× more expensive to run at high reasoning depth.[^16][^17]
2. **No leakage into the original MATH train split.** Because Lightman et al. moved 4,500 problems out of the MATH test set, any model trained on the standard MATH train split is, by construction, "untouched" by MATH-500. This made it a natural held-out evaluation for the burgeoning crop of math-specific finetuned models.[^1][^4]
3. **Easy-to-parse answer format.** All answers are wrapped in `\boxed{...}`, and the PRM800K repository provides a battle-tested grader (`grading.py`) that handles algebraic equivalence, a non-trivial problem in math evaluation.[^2][^18]
4. **Frontier-lab uptake.** Once OpenAI cited "MATH-500" in their o1 blog post, the label cascaded through every subsequent reasoning-model release. DeepSeek-R1, Kimi k1.5, Qwen2.5-Math, GLM-4.5, Nemotron-Nano, LongCat-Flash, Sarvam, and almost every open-weights reasoning checkpoint shipped a MATH-500 number in 2024-2025.[^13][^14][^6]

## Top scores over time

The benchmark has saturated quickly. A non-exhaustive timeline drawn from primary sources:

| Date | Model | MATH-500 (pass@1) | Source |
| --- | --- | --- | --- |
| May 2023 | GPT-4 + best-of-1860 PRM | 78.2% | Lightman et al. (Fig. 3)[^1] |
| Apr 2024 | GPT-4-turbo-2024-04-09 | 73.4% | Reported via `simple-evals`[^7] |
| May 2024 | GPT-4o (2024-05-13) | 76.6% | OpenAI `simple-evals`[^7] |
| Aug 2024 | GPT-4o (2024-08-06) | 75.9% | OpenAI `simple-evals`[^7] |
| Sep 2024 | OpenAI o1-mini | 90.0% | DeepSeek-R1 Table 4[^13] |
| Sep 2024 | OpenAI o1-preview / o1-1217 | 96.4% | OpenAI `simple-evals`[^7][^13] |
| Oct 2024 | Claude 3.5 Sonnet (2024-10-22) | 78.3% | DeepSeek-R1 Table 4[^13] |
| Dec 2024 | DeepSeek-V3 | 90.2% | DeepSeek-R1 Table 4[^13] |
| Jan 2025 | DeepSeek-R1-Zero | 95.9% | DeepSeek-R1 paper[^13] |
| Jan 2025 | DeepSeek-R1 | 97.3% | DeepSeek-R1 paper[^13] |
| Jan 2025 | Kimi k1.5 (long-CoT) | 96.2% | Kimi k1.5 technical report[^14] |
| Jan 2025 | DeepSeek-R1-Distill-Qwen-32B | 94.3% | DeepSeek-R1 Table 5[^13] |
| Mar 2025 | [Gemini 2.5 Pro](/wiki/gemini_2_5_pro) | ≈97.3% | Reported in independent leaderboard summaries[^19] |
| 2025 | [OpenAI o3](/wiki/o3) (high reasoning) | 99.2% | Artificial Analysis leaderboard[^5] |
| 2025 | [GPT-5](/wiki/gpt-5) (high reasoning) | 99.4% | Artificial Analysis leaderboard[^5] |
| 2025 | LongCat-Flash-Thinking (Meituan) | 99.2% | llm-stats leaderboard[^6] |
| 2025 | GLM-4.5 (Zhipu) | 98.2% | llm-stats leaderboard[^6] |
| 2025 | Sarvam-105B | 98.6% | llm-stats leaderboard[^6] |

Independent re-evaluation by Vals AI as of mid-2025 gave Gemini 3 Pro 96.4% and clusters most frontier models in the 90-99% band, with the platform later archiving the benchmark because "most recent models now consistently score over 90%."[^17]

## Saturation

By every reasonable measure, MATH-500 is saturated for frontier models. The mean across the 32 models tracked on the llm-stats leaderboard is 0.932, with the top model only 6 percentage points above that mean and the top three within 0.2 points of each other.[^6] Artificial Analysis ranks 201 models on MATH-500 but notes that the headline weighted "math" score now leans on harder evaluations such as BRUMO and AIME 2025 because MATH-500 no longer separates frontier models.[^5][^16]

Several knock-on effects of this saturation are now visible:

- **Retirement.** Vals AI archived MATH-500 in 2025, no longer testing new models on it.[^17]
- **Replacement by harder benchmarks.** [AIME 2024](/wiki/aime_2024), [AIME 2025](/wiki/aime_2025), [FrontierMath](/wiki/frontiermath), HMMT 2025, USAMO, Putnam, and HLE (Humanity's Last Exam) now occupy the "hard math" slot in flagship system cards.[^16][^20]
- **Compound metrics.** The Artificial Analysis intelligence index now weights MATH-500 alongside AIME and BRUMO to compute a single math-category number, with MATH-500 acting more as a hygiene check than a discriminator.[^5]

A small number of frontier vendors still report MATH-500 prominently, including Mistral, Zhipu, NVIDIA, Meituan, Sarvam, and most small-model labs, because it remains a useful checkpoint metric and because matching o1 / R1 on MATH-500 is still a strong baseline claim.[^6]

## Contamination concerns

MATH-500 is drawn from publicly available competition mathematics problems (AMC 10, AMC 12, AIME, etc.) that have been indexed, discussed, and solved on the open web, including Art of Problem Solving forums, contest archives, and innumerable solution blog posts. This visibility was already a worry for the parent MATH dataset; multiple surveys document that **MATH (Hendrycks et al.) is among the most contaminated of widely used LLM benchmarks**, alongside GSM8K, HumanEval, and MMLU.[^21][^22]

Specific concerns:

- **Self-disclosure by OpenAI.** OpenAI has acknowledged including portions of the MATH and [GSM8K](/wiki/gsm8k) training sets in its model training data; once one is comfortable training on the train split, the boundary between train and test (which contains MATH-500) becomes statistical rather than absolute.[^21][^22]
- **Public solutions.** Because problems originate from Art of Problem Solving and similar sites that publish worked solutions, any internet-scale crawl will incidentally encounter the test problems and their answers. The PRM800K license does not restrict this distribution.[^2][^11]
- **Vals AI warning.** The independent Vals AI evaluation explicitly flags MATH-500 as having "a high risk of pre-training on the test set" because all questions are in the public domain.[^17]
- **Robustness evidence from MATH-Perturb.** Huang et al.'s *MATH-Perturb* benchmark (ICML 2025) constructs 279 hand-perturbed variants of MATH Level 5 problems. Frontier reasoning models suffer 10-25 point drops on the "hard perturbation" variants, including o1-mini (−16.49) and Gemini 2.0 Flash Thinking (−12.9), suggesting that some of the headline MATH-500 score on Level 5 reflects memorisation rather than pure reasoning.[^23][^24]

These factors complicate inter-model comparisons even at the 90%+ tier. They are partly why benchmarks designed against contamination, such as FrontierMath (held-out problems written by professional mathematicians) and [MathArena](/wiki/matharena) (real-time contests evaluated within hours of release), have grown in importance.[^16][^25]

## Comparison to related benchmarks

| Benchmark | Size (test) | Source | Typical 2025 SOTA | Saturated? |
| --- | --- | --- | --- | --- |
| MATH-500 | 500 | MATH test split (Lightman 2023)[^1] | 99% (GPT-5 high)[^5] | Yes[^17] |
| MATH (full) | 5,000 | Hendrycks 2021[^8] | ≈98% (R1, o1)[^13] | Largely[^17] |
| [GSM8K](/wiki/gsm8k) | 1,319 | Cobbe et al. 2021[^26] | ≈97% (GPT-4)[^26] | Yes[^26] |
| [AIME 2024](/wiki/aime_2024) | 30 | AIME 2024 contest | ≈90% (o3, R1)[^16] | Approaching |
| [AIME 2025](/wiki/aime_2025) | 30 | AIME 2025 contest | 92-100% (GPT-5, Gemini 2.5 Pro)[^16] | Approaching |
| [FrontierMath](/wiki/frontiermath) | 300+ | Held-out problems by ≈60 mathematicians[^25] | <30% (frontier)[^25] | No |
| MATH-Perturb-Hard | 279 | Perturbations of MATH Level 5[^23] | drops 10-25 pts vs. MATH[^23] | No |

**Versus GSM8K.** GSM8K is grade-school arithmetic word problems, deliberately constrained to addition, subtraction, multiplication, and division over natural numbers; SOTA models score in the high 90s and have for years.[^26] MATH-500 is harder by roughly a full difficulty tier; Lightman et al. used precisely this gap (between trivial GSM8K performance and weaker MATH performance) to motivate their work on process supervision.[^1] In practice, modern reasoning models now saturate both.

**Versus AIME 2024/2025.** AIME is a 15-problem high-school invitational; "AIME 2024" and "AIME 2025" in LLM evaluation usage typically refer to the 30 problems from both AIME I and AIME II of the given year, with cons@k or pass@k scoring. AIME problems are individually harder than the *median* MATH-500 problem (MATH Level 5 problems are AIME-derived, but constitute only about 20% of the 500), so AIME serves as the natural "harder" successor.[^16][^27]

**Versus FrontierMath.** Epoch AI's FrontierMath benchmark, introduced in late 2024, is a fundamentally different scale: 300+ problems written and held in secret by professional research mathematicians, problems that take human experts hours to days. Top systems still score in the low double digits at best.[^16][^25]

**Versus MATH-Perturb.** *MATH-Perturb* (Huang et al., ICML 2025) constructs 279 hard- and 279 simple-perturbed problems by editing MATH Level 5 originals. The collapse on hard perturbations is the strongest published evidence that MATH-500 saturation is partly memorisation.[^23][^24]

## Reception and ongoing role

Despite saturation, MATH-500 still appears in nearly every reasoning-model release in 2025. The reasons are pragmatic: it is cheap, well-tooled (PRM800K grader is the de facto evaluator), and offers comparability with hundreds of prior published scores. As of 2025, the `HuggingFaceH4/MATH-500` dataset shows roughly 175,000 downloads per month and is listed as a training/evaluation dependency by more than 100 models on the Hub.[^3]

Several caveats remain in serious use of the benchmark:

- **Use the PRM800K grader.** Naive string-equality grading systematically under-counts correct answers because of LaTeX surface variation. The OpenAI grader (`prm800k/grading/grader.py`) is the conventional choice and handles algebraic equivalence, fraction normalisation, and other surface variation.[^2][^18]
- **Report greedy / pass@1 unless you specify otherwise.** Older numbers using majority voting (Maj@k) or PRM reranking (best-of-N) are not directly comparable to greedy pass@1 numbers reported by recent reasoning-model releases.[^7][^1]
- **Disambiguate "MATH" in claims.** Especially in headline numbers, papers and blog posts frequently say "MATH" when they mean MATH-500. The distinction matters when comparing across groups; checking the cited file path or the score range (above ≈95% generally implies MATH-500) is a useful heuristic.

For most practical purposes, MATH-500's role has shifted from a discriminating benchmark to a *floor*: a reasoning model that does not clear ≈90% on MATH-500 is not yet a frontier reasoning model. Discriminative comparison has migrated to AIME, USAMO, FrontierMath, and HLE.[^16][^25]

## See also

- [MATH benchmark](/wiki/math_benchmark): the parent dataset
- [MATH Level 5](/wiki/math_level_5): the hardest difficulty stratum of MATH and MATH-500
- [Process reward model (PRM)](/wiki/process_reward_model): the technique introduced alongside MATH-500
- [GSM8K](/wiki/gsm8k): the grade-school counterpart often paired with MATH
- [AIME](/wiki/aime), [AIME 2024](/wiki/aime_2024), [AIME 2025](/wiki/aime_2025): the harder successors
- [FrontierMath](/wiki/frontiermath): research-level held-out math benchmark
- [MathArena](/wiki/matharena): contamination-resistant contest-time evaluation
- [Reasoning models](/wiki/reasoning_models): the class of models that drove MATH-500 adoption
- [Chain-of-thought](/wiki/chain_of_thought): the prompting / training paradigm MATH-500 evaluates
- [Test-time compute](/wiki/test_time_compute): the scaling axis MATH-500 most cleanly measures

## References

[^1]: Lightman, H., Kosaraju, V., Burda, Y., Edwards, H., Baker, B., Lee, T., Leike, J., Schulman, J., Sutskever, I., & Cobbe, K. (2023). *Let's Verify Step by Step.* arXiv preprint arXiv:2305.20050. <https://arxiv.org/abs/2305.20050>

[^2]: OpenAI. *prm800k* GitHub repository, README and `math_splits/` documentation. <https://github.com/openai/prm800k>

[^3]: Hugging Face dataset card, *HuggingFaceH4/MATH-500*. <https://huggingface.co/datasets/HuggingFaceH4/MATH-500>

[^4]: OpenAI. *prm800k* README quote: "We selected these 500 test problems uniformly at random." <https://github.com/openai/prm800k>

[^5]: Artificial Analysis. *MATH-500 Benchmark Leaderboard.* <https://artificialanalysis.ai/evaluations/math-500>

[^6]: llm-stats.com. *MATH-500 Benchmark Leaderboard.* <https://llm-stats.com/benchmarks/math-500>

[^7]: OpenAI. *simple-evals* GitHub repository, README footnote on MATH / MATH-500 usage. <https://github.com/openai/simple-evals>

[^8]: Hendrycks, D., Burns, C., Kadavath, S., Arora, A., Basart, S., Tang, E., Song, D., & Steinhardt, J. (2021). *Measuring Mathematical Problem Solving With the MATH Dataset.* arXiv:2103.03874; NeurIPS 2021 Datasets and Benchmarks. <https://arxiv.org/abs/2103.03874>

[^9]: arXiv listing for *Measuring Mathematical Problem Solving With the MATH Dataset.* <https://arxiv.org/abs/2103.03874>

[^10]: Hugging Face dataset card, *hendrycks/competition_math.* <https://huggingface.co/datasets/hendrycks/competition_math>

[^11]: hendrycks/math GitHub repository (LICENSE: MIT). <https://github.com/hendrycks/math>

[^12]: Lightman et al. 2023, abstract and §3.3; process supervision results on the held-out subset. <https://arxiv.org/abs/2305.20050>

[^13]: DeepSeek-AI. (2025). *DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning.* arXiv:2501.12948, Tables 4 & 5. <https://arxiv.org/html/2501.12948v1>

[^14]: Kimi Team. (2025). *Kimi k1.5: Scaling Reinforcement Learning with LLMs.* arXiv:2501.12599. <https://arxiv.org/html/2501.12599v1>

[^15]: EleutherAI. *lm-evaluation-harness* `hendrycks_math` task README. <https://github.com/EleutherAI/lm-evaluation-harness/blob/main/lm_eval/tasks/hendrycks_math/README.md>

[^16]: BenchLM.ai. *Math Benchmarks 2026: AIME, HMMT, MATH-500 LLM Scores.* <https://benchlm.ai/math>

[^17]: Vals AI. *MATH 500 Benchmark.* <https://www.vals.ai/benchmarks/math500-04-15-2025>

[^18]: PRM800K repository, `prm800k/grading/` directory. <https://github.com/openai/prm800k>

[^19]: Independent leaderboards referencing Gemini 2.5 Pro MATH-500 score (≈97%). <https://llm-stats.com/benchmarks/math-500>

[^20]: Epoch AI. *FrontierMath.* <https://epoch.ai/frontiermath>

[^21]: Survey of contamination methods in LLMs. arXiv:2404.00699. <https://arxiv.org/html/2404.00699v4>

[^22]: Awesome Data Contamination paper list (lyy1994). <https://github.com/lyy1994/awesome-data-contamination>

[^23]: Huang, K. et al. (2025). *MATH-Perturb: Benchmarking LLMs' Math Reasoning Abilities against Hard Perturbations.* arXiv:2502.06453. <https://arxiv.org/abs/2502.06453>

[^24]: MATH-Perturb project page. <https://math-perturb.github.io/>

[^25]: Glazer et al. (2024). *FrontierMath: A Benchmark for Evaluating Advanced Mathematical Reasoning in AI.* arXiv:2411.04872. <https://arxiv.org/pdf/2411.04872>

[^26]: Verity AI. *GSM8K & MATH: Benchmarking Mathematical Reasoning.* <https://verityai.co/blog/gsm8k-math-benchmarks-mathematical-reasoning>

[^27]: Intuition Labs. *AIME 2025 Benchmark: An Analysis of AI Math Reasoning.* <https://intuitionlabs.ai/articles/aime-2025-ai-benchmark-explained>

