MATH
Last reviewed
Apr 28, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v2 ยท 5,563 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Apr 28, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v2 ยท 5,563 words
Add missing citations, update stale details, or suggest a clearer explanation.
MATH is a benchmark of 12,500 competition mathematics problems used to evaluate the mathematical problem-solving ability of machine learning systems, particularly large language models. It was introduced by Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt of UC Berkeley and the University of Chicago in the paper "Measuring Mathematical Problem Solving With the MATH Dataset" (arXiv:2103.03874), published at NeurIPS 2021 in the Datasets and Benchmarks track. The problems are drawn from American high-school mathematics competitions including the AMC 10, AMC 12, and the AIME, and each problem ships with a complete step-by-step solution typeset in LaTeX. MATH became one of the most widely cited evaluations of mathematical reasoning during the GPT-3 and GPT-4 era, and the trajectory from its 2021 baselines (around 5 to 7 percent) to near-saturation by 2025 (above 95 percent on the MATH-500 subset for frontier reasoning models) is one of the cleanest before-and-after stories in modern AI evaluation.
The MATH paper grew out of a broader research program at UC Berkeley, led by Dan Hendrycks, that produced several influential benchmarks in 2020 and 2021: MMLU for general knowledge, APPS for code generation, and MATH for quantitative reasoning. All three followed a similar design philosophy. They drew problems from human-curated sources rather than generating them synthetically, they spanned multiple difficulty levels, and they came with detailed solutions or rationales rather than just final answers. The Hendrycks group argued that these qualities were necessary if benchmarks were going to remain meaningful as models scaled into the hundreds of billions of parameters.
The motivation for MATH specifically was that the math benchmarks available in 2020 were either too easy for modern models (arithmetic-focused datasets like the math word problems in NumGLUE) or unsuitable for end-to-end evaluation (the MAWPS collection focused on simple word problems with single arithmetic operations). GSM8K, released by OpenAI around the same time as MATH, addressed the easy end of the spectrum with grade-school word problems. MATH addressed the hard end. Competition mathematics problems require multi-step reasoning, the choice of an appropriate technique from a wide repertoire (clever substitutions, casework, generating functions, modular arithmetic, geometric constructions), and care with algebraic manipulation. They were a natural target for a benchmark that intended to remain useful as models grew.
The authors collected problems by scraping public archives of past competitions, primarily the Art of Problem Solving (AoPS) wiki, which hosts problem statements and community-written solutions for the AMC and AIME and other contests. After cleaning and deduplication they kept 12,500 problems, each tagged with a subject and a difficulty level. The detailed solutions came from the AoPS community archives and were typeset in LaTeX to preserve mathematical notation. The split into 7,500 training problems and 5,000 test problems was held constant in the released dataset, and the test set is the standard reference point for reported MATH numbers.
Alongside MATH itself the authors released AMPS (Auxiliary Mathematics Problems and Solutions), a 23 GB pretraining corpus designed to teach models the basics of mathematics before fine-tuning on MATH. AMPS contains over 100,000 problems across 693 exercise types scraped from Khan Academy, plus approximately 5 million problems generated by 100 hand-designed Mathematica scripts (37 of which produce step-by-step solutions). The point of AMPS was to demonstrate that domain-specific pretraining could substitute for raw scale: in the original paper a GPT-2 1.5B model pretrained on AMPS reached the same MATH accuracy as scaling the model parameters by roughly 130x without it, an early hint that the right data mattered more than naive parameter count for mathematical reasoning.
MATH problems are organized into seven subject categories that map roughly to the standard high-school and early-college mathematics curriculum:
| Subject | Approximate share of dataset | Typical topics |
|---|---|---|
| Prealgebra | ~14% | Arithmetic, fractions, decimals, basic equations, simple word problems |
| Algebra | ~14% | Linear and quadratic equations, inequalities, polynomial manipulation, functions |
| Number Theory | ~14% | Divisibility, primes, modular arithmetic, base conversion, Diophantine equations |
| Counting and Probability | ~14% | Permutations, combinations, binomial coefficients, expected value, simple probability |
| Geometry | ~14% | Triangles, circles, area and volume, coordinate geometry, transformations |
| Intermediate Algebra | ~14% | Logarithms, sequences and series, complex numbers, conic sections, polynomial roots |
| Precalculus | ~14% | Trigonometry, vectors, matrices, limits, parametric equations |
The authors note that the seven-subject breakdown is roughly balanced, with each category contributing on the order of 1,700 to 1,800 problems across train and test combined. The original Hendrycks et al. paper sometimes describes the structure as six categories by merging Prealgebra and Precalculus, which is why some secondary sources (and earlier versions of this article) report six subjects rather than seven. The released JSON files in the GitHub repository carry seven distinct subject tags.
Every problem carries a difficulty rating from 1 (easiest) to 5 (hardest) assigned by the problem's original source or by the AoPS community. The mapping is roughly:
| Level | Description | Typical source on AMC/AIME |
|---|---|---|
| Level 1 | Easiest | Early AMC 8 and AMC 10 problems, basic AoPS introductory exercises |
| Level 2 | Easy | Middle AMC 10 problems, late AMC 8 |
| Level 3 | Medium | Late AMC 10, early AMC 12 |
| Level 4 | Hard | Late AMC 12, early AIME |
| Level 5 | Hardest | Late AIME and harder competition problems |
The distribution across levels is not uniform. Levels 1 through 4 each contain on the order of 2,000 to 3,000 problems, while Level 5 contains 1,324 problems in the test split. Level 5 is by far the most studied subset because it provides the cleanest signal of advanced mathematical reasoning, and it is sometimes reported as a separate benchmark called MATH Level 5 in modern leaderboards.
Each MATH problem ships as a JSON record with four fields: the problem statement in LaTeX, the subject tag, the level, and the canonical solution. Final answers are wrapped in \boxed{} so that automated evaluation can extract a single string for comparison. A representative example from the Counting and Probability category looks like this:
Problem: Find the number of ordered pairs of positive integers (a, b) such that a + b = 1000
and neither a nor b has a zero digit.
Solution: We use complementary counting. There are 999 ordered pairs (a, b) of positive
integers with a + b = 1000. We count those in which at least one of a, b has a digit zero.
... [several paragraphs of casework] ...
Thus there are 999 - 261 = \boxed{738} valid pairs.
The solution is intentionally verbose and walks through every step. The authors used these solutions for two purposes: as supervised fine-tuning targets that taught models the structure of mathematical reasoning, and as a qualitative tool for spot-checking model output. The decision to release full solutions rather than answers alone was unusual at the time and influenced later math-focused datasets such as PRM800K and the OpenMathInstruct corpus.
MATH is scored by exact match between the model's final answer and the canonical answer extracted from \boxed{}. There is no partial credit and no judgment of solution quality. A model that produces a correct answer through faulty reasoning is scored as correct, and a model that produces a near-correct answer with a minor algebraic slip is scored as wrong.
The official evaluation script handles common formatting variations: equivalent fractions (1/2 versus 0.5), ordering of terms in a sum, presence or absence of explicit multiplication signs in algebraic expressions, and so on. The script uses simple symbolic normalization (collecting like terms, canonicalizing fraction forms) before comparison. Even with normalization, format mismatches still occasionally cost models a point or two compared to a human grader, which is one of the standard caveats when comparing scores across reports.
Reported numbers usually take one of three forms:
Different papers use different prompts, different sampling parameters, and occasionally different subsets of the test set, so headline numbers should be compared with care. The OpenAI "simple-evals" repository and the EleutherAI lm-evaluation-harness both ship reference implementations that have become semi-official baselines for cross-paper comparison.
The progression of state-of-the-art numbers on MATH between 2021 and 2025 is one of the most-cited examples of rapid benchmark progress in modern AI. The original paper estimated, based on a log-linear fit to GPT-2 and GPT-3 scaling data, that reaching 40 percent accuracy by scaling alone would require roughly 10^35 parameters, a figure several orders of magnitude beyond any plausible compute budget. In practice 40 percent was reached within about 18 months, and 80 percent within about three years, almost entirely because of algorithmic and data improvements rather than parameter scaling.
| Year | Model | Method | MATH accuracy | Source |
|---|---|---|---|---|
| 2021 | GPT-2 0.1B | AMPS pretrain plus fine-tune | 5.4% | Hendrycks et al. (2021) |
| 2021 | GPT-2 1.5B | AMPS pretrain plus fine-tune | 6.9% | Hendrycks et al. (2021) |
| 2021 | GPT-3 175B | Few-shot | 5.2% | Hendrycks et al. (2021) |
| 2022 | Minerva 8B | Pass@1 | 14.1% | Lewkowycz et al. (2022) |
| 2022 | Minerva 62B | Pass@1 | 27.6% | Lewkowycz et al. (2022) |
| 2022 | Minerva 540B | Pass@1 | 33.6% | Lewkowycz et al. (2022) |
| 2022 | Minerva 540B | Majority vote (k = 64) | 50.3% | Lewkowycz et al. (2022) |
| 2023 | GPT-4 | Zero-shot CoT | 42.5% | OpenAI (2023) GPT-4 technical report |
| 2023 | GPT-4 + Code Interpreter | With Python tool use | 69.7% | Zhou et al. (2023) |
| 2024 | Claude 3 Opus | Zero-shot CoT | 60.1% | Anthropic Claude 3 model card |
| 2024 | GPT-4o | CoT | 76.6% | OpenAI GPT-4o announcement |
| 2024 | o1-preview | Reasoning | 85.5% | OpenAI o1 system card |
| 2024 | o1 (work in progress) | Reasoning | 94.8% | OpenAI o1 announcement |
| 2025 | DeepSeek-R1 | Reasoning, MATH-500 | 97.3% | DeepSeek-R1 paper |
| 2025 | OpenAI o1-1217 | Reasoning, MATH-500 | 96.4% | DeepSeek-R1 paper (comparison) |
A few moments in this table are worth pointing out.
The 2021 baseline. In the original paper, GPT-3 175B in a few-shot setting scored 5.2 percent, essentially indistinguishable from the much smaller GPT-2 1.5B at 6.9 percent (which had the advantage of AMPS pretraining and fine-tuning). The conclusion in the paper was that naive scaling was not solving MATH and that targeted innovation would be required. This conclusion aged well.
Minerva's jump. Minerva (Lewkowycz et al., 2022) was a fine-tuned variant of PaLM trained on 118 GB of mathematical and scientific content scraped from arXiv and the web. The 540B Minerva model reached 50.3 percent on MATH using majority voting over 64 samples, a roughly tenfold improvement over the original GPT-3 baseline in just over a year. Minerva did not use any external tools (no calculator, no code execution), so the gain came entirely from better pretraining data and majority voting. It became the model to beat for the next 18 months.
GPT-4 and tool use. GPT-4 without tools scored about 42 percent on MATH in the technical report, similar to Minerva 62B with majority voting. With access to a Python code interpreter, the same model jumped to 69.7 percent (Zhou et al., 2023, "Solving Challenging Math Word Problems Using GPT-4 Code Interpreter"). The lesson generalized quickly: external tools, especially code execution and symbolic math, were doing most of the work on the harder problems where models would otherwise fumble arithmetic.
The o1 jump. OpenAI o1, released in September 2024, was the first reasoning model trained explicitly to produce long internal chains of thought before answering. On MATH the work-in-progress version was reported at 94.8 percent, with the released o1-preview at 85.5 percent. By comparison, GPT-4o on the same benchmark was around 76 percent. The o1 jump effectively closed the gap between AI and the IMO-gold-medalist human baseline.
Saturation by 2025. DeepSeek-R1 and OpenAI o1-1217 both reached above 96 percent on MATH-500 in early 2025, leaving roughly a 3 to 4 percentage point ceiling for further improvement. At this point MATH had stopped functioning as a discriminative benchmark for frontier models and the field had largely moved on to harder evaluations.
MATH-500 is a 500-problem subset of the original MATH test set, sampled uniformly at random by OpenAI as part of the "Let's Verify Step by Step" research program (Lightman et al., 2023). The remaining 4,500 test problems were used as held-out training data for the PRM800K process reward model, so MATH-500 was the held-out evaluation slice. The subset has since become a de facto standard for fast MATH evaluation. Running the full 5,000-problem test set takes substantial compute when models generate long chains of thought (the o1-style models routinely produce thousands of reasoning tokens per problem), and 500 problems is roughly the smallest sample that still produces stable percentage estimates for top models.
MATH-500 spans the same seven subjects and five difficulty levels as the full test set in approximately the same proportions. OpenAI's simple-evals harness and Hugging Face's HuggingFaceH4/MATH-500 dataset both ship the same 500-problem split, which makes cross-paper comparison cleaner. Most 2024 and 2025 model release blog posts that report "MATH" numbers are reporting MATH-500 unless otherwise noted; the convention is to use the larger label even when the smaller subset is what was actually evaluated.
DeepSeek-R1's 97.3 percent result, OpenAI o3's reported scores, and most of the 2025 reasoning-model leaderboards are MATH-500 numbers. The full 5,000-problem score is usually a few tenths of a point off, but not in any consistent direction.
MATH Level 5 refers to the 1,324 hardest problems in the MATH test set, those rated at the top difficulty band. These are roughly AIME-level questions and represent the part of the benchmark that has remained meaningful longest. Even when models had saturated the easier levels, Level 5 continued to discriminate between systems through 2024.
For the original GPT-2 1.5B model, Level 5 accuracy was approximately 4 percent versus 15 percent on Level 1, so the difficulty gap was visible from the start. Minerva 540B with majority voting reached 33.6 percent on Level 5 in 2022, GPT-4 reached approximately 25 percent zero-shot, and o1-style reasoning models pushed Level 5 above 75 percent in 2024. Epoch AI maintains a tracker for MATH Level 5 specifically, treating it as a separate benchmark that has not yet saturated for non-reasoning models.
The Level 5 problems are dominated by AIME-style questions: integer-answer problems requiring multi-step casework, generating-function manipulations, geometric constructions, or clever algebraic identities. They are the part of MATH that most directly tests whether a model has learned mathematical reasoning rather than memorized solution templates.
The original paper reported a human reference based on a small sample of human solvers. A three-time International Mathematical Olympiad gold medalist scored approximately 90 percent on a sample of MATH problems, with the few errors attributed to calculation slips and time pressure. A computer science PhD student who did not specialize in math scored about 40 percent, which the authors used to argue that MATH is genuinely difficult even for technically skilled adults.
The 90 percent gold medalist figure is the number most often cited as the human ceiling. Average mathematics undergraduates score in the 20 to 40 percent range on competition problems of this difficulty, and casual college-educated adults score below 10 percent. The implication for AI is that MATH was never a benchmark that aligned-with-human-average-performance was supposed to solve. By the time top models passed 50 percent, they were already outperforming the median Berkeley CS PhD student. By the time they reached 90 percent, they were within range of top human competitors.
This trajectory is part of why mathematical-reasoning benchmarks evolved quickly past MATH: once models match an IMO gold medalist on a benchmark, the benchmark is no longer measuring the frontier of mathematical reasoning, only confirming that frontier models can solve high-school competition problems.
The roughly 90-percentage-point gain on MATH between 2021 and 2025 came from a combination of innovations rather than any single technique.
Domain-specific pretraining. AMPS in the original paper, then the Minerva math-and-science corpus, then OpenWebMath (Paster et al., 2023), then the OpenMathInstruct synthetic datasets all showed that targeted math pretraining outperformed general scaling. Minerva's 50.3 percent in 2022 came from PaLM plus 118 GB of arXiv and math-tagged web pages, with no architectural changes.
Chain-of-thought prompting. Wei et al. (2022) showed that simply prompting a sufficiently large model to think step by step before answering improved math performance by 10 to 20 percentage points on MATH and similar benchmarks. CoT became the default inference protocol for math evaluation by mid-2022.
Self-consistency and majority voting. Wang et al. (2022) showed that sampling multiple chains of thought at non-zero temperature and taking the majority answer was a much stronger evaluation protocol than greedy decoding. Minerva 540B improved from 33.6 percent (greedy) to 50.3 percent (maj@64) using only this technique.
Tool use. Code interpreter access raised GPT-4's MATH score from about 42 percent to about 70 percent, mostly by removing arithmetic errors. By 2024 most production math evaluations of base LLMs included a Python sandbox.
Process supervision. OpenAI's "Let's Verify Step by Step" paper (Lightman et al., 2023) trained a process reward model on 800,000 step-level correctness labels (the PRM800K dataset) and used it to rerank candidate solutions. Process supervision substantially outperformed outcome supervision on MATH-500, and the technique became a building block for later reasoning models.
Reasoning models. OpenAI o1, OpenAI o3, and DeepSeek-R1 trained models with reinforcement learning to produce long internal chains of thought before answering. On MATH this pushed accuracies into the mid-90s without external tools, effectively closing the benchmark.
Verifier-guided search. Stronger search procedures (best-of-N with learned verifiers, beam search over reasoning steps, tree-of-thought variants) added a few percentage points on top of any base model. By 2024 the typical recipe combined a strong base model, CoT prompting, sampling, and a verifier or reward model.
By 2024 and into 2025 the cluster of frontier models on MATH-500 had compressed into the 95 to 98 percent range, leaving little room for differentiation. This saturation was forecast: the original paper's worry was that benchmarks would saturate too slowly to be useful, but the reverse turned out to be the bigger problem. The field responded with a series of harder math evaluations.
| Benchmark | Year | Description | SOTA at release |
|---|---|---|---|
| GSM8K | 2021 | 8,500 grade-school word problems, OpenAI | About 35% (GPT-3 + verifier) |
| MATH | 2021 | 12,500 AMC and AIME problems | About 7% (GPT-2 1.5B + AMPS) |
| AIME 2024 | 2024 | 15-problem AIME exam used as a held-out test | 13% (GPT-4o) |
| OlympiadBench | 2024 | Olympiad-level math and physics | About 35% (GPT-4V) |
| FrontierMath | 2024 | Hundreds of original research-level problems by Epoch AI | Under 2% at release |
| Putnam Bench | 2024 | Putnam competition problems formalized in Lean | Single-digit percent |
| MATH-Perturb-Hard | 2025 | Adversarially perturbed MATH problems | 10 to 25 pp drop vs MATH |
| HARP | 2024 | Human-annotated reasoning problems | Open |
| OlymMATH | 2025 | Olympiad-level math, multilingual | Limited accuracy on hard subset |
AIME became the de facto successor for evaluating top math models. The 2024 AIME exams were used by OpenAI to characterize o1-preview and o1, with the headline that o1 averaged 74 percent on AIME 2024 (single sample), 83 percent with consensus over 64 samples, and 93 percent with a learned reranker over 1000 samples. AIME has the additional advantage that the problems are released annually on a fixed schedule, which makes contamination tracking easy: a model trained before March of a given year could not have seen that year's problems unless they were leaked.
FrontierMath (Glazer et al., 2024, arXiv:2411.04872) took a different approach. The benchmark was built by approximately 60 mathematicians, including Fields medalists Terence Tao, Timothy Gowers, and Richard Borcherds, who each contributed original research-level problems that had not been published anywhere else. Solving a typical FrontierMath problem takes a domain expert hours or days. At release, frontier models including GPT-4o, Claude 3.5 Sonnet, and Gemini 1.5 Pro all scored under 2 percent. FrontierMath is the most direct heir to MATH's role as a measurement of the upper limit of AI mathematical reasoning, and as of 2025 it remained far from saturated even for o1-style reasoning models.
MATH-Perturb (Huang et al., 2025) and similar adversarial benchmarks took a different angle. They started from MATH problems and applied targeted perturbations (renaming variables, changing constants, restructuring problems while preserving difficulty) to detect whether models had memorized solution templates rather than learning to solve the underlying problems. Top reasoning models including o1-mini and Gemini 2.0 Flash Thinking dropped 10 to 25 percentage points on MATH-Perturb-Hard relative to standard MATH, suggesting that some of the saturation was an overfitting artifact rather than genuine reasoning ability.
MATH had outsized influence on how the field studies mathematical reasoning. Several research directions trace directly back to choices made in the original paper.
Step-by-step solutions as data. The decision to release full LaTeX-typeset solutions, not just answers, made MATH the natural training set for instruction-tuned math models. The OpenMathInstruct corpus and the AMPS-derivative datasets that followed all built on this design. Without MATH there would be no clean way to fine-tune a model to produce verifiable mathematical reasoning, because most other contest archives only publish answers.
Process versus outcome supervision. The PRM800K work was built on MATH problems, with human labelers marking each step of model-generated solutions as correct or incorrect. The result, that process supervision generalizes better than outcome supervision, pushed the field toward verifier-based training and informed the design of o1-style reasoning models.
Synthetic math data. The success of AMPS demonstrated that synthetic math problems generated by symbolic math systems were a viable pretraining substrate. This idea became central to later math LLMs, with NVIDIA's NeMo-Skills, OpenMathInstruct, and DeepSeekMath all using synthetic data generation pipelines that trace back conceptually to AMPS.
Test-time compute scaling. Majority voting on MATH (Wang et al., 2022, then Minerva) was an early demonstration that more inference-time compute could substantially improve accuracy. This insight matured into the broader test-time compute scaling literature that culminated in o1-style reasoning models, where test-time compute became a first-class axis of model improvement on par with training compute.
Tool use as default. GPT-4 with code interpreter scoring 70 percent on MATH made it indefensible to evaluate math models without giving them access to a calculator. Even purist evaluations now usually report both "with tools" and "without tools" numbers, and production deployments almost always use tools.
Benchmark evolution as a research method. MATH's saturation arc became a template. AIME, FrontierMath, OlympiadBench, and MATH-Perturb were all designed in conscious response to lessons learned from MATH: keep the problems hard enough that the benchmark has runway, build in contamination resistance, and make the evaluation robust to template memorization.
The MATH benchmark has several well-documented limitations that are worth keeping in mind when comparing reported scores.
Answer-only evaluation. Exact match on the boxed answer does not check whether the reasoning was correct. A model can guess the right answer through faulty reasoning and be scored as correct. Conversely a model that produces correct reasoning but a different valid answer format may be scored as incorrect. Process supervision research (PRM800K) showed that the gap between answer-correct and reasoning-correct can be substantial, especially for long solutions.
Contamination. MATH problems and their solutions have been on the public internet since 2021 and the original sources (AMC, AIME, AoPS) have been online much longer. Modern web-scale pretraining corpora almost certainly include MATH and its solutions, and there is no reliable way to retroactively decontaminate a closed-source model. Some MATH improvements over the years probably reflect contamination rather than genuine reasoning gains. The MATH-Perturb-Hard drop of 10 to 25 percentage points for top models is consistent with this concern.
Saturation at the top. Once frontier models cluster above 95 percent, distinctions between them on MATH stop being informative. Sample noise, prompt sensitivity, and answer-format variation all become large relative to the remaining headroom. By 2025 MATH-500 results in the 96 to 98 percent range were essentially indistinguishable from each other.
Coverage gaps. MATH is dominated by competition mathematics. It does not cover proof-based mathematics, formal verification, advanced calculus and analysis, abstract algebra beyond what shows up on the AIME, or any kind of research mathematics. This is by design (the dataset targets American high-school competition style), but it means that high MATH scores do not transfer cleanly to the broader question of whether models can do mathematics.
English only. All problems and solutions are in English. There is no multilingual MATH, although translations have been produced informally and some successor benchmarks (OlymMATH) are explicitly multilingual.
Format brittleness. The exact-match grader handles common formatting variants but still occasionally penalizes correct answers that are written differently from the canonical form. Reported scores can shift by one or two points depending on which grader implementation is used.
Solution quality variance. Because the canonical solutions came from the AoPS community, their quality varies. Some are excellent expository solutions, others are terse competition-style writeups, and a few contain minor errors. This is a relatively small issue for evaluation (which only checks the final answer) but matters for fine-tuning, where the quality of the supervised target affects what the model learns.
MATH sits in a family of mathematical reasoning benchmarks that the field uses in roughly graduated order of difficulty.
| Benchmark | Year | Difficulty | Size | Status (2025) |
|---|---|---|---|---|
| GSM8K | 2021 | Grade school | 8,500 problems | Saturated for frontier models |
| MATH | 2021 | High school competition | 12,500 problems | Saturated for reasoning models |
| MATH-500 | 2023 | High school competition (subset) | 500 problems | Saturated for reasoning models |
| MATH Level 5 | 2021 (subset) | AIME-level | 1,324 problems | Mostly saturated for top models |
| AIME 2024 | 2024 | AIME competition | 15 problems | Open at top of leaderboard |
| AIME 2025 | 2025 | AIME competition | 15 problems | Open |
| OlympiadBench | 2024 | Olympiad-level | 8,476 problems | Open |
| Putnam Bench | 2024 | Putnam competition (Lean formalized) | 640 problems | Single-digit percent SOTA |
| HARP | 2024 | Human-annotated reasoning | Open | Open |
| FrontierMath | 2024 | Research-level mathematics | Several hundred problems | Open, under 2% at release |
| Humanity's Last Exam | 2025 | Multidisciplinary, includes math | About 3,000 problems | Open |
Among these, GSM8K and MATH are the historical pair that defined modern math evaluation. GSM8K covers the grade-school end and was the easier of the two; MATH covers the high-school competition end and is what most papers actually meant when they reported "math" results between 2021 and 2024. The successor pair AIME and FrontierMath play roughly the same dual role for the post-saturation era, with AIME providing a contamination-resistant high-school benchmark and FrontierMath providing the research-level frontier.
As of 2025 MATH is no longer the headline math benchmark for frontier model releases, but it remains useful in several ways.
First, it is the standard math benchmark in the EleutherAI lm-evaluation-harness and the OpenAI simple-evals harness, so it shows up in essentially every model card and technical report. Even when a model has saturated MATH, the score serves as a sanity check that math capability has not regressed and as a continuous time series back to 2021.
Second, MATH-500 is a fast and cheap evaluation. Five hundred problems takes a few hours of inference even with long chains of thought, which makes it well suited to checkpointing during training and to ablation studies. For models that are not yet at saturation (smaller open-source models, distilled models, narrow-domain models), MATH-500 still discriminates meaningfully.
Third, MATH Level 5 and MATH-Perturb-Hard are the residual hard subsets where measurement still happens. Epoch AI maintains a Level 5 leaderboard. Reports on perturbed MATH and on AIME 2024 and 2025 have effectively replaced the role MATH used to play in distinguishing top reasoning models.
Fourth, MATH solutions remain a widely used training corpus. The 7,500 training problems with their LaTeX solutions are part of essentially every modern math fine-tuning recipe, often expanded by a factor of 100 or more through synthetic generation in the AMPS lineage. Even when the test set is saturated, the training set continues to contribute to the data pipelines that build the next generation of math models.
The trajectory from 7 percent in 2021 to 97 percent in 2025 is the kind of arc benchmark designers hope to capture: a clean baseline that reveals real progress, a long enough runway to support several years of measurement, and a graceful retirement to a more challenging successor. The same group that built MATH (Hendrycks at the Center for AI Safety) went on to build Humanity's Last Exam in 2025, partly with the explicit goal of giving the field another decade of headroom before the next saturation.
Hendrycks, D., Burns, C., Kadavath, S., Arora, A., Basart, S., Tang, E., Song, D., & Steinhardt, J. (2021). "Measuring Mathematical Problem Solving With the MATH Dataset." NeurIPS 2021 Datasets and Benchmarks Track. arXiv:2103.03874. https://arxiv.org/abs/2103.03874
Lewkowycz, A., Andreassen, A., Dohan, D., Dyer, E., Michalewski, H., Ramasesh, V., Slone, A., Anil, C., Schlag, I., Gutman-Solo, T., Wu, Y., Neyshabur, B., Gur-Ari, G., & Misra, V. (2022). "Solving Quantitative Reasoning Problems with Language Models." NeurIPS 2022. arXiv:2206.14858. https://arxiv.org/abs/2206.14858
OpenAI (2023). "GPT-4 Technical Report." arXiv:2303.08774. https://arxiv.org/abs/2303.08774
Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter, B., Xia, F., Chi, E., Le, Q. V., & Zhou, D. (2022). "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models." NeurIPS 2022. arXiv:2201.11903. https://arxiv.org/abs/2201.11903
Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., & Zhou, D. (2022). "Self-Consistency Improves Chain of Thought Reasoning in Language Models." arXiv:2203.11171. https://arxiv.org/abs/2203.11171
Lightman, H., Kosaraju, V., Burda, Y., Edwards, H., Baker, B., Lee, T., Leike, J., Schulman, J., Sutskever, I., & Cobbe, K. (2023). "Let's Verify Step by Step." ICLR 2024. arXiv:2305.20050. https://arxiv.org/abs/2305.20050
Cobbe, K., Kosaraju, V., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., Hesse, C., & Schulman, J. (2021). "Training Verifiers to Solve Math Word Problems" (GSM8K). arXiv:2110.14168. https://arxiv.org/abs/2110.14168
Zhou, A., Wang, K., Lu, Z., Shi, W., Luo, S., Qin, Z., Lu, S., Jia, A., Song, L., Zhan, M., & Li, H. (2023). "Solving Challenging Math Word Problems Using GPT-4 Code Interpreter with Code-based Self-Verification." ICLR 2024. arXiv:2308.07921. https://arxiv.org/abs/2308.07921
OpenAI (2024). "OpenAI o1 System Card." September 12, 2024. https://cdn.openai.com/o1-system-card.pdf
OpenAI (2024). "Learning to Reason with LLMs." September 12, 2024. https://openai.com/index/learning-to-reason-with-llms/
Anthropic (2024). "The Claude 3 Model Family: Opus, Sonnet, Haiku." Model card. https://assets.anthropic.com/m/61e7d27f8c8f5919/original/Claude-3-Model-Card.pdf
DeepSeek-AI (2025). "DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning." arXiv:2501.12948. https://arxiv.org/abs/2501.12948
Glazer, E., Erdil, E., Besiroglu, T., Chicharro, D., Chen, E., Gunning, A., Olsson, C. F., Denain, J.-S., Ho, A., Sevilla, J., Heim, L., Schwettmann, S., Tao, T., Gowers, T., et al. (2024). "FrontierMath: A Benchmark for Evaluating Advanced Mathematical Reasoning in AI." arXiv:2411.04872. https://arxiv.org/abs/2411.04872
Paster, K., Santos, M. D., Azerbayev, Z., & Ba, J. (2023). "OpenWebMath: An Open Dataset of High-Quality Mathematical Web Text." arXiv:2310.06786. https://arxiv.org/abs/2310.06786
Azerbayev, Z., Schoelkopf, H., Paster, K., Santos, M. D., McAleer, S., Jiang, A. Q., Deng, J., Biderman, S., & Welleck, S. (2024). "Llemma: An Open Language Model for Mathematics." ICLR 2024. arXiv:2310.10631. https://arxiv.org/abs/2310.10631
Huang, K., et al. (2025). "MATH-Perturb: Benchmarking LLMs' Math Reasoning Abilities against Hard Perturbations." arXiv:2502.06453. https://arxiv.org/abs/2502.06453
He, C., Luo, R., Bai, Y., Hu, S., Thai, Z. L., Shen, J., Hu, J., Han, X., Huang, Y., Zhang, Y., Liu, J., Qi, L., Liu, Z., & Sun, M. (2024). "OlympiadBench: A Challenging Benchmark for Promoting AGI with Olympiad-Level Bilingual Multimodal Scientific Problems." arXiv:2402.14008. https://arxiv.org/abs/2402.14008
OpenAI (2024). "OpenAI o1-mini: advancing cost-efficient reasoning." https://openai.com/index/openai-o1-mini-advancing-cost-efficient-reasoning/
Hendrycks, D., et al. MATH dataset GitHub repository. https://github.com/hendrycks/math
OpenAI. PRM800K GitHub repository. https://github.com/openai/prm800k
HuggingFaceH4. MATH-500 dataset on Hugging Face. https://huggingface.co/datasets/HuggingFaceH4/MATH-500
Epoch AI. MATH Level 5 benchmark tracker. https://epoch.ai/benchmarks/math-level-5
EleutherAI. lm-evaluation-harness, hendrycks_math task. https://github.com/EleutherAI/lm-evaluation-harness/blob/main/lm_eval/tasks/hendrycks_math/README.md
OpenAI. simple-evals repository. https://github.com/openai/simple-evals