MATH

AI Benchmarks Model Evaluation Reasoning Models

30 min read

Updated Jun 21, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 21, 2026

Fact-checked

In review queue

Sources

25 citations

Revision

v5 · 6,045 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

MATH is a benchmark of 12,500 competition mathematics problems used to evaluate the mathematical problem-solving ability of machine learning systems, particularly large language models. It was introduced by Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt of UC Berkeley and the University of Chicago in the paper "Measuring Mathematical Problem Solving With the MATH Dataset" (arXiv:2103.03874), published at NeurIPS 2021 in the Datasets and Benchmarks track ^[1]. Each problem is drawn from American high-school mathematics competitions including the AMC 10, AMC 12, and the AIME, is tagged with one of seven subjects and a difficulty level from 1 to 5, and ships with a complete step-by-step solution typeset in LaTeX. MATH became one of the most widely cited evaluations of mathematical reasoning during the GPT-3 and GPT-4 era, and the trajectory from its 2021 baselines (around 5 to 7 percent) to near-saturation by 2025 (97.3 percent on the MATH-500 subset for DeepSeek-R1) is one of the cleanest before-and-after stories in modern AI evaluation ^[1]^[12].

The paper's central finding was that raw scale alone would not solve the benchmark: "While scaling Transformers is automatically solving most other text-based tasks, scaling is not currently solving MATH. To have more traction on mathematical problem solving we will likely need new algorithmic advancements from the broader research community" ^[1]. That prediction held, and the techniques that eventually closed MATH (domain-specific pretraining, chain-of-thought prompting, majority voting, tool use, process supervision, and reinforcement-learned reasoning) became core ingredients of modern reasoning models.

Key facts

Property	Value
Full name	Measuring Mathematical Problem Solving With the MATH Dataset
Released	2021 (NeurIPS Datasets and Benchmarks track)
Authors	Hendrycks, Burns, Kadavath, Arora, Basart, Tang, Song, Steinhardt (UC Berkeley, University of Chicago)
Size	12,500 problems (7,500 train, 5,000 test)
Subjects	7 (Prealgebra, Algebra, Number Theory, Counting and Probability, Geometry, Intermediate Algebra, Precalculus)
Difficulty	Levels 1 (easiest) to 5 (hardest); 1,324 Level 5 problems in the test split
Sources	AMC 10, AMC 12, AIME and other contests, via the Art of Problem Solving archives
Scoring	Exact match on the final answer extracted from \boxed{}
2021 SOTA	About 6.9% (GPT-2 1.5B with AMPS pretraining)
2025 SOTA	97.3% on MATH-500 (DeepSeek-R1) ^[12]
Human reference	About 90% for a three-time IMO gold medalist ^[1]

What was MATH built to measure?

The MATH paper grew out of a broader research program at UC Berkeley, led by Dan Hendrycks, that produced several influential benchmarks in 2020 and 2021: MMLU for general knowledge, APPS for code generation, and MATH for quantitative reasoning. All three followed a similar design philosophy. They drew problems from human-curated sources rather than generating them synthetically, they spanned multiple difficulty levels, and they came with detailed solutions or rationales rather than just final answers. The Hendrycks group argued that these qualities were necessary if benchmarks were going to remain meaningful as models scaled into the hundreds of billions of parameters.

The motivation for MATH specifically was that the math benchmarks available in 2020 were either too easy for modern models (arithmetic-focused datasets like the math word problems in NumGLUE) or unsuitable for end-to-end evaluation (the MAWPS collection focused on simple word problems with single arithmetic operations). GSM8K, released by OpenAI around the same time as MATH, addressed the easy end of the spectrum with grade-school word problems ^[7]. MATH addressed the hard end. Competition mathematics problems require multi-step reasoning, the choice of an appropriate technique from a wide repertoire (clever substitutions, casework, generating functions, modular arithmetic, geometric constructions), and care with algebraic manipulation. They were a natural target for a benchmark that intended to remain useful as models grew.

The authors collected problems by scraping public archives of past competitions, primarily the Art of Problem Solving (AoPS) wiki, which hosts problem statements and community-written solutions for the AMC and AIME and other contests. After cleaning and deduplication they kept 12,500 problems, each tagged with a subject and a difficulty level. The detailed solutions came from the AoPS community archives and were typeset in LaTeX to preserve mathematical notation. The split into 7,500 training problems and 5,000 test problems was held constant in the released dataset, and the test set is the standard reference point for reported MATH numbers.

Alongside MATH itself the authors released AMPS (Auxiliary Mathematics Problems and Solutions), a 23 GB pretraining corpus designed to teach models the basics of mathematics before fine-tuning on MATH. AMPS contains over 100,000 problems across 693 exercise types scraped from Khan Academy, plus approximately 5 million problems generated by 100 hand-designed Mathematica scripts (37 of which produce step-by-step solutions). The point of AMPS was to demonstrate that domain-specific pretraining could substitute for raw scale: in the original paper a GPT-2 1.5B model pretrained on AMPS reached the same MATH accuracy as scaling the model parameters by roughly 130x without it, an early hint that the right data mattered more than naive parameter count for mathematical reasoning ^[1].

How is the dataset structured?

subject categories

MATH problems are organized into seven subject categories that map roughly to the standard high-school and early-college mathematics curriculum:

Subject	Approximate share of dataset	Typical topics
Prealgebra	~14%	Arithmetic, fractions, decimals, basic equations, simple word problems
Algebra	~14%	Linear and quadratic equations, inequalities, polynomial manipulation, functions
Number Theory	~14%	Divisibility, primes, modular arithmetic, base conversion, Diophantine equations
Counting and Probability	~14%	Permutations, combinations, binomial coefficients, expected value, simple probability
Geometry	~14%	Triangles, circles, area and volume, coordinate geometry, transformations
Intermediate Algebra	~14%	Logarithms, sequences and series, complex numbers, conic sections, polynomial roots
Precalculus	~14%	Trigonometry, vectors, matrices, limits, parametric equations

The authors note that the seven-subject breakdown is roughly balanced, with each category contributing on the order of 1,700 to 1,800 problems across train and test combined. The original Hendrycks et al. paper sometimes describes the structure as six categories by merging Prealgebra and Precalculus, which is why some secondary sources (and earlier versions of this article) report six subjects rather than seven. The released JSON files in the GitHub repository carry seven distinct subject tags ^[19]. Geometry problems are unusual in that any required diagram is specified in text using the Asymptote vector-graphics language rather than as an image, so a text-only language model receives all the information it needs ^[1].

difficulty levels

Every problem carries a difficulty rating from 1 (easiest) to 5 (hardest) assigned by the problem's original source or by the AoPS community. The mapping is roughly:

Level	Description	Typical source on AMC/AIME
Level 1	Easiest	Early AMC 8 and AMC 10 problems, basic AoPS introductory exercises
Level 2	Easy	Middle AMC 10 problems, late AMC 8
Level 3	Medium	Late AMC 10, early AMC 12
Level 4	Hard	Late AMC 12, early AIME
Level 5	Hardest	Late AIME and harder competition problems

The distribution across levels is not uniform. Levels 1 through 4 each contain on the order of 2,000 to 3,000 problems, while Level 5 contains 1,324 problems in the test split ^[1]. Level 5 is by far the most studied subset because it provides the cleanest signal of advanced mathematical reasoning, and it is sometimes reported as a separate benchmark called MATH Level 5 in modern leaderboards.

problem and solution format

Each MATH problem ships as a JSON record with four fields: the problem statement in LaTeX, the subject tag, the level, and the canonical solution ^[1]. Final answers are wrapped in \boxed{} so that automated evaluation can extract a single string for comparison. A representative example from the Counting and Probability category looks like this:

Problem: Find the number of ordered pairs of positive integers (a, b) such that a + b = 1000
and neither a nor b has a zero digit.

Solution: We use complementary counting. There are 999 ordered pairs (a, b) of positive
integers with a + b = 1000. We count those in which at least one of a, b has a digit zero.
... [several paragraphs of casework] ...
Thus there are 999 - 261 = \boxed{738} valid pairs.

The solution is intentionally verbose and walks through every step. The authors used these solutions for two purposes: as supervised fine-tuning targets that taught models the structure of mathematical reasoning, and as a qualitative tool for spot-checking model output. The decision to release full solutions rather than answers alone was unusual at the time and influenced later math-focused datasets such as PRM800K and the OpenMathInstruct corpus.

How is MATH scored?

MATH is scored by exact match between the model's final answer and the canonical answer extracted from \boxed{}. There is no partial credit and no judgment of solution quality. A model that produces a correct answer through faulty reasoning is scored as correct, and a model that produces a near-correct answer with a minor algebraic slip is scored as wrong.

The official evaluation script handles common formatting variations: equivalent fractions (1/2 versus 0.5), ordering of terms in a sum, presence or absence of explicit multiplication signs in algebraic expressions, and so on. The script uses simple symbolic normalization (collecting like terms, canonicalizing fraction forms) before comparison. Even with normalization, format mismatches still occasionally cost models a point or two compared to a human grader, which is one of the standard caveats when comparing scores across reports.

Reported numbers usually take one of three forms:

Pass@1 (greedy): A single completion at temperature 0, scored by exact match. This is the most common headline number.
Pass@1 (zero-shot CoT): A single completion with a chain-of-thought prompt instructing the model to think step by step, scored by exact match.
Majority voting (maj@k): k completions sampled at non-zero temperature, with the most common final answer selected. The Minerva paper popularized this protocol with k = 64 or k = 256 ^[2].

Different papers use different prompts, different sampling parameters, and occasionally different subsets of the test set, so headline numbers should be compared with care. The OpenAI "simple-evals" repository and the EleutherAI lm-evaluation-harness both ship reference implementations that have become semi-official baselines for cross-paper comparison ^[23] ^[24].

What scores have models reached on MATH?

The progression of state-of-the-art numbers on MATH between 2021 and 2025 is one of the most-cited examples of rapid benchmark progress in modern AI. The original paper estimated, based on a log-linear fit to GPT-2 and GPT-3 scaling data, that reaching 40 percent accuracy by scaling alone would require roughly 10^35 parameters, a figure several orders of magnitude beyond any plausible compute budget, and concluded that "simply increasing budgets and model parameter counts will be impractical for achieving strong mathematical reasoning if scaling trends continue" ^[1]. In practice 40 percent was reached within about 18 months, and 80 percent within about three years, almost entirely because of algorithmic and data improvements rather than parameter scaling.

Year	Model	Method	MATH accuracy	Source
2021	GPT-2 0.1B	AMPS pretrain plus fine-tune	5.4%	Hendrycks et al. (2021)
2021	GPT-2 1.5B	AMPS pretrain plus fine-tune	6.9%	Hendrycks et al. (2021)
2021	GPT-3 175B	Few-shot	5.2%	Hendrycks et al. (2021)
2022	Minerva 8B	Pass@1	14.1%	Lewkowycz et al. (2022)
2022	Minerva 62B	Pass@1	27.6%	Lewkowycz et al. (2022)
2022	Minerva 540B	Pass@1	33.6%	Lewkowycz et al. (2022)
2022	Minerva 540B	Majority vote (k = 64)	50.3%	Lewkowycz et al. (2022)
2023	GPT-4	Zero-shot CoT	42.5%	OpenAI (2023) GPT-4 technical report
2023	GPT-4 + Code Interpreter	With Python tool use	69.7%	Zhou et al. (2023)
2024	Claude 3 Opus	Zero-shot CoT	60.1%	Anthropic Claude 3 model card
2024	GPT-4o	CoT	76.6%	OpenAI GPT-4o announcement
2024	o1-preview	Reasoning	85.5%	OpenAI o1 system card
2024	o1 (work in progress)	Reasoning	94.8%	OpenAI o1 announcement
2025	DeepSeek-R1	Reasoning, MATH-500	97.3%	DeepSeek-R1 paper
2025	OpenAI o1-1217	Reasoning, MATH-500	96.4%	DeepSeek-R1 paper (comparison)

A few moments in this table are worth pointing out.

The 2021 baseline. In the original paper, GPT-3 175B in a few-shot setting scored 5.2 percent, essentially indistinguishable from the much smaller GPT-2 1.5B at 6.9 percent (which had the advantage of AMPS pretraining and fine-tuning). The conclusion in the paper was that naive scaling was not solving MATH and that targeted innovation would be required ^[1]. This conclusion aged well.

Minerva's jump. Minerva (Lewkowycz et al., 2022) was a fine-tuned variant of PaLM trained on 118 GB of mathematical and scientific content scraped from arXiv and the web. The 540B Minerva model reached 50.3 percent on MATH using majority voting over 64 samples, a roughly tenfold improvement over the original GPT-3 baseline in just over a year ^[2]. Minerva did not use any external tools (no calculator, no code execution), so the gain came entirely from better pretraining data and majority voting. It became the model to beat for the next 18 months.

GPT-4 and tool use. GPT-4 without tools scored about 42 percent on MATH in the technical report, similar to Minerva 62B with majority voting ^[3]. With access to a Python code interpreter, the same model jumped to 69.7 percent (Zhou et al., 2023, "Solving Challenging Math Word Problems Using GPT-4 Code Interpreter") ^[8]. The lesson generalized quickly: external tools, especially code execution and symbolic math, were doing most of the work on the harder problems where models would otherwise fumble arithmetic.

The o1 jump. OpenAI o1, released in September 2024, was the first reasoning model trained explicitly to produce long internal chains of thought before answering. On MATH the work-in-progress version was reported at 94.8 percent ^[10], with the released o1-preview at 85.5 percent ^[9]. By comparison, GPT-4o on the same benchmark was around 76 percent. The o1 jump effectively closed the gap between AI and the IMO-gold-medalist human baseline.

Saturation by 2025. DeepSeek-R1 and OpenAI o1-1217 both reached above 96 percent on MATH-500 in early 2025, leaving roughly a 3 to 4 percentage point ceiling for further improvement. The DeepSeek-R1 paper reports that the model "achieves a score of 97.3% on MATH-500, performing on par with OpenAI-o1-1217 and significantly outperforming other models" ^[12]. At this point MATH had stopped functioning as a discriminative benchmark for frontier models and the field had largely moved on to harder evaluations.

What is MATH-500?

MATH-500 is a 500-problem subset of the original MATH test set, sampled uniformly at random by OpenAI as part of the "Let's Verify Step by Step" research program (Lightman et al., 2023) ^[6]. The remaining 4,500 test problems were used as held-out training data for the PRM800K process reward model, so MATH-500 was the held-out evaluation slice. The subset has since become a de facto standard for fast MATH evaluation. Running the full 5,000-problem test set takes substantial compute when models generate long chains of thought (the o1-style models routinely produce thousands of reasoning tokens per problem), and 500 problems is roughly the smallest sample that still produces stable percentage estimates for top models.

MATH-500 spans the same seven subjects and five difficulty levels as the full test set in approximately the same proportions. OpenAI's simple-evals harness and Hugging Face's HuggingFaceH4/MATH-500 dataset both ship the same 500-problem split, which makes cross-paper comparison cleaner ^[21] ^[24]. Most 2024 and 2025 model release blog posts that report "MATH" numbers are reporting MATH-500 unless otherwise noted; the convention is to use the larger label even when the smaller subset is what was actually evaluated.

DeepSeek-R1's 97.3 percent result, OpenAI o3's reported scores, and most of the 2025 reasoning-model leaderboards are MATH-500 numbers ^[12]. The full 5,000-problem score is usually a few tenths of a point off, but not in any consistent direction.

What is MATH Level 5?

MATH Level 5 refers to the 1,324 hardest problems in the MATH test set, those rated at the top difficulty band ^[1]. These are roughly AIME-level questions and represent the part of the benchmark that has remained meaningful longest. Even when models had saturated the easier levels, Level 5 continued to discriminate between systems through 2024.

For the original GPT-2 1.5B model, Level 5 accuracy was approximately 4 percent versus 15 percent on Level 1, so the difficulty gap was visible from the start ^[1]. Minerva 540B with majority voting reached 33.6 percent on Level 5 in 2022 ^[2], GPT-4 reached approximately 25 percent zero-shot, and o1-style reasoning models pushed Level 5 above 75 percent in 2024. Epoch AI maintains a tracker for MATH Level 5 specifically, treating it as a separate benchmark that has not yet saturated for non-reasoning models ^[22].

The Level 5 problems are dominated by AIME-style questions: integer-answer problems requiring multi-step casework, generating-function manipulations, geometric constructions, or clever algebraic identities. They are the part of MATH that most directly tests whether a model has learned mathematical reasoning rather than memorized solution templates.

How do humans perform on MATH?

The original paper reported a human reference based on a small sample of human solvers. A three-time International Mathematical Olympiad gold medalist scored approximately 90 percent on a sample of MATH problems, with the few errors attributed to calculation slips and time pressure ^[1]. A computer science PhD student who did not specialize in math scored about 40 percent, which the authors used to argue that MATH is genuinely difficult even for technically skilled adults ^[1].

The 90 percent gold medalist figure is the number most often cited as the human ceiling. Average mathematics undergraduates score in the 20 to 40 percent range on competition problems of this difficulty, and casual college-educated adults score below 10 percent. The implication for AI is that MATH was never a benchmark that aligned-with-human-average-performance was supposed to solve. By the time top models passed 50 percent, they were already outperforming the median Berkeley CS PhD student. By the time they reached 90 percent, they were within range of top human competitors.

This trajectory is part of why mathematical-reasoning benchmarks evolved quickly past MATH: once models match an IMO gold medalist on a benchmark, the benchmark is no longer measuring the frontier of mathematical reasoning, only confirming that frontier models can solve high-school competition problems.

What techniques drove progress on MATH?

The roughly 90-percentage-point gain on MATH between 2021 and 2025 came from a combination of innovations rather than any single technique.

Domain-specific pretraining. AMPS in the original paper, then the Minerva math-and-science corpus, then OpenWebMath (Paster et al., 2023) ^[14], then the OpenMathInstruct synthetic datasets all showed that targeted math pretraining outperformed general scaling. Minerva's 50.3 percent in 2022 came from PaLM plus 118 GB of arXiv and math-tagged web pages, with no architectural changes ^[2].

Chain-of-thought prompting. Wei et al. (2022) showed that simply prompting a sufficiently large model to think step by step before answering improved math performance by 10 to 20 percentage points on MATH and similar benchmarks ^[4]. CoT became the default inference protocol for math evaluation by mid-2022.

Self-consistency and majority voting. Wang et al. (2022) showed that sampling multiple chains of thought at non-zero temperature and taking the majority answer was a much stronger evaluation protocol than greedy decoding ^[5]. Minerva 540B improved from 33.6 percent (greedy) to 50.3 percent (maj@64) using only this technique ^[2].

Tool use. Code interpreter access raised GPT-4's MATH score from about 42 percent to about 70 percent, mostly by removing arithmetic errors ^[8]. By 2024 most production math evaluations of base LLMs included a Python sandbox.

Process supervision. OpenAI's "Let's Verify Step by Step" paper (Lightman et al., 2023) trained a process reward model on 800,000 step-level correctness labels (the PRM800K dataset) and used it to rerank candidate solutions ^[6]. Process supervision substantially outperformed outcome supervision on MATH-500, and the technique became a building block for later reasoning models ^[20].

Reasoning models. OpenAI o1, OpenAI o3, and DeepSeek-R1 trained models with reinforcement learning to produce long internal chains of thought before answering. On MATH this pushed accuracies into the mid-90s without external tools, effectively closing the benchmark.

Verifier-guided search. Stronger search procedures (best-of-N with learned verifiers, beam search over reasoning steps, tree-of-thought variants) added a few percentage points on top of any base model. By 2024 the typical recipe combined a strong base model, CoT prompting, sampling, and a verifier or reward model.

When did MATH saturate, and what replaced it?

By 2024 and into 2025 the cluster of frontier models on MATH-500 had compressed into the 95 to 98 percent range, leaving little room for differentiation. This saturation was forecast: the original paper's worry was that benchmarks would saturate too slowly to be useful, but the reverse turned out to be the bigger problem. The field responded with a series of harder math evaluations.

Benchmark	Year	Description	SOTA at release
GSM8K	2021	8,500 grade-school word problems, OpenAI	About 35% (GPT-3 + verifier)
MATH	2021	12,500 AMC and AIME problems	About 7% (GPT-2 1.5B + AMPS)
AIME 2024	2024	15-problem AIME exam used as a held-out test	13% (GPT-4o)
OlympiadBench	2024	Olympiad-level math and physics	About 35% (GPT-4V)
FrontierMath	2024	Hundreds of original research-level problems by Epoch AI	Under 2% at release
Putnam Bench	2024	Putnam competition problems formalized in Lean	Single-digit percent
MATH-Perturb-Hard	2025	Adversarially perturbed MATH problems	10 to 25 pp drop vs MATH
HARP	2024	Human-annotated reasoning problems	Open
OlymMATH	2025	Olympiad-level math, multilingual	Limited accuracy on hard subset

AIME became the de facto successor for evaluating top math models. The 2024 AIME exams were used by OpenAI to characterize o1-preview and o1, with the headline that on the 2024 AIME exams o1 averaged 74 percent (11.1 of 15) with a single sample per problem, 83 percent (12.5 of 15) with consensus over 64 samples, and 93 percent (13.9 of 15) when re-ranking 1,000 samples with a learned scoring function ^[10]. AIME has the additional advantage that the problems are released annually on a fixed schedule, which makes contamination tracking easy: a model trained before March of a given year could not have seen that year's problems unless they were leaked.

FrontierMath (Glazer et al., 2024, arXiv:2411.04872) took a different approach. The benchmark was built by approximately 60 mathematicians, including Fields medalists Terence Tao, Timothy Gowers, and Richard Borcherds, who each contributed original research-level problems that had not been published anywhere else ^[13]. Solving a typical FrontierMath problem takes a domain expert hours or days. At release, frontier models including GPT-4o, Claude 3.5 Sonnet, and Gemini 1.5 Pro all solved less than 2 percent of the problems ^[13]. FrontierMath is the most direct heir to MATH's role as a measurement of the upper limit of AI mathematical reasoning, and as of 2025 it remained far from saturated even for o1-style reasoning models.

MATH-Perturb (Huang et al., 2025) and similar adversarial benchmarks took a different angle ^[16]. They started from MATH problems and applied targeted perturbations (renaming variables, changing constants, restructuring problems while preserving difficulty) to detect whether models had memorized solution templates rather than learning to solve the underlying problems. Top reasoning models including o1-mini and Gemini 2.0 Flash Thinking dropped 10 to 25 percentage points on MATH-Perturb-Hard relative to standard MATH, suggesting that some of the saturation was an overfitting artifact rather than genuine reasoning ability ^[16].

How did MATH shape math reasoning research?

MATH had outsized influence on how the field studies mathematical reasoning. Several research directions trace directly back to choices made in the original paper.

Step-by-step solutions as data. The decision to release full LaTeX-typeset solutions, not just answers, made MATH the natural training set for instruction-tuned math models. The OpenMathInstruct corpus and the AMPS-derivative datasets that followed all built on this design. Without MATH there would be no clean way to fine-tune a model to produce verifiable mathematical reasoning, because most other contest archives only publish answers.

Process versus outcome supervision. The PRM800K work was built on MATH problems, with human labelers marking each step of model-generated solutions as correct or incorrect. The result, that process supervision generalizes better than outcome supervision, pushed the field toward verifier-based training and informed the design of o1-style reasoning models ^[6].

Synthetic math data. The success of AMPS demonstrated that synthetic math problems generated by symbolic math systems were a viable pretraining substrate. This idea became central to later math LLMs, with NVIDIA's NeMo-Skills, OpenMathInstruct, and DeepSeekMath all using synthetic data generation pipelines that trace back conceptually to AMPS.

Test-time compute scaling. Majority voting on MATH (Wang et al., 2022, then Minerva) was an early demonstration that more inference-time compute could substantially improve accuracy ^[5]. This insight matured into the broader test-time compute scaling literature that culminated in o1-style reasoning models, where test-time compute became a first-class axis of model improvement on par with training compute.

Tool use as default. GPT-4 with code interpreter scoring 70 percent on MATH made it indefensible to evaluate math models without giving them access to a calculator ^[8]. Even purist evaluations now usually report both "with tools" and "without tools" numbers, and production deployments almost always use tools.

Benchmark evolution as a research method. MATH's saturation arc became a template. AIME, FrontierMath, OlympiadBench, and MATH-Perturb were all designed in conscious response to lessons learned from MATH: keep the problems hard enough that the benchmark has runway, build in contamination resistance, and make the evaluation robust to template memorization.

What are the limitations of MATH?

The MATH benchmark has several well-documented limitations that are worth keeping in mind when comparing reported scores.

Answer-only evaluation. Exact match on the boxed answer does not check whether the reasoning was correct. A model can guess the right answer through faulty reasoning and be scored as correct. Conversely a model that produces correct reasoning but a different valid answer format may be scored as incorrect. Process supervision research (PRM800K) showed that the gap between answer-correct and reasoning-correct can be substantial, especially for long solutions.

Contamination. MATH problems and their solutions have been on the public internet since 2021 and the original sources (AMC, AIME, AoPS) have been online much longer. Modern web-scale pretraining corpora almost certainly include MATH and its solutions, and there is no reliable way to retroactively decontaminate a closed-source model. Some MATH improvements over the years probably reflect contamination rather than genuine reasoning gains. The MATH-Perturb-Hard drop of 10 to 25 percentage points for top models is consistent with this concern ^[16].

Saturation at the top. Once frontier models cluster above 95 percent, distinctions between them on MATH stop being informative. Sample noise, prompt sensitivity, and answer-format variation all become large relative to the remaining headroom. By 2025 MATH-500 results in the 96 to 98 percent range were essentially indistinguishable from each other.

Coverage gaps. MATH is dominated by competition mathematics. It does not cover proof-based mathematics, formal verification, advanced calculus and analysis, abstract algebra beyond what shows up on the AIME, or any kind of research mathematics. This is by design (the dataset targets American high-school competition style), but it means that high MATH scores do not transfer cleanly to the broader question of whether models can do mathematics.

English only. All problems and solutions are in English. There is no multilingual MATH, although translations have been produced informally and some successor benchmarks (OlymMATH) are explicitly multilingual.

Format brittleness. The exact-match grader handles common formatting variants but still occasionally penalizes correct answers that are written differently from the canonical form. Reported scores can shift by one or two points depending on which grader implementation is used.

Solution quality variance. Because the canonical solutions came from the AoPS community, their quality varies. Some are excellent expository solutions, others are terse competition-style writeups, and a few contain minor errors. This is a relatively small issue for evaluation (which only checks the final answer) but matters for fine-tuning, where the quality of the supervised target affects what the model learns.

MATH sits in a family of mathematical reasoning benchmarks that the field uses in roughly graduated order of difficulty.

Benchmark	Year	Difficulty	Size	Status (2025)
GSM8K	2021	Grade school	8,500 problems	Saturated for frontier models
MATH	2021	High school competition	12,500 problems	Saturated for reasoning models
MATH-500	2023	High school competition (subset)	500 problems	Saturated for reasoning models
MATH Level 5	2021 (subset)	AIME-level	1,324 problems	Mostly saturated for top models
AIME 2024	2024	AIME competition	15 problems	Open at top of leaderboard
AIME 2025	2025	AIME competition	15 problems	Open
OlympiadBench	2024	Olympiad-level	8,476 problems	Open
Putnam Bench	2024	Putnam competition (Lean formalized)	640 problems	Single-digit percent SOTA
HARP	2024	Human-annotated reasoning	Open	Open
FrontierMath	2024	Research-level mathematics	Several hundred problems	Open, under 2% at release
Humanity's Last Exam	2025	Multidisciplinary, includes math	About 2,500 questions	Open

Among these, GSM8K and MATH are the historical pair that defined modern math evaluation. GSM8K covers the grade-school end and was the easier of the two; MATH covers the high-school competition end and is what most papers actually meant when they reported "math" results between 2021 and 2024. The successor pair AIME and FrontierMath play roughly the same dual role for the post-saturation era, with AIME providing a contamination-resistant high-school benchmark and FrontierMath providing the research-level frontier.

Is MATH still used in 2025?

As of 2025 MATH is no longer the headline math benchmark for frontier model releases, but it remains useful in several ways.

First, it is the standard math benchmark in the EleutherAI lm-evaluation-harness and the OpenAI simple-evals harness, so it shows up in essentially every model card and technical report ^[23] ^[24]. Even when a model has saturated MATH, the score serves as a sanity check that math capability has not regressed and as a continuous time series back to 2021.

Second, MATH-500 is a fast and cheap evaluation. Five hundred problems takes a few hours of inference even with long chains of thought, which makes it well suited to checkpointing during training and to ablation studies. For models that are not yet at saturation (smaller open-source models, distilled models, narrow-domain models), MATH-500 still discriminates meaningfully.

Third, MATH Level 5 and MATH-Perturb-Hard are the residual hard subsets where measurement still happens. Epoch AI maintains a Level 5 leaderboard ^[22]. Reports on perturbed MATH and on AIME 2024 and 2025 have effectively replaced the role MATH used to play in distinguishing top reasoning models.

Fourth, MATH solutions remain a widely used training corpus. The 7,500 training problems with their LaTeX solutions are part of essentially every modern math fine-tuning recipe, often expanded by a factor of 100 or more through synthetic generation in the AMPS lineage. Even when the test set is saturated, the training set continues to contribute to the data pipelines that build the next generation of math models.

The trajectory from 7 percent in 2021 to 97 percent in 2025 is the kind of arc benchmark designers hope to capture: a clean baseline that reveals real progress, a long enough runway to support several years of measurement, and a graceful retirement to a more challenging successor. The same group that built MATH (Hendrycks at the Center for AI Safety) went on to build Humanity's Last Exam, released in 2025 with Scale AI as a 2,500-question multidisciplinary benchmark, partly with the explicit goal of giving the field another decade of headroom before the next saturation ^[25].

References

Hendrycks, D., Burns, C., Kadavath, S., Arora, A., Basart, S., Tang, E., Song, D., & Steinhardt, J. (2021). "Measuring Mathematical Problem Solving With the MATH Dataset." *NeurIPS 2021 Datasets and Benchmarks Track*. arXiv:2103.03874. https://arxiv.org/abs/2103.03874 ↩
Lewkowycz, A., Andreassen, A., Dohan, D., Dyer, E., Michalewski, H., Ramasesh, V., Slone, A., Anil, C., Schlag, I., Gutman-Solo, T., Wu, Y., Neyshabur, B., Gur-Ari, G., & Misra, V. (2022). "Solving Quantitative Reasoning Problems with Language Models." *NeurIPS 2022*. arXiv:2206.14858. https://arxiv.org/abs/2206.14858 ↩
OpenAI (2023). "GPT-4 Technical Report." arXiv:2303.08774. https://arxiv.org/abs/2303.08774 ↩
Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter, B., Xia, F., Chi, E., Le, Q. V., & Zhou, D. (2022). "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models." *NeurIPS 2022*. arXiv:2201.11903. https://arxiv.org/abs/2201.11903 ↩
Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., & Zhou, D. (2022). "Self-Consistency Improves Chain of Thought Reasoning in Language Models." arXiv:2203.11171. https://arxiv.org/abs/2203.11171 ↩
Lightman, H., Kosaraju, V., Burda, Y., Edwards, H., Baker, B., Lee, T., Leike, J., Schulman, J., Sutskever, I., & Cobbe, K. (2023). "Let's Verify Step by Step." *ICLR 2024*. arXiv:2305.20050. https://arxiv.org/abs/2305.20050 ↩
Cobbe, K., Kosaraju, V., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., Hesse, C., & Schulman, J. (2021). "Training Verifiers to Solve Math Word Problems" (GSM8K). arXiv:2110.14168. https://arxiv.org/abs/2110.14168 ↩
Zhou, A., Wang, K., Lu, Z., Shi, W., Luo, S., Qin, Z., Lu, S., Jia, A., Song, L., Zhan, M., & Li, H. (2023). "Solving Challenging Math Word Problems Using GPT-4 Code Interpreter with Code-based Self-Verification." *ICLR 2024*. arXiv:2308.07921. https://arxiv.org/abs/2308.07921 ↩
OpenAI (2024). "OpenAI o1 System Card." September 12, 2024. https://cdn.openai.com/o1-system-card.pdf ↩
OpenAI (2024). "Learning to Reason with LLMs." September 12, 2024. https://openai.com/index/learning-to-reason-with-llms/ ↩
Anthropic (2024). "The Claude 3 Model Family: Opus, Sonnet, Haiku." Model card. https://assets.anthropic.com/m/61e7d27f8c8f5919/original/Claude-3-Model-Card.pdf
DeepSeek-AI (2025). "DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning." arXiv:2501.12948. https://arxiv.org/abs/2501.12948 ↩
Glazer, E., Erdil, E., Besiroglu, T., Chicharro, D., Chen, E., Gunning, A., Olsson, C. F., Denain, J.-S., Ho, A., Sevilla, J., Heim, L., Schwettmann, S., Tao, T., Gowers, T., et al. (2024). "FrontierMath: A Benchmark for Evaluating Advanced Mathematical Reasoning in AI." arXiv:2411.04872. https://arxiv.org/abs/2411.04872 ↩
Paster, K., Santos, M. D., Azerbayev, Z., & Ba, J. (2023). "OpenWebMath: An Open Dataset of High-Quality Mathematical Web Text." arXiv:2310.06786. https://arxiv.org/abs/2310.06786 ↩
Azerbayev, Z., Schoelkopf, H., Paster, K., Santos, M. D., McAleer, S., Jiang, A. Q., Deng, J., Biderman, S., & Welleck, S. (2024). "Llemma: An Open Language Model for Mathematics." *ICLR 2024*. arXiv:2310.10631. https://arxiv.org/abs/2310.10631
Huang, K., et al. (2025). "MATH-Perturb: Benchmarking LLMs' Math Reasoning Abilities against Hard Perturbations." arXiv:2502.06453. https://arxiv.org/abs/2502.06453 ↩
He, C., Luo, R., Bai, Y., Hu, S., Thai, Z. L., Shen, J., Hu, J., Han, X., Huang, Y., Zhang, Y., Liu, J., Qi, L., Liu, Z., & Sun, M. (2024). "OlympiadBench: A Challenging Benchmark for Promoting AGI with Olympiad-Level Bilingual Multimodal Scientific Problems." arXiv:2402.14008. https://arxiv.org/abs/2402.14008
OpenAI (2024). "OpenAI o1-mini: advancing cost-efficient reasoning." https://openai.com/index/openai-o1-mini-advancing-cost-efficient-reasoning/
Hendrycks, D., et al. MATH dataset GitHub repository. https://github.com/hendrycks/math ↩
OpenAI. PRM800K GitHub repository. https://github.com/openai/prm800k ↩
HuggingFaceH4. MATH-500 dataset on Hugging Face. https://huggingface.co/datasets/HuggingFaceH4/MATH-500 ↩
Epoch AI. MATH Level 5 benchmark tracker. https://epoch.ai/benchmarks/math-level-5 ↩
EleutherAI. lm-evaluation-harness, hendrycks_math task. https://github.com/EleutherAI/lm-evaluation-harness/blob/main/lm_eval/tasks/hendrycks_math/README.md ↩
OpenAI. simple-evals repository. https://github.com/openai/simple-evals ↩
Phan, L., et al. / Center for AI Safety and Scale AI (2025). "Humanity's Last Exam." https://agi.safe.ai/ ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

4 revisions by 1 contributors · full history

Suggest edit

MATH

Key facts

What was MATH built to measure?

How is the dataset structured?

subject categories

difficulty levels

problem and solution format

How is MATH scored?

What scores have models reached on MATH?

What is MATH-500?

What is MATH Level 5?

How do humans perform on MATH?

What techniques drove progress on MATH?

When did MATH saturate, and what replaced it?

How did MATH shape math reasoning research?

What are the limitations of MATH?

Is MATH still used in 2025?

See also

References

Improve this article

What links here (24 of 57)

What links here (24 of 57)

Key facts

What was MATH built to measure?

How is the dataset structured?

subject categories

difficulty levels

problem and solution format

How is MATH scored?

What scores have models reached on MATH?

What is MATH-500?

What is MATH Level 5?

How do humans perform on MATH?

What techniques drove progress on MATH?

When did MATH saturate, and what replaced it?

How did MATH shape math reasoning research?

What are the limitations of MATH?

How does MATH compare to related benchmarks?

Is MATH still used in 2025?

See also

References

Improve this article

Related Articles

ProcessBench

MuSR

Best AI Models for Reasoning and Math

ARC-AGI 1

GPQA

MathArena

What links here (24 of 57)

Related Articles

ProcessBench

MuSR

Best AI Models for Reasoning and Math

ARC-AGI 1

GPQA

MathArena

What links here (24 of 57)