MATH Level 5

MATH Level 5
Overview
Full name	MATH dataset, Level 5 difficulty subset
Abbreviation	MATH L5
Description	The hardest difficulty tier of the MATH (benchmark) dataset, comprising competition mathematics problems labeled with the maximum Art of Problem Solving difficulty rating
Release date	March 2021 (parent dataset)
Parent dataset	MATH (12,500 problems)
Level 5 test problems	1,324
Authors	Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, Jacob Steinhardt
Organization	UC Berkeley, University of Chicago, OpenAI (at time of writing)
Venue	NeurIPS 2021 (Datasets and Benchmarks Track)
Technical Details
Type	Mathematical reasoning, problem solving
Modality	Text (LaTeX)
Task format	Open-ended written solutions with exact-match final answers
Evaluation metric	Exact match accuracy on the final boxed answer
Domains	Algebra, counting and probability, geometry, intermediate algebra, number theory, prealgebra, precalculus
Languages	English
Difficulty rubric	AoPS difficulty scale, where Level 5 corresponds to AIME and harder competition problems
Resources
Paper	https://arxiv.org/abs/2103.03874
GitHub	https://github.com/hendrycks/math
Dataset	https://huggingface.co/datasets/hendrycks/competition_math
License	MIT

MATH Level 5 is the hardest difficulty tier of the MATH (benchmark) dataset introduced by Dan Hendrycks and colleagues in the 2021 paper Measuring Mathematical Problem Solving With the MATH Dataset. The full MATH corpus contains 12,500 problems drawn from United States high-school mathematics competitions; each problem is annotated with a difficulty rating from 1 to 5 following the Art of Problem Solving (AoPS) competition rating convention. The 1,324 Level 5 problems in the 5,000 problem test set sit at the top of that scale and are drawn primarily from the American Invitational Mathematics Examination and the hardest AMC problems. For four years they functioned as a frontier evaluation for large language models and were a primary scoreboard along which the field tracked progress in mathematical reasoning, until reasoning-trained models pushed accuracy on full MATH and its 500 problem subset past 95 percent in late 2024 and early 2025.

Level 5 has remained more useful than overall MATH accuracy because it is the slice where saturation is slowest. While GPT-3 davinci scored below 7 percent overall on the original MATH benchmark and roughly 4 percent on Level 5 problems, the same Level 5 problems remained well below 70 percent for non reasoning models well into 2024. Improvements driven by chain of thought prompting, mathematics pretraining corpora such as AMPS, process reward modeling, and reinforcement learning from verifier feedback can be traced through their Level 5 numbers more cleanly than through the easier tiers, which tend to saturate first.

Origins and motivation

The MATH dataset was created in late 2020 and early 2021 by researchers at UC Berkeley, the University of Chicago, and OpenAI. The motivating question, stated in the abstract of the paper, was whether the scaling trends observed for natural language benchmarks would extend to genuine mathematical problem solving, or whether mathematical reasoning required qualitatively new techniques. The authors collected 12,500 problems from public sources, primarily problem archives associated with American high school competitions such as the AMC 8, AMC 10, AMC 12, and AIME, along with state and regional contests. Each problem comes with a step by step written solution from the competition community.

The paper concluded that simply increasing parameter counts was unlikely to produce strong mathematical reasoning, and that meaningful progress would require either much larger compute budgets than were practical at the time or qualitatively new algorithms. That conclusion turned out to be partly right and partly wrong: dedicated mathematical pretraining and reasoning algorithms did push accuracy up sharply, but scale did help, especially when combined with reinforcement learning on verifiable rewards.

The AoPS difficulty scale

Difficulty in the MATH dataset is annotated using the rating scheme used by the Art of Problem Solving community on the AoPS Wiki. The authors describe the convention directly: a subject's easiest problems for humans are labeled Level 1 and the hardest are labeled Level 5. Concretely, the first several problems of an AMC 8 exam are usually Level 1 problems, the middle problems of AMC 10 and AMC 12 contests typically fall in Levels 2 and 3, the last AMC problems and the first AIME problems land in Level 4, and AIME problems generally and certainly the harder AIME problems are Level 5. There is no formal calibration across subjects; ratings reflect the AoPS community's collective judgment about what constitutes a hard problem in each topic.

Dataset composition

Attribute	Value
Total problems	12,500
Training set	7,500
Test set	5,000
Subjects	7
Difficulty levels	5 (Level 1 through Level 5)
Test set Level 5 problems	1,324
Solution format	Full step by step LaTeX writeup with final boxed answer
License	MIT

The seven subject categories are prealgebra, algebra, intermediate algebra, counting and probability, number theory, geometry, and precalculus. Each problem carries a subject tag and a difficulty tag, allowing researchers to slice performance along both axes. The Level 5 test slice is enriched in algebra, intermediate algebra, counting and probability, and precalculus relative to prealgebra and geometry, because those topics are over represented in AIME style problems.

Why Level 5 became a frontier metric

When the MATH paper was published, the gap between overall scores and Level 5 scores already looked substantial. The GPT-2 1.5B model reported in the original paper achieved about 6.9 percent overall accuracy when pretrained on the AMPS auxiliary corpus and fine tuned on MATH, but only around 4 percent on Level 5 problems, with Level 1 problems closer to 15 percent. As large language models scaled and as new training techniques such as chain of thought prompting became standard, the easier levels saturated first. By 2023 frontier models were over 90 percent on Level 1 and Level 2 problems while still struggling with Level 5.

This pattern made Level 5 a useful telemetry. A model whose overall MATH score climbed from 50 percent to 70 percent could be doing so by sweeping up easier problems while leaving the AIME class problems essentially untouched, or by genuinely improving on the hardest problems. The per level breakdown made the difference visible.

Epoch AI formalized Level 5 as a standalone evaluation in 2024 specifically because it had not yet saturated while overall MATH and its 500 problem subset MATH-500 were trending toward 99 percent. Their hosted leaderboard runs the 1,324 Level 5 test problems with multiple equivalence scorers, including normalized string match, SymPy symbolic equivalence, and a model graded equivalence check. The site treats Level 5 as a stricter, slower saturating signal than full MATH.

Subject composition of Level 5

The seven subjects do not contribute equally to the Level 5 slice. Algebra, intermediate algebra, and precalculus dominate, reflecting the topical balance of AIME style problems. Geometry and prealgebra contribute relatively fewer Level 5 problems. The table below characterizes the kinds of problems typical of each subject at the top difficulty.

Subject	Typical Level 5 question style
Algebra	Functional equations, systems with constraints, inequalities with extremal conditions
Counting and probability	Multi step combinatorial identities, casework heavy probability with constraints
Geometry	Synthetic geometry with multiple auxiliary constructions, coordinate or trigonometric setups
Intermediate algebra	Polynomial roots and Vieta's formulas, complex numbers, sequences and series, advanced inequalities
Number theory	Modular arithmetic with multi step casework, Diophantine equations, divisibility puzzles
Prealgebra	Sparse at Level 5, but includes unusually long arithmetic puzzles when present
Precalculus	Trigonometric identities, sums involving roots of unity, telescoping identities

Intermediate algebra in particular has historically been the hardest subject for models, with multi step manipulations of polynomial roots and complex valued sums being the most common failure mode.

Evaluation methodology

The MATH protocol is exact match on the final answer, which is enclosed in a boxed{} command in the LaTeX solution. Models are expected to produce a full written derivation followed by the boxed final answer, and only the contents of the final box are checked. This avoids the problem of partial credit while preserving the requirement that the model produce a written argument that supports the answer.

Answer equivalence

Because mathematical answers can be expressed in many equivalent forms, the de facto standard pipelines accept multiple representations:

Method	What it checks
Normalized string match	Strips whitespace, normalizes LaTeX commands, and compares strings
SymPy equivalence	Parses both answers as symbolic expressions and tests algebraic equivalence
Model graded equivalence	An auxiliary language model verifies whether two answers are mathematically the same

The Hendrycks reference pipeline uses normalized string match with a hand written canonicalizer, while Epoch AI's MATH Level 5 implementation runs all three scorers and reports the model graded version as primary. Differences between scorers are usually under one percentage point but can matter near the top of the leaderboard.

Sampling and aggregation

Many reported MATH Level 5 numbers correspond to single sample greedy or low temperature decoding (pass@1). Earlier flagship results, especially from Minerva, often used majority voting across many sampled solutions; Minerva 540B reports used self consistency with up to 64 samples. With chain of thought reasoning, sampling diversity tends to help on the hardest problems, so majority voting results are typically several points higher than pass@1 on Level 5.

Baseline results from the original paper

The original Hendrycks et al. paper benchmarked language models available in early 2021. The headline results are summarized below.

Model	Pretraining	Overall accuracy	Approximate Level 5 accuracy
GPT-2 0.1B	Standard	~3.0%	<2%
GPT-2 0.7B	Standard	~3.7%	<3%
GPT-2 1.5B	Standard	~5.4%	<4%
GPT-2 1.5B	+ AMPS	~6.9%	~4%
GPT-3 davinci	Standard	~5%	similar to GPT-2 1.5B

The paper introduces a 23 GB auxiliary pretraining corpus called the Auxiliary Mathematics Problems and Solutions (AMPS) dataset, comprising over 100,000 Khan Academy style problems and roughly five million problems generated from Mathematica scripts. Pretraining on AMPS prior to fine tuning on MATH consistently boosted accuracy by several points but did not change the qualitative picture. The authors used these numbers to argue that scaling alone would not solve the benchmark.

The paper also reported informal human baselines. A computer science PhD student who did not particularly enjoy mathematics scored about 40 percent. A three time International Mathematical Olympiad gold medalist scored about 90 percent. These figures are not formal estimates and apply to the full MATH test set rather than to Level 5 specifically, where the human gap between the two extremes would be wider.

Leaderboard milestones

The following table tracks reported performance on MATH and on Level 5 specifically across major model releases. Numbers refer to the standard test split unless otherwise stated, and Level 5 numbers are listed where they have been published separately.

Year	Model	Overall MATH	Level 5	Notes
2021	GPT-2 1.5B + AMPS	6.9%	~4%	Hendrycks et al. baseline
2021	GPT-3 davinci	~5%	~3-4%	Few-shot prompting
2022	Minerva 8B (maj@k)	25.4%	not isolated	Math-trained PaLM
2022	Minerva 62B (maj@k)	43.4%	not isolated	Math-trained PaLM
2022	Minerva 540B (maj@k)	50.3%	not isolated	First model to exceed 50% overall
2023	GPT-4 (CoT)	~42%	not officially isolated	GPT-4 technical report
2023	GPT-4 + code interpreter	~70%	not officially isolated	the-decoder reporting
2023	Llemma 34B (maj@256)	~25%	not isolated	Open base model, approaching Minerva 62B
2024	DeepSeekMath 7B RL	51.7%	not isolated	GRPO reinforcement learning
2024	Claude 3 Opus	~60%	not officially isolated	Anthropic Claude 3 model card
2024	OpenAI o1 (CoT)	94.8% (work-in-progress)	not isolated	First major reasoning model
2024	OpenAI o3	not formally reported on MATH	not isolated	Reported AIME 2024 96.7%
2025	DeepSeek R1	97.3% on MATH-500	not officially isolated	RL trained reasoning

Reported overall scores hide most of the Level 5 story, because by 2024 the Level 5 contribution to remaining error became dominant. A model at 94 percent overall on MATH typically gets the Level 1, 2, and 3 problems essentially correct; almost all the missed problems are in Level 4 and Level 5, and most of those are in Level 5.

Minerva and the role of mathematical pretraining

Minerva was the first major leap on MATH, jumping from the GPT-3 baseline of around 5 percent to 50.3 percent with the 540B variant in 2022. Built on PaLM and further trained on a 118 GB corpus of scientific papers from arXiv and other mathematical content, Minerva used chain of thought prompting and majority voting with up to 64 samples. The paper's error analysis attributed roughly half of remaining errors to calculation mistakes and half to genuine reasoning failures, and noted that performance dropped sharply on intermediate algebra and other subjects that are over represented at Level 5.

Minerva's contribution was twofold. First, it demonstrated that domain specific pretraining could close most of the gap to the top of the published MATH leaderboard at the time. Second, it gave the field a concrete error decomposition between arithmetic execution and reasoning, which motivated later work on tool use (calculators, Python interpreters), self consistency, and process supervision.

GPT-4 and the contamination question

The GPT-4 technical report from March 2023 reported MATH accuracy in the low forties without external tools and around 70 percent with code interpreter augmentation. OpenAI explicitly acknowledged in the GPT-4 system card and a Hendrycks led decontamination note that the training data included parts of the MATH training set; reported test set numbers are therefore on a held out test split, but the standard 5,000 test set remains potentially within the broader pretraining distribution for any frontier model trained after the dataset's public release. This contamination concern is one reason later evaluations turned to held out subsets, fresh competition problems (such as new AIME years), and adversarial perturbations of MATH problems.

Reasoning models and saturation

OpenAI o1, introduced in September 2024 with a system card on September 12, was the first widely benchmarked model to push MATH past 90 percent without external tools. The o1 series reports a 94.8 percent score on MATH for a work in progress checkpoint, with o1-preview at 85.5 percent. The system card frames MATH as effectively saturated at this point. o3, announced in December 2024, reports 96.7 percent on AIME 2024, the human competition from which most Level 5 problems are drawn, indicating that the AIME class is no longer a strict frontier.

DeepSeek R1, released in January 2025, reports 97.3 percent on the 500 problem MATH-500 subset and is the first widely available open weight model in this regime. R1's reinforcement learning pipeline, using outcome rewards on math problems and the Group Relative Policy Optimization algorithm originally introduced in DeepSeekMath, is the canonical example of pushing reasoning by rewarding successful problem solving rather than imitating human written solutions.

Despite these gains, the Level 5 slice has resisted full saturation. Even when overall MATH scores exceed 95 percent, the remaining errors are concentrated in Level 5, and frontier model documentation typically reports Level 5 accuracy several points below the overall figure. Public reporting on this slice has become less common as MATH itself has been displaced by harder benchmarks; researchers tracking long horizon progress have moved much of their attention to FrontierMath, HMMT, the unseen AIME, and the IMO grand challenge.

MATH Level 5 sits in a specific niche between curriculum aligned datasets such as GSM8K and research level mathematics benchmarks such as FrontierMath. The table below positions it relative to other widely used mathematical evaluations.

Benchmark	Difficulty level	Problem source	Typical frontier accuracy (2025)
GSM8K	Grade school word problems	OpenAI commissioned writers	>95%
MATH (full)	Mixed levels 1-5	US high school competitions	>95%
MATH-500	Random 500 problem subset of MATH	Subset of MATH test	>95%
MATH Level 5	AIME and hardest AMC problems	Subset of MATH test, hardest tier	low to high 90s, lags overall MATH
AMC 10 / AMC 12	Late high school	Annual competitions	very high for reasoning models
AIME	Top US high school competition	Annual AIME exams	over 90% for reasoning models
USAMO	US Mathematical Olympiad	Proof based	early stage, partial credit graded
IMO	International Olympiad	Proof based	very early stage, special purpose systems
FrontierMath	Research mathematics	Custom commissioned	low double digits for the strongest models in 2025

The MATH benchmark and its Level 5 subset are descended from the same competition ecosystem that produces the AMC and AIME contests, but use historical problems rather than new ones. This makes contamination a structural risk: any frontier model trained on a recent crawl of the open web is likely to have seen the questions and solutions, since the AoPS community discusses these problems extensively online. The unseen AIME and other fresh competition based benchmarks were introduced specifically to address this.

The relationship between MATH Level 5 accuracy and AIME accuracy is informative but not identical. AIME 2024 contains 30 problems (across AIME I and AIME II), each scored as an integer answer from 0 to 999. MATH Level 5 contains 1,324 problems with a similar style but a wider variety of answer formats. A model that does well on MATH Level 5 will typically do well on AIME but may handle the precise answer formatting and time pressure structure of AIME differently.

Contamination, perturbation, and unseen evaluations

Because MATH was scraped from public competition archives, the same problems appear in many forms across the open web, including AoPS forum discussions, problem of the week sites, and educational content. Several lines of research have studied the extent to which the resulting contamination inflates reported numbers:

The MATH-Perturb benchmark applies systematic semantic perturbations to MATH problems and finds noticeably lower accuracy on perturbed versions, especially at higher difficulty.
Inference-time decontamination techniques rephrase test items and compare model performance to original; gaps tend to be larger at Level 5 than at lower levels.
HARP (a human annotated reasoning benchmark) and other newly written problem sets specifically avoid overlap with MATH's competition sources to provide cleaner held out signal.

A pragmatic alternative is to evaluate on the freshest AIME competition for the current year, which the test set predates and which would not have been seen during pretraining for older models. As of 2025 this is the standard comparison for reasoning models. MATH Level 5 retains value as a large enough sample (1,324 problems) to yield tight confidence intervals and as a continuity bridge to historical results.

Common failure modes

Error analyses across Minerva, GPT-4, OpenAI o1, and DeepSeek R1 highlight a consistent set of failure modes on Level 5 problems:

Failure type	Description	Frequency among errors
Arithmetic slips	Multi step calculations carried through correctly except for a small algebra or arithmetic error	high for pre-reasoning models, lower for o1 and R1
Casework omissions	Missing a case in combinatorics or number theory problems with multiple branches	persistent across model generations
Misreading the problem	Solving a different problem than the one stated, often by ignoring a constraint	falls with longer chains of thought
Premature claim of factorizations or identities	Asserting a polynomial factorization or trigonometric identity without verification	common in intermediate algebra and precalculus
Plausible but wrong final boxed answer	Reasoning is roughly correct but the final extraction step produces an answer that does not match the question's required form	common at Level 5 where answer formats are unusual

Reasoning trained models with long chains of thought tend to reduce the first three categories more than the last two. The remaining errors at the very top of the leaderboard are dominated by genuine reasoning gaps and idiosyncratic answer formatting issues rather than careless mistakes.

Reception and influence

The MATH dataset has been cited thousands of times since its 2021 release, and its difficulty rating scheme has been reused in derivative benchmarks. MATH-500, the 500 problem subset introduced by OpenAI's Let's Verify Step by Step paper (Lightman et al., 2023), became the standard evaluation in much of the process reward modeling literature, including the PRM800K release. MATH-Shepherd, a follow up on process supervision without human annotation, similarly evaluates on MATH and reports per level breakdowns.

The role of MATH Level 5 specifically as a frontier metric has been important in several research narratives:

It served as a continuity measure across the transition from non reasoning models such as GPT-4 to reasoning models such as o1 and R1, allowing apples to apples comparisons even after the field reframed its evaluation strategy around chain of thought.
It motivated the development of process reward models and the PRM800K corpus, since process supervision was most clearly beneficial on the hardest problems.
It anchored Epoch AI's framing of the broader transition from MATH to harder benchmarks: as MATH-500 trended toward 99 percent, Level 5 retained discriminating power for another year.

Limitations

Limitation	Description
Subjective level labels	The AoPS difficulty scale is community curated and not formally calibrated across subjects
Public availability	Problems and solutions are extensively discussed online, creating contamination risk
English only	Problems and solutions are in English LaTeX, limiting cross lingual evaluation
Answer format quirks	Exact match on boxed answers can disadvantage models that produce mathematically equivalent but textually different forms; mitigations exist via symbolic equivalence scorers
Static test set	The 1,324 Level 5 problems do not update over time, so the benchmark cannot track novelty
Diminishing headroom	Reasoning models in 2025 score above 95 percent on overall MATH; remaining errors concentrate in Level 5 but the absolute headroom is narrow

These constraints have driven the field toward complementary benchmarks (FrontierMath, HMMT, unseen AIME, IMO grand challenge) and toward dynamic evaluations that incorporate new problems each year.

References

Hendrycks, D., Burns, C., Kadavath, S., Arora, A., Basart, S., Tang, E., Song, D., Steinhardt, J. (2021). Measuring Mathematical Problem Solving With the MATH Dataset. NeurIPS 2021 Datasets and Benchmarks Track. https://arxiv.org/abs/2103.03874
Hendrycks MATH GitHub repository. https://github.com/hendrycks/math
MATH dataset card on Hugging Face. https://huggingface.co/datasets/hendrycks/competition_math
Lewkowycz, A., et al. (2022). Solving Quantitative Reasoning Problems with Language Models (Minerva). NeurIPS 2022. https://research.google/blog/minerva-solving-quantitative-reasoning-problems-with-language-models/
Lightman, H., Kosaraju, V., Burda, Y., et al. (2023). Let's Verify Step by Step. https://arxiv.org/abs/2305.20050
OpenAI prm800k repository and MATH-500 subset. https://github.com/openai/prm800k
Azerbayev, Z., et al. (2023). Llemma: An Open Language Model for Mathematics. https://arxiv.org/abs/2310.10631
Shao, Z., et al. (2024). DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models. https://arxiv.org/abs/2402.03300
OpenAI. Learning to reason with LLMs (o1 announcement and system card). September 12, 2024. https://openai.com/index/learning-to-reason-with-llms/
OpenAI. Introducing OpenAI o3 and o4-mini. December 2024. https://openai.com/index/introducing-o3-and-o4-mini/
DeepSeek-AI. (2025). DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. https://github.com/deepseek-ai/DeepSeek-R1
Epoch AI. MATH Level 5 benchmark page. https://epoch.ai/benchmarks/math-level-5
Hendrycks et al. and OpenAI. GPT-4 Technical Report (March 2023). https://arxiv.org/abs/2303.08774
MATH-Perturb. (2025). Benchmarking LLMs' Math Reasoning Abilities against Hard Perturbations. https://arxiv.org/abs/2502.06453
AoPS Wiki, Competition ratings. https://artofproblemsolving.com/wiki/index.php/AoPS_Wiki:Competition_ratings

MATH Level 5

Origins and motivation

The AoPS difficulty scale

Dataset composition

Why Level 5 became a frontier metric

Subject composition of Level 5

Evaluation methodology

Answer equivalence

Sampling and aggregation

Baseline results from the original paper

Leaderboard milestones

Minerva and the role of mathematical pretraining

GPT-4 and the contamination question

Reasoning models and saturation

Contamination, perturbation, and unseen evaluations

Common failure modes

Reception and influence

Limitations

See also

References

Improve this article

Origins and motivation

The AoPS difficulty scale

Dataset composition

Why Level 5 became a frontier metric

Subject composition of Level 5

Evaluation methodology

Answer equivalence

Sampling and aggregation

Baseline results from the original paper

Leaderboard milestones

Minerva and the role of mathematical pretraining

GPT-4 and the contamination question

Reasoning models and saturation

Contamination, perturbation, and unseen evaluations

Common failure modes

Reception and influence

Limitations

See also

References

Origins and motivation

The AoPS difficulty scale

Dataset composition

Why Level 5 became a frontier metric

Subject composition of Level 5

Evaluation methodology

Answer equivalence

Sampling and aggregation

Baseline results from the original paper

Leaderboard milestones

Minerva and the role of mathematical pretraining

GPT-4 and the contamination question

Reasoning models and saturation

Comparison with related mathematics benchmarks

Contamination, perturbation, and unseen evaluations

Common failure modes

Reception and influence

Limitations

See also

References

Improve this article

Related Articles

Humanity's Last Exam

AIME 2024

AIME 2025

ARC-AGI 3

Aider Polyglot

BALROG

Origins and motivation

The AoPS difficulty scale

Dataset composition

Why Level 5 became a frontier metric

Subject composition of Level 5

Evaluation methodology

Answer equivalence

Sampling and aggregation

Baseline results from the original paper

Leaderboard milestones

Minerva and the role of mathematical pretraining

GPT-4 and the contamination question

Reasoning models and saturation

Comparison with related mathematics benchmarks

Contamination, perturbation, and unseen evaluations

Common failure modes

Reception and influence

Limitations

See also

References

Related Articles

Humanity's Last Exam

AIME 2024

AIME 2025

ARC-AGI 3

Aider Polyglot

BALROG