MATH Level 5
Last reviewed
May 16, 2026
Sources
15 citations
Review status
Source-backed
Revision
v2 ยท 4,170 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 16, 2026
Sources
15 citations
Review status
Source-backed
Revision
v2 ยท 4,170 words
Add missing citations, update stale details, or suggest a clearer explanation.
| MATH Level 5 | |
|---|---|
| Overview | |
| Full name | MATH dataset, Level 5 difficulty subset |
| Abbreviation | MATH L5 |
| Description | The hardest difficulty tier of the MATH (benchmark) dataset, comprising competition mathematics problems labeled with the maximum Art of Problem Solving difficulty rating |
| Release date | March 2021 (parent dataset) |
| Parent dataset | MATH (12,500 problems) |
| Level 5 test problems | 1,324 |
| Authors | Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, Jacob Steinhardt |
| Organization | UC Berkeley, University of Chicago, OpenAI (at time of writing) |
| Venue | NeurIPS 2021 (Datasets and Benchmarks Track) |
| Technical Details | |
| Type | Mathematical reasoning, problem solving |
| Modality | Text (LaTeX) |
| Task format | Open-ended written solutions with exact-match final answers |
| Evaluation metric | Exact match accuracy on the final boxed answer |
| Domains | Algebra, counting and probability, geometry, intermediate algebra, number theory, prealgebra, precalculus |
| Languages | English |
| Difficulty rubric | AoPS difficulty scale, where Level 5 corresponds to AIME and harder competition problems |
| Resources | |
| Paper | https://arxiv.org/abs/2103.03874 |
| GitHub | https://github.com/hendrycks/math |
| Dataset | https://huggingface.co/datasets/hendrycks/competition_math |
| License | MIT |
MATH Level 5 is the hardest difficulty tier of the MATH (benchmark) dataset introduced by Dan Hendrycks and colleagues in the 2021 paper Measuring Mathematical Problem Solving With the MATH Dataset. The full MATH corpus contains 12,500 problems drawn from United States high-school mathematics competitions; each problem is annotated with a difficulty rating from 1 to 5 following the Art of Problem Solving (AoPS) competition rating convention. The 1,324 Level 5 problems in the 5,000 problem test set sit at the top of that scale and are drawn primarily from the American Invitational Mathematics Examination and the hardest AMC problems. For four years they functioned as a frontier evaluation for large language models and were a primary scoreboard along which the field tracked progress in mathematical reasoning, until reasoning-trained models pushed accuracy on full MATH and its 500 problem subset past 95 percent in late 2024 and early 2025.
Level 5 has remained more useful than overall MATH accuracy because it is the slice where saturation is slowest. While GPT-3 davinci scored below 7 percent overall on the original MATH benchmark and roughly 4 percent on Level 5 problems, the same Level 5 problems remained well below 70 percent for non reasoning models well into 2024. Improvements driven by chain of thought prompting, mathematics pretraining corpora such as AMPS, process reward modeling, and reinforcement learning from verifier feedback can be traced through their Level 5 numbers more cleanly than through the easier tiers, which tend to saturate first.
The MATH dataset was created in late 2020 and early 2021 by researchers at UC Berkeley, the University of Chicago, and OpenAI. The motivating question, stated in the abstract of the paper, was whether the scaling trends observed for natural language benchmarks would extend to genuine mathematical problem solving, or whether mathematical reasoning required qualitatively new techniques. The authors collected 12,500 problems from public sources, primarily problem archives associated with American high school competitions such as the AMC 8, AMC 10, AMC 12, and AIME, along with state and regional contests. Each problem comes with a step by step written solution from the competition community.
The paper concluded that simply increasing parameter counts was unlikely to produce strong mathematical reasoning, and that meaningful progress would require either much larger compute budgets than were practical at the time or qualitatively new algorithms. That conclusion turned out to be partly right and partly wrong: dedicated mathematical pretraining and reasoning algorithms did push accuracy up sharply, but scale did help, especially when combined with reinforcement learning on verifiable rewards.
Difficulty in the MATH dataset is annotated using the rating scheme used by the Art of Problem Solving community on the AoPS Wiki. The authors describe the convention directly: a subject's easiest problems for humans are labeled Level 1 and the hardest are labeled Level 5. Concretely, the first several problems of an AMC 8 exam are usually Level 1 problems, the middle problems of AMC 10 and AMC 12 contests typically fall in Levels 2 and 3, the last AMC problems and the first AIME problems land in Level 4, and AIME problems generally and certainly the harder AIME problems are Level 5. There is no formal calibration across subjects; ratings reflect the AoPS community's collective judgment about what constitutes a hard problem in each topic.
| Attribute | Value |
|---|---|
| Total problems | 12,500 |
| Training set | 7,500 |
| Test set | 5,000 |
| Subjects | 7 |
| Difficulty levels | 5 (Level 1 through Level 5) |
| Test set Level 5 problems | 1,324 |
| Solution format | Full step by step LaTeX writeup with final boxed answer |
| License | MIT |
The seven subject categories are prealgebra, algebra, intermediate algebra, counting and probability, number theory, geometry, and precalculus. Each problem carries a subject tag and a difficulty tag, allowing researchers to slice performance along both axes. The Level 5 test slice is enriched in algebra, intermediate algebra, counting and probability, and precalculus relative to prealgebra and geometry, because those topics are over represented in AIME style problems.
When the MATH paper was published, the gap between overall scores and Level 5 scores already looked substantial. The GPT-2 1.5B model reported in the original paper achieved about 6.9 percent overall accuracy when pretrained on the AMPS auxiliary corpus and fine tuned on MATH, but only around 4 percent on Level 5 problems, with Level 1 problems closer to 15 percent. As large language models scaled and as new training techniques such as chain of thought prompting became standard, the easier levels saturated first. By 2023 frontier models were over 90 percent on Level 1 and Level 2 problems while still struggling with Level 5.
This pattern made Level 5 a useful telemetry. A model whose overall MATH score climbed from 50 percent to 70 percent could be doing so by sweeping up easier problems while leaving the AIME class problems essentially untouched, or by genuinely improving on the hardest problems. The per level breakdown made the difference visible.
Epoch AI formalized Level 5 as a standalone evaluation in 2024 specifically because it had not yet saturated while overall MATH and its 500 problem subset MATH-500 were trending toward 99 percent. Their hosted leaderboard runs the 1,324 Level 5 test problems with multiple equivalence scorers, including normalized string match, SymPy symbolic equivalence, and a model graded equivalence check. The site treats Level 5 as a stricter, slower saturating signal than full MATH.
The seven subjects do not contribute equally to the Level 5 slice. Algebra, intermediate algebra, and precalculus dominate, reflecting the topical balance of AIME style problems. Geometry and prealgebra contribute relatively fewer Level 5 problems. The table below characterizes the kinds of problems typical of each subject at the top difficulty.
| Subject | Typical Level 5 question style |
|---|---|
| Algebra | Functional equations, systems with constraints, inequalities with extremal conditions |
| Counting and probability | Multi step combinatorial identities, casework heavy probability with constraints |
| Geometry | Synthetic geometry with multiple auxiliary constructions, coordinate or trigonometric setups |
| Intermediate algebra | Polynomial roots and Vieta's formulas, complex numbers, sequences and series, advanced inequalities |
| Number theory | Modular arithmetic with multi step casework, Diophantine equations, divisibility puzzles |
| Prealgebra | Sparse at Level 5, but includes unusually long arithmetic puzzles when present |
| Precalculus | Trigonometric identities, sums involving roots of unity, telescoping identities |
Intermediate algebra in particular has historically been the hardest subject for models, with multi step manipulations of polynomial roots and complex valued sums being the most common failure mode.
The MATH protocol is exact match on the final answer, which is enclosed in a boxed{} command in the LaTeX solution. Models are expected to produce a full written derivation followed by the boxed final answer, and only the contents of the final box are checked. This avoids the problem of partial credit while preserving the requirement that the model produce a written argument that supports the answer.
Because mathematical answers can be expressed in many equivalent forms, the de facto standard pipelines accept multiple representations:
| Method | What it checks |
|---|---|
| Normalized string match | Strips whitespace, normalizes LaTeX commands, and compares strings |
| SymPy equivalence | Parses both answers as symbolic expressions and tests algebraic equivalence |
| Model graded equivalence | An auxiliary language model verifies whether two answers are mathematically the same |
The Hendrycks reference pipeline uses normalized string match with a hand written canonicalizer, while Epoch AI's MATH Level 5 implementation runs all three scorers and reports the model graded version as primary. Differences between scorers are usually under one percentage point but can matter near the top of the leaderboard.
Many reported MATH Level 5 numbers correspond to single sample greedy or low temperature decoding (pass@1). Earlier flagship results, especially from Minerva, often used majority voting across many sampled solutions; Minerva 540B reports used self consistency with up to 64 samples. With chain of thought reasoning, sampling diversity tends to help on the hardest problems, so majority voting results are typically several points higher than pass@1 on Level 5.
The original Hendrycks et al. paper benchmarked language models available in early 2021. The headline results are summarized below.
| Model | Pretraining | Overall accuracy | Approximate Level 5 accuracy |
|---|---|---|---|
| GPT-2 0.1B | Standard | ~3.0% | <2% |
| GPT-2 0.7B | Standard | ~3.7% | <3% |
| GPT-2 1.5B | Standard | ~5.4% | <4% |
| GPT-2 1.5B | + AMPS | ~6.9% | ~4% |
| GPT-3 davinci | Standard | ~5% | similar to GPT-2 1.5B |
The paper introduces a 23 GB auxiliary pretraining corpus called the Auxiliary Mathematics Problems and Solutions (AMPS) dataset, comprising over 100,000 Khan Academy style problems and roughly five million problems generated from Mathematica scripts. Pretraining on AMPS prior to fine tuning on MATH consistently boosted accuracy by several points but did not change the qualitative picture. The authors used these numbers to argue that scaling alone would not solve the benchmark.
The paper also reported informal human baselines. A computer science PhD student who did not particularly enjoy mathematics scored about 40 percent. A three time International Mathematical Olympiad gold medalist scored about 90 percent. These figures are not formal estimates and apply to the full MATH test set rather than to Level 5 specifically, where the human gap between the two extremes would be wider.
The following table tracks reported performance on MATH and on Level 5 specifically across major model releases. Numbers refer to the standard test split unless otherwise stated, and Level 5 numbers are listed where they have been published separately.
| Year | Model | Overall MATH | Level 5 | Notes |
|---|---|---|---|---|
| 2021 | GPT-2 1.5B + AMPS | 6.9% | ~4% | Hendrycks et al. baseline |
| 2021 | GPT-3 davinci | ~5% | ~3-4% | Few-shot prompting |
| 2022 | Minerva 8B (maj@k) | 25.4% | not isolated | Math-trained PaLM |
| 2022 | Minerva 62B (maj@k) | 43.4% | not isolated | Math-trained PaLM |
| 2022 | Minerva 540B (maj@k) | 50.3% | not isolated | First model to exceed 50% overall |
| 2023 | GPT-4 (CoT) | ~42% | not officially isolated | GPT-4 technical report |
| 2023 | GPT-4 + code interpreter | ~70% | not officially isolated | the-decoder reporting |
| 2023 | Llemma 34B (maj@256) | ~25% | not isolated | Open base model, approaching Minerva 62B |
| 2024 | DeepSeekMath 7B RL | 51.7% | not isolated | GRPO reinforcement learning |
| 2024 | Claude 3 Opus | ~60% | not officially isolated | Anthropic Claude 3 model card |
| 2024 | OpenAI o1 (CoT) | 94.8% (work-in-progress) | not isolated | First major reasoning model |
| 2024 | OpenAI o3 | not formally reported on MATH | not isolated | Reported AIME 2024 96.7% |
| 2025 | DeepSeek R1 | 97.3% on MATH-500 | not officially isolated | RL trained reasoning |
Reported overall scores hide most of the Level 5 story, because by 2024 the Level 5 contribution to remaining error became dominant. A model at 94 percent overall on MATH typically gets the Level 1, 2, and 3 problems essentially correct; almost all the missed problems are in Level 4 and Level 5, and most of those are in Level 5.
Minerva was the first major leap on MATH, jumping from the GPT-3 baseline of around 5 percent to 50.3 percent with the 540B variant in 2022. Built on PaLM and further trained on a 118 GB corpus of scientific papers from arXiv and other mathematical content, Minerva used chain of thought prompting and majority voting with up to 64 samples. The paper's error analysis attributed roughly half of remaining errors to calculation mistakes and half to genuine reasoning failures, and noted that performance dropped sharply on intermediate algebra and other subjects that are over represented at Level 5.
Minerva's contribution was twofold. First, it demonstrated that domain specific pretraining could close most of the gap to the top of the published MATH leaderboard at the time. Second, it gave the field a concrete error decomposition between arithmetic execution and reasoning, which motivated later work on tool use (calculators, Python interpreters), self consistency, and process supervision.
The GPT-4 technical report from March 2023 reported MATH accuracy in the low forties without external tools and around 70 percent with code interpreter augmentation. OpenAI explicitly acknowledged in the GPT-4 system card and a Hendrycks led decontamination note that the training data included parts of the MATH training set; reported test set numbers are therefore on a held out test split, but the standard 5,000 test set remains potentially within the broader pretraining distribution for any frontier model trained after the dataset's public release. This contamination concern is one reason later evaluations turned to held out subsets, fresh competition problems (such as new AIME years), and adversarial perturbations of MATH problems.
OpenAI o1, introduced in September 2024 with a system card on September 12, was the first widely benchmarked model to push MATH past 90 percent without external tools. The o1 series reports a 94.8 percent score on MATH for a work in progress checkpoint, with o1-preview at 85.5 percent. The system card frames MATH as effectively saturated at this point. o3, announced in December 2024, reports 96.7 percent on AIME 2024, the human competition from which most Level 5 problems are drawn, indicating that the AIME class is no longer a strict frontier.
DeepSeek R1, released in January 2025, reports 97.3 percent on the 500 problem MATH-500 subset and is the first widely available open weight model in this regime. R1's reinforcement learning pipeline, using outcome rewards on math problems and the Group Relative Policy Optimization algorithm originally introduced in DeepSeekMath, is the canonical example of pushing reasoning by rewarding successful problem solving rather than imitating human written solutions.
Despite these gains, the Level 5 slice has resisted full saturation. Even when overall MATH scores exceed 95 percent, the remaining errors are concentrated in Level 5, and frontier model documentation typically reports Level 5 accuracy several points below the overall figure. Public reporting on this slice has become less common as MATH itself has been displaced by harder benchmarks; researchers tracking long horizon progress have moved much of their attention to FrontierMath, HMMT, the unseen AIME, and the IMO grand challenge.
MATH Level 5 sits in a specific niche between curriculum aligned datasets such as GSM8K and research level mathematics benchmarks such as FrontierMath. The table below positions it relative to other widely used mathematical evaluations.
| Benchmark | Difficulty level | Problem source | Typical frontier accuracy (2025) |
|---|---|---|---|
| GSM8K | Grade school word problems | OpenAI commissioned writers | >95% |
| MATH (full) | Mixed levels 1-5 | US high school competitions | >95% |
| MATH-500 | Random 500 problem subset of MATH | Subset of MATH test | >95% |
| MATH Level 5 | AIME and hardest AMC problems | Subset of MATH test, hardest tier | low to high 90s, lags overall MATH |
| AMC 10 / AMC 12 | Late high school | Annual competitions | very high for reasoning models |
| AIME | Top US high school competition | Annual AIME exams | over 90% for reasoning models |
| USAMO | US Mathematical Olympiad | Proof based | early stage, partial credit graded |
| IMO | International Olympiad | Proof based | very early stage, special purpose systems |
| FrontierMath | Research mathematics | Custom commissioned | low double digits for the strongest models in 2025 |
The MATH benchmark and its Level 5 subset are descended from the same competition ecosystem that produces the AMC and AIME contests, but use historical problems rather than new ones. This makes contamination a structural risk: any frontier model trained on a recent crawl of the open web is likely to have seen the questions and solutions, since the AoPS community discusses these problems extensively online. The unseen AIME and other fresh competition based benchmarks were introduced specifically to address this.
The relationship between MATH Level 5 accuracy and AIME accuracy is informative but not identical. AIME 2024 contains 30 problems (across AIME I and AIME II), each scored as an integer answer from 0 to 999. MATH Level 5 contains 1,324 problems with a similar style but a wider variety of answer formats. A model that does well on MATH Level 5 will typically do well on AIME but may handle the precise answer formatting and time pressure structure of AIME differently.
Because MATH was scraped from public competition archives, the same problems appear in many forms across the open web, including AoPS forum discussions, problem of the week sites, and educational content. Several lines of research have studied the extent to which the resulting contamination inflates reported numbers:
A pragmatic alternative is to evaluate on the freshest AIME competition for the current year, which the test set predates and which would not have been seen during pretraining for older models. As of 2025 this is the standard comparison for reasoning models. MATH Level 5 retains value as a large enough sample (1,324 problems) to yield tight confidence intervals and as a continuity bridge to historical results.
Error analyses across Minerva, GPT-4, OpenAI o1, and DeepSeek R1 highlight a consistent set of failure modes on Level 5 problems:
| Failure type | Description | Frequency among errors |
|---|---|---|
| Arithmetic slips | Multi step calculations carried through correctly except for a small algebra or arithmetic error | high for pre-reasoning models, lower for o1 and R1 |
| Casework omissions | Missing a case in combinatorics or number theory problems with multiple branches | persistent across model generations |
| Misreading the problem | Solving a different problem than the one stated, often by ignoring a constraint | falls with longer chains of thought |
| Premature claim of factorizations or identities | Asserting a polynomial factorization or trigonometric identity without verification | common in intermediate algebra and precalculus |
| Plausible but wrong final boxed answer | Reasoning is roughly correct but the final extraction step produces an answer that does not match the question's required form | common at Level 5 where answer formats are unusual |
Reasoning trained models with long chains of thought tend to reduce the first three categories more than the last two. The remaining errors at the very top of the leaderboard are dominated by genuine reasoning gaps and idiosyncratic answer formatting issues rather than careless mistakes.
The MATH dataset has been cited thousands of times since its 2021 release, and its difficulty rating scheme has been reused in derivative benchmarks. MATH-500, the 500 problem subset introduced by OpenAI's Let's Verify Step by Step paper (Lightman et al., 2023), became the standard evaluation in much of the process reward modeling literature, including the PRM800K release. MATH-Shepherd, a follow up on process supervision without human annotation, similarly evaluates on MATH and reports per level breakdowns.
The role of MATH Level 5 specifically as a frontier metric has been important in several research narratives:
| Limitation | Description |
|---|---|
| Subjective level labels | The AoPS difficulty scale is community curated and not formally calibrated across subjects |
| Public availability | Problems and solutions are extensively discussed online, creating contamination risk |
| English only | Problems and solutions are in English LaTeX, limiting cross lingual evaluation |
| Answer format quirks | Exact match on boxed answers can disadvantage models that produce mathematically equivalent but textually different forms; mitigations exist via symbolic equivalence scorers |
| Static test set | The 1,324 Level 5 problems do not update over time, so the benchmark cannot track novelty |
| Diminishing headroom | Reasoning models in 2025 score above 95 percent on overall MATH; remaining errors concentrate in Level 5 but the absolute headroom is narrow |
These constraints have driven the field toward complementary benchmarks (FrontierMath, HMMT, unseen AIME, IMO grand challenge) and toward dynamic evaluations that incorporate new problems each year.