MATH (Mathematics Aptitude Test of Heuristics) is a dataset and benchmark of 12,500 challenging competition-level mathematics problems designed to evaluate the mathematical reasoning capabilities of large language models. Created by Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt, the benchmark was introduced in the 2021 paper "Measuring Mathematical Problem Solving With the MATH Dataset" and presented at the NeurIPS 2021 Datasets and Benchmarks track.
The problems in MATH are drawn from American high school mathematics competitions, including the AMC 10, AMC 12, and AIME, and were collected from archives on the Art of Problem Solving (AoPS) community website. Each problem comes with a detailed step-by-step solution written in LaTeX, enabling both automated evaluation and the training of models to produce mathematical derivations. At the time of its release, the benchmark was considered exceptionally difficult for AI systems: the best large transformer models scored below 7% accuracy, while a three-time International Mathematical Olympiad (IMO) gold medalist achieved 90% on a sample of 20 problems.
MATH has become one of the most widely cited benchmarks for mathematical reasoning in AI. Its influence extends across hundreds of research papers, and model developers routinely report MATH scores in technical reports and announcements. The benchmark played an important role in motivating advances in chain-of-thought prompting, process supervision, and test-time compute scaling. As of early 2026, frontier models score above 95% on the MATH test set, and the benchmark is considered largely saturated for top-tier systems.
Before MATH was introduced, benchmarks for evaluating mathematical reasoning in language models were either too easy or too narrow to provide meaningful signal. Existing datasets tended to focus on elementary arithmetic or single-step algebra problems that did not test genuine multi-step reasoning. The GSM8K benchmark, also released in 2021, addressed grade-school-level math word problems requiring 2 to 8 steps, but its problems could be solved using only basic arithmetic operations.
Hendrycks and colleagues recognized the need for a benchmark that would remain challenging even as models grew in scale. They selected competition mathematics as the problem domain because it requires a combination of conceptual understanding, heuristic reasoning, and multi-step derivation that goes well beyond rote calculation. Competition problems often demand creative approaches, the application of mathematical identities, and careful logical reasoning across many steps.
The paper's central thesis was that simply scaling up transformer models would not be sufficient to solve competition-level mathematics. The authors argued that "if a machine learning model were to achieve high accuracy on MATH, this would represent a substantial step forward for mathematical reasoning," and they provided evidence that existing scaling trends pointed toward a need for new algorithmic advances rather than larger parameter counts alone.
MATH contains 12,500 problems divided into a training set of 7,500 problems and a test set of 5,000 problems. The problems span seven subject areas and five difficulty levels.
The seven mathematical subjects covered by MATH are:
| Subject | Description |
|---|---|
| Prealgebra | Foundational topics including fractions, ratios, percentages, and basic number properties |
| Algebra | Equations, inequalities, polynomials, functions, and algebraic manipulations |
| Number Theory | Divisibility, modular arithmetic, prime numbers, and Diophantine equations |
| Counting and Probability | Combinatorics, permutations, conditional probability, and expected value |
| Geometry | Euclidean geometry, coordinate geometry, areas, volumes, and trigonometric relationships |
| Intermediate Algebra | Complex numbers, sequences and series, logarithms, and advanced polynomial theory |
| Precalculus | Trigonometric functions, vectors, matrices, and conic sections |
Each problem is assigned a difficulty rating from Level 1 (easiest) to Level 5 (hardest). These ratings reflect the relative difficulty as perceived by human competitors. Level 1 problems are typically straightforward applications of well-known formulas or concepts, while Level 5 problems require substantial ingenuity, the combination of multiple mathematical ideas, or extended multi-step reasoning.
The MATH Level 5 subset, consisting of the 1,324 Level 5 problems from the test set, is sometimes used as a standalone evaluation. Because these represent the hardest problems in the dataset, they provide better differentiation between models at the high end of the performance spectrum.
Each problem in MATH is stored as a structured record containing the following fields:
| Field | Description |
|---|---|
problem | The problem statement, written in natural language with LaTeX mathematical notation |
solution | A complete step-by-step solution in LaTeX |
level | The difficulty level (1 through 5) |
type | The subject category (one of the seven subjects listed above) |
Some problems include Asymptote code for generating geometric figures. Final answers are enclosed in \boxed{} delimiters, which enables automated evaluation by parsing the content inside the box.
A typical Level 1 problem from the Counting and Probability category:
A board game spinner is divided into three parts labeled A, B and C. The probability of the spinner landing on A is 1/3 and the probability of the spinner landing on B is 5/12. What is the probability of the spinner landing on C? Express your answer as a common fraction.
The answer is \boxed{\frac{1}{4}}.
The problems were sourced from American mathematics competitions, primarily from archives hosted on the Art of Problem Solving (AoPS) website. The competition sources include the AMC 10 (American Mathematics Competition for students in 10th grade and below), AMC 12 (for students in 12th grade and below), and AIME (American Invitational Mathematics Examination, a more selective follow-up to the AMC). The problems span multiple decades of competition history.
MATH uses exact-match evaluation on the final answer. The system extracts the content inside the \boxed{} delimiter from a model's generated solution and compares it to the ground truth answer.
To handle mathematically equivalent representations, the evaluation pipeline applies several normalization rules:
\frac{x}{y} formatThis normalization approach allows the evaluation to be fully automated while accounting for the many different ways a correct mathematical answer can be expressed in LaTeX. The grading logic was later reused and refined in subsequent work, including OpenAI's PRM800K project.
The original 2021 paper reported results for several model configurations on the MATH test set. Performance was remarkably low across the board.
| Model | Configuration | MATH Accuracy |
|---|---|---|
| GPT-2 0.1B | Pretrained on AMPS, fine-tuned on MATH | 5.4% |
| GPT-2 0.3B | Pretrained on AMPS, fine-tuned on MATH | 6.2% |
| GPT-2 0.7B | Pretrained on AMPS, fine-tuned on MATH | 6.4% |
| GPT-2 1.5B | Pretrained on AMPS, fine-tuned on MATH | 6.9% |
| GPT-2 1.5B | Fine-tuned on MATH only (no AMPS) | 5.5% |
| GPT-3 13B | Few-shot | 3.0% |
| GPT-3 13B | Fine-tuned | 5.6% |
| GPT-3 175B | Few-shot | 5.2% |
| BART-Large 0.4B | Fine-tuned | 4.9% |
A notable finding was that a 0.1B parameter model pretrained on AMPS achieved performance comparable to a fine-tuned 13B parameter model, representing roughly a 130-fold improvement in parameter efficiency through mathematical pretraining.
The paper reported per-subject accuracy for the GPT-2 1.5B model pretrained on AMPS and fine-tuned on MATH:
| Subject | GPT-2 1.5B Accuracy |
|---|---|
| Prealgebra | 8.3% |
| Algebra | 6.2% |
| Number Theory | 4.8% |
| Counting and Probability | 5.4% |
| Geometry | 8.7% |
| Intermediate Algebra | 6.1% |
| Precalculus | 8.8% |
| Average | 6.9% |
To provide a human baseline, the authors evaluated a small sample of 20 problems with participants of varying mathematical backgrounds:
| Participant | Score (out of 20) | Accuracy |
|---|---|---|
| Non-mathematics enthusiast | 8 | 40% |
| Ambivalent participant | 13 | 65% |
| Mathematics enthusiast (1) | 14 | 70% |
| Mathematics enthusiast (2) | 15 | 75% |
| USAMO participant | 18 | 90% |
| Three-time IMO gold medalist | 18 | 90% |
The authors highlighted the enormous gap between the best model (6.9%) and the most skilled human (90%), and projected based on scaling trends that approximately 10^35 parameters would be needed to reach 40% accuracy through scaling alone. This projection underscored their argument that new algorithmic breakthroughs, not just larger models, were necessary.
Alongside the MATH dataset, Hendrycks et al. released AMPS (Auxiliary Mathematics Problems and Solutions), a large pretraining corpus designed to teach models the fundamentals of mathematics. AMPS totals approximately 23 GB and consists of two components.
AMPS includes over 100,000 problems with step-by-step solutions drawn from Khan Academy exercises. These cover topics ranging from basic addition to multivariable calculus, organized across 693 exercise types. The problems and solutions are formatted in LaTeX.
The second component contains approximately 5 million problems generated using Wolfram Mathematica scripts. The authors designed 100 hand-crafted modules covering topics such as conic sections, div/grad/curl, KL divergence, eigenvalues, polyhedra, and Diophantine equations. Of these 100 modules, 37 include full step-by-step solutions.
Pretraining on AMPS before fine-tuning on MATH improved accuracy from 5.5% to 6.9% for the GPT-2 1.5B model, a 25% relative improvement. The authors found that AMPS pretraining provided roughly the same benefit as increasing model size by a factor of 15, demonstrating that domain-specific pretraining data could substitute for substantial parameter scaling.
MATH-500 is a widely used 500-problem subset of the MATH test set, introduced by OpenAI in the 2023 paper "Let's Verify Step by Step" by Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe.
The MATH-500 subset was created as part of OpenAI's research on process supervision for mathematical reasoning. To expand the training data available for their process reward model (PRM), the researchers incorporated 4,500 problems from MATH's original test split into their training set. The remaining 500 problems were selected uniformly at random to serve as a held-out evaluation set. The authors stated that they "believe [these 500 test problems] are representative of the test set as a whole."
The "Let's Verify Step by Step" paper also released PRM800K, a dataset of approximately 800,000 step-level human correctness labels applied to model-generated solutions for MATH problems. Human labelers evaluated each step of a solution as correct, incorrect, or neutral, enabling the training of process reward models that provide fine-grained feedback on mathematical reasoning. Using a process-supervised reward model, the researchers achieved 78.2% accuracy on the MATH-500 subset.
MATH-500 was subsequently adopted by many organizations as a standard evaluation benchmark. It spans six of the seven MATH subject areas: Algebra, Counting and Probability, Geometry, Intermediate Algebra, Number Theory, and Precalculus. The smaller size of 500 problems makes it less expensive and faster to evaluate than the full 5,000-problem test set, while still providing a representative sample. A version of the subset is hosted on Hugging Face by the HuggingFaceH4 team.
Models are typically evaluated on MATH-500 by prompting them to present final answers in \boxed{} LaTeX format, with evaluation logic adapted from the PRM800K grading framework. Most evaluations use a temperature of 0, except for reasoning models with specific temperature requirements.
As of 2025, most frontier models score above 95% on MATH-500, and many exceed 97%. Due to this saturation, some evaluation platforms no longer run MATH-500 on new model releases. The benchmark remains useful for evaluating smaller or open-weight models where there is still meaningful variation in performance.
Performance on the MATH benchmark improved at a remarkable pace between 2021 and 2025. The following tables summarize notable scores across different models and evaluation periods.
| Model | Organization | Year | MATH Accuracy | Method |
|---|---|---|---|---|
| GPT-2 1.5B | OpenAI | 2021 | 6.9% | AMPS pretrain + fine-tune |
| GPT-3 175B | OpenAI | 2021 | 5.2% | Few-shot |
| text-davinci-002 | OpenAI | 2022 | 8.5% | 4-shot CoT |
| text-davinci-003 | OpenAI | 2022 | 15.6% | 4-shot CoT |
| Minerva 540B | 2022 | 33.6% | Pass@1 | |
| Minerva 540B | 2022 | 50.3% | Majority vote (k=64) | |
| PaLM 2-L | 2023 | 34.3% | 4-shot CoT | |
| gpt-3.5-turbo-0301 | OpenAI | 2023 | 33.4% | 4-shot CoT |
| gpt-4-0314 | OpenAI | 2023 | 38.6% | 4-shot CoT |
| GPT-4 (Code Interpreter) | OpenAI | 2023 | 69.7% | Zero-shot with code execution |
| Claude 3 Opus | Anthropic | 2024 | 60.1% | Zero-shot CoT |
| Gemini 1.0 Ultra | 2024 | 53.2% | Zero-shot CoT | |
| GPT-4o | OpenAI | 2024 | 76.6% | Zero-shot CoT |
| Claude 3.5 Sonnet | Anthropic | 2024 | 71.1% | Zero-shot CoT |
| o1 | OpenAI | 2024 | 94.8% | Internal chain-of-thought |
| Model | Organization | Year | MATH-500 Accuracy |
|---|---|---|---|
| o1-mini | OpenAI | 2024 | 90.0% |
| DeepSeek-V3 | DeepSeek | 2024 | 90.2% |
| QwQ-32B | Alibaba | 2024 | 90.6% |
| DeepSeek-R1-Distill-Qwen-7B | DeepSeek | 2025 | 92.8% |
| DeepSeek-V3-0324 | DeepSeek | 2025 | 94.0% |
| DeepSeek-R1-Distill-Qwen-32B | DeepSeek | 2025 | 94.3% |
| DeepSeek-R1-Distill-Llama-70B | DeepSeek | 2025 | 94.5% |
| Claude 3.7 Sonnet | Anthropic | 2025 | 96.2% |
| Kimi K2 Instruct | Moonshot AI | 2025 | 97.4% |
| DeepSeek-R1 | DeepSeek | 2025 | 97.3% |
Several patterns emerge from these results. First, performance jumped dramatically between 2021 and 2023 as techniques like chain-of-thought prompting, mathematical pretraining, and code-augmented reasoning were developed. Second, the introduction of reasoning-focused models in late 2024 (such as OpenAI's o1) pushed scores into the 90s on the full test set. Third, by early 2025, multiple models from different organizations converged above 95% on MATH-500, signaling benchmark saturation.
GSM8K and MATH are both mathematical reasoning benchmarks released in 2021, but they differ substantially in difficulty and scope.
| Feature | GSM8K | MATH |
|---|---|---|
| Creator | Karl Cobbe et al. (OpenAI) | Dan Hendrycks et al. (UC Berkeley) |
| Total problems | 8,792 | 12,500 |
| Difficulty level | Grade school | Competition (high school) |
| Required operations | Basic arithmetic (+, -, x, /) | Algebra, geometry, number theory, calculus concepts |
| Steps per problem | 2 to 8 | Varies widely; many require 10+ steps |
| Solution format | Calculator annotations with #### delimiter | Full LaTeX derivations with \boxed{} delimiter |
| Saturation point | ~95% (reached by mid-2024) | ~95% on MATH-500 (reached by early 2025) |
GSM8K problems can generally be solved through straightforward sequential application of arithmetic operations. MATH problems, by contrast, often require recognizing which mathematical technique to apply, combining ideas from different areas, and carrying out multi-step symbolic manipulations. A model that scores well on GSM8K may still struggle with MATH, as the latter demands a deeper level of mathematical reasoning.
Because GSM8K saturated earlier, many organizations stopped prominently reporting GSM8K scores for their latest models by late 2024, shifting their focus to MATH and more challenging benchmarks. MATH provided better differentiation between frontier models during the period when most models were already scoring above 95% on GSM8K.
The American Invitational Mathematics Examination (AIME) is a selective 15-question, 3-hour mathematics competition administered by the Mathematical Association of America. Students must score in the top 2.5% to 5% on the AMC 10 or AMC 12 to qualify for the AIME. Each AIME answer is an integer between 000 and 999.
MATH and AIME are related in two ways. First, some of the problems in the MATH dataset are drawn from past AIME competitions (along with AMC 10 and AMC 12 problems). AIME problems in MATH tend to appear at difficulty Levels 4 and 5, as they are among the more challenging competition problems.
Second, AIME has increasingly been used as a separate benchmark for evaluating AI systems, particularly as MATH has become saturated. Because new AIME problems are released each year, they offer a contamination-resistant evaluation. Models trained before a given year's competition cannot have seen that year's problems in their training data. The 2024 and 2025 AIME problems (30 problems per year, combining AIME I and AIME II) have been widely adopted as benchmarks for frontier reasoning models.
Frontier models have achieved strong performance on AIME evaluations. For example, OpenAI's o1 model placed among the top 500 students nationally on the 2024 AIME. By 2025, several frontier models scored above 80% on AIME problems, with some exceeding 90% when using extended reasoning or tool access.
As both MATH and AIME have approached saturation for the strongest models, the research community has turned to even harder evaluations such as FrontierMath, which contains research-level mathematics problems where current frontier models solve fewer than 2% of problems.
As with many widely used benchmarks, the MATH dataset has faced scrutiny over the possibility of data contamination, where test problems (or very similar problems) leak into model training data, inflating reported scores.
The MATH dataset's problems were collected from publicly available competition archives on the Art of Problem Solving website. These problems, along with their solutions, have been freely accessible online for years, making it likely that they appear in the large web crawls used to pretrain many language models. This creates a risk that models may have memorized specific problems or solution patterns from their pretraining data, rather than demonstrating genuine mathematical reasoning.
Research has identified measurable contamination in certain models. A study using the LLM Decontaminator tool found 79 paraphrased samples, or approximately 1.58% of the MATH test set, present in some training corpora. While this percentage is relatively small, even partial exposure to test problems can affect benchmark scores, particularly for models that are already performing near the top of the scale.
The contamination concern is not unique to MATH. Parallel research on GSM8K found accuracy drops of up to 13% when models were evaluated on GSM1K, a contamination-free mirror dataset designed to match GSM8K's style and difficulty. While no equivalent "MATH1K" contamination-free mirror has been widely adopted, the GSM8K findings suggest that similar effects could influence MATH scores for some models.
To address contamination and test genuine reasoning, researchers have developed MATH-Perturb, a benchmark that applies systematic perturbations to MATH problems. By modifying the numerical values, conditions, or constraints in existing problems, MATH-Perturb creates novel variations that test whether models can apply mathematical reasoning to modified versions of familiar problem structures rather than relying on memorized solutions.
In January 2025, the Art of Problem Solving (AoPS) filed a DMCA takedown notice against the MATH dataset hosted on Hugging Face. AoPS argued that the competition problems in the dataset had been collected from their website without authorization and constituted copyright infringement. As a result, the original hendrycks/competition_math dataset on Hugging Face was taken down. Alternative copies and mirrors of the dataset continue to exist elsewhere, and the benchmark remains in widespread use for model evaluation.
The MATH benchmark has had a substantial impact on the field of AI research, influencing both the development of new techniques and the broader conversation about what language models can and cannot do.
The low initial performance of language models on MATH helped motivate research into eliciting better reasoning through prompting techniques. Chain-of-thought prompting, introduced by Jason Wei and colleagues in 2022, demonstrated that providing step-by-step reasoning examples in the prompt could significantly improve performance on mathematical tasks. The MATH benchmark, alongside GSM8K, served as a primary testing ground for these techniques.
The step-by-step solutions in MATH enabled research on process supervision, where models are trained to evaluate the correctness of each individual reasoning step rather than just the final answer. OpenAI's "Let's Verify Step by Step" paper used MATH extensively to demonstrate that process reward models (PRMs) outperform outcome reward models (ORMs) for mathematical problem solving. This line of research has become central to how reasoning-focused models like OpenAI's o1 and o3 series are trained.
MATH has been a key benchmark for research on test-time compute scaling, the strategy of generating multiple candidate solutions and selecting the best one rather than relying on a single model output. Minerva's use of majority voting (which improved its MATH score from 33.6% to 50.3%) was an early demonstration of this principle. The concept has since evolved into sophisticated inference-time strategies used by modern reasoning models.
Alongside MMLU and HumanEval, MATH became part of the standard set of benchmarks reported in virtually every major language model release from 2022 through 2025. Its difficulty level provided meaningful differentiation between models during a period when simpler benchmarks like GSM8K had become saturated.
The saturation of MATH by frontier models has prompted the development of harder mathematical reasoning benchmarks:
| Benchmark | Year | Description |
|---|---|---|
| MATH | 2021 | 12,500 competition-level problems; now largely saturated |
| AIME (as AI benchmark) | 2024 | 30 olympiad-level problems per year; fresh problems annually |
| Omni-MATH | 2024 | Olympiad-level problems spanning multiple mathematical domains |
| FrontierMath | 2024 | Research-level mathematics; fewer than 2% solved by frontier models |
| MATH-Perturb | 2025 | Perturbed versions of MATH problems testing robustness |
| U-MATH | 2025 | University-level mathematics problems |
This progression reflects the rapid pace of improvement in AI mathematical reasoning. Benchmarks that seemed insurmountable when introduced are often approached or saturated within two to three years, driving the community to develop increasingly challenging evaluations.
The MATH dataset was released under the MIT License. The dataset was originally hosted on GitHub at github.com/hendrycks/math and on Hugging Face at hendrycks/competition_math. Following the DMCA takedown, the Hugging Face version was removed, though the GitHub repository and alternative mirrors remain available. An alternative version of the dataset is hosted at nlile/hendrycks-MATH-benchmark on Hugging Face.
MATH is supported by several major evaluation frameworks, including EleutherAI's Language Model Evaluation Harness (lm-evaluation-harness), which provides standardized implementations for running the benchmark. OpenAI's simple-evals repository also includes a MATH evaluation. These frameworks handle answer extraction, normalization, and scoring in a reproducible manner.
| Property | Value |
|---|---|
| Total problems | 12,500 |
| Training set | 7,500 |
| Test set | 5,000 |
| Subject areas | 7 |
| Difficulty levels | 5 |
| MATH-500 subset | 500 (randomly sampled from test set) |
| Level 5 subset (test) | 1,324 |
| Problem text length | 16 to 4,310 characters |
| Solution text length | 26 to 6,770 characters |
| Language | English |
| License | MIT |
| Download size | ~5.33 MB |