Omni-MATH
Last reviewed
Jun 8, 2026
Sources
4 citations
Review status
Source-backed
Revision
v1 · 1,491 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 8, 2026
Sources
4 citations
Review status
Source-backed
Revision
v1 · 1,491 words
Add missing citations, update stale details, or suggest a clearer explanation.
Omni-MATH is an AI benchmark of Olympiad-level competition mathematics, introduced in October 2024 to measure the mathematical reasoning ability of large language models on problems far harder than those in earlier, increasingly saturated datasets. The benchmark contains 4,428 competition-level problems drawn from roughly 21 mathematics olympiads and contests, each problem carrying human annotations for its mathematical domain (one of more than 33 sub-domains) and a fine-grained difficulty rating on a continuous scale from 0 to 10. Alongside the dataset, the authors released Omni-Judge, an open-source model that automatically verifies whether a model's free-form solution matches the reference answer, removing the need for an expensive proprietary judge [1][2].
The work was published as "Omni-MATH: A Universal Olympiad Level Mathematic Benchmark For Large Language Models" by Bofei Gao, Feifan Song, Zhe Yang, Baobao Chang and colleagues, a team centered at Peking University with collaborators from the Qwen team at Alibaba, Tencent and the Chinese University of Hong Kong, Shenzhen. It first appeared on arXiv on October 10, 2024, and was accepted to the International Conference on Learning Representations (ICLR) 2025 [1][3]. At release, the strongest model evaluated, OpenAI o1-mini, reached only about 60.5 percent accuracy, leaving substantial headroom and positioning Omni-MATH as one of the harder text-only math reasoning evaluations of the 2024 to 2026 period [1][2].
By 2024, the two most widely used math reasoning benchmarks had largely been solved by frontier systems. GSM8K, a set of grade-school word problems released by OpenAI in 2021, and the more demanding MATH dataset of high-school competition problems released by Hendrycks et al. in 2021, were both being answered with very high accuracy. OpenAI's o1 model, for example, reported 94.8 percent on MATH, leaving little room to distinguish between strong systems [1]. When a benchmark approaches its ceiling, small differences in score stop reflecting meaningful differences in capability, and the benchmark loses its usefulness as a research signal.
The Omni-MATH authors framed their dataset as a direct response to this saturation. They mapped existing datasets onto their own difficulty scale, noting that GSM8K corresponds roughly to the lowest difficulty band and MATH to the next band up, both well below the olympiad range that Omni-MATH targets [1]. Olympiad mathematics demands multi-step proofs, creative problem decomposition and rigorous logical chains rather than routine computation, making it a natural next frontier once school-level math is exhausted. Omni-MATH joined a small group of contemporaneous olympiad-tier efforts, including OlympiadBench, but distinguished itself through its scale, its breadth of source competitions and its unusually granular difficulty and domain labeling [1].
The dataset comprises 4,428 problems, each with a verified final answer and rich metadata. Problems are categorized into more than 33 sub-domains nested under broad areas such as algebra, number theory, geometry, discrete mathematics, calculus, precalculus and applied mathematics, with the authors reporting that models are notably weaker on discrete mathematics than on algebra or calculus [1].
Difficulty is annotated on a continuous 0 to 10 scale with increments as fine as 0.25, giving more than ten distinct levels. For reporting, the paper aggregates problems into four analysis tiers, and it groups the source competitions into five hierarchical difficulty bands [1].
| Aspect | Value |
|---|---|
| Total problems | 4,428 |
| Sub-domains | 33+ |
| Difficulty scale | continuous 0 to 10 (10+ levels) |
| Analysis tiers | T1 (1 to 3), T2 (3 to 5), T3 (5 to 7), T4 (7 to 10) |
| Competition tiers | 5 (Introductory to Famous Worldwide) |
| Source competitions | about 21 |
The problems are sourced from a wide range of national and international contests. The most prestigious source is the International Mathematical Olympiad (IMO), which contributes 75 problems, alongside the William Lowell Putnam Mathematical Competition, the United States of America Mathematical Olympiad (USAMO) and Junior Olympiad (USAJMO), the China National Olympiad, the Asian Pacific Mathematics Olympiad (APMO), and regional contests such as the Balkan MO, the Junior Balkan MO, Baltic Way and HMMT, down to more introductory contests such as Pascal, which contributes 249 problems [1][2]. The five competition tiers run from Introductory and Transitional through Intermediate, National or International Difficult, and Famous Worldwide, mapping onto the numerical 1 to 10 difficulty scale [2].
To guard against the possibility that high scores merely reflect memorized training data, the authors ran a contamination check using 5-gram overlap. They found only minimal leakage, with the most affected model, Qwen2.5-MATH-72B, showing about 0.70 percent of samples flagged as contaminated and 0.27 percent both contaminated and answered correctly, and concluded that data leakage had little impact on their overall findings [1].
Grading olympiad math is harder than grading multiple-choice questions because solutions are open-ended and a correct final answer can be expressed in many equivalent forms. Using a proprietary model such as GPT-4o as the judge is accurate but costly and not freely reproducible. To address this, the authors built Omni-Judge, an open-source verifier that compares a candidate solution against the reference answer and decides whether they match [1].
Omni-Judge is a fine-tuned version of Meta-Llama-3-8B-Instruct, trained on 21,451 examples of GPT-4o judgments for two epochs. On the authors' internal test set of 2,690 samples it reached roughly 95 percent consistency with GPT-4o, and across the benchmark it achieved over 91 percent consistency with GPT-4o and about 86 percent agreement with human judgments [1][2]. This functions as an LLM-as-a-judge approach specialized for mathematics, and a vLLM implementation is provided for efficient inference. The paper also reports a rule-based evaluation as a cross-check, under which o1-mini scores 62.2 percent and o1-preview 51.7 percent, close to the model-judged figures [1][2].
At the benchmark's release in late 2024, OpenAI's reasoning models led by a wide margin, but no system came close to solving the dataset. The table below lists representative overall accuracies reported by the authors and on the official leaderboard [1][2].
| Model | Omni-MATH overall accuracy |
|---|---|
| OpenAI o1-mini | 60.54% |
| OpenAI o1-preview | 52.55% |
| Qwen2.5-MATH-72B-Instruct | 36.20% |
| Qwen2-MATH-72B-Instruct | 33.68% |
| Qwen2.5-MATH-7B-Instruct | 33.22% |
| GPT-4o | 30.49% |
| NuminaMATH-72B-CoT | 28.45% |
| Claude 3.5 Sonnet | 26.23% |
| DeepSeek-Coder-V2 | 25.78% |
| Llama 3.1 70B Instruct | 24.16% |
| DeepSeekMath-7B-RL | 16.12% |
The gap between the o1 series and the best non-reasoning models was striking: o1-mini at 60.5 percent roughly doubled GPT-4o at 30.5 percent, underlining how much the reinforcement-learning-trained reasoning models gained on hard mathematics [1]. Even so, performance fell sharply with difficulty. On problems rated level 5 or above, o1-mini dropped to about 48.6 percent, showing that the upper difficulty tiers remained largely unsolved [2]. Error analysis of the leading models found that most mistakes were logical errors in the reasoning chain, followed by error accumulation across steps and then arithmetic or calculation errors, rather than simple slips [1].
Through 2025 and into 2026, later reasoning models such as OpenAI's o3, DeepSeek-R1 and Google's Gemini 2.5 reported strong gains on olympiad-style math, and DeepSeek-R1 was noted to outperform o3-mini on Omni-MATH in contemporaneous comparisons [4]. These advances paralleled broader milestones in machine mathematical reasoning, including AI systems reaching gold-medal-level performance on IMO-style problem sets, which prompted the community to keep introducing harder successors and more contamination-resistant evaluations rather than relying on any single static benchmark.
Omni-MATH occupies a clear niche in the landscape of math reasoning evaluation. Where GSM8K and MATH measured arithmetic and high-school competition skills that frontier models had effectively mastered by 2024, and where AIME-style benchmarks offered only a handful of problems per year, Omni-MATH supplied thousands of olympiad-tier problems with the domain and difficulty metadata needed for diagnostic analysis [1]. That granularity lets researchers see not just whether a model is good at math but where it breaks down, by topic and by difficulty, which is more informative than a single aggregate number.
Its second lasting contribution is methodological. By releasing Omni-Judge as an open, reproducible verifier with high agreement with both GPT-4o and human graders, the authors lowered the cost and improved the reproducibility of evaluating free-form mathematical solutions, a recurring pain point for open-ended benchmarks [1][2]. Together with OlympiadBench and other olympiad-level suites, Omni-MATH helped establish competition mathematics as a standard proving ground for reasoning models in the 2024 to 2026 era, and it remains a reference point for tracking how quickly frontier systems are closing the gap to expert human mathematicians.