| MathArena | |
|---|---|
| Overview | |
| Full name | MathArena: Evaluating LLMs on Uncontaminated Math Competitions |
| Description | A benchmark for evaluating large language models on uncontaminated math competitions with both final-answer and proof-based problems
Property "Description" (as page type) with input value "A benchmark for evaluating large language models on uncontaminated math competitions with both final-answer and proof-based problems" contains invalid characters or is incomplete and therefore can cause unexpected results during a query or annotation process. |
| Release date | 2025-05-29 |
| Latest version | MathArena Apex |
| Benchmark updated | 2025-08-18 |
| Authors | Mislav Balunović, Jasper Dekoninck, Ivo Petrov, Nikola Jovanović, Martin Vechev |
| Organization | ETH Zürich, SRI Lab |
| Technical Details | |
| Type | Mathematical Reasoning, Proof Writing |
| Modality | Text, Mathematical Notation |
| Task format | Final answers, Mathematical proofs, Code-based solutions |
| Number of tasks | 149+ |
| Total examples | Varies by competition |
| Evaluation metric | Pass@1 Accuracy, Human-judged proof scores |
| Domains | Algebra, Number Theory, Combinatorics, Geometry, Analysis |
| Languages | English, Mathematical LaTeX |
| Performance | |
| Human performance | 35.7% (USAMO 2025 median) |
| SOTA score | 31.0 |
| SOTA model | Gemini 2.5 Pro |
| SOTA date | 2025-08-18 |
| Saturated | No |
| Resources | |
| Website | Official website |
| Paper | Paper |
| GitHub | Repository |
| Dataset | Download
|
MathArena is a comprehensive benchmark designed to evaluate large language models (LLMs) on uncontaminated mathematical competitions and olympiads. Developed by researchers at ETH Zürich's SRI Lab and released in May 2025, MathArena addresses critical issues in existing mathematical benchmarks by providing real-time evaluation on newly released problems and introducing the first systematic assessment of proof writing capabilities in AI systems. The platform evaluates models on actual mathematical competitions including AIME, USAMO, IMO (International Mathematical Olympiad), and Project Euler, ensuring that test problems have not been encountered during model training.[1][2]
MathArena represents a paradigm shift in evaluating mathematical capabilities of AI systems by focusing on genuine mathematical reasoning rather than pattern recognition from potentially contaminated training data. The benchmark uniquely combines evaluation of computational problem-solving with formal mathematical proof writing, providing a comprehensive assessment of mathematical intelligence in LLMs.[1]
The platform operates on a principle of continuous evaluation, adding new problems as they are released from official mathematical competitions worldwide. This approach ensures that models are tested on problems they could not have seen during training, providing an authentic measure of their mathematical reasoning capabilities rather than their ability to retrieve memorized solutions.
MathArena introduces several groundbreaking features to mathematical AI evaluation:
MathArena evaluates models across three distinct problem types:[2]
These problems require numerical or algebraic answers without formal proofs:
Models are evaluated using Pass@1 accuracy, with 4 standard runs per problem to account for variance in generation.
These require formal mathematical proofs judged by human experts:
Each proof is scored on a 7-point scale by experienced judges with olympiad-level expertise. Responses are anonymized and evaluated independently by two judges to ensure fairness.
Models can use either custom scaffolding with multi-turn code execution or provider-specific code interpreters.
Introduced in August 2025, MathArena Apex is a curated collection of recent final-answer problems specifically selected to challenge state-of-the-art models. These problems are evaluated with 16 runs per model (compared to 4 standard runs) for more robust statistical analysis.[3]
The evaluation process employs rigorous statistical methods:[1]
For proof-based problems:
Performance on the International Mathematical Olympiad 2025 problems (6 problems, 42 points total):[2]
| Rank | Model | Organization | Score | Percentage | Medal Achievement |
|---|---|---|---|---|---|
| 1 | Gemini 2.5 Pro | 13 points | 31.0% | Below Bronze (requires 19 points) | |
| 2 | Grok 4 (updated prompt) | xAI | 9 points | 21.43% | No medal |
| 3 | o3 | OpenAI | 8 points | 19.05% | No medal |
| 4 | GPT-5 | OpenAI | 7 points | 16.67% | No medal |
| 5 | Claude 4 Sonnet | Anthropic | 6 points | 14.29% | No medal |
No model achieved the Bronze medal threshold of 19/42 points, highlighting the difficulty of IMO problems.
USA Mathematical Olympiad 2025 results (6 problems, 42 points total, proof-based):[4]
| Rank | Model | Organization | Score | Percentage |
|---|---|---|---|---|
| 1 | Gemini 2.5 Pro | 10.1 points | 24.0% | |
| 2 | o3 | OpenAI | 9.2 points | 21.9% |
| 3 | o4-mini | OpenAI | 8.1 points | 19.3% |
| 4 | GPT-5 | OpenAI | 7.5 points | 17.9% |
| 5 | Claude 4 Opus | Anthropic | 6.8 points | 16.2% |
Human median performance: 35.7%
Top models on numerical problem competitions achieve significantly higher scores:[2]
| Model | Average Accuracy | Strength |
|---|---|---|
| o3 HIGH | 87% | Exceeds top 1% human performance |
| o4-mini HIGH | 86% | Exceeds top 1% human performance |
| Gemini 2.5 Pro | 86% | Exceeds top 1% human performance |
| GPT-5 | 82% | Strong performance |
| Claude 4 Opus | 79% | Strong performance |
Performance on the specially curated challenging problems (as of August 2025):[3]
| Rank | Model | Accuracy | Cost (USD) |
|---|---|---|---|
| 1 | Qwen3-A22B-2507-Think | 5.21% | $9.89 |
| 2 | Grok 4 | 2.08% | $99.39 |
| 3 | GPT-5 (High) Agent | 2.08% | $183.79 |
| 4 | GPT-5-mini (high) | 1.04% | $13.42 |
| 5 | GLM 4.5 | 1.04% | $14.50 |
These results demonstrate that even state-of-the-art models struggle with carefully selected challenging problems.
MathArena problems span the full spectrum of competition mathematics:[1]
MathArena uses a color-coded system to indicate problem difficulty based on model performance:[2]
| Color | Success Rate | Interpretation |
|---|---|---|
| Green | >75% | Routinely solved by models |
| Yellow | 25-75% | Moderate difficulty |
| Orange | 1-24% | Very challenging |
| Red | 0% | Unsolved by any model |
Research reveals significant variation in model performance across mathematical domains:[1]
This pattern suggests that current LLMs excel at symbolic manipulation and pattern-based reasoning but struggle with spatial reasoning and complex combinatorial arguments.
MathArena's analysis revealed significant contamination in existing benchmarks:[1]
Models showed anomalous performance on AIME 2024 problems:
When evaluated on genuinely new problems:
MathArena implements several strategies to ensure evaluation integrity:
MathArena provides a comprehensive evaluation framework:[5]
The standard evaluation process consists of:
MathArena datasets follow a standardized structure:[5]
* `problem_idx`: Unique identifier * `problem`: LaTeX-formatted problem statement * `answer`: Ground truth (optional for proof problems)
* `points`: Maximum score * `sample_solution`: Example solution * `grading_scheme`: Scoring rubric * `difficulty`: Problem difficulty rating
MathArena employs expert mathematicians for proof evaluation:[4]
The proof grading process follows IMO standards:
Proofs are assessed on multiple dimensions:
MathArena has influenced mathematical AI research through:[1]
Leading AI companies use MathArena for:
MathArena data supports:
MathArena acknowledges several constraints:[1]
The MathArena team has outlined future developments:
MathArena complements and improves upon existing mathematical benchmarks: