MathArena
| MathArena | |
|---|---|
| Overview | |
| Full name | MathArena: Evaluating LLMs on Uncontaminated Math Competitions |
| Description | A benchmark for evaluating large language models on uncontaminated math competitions with both final-answer and proof-based problems
Property "Description" (as page type) with input value "A benchmark for evaluating large language models on uncontaminated math competitions with both final-answer and proof-based problems" contains invalid characters or is incomplete and therefore can cause unexpected results during a query or annotation process. |
| Release date | 2025-05-29 |
| Latest version | MathArena Apex |
| Benchmark updated | 2025-08-18 |
| Authors | Mislav Balunović, Jasper Dekoninck, Ivo Petrov, Nikola Jovanović, Martin Vechev |
| Organization | ETH Zürich, SRI Lab |
| Technical Details | |
| Type | Mathematical Reasoning, Proof Writing |
| Modality | Text, Mathematical Notation |
| Task format | Final answers, Mathematical proofs, Code-based solutions |
| Number of tasks | 149+ |
| Total examples | Varies by competition |
| Evaluation metric | Pass@1 Accuracy, Human-judged proof scores |
| Domains | Algebra, Number Theory, Combinatorics, Geometry, Analysis |
| Languages | English, Mathematical LaTeX |
| Performance | |
| Human performance | 35.7% (USAMO 2025 median) |
| SOTA score | 31.0 |
| SOTA model | Gemini 2.5 Pro |
| SOTA date | 2025-08-18 |
| Saturated | No |
| Resources | |
| Website | Official website |
| Paper | Paper |
| GitHub | Repository |
| Dataset | Download
|
MathArena is a comprehensive benchmark designed to evaluate large language models (LLMs) on uncontaminated mathematical competitions and olympiads. Developed by researchers at ETH Zürich's SRI Lab and released in May 2025, MathArena addresses critical issues in existing mathematical benchmarks by providing real-time evaluation on newly released problems and introducing the first systematic assessment of proof writing capabilities in AI systems. The platform evaluates models on actual mathematical competitions including AIME, USAMO, IMO (International Mathematical Olympiad), and Project Euler, ensuring that test problems have not been encountered during model training.[1][2]
Overview
MathArena represents a paradigm shift in evaluating mathematical capabilities of AI systems by focusing on genuine mathematical reasoning rather than pattern recognition from potentially contaminated training data. The benchmark uniquely combines evaluation of computational problem-solving with formal mathematical proof writing, providing a comprehensive assessment of mathematical intelligence in LLMs.[1]
The platform operates on a principle of continuous evaluation, adding new problems as they are released from official mathematical competitions worldwide. This approach ensures that models are tested on problems they could not have seen during training, providing an authentic measure of their mathematical reasoning capabilities rather than their ability to retrieve memorized solutions.
Key Innovations
MathArena introduces several groundbreaking features to mathematical AI evaluation:
- Contamination elimination: Real-time evaluation on newly released competition problems
- Proof-writing assessment: First benchmark to systematically evaluate formal mathematical proofs
- Human expert evaluation: IMO-level judges assess proof quality and rigor
- Multi-format support: Handles final-answer, proof-based, and code-based mathematical problems
- Transparency: Open-source implementation with detailed methodology
Methodology
Problem Categories
MathArena evaluates models across three distinct problem types:[2]
Final-Answer Competitions
These problems require numerical or algebraic answers without formal proofs:
- AIME (American Invitational Mathematics Examination)
- HMMT (Harvard-MIT Mathematics Tournament)
- BRUMO (Brown University Mathematics Olympiad)
- SMT (Stanford Mathematics Tournament)
Models are evaluated using Pass@1 accuracy, with 4 standard runs per problem to account for variance in generation.
Proof-Based Competitions
These require formal mathematical proofs judged by human experts:
- USAMO (USA Mathematical Olympiad)
- IMO (International Mathematical Olympiad)
- Putnam Competition
Each proof is scored on a 7-point scale by experienced judges with olympiad-level expertise. Responses are anonymized and evaluated independently by two judges to ensure fairness.
Math+Code Problems
- Project Euler: Computational mathematics problems requiring both mathematical insight and programming implementation
Models can use either custom scaffolding with multi-turn code execution or provider-specific code interpreters.
MathArena Apex
Introduced in August 2025, MathArena Apex is a curated collection of recent final-answer problems specifically selected to challenge state-of-the-art models. These problems are evaluated with 16 runs per model (compared to 4 standard runs) for more robust statistical analysis.[3]
Evaluation Framework
The evaluation process employs rigorous statistical methods:[1]
- Standard evaluation: 4 runs per problem for final-answer questions
- Apex evaluation: 16 runs per problem for enhanced statistical significance
- Token limits: 64,000 tokens standard (128,000 for specific models)
- Cost tracking: Detailed USD cost calculation per model evaluation
- Confidence intervals: Statistical analysis with configurable significance levels
For proof-based problems:
- Scoring system: 7 points per problem (IMO standard)
- Judge selection: 4 experienced judges with olympiad-level expertise
- Double-blind review: Anonymized responses graded independently
- Real-time assessment: Evaluation begins immediately after problem release
Performance Results
IMO 2025 Results
Performance on the International Mathematical Olympiad 2025 problems (6 problems, 42 points total):[2]
| Rank | Model | Organization | Score | Percentage | Medal Achievement |
|---|---|---|---|---|---|
| 1 | Gemini 2.5 Pro | 13 points | 31.0% | Below Bronze (requires 19 points) | |
| 2 | Grok 4 (updated prompt) | xAI | 9 points | 21.43% | No medal |
| 3 | o3 | OpenAI | 8 points | 19.05% | No medal |
| 4 | GPT-5 | OpenAI | 7 points | 16.67% | No medal |
| 5 | Claude 4 Sonnet | Anthropic | 6 points | 14.29% | No medal |
No model achieved the Bronze medal threshold of 19/42 points, highlighting the difficulty of IMO problems.
USAMO 2025 Performance
USA Mathematical Olympiad 2025 results (6 problems, 42 points total, proof-based):[4]
| Rank | Model | Organization | Score | Percentage |
|---|---|---|---|---|
| 1 | Gemini 2.5 Pro | 10.1 points | 24.0% | |
| 2 | o3 | OpenAI | 9.2 points | 21.9% |
| 3 | o4-mini | OpenAI | 8.1 points | 19.3% |
| 4 | GPT-5 | OpenAI | 7.5 points | 17.9% |
| 5 | Claude 4 Opus | Anthropic | 6.8 points | 16.2% |
Human median performance: 35.7%
Final-Answer Competition Performance
Top models on numerical problem competitions achieve significantly higher scores:[2]
| Model | Average Accuracy | Strength |
|---|---|---|
| o3 HIGH | 87% | Exceeds top 1% human performance |
| o4-mini HIGH | 86% | Exceeds top 1% human performance |
| Gemini 2.5 Pro | 86% | Exceeds top 1% human performance |
| GPT-5 | 82% | Strong performance |
| Claude 4 Opus | 79% | Strong performance |
MathArena Apex Results
Performance on the specially curated challenging problems (as of August 2025):[3]
| Rank | Model | Accuracy | Cost (USD) |
|---|---|---|---|
| 1 | Qwen3-A22B-2507-Think | 5.21% | $9.89 |
| 2 | Grok 4 | 2.08% | $99.39 |
| 3 | GPT-5 (High) Agent | 2.08% | $183.79 |
| 4 | GPT-5-mini (high) | 1.04% | $13.42 |
| 5 | GLM 4.5 | 1.04% | $14.50 |
These results demonstrate that even state-of-the-art models struggle with carefully selected challenging problems.
Problem Domains and Difficulty
Mathematical Domains Covered
MathArena problems span the full spectrum of competition mathematics:[1]
- Algebra: Polynomial equations, functional equations, inequalities
- Number Theory: Divisibility, modular arithmetic, Diophantine equations
- Combinatorics: Counting, graph theory, combinatorial geometry
- Geometry: Euclidean geometry, coordinate geometry, transformations
- Analysis: Calculus, sequences, series (primarily in Putnam)
- Discrete Mathematics: Logic, set theory, algorithms
Difficulty Classification
MathArena uses a color-coded system to indicate problem difficulty based on model performance:[2]
| Color | Success Rate | Interpretation |
|---|---|---|
| Green | >75% | Routinely solved by models |
| Yellow | 25-75% | Moderate difficulty |
| Orange | 1-24% | Very challenging |
| Red | 0% | Unsolved by any model |
Performance Analysis by Domain
Research reveals significant variation in model performance across mathematical domains:[1]
- Strongest performance: Algebra and Number Theory problems
- Moderate performance: Analysis and discrete mathematics
- Weakest performance: Combinatorics and Geometry problems
This pattern suggests that current LLMs excel at symbolic manipulation and pattern-based reasoning but struggle with spatial reasoning and complex combinatorial arguments.
Contamination Analysis
Evidence of Training Data Contamination
MathArena's analysis revealed significant contamination in existing benchmarks:[1]
AIME 2024 Contamination
Models showed anomalous performance on AIME 2024 problems:
- Expected performance based on human percentiles: ~25-30%
- Actual model performance: ~85-90%
- Conclusion: Strong evidence of training data contamination
Comparison with Fresh Problems
When evaluated on genuinely new problems:
- Performance drops by 50-70% compared to potentially contaminated benchmarks
- Models struggle with novel problem formulations
- Significant gap between memorization and genuine reasoning
Contamination Prevention Measures
MathArena implements several strategies to ensure evaluation integrity:
- Real-time evaluation: Problems added immediately upon public release
- Embargo period: No public release of solutions during evaluation
- Version tracking: Monitoring of problem appearances in training corpora
- Statistical analysis: Detection of anomalous performance patterns
Technical Implementation
Infrastructure
MathArena provides a comprehensive evaluation framework:[5]
- Open source: Complete codebase available on GitHub
- Package management: UV package manager for dependencies
- Dataset hosting: Problems and solutions on HuggingFace
- API support: Integration with major LLM providers
Evaluation Pipeline
The standard evaluation process consists of:
- Problem selection: Choosing competitions and specific problems
- Model configuration: Setting token limits and generation parameters
- Response generation: Multiple runs per problem for statistical validity
- Answer extraction: Parsing numerical answers or proof text
- Scoring: Automated for final answers, human judges for proofs
- Statistical analysis: Confidence intervals and significance testing
Dataset Format
MathArena datasets follow a standardized structure:[5]
- Required fields:
* `problem_idx`: Unique identifier * `problem`: LaTeX-formatted problem statement * `answer`: Ground truth (optional for proof problems)
- Optional fields:
* `points`: Maximum score * `sample_solution`: Example solution * `grading_scheme`: Scoring rubric * `difficulty`: Problem difficulty rating
Human Evaluation Process
Judge Selection and Training
MathArena employs expert mathematicians for proof evaluation:[4]
- Qualification requirements: IMO-level competition experience
- Training process: Calibration on sample problems with known scores
- Ongoing quality control: Inter-rater reliability monitoring
- Judge pool: 4 core judges with additional experts for specific domains
Grading Methodology
The proof grading process follows IMO standards:
- Anonymization: All model identifiers removed
- Independent review: Two judges score each proof separately
- Score reconciliation: Discussion for scores differing by >2 points
- Final score: Average of two judge scores
- Appeals process: Third judge for contested scores
Evaluation Criteria
Proofs are assessed on multiple dimensions:
- Correctness: Mathematical validity of arguments
- Completeness: All cases considered and proven
- Rigor: Formal mathematical exposition
- Clarity: Logical flow and presentation
- Innovation: Creative problem-solving approaches
Impact and Significance
Research Community Impact
MathArena has influenced mathematical AI research through:[1]
- Benchmark adoption: Used by major AI labs for model development
- Methodology influence: Contamination-aware evaluation becoming standard
- Dataset contributions: Open Proof Corpus with 5,000+ evaluated proofs
- Research direction: Renewed focus on genuine reasoning vs. memorization
Industry Adoption
Leading AI companies use MathArena for:
- Model evaluation: Assessing mathematical capabilities
- Training validation: Ensuring models aren't overfitting to known problems
- Competitive analysis: Comparing model performance
- Development priorities: Identifying areas for improvement
Educational Applications
MathArena data supports:
- AI tutoring systems: Understanding common solution approaches
- Problem generation: Creating new mathematical challenges
- Student assessment: Comparing human and AI performance
- Curriculum development: Identifying conceptual difficulties
Limitations and Future Work
Current Limitations
MathArena acknowledges several constraints:[1]
- Language limitation: Currently English-only problems
- Competition focus: May not represent all mathematical reasoning
- Human evaluation bottleneck: Proof grading requires expert time
- Cost considerations: Expensive to run comprehensive evaluations
Planned Extensions
The MathArena team has outlined future developments:
- Multilingual support: Problems in multiple languages
- Automated proof checking: Integration with formal verification systems
- Interactive problem solving: Multi-turn mathematical dialogue
- Curriculum coverage: Beyond competition mathematics
- Real-time leaderboard: Continuous model evaluation
Related Benchmarks
MathArena complements and improves upon existing mathematical benchmarks:
- GSM8K: Grade school math problems (contamination issues identified)
- MATH: High school competition problems (static dataset)
- MMLU Mathematics: Multiple choice questions (limited depth)
- Minerva: Mathematical reasoning benchmark
- MathVista: Visual mathematical reasoning
- OlympiadBench: Olympiad problems (predecessor to MathArena)
See Also
- Mathematical reasoning
- Large language models
- International Mathematical Olympiad
- Proof assistant
- Automated theorem proving
- Mathematics competitions
- AI benchmarking
References
- ↑ 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 Balunović, Mislav, et al. "MathArena: Evaluating LLMs on Uncontaminated Math Competitions." arXiv preprint arXiv:2505.23281 (2025). Cite error: Invalid
<ref>tag; name "arxiv" defined multiple times with different content - ↑ 2.0 2.1 2.2 2.3 2.4 MathArena Official Website. https://matharena.ai/ Accessed August 2025.
- ↑ 3.0 3.1 MathArena Team. "MathArena Apex: Challenging SOTA Models." August 2025. https://matharena.ai/apex
- ↑ 4.0 4.1 Petrov, Ivo, et al. "Proof or Bluff? Evaluating LLMs on 2025 USA Math Olympiad." arXiv:2503.21934 (2025). Cite error: Invalid
<ref>tag; name "proof" defined multiple times with different content - ↑ 5.0 5.1 MathArena GitHub Repository. https://github.com/eth-sri/matharena Accessed August 2025.