MathArena

From AI Wiki


MathArena
Overview
Full name MathArena: Evaluating LLMs on Uncontaminated Math Competitions
Description A benchmark for evaluating large language models on uncontaminated math competitions with both final-answer and proof-based problems

Property "Description" (as page type) with input value "A benchmark for evaluating large language models on uncontaminated math competitions with both final-answer and proof-based problems" contains invalid characters or is incomplete and therefore can cause unexpected results during a query or annotation process.

Release date 2025-05-29
Latest version MathArena Apex
Benchmark updated 2025-08-18
Authors Mislav BalunovićJasper DekoninckIvo PetrovNikola JovanovićMartin Vechev
Organization ETH ZürichSRI Lab
Technical Details
Type Mathematical ReasoningProof Writing
Modality TextMathematical Notation
Task format Final answers, Mathematical proofs, Code-based solutions
Number of tasks 149+
Total examples Varies by competition
Evaluation metric Pass@1 AccuracyHuman-judged proof scores
Domains AlgebraNumber TheoryCombinatoricsGeometryAnalysis
Languages English, Mathematical LaTeX
Performance
Human performance 35.7% (USAMO 2025 median)
SOTA score 31.0
SOTA model Gemini 2.5 Pro
SOTA date 2025-08-18
Saturated No
Resources
Website Official website
Paper Paper
GitHub Repository
Dataset Download



MathArena is a comprehensive benchmark designed to evaluate large language models (LLMs) on uncontaminated mathematical competitions and olympiads. Developed by researchers at ETH Zürich's SRI Lab and released in May 2025, MathArena addresses critical issues in existing mathematical benchmarks by providing real-time evaluation on newly released problems and introducing the first systematic assessment of proof writing capabilities in AI systems. The platform evaluates models on actual mathematical competitions including AIME, USAMO, IMO (International Mathematical Olympiad), and Project Euler, ensuring that test problems have not been encountered during model training.[1][2]

Overview

MathArena represents a paradigm shift in evaluating mathematical capabilities of AI systems by focusing on genuine mathematical reasoning rather than pattern recognition from potentially contaminated training data. The benchmark uniquely combines evaluation of computational problem-solving with formal mathematical proof writing, providing a comprehensive assessment of mathematical intelligence in LLMs.[1]

The platform operates on a principle of continuous evaluation, adding new problems as they are released from official mathematical competitions worldwide. This approach ensures that models are tested on problems they could not have seen during training, providing an authentic measure of their mathematical reasoning capabilities rather than their ability to retrieve memorized solutions.

Key Innovations

MathArena introduces several groundbreaking features to mathematical AI evaluation:

  • Contamination elimination: Real-time evaluation on newly released competition problems
  • Proof-writing assessment: First benchmark to systematically evaluate formal mathematical proofs
  • Human expert evaluation: IMO-level judges assess proof quality and rigor
  • Multi-format support: Handles final-answer, proof-based, and code-based mathematical problems
  • Transparency: Open-source implementation with detailed methodology

Methodology

Problem Categories

MathArena evaluates models across three distinct problem types:[2]

Final-Answer Competitions

These problems require numerical or algebraic answers without formal proofs:

  • AIME (American Invitational Mathematics Examination)
  • HMMT (Harvard-MIT Mathematics Tournament)
  • BRUMO (Brown University Mathematics Olympiad)
  • SMT (Stanford Mathematics Tournament)

Models are evaluated using Pass@1 accuracy, with 4 standard runs per problem to account for variance in generation.

Proof-Based Competitions

These require formal mathematical proofs judged by human experts:

Each proof is scored on a 7-point scale by experienced judges with olympiad-level expertise. Responses are anonymized and evaluated independently by two judges to ensure fairness.

Math+Code Problems

  • Project Euler: Computational mathematics problems requiring both mathematical insight and programming implementation

Models can use either custom scaffolding with multi-turn code execution or provider-specific code interpreters.

MathArena Apex

Introduced in August 2025, MathArena Apex is a curated collection of recent final-answer problems specifically selected to challenge state-of-the-art models. These problems are evaluated with 16 runs per model (compared to 4 standard runs) for more robust statistical analysis.[3]

Evaluation Framework

The evaluation process employs rigorous statistical methods:[1]

  • Standard evaluation: 4 runs per problem for final-answer questions
  • Apex evaluation: 16 runs per problem for enhanced statistical significance
  • Token limits: 64,000 tokens standard (128,000 for specific models)
  • Cost tracking: Detailed USD cost calculation per model evaluation
  • Confidence intervals: Statistical analysis with configurable significance levels

For proof-based problems:

  • Scoring system: 7 points per problem (IMO standard)
  • Judge selection: 4 experienced judges with olympiad-level expertise
  • Double-blind review: Anonymized responses graded independently
  • Real-time assessment: Evaluation begins immediately after problem release

Performance Results

IMO 2025 Results

Performance on the International Mathematical Olympiad 2025 problems (6 problems, 42 points total):[2]

Rank Model Organization Score Percentage Medal Achievement
1 Gemini 2.5 Pro Google 13 points 31.0% Below Bronze (requires 19 points)
2 Grok 4 (updated prompt) xAI 9 points 21.43% No medal
3 o3 OpenAI 8 points 19.05% No medal
4 GPT-5 OpenAI 7 points 16.67% No medal
5 Claude 4 Sonnet Anthropic 6 points 14.29% No medal

No model achieved the Bronze medal threshold of 19/42 points, highlighting the difficulty of IMO problems.

USAMO 2025 Performance

USA Mathematical Olympiad 2025 results (6 problems, 42 points total, proof-based):[4]

Rank Model Organization Score Percentage
1 Gemini 2.5 Pro Google 10.1 points 24.0%
2 o3 OpenAI 9.2 points 21.9%
3 o4-mini OpenAI 8.1 points 19.3%
4 GPT-5 OpenAI 7.5 points 17.9%
5 Claude 4 Opus Anthropic 6.8 points 16.2%

Human median performance: 35.7%

Final-Answer Competition Performance

Top models on numerical problem competitions achieve significantly higher scores:[2]

Model Average Accuracy Strength
o3 HIGH 87% Exceeds top 1% human performance
o4-mini HIGH 86% Exceeds top 1% human performance
Gemini 2.5 Pro 86% Exceeds top 1% human performance
GPT-5 82% Strong performance
Claude 4 Opus 79% Strong performance

MathArena Apex Results

Performance on the specially curated challenging problems (as of August 2025):[3]

Rank Model Accuracy Cost (USD)
1 Qwen3-A22B-2507-Think 5.21% $9.89
2 Grok 4 2.08% $99.39
3 GPT-5 (High) Agent 2.08% $183.79
4 GPT-5-mini (high) 1.04% $13.42
5 GLM 4.5 1.04% $14.50

These results demonstrate that even state-of-the-art models struggle with carefully selected challenging problems.

Problem Domains and Difficulty

Mathematical Domains Covered

MathArena problems span the full spectrum of competition mathematics:[1]

  • Algebra: Polynomial equations, functional equations, inequalities
  • Number Theory: Divisibility, modular arithmetic, Diophantine equations
  • Combinatorics: Counting, graph theory, combinatorial geometry
  • Geometry: Euclidean geometry, coordinate geometry, transformations
  • Analysis: Calculus, sequences, series (primarily in Putnam)
  • Discrete Mathematics: Logic, set theory, algorithms

Difficulty Classification

MathArena uses a color-coded system to indicate problem difficulty based on model performance:[2]

Color Success Rate Interpretation
Green >75% Routinely solved by models
Yellow 25-75% Moderate difficulty
Orange 1-24% Very challenging
Red 0% Unsolved by any model

Performance Analysis by Domain

Research reveals significant variation in model performance across mathematical domains:[1]

  • Strongest performance: Algebra and Number Theory problems
  • Moderate performance: Analysis and discrete mathematics
  • Weakest performance: Combinatorics and Geometry problems

This pattern suggests that current LLMs excel at symbolic manipulation and pattern-based reasoning but struggle with spatial reasoning and complex combinatorial arguments.

Contamination Analysis

Evidence of Training Data Contamination

MathArena's analysis revealed significant contamination in existing benchmarks:[1]

AIME 2024 Contamination

Models showed anomalous performance on AIME 2024 problems:

  • Expected performance based on human percentiles: ~25-30%
  • Actual model performance: ~85-90%
  • Conclusion: Strong evidence of training data contamination

Comparison with Fresh Problems

When evaluated on genuinely new problems:

  • Performance drops by 50-70% compared to potentially contaminated benchmarks
  • Models struggle with novel problem formulations
  • Significant gap between memorization and genuine reasoning

Contamination Prevention Measures

MathArena implements several strategies to ensure evaluation integrity:

  • Real-time evaluation: Problems added immediately upon public release
  • Embargo period: No public release of solutions during evaluation
  • Version tracking: Monitoring of problem appearances in training corpora
  • Statistical analysis: Detection of anomalous performance patterns

Technical Implementation

Infrastructure

MathArena provides a comprehensive evaluation framework:[5]

  • Open source: Complete codebase available on GitHub
  • Package management: UV package manager for dependencies
  • Dataset hosting: Problems and solutions on HuggingFace
  • API support: Integration with major LLM providers

Evaluation Pipeline

The standard evaluation process consists of:

  1. Problem selection: Choosing competitions and specific problems
  2. Model configuration: Setting token limits and generation parameters
  3. Response generation: Multiple runs per problem for statistical validity
  4. Answer extraction: Parsing numerical answers or proof text
  5. Scoring: Automated for final answers, human judges for proofs
  6. Statistical analysis: Confidence intervals and significance testing

Dataset Format

MathArena datasets follow a standardized structure:[5]

  • Required fields:
 * `problem_idx`: Unique identifier
 * `problem`: LaTeX-formatted problem statement
 * `answer`: Ground truth (optional for proof problems)
  • Optional fields:
 * `points`: Maximum score
 * `sample_solution`: Example solution
 * `grading_scheme`: Scoring rubric
 * `difficulty`: Problem difficulty rating

Human Evaluation Process

Judge Selection and Training

MathArena employs expert mathematicians for proof evaluation:[4]

  • Qualification requirements: IMO-level competition experience
  • Training process: Calibration on sample problems with known scores
  • Ongoing quality control: Inter-rater reliability monitoring
  • Judge pool: 4 core judges with additional experts for specific domains

Grading Methodology

The proof grading process follows IMO standards:

  1. Anonymization: All model identifiers removed
  2. Independent review: Two judges score each proof separately
  3. Score reconciliation: Discussion for scores differing by >2 points
  4. Final score: Average of two judge scores
  5. Appeals process: Third judge for contested scores

Evaluation Criteria

Proofs are assessed on multiple dimensions:

  • Correctness: Mathematical validity of arguments
  • Completeness: All cases considered and proven
  • Rigor: Formal mathematical exposition
  • Clarity: Logical flow and presentation
  • Innovation: Creative problem-solving approaches

Impact and Significance

Research Community Impact

MathArena has influenced mathematical AI research through:[1]

  • Benchmark adoption: Used by major AI labs for model development
  • Methodology influence: Contamination-aware evaluation becoming standard
  • Dataset contributions: Open Proof Corpus with 5,000+ evaluated proofs
  • Research direction: Renewed focus on genuine reasoning vs. memorization

Industry Adoption

Leading AI companies use MathArena for:

  • Model evaluation: Assessing mathematical capabilities
  • Training validation: Ensuring models aren't overfitting to known problems
  • Competitive analysis: Comparing model performance
  • Development priorities: Identifying areas for improvement

Educational Applications

MathArena data supports:

  • AI tutoring systems: Understanding common solution approaches
  • Problem generation: Creating new mathematical challenges
  • Student assessment: Comparing human and AI performance
  • Curriculum development: Identifying conceptual difficulties

Limitations and Future Work

Current Limitations

MathArena acknowledges several constraints:[1]

  • Language limitation: Currently English-only problems
  • Competition focus: May not represent all mathematical reasoning
  • Human evaluation bottleneck: Proof grading requires expert time
  • Cost considerations: Expensive to run comprehensive evaluations

Planned Extensions

The MathArena team has outlined future developments:

  • Multilingual support: Problems in multiple languages
  • Automated proof checking: Integration with formal verification systems
  • Interactive problem solving: Multi-turn mathematical dialogue
  • Curriculum coverage: Beyond competition mathematics
  • Real-time leaderboard: Continuous model evaluation

Related Benchmarks

MathArena complements and improves upon existing mathematical benchmarks:

  • GSM8K: Grade school math problems (contamination issues identified)
  • MATH: High school competition problems (static dataset)
  • MMLU Mathematics: Multiple choice questions (limited depth)
  • Minerva: Mathematical reasoning benchmark
  • MathVista: Visual mathematical reasoning
  • OlympiadBench: Olympiad problems (predecessor to MathArena)

See Also

References

  1. 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 Balunović, Mislav, et al. "MathArena: Evaluating LLMs on Uncontaminated Math Competitions." arXiv preprint arXiv:2505.23281 (2025). Cite error: Invalid <ref> tag; name "arxiv" defined multiple times with different content
  2. 2.0 2.1 2.2 2.3 2.4 MathArena Official Website. https://matharena.ai/ Accessed August 2025.
  3. 3.0 3.1 MathArena Team. "MathArena Apex: Challenging SOTA Models." August 2025. https://matharena.ai/apex
  4. 4.0 4.1 Petrov, Ivo, et al. "Proof or Bluff? Evaluating LLMs on 2025 USA Math Olympiad." arXiv:2503.21934 (2025). Cite error: Invalid <ref> tag; name "proof" defined multiple times with different content
  5. 5.0 5.1 MathArena GitHub Repository. https://github.com/eth-sri/matharena Accessed August 2025.

External Links