AIME 2025
| AIME 2025 | |
|---|---|
| Overview | |
| Full name | American Invitational Mathematics Examination 2025 |
| Abbreviation | AIME 2025 |
| Description | A challenging mathematical reasoning benchmark based on the American Invitational Mathematics Examination 2025 problems, testing olympiad-level mathematical reasoning with complex multi-step problem solving |
| Release date | 2025-02-06 |
| Latest version | 1.0 |
| Benchmark updated | 2025-02-14 |
| Authors | Mathematical Association of America |
| Organization | Mathematical Association of America (MAA), Art of Problem Solving (AoPS) |
| Technical Details | |
| Type | Mathematical Reasoning, Olympiad Mathematics |
| Modality | Text |
| Task format | Open-ended problem solving |
| Number of tasks | 30 |
| Total examples | 30 |
| Evaluation metric | Exact Match, Pass@1, Pass@8 |
| Domains | Algebra, Geometry, Number Theory, Combinatorics, Probability |
| Languages | English |
| Performance | |
| Human performance | 26.67%-40% (4-6 problems correct per 15) |
| Baseline | 20% (Non-reasoning models) |
| SOTA score | 94.6% (GPT-5, August 2025) |
| SOTA model | GPT-5 |
| SOTA date | 2025-08 |
| Saturated | No |
| Resources | |
| Website | Official website |
| GitHub | Repository |
| Dataset | Download |
| Predecessor | AIME 2024 |
AIME 2025 is an AI benchmark that evaluates large language models' ability to solve complex mathematical problems from the 2025 American Invitational Mathematics Examination. The exam was held on February 6, 2025, with benchmark evaluations conducted immediately afterward to minimize data contamination. The benchmark consists of 30 challenging olympiad-level mathematics problems (combining both AIME I and AIME II sessions) that test advanced mathematical reasoning, symbolic manipulation, and multi-step problem-solving capabilities.
Overview
The AIME 2025 benchmark represents one of the most challenging tests for evaluating how well large language models (LLMs) can think logically, reason step-by-step, and solve multi-layered mathematical problems. Unlike simpler mathematical benchmarks, AIME 2025 requires deep mathematical understanding and the ability to apply complex reasoning strategies typically expected of the top high school mathematics students in the United States.
Significance
AIME 2025 has emerged as the gold standard for mathematical reasoning in AI for several reasons:
- Difficulty Level: Problems require olympiad-level mathematics understanding
- Reasoning Depth: Tests structured, symbolic reasoning under constraints
- Non-saturation: Unlike benchmarks like MATH 500 and MGSM, AIME remains unsaturated
- Real Progress Indicator: Improvement on AIME often lags behind gains in language fluency or code generation, showing where genuine reasoning progress lies
Technical Specifications
Problem Structure
The AIME 2025 benchmark includes:
- 30 total problems (15 from AIME I and 15 from AIME II)
- 3-hour time limit format (for human test-takers)
- Integer answers ranging from 000 to 999
- Problems drawn from pre-calculus high school mathematics curriculum
- Increasing difficulty gradient within each 15-problem set
Evaluation Methodology
The standard evaluation protocol for AIME 2025 includes:
| Parameter | Setting | Purpose |
|---|---|---|
| Temperature | [0.0, 0.3, 0.6] | Multiple settings to test consistency |
| Samples per question | 8 | Reduce variance on small dataset |
| Maximum tokens | 32,768 | Allow for detailed reasoning chains |
| Top-p sampling | 0.95 | Control output diversity |
| Random seed | 0 | Ensure reproducibility |
| Prompt format | "Please reason step by step, and put your final answer within \boxed{}" | Standardized reasoning extraction |
Results are typically reported as averages across all temperature settings and runs to provide robust performance metrics.
Performance Analysis
Initial Leaderboard (February 2025)
The following table shows the performance of AI models on AIME 2025 at initial benchmark release:
| Rank | Model | Score (%) | Parameters | Organization |
|---|---|---|---|---|
| 1 | o3 Mini | 86.5 | - | OpenAI |
| 2 | DeepSeek R1 | 74.0 | - | DeepSeek |
| 3 | o1 | ~60 | - | OpenAI |
| 4 | DeepSeek-R1-Distill-Llama-70B | 51.4 | 70B | DeepSeek |
| 5 | o1-preview | ~50 | - | OpenAI |
| 6 | Gemini 2.0 Flash | ~45 | - | Google DeepMind |
| 7 | o1-mini | ~40 | - | OpenAI |
| 8 | QwQ-32B-Preview | ~35 | 32B | Alibaba |
| 9 | Non-reasoning models | ~20 | Various | Various |
Updated Performance (Later 2025)
Models released or evaluated after the initial benchmark showed improved performance:
| Model | Score (%) | Release Date | Notes |
|---|---|---|---|
| GPT-5 | 94.6 | August 2025 | Without tools; 99.6% with thinking |
| o4-mini | 92.7 | April 2025 | Successor to o3-mini |
| o3 | 88.9 | April 2025 | Updated evaluation |
Key Findings
Reasoning vs. Non-Reasoning Models
The benchmark clearly demonstrates the superiority of models with explicit reasoning capabilities:
- Reasoning models: 40-86.5% accuracy (initial); up to 94.6% (later models)
- Non-reasoning models: ~20% accuracy
- Performance gap: 2-4x improvement with reasoning architectures
Temperature Impact
Research on AIME 2025 revealed significant temperature sensitivity:
- Larger models (>14B parameters) show more stability across temperatures
- No universal optimal temperature setting exists
- Model-specific tuning recommended for optimal performance
- Ensemble approaches across temperatures can improve results
Model Brittleness
AIME 2025 highlights critical weaknesses in current AI systems:
- Some models fail on relatively simple AIME problems while succeeding in coding or trivia tasks
- Correctly answered questions are distributed among different models
- No single model demonstrates comprehensive problem-solving approach
- Performance varies significantly based on problem type
Mathematical Domains
AIME 2025 tests proficiency across multiple mathematical areas:
| Domain | Topics Covered | Example Problem Types |
|---|---|---|
| Algebra | Polynomial equations, functional equations, inequalities, sequences | Finding roots of complex equations, proving identities |
| Geometry | Euclidean geometry, coordinate geometry, solid geometry, transformations | Triangle centers, circle theorems, 3D visualization |
| Number Theory | Divisibility, modular arithmetic, prime factorization, Diophantine equations | Finding remainders, solving congruences |
| Combinatorics | Counting principles, probability, graph theory, generating functions | Arrangement problems, expected values |
| Trigonometry | Identities, complex numbers, roots of unity | Solving trigonometric equations |
Comparison with AIME 2024
Performance differences between AIME 2024 and 2025 reveal important insights:
| Aspect | AIME 2024 | AIME 2025 | Implications |
|---|---|---|---|
| Average AI Performance | Higher | Lower (initially) | Suggests reduced data contamination |
| Problem Novelty | Potentially compromised | Fresh problems | Better true capability assessment |
| Model Rankings | Different ordering | New hierarchy | Reveals genuine reasoning abilities |
| Saturation Status | Approaching saturation | Far from saturated | More room for improvement |
Limitations and Challenges
Data Contamination Concerns
The benchmark evaluation was conducted immediately after the February 6, 2025 exam to minimize contamination:
- Problems become publicly available after administration
- Evaluations were rushed to complete before models could train on solutions
- Performance monitoring needed over time
Statistical Limitations
- Small dataset size: Only 30 problems limits statistical power
- High variance: Individual runs show significant variation
- Limited diversity: Focus on competition-style problems
Evaluation Challenges
- Computational cost: Multiple runs required for reliable results
- Temperature sensitivity: Optimal settings vary by model
- Answer extraction: Parsing final answers from reasoning chains
Applications and Impact
Educational Technology
AIME 2025 performance indicates potential for:
- AI Tutoring Systems: Models solving AIME problems can serve as advanced math tutors
- Problem Generation: Creating new olympiad-style problems
- Solution Verification: Checking student work on complex problems
Research Applications
- Automated Theorem Proving: Success implies potential for formal mathematics
- Scientific Computing: Mathematical modeling capabilities
- Symbolic AI: Bridge between neural and symbolic reasoning
Industry Applications
- Quantitative Finance: Risk analysis and modeling
- Operations Research: Optimization problem solving
- Cryptography: Number theory applications
Future Directions
Benchmark Evolution
Proposed improvements include:
- Larger problem sets for better statistical significance
- Dynamic problem generation to prevent contamination
- Multi-modal problems incorporating diagrams
- Interactive problem-solving evaluation
Model Development
AIME 2025 drives research in:
- Test-time computation optimization
- Chain-of-thought reasoning improvements
- Mathematical knowledge distillation
- Hybrid symbolic-neural architectures
Related Benchmarks
- AIME 2024: Predecessor benchmark with 15 problems
- MATH: Broader mathematical dataset with 12,500 problems
- GSM8K: Elementary school math word problems
- GPQA Diamond: PhD-level science and mathematics
- Minerva: Technical problem-solving benchmark
- Olympiad Bench: Collection of olympiad problems
- IMO Grand Challenge: International Mathematical Olympiad problems
See Also
- Mathematical Reasoning in AI
- Olympiad Mathematics
- Reasoning Models
- Test-time Computation
- AI Evaluation Metrics
- Benchmark Saturation
References
Cite error: <ref> tag with name "gair_nlp" defined in <references> has group attribute "" which does not appear in prior text.
Cite error: <ref> tag with name "regularizer" defined in <references> has group attribute "" which does not appear in prior text.
Cite error: <ref> tag with name "vals_benchmark" defined in <references> has group attribute "" which does not appear in prior text.
Cite error: <ref> tag with name "artificial_analysis" defined in <references> has group attribute "" which does not appear in prior text.
Cite error: <ref> tag with name "deepseek_performance" defined in <references> has group attribute "" which does not appear in prior text.
Cite error: <ref> tag with name "opencompass" defined in <references> has group attribute "" which does not appear in prior text.
Cite error: <ref> tag with name "lemmata" defined in <references> has group attribute "" which does not appear in prior text.
Cite error: <ref> tag with name "epoch_ai" defined in <references> has group attribute "" which does not appear in prior text.
Cite error: <ref> tag with name "gpt5" defined in <references> has group attribute "" which does not appear in prior text.