| AIME 2025 | |
|---|---|
| Overview | |
| Full name | American Invitational Mathematics Examination 2025 |
| Abbreviation | AIME 2025 |
| Description | A challenging mathematical reasoning benchmark based on the American Invitational Mathematics Examination 2025 problems, testing olympiad-level mathematical reasoning with complex multi-step problem solving |
| Release date | 2025-02-06 |
| Latest version | 1.0 |
| Benchmark updated | 2025-02-14 |
| Authors | Mathematical Association of America |
| Organization | Mathematical Association of America (MAA), Art of Problem Solving (AoPS) |
| Technical Details | |
| Type | Mathematical Reasoning, Olympiad Mathematics |
| Modality | Text |
| Task format | Open-ended problem solving |
| Number of tasks | 30 |
| Total examples | 30 |
| Evaluation metric | Exact Match, Pass@1, Pass@8 |
| Domains | Algebra, Geometry, Number Theory, Combinatorics, Probability |
| Languages | English |
| Performance | |
| Human performance | 26.67%-40% (4-6 problems correct per 15) |
| Baseline | 20% (Non-reasoning models) |
| SOTA score | 94.6% (GPT-5, August 2025) |
| SOTA model | GPT-5 |
| SOTA date | 2025-08 |
| Saturated | No |
| Resources | |
| Website | Official website |
| GitHub | Repository |
| Dataset | Download |
| Predecessor | AIME 2024 |
AIME 2025 is an AI benchmark that evaluates large language models' ability to solve complex mathematical problems from the 2025 American Invitational Mathematics Examination. The exam was held on February 6, 2025, with benchmark evaluations conducted immediately afterward to minimize data contamination. The benchmark consists of 30 challenging olympiad-level mathematics problems (combining both AIME I and AIME II sessions) that test advanced mathematical reasoning, symbolic manipulation, and multi-step problem-solving capabilities.
The AIME 2025 benchmark represents one of the most challenging tests for evaluating how well large language models (LLMs) can think logically, reason step-by-step, and solve multi-layered mathematical problems. Unlike simpler mathematical benchmarks, AIME 2025 requires deep mathematical understanding and the ability to apply complex reasoning strategies typically expected of the top high school mathematics students in the United States.
AIME 2025 has emerged as the gold standard for mathematical reasoning in AI for several reasons:
The AIME 2025 benchmark includes:
The standard evaluation protocol for AIME 2025 includes:
| Parameter | Setting | Purpose |
|---|---|---|
| Temperature | [0.0, 0.3, 0.6] | Multiple settings to test consistency |
| Samples per question | 8 | Reduce variance on small dataset |
| Maximum tokens | 32,768 | Allow for detailed reasoning chains |
| Top-p sampling | 0.95 | Control output diversity |
| Random seed | 0 | Ensure reproducibility |
| Prompt format | "Please reason step by step, and put your final answer within \boxed{}" | Standardized reasoning extraction |
Results are typically reported as averages across all temperature settings and runs to provide robust performance metrics.
The following table shows the performance of AI models on AIME 2025 at initial benchmark release:
| Rank | Model | Score (%) | Parameters | Organization |
|---|---|---|---|---|
| 1 | o3 Mini | 86.5 | - | OpenAI |
| 2 | DeepSeek R1 | 74.0 | - | DeepSeek |
| 3 | o1 | ~60 | - | OpenAI |
| 4 | DeepSeek-R1-Distill-Llama-70B | 51.4 | 70B | DeepSeek |
| 5 | o1-preview | ~50 | - | OpenAI |
| 6 | Gemini 2.0 Flash | ~45 | - | Google DeepMind |
| 7 | o1-mini | ~40 | - | OpenAI |
| 8 | QwQ-32B-Preview | ~35 | 32B | Alibaba |
| 9 | Non-reasoning models | ~20 | Various | Various |
Models released or evaluated after the initial benchmark showed improved performance:
| Model | Score (%) | Release Date | Notes |
|---|---|---|---|
| GPT-5 | 94.6 | August 2025 | Without tools; 99.6% with thinking |
| o4-mini | 92.7 | April 2025 | Successor to o3-mini |
| o3 | 88.9 | April 2025 | Updated evaluation |
The benchmark clearly demonstrates the superiority of models with explicit reasoning capabilities:
Research on AIME 2025 revealed significant temperature sensitivity:
AIME 2025 highlights critical weaknesses in current AI systems:
AIME 2025 tests proficiency across multiple mathematical areas:
| Domain | Topics Covered | Example Problem Types |
|---|---|---|
| Algebra | Polynomial equations, functional equations, inequalities, sequences | Finding roots of complex equations, proving identities |
| Geometry | Euclidean geometry, coordinate geometry, solid geometry, transformations | Triangle centers, circle theorems, 3D visualization |
| Number Theory | Divisibility, modular arithmetic, prime factorization, Diophantine equations | Finding remainders, solving congruences |
| Combinatorics | Counting principles, probability, graph theory, generating functions | Arrangement problems, expected values |
| Trigonometry | Identities, complex numbers, roots of unity | Solving trigonometric equations |
Performance differences between AIME 2024 and 2025 reveal important insights:
| Aspect | AIME 2024 | AIME 2025 | Implications |
|---|---|---|---|
| Average AI Performance | Higher | Lower (initially) | Suggests reduced data contamination |
| Problem Novelty | Potentially compromised | Fresh problems | Better true capability assessment |
| Model Rankings | Different ordering | New hierarchy | Reveals genuine reasoning abilities |
| Saturation Status | Approaching saturation | Far from saturated | More room for improvement |
The benchmark evaluation was conducted immediately after the February 6, 2025 exam to minimize contamination:
AIME 2025 performance indicates potential for:
Proposed improvements include:
AIME 2025 drives research in:
Cite error: <ref> tag with name "gair_nlp" defined in <references> has group attribute "" which does not appear in prior text.
Cite error: <ref> tag with name "regularizer" defined in <references> has group attribute "" which does not appear in prior text.
Cite error: <ref> tag with name "vals_benchmark" defined in <references> has group attribute "" which does not appear in prior text.
Cite error: <ref> tag with name "artificial_analysis" defined in <references> has group attribute "" which does not appear in prior text.
Cite error: <ref> tag with name "deepseek_performance" defined in <references> has group attribute "" which does not appear in prior text.
Cite error: <ref> tag with name "opencompass" defined in <references> has group attribute "" which does not appear in prior text.
Cite error: <ref> tag with name "lemmata" defined in <references> has group attribute "" which does not appear in prior text.
Cite error: <ref> tag with name "epoch_ai" defined in <references> has group attribute "" which does not appear in prior text.
Cite error: <ref> tag with name "gpt5" defined in <references> has group attribute "" which does not appear in prior text.