AIME 2025

From AI Wiki


AIME 2025
Overview
Full name American Invitational Mathematics Examination 2025
Abbreviation AIME 2025
Description A challenging mathematical reasoning benchmark based on the American Invitational Mathematics Examination 2025 problems, testing olympiad-level mathematical reasoning with complex multi-step problem solving
Release date 2025-02-06
Latest version 1.0
Benchmark updated 2025-02-14
Authors Mathematical Association of America
Organization Mathematical Association of America (MAA)Art of Problem Solving (AoPS)
Technical Details
Type Mathematical ReasoningOlympiad Mathematics
Modality Text
Task format Open-ended problem solving
Number of tasks 30
Total examples 30
Evaluation metric Exact MatchPass@1Pass@8
Domains AlgebraGeometryNumber TheoryCombinatoricsProbability
Languages English
Performance
Human performance 26.67%-40% (4-6 problems correct per 15)
Baseline 20% (Non-reasoning models)
SOTA score 94.6% (GPT-5, August 2025)
SOTA model GPT-5
SOTA date 2025-08
Saturated No
Resources
Website Official website
GitHub Repository
Dataset Download
Predecessor AIME 2024


AIME 2025 is an AI benchmark that evaluates large language models' ability to solve complex mathematical problems from the 2025 American Invitational Mathematics Examination. The exam was held on February 6, 2025, with benchmark evaluations conducted immediately afterward to minimize data contamination. The benchmark consists of 30 challenging olympiad-level mathematics problems (combining both AIME I and AIME II sessions) that test advanced mathematical reasoning, symbolic manipulation, and multi-step problem-solving capabilities.

Overview

The AIME 2025 benchmark represents one of the most challenging tests for evaluating how well large language models (LLMs) can think logically, reason step-by-step, and solve multi-layered mathematical problems. Unlike simpler mathematical benchmarks, AIME 2025 requires deep mathematical understanding and the ability to apply complex reasoning strategies typically expected of the top high school mathematics students in the United States.

Significance

AIME 2025 has emerged as the gold standard for mathematical reasoning in AI for several reasons:

  • Difficulty Level: Problems require olympiad-level mathematics understanding
  • Reasoning Depth: Tests structured, symbolic reasoning under constraints
  • Non-saturation: Unlike benchmarks like MATH 500 and MGSM, AIME remains unsaturated
  • Real Progress Indicator: Improvement on AIME often lags behind gains in language fluency or code generation, showing where genuine reasoning progress lies

Technical Specifications

Problem Structure

The AIME 2025 benchmark includes:

  • 30 total problems (15 from AIME I and 15 from AIME II)
  • 3-hour time limit format (for human test-takers)
  • Integer answers ranging from 000 to 999
  • Problems drawn from pre-calculus high school mathematics curriculum
  • Increasing difficulty gradient within each 15-problem set

Evaluation Methodology

The standard evaluation protocol for AIME 2025 includes:

Parameter Setting Purpose
Temperature [0.0, 0.3, 0.6] Multiple settings to test consistency
Samples per question 8 Reduce variance on small dataset
Maximum tokens 32,768 Allow for detailed reasoning chains
Top-p sampling 0.95 Control output diversity
Random seed 0 Ensure reproducibility
Prompt format "Please reason step by step, and put your final answer within \boxed{}" Standardized reasoning extraction

Results are typically reported as averages across all temperature settings and runs to provide robust performance metrics.

Performance Analysis

Initial Leaderboard (February 2025)

The following table shows the performance of AI models on AIME 2025 at initial benchmark release:

Rank Model Score (%) Parameters Organization
1 o3 Mini 86.5 - OpenAI
2 DeepSeek R1 74.0 - DeepSeek
3 o1 ~60 - OpenAI
4 DeepSeek-R1-Distill-Llama-70B 51.4 70B DeepSeek
5 o1-preview ~50 - OpenAI
6 Gemini 2.0 Flash ~45 - Google DeepMind
7 o1-mini ~40 - OpenAI
8 QwQ-32B-Preview ~35 32B Alibaba
9 Non-reasoning models ~20 Various Various

Updated Performance (Later 2025)

Models released or evaluated after the initial benchmark showed improved performance:

Model Score (%) Release Date Notes
GPT-5 94.6 August 2025 Without tools; 99.6% with thinking
o4-mini 92.7 April 2025 Successor to o3-mini
o3 88.9 April 2025 Updated evaluation

Key Findings

Reasoning vs. Non-Reasoning Models

The benchmark clearly demonstrates the superiority of models with explicit reasoning capabilities:

  • Reasoning models: 40-86.5% accuracy (initial); up to 94.6% (later models)
  • Non-reasoning models: ~20% accuracy
  • Performance gap: 2-4x improvement with reasoning architectures

Temperature Impact

Research on AIME 2025 revealed significant temperature sensitivity:

  • Larger models (>14B parameters) show more stability across temperatures
  • No universal optimal temperature setting exists
  • Model-specific tuning recommended for optimal performance
  • Ensemble approaches across temperatures can improve results

Model Brittleness

AIME 2025 highlights critical weaknesses in current AI systems:

  • Some models fail on relatively simple AIME problems while succeeding in coding or trivia tasks
  • Correctly answered questions are distributed among different models
  • No single model demonstrates comprehensive problem-solving approach
  • Performance varies significantly based on problem type

Mathematical Domains

AIME 2025 tests proficiency across multiple mathematical areas:

Domain Topics Covered Example Problem Types
Algebra Polynomial equations, functional equations, inequalities, sequences Finding roots of complex equations, proving identities
Geometry Euclidean geometry, coordinate geometry, solid geometry, transformations Triangle centers, circle theorems, 3D visualization
Number Theory Divisibility, modular arithmetic, prime factorization, Diophantine equations Finding remainders, solving congruences
Combinatorics Counting principles, probability, graph theory, generating functions Arrangement problems, expected values
Trigonometry Identities, complex numbers, roots of unity Solving trigonometric equations

Comparison with AIME 2024

Performance differences between AIME 2024 and 2025 reveal important insights:

Aspect AIME 2024 AIME 2025 Implications
Average AI Performance Higher Lower (initially) Suggests reduced data contamination
Problem Novelty Potentially compromised Fresh problems Better true capability assessment
Model Rankings Different ordering New hierarchy Reveals genuine reasoning abilities
Saturation Status Approaching saturation Far from saturated More room for improvement

Limitations and Challenges

Data Contamination Concerns

The benchmark evaluation was conducted immediately after the February 6, 2025 exam to minimize contamination:

  • Problems become publicly available after administration
  • Evaluations were rushed to complete before models could train on solutions
  • Performance monitoring needed over time

Statistical Limitations

  • Small dataset size: Only 30 problems limits statistical power
  • High variance: Individual runs show significant variation
  • Limited diversity: Focus on competition-style problems

Evaluation Challenges

  • Computational cost: Multiple runs required for reliable results
  • Temperature sensitivity: Optimal settings vary by model
  • Answer extraction: Parsing final answers from reasoning chains

Applications and Impact

Educational Technology

AIME 2025 performance indicates potential for:

  • AI Tutoring Systems: Models solving AIME problems can serve as advanced math tutors
  • Problem Generation: Creating new olympiad-style problems
  • Solution Verification: Checking student work on complex problems

Research Applications

Industry Applications

Future Directions

Benchmark Evolution

Proposed improvements include:

  • Larger problem sets for better statistical significance
  • Dynamic problem generation to prevent contamination
  • Multi-modal problems incorporating diagrams
  • Interactive problem-solving evaluation

Model Development

AIME 2025 drives research in:

Related Benchmarks

  • AIME 2024: Predecessor benchmark with 15 problems
  • MATH: Broader mathematical dataset with 12,500 problems
  • GSM8K: Elementary school math word problems
  • GPQA Diamond: PhD-level science and mathematics
  • Minerva: Technical problem-solving benchmark
  • Olympiad Bench: Collection of olympiad problems
  • IMO Grand Challenge: International Mathematical Olympiad problems

See Also

References

Cite error: <ref> tag with name "gair_nlp" defined in <references> has group attribute "" which does not appear in prior text.
Cite error: <ref> tag with name "regularizer" defined in <references> has group attribute "" which does not appear in prior text.
Cite error: <ref> tag with name "vals_benchmark" defined in <references> has group attribute "" which does not appear in prior text.
Cite error: <ref> tag with name "artificial_analysis" defined in <references> has group attribute "" which does not appear in prior text.
Cite error: <ref> tag with name "deepseek_performance" defined in <references> has group attribute "" which does not appear in prior text.
Cite error: <ref> tag with name "opencompass" defined in <references> has group attribute "" which does not appear in prior text.
Cite error: <ref> tag with name "lemmata" defined in <references> has group attribute "" which does not appear in prior text.
Cite error: <ref> tag with name "epoch_ai" defined in <references> has group attribute "" which does not appear in prior text.
Cite error: <ref> tag with name "gpt5" defined in <references> has group attribute "" which does not appear in prior text.