AIME 2024
| AIME 2024 | |
|---|---|
| Overview | |
| Full name | American Invitational Mathematics Examination 2024 |
| Abbreviation | AIME 2024 |
| Description | A challenging mathematical reasoning benchmark based on the American Invitational Mathematics Examination 2024 problems, designed to evaluate AI models' ability to solve complex high school mathematics problems requiring multi-step reasoning |
| Release date | 2024-02-01 |
| Latest version | 1.0 |
| Benchmark updated | 2024-02-07 |
| Authors | Mathematical Association of America |
| Organization | Mathematical Association of America (MAA), Art of Problem Solving (AoPS) |
| Technical Details | |
| Type | Mathematical Reasoning, Problem Solving |
| Modality | Text |
| Task format | Open-ended problem solving |
| Number of tasks | 15 |
| Total examples | 15 |
| Evaluation metric | Exact Match, Pass@1 |
| Domains | Algebra, Geometry, Number Theory, Combinatorics, Probability |
| Languages | English |
| Performance | |
| Human performance | 26.67%-40% (4-6 problems correct) |
| Baseline | 10% (GPT-4o) |
| SOTA score | 93% (o1 with re-ranking) |
| SOTA model | OpenAI o1 |
| SOTA date | 2024-09-12 |
| Saturated | No |
| Resources | |
| Website | Official website |
| GitHub | Repository |
| Dataset | Download |
| Predecessor | AIME 2023 |
| Successor | AIME 2025 |
AIME 2024 is an AI benchmark that evaluates large language models' ability to solve complex mathematical problems from the 2024 American Invitational Mathematics Examination. The benchmark consists of 15 challenging mathematical problems that require advanced problem-solving skills, mathematical reasoning, and multi-step logical thinking typically expected of top high school mathematics students.
Overview
The AIME 2024 benchmark is based on problems from the American Invitational Mathematics Examination, a prestigious invite-only mathematics competition for high school students who perform in the top 5% of the AMC 12 mathematics exam. The benchmark serves as a critical test for evaluating AI models' capabilities in advanced mathematical reasoning, particularly in areas that require creative problem-solving approaches and deep mathematical understanding.
Background
The American Invitational Mathematics Examination (AIME) is one of the most challenging high school mathematics competitions in the United States, serving as a qualification pathway for the USA Mathematical Olympiad (USAMO). The 2024 edition was administered in two sessions: AIME I on February 1, 2024, and AIME II on February 7, 2024. The problems cover topics in algebra, geometry, number theory, combinatorics, and probability theory.
The adaptation of AIME 2024 as an AI benchmark represents a significant milestone in evaluating artificial intelligence systems' mathematical capabilities, as these problems require not just computational ability but genuine mathematical insight and reasoning that has traditionally been considered uniquely human.
Technical Specifications
Problem Format
Each of the 15 problems in AIME 2024 requires:
- Comprehensive understanding of multiple mathematical concepts
- Multi-step reasoning and problem decomposition
- Creative approaches to problem-solving
- Precise numerical answers (integers from 0 to 999)
The problems increase in difficulty progressively, with later problems requiring more sophisticated mathematical techniques and insights.
Evaluation Methodology
The benchmark employs several evaluation approaches:
| Evaluation Method | Description | Implementation |
|---|---|---|
| Exact Match | Models must produce the exact integer answer | Answer extracted from model output and compared to ground truth |
| Pass@1 | Single attempt accuracy | Model given one attempt per problem |
| Pass@k | Best of k attempts | Multiple samples generated, best answer selected |
| Consensus Voting | Majority vote from multiple attempts | Multiple runs aggregated to reduce variance |
To reduce variance due to the small dataset size, standard practice involves running models 8 times on the benchmark and averaging the results. Models are typically prompted with: "Please reason step by step, and put your final answer within \boxed{}"
Performance Analysis
Model Performance Comparison
The following table shows the performance of various AI models on AIME 2024:
| Model | Pass@1 Score | Methodology | Date |
|---|---|---|---|
| OpenAI o1 (with re-ranking) | 93% (13.9/15) | Re-ranking 1000 samples | September 2024 |
| OpenAI o3 | 91.6% | Single sample | April 2025 |
| OpenAI o3-mini | 87.3% | Single sample | April 2025 |
| OpenAI o1 (consensus) | 83% (12.5/15) | Consensus among 64 samples | September 2024 |
| DeepSeek R1 | 79.8% | Multiple runs averaged | January 2025 |
| OpenAI o1 | 74% (11.1/15) | Single sample | September 2024 |
| o1-mini | 56.67% | Pass@1 | 2024 |
| Gemini-exp-1114 | ~50% | Pass@1 | 2024 |
| Qwen2-Math-72B | 36.67% (11/30 on combined AIME 2024+2025) | Pass@1 | 2024 |
| GPT-4o | 12% (1.8/15) | Single sample | 2024 |
| Claude-3.5-Sonnet | 10% | Exact match | 2024 |
| GPT-4o-mini | 6.67% | Exact match | 2024 |
Note: o3 and o3-mini were released in April 2025, with o4-mini succeeding o3-mini shortly after. Performance figures for models released after 2024 are included for reference.
Key Findings
Performance Characteristics
1. **Reasoning vs. Non-Reasoning Models**: Models with explicit chain-of-thought reasoning capabilities significantly outperform traditional language models 2. **Scaling with Compute**: OpenAI demonstrated a log-linear relationship between accuracy and test-time compute 3. **Problem Distribution**: Correct answers are distributed across different models, suggesting no single model has comprehensive problem-solving capabilities 4. **Difficulty Gradient**: Performance degrades significantly on later, more difficult problems
Human Comparison
- **Median Human Score**: 4-6 problems correct (26.67%-40%)
- **Top 500 Students Nationally**: ~13.9 problems correct (93%)
- **USAMO Qualification**: Typically requires 9+ correct answers
The best AI performance (o1 with re-ranking at 93%) places it among the top 500 students nationally, above the USAMO qualification threshold.
Mathematical Domains Covered
The AIME 2024 benchmark tests proficiency across multiple mathematical domains:
| Domain | Example Topics | Percentage of Problems |
|---|---|---|
| Algebra | Polynomial equations, systems of equations, inequalities | ~27% |
| Geometry | Euclidean geometry, coordinate geometry, transformations | ~27% |
| Number Theory | Divisibility, modular arithmetic, prime numbers | ~20% |
| Combinatorics | Counting principles, probability, discrete structures | ~20% |
| Complex Analysis | Complex numbers, roots of unity | ~6% |
Limitations and Considerations
Data Contamination Concerns
A significant concern with AIME 2024 as a benchmark is potential data contamination:
- Problems and solutions are publicly available online
- Models may have encountered these problems during pre-training
- Performance differences between AIME 2024 and AIME 2025 suggest possible contamination
Statistical Limitations
- **Small Dataset Size**: Only 15 problems limits statistical significance
- **High Variance**: Individual run results vary significantly
- **Limited Diversity**: Problems focus on specific mathematical competition style
Related Benchmarks
AIME 2024 is part of a broader ecosystem of mathematical reasoning benchmarks:
- AIME 2025: Successor benchmark with 15 new problems
- MATH: Broader mathematical problem dataset with 12,500 problems
- GSM8K: Grade school math problems benchmark
- GPQA Diamond: PhD-level science questions including mathematics
- Minerva: Mathematical problem-solving benchmark
- HumanEval: Code generation benchmark with mathematical components
Impact and Significance
The AIME 2024 benchmark has several important implications:
Research Impact
1. **Capability Assessment**: Provides clear metrics for mathematical reasoning progress 2. **Architecture Development**: Drives development of reasoning-optimized models 3. **Training Methodology**: Influences approaches to mathematical problem training
Educational Implications
- Demonstrates AI approaching expert-level mathematical problem-solving
- Raises questions about AI tutoring and educational assistance
- Highlights gaps between computational ability and mathematical understanding
Future Directions
- Development of contamination-resistant evaluation methods
- Extension to other mathematical competition formats
- Integration with interactive theorem proving systems
- Exploration of mathematical creativity vs. pattern matching
See Also
- Mathematical Reasoning in AI
- AI Benchmarks
- OpenAI o-series Models
- Chain-of-Thought Prompting
- USA Mathematical Olympiad
- Mathematical Competitions
References
Cite error: <ref> tag with name "openai_o1" defined in <references> has group attribute "" which does not appear in prior text.
Cite error: <ref> tag with name "aime_official" defined in <references> has group attribute "" which does not appear in prior text.
Cite error: <ref> tag with name "inspect_evals" defined in <references> has group attribute "" which does not appear in prior text.
Cite error: <ref> tag with name "huggingface" defined in <references> has group attribute "" which does not appear in prior text.
Cite error: <ref> tag with name "vals_benchmark" defined in <references> has group attribute "" which does not appear in prior text.
Cite error: <ref> tag with name "openai_o3" defined in <references> has group attribute "" which does not appear in prior text.
Cite error: <ref> tag with name "deepseek_r1" defined in <references> has group attribute "" which does not appear in prior text.