| AIME 2024 | |
|---|---|
| Overview | |
| Full name | American Invitational Mathematics Examination 2024 |
| Abbreviation | AIME 2024 |
| Description | A challenging mathematical reasoning benchmark based on the American Invitational Mathematics Examination 2024 problems, designed to evaluate AI models' ability to solve complex high school mathematics problems requiring multi-step reasoning |
| Release date | 2024-02-01 |
| Latest version | 1.0 |
| Benchmark updated | 2024-02-07 |
| Authors | Mathematical Association of America |
| Organization | Mathematical Association of America (MAA), Art of Problem Solving (AoPS) |
| Technical Details | |
| Type | Mathematical Reasoning, Problem Solving |
| Modality | Text |
| Task format | Open-ended problem solving |
| Number of tasks | 15 |
| Total examples | 15 |
| Evaluation metric | Exact Match, Pass@1 |
| Domains | Algebra, Geometry, Number Theory, Combinatorics, Probability |
| Languages | English |
| Performance | |
| Human performance | 26.67%-40% (4-6 problems correct) |
| Baseline | 10% (GPT-4o) |
| SOTA score | 93% (o1 with re-ranking) |
| SOTA model | OpenAI o1 |
| SOTA date | 2024-09-12 |
| Saturated | No |
| Resources | |
| Website | Official website |
| GitHub | Repository |
| Dataset | Download |
| Predecessor | AIME 2023 |
| Successor | AIME 2025 |
AIME 2024 is an AI benchmark that evaluates large language models' ability to solve complex mathematical problems from the 2024 American Invitational Mathematics Examination. The benchmark consists of 15 challenging mathematical problems that require advanced problem-solving skills, mathematical reasoning, and multi-step logical thinking typically expected of top high school mathematics students.
The AIME 2024 benchmark is based on problems from the American Invitational Mathematics Examination, a prestigious invite-only mathematics competition for high school students who perform in the top 5% of the AMC 12 mathematics exam. The benchmark serves as a critical test for evaluating AI models' capabilities in advanced mathematical reasoning, particularly in areas that require creative problem-solving approaches and deep mathematical understanding.
The American Invitational Mathematics Examination (AIME) is one of the most challenging high school mathematics competitions in the United States, serving as a qualification pathway for the USA Mathematical Olympiad (USAMO). The 2024 edition was administered in two sessions: AIME I on February 1, 2024, and AIME II on February 7, 2024. The problems cover topics in algebra, geometry, number theory, combinatorics, and probability theory.
The adaptation of AIME 2024 as an AI benchmark represents a significant milestone in evaluating artificial intelligence systems' mathematical capabilities, as these problems require not just computational ability but genuine mathematical insight and reasoning that has traditionally been considered uniquely human.
Each of the 15 problems in AIME 2024 requires:
The problems increase in difficulty progressively, with later problems requiring more sophisticated mathematical techniques and insights.
The benchmark employs several evaluation approaches:
| Evaluation Method | Description | Implementation |
|---|---|---|
| Exact Match | Models must produce the exact integer answer | Answer extracted from model output and compared to ground truth |
| Pass@1 | Single attempt accuracy | Model given one attempt per problem |
| Pass@k | Best of k attempts | Multiple samples generated, best answer selected |
| Consensus Voting | Majority vote from multiple attempts | Multiple runs aggregated to reduce variance |
To reduce variance due to the small dataset size, standard practice involves running models 8 times on the benchmark and averaging the results. Models are typically prompted with: "Please reason step by step, and put your final answer within \boxed{}"
The following table shows the performance of various AI models on AIME 2024:
| Model | Pass@1 Score | Methodology | Date |
|---|---|---|---|
| OpenAI o1 (with re-ranking) | 93% (13.9/15) | Re-ranking 1000 samples | September 2024 |
| OpenAI o3 | 91.6% | Single sample | April 2025 |
| OpenAI o3-mini | 87.3% | Single sample | April 2025 |
| OpenAI o1 (consensus) | 83% (12.5/15) | Consensus among 64 samples | September 2024 |
| DeepSeek R1 | 79.8% | Multiple runs averaged | January 2025 |
| OpenAI o1 | 74% (11.1/15) | Single sample | September 2024 |
| o1-mini | 56.67% | Pass@1 | 2024 |
| Gemini-exp-1114 | ~50% | Pass@1 | 2024 |
| Qwen2-Math-72B | 36.67% (11/30 on combined AIME 2024+2025) | Pass@1 | 2024 |
| GPT-4o | 12% (1.8/15) | Single sample | 2024 |
| Claude-3.5-Sonnet | 10% | Exact match | 2024 |
| GPT-4o-mini | 6.67% | Exact match | 2024 |
Note: o3 and o3-mini were released in April 2025, with o4-mini succeeding o3-mini shortly after. Performance figures for models released after 2024 are included for reference.
1. **Reasoning vs. Non-Reasoning Models**: Models with explicit chain-of-thought reasoning capabilities significantly outperform traditional language models 2. **Scaling with Compute**: OpenAI demonstrated a log-linear relationship between accuracy and test-time compute 3. **Problem Distribution**: Correct answers are distributed across different models, suggesting no single model has comprehensive problem-solving capabilities 4. **Difficulty Gradient**: Performance degrades significantly on later, more difficult problems
The best AI performance (o1 with re-ranking at 93%) places it among the top 500 students nationally, above the USAMO qualification threshold.
The AIME 2024 benchmark tests proficiency across multiple mathematical domains:
| Domain | Example Topics | Percentage of Problems |
|---|---|---|
| Algebra | Polynomial equations, systems of equations, inequalities | ~27% |
| Geometry | Euclidean geometry, coordinate geometry, transformations | ~27% |
| Number Theory | Divisibility, modular arithmetic, prime numbers | ~20% |
| Combinatorics | Counting principles, probability, discrete structures | ~20% |
| Complex Analysis | Complex numbers, roots of unity | ~6% |
A significant concern with AIME 2024 as a benchmark is potential data contamination:
AIME 2024 is part of a broader ecosystem of mathematical reasoning benchmarks:
The AIME 2024 benchmark has several important implications:
1. **Capability Assessment**: Provides clear metrics for mathematical reasoning progress 2. **Architecture Development**: Drives development of reasoning-optimized models 3. **Training Methodology**: Influences approaches to mathematical problem training
Cite error: <ref> tag with name "openai_o1" defined in <references> has group attribute "" which does not appear in prior text.
Cite error: <ref> tag with name "aime_official" defined in <references> has group attribute "" which does not appear in prior text.
Cite error: <ref> tag with name "inspect_evals" defined in <references> has group attribute "" which does not appear in prior text.
Cite error: <ref> tag with name "huggingface" defined in <references> has group attribute "" which does not appear in prior text.
Cite error: <ref> tag with name "vals_benchmark" defined in <references> has group attribute "" which does not appear in prior text.
Cite error: <ref> tag with name "openai_o3" defined in <references> has group attribute "" which does not appear in prior text.
Cite error: <ref> tag with name "deepseek_r1" defined in <references> has group attribute "" which does not appear in prior text.