AIME 2024

From AI Wiki


AIME 2024
Overview
Full name American Invitational Mathematics Examination 2024
Abbreviation AIME 2024
Description A challenging mathematical reasoning benchmark based on the American Invitational Mathematics Examination 2024 problems, designed to evaluate AI models' ability to solve complex high school mathematics problems requiring multi-step reasoning
Release date 2024-02-01
Latest version 1.0
Benchmark updated 2024-02-07
Authors Mathematical Association of America
Organization Mathematical Association of America (MAA)Art of Problem Solving (AoPS)
Technical Details
Type Mathematical ReasoningProblem Solving
Modality Text
Task format Open-ended problem solving
Number of tasks 15
Total examples 15
Evaluation metric Exact MatchPass@1
Domains AlgebraGeometryNumber TheoryCombinatoricsProbability
Languages English
Performance
Human performance 26.67%-40% (4-6 problems correct)
Baseline 10% (GPT-4o)
SOTA score 93% (o1 with re-ranking)
SOTA model OpenAI o1
SOTA date 2024-09-12
Saturated No
Resources
Website Official website
GitHub Repository
Dataset Download
Predecessor AIME 2023
Successor AIME 2025


AIME 2024 is an AI benchmark that evaluates large language models' ability to solve complex mathematical problems from the 2024 American Invitational Mathematics Examination. The benchmark consists of 15 challenging mathematical problems that require advanced problem-solving skills, mathematical reasoning, and multi-step logical thinking typically expected of top high school mathematics students.

Overview

The AIME 2024 benchmark is based on problems from the American Invitational Mathematics Examination, a prestigious invite-only mathematics competition for high school students who perform in the top 5% of the AMC 12 mathematics exam. The benchmark serves as a critical test for evaluating AI models' capabilities in advanced mathematical reasoning, particularly in areas that require creative problem-solving approaches and deep mathematical understanding.

Background

The American Invitational Mathematics Examination (AIME) is one of the most challenging high school mathematics competitions in the United States, serving as a qualification pathway for the USA Mathematical Olympiad (USAMO). The 2024 edition was administered in two sessions: AIME I on February 1, 2024, and AIME II on February 7, 2024. The problems cover topics in algebra, geometry, number theory, combinatorics, and probability theory.

The adaptation of AIME 2024 as an AI benchmark represents a significant milestone in evaluating artificial intelligence systems' mathematical capabilities, as these problems require not just computational ability but genuine mathematical insight and reasoning that has traditionally been considered uniquely human.

Technical Specifications

Problem Format

Each of the 15 problems in AIME 2024 requires:

  • Comprehensive understanding of multiple mathematical concepts
  • Multi-step reasoning and problem decomposition
  • Creative approaches to problem-solving
  • Precise numerical answers (integers from 0 to 999)

The problems increase in difficulty progressively, with later problems requiring more sophisticated mathematical techniques and insights.

Evaluation Methodology

The benchmark employs several evaluation approaches:

Evaluation Method Description Implementation
Exact Match Models must produce the exact integer answer Answer extracted from model output and compared to ground truth
Pass@1 Single attempt accuracy Model given one attempt per problem
Pass@k Best of k attempts Multiple samples generated, best answer selected
Consensus Voting Majority vote from multiple attempts Multiple runs aggregated to reduce variance

To reduce variance due to the small dataset size, standard practice involves running models 8 times on the benchmark and averaging the results. Models are typically prompted with: "Please reason step by step, and put your final answer within \boxed{}"

Performance Analysis

Model Performance Comparison

The following table shows the performance of various AI models on AIME 2024:

Model Pass@1 Score Methodology Date
OpenAI o1 (with re-ranking) 93% (13.9/15) Re-ranking 1000 samples September 2024
OpenAI o3 91.6% Single sample April 2025
OpenAI o3-mini 87.3% Single sample April 2025
OpenAI o1 (consensus) 83% (12.5/15) Consensus among 64 samples September 2024
DeepSeek R1 79.8% Multiple runs averaged January 2025
OpenAI o1 74% (11.1/15) Single sample September 2024
o1-mini 56.67% Pass@1 2024
Gemini-exp-1114 ~50% Pass@1 2024
Qwen2-Math-72B 36.67% (11/30 on combined AIME 2024+2025) Pass@1 2024
GPT-4o 12% (1.8/15) Single sample 2024
Claude-3.5-Sonnet 10% Exact match 2024
GPT-4o-mini 6.67% Exact match 2024

Note: o3 and o3-mini were released in April 2025, with o4-mini succeeding o3-mini shortly after. Performance figures for models released after 2024 are included for reference.

Key Findings

Performance Characteristics

1. **Reasoning vs. Non-Reasoning Models**: Models with explicit chain-of-thought reasoning capabilities significantly outperform traditional language models 2. **Scaling with Compute**: OpenAI demonstrated a log-linear relationship between accuracy and test-time compute 3. **Problem Distribution**: Correct answers are distributed across different models, suggesting no single model has comprehensive problem-solving capabilities 4. **Difficulty Gradient**: Performance degrades significantly on later, more difficult problems

Human Comparison

  • **Median Human Score**: 4-6 problems correct (26.67%-40%)
  • **Top 500 Students Nationally**: ~13.9 problems correct (93%)
  • **USAMO Qualification**: Typically requires 9+ correct answers

The best AI performance (o1 with re-ranking at 93%) places it among the top 500 students nationally, above the USAMO qualification threshold.

Mathematical Domains Covered

The AIME 2024 benchmark tests proficiency across multiple mathematical domains:

Domain Example Topics Percentage of Problems
Algebra Polynomial equations, systems of equations, inequalities ~27%
Geometry Euclidean geometry, coordinate geometry, transformations ~27%
Number Theory Divisibility, modular arithmetic, prime numbers ~20%
Combinatorics Counting principles, probability, discrete structures ~20%
Complex Analysis Complex numbers, roots of unity ~6%

Limitations and Considerations

Data Contamination Concerns

A significant concern with AIME 2024 as a benchmark is potential data contamination:

  • Problems and solutions are publicly available online
  • Models may have encountered these problems during pre-training
  • Performance differences between AIME 2024 and AIME 2025 suggest possible contamination

Statistical Limitations

  • **Small Dataset Size**: Only 15 problems limits statistical significance
  • **High Variance**: Individual run results vary significantly
  • **Limited Diversity**: Problems focus on specific mathematical competition style

Related Benchmarks

AIME 2024 is part of a broader ecosystem of mathematical reasoning benchmarks:

  • AIME 2025: Successor benchmark with 15 new problems
  • MATH: Broader mathematical problem dataset with 12,500 problems
  • GSM8K: Grade school math problems benchmark
  • GPQA Diamond: PhD-level science questions including mathematics
  • Minerva: Mathematical problem-solving benchmark
  • HumanEval: Code generation benchmark with mathematical components

Impact and Significance

The AIME 2024 benchmark has several important implications:

Research Impact

1. **Capability Assessment**: Provides clear metrics for mathematical reasoning progress 2. **Architecture Development**: Drives development of reasoning-optimized models 3. **Training Methodology**: Influences approaches to mathematical problem training

Educational Implications

  • Demonstrates AI approaching expert-level mathematical problem-solving
  • Raises questions about AI tutoring and educational assistance
  • Highlights gaps between computational ability and mathematical understanding

Future Directions

  • Development of contamination-resistant evaluation methods
  • Extension to other mathematical competition formats
  • Integration with interactive theorem proving systems
  • Exploration of mathematical creativity vs. pattern matching

See Also

References

Cite error: <ref> tag with name "openai_o1" defined in <references> has group attribute "" which does not appear in prior text.
Cite error: <ref> tag with name "aime_official" defined in <references> has group attribute "" which does not appear in prior text.
Cite error: <ref> tag with name "inspect_evals" defined in <references> has group attribute "" which does not appear in prior text.
Cite error: <ref> tag with name "huggingface" defined in <references> has group attribute "" which does not appear in prior text.
Cite error: <ref> tag with name "vals_benchmark" defined in <references> has group attribute "" which does not appear in prior text.
Cite error: <ref> tag with name "openai_o3" defined in <references> has group attribute "" which does not appear in prior text.
Cite error: <ref> tag with name "deepseek_r1" defined in <references> has group attribute "" which does not appear in prior text.