AIME 2024

AIME 2024
Overview
Full name	American Invitational Mathematics Examination 2024
Abbreviation	AIME 2024
Description	A challenging mathematical reasoning benchmark based on the American Invitational Mathematics Examination 2024 problems, designed to evaluate AI models' ability to solve complex high school mathematics problems requiring multi-step reasoning
Release date	2024-02-01
Latest version	1.0
Benchmark updated	2024-02-07
Authors	Mathematical Association of America
Organization	Mathematical Association of America (MAA), Art of Problem Solving (AoPS)
Technical Details
Type	Mathematical Reasoning, Problem Solving
Modality	Text
Task format	Open-ended problem solving
Number of tasks	15
Total examples	15
Evaluation metric	Exact Match, Pass@1
Domains	Algebra, Geometry, Number Theory, Combinatorics, Probability
Languages	English
Performance
Human performance	26.67%-40% (4-6 problems correct)
Baseline	10% (GPT-4o)
SOTA score	93% (o1 with re-ranking)
SOTA model	OpenAI o1
SOTA date	2024-09-12
Saturated	No
Resources
Website	Official website
GitHub	Repository
Dataset	Download
Predecessor	AIME 2023
Successor	AIME 2025

AIME 2024 is an AI benchmark that evaluates large language models' ability to solve complex mathematical problems from the 2024 American Invitational Mathematics Examination. The benchmark consists of 15 challenging mathematical problems that require advanced problem-solving skills, mathematical reasoning, and multi-step logical thinking typically expected of top high school mathematics students.

Overview

The AIME 2024 benchmark is based on problems from the American Invitational Mathematics Examination, a prestigious invite-only mathematics competition for high school students who perform in the top 5% of the AMC 12 mathematics exam. The benchmark serves as a critical test for evaluating AI models' capabilities in advanced mathematical reasoning, particularly in areas that require creative problem-solving approaches and deep mathematical understanding.

Background

The American Invitational Mathematics Examination (AIME) is one of the most challenging high school mathematics competitions in the United States, serving as a qualification pathway for the USA Mathematical Olympiad (USAMO). The 2024 edition was administered in two sessions: AIME I on February 1, 2024, and AIME II on February 7, 2024. The problems cover topics in algebra, geometry, number theory, combinatorics, and probability theory.

The adaptation of AIME 2024 as an AI benchmark represents a significant milestone in evaluating artificial intelligence systems' mathematical capabilities, as these problems require not just computational ability but genuine mathematical insight and reasoning that has traditionally been considered uniquely human.

Technical Specifications

Problem Format

Each of the 15 problems in AIME 2024 requires:

Comprehensive understanding of multiple mathematical concepts
Multi-step reasoning and problem decomposition
Creative approaches to problem-solving
Precise numerical answers (integers from 0 to 999)

The problems increase in difficulty progressively, with later problems requiring more sophisticated mathematical techniques and insights.

Evaluation Methodology

The benchmark employs several evaluation approaches:

Evaluation Method	Description	Implementation
Exact Match	Models must produce the exact integer answer	Answer extracted from model output and compared to ground truth
Pass@1	Single attempt accuracy	Model given one attempt per problem
Pass@k	Best of k attempts	Multiple samples generated, best answer selected
Consensus Voting	Majority vote from multiple attempts	Multiple runs aggregated to reduce variance

To reduce variance due to the small dataset size, standard practice involves running models 8 times on the benchmark and averaging the results. Models are typically prompted with: "Please reason step by step, and put your final answer within \boxed{}"

Performance Analysis

Model Performance Comparison

The following table shows the performance of various AI models on AIME 2024:

Model	Pass@1 Score	Methodology	Date
OpenAI o1 (with re-ranking)	93% (13.9/15)	Re-ranking 1000 samples	September 2024
OpenAI o3	91.6%	Single sample	April 2025
OpenAI o3-mini	87.3%	Single sample	April 2025
OpenAI o1 (consensus)	83% (12.5/15)	Consensus among 64 samples	September 2024
DeepSeek R1	79.8%	Multiple runs averaged	January 2025
OpenAI o1	74% (11.1/15)	Single sample	September 2024
o1-mini	56.67%	Pass@1	2024
Gemini-exp-1114	~50%	Pass@1	2024
Qwen2-Math-72B	36.67% (11/30 on combined AIME 2024+2025)	Pass@1	2024
GPT-4o	12% (1.8/15)	Single sample	2024
Claude-3.5-Sonnet	10%	Exact match	2024
GPT-4o-mini	6.67%	Exact match	2024

Note: o3 and o3-mini were released in April 2025, with o4-mini succeeding o3-mini shortly after. Performance figures for models released after 2024 are included for reference.

Key Findings

Performance Characteristics

1. **Reasoning vs. Non-Reasoning Models**: Models with explicit chain-of-thought reasoning capabilities significantly outperform traditional language models 2. **Scaling with Compute**: OpenAI demonstrated a log-linear relationship between accuracy and test-time compute 3. **Problem Distribution**: Correct answers are distributed across different models, suggesting no single model has comprehensive problem-solving capabilities 4. **Difficulty Gradient**: Performance degrades significantly on later, more difficult problems

Human Comparison

**Median Human Score**: 4-6 problems correct (26.67%-40%)
**Top 500 Students Nationally**: ~13.9 problems correct (93%)
**USAMO Qualification**: Typically requires 9+ correct answers

The best AI performance (o1 with re-ranking at 93%) places it among the top 500 students nationally, above the USAMO qualification threshold.

Mathematical Domains Covered

The AIME 2024 benchmark tests proficiency across multiple mathematical domains:

Domain	Example Topics	Percentage of Problems
Algebra	Polynomial equations, systems of equations, inequalities	~27%
Geometry	Euclidean geometry, coordinate geometry, transformations	~27%
Number Theory	Divisibility, modular arithmetic, prime numbers	~20%
Combinatorics	Counting principles, probability, discrete structures	~20%
Complex Analysis	Complex numbers, roots of unity	~6%

Limitations and Considerations

Data Contamination Concerns

A significant concern with AIME 2024 as a benchmark is potential data contamination:

Problems and solutions are publicly available online
Models may have encountered these problems during pre-training
Performance differences between AIME 2024 and AIME 2025 suggest possible contamination

Statistical Limitations

**Small Dataset Size**: Only 15 problems limits statistical significance
**High Variance**: Individual run results vary significantly
**Limited Diversity**: Problems focus on specific mathematical competition style

Related Benchmarks

AIME 2024 is part of a broader ecosystem of mathematical reasoning benchmarks:

AIME 2025: Successor benchmark with 15 new problems
MATH: Broader mathematical problem dataset with 12,500 problems
GSM8K: Grade school math problems benchmark
GPQA Diamond: PhD-level science questions including mathematics
Minerva: Mathematical problem-solving benchmark
HumanEval: Code generation benchmark with mathematical components

Impact and Significance

The AIME 2024 benchmark has several important implications:

Research Impact

1. **Capability Assessment**: Provides clear metrics for mathematical reasoning progress 2. **Architecture Development**: Drives development of reasoning-optimized models 3. **Training Methodology**: Influences approaches to mathematical problem training

Educational Implications

Demonstrates AI approaching expert-level mathematical problem-solving
Raises questions about AI tutoring and educational assistance
Highlights gaps between computational ability and mathematical understanding

Future Directions

Development of contamination-resistant evaluation methods
Extension to other mathematical competition formats
Integration with interactive theorem proving systems
Exploration of mathematical creativity vs. pattern matching

References

Cite error: <ref> tag with name "openai_o1" defined in <references> has group attribute "" which does not appear in prior text.
Cite error: <ref> tag with name "aime_official" defined in <references> has group attribute "" which does not appear in prior text.
Cite error: <ref> tag with name "inspect_evals" defined in <references> has group attribute "" which does not appear in prior text.
Cite error: <ref> tag with name "huggingface" defined in <references> has group attribute "" which does not appear in prior text.
Cite error: <ref> tag with name "vals_benchmark" defined in <references> has group attribute "" which does not appear in prior text.
Cite error: <ref> tag with name "openai_o3" defined in <references> has group attribute "" which does not appear in prior text.
Cite error: <ref> tag with name "deepseek_r1" defined in <references> has group attribute "" which does not appear in prior text.