AIME 2024

AIME 2024
Overview
Full name	American Invitational Mathematics Examination 2024
Abbreviation	AIME 2024
Description	A challenging mathematical reasoning benchmark based on the American Invitational Mathematics Examination 2024 problems, designed to evaluate AI models' ability to solve complex high school mathematics problems requiring multi-step reasoning
Release date	2024-02-01
Latest version	1.0
Benchmark updated	2024-02-07
Authors	Mathematical Association of America
Organization	Mathematical Association of America (MAA), Art of Problem Solving (AoPS)
Technical Details
Type	Mathematical Reasoning, Problem Solving
Modality	Text
Task format	Open-ended problem solving
Number of tasks	30 (15 from AIME I + 15 from AIME II)
Total examples	30
Evaluation metric	Exact Match, Pass@1, Cons@N
Domains	Algebra, Geometry, Number Theory, Combinatorics, Probability
Languages	English
Performance
Human performance	~26% (median high-scoring qualifier solves 4 of 15)
Baseline	~12% (GPT-4o pass@1)
SOTA score	95.8% (Grok 3 Mini, cons@64)
SOTA model	Grok 3 Mini (xAI)
SOTA date	2025-02-17
Saturated	Yes (top reasoning models exceed 90%)
Resources
Website	Official website
GitHub	Repository
Dataset	Download
Predecessor	AIME 2023
Successor	AIME 2025

AIME 2024 is an AI benchmark that evaluates large language models on the 2024 American Invitational Mathematics Examination, a competition originally written by the Mathematical Association of America for top high school mathematicians in the United States. The benchmark consists of 30 problems (15 from AIME I and 15 from AIME II) that require integer answers between 0 and 999. Because the answer space is small, the questions resist guessing, and the problems demand layered reasoning, AIME 2024 became one of the most cited tests of mathematical reasoning ability for reasoning models released in late 2024 and 2025.

Overview

The AIME 2024 benchmark is built on the 2024 cycle of the American Invitational Mathematics Examination, an invitational round of the AMC competition series. AIME is administered to students who score in roughly the top 5% of the AMC 10 or top 2.5% of the AMC 12. Strong AIME performance is the gateway to the USA Mathematical Olympiad (USAMO) and USA Junior Mathematical Olympiad (USAJMO), and the index used for those competitions combines AMC and AIME scores.

The AIME 2024 problems were administered in two sittings: AIME I on January 31 to February 1, 2024, and AIME II on February 7, 2024. Each sitting contains 15 problems and runs for three hours. After the contest closed, the problems and full solutions were posted publicly through the Art of Problem Solving wiki and other community archives, which is exactly why AIME 2024 became a popular target for AI researchers, and also why later contamination concerns emerged.

The machine-readable benchmark most often used in papers is the Hugging Face dataset published by Maxwell Jia, which packages the 30 official problems with reference answers in JSON Lines format. Some early evaluations, including OpenAI's blog post for o1, restricted the benchmark to the 15 AIME I problems only, which is one source of confusion when comparing scores across papers.

Why AIME 2024 became an LLM benchmark

A few features made AIME 2024 the right shape for evaluating reasoning models:

Integer answers from 0 to 999: the model either writes the correct integer or it does not, which makes grading cheap and reproducible.
No partial credit, no negative marking: a clean exact-match metric.
Hard but bounded math: every problem is solvable from a high school curriculum, but the later items demand creative combinations of algebra, number theory, geometry, and combinatorics. That is the level where pre-2024 base models like GPT-4o consistently failed and where chain-of-thought training started to pay off.
Small, well-known dataset: only 30 problems, easy to run and inspect.
Public reference solutions: makes detailed error analysis straightforward, not just final accuracy.

Technical specifications

Problem format

Each AIME 2024 problem requires the model to output a single integer between 000 and 999. Problems are presented as plain text and may include LaTeX. Common task patterns include:

Counting and probability questions where the answer is the numerator plus denominator of a reduced fraction.
Geometry problems where the answer is some integer length, area, or sum of unknowns.
Number theory problems asking for a specific residue, sum of digits, or count of solutions.
Algebra problems where the answer is a coefficient, a polynomial value at a point, or the sum m+n where the original answer is m/n.

The difficulty curve is steep. Problems 1 to 5 are typically tractable for an experienced AMC solver. Problems 11 to 15 frequently require nonobvious construction or clever invariants and are roughly USAMO entry difficulty.

Evaluation methodology

The benchmark is run in several ways depending on the paper:

Evaluation method	Description	Implementation
Exact match	Model output must equal the ground truth integer	Final answer extracted from `\boxed{}` or last line of model output
Pass@1 (greedy)	Single deterministic attempt	Temperature 0, no sampling
Pass@1 (averaged)	Average correctness across many samples	Common setup is 64 samples per problem with temperature 0.6 and top-p 0.95
Cons@N (consensus)	Majority vote over N samples	Often N=64; reduces variance from sampling
Best-of-N	Re-rank N samples with a learned scorer	Used by OpenAI for the 93% o1 result

DeepSeek's R1 paper, for example, fixes temperature to 0.6, top-p to 0.95, and reports pass@1 averaged over 64 sampled responses, which is now a common reference setup. Models are usually prompted with something close to: "Please reason step by step, and put your final answer within \boxed{}."

Because there are only 30 problems, single-run scores have high variance. A model that gets 24 right one run and 21 right the next has shifted by 10 percentage points without any change in capability. That is why most credible scores either average over many seeds or report cons@N.

Performance analysis

Model performance comparison

The table below collects widely-cited AIME 2024 scores from primary sources. Where scores were originally reported on the 15-problem AIME I subset (as in OpenAI's September 2024 blog post), that is noted in the methodology column.

Model	AIME 2024 score	Methodology	Source / date
Grok 3 Mini (Think, high)	95.8%	cons@64, test-time compute scaling	xAI, February 2025
OpenAI o4-mini	93.4%	pass@1, no tools	OpenAI, April 2025
Grok 3 (Think)	93.3%	cons@64	xAI, February 2025
OpenAI o1 (re-ranked)	93% (13.9/15)	re-ranking 1000 samples on AIME I	OpenAI, September 2024
Gemini 2.5 Pro	92.0%	pass@1	Google DeepMind, March 2025
OpenAI o3	91.6%	pass@1	OpenAI, April 2025
OpenAI o3-mini (high)	87.3%	pass@1	OpenAI, January 2025
DeepSeek R1-Zero (cons@64)	86.7%	majority vote over 64 samples	DeepSeek, January 2025
OpenAI o1 (cons@64)	83% (12.5/15)	consensus on AIME I	OpenAI, September 2024
Claude 3.7 Sonnet (extended thinking)	80.0%	parallel extended thinking, 64K token budget	Anthropic, February 2025
DeepSeek R1	79.8%	pass@1 averaged over 64 samples	DeepSeek paper, January 2025
QwQ-32B-Preview (Alibaba)	79.5%	pass@1 averaged	Alibaba, November 2024
OpenAI o1 (single sample)	74.4%	pass@1 on AIME I (11.1/15)	OpenAI, September 2024
Gemini 2.0 Flash Thinking	73.3%	pass@1	Google, December 2024
OpenAI o1-mini	63.6%	pass@1 averaged	DeepSeek paper / OpenAI
Grok 3 (base, non-reasoning)	52.2%	pass@1	xAI, February 2025
Gemini 2.0 Flash (experimental)	35.5%	pass@1	Google, December 2024
Claude 3.7 Sonnet (standard mode)	23.3%	pass@1, no extended thinking	Anthropic, February 2025
Gemini 1.5 Pro	19.3%	pass@1	Google, 2024
GPT-4o	~12% (1.8/15)	pass@1 on AIME I	OpenAI, September 2024
Claude 3.5 Sonnet	~10%	pass@1	Anthropic, 2024

A few notes on this table:

The 93% o1 number that OpenAI publicized in September 2024 used best-of-1000 with a learned re-ranker, not a single sample. The single-sample number for the same model was 74.4% on the 15-problem AIME I subset. Press coverage frequently mixed these two figures together.
The DeepSeek R1 paper reports 79.8% pass@1 averaged over 64 samples, slightly above OpenAI o1-1217's 79.2% on the same setup, which is the headline that pushed R1 into the news cycle in January 2025.
Grok 3 Mini's 95.8% sits at the top of public AIME 2024 leaderboards, but it relies on cons@64 with very large test-time compute. At pass@1 the figures from xAI's own announcement are noticeably lower.

Reasoning models versus base models

The most striking pattern in the table is the gap between reasoning-trained models and conventional chat models. GPT-4o and the original Claude 3.5 Sonnet both sit around 10 to 13% on AIME 2024. Models trained with reinforcement learning on chains of thought, including OpenAI o1, DeepSeek R1, and Claude 3.7 Sonnet with extended thinking, jumped to 70-90%+. The same Claude 3.7 Sonnet weights score 23.3% in standard mode and 80.0% with extended thinking enabled, which is the cleanest demonstration of how much of the gain comes from inference-time reasoning rather than raw capability. OpenAI's o1 blog also reports a roughly log-linear scaling relationship between accuracy and test-time compute on AIME, a pattern most labs have since adopted in their evaluation reports.

Human comparison

Separating model scores from human performance is messier than people often present. AIME is taken only by AMC qualifiers, so the typical AIME taker is already strong:

The median AIME score across all qualifiers is around 4 problems out of 15, which is roughly 26%.
USAMO qualification typically requires a USAMO index where the AIME contribution is 9 or higher.
The very top scorers, perfect or near-perfect, are a tiny tail of the distribution, on the order of a few hundred students nationally per year.

When OpenAI claimed o1 at 93% placed it "among the top 500 students in the United States," that comparison is to AIME I only, not the AIME I and II combined dataset most papers use today.

Mathematical domains covered

The AIME 2024 benchmark tests proficiency across the standard secondary math contest domains:

Domain	Example topics	Approximate share of problems
Algebra	Polynomial equations, systems, inequalities, sequences	~27%
Geometry	Euclidean geometry, coordinate geometry, transformations	~27%
Number theory	Divisibility, modular arithmetic, primes, Diophantine	~20%
Combinatorics	Counting, probability, recursion	~20%
Complex numbers	Roots of unity, complex algebra	~6%

Topic boundaries are fuzzy. A typical AIME 12 problem might mix coordinate geometry with number theory and a touch of combinatorics, and many problems are deliberately built so that the obvious approach is intractable and a clever observation cuts the work down.

Limitations and considerations

Data contamination

This is the largest caveat hanging over AIME 2024 as a benchmark. The 2024 problems were posted in full, with detailed solutions, on the Art of Problem Solving wiki and elsewhere within hours of the contest. By the time models were trained or fine-tuned in late 2024 and 2025, those pages were almost certainly part of the public web crawls feeding pretraining and instruction tuning datasets.

Researchers building MathArena, a contamination-resistant evaluation framework, found strong signs of contamination on AIME 2024. Several models scored 10 to 20 points above what their performance on freshly released, uncontaminated competitions would predict, and one model (QwQ-32B-Preview) was estimated to score around 60% above the human-aligned expectation. The same project released VAR-AIME24, which substitutes symbolic parameters for the fixed numeric constants in each AIME 2024 problem to test whether models actually solve the problem or recall the answer.

The usual remedy now is to evaluate on AIME 2025, which is administered after the cutoff for most current models, in addition to AIME 2024. When a model that scored 90% on AIME 2024 drops to 75% on AIME 2025, that gap is a useful contamination signal even if neither score is perfectly clean.

Statistical noise

Thirty problems is a small sample. A single problem worth 1 of 30 is 3.3% of the score. A model that solves 24 problems correctly scores 80%, but its true skill could plausibly produce anywhere between 22 and 26 on a different draw of similarly hard problems. Confidence intervals on AIME 2024 scores are wide, which is part of why pass@1 averaged over 64 samples and cons@64 are now the standard reporting modes.

Limited coverage

AIME 2024 only tests one style of math: short-answer, integer-output, contest-flavored problems. It says nothing about whether a model can write a real proof, formalize an argument in Lean, or do open-ended exploration of a research-style question. Benchmarks like the USAMO, Putnam competitions, FrontierMath, and Humanity's Last Exam are designed to fill those gaps.

AIME 2024 sits inside a broader ecosystem of mathematical reasoning benchmarks that AI researchers run alongside it:

AIME 2025: the follow-up benchmark using the 2025 contest problems, less affected by training data contamination.
MATH: the original 12,500-problem dataset of high school competition math, now considered partially saturated by frontier reasoning models.
MATH-500: a 500-problem subset of MATH commonly reported alongside AIME.
GSM8K: 8,000 grade school word problems, long since saturated by capable LLMs.
GPQA Diamond: graduate-level science questions including mathematics.
HMMT: the Harvard-MIT Mathematics Tournament, also used in MathArena.
FrontierMath: research-level math problems designed to be much harder and more contamination resistant.
Putnam: undergraduate competition mathematics, used for college-level evaluations.
USAMO 2025: proof-based competition for the very top US math students, now a fresh evaluation target as well.

Impact and significance

AIME 2024 is probably the single benchmark most responsible for the 2024-2025 reasoning model wave entering public consciousness. The September 2024 OpenAI o1 announcement leaned on AIME 2024 as its headline reasoning result, which set the framing that DeepSeek directly attacked four months later when R1 matched o1's score at a fraction of the inference cost. Anthropic, Google, and xAI followed with their own reasoning launches, and AIME 2024 was on every comparison chart.

The benchmark also accelerated two reporting habits. First, test-time compute became a first-class axis: almost every 2025 reasoning model release included a chart of accuracy against thinking budget, with AIME 2024 the most common dataset for that x-axis. Second, cons@N and best-of-N now appear alongside pass@1 in most releases, since 30 problems and large test-time budgets together make any single number too noisy. For education, contamination concerns mean current AIME 2024 scores probably overstate how well frontier models reason on truly novel contest problems, and the more sobering AIME 2025 numbers tend to support that.

References

Mathematical Association of America. "MAA Invitational Competitions." https://maa.org/maa-invitational-competitions/
Wikipedia. "American Invitational Mathematics Examination." https://en.wikipedia.org/wiki/American_Invitational_Mathematics_Examination
OpenAI. "Learning to reason with LLMs." September 12, 2024. https://openai.com/index/learning-to-reason-with-llms/
OpenAI. "Introducing OpenAI o3 and o4-mini." April 16, 2025. https://openai.com/index/introducing-o3-and-o4-mini/
OpenAI. "OpenAI o3-mini." January 31, 2025. https://openai.com/index/openai-o3-mini/
DeepSeek-AI et al. "DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning." arXiv:2501.12948. January 2025. https://arxiv.org/html/2501.12948v1
Anthropic. "Claude 3.7 Sonnet and Claude Code." February 24, 2025. https://www.anthropic.com/news/claude-3-7-sonnet
Anthropic. "Claude's extended thinking." 2025. https://www.anthropic.com/news/visible-extended-thinking
xAI. "Grok 3 Beta: The Age of Reasoning Agents." February 17, 2025. https://x.ai/news/grok-3
Google DeepMind. "Gemini 2.5: Our newest Gemini model with thinking." March 25, 2025. https://blog.google/innovation-and-ai/models-and-research/google-deepmind/gemini-model-thinking-updates-march-2025/
Maxwell-Jia. "AIME 2024 dataset." Hugging Face. https://huggingface.co/datasets/Maxwell-Jia/AIME_2024
Balunović et al. "MathArena: Evaluating LLMs on Uncontaminated Math Competitions." arXiv:2505.23281. https://arxiv.org/html/2505.23281v2
Vellum AI. "Analysis: OpenAI o1 vs DeepSeek R1." https://www.vellum.ai/blog/analysis-openai-o1-vs-deepseek-r1
Vellum AI. "Claude 3.7 Sonnet vs OpenAI o1 vs DeepSeek R1." https://www.vellum.ai/blog/claude-3-7-sonnet-vs-openai-o1-vs-deepseek-r1
llm-stats.com. "AIME 2024 Benchmark Leaderboard." https://llm-stats.com/benchmarks/aime-2024
Alibaba Cloud. "Alibaba Cloud Unveils QwQ-32B." https://www.alibabacloud.com/blog/alibaba-cloud-unveils-qwq-32b-a-compact-reasoning-model-with-cutting-edge-performance_602039
UK AI Security Institute. "Inspect Evals: AIME 2024." https://ukgovernmentbeis.github.io/inspect_evals/evals/mathematics/aime2024/index.html
DataCamp. "Gemini 2.0 Flash Thinking Experimental: A Guide With Examples." https://www.datacamp.com/blog/gemini-2-0-flash-experimental

AIME 2024

Overview

Why AIME 2024 became an LLM benchmark

Technical specifications

Problem format

Evaluation methodology

Performance analysis

Model performance comparison

Reasoning models versus base models

Human comparison

Mathematical domains covered

Limitations and considerations

Data contamination

Statistical noise

Limited coverage

Impact and significance

See also

References

Improve this article

Overview

Why AIME 2024 became an LLM benchmark

Technical specifications

Problem format

Evaluation methodology

Performance analysis

Model performance comparison

Reasoning models versus base models

Human comparison

Mathematical domains covered

Limitations and considerations

Data contamination

Statistical noise

Limited coverage

Impact and significance

See also

References

Overview

Why AIME 2024 became an LLM benchmark

Technical specifications

Problem format

Evaluation methodology

Performance analysis

Model performance comparison

Reasoning models versus base models

Human comparison

Mathematical domains covered

Limitations and considerations

Data contamination

Statistical noise

Limited coverage

Related benchmarks

Impact and significance

See also

References

Improve this article

Related Articles

τ-bench

Humanity's Last Exam

AIME 2025

MATH Level 5

CharXiv

Aider Polyglot

Overview

Why AIME 2024 became an LLM benchmark

Technical specifications

Problem format

Evaluation methodology

Performance analysis

Model performance comparison

Reasoning models versus base models

Human comparison

Mathematical domains covered

Limitations and considerations

Data contamination

Statistical noise

Limited coverage

Related benchmarks

Impact and significance

See also

References

Related Articles

τ-bench

Humanity's Last Exam

AIME 2025

MATH Level 5

CharXiv

Aider Polyglot