AIME 2024
Last reviewed
May 10, 2026
Sources
18 citations
Review status
Source-backed
Revision
v2 · 2,836 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 10, 2026
Sources
18 citations
Review status
Source-backed
Revision
v2 · 2,836 words
Add missing citations, update stale details, or suggest a clearer explanation.
| AIME 2024 | |
|---|---|
| Overview | |
| Full name | American Invitational Mathematics Examination 2024 |
| Abbreviation | AIME 2024 |
| Description | A challenging mathematical reasoning benchmark based on the American Invitational Mathematics Examination 2024 problems, designed to evaluate AI models' ability to solve complex high school mathematics problems requiring multi-step reasoning |
| Release date | 2024-02-01 |
| Latest version | 1.0 |
| Benchmark updated | 2024-02-07 |
| Authors | Mathematical Association of America |
| Organization | Mathematical Association of America (MAA), Art of Problem Solving (AoPS) |
| Technical Details | |
| Type | Mathematical Reasoning, Problem Solving |
| Modality | Text |
| Task format | Open-ended problem solving |
| Number of tasks | 30 (15 from AIME I + 15 from AIME II) |
| Total examples | 30 |
| Evaluation metric | Exact Match, Pass@1, Cons@N |
| Domains | Algebra, Geometry, Number Theory, Combinatorics, Probability |
| Languages | English |
| Performance | |
| Human performance | ~26% (median high-scoring qualifier solves 4 of 15) |
| Baseline | ~12% (GPT-4o pass@1) |
| SOTA score | 95.8% (Grok 3 Mini, cons@64) |
| SOTA model | Grok 3 Mini (xAI) |
| SOTA date | 2025-02-17 |
| Saturated | Yes (top reasoning models exceed 90%) |
| Resources | |
| Website | Official website |
| GitHub | Repository |
| Dataset | Download |
| Predecessor | AIME 2023 |
| Successor | AIME 2025 |
AIME 2024 is an AI benchmark that evaluates large language models on the 2024 American Invitational Mathematics Examination, a competition originally written by the Mathematical Association of America for top high school mathematicians in the United States. The benchmark consists of 30 problems (15 from AIME I and 15 from AIME II) that require integer answers between 0 and 999. Because the answer space is small, the questions resist guessing, and the problems demand layered reasoning, AIME 2024 became one of the most cited tests of mathematical reasoning ability for reasoning models released in late 2024 and 2025.
The AIME 2024 benchmark is built on the 2024 cycle of the American Invitational Mathematics Examination, an invitational round of the AMC competition series. AIME is administered to students who score in roughly the top 5% of the AMC 10 or top 2.5% of the AMC 12. Strong AIME performance is the gateway to the USA Mathematical Olympiad (USAMO) and USA Junior Mathematical Olympiad (USAJMO), and the index used for those competitions combines AMC and AIME scores.
The AIME 2024 problems were administered in two sittings: AIME I on January 31 to February 1, 2024, and AIME II on February 7, 2024. Each sitting contains 15 problems and runs for three hours. After the contest closed, the problems and full solutions were posted publicly through the Art of Problem Solving wiki and other community archives, which is exactly why AIME 2024 became a popular target for AI researchers, and also why later contamination concerns emerged.
The machine-readable benchmark most often used in papers is the Hugging Face dataset published by Maxwell Jia, which packages the 30 official problems with reference answers in JSON Lines format. Some early evaluations, including OpenAI's blog post for o1, restricted the benchmark to the 15 AIME I problems only, which is one source of confusion when comparing scores across papers.
A few features made AIME 2024 the right shape for evaluating reasoning models:
Each AIME 2024 problem requires the model to output a single integer between 000 and 999. Problems are presented as plain text and may include LaTeX. Common task patterns include:
The difficulty curve is steep. Problems 1 to 5 are typically tractable for an experienced AMC solver. Problems 11 to 15 frequently require nonobvious construction or clever invariants and are roughly USAMO entry difficulty.
The benchmark is run in several ways depending on the paper:
| Evaluation method | Description | Implementation |
|---|---|---|
| Exact match | Model output must equal the ground truth integer | Final answer extracted from \boxed{} or last line of model output |
| Pass@1 (greedy) | Single deterministic attempt | Temperature 0, no sampling |
| Pass@1 (averaged) | Average correctness across many samples | Common setup is 64 samples per problem with temperature 0.6 and top-p 0.95 |
| Cons@N (consensus) | Majority vote over N samples | Often N=64; reduces variance from sampling |
| Best-of-N | Re-rank N samples with a learned scorer | Used by OpenAI for the 93% o1 result |
DeepSeek's R1 paper, for example, fixes temperature to 0.6, top-p to 0.95, and reports pass@1 averaged over 64 sampled responses, which is now a common reference setup. Models are usually prompted with something close to: "Please reason step by step, and put your final answer within \boxed{}."
Because there are only 30 problems, single-run scores have high variance. A model that gets 24 right one run and 21 right the next has shifted by 10 percentage points without any change in capability. That is why most credible scores either average over many seeds or report cons@N.
The table below collects widely-cited AIME 2024 scores from primary sources. Where scores were originally reported on the 15-problem AIME I subset (as in OpenAI's September 2024 blog post), that is noted in the methodology column.
| Model | AIME 2024 score | Methodology | Source / date |
|---|---|---|---|
| Grok 3 Mini (Think, high) | 95.8% | cons@64, test-time compute scaling | xAI, February 2025 |
| OpenAI o4-mini | 93.4% | pass@1, no tools | OpenAI, April 2025 |
| Grok 3 (Think) | 93.3% | cons@64 | xAI, February 2025 |
| OpenAI o1 (re-ranked) | 93% (13.9/15) | re-ranking 1000 samples on AIME I | OpenAI, September 2024 |
| Gemini 2.5 Pro | 92.0% | pass@1 | Google DeepMind, March 2025 |
| OpenAI o3 | 91.6% | pass@1 | OpenAI, April 2025 |
| OpenAI o3-mini (high) | 87.3% | pass@1 | OpenAI, January 2025 |
| DeepSeek R1-Zero (cons@64) | 86.7% | majority vote over 64 samples | DeepSeek, January 2025 |
| OpenAI o1 (cons@64) | 83% (12.5/15) | consensus on AIME I | OpenAI, September 2024 |
| Claude 3.7 Sonnet (extended thinking) | 80.0% | parallel extended thinking, 64K token budget | Anthropic, February 2025 |
| DeepSeek R1 | 79.8% | pass@1 averaged over 64 samples | DeepSeek paper, January 2025 |
| QwQ-32B-Preview (Alibaba) | 79.5% | pass@1 averaged | Alibaba, November 2024 |
| OpenAI o1 (single sample) | 74.4% | pass@1 on AIME I (11.1/15) | OpenAI, September 2024 |
| Gemini 2.0 Flash Thinking | 73.3% | pass@1 | Google, December 2024 |
| OpenAI o1-mini | 63.6% | pass@1 averaged | DeepSeek paper / OpenAI |
| Grok 3 (base, non-reasoning) | 52.2% | pass@1 | xAI, February 2025 |
| Gemini 2.0 Flash (experimental) | 35.5% | pass@1 | Google, December 2024 |
| Claude 3.7 Sonnet (standard mode) | 23.3% | pass@1, no extended thinking | Anthropic, February 2025 |
| Gemini 1.5 Pro | 19.3% | pass@1 | Google, 2024 |
| GPT-4o | ~12% (1.8/15) | pass@1 on AIME I | OpenAI, September 2024 |
| Claude 3.5 Sonnet | ~10% | pass@1 | Anthropic, 2024 |
A few notes on this table:
The most striking pattern in the table is the gap between reasoning-trained models and conventional chat models. GPT-4o and the original Claude 3.5 Sonnet both sit around 10 to 13% on AIME 2024. Models trained with reinforcement learning on chains of thought, including OpenAI o1, DeepSeek R1, and Claude 3.7 Sonnet with extended thinking, jumped to 70-90%+. The same Claude 3.7 Sonnet weights score 23.3% in standard mode and 80.0% with extended thinking enabled, which is the cleanest demonstration of how much of the gain comes from inference-time reasoning rather than raw capability. OpenAI's o1 blog also reports a roughly log-linear scaling relationship between accuracy and test-time compute on AIME, a pattern most labs have since adopted in their evaluation reports.
Separating model scores from human performance is messier than people often present. AIME is taken only by AMC qualifiers, so the typical AIME taker is already strong:
When OpenAI claimed o1 at 93% placed it "among the top 500 students in the United States," that comparison is to AIME I only, not the AIME I and II combined dataset most papers use today.
The AIME 2024 benchmark tests proficiency across the standard secondary math contest domains:
| Domain | Example topics | Approximate share of problems |
|---|---|---|
| Algebra | Polynomial equations, systems, inequalities, sequences | ~27% |
| Geometry | Euclidean geometry, coordinate geometry, transformations | ~27% |
| Number theory | Divisibility, modular arithmetic, primes, Diophantine | ~20% |
| Combinatorics | Counting, probability, recursion | ~20% |
| Complex numbers | Roots of unity, complex algebra | ~6% |
Topic boundaries are fuzzy. A typical AIME 12 problem might mix coordinate geometry with number theory and a touch of combinatorics, and many problems are deliberately built so that the obvious approach is intractable and a clever observation cuts the work down.
This is the largest caveat hanging over AIME 2024 as a benchmark. The 2024 problems were posted in full, with detailed solutions, on the Art of Problem Solving wiki and elsewhere within hours of the contest. By the time models were trained or fine-tuned in late 2024 and 2025, those pages were almost certainly part of the public web crawls feeding pretraining and instruction tuning datasets.
Researchers building MathArena, a contamination-resistant evaluation framework, found strong signs of contamination on AIME 2024. Several models scored 10 to 20 points above what their performance on freshly released, uncontaminated competitions would predict, and one model (QwQ-32B-Preview) was estimated to score around 60% above the human-aligned expectation. The same project released VAR-AIME24, which substitutes symbolic parameters for the fixed numeric constants in each AIME 2024 problem to test whether models actually solve the problem or recall the answer.
The usual remedy now is to evaluate on AIME 2025, which is administered after the cutoff for most current models, in addition to AIME 2024. When a model that scored 90% on AIME 2024 drops to 75% on AIME 2025, that gap is a useful contamination signal even if neither score is perfectly clean.
Thirty problems is a small sample. A single problem worth 1 of 30 is 3.3% of the score. A model that solves 24 problems correctly scores 80%, but its true skill could plausibly produce anywhere between 22 and 26 on a different draw of similarly hard problems. Confidence intervals on AIME 2024 scores are wide, which is part of why pass@1 averaged over 64 samples and cons@64 are now the standard reporting modes.
AIME 2024 only tests one style of math: short-answer, integer-output, contest-flavored problems. It says nothing about whether a model can write a real proof, formalize an argument in Lean, or do open-ended exploration of a research-style question. Benchmarks like the USAMO, Putnam competitions, FrontierMath, and Humanity's Last Exam are designed to fill those gaps.
AIME 2024 sits inside a broader ecosystem of mathematical reasoning benchmarks that AI researchers run alongside it:
AIME 2024 is probably the single benchmark most responsible for the 2024-2025 reasoning model wave entering public consciousness. The September 2024 OpenAI o1 announcement leaned on AIME 2024 as its headline reasoning result, which set the framing that DeepSeek directly attacked four months later when R1 matched o1's score at a fraction of the inference cost. Anthropic, Google, and xAI followed with their own reasoning launches, and AIME 2024 was on every comparison chart.
The benchmark also accelerated two reporting habits. First, test-time compute became a first-class axis: almost every 2025 reasoning model release included a chart of accuracy against thinking budget, with AIME 2024 the most common dataset for that x-axis. Second, cons@N and best-of-N now appear alongside pass@1 in most releases, since 30 problems and large test-time budgets together make any single number too noisy. For education, contamination concerns mean current AIME 2024 scores probably overstate how well frontier models reason on truly novel contest problems, and the more sobering AIME 2025 numbers tend to support that.