AIME 2025
Last reviewed
May 10, 2026
Sources
20 citations
Review status
Source-backed
Revision
v2 ยท 2,800 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 10, 2026
Sources
20 citations
Review status
Source-backed
Revision
v2 ยท 2,800 words
Add missing citations, update stale details, or suggest a clearer explanation.
| AIME 2025 | |
|---|---|
| Overview | |
| Full name | American Invitational Mathematics Examination 2025 |
| Abbreviation | AIME 2025 |
| Description | A challenging mathematical reasoning benchmark based on the American Invitational Mathematics Examination 2025 problems, testing olympiad-level mathematical reasoning with complex multi-step problem solving |
| AIME I date | 2025-02-06 |
| AIME II date | 2025-02-12 |
| Latest version | 1.0 |
| Authors | Mathematical Association of America |
| Organization | Mathematical Association of America (MAA), Art of Problem Solving (AoPS) |
| Technical Details | |
| Type | Mathematical Reasoning, Olympiad Mathematics |
| Modality | Text |
| Task format | Open-ended problem solving (integer answer 000 to 999) |
| Number of tasks | 30 (15 from AIME I, 15 from AIME II) |
| Total examples | 30 |
| Evaluation metric | Exact match, pass@1, cons@8, cons@64 |
| Domains | Algebra, Geometry, Number Theory, Combinatorics, Probability |
| Languages | English |
| Performance | |
| Human performance | 26.67% to 40% (4 to 6 problems correct per 15) |
| Baseline | ~20% (non-reasoning models) |
| SOTA score (no tools) | 94.6% (GPT-5, August 2025) |
| SOTA score (with Python) | 100% (GPT-5 Pro, Claude Opus 4.5, others) |
| SOTA model | GPT-5 / GPT-5 Pro |
| SOTA date | 2025-08 |
| Saturated | Approaching saturation for frontier models |
| Resources | |
| Website | Official MAA AIME page |
| AoPS Wiki (AIME I) | 2025 AIME I problems |
| AoPS Wiki (AIME II) | 2025 AIME II problems |
| Dataset (HF) | MathArena/aime_2025 |
| Live leaderboard | MathArena, Artificial Analysis |
| Predecessor | AIME 2024 |
AIME 2025 is an AI benchmark drawn from the 2025 American Invitational Mathematics Examination, a high school olympiad track contest run by the Mathematical Association of America. The 2025 edition was administered in two sittings: AIME I on Thursday, February 6, 2025, and AIME II on Wednesday, February 12, 2025. Each paper contains 15 problems with integer answers in the range 000 to 999, giving 30 problems in total. Within hours of each test window closing, research labs and independent evaluators began running frontier language models on the fresh problems, turning the contest into one of the most closely watched mathematical reasoning benchmarks of 2025.
The benchmark gained traction quickly because it offered something rare: a hard, well calibrated set of problems that nearly every major model released before February 2025 had a strong claim to never having seen. That made it a useful counterpoint to AIME 2024, where contamination of pretraining corpora became a serious concern after researchers showed model scores fell by 10 to 20 points compared with held out 2025 problems.
AIME 2025 evaluates how well a large language model can carry out structured multi step mathematical reasoning under tight constraints. Problems require pre calculus mathematics across algebra, geometry, number theory, combinatorics, and probability, and increase in difficulty across each 15 problem set. There is no partial credit; the model either produces the correct integer or it does not.
For humans, AIME is the gateway between the AMC 10 / AMC 12 and the USA Math Olympiad. Top high school competitors typically solve 4 to 6 of the 15 problems per paper, putting human performance between 26.67% and 40% on a single sitting. A non reasoning model that knows textbook techniques but cannot reason carefully tends to land around 20%.
For most of 2024, the math benchmark of choice was AIME 2024 paired with MATH 500 and GSM8K. By late 2024, the top reasoning models from OpenAI and DeepSeek were posting scores above 80% on AIME 2024 and above 95% on MATH 500, and the field needed something that was not yet partly memorized. AIME 2025 fit the brief: it was hard, it shared the same answer format that researchers had already built tooling around, and the training cutoffs of leading models (including DeepSeek R1, o3 mini, and Claude 3.7 Sonnet) all sat before February 6, 2025.
The other useful thing about AIME 2025 was political. The contest is run by the MAA, an independent organization, rather than by any of the labs being evaluated, which made it harder for any one company to cherry pick the questions where their model did well.
The 30 problem dataset combines both AIME papers and is the version most leaderboards report against. Some early evaluations reported only the 15 problems from AIME I because they were run on February 7, before AIME II was administered.
| Item | AIME I | AIME II |
|---|---|---|
| Date administered | February 6, 2025 | February 12, 2025 |
| Number of problems | 15 | 15 |
| Time limit (humans) | 3 hours | 3 hours |
| Answer format | Integer 000 to 999 | Integer 000 to 999 |
| Calculator policy | None permitted | None permitted |
The MAA publishes the official problems and answer keys after each sitting. The Art of Problem Solving community then writes up multiple solutions per problem, which is one of the ways the problems eventually leak into web crawls.
| Domain | Example topics |
|---|---|
| Algebra | Polynomial and functional equations, inequalities, sequences |
| Geometry | Euclidean and coordinate geometry, transformations, 3D solids |
| Number theory | Divisibility, modular arithmetic, Diophantine equations |
| Combinatorics and probability | Counting, expected values, generating functions |
| Trigonometry / complex numbers | Identities, roots of unity |
Different evaluators have settled on different protocols, which is part of why scores in different press releases sometimes look inconsistent.
| Metric | What it measures | Typical use |
|---|---|---|
| pass@1 | Single sample exact match accuracy | Default leaderboards |
| cons@8 | Majority vote across 8 samples | Reduces variance on a 30 problem set |
| cons@64 | Majority vote across 64 samples | Used in o1 and Grok 3 announcements |
| pass@k | At least one of k samples is correct | Ablations, not headline numbers |
Because the dataset has only 30 problems, single run pass@1 numbers are noisy. Two independent average of 5 runs can differ by 5 to 10 percentage points on the same model, which has pushed evaluators toward consistency metrics or pass@1 averaged over many runs (often 16, 32, or 50). It also explains why frontier models tend to bunch within a percentage point or two of each other; the noise is comparable to the gap.
The headline AIME 2025 number is usually reported closed book: the model produces a chain of thought and a final integer using only its own weights. A second mode, called tool augmented or code interpreter mode, allows the model to call out to a Python interpreter during reasoning. Tool augmented numbers tend to be 5 to 7 points higher and have pushed several frontier models to 100% on AIME 2025.
| Parameter | Common setting |
|---|---|
| Temperature | 0.0 to 0.6 (often averaged) |
| Samples per question | 8, 16, 32, or 64 |
| Maximum tokens | 16K to 64K |
| Top p | 0.95 |
| Prompt format | "Please reason step by step, and put your final answer within \boxed{}" |
The tables below collect publicly reported AIME 2025 results from model release blogs, third party leaderboards (Artificial Analysis, MathArena, Vellum, llm-stats), and arXiv reports. Because methodology varies across sources, scores within a few points should be treated as tied.
| Model | Score (%) | Method | Organization | Released |
|---|---|---|---|---|
| Gemini 3 Pro | ~100 | pass@1 | Google DeepMind | 2025 |
| Grok-4 Heavy | ~100 | pass@1 | xAI | 2025 |
| GPT-5 (thinking) | 99.6 | pass@1 | OpenAI | 2025-08 |
| GPT-5 (default) | 94.6 | pass@1 | OpenAI | 2025-08 |
| Grok 3 (Think, cons@64) | 93.3 | cons@64 | xAI | 2025-02 |
| o4-mini | 92.7 | pass@1 | OpenAI | 2025-04 |
| Claude Opus 4.5 | 92.77 | pass@1 | Anthropic | 2025-11 |
| Qwen3 235B (thinking) | 92.3 | pass@1 | Alibaba | 2025 |
| o3 | 88.9 | pass@1 | OpenAI | 2025-04 |
| DeepSeek R1 0528 | 87.5 | pass@1 | DeepSeek | 2025-05 |
| Claude Sonnet 4.5 | 87 | pass@1 | Anthropic | 2025-09 |
| Gemini 2.5 Pro | 86.7 | pass@1 | Google DeepMind | 2025-03 |
| o3-mini (high) | 86.5 | pass@1 | OpenAI | 2025-01 |
| Claude Opus 4 | 75.5 | pass@1 | Anthropic | 2025-05 |
| DeepSeek R1 (original) | 74.0 | pass@1 | DeepSeek | 2025-01 |
| Claude 3.7 Sonnet (ext. thinking) | 61.3 | pass@1 | Anthropic | 2025-02 |
| o1 | ~60 | pass@1 | OpenAI | 2024-12 |
| Claude 3.7 Sonnet (standard) | 52.7 | pass@1 | Anthropic | 2025-02 |
| Gemini 2.0 Flash | ~45 | pass@1 | Google DeepMind | 2025-01 |
| Non reasoning baseline | ~20 | pass@1 | Various | n/a |
With access to a Python interpreter during reasoning, several models reach the ceiling of the dataset:
| Model | Score (%) | Source |
|---|---|---|
| GPT-5 Pro with Python | 100.0 | OpenAI launch blog |
| Claude Opus 4.5 with Python | 100.0 | Anthropic system card |
| Claude Sonnet 4.5 with Python | 100.0 | Anthropic system card |
| Gemini 3 Pro with code execution | 100.0 | Google launch material |
| o4-mini with Python | 99.5 | OpenAI o3 / o4-mini blog |
| o3 with Python | 98.4 | OpenAI o3 / o4-mini blog |
The Grok 3 launch was the first instance where AIME 2025 scores caused a public dispute over evaluation honesty. xAI headlined 93.3% for Grok 3 (Think), but that figure used cons@64, while OpenAI's chart for o3-mini reported pass@1. On apples to apples pass@1, Grok 3 Reasoning Beta sat below o3-mini-high. The episode was a useful reminder that the headline number depends on which metric a lab chooses.
DeepSeek's path was its own story. The original R1 from January 2025 posted 74.0% pass@1, well below o3-mini. The R1-0528 update from May 2025 jumped to 87.5%; by then average response length on hard problems had nearly doubled, from about 12K reasoning tokens to about 23K, suggesting more test time compute, not a different recipe, was doing the work.
Claude 3.7 Sonnet, released the same month as AIME 2025, came in low at 52.7% standard and 61.3% with extended thinking. Anthropic's strategy at the time emphasized coding and agentic tasks. The gap closed with Claude 4 Opus (75.5%), Claude Sonnet 4.5 (87%), and Claude Opus 4.5 (92.77%) later in 2025.
Researchers use the gap between AIME 2024 and AIME 2025 scores as a rough contamination indicator: a model that scores noticeably higher on the older paper is suspected of having seen those problems during training. MathArena ran this comparison across more than 50 LLMs and reported that for several open weight models the AIME 2024 score sat 10 to 20 points above AIME 2025, despite roughly equivalent difficulty.
| Aspect | AIME 2024 | AIME 2025 |
|---|---|---|
| Frontier score (early 2025) | High 80s to mid 90s | Mid 70s to high 80s |
| Estimated contamination | Substantial for many models | Limited, some leakage |
| Saturation (late 2025) | Effectively saturated | Approaching saturation |
Even AIME 2025 has not been completely clean. Researchers found that 8 of the 30 problems had near identical analogues already on the public web (Quora, math.stackexchange, and similar archives), with one AIME 2025 Question 1 having an essentially identical version posted years earlier. That kept some risk that pre February 2025 training data still contained partial solutions or similar formulations.
The response from the evaluation community has been to move toward live benchmarks: instead of evaluating on a fixed test set indefinitely, evaluators score models only on competitions held after the model's training cutoff. The MathArena project, run by researchers at ETH Zurich and SRI Lab, formalized this approach with continual evaluation across AIME, HMMT, the Putnam, and the IMO. A separate effort at vals.ai ran the same idea with a public leaderboard.
The practical workflow most labs adopted for AIME 2025 was to fetch the official problem PDFs the morning after each sitting, run frozen pre announcement checkpoints, and publish results within 48 to 72 hours, before the problems could plausibly enter any retraining cycle.
AIME 2025 is useful, not perfect. Several limitations are worth keeping in mind:
AIME 2025 has shaped how frontier labs talk about reasoning. By mid 2025, the AIME 2025 number had become a near required disclosure in any major reasoning model release, alongside GPQA Diamond, HumanEval, and the MMLU. The reproducible lift from longer reasoning chains helped popularize the test time compute paradigm that defines o3, DeepSeek R1, and their successors. cons@k metrics pushed multi sample voting into product features in Claude, Gemini, and ChatGPT. The contamination story reinforced the case for held out evaluation sets and motivated continual benchmarks like MathArena. Open source projects (DeepSeek R1 Distill, OpenThinker, AM Thinking, and various Qwen and Llama based distillations) have used AIME 2025 as their primary external yardstick for reasoning capability transfer.