AIME 2025

AIME 2025
Overview
Full name	American Invitational Mathematics Examination 2025
Abbreviation	AIME 2025
Description	A challenging mathematical reasoning benchmark based on the American Invitational Mathematics Examination 2025 problems, testing olympiad-level mathematical reasoning with complex multi-step problem solving
AIME I date	2025-02-06
AIME II date	2025-02-12
Latest version	1.0
Authors	Mathematical Association of America
Organization	Mathematical Association of America (MAA), Art of Problem Solving (AoPS)
Technical Details
Type	Mathematical Reasoning, Olympiad Mathematics
Modality	Text
Task format	Open-ended problem solving (integer answer 000 to 999)
Number of tasks	30 (15 from AIME I, 15 from AIME II)
Total examples	30
Evaluation metric	Exact match, pass@1, cons@8, cons@64
Domains	Algebra, Geometry, Number Theory, Combinatorics, Probability
Languages	English
Performance
Human performance	26.67% to 40% (4 to 6 problems correct per 15)
Baseline	~20% (non-reasoning models)
SOTA score (no tools)	94.6% (GPT-5, August 2025)
SOTA score (with Python)	100% (GPT-5 Pro, Claude Opus 4.5, others)
SOTA model	GPT-5 / GPT-5 Pro
SOTA date	2025-08
Saturated	Approaching saturation for frontier models
Resources
Website	Official MAA AIME page
AoPS Wiki (AIME I)	2025 AIME I problems
AoPS Wiki (AIME II)	2025 AIME II problems
Dataset (HF)	MathArena/aime_2025
Live leaderboard	MathArena, Artificial Analysis
Predecessor	AIME 2024

AIME 2025 is an AI benchmark drawn from the 2025 American Invitational Mathematics Examination, a high school olympiad track contest run by the Mathematical Association of America. The 2025 edition was administered in two sittings: AIME I on Thursday, February 6, 2025, and AIME II on Wednesday, February 12, 2025. Each paper contains 15 problems with integer answers in the range 000 to 999, giving 30 problems in total. Within hours of each test window closing, research labs and independent evaluators began running frontier language models on the fresh problems, turning the contest into one of the most closely watched mathematical reasoning benchmarks of 2025.

The benchmark gained traction quickly because it offered something rare: a hard, well calibrated set of problems that nearly every major model released before February 2025 had a strong claim to never having seen. That made it a useful counterpoint to AIME 2024, where contamination of pretraining corpora became a serious concern after researchers showed model scores fell by 10 to 20 points compared with held out 2025 problems.

Overview

AIME 2025 evaluates how well a large language model can carry out structured multi step mathematical reasoning under tight constraints. Problems require pre calculus mathematics across algebra, geometry, number theory, combinatorics, and probability, and increase in difficulty across each 15 problem set. There is no partial credit; the model either produces the correct integer or it does not.

For humans, AIME is the gateway between the AMC 10 / AMC 12 and the USA Math Olympiad. Top high school competitors typically solve 4 to 6 of the 15 problems per paper, putting human performance between 26.67% and 40% on a single sitting. A non reasoning model that knows textbook techniques but cannot reason carefully tends to land around 20%.

Why it became the math benchmark of 2025

For most of 2024, the math benchmark of choice was AIME 2024 paired with MATH 500 and GSM8K. By late 2024, the top reasoning models from OpenAI and DeepSeek were posting scores above 80% on AIME 2024 and above 95% on MATH 500, and the field needed something that was not yet partly memorized. AIME 2025 fit the brief: it was hard, it shared the same answer format that researchers had already built tooling around, and the training cutoffs of leading models (including DeepSeek R1, o3 mini, and Claude 3.7 Sonnet) all sat before February 6, 2025.

The other useful thing about AIME 2025 was political. The contest is run by the MAA, an independent organization, rather than by any of the labs being evaluated, which made it harder for any one company to cherry pick the questions where their model did well.

Exam structure and dataset

The 30 problem dataset combines both AIME papers and is the version most leaderboards report against. Some early evaluations reported only the 15 problems from AIME I because they were run on February 7, before AIME II was administered.

Item	AIME I	AIME II
Date administered	February 6, 2025	February 12, 2025
Number of problems	15	15
Time limit (humans)	3 hours	3 hours
Answer format	Integer 000 to 999	Integer 000 to 999
Calculator policy	None permitted	None permitted

The MAA publishes the official problems and answer keys after each sitting. The Art of Problem Solving community then writes up multiple solutions per problem, which is one of the ways the problems eventually leak into web crawls.

Coverage by mathematical domain

Domain	Example topics
Algebra	Polynomial and functional equations, inequalities, sequences
Geometry	Euclidean and coordinate geometry, transformations, 3D solids
Number theory	Divisibility, modular arithmetic, Diophantine equations
Combinatorics and probability	Counting, expected values, generating functions
Trigonometry / complex numbers	Identities, roots of unity

Evaluation methodology

Different evaluators have settled on different protocols, which is part of why scores in different press releases sometimes look inconsistent.

Common metrics

Metric	What it measures	Typical use
pass@1	Single sample exact match accuracy	Default leaderboards
cons@8	Majority vote across 8 samples	Reduces variance on a 30 problem set
cons@64	Majority vote across 64 samples	Used in o1 and Grok 3 announcements
pass@k	At least one of k samples is correct	Ablations, not headline numbers

Because the dataset has only 30 problems, single run pass@1 numbers are noisy. Two independent average of 5 runs can differ by 5 to 10 percentage points on the same model, which has pushed evaluators toward consistency metrics or pass@1 averaged over many runs (often 16, 32, or 50). It also explains why frontier models tend to bunch within a percentage point or two of each other; the noise is comparable to the gap.

Closed book versus tool use

The headline AIME 2025 number is usually reported closed book: the model produces a chain of thought and a final integer using only its own weights. A second mode, called tool augmented or code interpreter mode, allows the model to call out to a Python interpreter during reasoning. Tool augmented numbers tend to be 5 to 7 points higher and have pushed several frontier models to 100% on AIME 2025.

Sampling settings used in research papers

Parameter	Common setting
Temperature	0.0 to 0.6 (often averaged)
Samples per question	8, 16, 32, or 64
Maximum tokens	16K to 64K
Top p	0.95
Prompt format	"Please reason step by step, and put your final answer within \boxed{}"

Model performance

The tables below collect publicly reported AIME 2025 results from model release blogs, third party leaderboards (Artificial Analysis, MathArena, Vellum, llm-stats), and arXiv reports. Because methodology varies across sources, scores within a few points should be treated as tied.

Closed book (no tool use) leaderboard

Model	Score (%)	Method	Organization	Released
Gemini 3 Pro	~100	pass@1	Google DeepMind	2025
Grok-4 Heavy	~100	pass@1	xAI	2025
GPT-5 (thinking)	99.6	pass@1	OpenAI	2025-08
GPT-5 (default)	94.6	pass@1	OpenAI	2025-08
Grok 3 (Think, cons@64)	93.3	cons@64	xAI	2025-02
o4-mini	92.7	pass@1	OpenAI	2025-04
Claude Opus 4.5	92.77	pass@1	Anthropic	2025-11
Qwen3 235B (thinking)	92.3	pass@1	Alibaba	2025
o3	88.9	pass@1	OpenAI	2025-04
DeepSeek R1 0528	87.5	pass@1	DeepSeek	2025-05
Claude Sonnet 4.5	87	pass@1	Anthropic	2025-09
Gemini 2.5 Pro	86.7	pass@1	Google DeepMind	2025-03
o3-mini (high)	86.5	pass@1	OpenAI	2025-01
Claude Opus 4	75.5	pass@1	Anthropic	2025-05
DeepSeek R1 (original)	74.0	pass@1	DeepSeek	2025-01
Claude 3.7 Sonnet (ext. thinking)	61.3	pass@1	Anthropic	2025-02
o1	~60	pass@1	OpenAI	2024-12
Claude 3.7 Sonnet (standard)	52.7	pass@1	Anthropic	2025-02
Gemini 2.0 Flash	~45	pass@1	Google DeepMind	2025-01
Non reasoning baseline	~20	pass@1	Various	n/a

Tool augmented (Python interpreter) leaderboard

With access to a Python interpreter during reasoning, several models reach the ceiling of the dataset:

Model	Score (%)	Source
GPT-5 Pro with Python	100.0	OpenAI launch blog
Claude Opus 4.5 with Python	100.0	Anthropic system card
Claude Sonnet 4.5 with Python	100.0	Anthropic system card
Gemini 3 Pro with code execution	100.0	Google launch material
o4-mini with Python	99.5	OpenAI o3 / o4-mini blog
o3 with Python	98.4	OpenAI o3 / o4-mini blog

Specific stories worth knowing

The Grok 3 launch was the first instance where AIME 2025 scores caused a public dispute over evaluation honesty. xAI headlined 93.3% for Grok 3 (Think), but that figure used cons@64, while OpenAI's chart for o3-mini reported pass@1. On apples to apples pass@1, Grok 3 Reasoning Beta sat below o3-mini-high. The episode was a useful reminder that the headline number depends on which metric a lab chooses.

DeepSeek's path was its own story. The original R1 from January 2025 posted 74.0% pass@1, well below o3-mini. The R1-0528 update from May 2025 jumped to 87.5%; by then average response length on hard problems had nearly doubled, from about 12K reasoning tokens to about 23K, suggesting more test time compute, not a different recipe, was doing the work.

Claude 3.7 Sonnet, released the same month as AIME 2025, came in low at 52.7% standard and 61.3% with extended thinking. Anthropic's strategy at the time emphasized coding and agentic tasks. The gap closed with Claude 4 Opus (75.5%), Claude Sonnet 4.5 (87%), and Claude Opus 4.5 (92.77%) later in 2025.

Comparison with AIME 2024

Researchers use the gap between AIME 2024 and AIME 2025 scores as a rough contamination indicator: a model that scores noticeably higher on the older paper is suspected of having seen those problems during training. MathArena ran this comparison across more than 50 LLMs and reported that for several open weight models the AIME 2024 score sat 10 to 20 points above AIME 2025, despite roughly equivalent difficulty.

Aspect	AIME 2024	AIME 2025
Frontier score (early 2025)	High 80s to mid 90s	Mid 70s to high 80s
Estimated contamination	Substantial for many models	Limited, some leakage
Saturation (late 2025)	Effectively saturated	Approaching saturation

Contamination concerns

Even AIME 2025 has not been completely clean. Researchers found that 8 of the 30 problems had near identical analogues already on the public web (Quora, math.stackexchange, and similar archives), with one AIME 2025 Question 1 having an essentially identical version posted years earlier. That kept some risk that pre February 2025 training data still contained partial solutions or similar formulations.

The response from the evaluation community has been to move toward live benchmarks: instead of evaluating on a fixed test set indefinitely, evaluators score models only on competitions held after the model's training cutoff. The MathArena project, run by researchers at ETH Zurich and SRI Lab, formalized this approach with continual evaluation across AIME, HMMT, the Putnam, and the IMO. A separate effort at vals.ai ran the same idea with a public leaderboard.

The practical workflow most labs adopted for AIME 2025 was to fetch the official problem PDFs the morning after each sitting, run frozen pre announcement checkpoints, and publish results within 48 to 72 hours, before the problems could plausibly enter any retraining cycle.

Limitations

AIME 2025 is useful, not perfect. Several limitations are worth keeping in mind:

Thirty problems is enough to separate weak from strong models, but not enough to rank similar frontier models with confidence; pass@1 variance on a single run can swing 5 to 10 points.
Answer only scoring means a model can guess the correct integer for the wrong reasons, especially in combinatorics where the answer space is small.
All problems and reasoning are in English, limiting usefulness for multilingual evaluation.
Problems do not require calculus or graduate level mathematics, so scores say almost nothing about research math. For that, evaluators look at FrontierMath, the USAMO, or the IMO grand challenge.
Test time compute is a confounder: two models with the same final score can be using radically different amounts of compute per problem.

Impact on AI development

AIME 2025 has shaped how frontier labs talk about reasoning. By mid 2025, the AIME 2025 number had become a near required disclosure in any major reasoning model release, alongside GPQA Diamond, HumanEval, and the MMLU. The reproducible lift from longer reasoning chains helped popularize the test time compute paradigm that defines o3, DeepSeek R1, and their successors. cons@k metrics pushed multi sample voting into product features in Claude, Gemini, and ChatGPT. The contamination story reinforced the case for held out evaluation sets and motivated continual benchmarks like MathArena. Open source projects (DeepSeek R1 Distill, OpenThinker, AM Thinking, and various Qwen and Llama based distillations) have used AIME 2025 as their primary external yardstick for reasoning capability transfer.

AIME 2024: predecessor, now widely considered contaminated for most pre 2025 models.
MATH: 12,500 problem dataset spanning multiple difficulty levels.
GSM8K: grade school math word problems, effectively saturated.
GPQA Diamond: PhD level science multiple choice questions.
HMMT: Harvard MIT Math Tournament problems, run alongside AIME 2025 in MathArena.
USAMO: USA Math Olympiad, proof based and substantially harder than AIME.
FrontierMath: research level mathematics designed to remain unsaturated.

References

Mathematical Association of America. "AIME (American Invitational Mathematics Examination)." maa.org/maa-invitational-competitions.
Art of Problem Solving. "2025 AIME I." artofproblemsolving.com/wiki/index.php/2025_AIME_I.
Art of Problem Solving. "2025 AIME II." artofproblemsolving.com/wiki/index.php/2025_AIME_II.
OpenAI. "Introducing OpenAI o3 and o4-mini." April 2025. openai.com/index/introducing-o3-and-o4-mini.
OpenAI. "Introducing GPT-5." August 2025. openai.com/index/introducing-gpt-5.
DeepSeek AI. "DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning." arXiv:2501.12948, January 2025.
DeepSeek AI. "DeepSeek-R1-0528 model card." Hugging Face, May 2025.
Anthropic. "Claude Sonnet 4.5 system card," September 2025.
Anthropic. "Claude Opus 4.5 system card," November 2025.
xAI. "Grok 3 Beta: The Age of Reasoning Agents." February 2025.
Google DeepMind. "Gemini 2.5: Our newest Gemini model with thinking." March 2025.
Artificial Analysis. "AIME 2025 Benchmark Leaderboard." artificialanalysis.ai/evaluations/aime-2025.
MathArena. "AIME 2025 dataset and leaderboard." matharena.ai.
Balunovic, M., Jovanovic, N., et al. "MathArena: Evaluating LLMs on Uncontaminated Math Competitions." arXiv:2505.23281, 2025.
Vellum AI. "Claude 3.7 Sonnet vs OpenAI o1 vs DeepSeek R1." vellum.ai/blog.
Vellum AI. "GPT-5 Benchmarks." vellum.ai/blog/gpt-5-benchmarks.
IntuitionLabs. "AIME 2025 Benchmark: An Analysis of AI Math Reasoning." intuitionlabs.ai.
llm-stats.com. "AIME 2025 Benchmark Leaderboard." llm-stats.com/benchmarks/aime-2025.
Vals AI. "AIME Benchmark." vals.ai/benchmarks/aime.
Papailiopoulos, D. "AIME I 2025: A Cautionary Tale About Math Benchmarks and Data Contamination." Twitter / X, February 2025.

AIME 2025

Overview

Why it became the math benchmark of 2025

Exam structure and dataset

Coverage by mathematical domain

Evaluation methodology

Common metrics

Closed book versus tool use

Sampling settings used in research papers

Model performance

Closed book (no tool use) leaderboard

Tool augmented (Python interpreter) leaderboard

Specific stories worth knowing

Comparison with AIME 2024

Contamination concerns

Limitations

Impact on AI development

See also

References

Improve this article

Overview

Why it became the math benchmark of 2025

Exam structure and dataset

Coverage by mathematical domain

Evaluation methodology

Common metrics

Closed book versus tool use

Sampling settings used in research papers

Model performance

Closed book (no tool use) leaderboard

Tool augmented (Python interpreter) leaderboard

Specific stories worth knowing

Comparison with AIME 2024

Contamination concerns

Limitations

Impact on AI development

See also

References

Overview

Why it became the math benchmark of 2025

Exam structure and dataset

Coverage by mathematical domain

Evaluation methodology

Common metrics

Closed book versus tool use

Sampling settings used in research papers

Model performance

Closed book (no tool use) leaderboard

Tool augmented (Python interpreter) leaderboard

Specific stories worth knowing

Comparison with AIME 2024

Contamination concerns

Limitations

Impact on AI development

Related benchmarks

See also

References

Improve this article

Related Articles

AIME 2024

MATH Level 5

AA-LCR

GSO

BrowseComp

Creative Writing v3

Overview

Why it became the math benchmark of 2025

Exam structure and dataset

Coverage by mathematical domain

Evaluation methodology

Common metrics

Closed book versus tool use

Sampling settings used in research papers

Model performance

Closed book (no tool use) leaderboard

Tool augmented (Python interpreter) leaderboard

Specific stories worth knowing

Comparison with AIME 2024

Contamination concerns

Limitations

Impact on AI development

Related benchmarks

See also

References

Related Articles

AIME 2024

MATH Level 5

AA-LCR

GSO

BrowseComp

Creative Writing v3