The American Invitational Mathematics Examination (AIME) is a 15-question, 3-hour mathematics competition for high school students in the United States and Canada, administered by the Mathematical Association of America (MAA). Established in 1983, it serves as the intermediate round in the USAMO qualification pipeline, sitting between the entry-level AMC 10/AMC 12 examinations and the proof-based USA Mathematical Olympiad (USAMO).[1][2]
Beginning in 2024, the AIME emerged as a widely cited benchmark for evaluating mathematical reasoning in large language models, particularly the new generation of reasoning models such as OpenAI o1, OpenAI o3, DeepSeek-R1, Gemini 2.5 Pro, Grok 3, Grok 4, and the Claude Opus 4.5 and Claude Sonnet 4.6 series. Its integer-only answer format, multi-step reasoning requirements, and clean automatic grading made it a natural fit for AI evaluation, although by 2025 it had largely saturated at the frontier and prompted the introduction of harder successor benchmarks.[3][4]
History
The AIME was first administered in 1983, replacing an earlier pipeline that ran directly from the American High School Mathematics Examination (AHSME) to the USAMO. The MAA introduced the AIME as a middle round to better distinguish among top AHSME scorers and to provide a less abrupt transition between multiple-choice contests and the proof-based USAMO. The AHSME itself was rebranded as the AMC 12 in 2000, with the AMC 10 added the same year to broaden participation; the AIME pipeline was extended to AMC 10 qualifiers in 2010.[1][2]
From 1983 until 1999, only a single annual AIME was held in late March or early April. Starting in 2000, the MAA began offering a second sitting, the AIME II, as an alternate test for students who could not take the AIME I due to scheduling conflicts, illness, or international time zones. Students may sit only one of the two; attempting both results in disqualification.[1][2]
The AIME consists of 15 free-response problems administered in a single 3-hour session. Each answer is an integer from 000 to 999 inclusive, entered on an optical-mark-recognition (OMR) bubble sheet. This format eliminates guessing on multiple-choice options while retaining ease of automated grading.[1][2]
Scoring is exactly one point per correct answer with no penalty for incorrect answers and no partial credit. The maximum score is therefore 15 out of 15. Calculators, mobile phones, and other computing aids are prohibited; only pencils, erasers, rulers, and compasses are permitted.[1][2]
Problems are arranged in roughly increasing order of difficulty. Topics covered include algebra, combinatorics, number theory, probability, geometry, and trigonometry, but no calculus. Solutions usually require multi-step deductions, careful case analysis, or clever transformations, with the integer answer often appearing as the result of a non-trivial computation rather than a memorized formula.[1]
AIME I and AIME II
Both versions of the AIME are administered annually:
- AIME I, the primary test, is typically held on a Tuesday or Thursday in early February. AIME I 2025 was given on Thursday, February 6, 2025.[5]
- AIME II, the alternate, is held roughly a week later. AIME II 2025 was given on Wednesday, February 12, 2025.[5]
The MAA explicitly states that "Students may only take the AIME once," and that taking both versions results in disqualification.[2]
Qualification
Invitation to the AIME is automatic for top scorers on the AMC 10 and AMC 12 contests. According to current MAA policy, AIME invitations are extended to "at least the top 5% of all scorers" on the AMC 12 and "at least the top 2.5% of all scorers" on the AMC 10. In practice, qualifying cutoffs vary slightly year to year and have been somewhat relaxed in recent cycles. A third pathway exists through the USA Mathematical Talent Search (USAMTS), which awards AIME invitations to students achieving a sufficiently high cumulative score (historically 68 of 75) across its three rounds.[2]
Path to the IMO
The AIME is the second step in a five-stage selection pipeline used by the MAA to choose the U.S. team for the International Mathematical Olympiad (IMO):
- AMC 10 or AMC 12: open multiple-choice contests with roughly 300,000 combined participants annually.
- AIME: invitational round for top AMC scorers.
- USAMO / USAJMO: proof-based national olympiad. Qualification is determined by a weighted index:
- USAMO Index = AMC 12 score + 20 × AIME score
- USAJMO Index = AMC 10 score + 20 × AIME score
- Mathematical Olympiad Program (MOP): three-week residential summer training program for the top USAMO/USAJMO scorers, historically held at Carnegie Mellon University and formerly known as the Mathematical Olympiad Summer Program (MOSP).
- Team Selection Test (TST): used together with USAMO results, weighted equally, to choose the six members of the U.S. IMO team.[6][7]
Approximately 300 students qualify for the USAMO/USAJMO each year, around 60 attend MOP, about 24 compete in the Team Selection Test process, and six are selected for the IMO team.[6]
Use as an AI benchmark
Why AIME suits AI evaluation
The AIME became one of the most prominent benchmarks for evaluating mathematical reasoning in large language models because its design happens to satisfy many practical requirements for automatic scoring:
- Integer-only answers. Each problem has a unique integer answer from 0 to 999, allowing exact-match grading without complex symbolic-equivalence checks. This contrasts with the MATH dataset, where answers can be algebraic expressions.
- Non-trivial reasoning. Unlike multiple-choice tests, the AIME cannot be solved by elimination or lucky guesses. Problems typically require several pages of work for a strong human solver.
- Standardized difficulty. The MAA has administered the contest yearly for over four decades, producing a long history of problems with calibrated difficulty.
- Public availability. Problems and solutions for AIME 1983 through the most recent year are catalogued on the Art of Problem Solving (AoPS) wiki and elsewhere, making the benchmark trivially accessible for evaluation pipelines.[8]
The standard convention in AI evaluations is to use AIME I and AIME II from a given year (a total of 30 problems) and report accuracy as the percentage of problems solved. Common protocols include pass@1 (single sample), cons@64 or majority@64 (majority vote across 64 samples), and tool-augmented variants in which the model may call a Python interpreter.[3][9]
Reported scores
The following AIME scores have been reported by AI laboratories in official announcements, system cards, or technical reports. Protocols vary substantially between labs and even between models from the same lab, so direct comparisons should be treated with caution.
OpenAI o-series
In its September 2024 "Learning to reason with LLMs" announcement, OpenAI reported that on AIME 2024, GPT-4o solved "on average 12% (1.8/15) of problems," while o1 averaged "74% (11.1/15) with a single sample per problem, 83% (12.5/15) with consensus among 64 samples, and 93% (13.9/15) when re-ranking 1000 samples with a learned scoring function." OpenAI noted that 13.9/15 "places it among the top 500 students nationally and above the cutoff for the USA Mathematical Olympiad."[3]
OpenAI's December 20, 2024 announcement of o3 reported a score of 96.7% on AIME 2024, with the model "missing just one question," compared with o1's 83.3% (the cons@64 result).[10]
The o3-mini system card (January 2025) reported that o3-mini (high) achieved 87.3% on AIME 2024, with OpenAI noting that "adjusting reasoning effort significantly affects performance, especially for STEM tasks," and that moving from low to high reasoning effort typically raises AIME 2024 accuracy "by 10–30%."[11]
The April 16, 2025 o3 and o4-mini system card reported the following pass@1 scores (no tools):
| Model | AIME 2024 | AIME 2025 |
|---|
| o1 | 74.3% | 79.2% |
| o3-mini | 87.3% | 86.5% |
| o3 | 91.6% | 88.9% |
| o4-mini | 93.4% | 92.7% |
With access to a Python interpreter, o4-mini reached 99.5% pass@1 and 100% consensus@8 on AIME 2025.[12]
DeepSeek
The DeepSeek-R1 technical report (January 2025) reported the following AIME 2024 scores:[9]
| Model | pass@1 | cons@64 |
|---|
| DeepSeek-R1 | 79.8% | N/A |
| DeepSeek-R1-Zero | 71.0% | 86.7% |
| DeepSeek-R1-Distill-Qwen-7B | 55.5% | 83.3% |
| DeepSeek-R1-Distill-Qwen-14B | 69.7% | 80.0% |
| DeepSeek-R1-Distill-Qwen-32B | 72.6% | 83.3% |
| DeepSeek-R1-Distill-Llama-70B | 70.0% | 86.7% |
All evaluations used a sampling temperature of 0.6, a top-p value of 0.95, and a maximum generation length of 32,768 tokens.[9]
xAI Grok
xAI's February 17, 2025 release of Grok 3 reported that with the highest level of test-time compute (cons@64), Grok 3 (Think) achieved 93.3% on AIME 2025, which had been administered just days before the announcement on February 6 and 12. The smaller Grok 3 mini reached 95.8% on AIME 2024 in cons@64 mode.[13]
Grok 4, unveiled on July 9, 2025, was reported by xAI to score 100% on AIME 2025 in its Heavy configuration, with the base Grok 4 reaching around 91.7% without tools.[14]
Google DeepMind Gemini
Gemini 2.5 Pro was reported by Google DeepMind to score approximately 92.0% on AIME 2024 and 86.7% on AIME 2025 in single-attempt evaluations, placing it among the top single-shot performers on the AIME 2025 leaderboard at the time of release in early 2025.[15]
Anthropic Claude
For the Claude 4 family, Claude Opus 4 was reported by Anthropic to score around 75.5% on AIME, rising to approximately 90.0% in high-compute settings. Claude Sonnet 4.5 was reported to reach roughly 87% without tools on AIME 2025 and 100% with Python tool access. Claude Opus 4.5, released in November 2025, was reported to achieve 92.77% without tools and 100% with Python tools on AIME 2025; Anthropic noted concerns that "contamination may have inflated this score."[16][17]
Other models
- GPT-4o (non-reasoning baseline, AIME 2024): ~12% (1.8/15), per OpenAI's o1 announcement.[3]
- QwQ-32B-Preview (Alibaba, late 2024): reported by Alibaba to surpass o1-preview on AIME and MATH-500; specific score documented in the QwQ-32B announcement.[18]
- Qwen3 (April 2025): the technical report describes the Qwen3 series as surpassing both QwQ in thinking mode and Qwen2.5 in non-thinking mode on AIME and other math benchmarks.[19]
Saturation and the move to harder benchmarks
By mid-2025, frontier models were routinely scoring above 90% on AIME 2024 and AIME 2025 in their default configurations and 100% with Python tool access, prompting widespread discussion that the benchmark had effectively saturated.[4][20] Several methodological concerns were also raised:
- Data contamination. AIME problems and solutions are widely available on the AoPS wiki and dozens of solution repositories. Independent analyses suggest that widely available AIME 2024 problems may have appeared in pretraining corpora for some models, inflating reported scores relative to genuinely held-out contests. Researchers have responded by adopting timeline-locked evaluation protocols in which models are tested only on problems released after the model's pretraining cutoff.[4]
- Inconsistent protocols. Reported scores variously use pass@1, cons@k (for k = 8, 16, 32, 64), pass@k, sampling temperature, and Python tool access. A model claiming "99% on AIME" with Python tools is not directly comparable to one claiming "75% pass@1" without tools.
- Sample-size variance. The AIME has only 15 problems per paper and 30 per year. Small differences in absolute score (e.g., 86.7% vs 90.0%) correspond to a single problem and can fluctuate substantially across sampling runs.
- Limited difficulty range. AIME problems are calibrated for talented high school students. Researchers have noted that performance on the AIME is no longer predictive of performance on genuinely hard mathematics tasks such as research-level problems or the proof-based USAMO and IMO.
In response, the AI evaluation community has shifted attention to harder mathematics benchmarks, including:
- USAMO and IMO problems, which require full written proofs rather than integer answers.
- FrontierMath, a benchmark of unpublished research-level mathematics problems designed by working mathematicians.
- MathArena and other live, timeline-locked evaluation suites that test models on each year's AIME, HMMT, and IMO papers immediately after release to minimize contamination risk.
- Humanity's Last Exam, a broader expert-curated benchmark that includes advanced mathematics among other disciplines.
For specific year-by-year results, see AIME 2024 and AIME 2025.
Significance for human competition
Despite saturation as an AI benchmark, the AIME remains an important fixture of secondary mathematics education in the United States. Roughly 3,000 to 5,000 students qualify each year, and the exam continues to serve as the principal gateway to the USAMO and, ultimately, the IMO. The combination of historical longevity, well-calibrated difficulty, and a clean integer answer format that originally made the AIME attractive to AI researchers continues to make it a respected human competition independent of its role in evaluating language models.
See also
References
- Wikipedia, "American Invitational Mathematics Examination." https://en.wikipedia.org/wiki/American_Invitational_Mathematics_Examination
- Mathematical Association of America, "MAA Invitational Competitions." https://maa.org/maa-invitational-competitions/
- OpenAI, "Learning to reason with LLMs," September 12, 2024. https://openai.com/index/learning-to-reason-with-llms/
- IntuitionLabs, "AIME 2025 Benchmark: An Analysis of AI Math Reasoning." https://intuitionlabs.ai/articles/aime-2025-ai-benchmark-explained
- Mathematical Association of America, "2024-25 AIME Thresholds Are Available." https://maa.org/news/aime-thresholds-are-available/
- Wikipedia, "Mathematical Olympiad Program." https://en.wikipedia.org/wiki/Mathematical_Olympiad_Program
- Art of Problem Solving Wiki, "Mathematical Olympiad Summer Program." https://artofproblemsolving.com/wiki/index.php/Mathematical_Olympiad_Summer_Program
- Art of Problem Solving Wiki, "AIME Problems and Solutions." https://artofproblemsolving.com/wiki/index.php/AIME_Problems_and_Solutions
- DeepSeek-AI, "DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning," arXiv:2501.12948, January 2025. https://arxiv.org/html/2501.12948v1
- OpenAI, o3 announcement, December 20, 2024 (via TechCrunch and InfoQ coverage). https://techcrunch.com/2024/12/20/openai-announces-new-o3-model/ ; https://www.infoq.com/news/2024/12/openai-announces-o3/
- OpenAI, "OpenAI o3-mini" announcement and system card, January 2025. https://openai.com/index/openai-o3-mini/
- OpenAI, "OpenAI o3 and o4-mini System Card," April 16, 2025. https://cdn.openai.com/pdf/2221c875-02dc-4789-800b-e7758f3722c1/o3-and-o4-mini-system-card.pdf
- xAI, "Grok 3 Beta: The Age of Reasoning Agents," February 17, 2025. https://x.ai/news/grok-3
- xAI, Grok 4 launch coverage, July 9, 2025. https://epoch.ai/blog/grok-4-math
- Google DeepMind, Gemini 2.5 Pro benchmark documentation (2025). https://artificialanalysis.ai/evaluations/aime-2025
- Anthropic, "Claude Opus 4.5 System Card," November 2025. https://assets.anthropic.com/m/64823ba7485345a7/Claude-Opus-4-5-System-Card.pdf
- Anthropic, "Introducing Claude Sonnet 4.5." https://www.anthropic.com/news/claude-sonnet-4-5
- Alibaba Cloud, "Alibaba Cloud Unveils QwQ-32B: A Compact Reasoning Model with Cutting-Edge Performance." https://www.alibabacloud.com/blog/alibaba-cloud-unveils-qwq-32b-a-compact-reasoning-model-with-cutting-edge-performance_602039
- Qwen Team, "Qwen3 Technical Report," May 2025. https://arxiv.org/pdf/2505.09388
- Artificial Analysis, "AIME 2025 Benchmark Leaderboard." https://artificialanalysis.ai/evaluations/aime-2025