AIME (American Invitational Mathematics Examination)
Last reviewed
Sources
No citations yet
Review status
Needs citations
Revision
v6 ยท 2,458 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Sources
No citations yet
Review status
Needs citations
Revision
v6 ยท 2,458 words
Add missing citations, update stale details, or suggest a clearer explanation.
The American Invitational Mathematics Examination (AIME) is a 15-question, 3-hour mathematics competition for high school students in the United States and Canada, administered by the Mathematical Association of America (MAA) since 1983. Each answer is an integer from 000 to 999, which makes the contest cleanly auto-gradable, and beginning in 2024 the AIME became one of the most widely cited benchmarks for evaluating mathematical reasoning in large language models. Frontier reasoning models now routinely score above 90% on AIME 2024 and AIME 2025 and reach 100% with Python tool access, so the benchmark is widely regarded as saturated at the frontier.[1][3][4]
In the human pipeline, the AIME serves as the intermediate round in the USAMO qualification process, sitting between the entry-level AMC 10/AMC 12 examinations and the proof-based USA Mathematical Olympiad (USAMO).[1][2] As an AI benchmark it is used across the new generation of reasoning systems, including OpenAI o1, OpenAI o3, DeepSeek-R1, Gemini 2.5 Pro, Grok 3, Grok 4, and the Claude Opus 4.5 and Claude Sonnet 4.6 series. Its integer-only answer format, multi-step reasoning requirements, and clean automatic grading made it a natural fit for AI evaluation, although by 2025 it had largely saturated at the frontier and prompted the introduction of harder successor benchmarks.[3][4]
The AIME was first administered in 1983, replacing an earlier pipeline that ran directly from the American High School Mathematics Examination (AHSME) to the USAMO. The MAA introduced the AIME as a middle round to better distinguish among top AHSME scorers and to provide a less abrupt transition between multiple-choice contests and the proof-based USAMO. The AHSME itself was rebranded as the AMC 12 in 2000, with the AMC 10 added the same year to broaden participation; the AIME pipeline was extended to AMC 10 qualifiers in 2010.[1][2]
From 1983 until 1999, only a single annual AIME was held in late March or early April. Starting in 2000, the MAA began offering a second sitting, the AIME II, as an alternate test for students who could not take the AIME I due to scheduling conflicts, illness, or international time zones. Students may sit only one of the two; attempting both results in disqualification.[1][2]
The AIME consists of 15 free-response problems administered in a single 3-hour session. Each answer is an integer from 000 to 999 inclusive, entered on an optical-mark-recognition (OMR) bubble sheet. This format eliminates guessing on multiple-choice options while retaining ease of automated grading.[1][2]
Scoring is exactly one point per correct answer with no penalty for incorrect answers and no partial credit. The maximum score is therefore 15 out of 15. Calculators, mobile phones, and other computing aids are prohibited; only pencils, erasers, rulers, and compasses are permitted.[1][2]
Problems are arranged in roughly increasing order of difficulty. Topics covered include algebra, combinatorics, number theory, probability, geometry, and trigonometry, but no calculus. Solutions usually require multi-step deductions, careful case analysis, or clever transformations, with the integer answer often appearing as the result of a non-trivial computation rather than a memorized formula.[1]
Both versions of the AIME are administered annually:
The MAA explicitly states that "Students may only take the AIME once," and that taking both versions results in disqualification.[2]
Invitation to the AIME is automatic for top scorers on the AMC 10 and AMC 12 contests. According to current MAA policy, AIME invitations are extended to "at least the top 5% of all scorers" on the AMC 12 and "at least the top 2.5% of all scorers" on the AMC 10. In practice, qualifying cutoffs vary slightly year to year and have been somewhat relaxed in recent cycles. A third pathway exists through the USA Mathematical Talent Search (USAMTS), which awards AIME invitations to students achieving a sufficiently high cumulative score (historically 68 of 75) across its three rounds.[2]
The AIME is the second step in a five-stage selection pipeline used by the MAA to choose the U.S. team for the International Mathematical Olympiad (IMO):
Approximately 300 students qualify for the USAMO/USAJMO each year, around 60 attend MOP, about 24 compete in the Team Selection Test process, and six are selected for the IMO team.[6]
The AIME became one of the most prominent benchmarks for evaluating mathematical reasoning in large language models because its design happens to satisfy many practical requirements for automatic scoring:
The standard convention in AI evaluations is to use AIME I and AIME II from a given year (a total of 30 problems) and report accuracy as the percentage of problems solved. Common protocols include pass@1 (single sample), cons@64 or majority@64 (majority vote across 64 samples), and tool-augmented variants in which the model may call a Python interpreter.[3][9]
The following AIME scores have been reported by AI laboratories in official announcements, system cards, or technical reports. Protocols vary substantially between labs and even between models from the same lab, so direct comparisons should be treated with caution. As a single-figure summary of the difficulty curve, OpenAI's non-reasoning GPT-4o solved roughly 12% (1.8/15) of AIME 2024 problems, whereas a year later xAI reported Grok 4 Heavy reaching 100% on AIME 2025.[3][14]
In its September 2024 "Learning to reason with LLMs" announcement, OpenAI reported that on AIME 2024, GPT-4o solved "on average 12% (1.8/15) of problems," while o1 averaged "74% (11.1/15) with a single sample per problem, 83% (12.5/15) with consensus among 64 samples, and 93% (13.9/15) when re-ranking 1000 samples with a learned scoring function." OpenAI noted that 13.9/15 "places it among the top 500 students nationally and above the cutoff for the USA Mathematical Olympiad."[3]
OpenAI's December 20, 2024 announcement of o3 reported a score of 96.7% on AIME 2024, with the model "missing just one question," compared with o1's 83.3% (the cons@64 result).[10]
The o3-mini system card (January 2025) reported that o3-mini (high) achieved 87.3% on AIME 2024, with OpenAI noting that "adjusting reasoning effort significantly affects performance, especially for STEM tasks," and that moving from low to high reasoning effort typically raises AIME 2024 accuracy "by 10-30%."[11]
The April 16, 2025 o3 and o4-mini system card reported the following pass@1 scores (no tools):
| Model | AIME 2024 | AIME 2025 |
|---|---|---|
| o1 | 74.3% | 79.2% |
| o3-mini | 87.3% | 86.5% |
| o3 | 91.6% | 88.9% |
| o4-mini | 93.4% | 92.7% |
With access to a Python interpreter, o4-mini reached 99.5% pass@1 and 100% consensus@8 on AIME 2025, and o3 reached 98.4% pass@1 and 100% consensus@8 on the same paper.[12]
The DeepSeek-R1 technical report (January 2025) reported that DeepSeek-R1 "achieves a score of 79.8% Pass@1 on AIME 2024, slightly surpassing OpenAI-o1-1217," along with the following AIME 2024 scores:[9]
| Model | pass@1 | cons@64 |
|---|---|---|
| DeepSeek-R1 | 79.8% | N/A |
| DeepSeek-R1-Zero | 71.0% | 86.7% |
| DeepSeek-R1-Distill-Qwen-7B | 55.5% | 83.3% |
| DeepSeek-R1-Distill-Qwen-14B | 69.7% | 80.0% |
| DeepSeek-R1-Distill-Qwen-32B | 72.6% | 83.3% |
| DeepSeek-R1-Distill-Llama-70B | 70.0% | 86.7% |
All evaluations used a sampling temperature of 0.6, a top-p value of 0.95, and a maximum generation length of 32,768 tokens.[9]
xAI's February 17, 2025 release of Grok 3 reported that with the highest level of test-time compute (cons@64), Grok 3 (Think) achieved 93.3% on AIME 2025, which had been administered just days before the announcement on February 6 and 12. The smaller Grok 3 mini reached 95.8% on AIME 2024 in cons@64 mode.[13]
Grok 4, unveiled on July 9, 2025, was reported by xAI to score 100% on AIME 2025 in its Heavy configuration, with the base Grok 4 reaching around 91.7% without tools.[14]
Gemini 2.5 Pro was reported by Google DeepMind to score approximately 92.0% on AIME 2024 and 86.7% on AIME 2025 in single-attempt evaluations, placing it among the top single-shot performers on the AIME 2025 leaderboard at the time of release in early 2025.[15]
For the Claude 4 family, Claude Opus 4 was reported by Anthropic to score around 75.5% on AIME, rising to approximately 90.0% in high-compute settings. Claude Sonnet 4.5 was reported to reach roughly 87% without tools on AIME 2025 and 100% with Python tool access. Claude Opus 4.5, released in November 2025, was reported to achieve 92.77% without tools and 100% with Python tools on AIME 2025; Anthropic's system card cautioned that it has "some concerns that contamination may have inflated this score."[16][17]
By mid-2025, frontier models were routinely scoring above 90% on AIME 2024 and AIME 2025 in their default configurations and 100% with Python tool access, prompting widespread discussion that the benchmark had effectively saturated.[4][20] Several methodological concerns were also raised:
In response, the AI evaluation community has shifted attention to harder mathematics benchmarks, including:
For specific year-by-year results, see AIME 2024 and AIME 2025.
Despite saturation as an AI benchmark, the AIME remains an important fixture of secondary mathematics education in the United States. Roughly 3,000 to 5,000 students qualify each year, and the exam continues to serve as the principal gateway to the USAMO and, ultimately, the IMO. The combination of historical longevity, well-calibrated difficulty, and a clean integer answer format that originally made the AIME attractive to AI researchers continues to make it a respected human competition independent of its role in evaluating language models.