# AIME 2024

> Source: https://aiwiki.ai/wiki/aime_2024
> Updated: 2026-06-21
> Categories: AI Benchmarks
> From AI Wiki (https://aiwiki.ai), a free encyclopedia of artificial intelligence. Quote with attribution.

| AIME 2024 |  |
| --- | --- |
| Overview |  |
| Full name | American Invitational Mathematics Examination 2024 |
| Abbreviation | AIME 2024 |
| Description | A challenging mathematical reasoning benchmark based on the American Invitational Mathematics Examination 2024 problems, designed to evaluate AI models' ability to solve complex high school mathematics problems requiring multi-step reasoning |
| Release date | 2024-02-01 |
| Latest version | 1.0 |
| Benchmark updated | 2024-02-07 |
| Authors | Mathematical Association of America |
| Organization | Mathematical Association of America (MAA), Art of Problem Solving (AoPS) |
| Technical Details |  |
| Type | Mathematical Reasoning, Problem Solving |
| Modality | Text |
| Task format | Open-ended problem solving |
| Number of tasks | 30 (15 from AIME I + 15 from AIME II) |
| Total examples | 30 |
| Evaluation metric | Exact Match, Pass@1, Cons@N |
| Domains | Algebra, Geometry, Number Theory, Combinatorics, Probability |
| Languages | English |
| Performance |  |
| Human performance | ~26% (median high-scoring qualifier solves 4 of 15) |
| Baseline | ~12% (GPT-4o pass@1) |
| SOTA score | 95.8% (Grok 3 Mini, cons@64) |
| SOTA model | Grok 3 Mini (xAI) |
| SOTA date | 2025-02-17 |
| Saturated | Yes (top reasoning models exceed 90%) |
| Resources |  |
| Website | [Official website](https://maa.org/maa-invitational-competitions/) |
| GitHub | [Repository](https://github.com/Maxwell-Jia/AIME_2024) |
| Dataset | [Download](https://huggingface.co/datasets/Maxwell-Jia/AIME_2024) |
| Predecessor | AIME 2023 |
| Successor | [AIME 2025](/wiki/aime_2025) |

**AIME 2024** is an [AI benchmark](/wiki/ai_benchmark) of 30 problems drawn from the 2024 American Invitational Mathematics Examination that has become the standard yardstick for measuring the mathematical reasoning of [large language models](/wiki/large_language_model) such as [OpenAI o1](/wiki/o1), [o3](/wiki/o3), and [DeepSeek R1](/wiki/deepseek-r1).[^1][^3] Each problem has an integer answer between 0 and 999, the competition was originally written by the [Mathematical Association of America](https://maa.org/maa-invitational-competitions/) for top high school mathematicians in the United States, and the 30 questions combine the 15 problems of AIME I with the 15 of AIME II.[^1][^2] Because the answer space is small, the questions resist guessing, and the problems demand layered reasoning, AIME 2024 became one of the most cited tests of mathematical reasoning ability for [reasoning models](/wiki/reasoning_model) released in late 2024 and 2025.[^3] By 2026 the benchmark is widely considered saturated: most frontier reasoning models score above 90% and several public leaderboards have stopped weighting it in their composite rankings.[^15][^19]

## What is AIME 2024 and what is it used for?

The AIME 2024 benchmark is built on the 2024 cycle of the American Invitational Mathematics Examination, an invitational round of the [AMC](/wiki/amc) competition series. AIME is administered to students who score in roughly the top 5% of the AMC 10 or top 2.5% of the AMC 12.[^2] Strong AIME performance is the gateway to the USA Mathematical Olympiad (USAMO) and USA Junior Mathematical Olympiad (USAJMO), and the index used for those competitions combines AMC and AIME scores.

The AIME 2024 problems were administered in two sittings: AIME I on January 31 to February 1, 2024, and AIME II on February 7, 2024.[^2] Each sitting contains 15 problems and runs for three hours, scored 1 point per correct answer with no penalty for wrong or blank answers, so a perfect score is 15 per form and 30 across both.[^2] After the contest closed, the problems and full solutions were posted publicly through the Art of Problem Solving wiki and other community archives, which is exactly why AIME 2024 became a popular target for AI researchers, and also why later contamination concerns emerged.[^12]

The machine-readable benchmark most often used in papers is the Hugging Face dataset published by Maxwell Jia, which packages the 30 official problems with reference answers in JSON Lines format.[^11] Several alternative packagings exist, including `HuggingFaceH4/aime_2024` and `math-ai/aime24`, but they all wrap the same 30 MAA problems. Some early evaluations, including OpenAI's blog post for [o1](/wiki/o1), restricted the benchmark to the 15 AIME I problems only, which is one source of confusion when comparing scores across papers.[^3]

### A note on the problem 12 erratum

The official MAA answer for AIME I 2024 Problem 12 was initially published as 384 but was corrected to 385 within days of the contest.[^20] The Maxwell-Jia dataset and most modern evaluation harnesses use the corrected 385; some early scripts that copied the original answer key produce systematically lower scores for any model that arrives at the correct 385. This is a small but real source of cross-paper discrepancy for late-2024 results.

### Why did AIME 2024 become an LLM benchmark?

A few features made AIME 2024 the right shape for evaluating reasoning models:

- **Integer answers from 0 to 999**: the model either writes the correct integer or it does not, which makes grading cheap and reproducible.
- **No partial credit, no negative marking**: a clean exact-match metric.
- **Hard but bounded math**: every problem is solvable from a high school curriculum, but the later items demand creative combinations of algebra, number theory, geometry, and combinatorics. That is the level where pre-2024 base models like [GPT-4o](/wiki/gpt-4o) consistently failed and where chain-of-thought training started to pay off.[^3]
- **Small, well-known dataset**: only 30 problems, easy to run and inspect.
- **Public reference solutions**: makes detailed error analysis straightforward, not just final accuracy.

## Technical specifications

### Problem format

Each AIME 2024 problem requires the model to output a single integer between 000 and 999. Problems are presented as plain text and may include LaTeX. Common task patterns include:

- Counting and probability questions where the answer is the numerator plus denominator of a reduced fraction.
- Geometry problems where the answer is some integer length, area, or sum of unknowns.
- Number theory problems asking for a specific residue, sum of digits, or count of solutions.
- Algebra problems where the answer is a coefficient, a polynomial value at a point, or the sum m+n where the original answer is m/n.

The difficulty curve is steep. Problems 1 to 5 are typically tractable for an experienced AMC solver. Problems 11 to 15 frequently require nonobvious construction or clever invariants and are roughly USAMO entry difficulty. Problem 15 on AIME I 2024, for example, required setting up surface-area equations with Vieta's relations on a degree-three polynomial.[^20]

### How is AIME 2024 scored for LLMs?

The benchmark is run in several ways depending on the paper:

| Evaluation method | Description | Implementation |
| --- | --- | --- |
| Exact match | Model output must equal the ground truth integer | Final answer extracted from `\boxed{}` or last line of model output |
| Pass@1 (greedy) | Single deterministic attempt | Temperature 0, no sampling |
| Pass@1 (averaged) | Average correctness across many samples | Common setup is 64 samples per problem with temperature 0.6 and top-p 0.95 |
| Cons@N (consensus) | Majority vote over N samples | Often N=64; reduces variance from sampling |
| Best-of-N | Re-rank N samples with a learned scorer | Used by OpenAI for the 93% o1 result with N=1000 |
| Tool-augmented | Model can call a Python interpreter | Pushes scores to ~99-100% on frontier models |

[DeepSeek's R1 paper](/wiki/deepseek-r1), for example, fixes temperature to 0.6, top-p to 0.95, and reports pass@1 averaged over 64 sampled responses, which is now a common reference setup.[^6] Models are usually prompted with something close to: "Please reason step by step, and put your final answer within `\boxed{}`."

Because there are only 30 problems, single-run scores have high variance. A model that gets 24 right one run and 21 right the next has shifted by 10 percentage points without any change in capability. That is why most credible scores either average over many seeds or report cons@N.

### Tool-augmented evaluations

By 2025 it became standard to publish two AIME numbers for any model that supports tool use: pure reasoning (no tools) and with a Python interpreter. Giving a frontier reasoning model a Python sandbox typically pushes AIME 2024 accuracy to ~99-100%, with errors confined to a small number of problems whose translation to code is itself the hard step.[^17][^21] At OpenAI's GPT-5 launch, the "Pro" variant with parallel test-time compute and tool use reportedly solved 100% of the [AIME 2025](/wiki/aime_2025) set, and similar numbers are routine on the older 2024 problems. These open-book results have shifted attention away from AIME as a pure reasoning yardstick and toward benchmarks like [FrontierMath](/wiki/frontiermath), [HMMT](/wiki/hmmt), and [USAMO 2025](/wiki/usamo_2025) where tool use is either disallowed or not enough.[^15][^16]

## Performance analysis

### Which models report AIME 2024 scores?

The table below collects widely-cited AIME 2024 scores from primary sources. Where scores were originally reported on the 15-problem AIME I subset (as in OpenAI's September 2024 blog post), that is noted in the methodology column.

| Model | AIME 2024 score | Methodology | Source / date |
| --- | --- | --- | --- |
| GPT-5 | ~95.7% | pass@1 | OpenAI / aggregator, 2026[^16][^22] |
| Grok 3 Mini (Think, high) | 95.8% | cons@64, test-time compute scaling | xAI, February 2025[^9] |
| Grok 4 | ~94.0% | pass@1 | xAI, 2026[^22] |
| OpenAI o4-mini | 93.4% | pass@1, no tools | OpenAI, April 2025[^4] |
| Grok 3 (Think) | 93.3% | cons@64 | xAI, February 2025[^9] |
| LongCat-Flash-Thinking (Meituan) | 93.3% | self-reported | Meituan, 2025[^15] |
| OpenAI o1 (re-ranked) | 93% (13.9/15) | re-ranking 1000 samples on AIME I | OpenAI, September 2024[^3] |
| Gemini 2.5 Pro | 92.0% | pass@1 | Google DeepMind, March 2025[^10] |
| OpenAI o3 | 91.6% (96.7% later report) | pass@1 | OpenAI, April 2025[^4][^16] |
| DeepSeek-R1-0528 | 91.4% | pass@1 | DeepSeek, May 2025[^23] |
| GLM-4.5 (Zhipu AI) | 91.0% | self-reported | Zhipu AI, 2025[^15] |
| Ministral 3 (14B Reasoning) | 89.8% | self-reported | Mistral AI, 2025[^15] |
| Gemini 2.5 Flash | 88.0% | pass@1 | Google, 2025[^15] |
| OpenAI o3-mini (high) | 87.3% | pass@1 | OpenAI, January 2025[^5] |
| DeepSeek R1-Zero (cons@64) | 86.7% | majority vote over 64 samples | DeepSeek, January 2025[^6] |
| DeepSeek R1 Distill Llama 70B | 86.7% | self-reported | DeepSeek, 2025[^15] |
| Qwen3-235B-A22B | 85.7% | self-reported | Alibaba, 2025[^15] |
| OpenAI o1-pro | 86.0% | pass@1 | OpenAI, 2025[^15] |
| OpenAI o1 (cons@64) | 83% (12.5/15) | consensus on AIME I | OpenAI, September 2024[^3] |
| Claude 3.7 Sonnet (extended thinking) | 80.0% | parallel extended thinking, 64K token budget | Anthropic, February 2025[^7] |
| DeepSeek R1 | 79.8% | pass@1 averaged over 64 samples | DeepSeek paper, January 2025[^6] |
| QwQ-32B-Preview (Alibaba) | 79.5% | pass@1 averaged | Alibaba, November 2024[^16] |
| OpenAI o1 (single sample) | 74.4% | pass@1 on AIME I (11.1/15) | OpenAI, September 2024[^3] |
| Gemini 2.0 Flash Thinking | 73.3% | pass@1 | Google, December 2024[^18] |
| OpenAI o1-mini | 63.6% | pass@1 averaged | DeepSeek paper / OpenAI[^6] |
| Grok 3 (base, non-reasoning) | 52.2% | pass@1 | xAI, February 2025[^9] |
| Gemini 2.0 Flash (experimental) | 35.5% | pass@1 | Google, December 2024[^18] |
| Claude 3.7 Sonnet (standard mode) | 23.3% | pass@1, no extended thinking | Anthropic, February 2025[^7] |
| Gemini 1.5 Pro | 19.3% | pass@1 | Google, 2024[^10] |
| GPT-4o | ~12% (1.8/15) | pass@1 on AIME I | OpenAI, September 2024[^3] |
| Claude 3.5 Sonnet | ~10% | pass@1 | Anthropic, 2024[^14] |

A few notes on this table:

- The 93% o1 number that OpenAI publicized in September 2024 used best-of-1000 with a learned re-ranker, not a single sample. OpenAI's own report states that "o1 averaged 74% (11.1/15) with a single sample per problem, 83% (12.5/15) with consensus among 64 samples, and 93% (13.9/15) when re-ranking 1000 samples with a learned scoring function" on the 15-problem AIME I exam, so press coverage that cited 93% as a single-shot score conflated three very different methodologies.[^3]
- The [DeepSeek R1](/wiki/deepseek-r1) paper reports 79.8% pass@1 averaged over 64 samples, slightly above OpenAI o1-1217's 79.2% on the same setup, which is the headline that pushed R1 into the news cycle in January 2025.[^6]
- Grok 3 Mini's 95.8% sits at the top of public AIME 2024 leaderboards through 2025, but it relies on cons@64 with very large test-time compute. At pass@1 the figures from xAI's own announcement are noticeably lower.[^9]
- 2026 frontier models including [GPT-5](/wiki/gpt-5), Grok 4, and o3 in its production configuration are clustered in a tight 94-97% band on pass@1, which is one of the strongest single signals that the benchmark has effectively saturated.[^16][^22]
- With a Python interpreter enabled, [o4-mini](/wiki/o4-mini) and [Claude Opus 4.5](/wiki/claude_opus_4_5) routinely reach 99-100% on AIME 2024, so the table above intentionally excludes tool-augmented runs to keep comparisons apples-to-apples.[^21]

### How do reasoning models differ from base models on AIME 2024?

The most striking pattern in the table is the gap between reasoning-trained models and conventional chat models. [GPT-4o](/wiki/gpt-4o) and the original [Claude 3.5 Sonnet](/wiki/claude-3-5-sonnet) both sit around 10 to 13% on AIME 2024. Models trained with reinforcement learning on chains of thought, including [OpenAI o1](/wiki/o1), [DeepSeek R1](/wiki/deepseek-r1), and [Claude 3.7 Sonnet](/wiki/claude_3_7_sonnet) with extended thinking, jumped to 70-90%+. The same Claude 3.7 Sonnet weights score 23.3% in standard mode and 80.0% with extended thinking enabled, which is the cleanest demonstration of how much of the gain comes from inference-time reasoning rather than raw capability.[^7][^8] OpenAI's o1 blog also reports a roughly log-linear scaling relationship between accuracy and test-time compute on AIME, a pattern most labs have since adopted in their evaluation reports.[^3] In follow-on analyses, scaling test-time compute by 100x typically lifts AIME 2024 accuracy from around 20% to around 80%, while 100x more RL training only moves it from about 33% to 66%, indicating inference scaling is the larger lever on this dataset.[^16]

### Open-source progress

AIME 2024 has been the headline number for nearly every open-weight reasoning model release in 2025-2026. The fast-moving open-weight stack of late 2025 - [DeepSeek-R1-0528](/wiki/deepseek-r1-0528) at 91.4%, [Qwen3](/wiki/qwen3)-235B-A22B at 85.7%, GLM-4.5 at 91.0%, MiniMax M1 at 86.0%, LongCat-Flash-Thinking at 93.3%, and the Ministral 3 reasoning family - has effectively closed the AIME 2024 gap with proprietary frontier models, even if it has not closed the gap on more contamination-resistant benchmarks like AIME 2025 or [FrontierMath](/wiki/frontiermath).[^15][^23] DeepSeek's R1-0528 release notes specifically called out AIME 2024 as moving from 79.8% to 91.4% as the chain-of-thought training matured, with average reasoning trace length nearly doubling from roughly 12K to 23K tokens on harder problems.[^23]

### Human comparison

Separating model scores from human performance is messier than people often present. AIME is taken only by AMC qualifiers, so the typical AIME taker is already strong:

- The official MAA statistics for 2024 give an AIME I mean score of 5.89 (median 5) and an AIME II mean of 5.45 (median 5) out of 15 problems.[^2]
- The median AIME score across all qualifiers is therefore around 4-6 problems out of 15, roughly 27-40% depending on the form.
- USAMO qualification typically requires a USAMO index where the AIME contribution is 9 or higher.
- The very top scorers, perfect or near-perfect, are a tiny tail of the distribution, on the order of a few hundred students nationally per year.

When OpenAI claimed o1 at 93% placed it "among the top 500 students in the United States," that comparison is to AIME I only, not the AIME I and II combined dataset most papers use today.[^3]

## Mathematical domains covered

The AIME 2024 benchmark tests proficiency across the standard secondary math contest domains:

| Domain | Example topics | Approximate share of problems |
| --- | --- | --- |
| Algebra | Polynomial equations, systems, inequalities, sequences | ~27% |
| Geometry | Euclidean geometry, coordinate geometry, transformations | ~27% |
| Number theory | Divisibility, modular arithmetic, primes, Diophantine | ~20% |
| Combinatorics | Counting, probability, recursion | ~20% |
| Complex numbers | Roots of unity, complex algebra | ~6% |

Topic boundaries are fuzzy. A typical AIME 12 problem might mix coordinate geometry with number theory and a touch of combinatorics, and many problems are deliberately built so that the obvious approach is intractable and a clever observation cuts the work down. The published Areteem and AoPS solution sets for AIME I 2024 cover precisely these blended patterns: AIME I Problem 3 is a game-theory/modular arithmetic mash-up, Problem 10 mixes circle geometry with Stewart's theorem, Problem 14 is a 3D volume calculation, and Problem 15 reduces to applying Vieta's formulas to an unknown polynomial.[^20]

## Limitations and considerations

### Is AIME 2024 contaminated?

Data contamination is the largest caveat hanging over AIME 2024 as a benchmark. The 2024 problems were posted in full, with detailed solutions, on the Art of Problem Solving wiki and elsewhere within hours of the contest. By the time models were trained or fine-tuned in late 2024 and 2025, those pages were almost certainly part of the public web crawls feeding pretraining and instruction tuning datasets.

Researchers building [MathArena](/wiki/matharena), a contamination-resistant evaluation framework, found strong signs of contamination on AIME 2024.[^12] They report that "widely available problems from AIME 2024 have contaminated the pretraining data of multiple LLMs," inflating scores by 10 to 20 points relative to fresh, uncontaminated competitions.[^12] Most leading models scored 10-20 points above what their performance on freshly released contests ([AIME 2025](/wiki/aime_2025), BRUMO 2025) would predict, and QwQ-32B-Preview was estimated to score around 60 percentage points above its human-aligned expectation, the most extreme contamination signal in the paper.[^12]

The same project released VAR-AIME24, which substitutes symbolic parameters for the fixed numeric constants in each AIME 2024 problem to test whether models actually solve the problem or recall the answer. Frontier models drop 7-18 percentage points under this perturbation, while smaller RL-tuned models lose 40-75% of their symbolic consistency, suggesting they were memorizing surface form more than reasoning to a solution.[^12]

The usual remedy now is to evaluate on [AIME 2025](/wiki/aime_2025), which is administered after the cutoff for most current models, in addition to AIME 2024. When a model that scored 90% on AIME 2024 drops to 75% on AIME 2025, that gap is a useful contamination signal even if neither score is perfectly clean.[^12]

### Statistical noise

Thirty problems is a small sample. A single problem worth 1 of 30 is 3.3% of the score. A model that solves 24 problems correctly scores 80%, but its true skill could plausibly produce anywhere between 22 and 26 on a different draw of similarly hard problems. Confidence intervals on AIME 2024 scores are wide, which is part of why pass@1 averaged over 64 samples and cons@64 are now the standard reporting modes.[^6][^17] Some recent reports, including Microsoft's inference-time scaling analyses, show that averaging across diverse temperature settings adds another ~7 points on top of conventional same-temperature test-time scaling, providing yet another non-capability lever that can shift headline AIME numbers.[^24]

### Limited coverage

AIME 2024 only tests one style of math: short-answer, integer-output, contest-flavored problems. It says nothing about whether a model can write a real proof, formalize an argument in [Lean](/wiki/lean), or do open-ended exploration of a research-style question. Benchmarks like the [USAMO](/wiki/usamo), [Putnam](/wiki/putnam) competitions, [FrontierMath](/wiki/frontiermath), and [Humanity's Last Exam](/wiki/humanitys_last_exam) are designed to fill those gaps.

### Saturation and benchmark deprecation

By May 2026 several public leaderboards have stopped weighting AIME 2024 in their composite scores. The CodeSOTA and BenchLM math leaderboards continue to display AIME 2024 numbers for reference but explicitly mark the benchmark as saturated and exclude it from headline rankings, redirecting attention to BRUMO 2025, [USAMO 2025](/wiki/usamo_2025), and [HMMT](/wiki/hmmt) 2025.[^15][^19] Anthropic, Google, and OpenAI have all moved their flagship math results in 2026 launches to a mix of AIME 2025 with tools, [FrontierMath](/wiki/frontiermath), and Putnam, often relegating AIME 2024 to an appendix or omitting it entirely.[^21]

## How does AIME 2024 differ from AIME 2025?

AIME 2024 and [AIME 2025](/wiki/aime_2025) share an identical structure, 30 integer-answer problems split across two forms, but they differ in the one dimension that matters most for benchmarking: training-data exposure. The 2024 problems had been on the public web for more than a year before most 2025-2026 models finished pretraining, while the 2025 problems were released in February 2025, after the knowledge cutoff of many of those models.[^12] In practice this means AIME 2024 scores tend to run higher and less reliable, and a large drop from AIME 2024 to AIME 2025 for the same model is read as a contamination signal rather than a genuine difference in problem difficulty.[^12] Most 2025-2026 model cards now report both numbers side by side for this reason.

## Related benchmarks

AIME 2024 sits inside a broader ecosystem of mathematical reasoning benchmarks that AI researchers run alongside it:

- **[AIME 2025](/wiki/aime_2025)**: the follow-up benchmark using the 2025 contest problems, less affected by training data contamination.
- **[MATH](/wiki/math)**: the original 12,500-problem dataset of high school competition math, now considered partially saturated by frontier reasoning models.
- **[MATH-500](/wiki/math-500)**: a 500-problem subset of MATH commonly reported alongside AIME.
- **[GSM8K](/wiki/gsm8k)**: 8,000 grade school word problems, long since saturated by capable LLMs.
- **[GPQA Diamond](/wiki/gpqa_diamond)**: graduate-level science questions including mathematics.
- **[HMMT](/wiki/hmmt)**: the Harvard-MIT Mathematics Tournament, also used in MathArena.
- **[FrontierMath](/wiki/frontiermath)**: research-level math problems designed to be much harder and more contamination resistant.
- **[Putnam](/wiki/putnam)**: undergraduate competition mathematics, used for college-level evaluations.
- **[USAMO 2025](/wiki/usamo_2025)**: proof-based competition for the very top US math students, now a fresh evaluation target as well.
- **[MathArena](/wiki/matharena)**: an evaluation framework that uses competitions held after each model's release to control for contamination.

## Impact and significance

AIME 2024 is probably the single benchmark most responsible for the 2024-2025 reasoning model wave entering public consciousness. The September 2024 OpenAI o1 announcement leaned on AIME 2024 as its headline reasoning result, which set the framing that DeepSeek directly attacked four months later when [R1](/wiki/deepseek-r1) matched o1's score at a fraction of the inference cost.[^3][^6] Anthropic, Google, and xAI followed with their own reasoning launches, and AIME 2024 was on every comparison chart.[^7][^9][^10]

The benchmark also accelerated two reporting habits. First, test-time compute became a first-class axis: almost every 2025 reasoning model release included a chart of accuracy against thinking budget, with AIME 2024 the most common dataset for that x-axis.[^3][^16] Second, cons@N and best-of-N now appear alongside pass@1 in most releases, since 30 problems and large test-time budgets together make any single number too noisy. For education, contamination concerns mean current AIME 2024 scores probably overstate how well frontier models reason on truly novel contest problems, and the more sobering [AIME 2025](/wiki/aime_2025) numbers tend to support that.[^12] In 2026 the benchmark functions less as a frontier yardstick than as a baseline floor: any new reasoning model expected to be taken seriously is implicitly required to clear 90% on it before its other benchmark numbers will be read at all.[^15][^22]

## See also

- [Mathematical reasoning](/wiki/mathematical_reasoning) in AI
- [AI benchmarks](/wiki/ai_benchmark)
- [OpenAI o-series models](/wiki/openai_o_series)
- [Chain-of-thought prompting](/wiki/chain_of_thought)
- [Test-time compute scaling](/wiki/test_time_compute)
- [USAMO](/wiki/usamo)
- [American Mathematics Competitions](/wiki/amc)
- [MathArena](/wiki/matharena)

## References

[^1]: Mathematical Association of America. "MAA Invitational Competitions." https://maa.org/maa-invitational-competitions/
[^2]: Wikipedia. "American Invitational Mathematics Examination." https://en.wikipedia.org/wiki/American_Invitational_Mathematics_Examination
[^3]: OpenAI. "Learning to reason with LLMs." September 12, 2024. https://openai.com/index/learning-to-reason-with-llms/
[^4]: OpenAI. "Introducing OpenAI o3 and o4-mini." April 16, 2025. https://openai.com/index/introducing-o3-and-o4-mini/
[^5]: OpenAI. "OpenAI o3-mini." January 31, 2025. https://openai.com/index/openai-o3-mini/
[^6]: DeepSeek-AI et al. "DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning." arXiv:2501.12948. January 2025. https://arxiv.org/html/2501.12948v1
[^7]: Anthropic. "Claude 3.7 Sonnet and Claude Code." February 24, 2025. https://www.anthropic.com/news/claude-3-7-sonnet
[^8]: Anthropic. "Claude's extended thinking." 2025. https://www.anthropic.com/news/visible-extended-thinking
[^9]: xAI. "Grok 3 Beta: The Age of Reasoning Agents." February 17, 2025. https://x.ai/news/grok-3
[^10]: Google DeepMind. "Gemini 2.5: Our newest Gemini model with thinking." March 25, 2025. https://blog.google/innovation-and-ai/models-and-research/google-deepmind/gemini-model-thinking-updates-march-2025/
[^11]: Maxwell-Jia. "AIME 2024 dataset." Hugging Face. https://huggingface.co/datasets/Maxwell-Jia/AIME_2024
[^12]: Balunović et al. "MathArena: Evaluating LLMs on Uncontaminated Math Competitions." arXiv:2505.23281. https://arxiv.org/html/2505.23281v2
[^13]: Vellum AI. "Analysis: OpenAI o1 vs DeepSeek R1." https://www.vellum.ai/blog/analysis-openai-o1-vs-deepseek-r1
[^14]: Vellum AI. "Claude 3.7 Sonnet vs OpenAI o1 vs DeepSeek R1." https://www.vellum.ai/blog/claude-3-7-sonnet-vs-openai-o1-vs-deepseek-r1
[^15]: llm-stats.com. "AIME 2024 Benchmark Leaderboard." https://llm-stats.com/benchmarks/aime-2024
[^16]: Alibaba Cloud. "Alibaba Cloud Unveils QwQ-32B." https://www.alibabacloud.com/blog/alibaba-cloud-unveils-qwq-32b-a-compact-reasoning-model-with-cutting-edge-performance_602039
[^17]: UK AI Security Institute. "Inspect Evals: AIME 2024." https://ukgovernmentbeis.github.io/inspect_evals/evals/mathematics/aime2024/index.html
[^18]: DataCamp. "Gemini 2.0 Flash Thinking Experimental: A Guide With Examples." https://www.datacamp.com/blog/gemini-2-0-flash-experimental
[^19]: CodeSOTA. "AIME 2024 Leaderboard." https://www.codesota.com/benchmark/aime-2024
[^20]: Areteem Institute. "2024 AIME I Answer Key Released." February 2024. https://areteem.org/blog/2024-aime-i-answer-key-released/
[^21]: Anthropic. "Introducing Claude Opus 4.5." November 2025. https://www.anthropic.com/news/claude-opus-4-5
[^22]: LM Council. "AI Model Benchmarks May 2026." https://lmcouncil.ai/benchmarks
[^23]: DeepSeek-AI. "DeepSeek-R1-0528 release notes." Hugging Face, May 2025. https://huggingface.co/deepseek-ai/DeepSeek-R1-0528
[^24]: Microsoft Research. "Inference-Time Scaling for Complex Tasks: Where We Stand and What Lies Ahead." March 2025. https://www.microsoft.com/en-us/research/wp-content/uploads/2025/03/Inference-Time-Scaling-for-Complex-Tasks-Where-We-Stand-and-What-Lies-Ahead-2.pdf

