MathArena
Last reviewed
May 16, 2026
Sources
15 citations
Review status
Source-backed
Revision
v2 · 5,646 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 16, 2026
Sources
15 citations
Review status
Source-backed
Revision
v2 · 5,646 words
Add missing citations, update stale details, or suggest a clearer explanation.
| MathArena | |
|---|---|
| Overview | |
| Full name | MathArena: Evaluating LLMs on Uncontaminated Math Competitions |
| Description | A continuously updated public leaderboard that evaluates large language models on freshly released mathematics competitions including AIME, USAMO, IMO, HMMT, BRUMO, SMT, and Putnam, with grading for both final answers and full natural-language proofs |
| Initial release | March 2025 (USAMO 2025 evaluation), May 2025 (paper) |
| Latest expansion | MathArena Apex (August 2025), ArXivMath and ArXivLean (early 2026) |
| Authors | Mislav Balunović, Jasper Dekoninck, Ivo Petrov, Nikola Jovanović, Martin Vechev |
| Organization | SRI Lab, ETH Zurich; INSAIT (Sofia) |
| Venue | NeurIPS Datasets and Benchmarks 2025 |
| Technical Details | |
| Type | Mathematical reasoning, proof writing, research-level mathematics, formal verification |
| Modality | Text, LaTeX, visual (Kangaroo), Lean (ArXivLean) |
| Task format | Final-answer (numerical or symbolic), natural-language proofs, code-based solutions, formal proofs |
| Number of problems | 162 in the original 2025 paper; over 400 problems by 2026 across all tracks |
| Evaluation metric | Pass@1 average over 4 runs (16 for Apex), human-graded proof scores out of 7, item-response-theory (IRT) expected performance |
| Domains | Algebra, number theory, combinatorics, geometry, analysis, research-level mathematics |
| Languages | English with LaTeX notation |
| Performance | |
| Human reference | USAMO 2025 human median: 35.7%; IMO 2025 bronze threshold: 19/42 (45.2%) |
| Top final-answer score | 91.25% (GPT-5 high on aggregated 130 problems, 2025 paper) |
| Top USAMO 2025 proof score | 24.4% (Gemini 2.5 Pro), all other models <5% |
| Top IMO 2025 score | 31% / 13 of 42 points (Gemini 2.5 Pro, no medal) |
| Saturated | Final-answer competitions trending toward saturation by 2026; proof tracks remain open |
| Resources | |
| Website | https://matharena.ai/ |
| Paper (main) | https://arxiv.org/abs/2505.23281 |
| Paper (USAMO) | https://arxiv.org/abs/2503.21934 |
| Paper (Open Proof Corpus) | https://arxiv.org/abs/2506.21621 |
| Code | https://github.com/eth-sri/matharena |
| Datasets | https://huggingface.co/MathArena |
| License | MIT |
MathArena is a public, continuously updated leaderboard and evaluation platform that measures the performance of large language models on mathematics competition problems released after each model's training cutoff. The project is maintained by the Secure, Reliable, and Intelligent Systems (SRI) Lab at ETH Zurich, in collaboration with the Institute for Computer Science, Artificial Intelligence and Technology (INSAIT) in Sofia, Bulgaria. By scraping problems within hours of their publication and running every candidate model immediately, MathArena turns each fresh contest into an uncontaminated test set, addressing the well documented problem that benchmarks like AIME 2024 and earlier static datasets had leaked into widely shared training corpora. The platform was introduced in March 2025 with a USAMO 2025 evaluation, formalized in a NeurIPS Datasets and Benchmarks 2025 paper led by Mislav Balunović and Jasper Dekoninck, and has since expanded to cover proof grading, research-level problems pulled from arXiv preprints, and formalization in Lean.
The leaderboard is best known for showing that strong models that score above 85% on final-answer contests routinely collapse below 25% when forced to produce a complete proof graded by IMO-level judges. That gap, between knowing the answer and being able to justify it, is the central empirical contribution of the project and has reshaped how labs report olympiad results.
Before MathArena, the standard practice in evaluating mathematical reasoning was to report scores on fixed test sets that had been published months or years earlier. The MATH dataset, GSM8K, OlympiadBench, and even AIME 2024 all became part of the public web shortly after release, which meant that any frontier model trained on a recent Common Crawl snapshot had likely seen many of the problems and at least some of the official solutions. Researchers had long suspected that this exposure inflated reported scores. The MathArena team made the case quantitatively by comparing model accuracy on AIME 2024 (potentially contaminated) against AIME 2025 (released only after every evaluated model's training cutoff). Top systems were performing 10 to 20 percentage points higher on the older contest than on the newer one despite the difficulty profiles being similar, and QwQ-Preview-32B showed an inflation of close to 60 percentage points. That measurement, more than any theoretical argument, made the contamination problem concrete for the broader community.
The second motivation was that no public benchmark systematically evaluated proof writing. Mathematical olympiads are graded on the rigor and correctness of natural-language proofs, not just on whether the final boxed number is right. A model that emits "the answer is 42" with a hand-wavy paragraph behind it can score full marks on AIME but zero on USAMO. The team behind MathArena argued that without grading proofs, the field had no reliable way to measure mathematical understanding beyond pattern-matched answer extraction.
The third motivation was timeliness. Competitions like AIME I, AIME II, HMMT February, USAMO, and IMO follow a public schedule. By instrumenting a pipeline that scrapes problem statements within hours, queries all participating model providers, parses the responses, and lines up expert graders, MathArena turns the global competition calendar into a rolling benchmark refresh that no single lab can pre-train against.
MathArena is a project of the SRI Lab at ETH Zurich, led by Professor Martin Vechev. The lab is part of the Department of Computer Science and is known for prior work on neural network verification (ERAN, AI2), program synthesis, and a wider research program on the reliability and safety of machine learning systems. Vechev received an ERC Consolidator Grant in 2021 and has co-founded several ETH spin-offs, including LatticeFlow and InvariantLabs, that commercialize trustworthy AI research.
The core MathArena team listed on the project paper and website includes:
| Researcher | Affiliation | Primary role |
|---|---|---|
| Jasper Dekoninck | SRI Lab, ETH Zurich | Lead author and primary maintainer; main contact for the leaderboard |
| Mislav Balunović | SRI Lab, ETH Zurich; LatticeFlow | First-listed paper author; methodology |
| Ivo Petrov | INSAIT, Sofia | USAMO grading lead; co-author on Proof or Bluff paper |
| Nikola Jovanović | SRI Lab, ETH Zurich | Evaluation infrastructure |
| Martin Vechev | SRI Lab, ETH Zurich | Senior author and group lead |
The USAMO 2025 report ("Proof or Bluff?") additionally credits Lyuben Baltadzhiev, Maria Drencheva, and Kristian Minchev as student graders. INSAIT, founded with a partnership between ETH Zurich, EPFL, and the Bulgarian government in 2022, supplies a sizable share of the olympiad-level judges given its base of former IMO participants in the Sofia mathematics community. Funding for the platform comes from ERC grants held by the SRI Lab and from INSAIT operating support.
MathArena groups its tracked contests into five families. The set has grown steadily since launch, and the platform's competitions page is the authoritative reference for what is live at any given time.
| Family | Representative contests | Format | Grading |
|---|---|---|---|
| Final answer (high school) | AIME I and II 2024 and 2025, AIME 2026, HMMT February 2025 and 2026, BRUMO 2025, SMT 2025, CMIMC 2025 | Integer or short symbolic answer | Automated SymPy equivalence; secondary LLM check |
| Proof based | USAMO 2025 and 2026, IMO 2025, Putnam, IMC (International Mathematics Competition), Miklós Schweitzer | Full natural-language proofs | Human or LLM jury graders, 7-point IMO-style rubric |
| Apex | MathArena Apex 2025 (12 problems) | Hardest final-answer problems curated from 100+ contests | Pass@1 over 16 runs |
| Research level | ArXivMath (monthly batches), BrokenArxiv (false-statement detection), Project Euler | Numerical or short answer with reasoning | Mixed automated and human |
| Formal and visual | ArXivLean (Lean formalization), Math Kangaroo grades 1 to 12 | Lean proofs or multiple choice | Lean compiler or automated check |
The original 2025 paper covered 162 problems across seven 2025 contests. By early 2026 the platform had grown to include several monthly arXiv-sourced sets, the AIME 2026 and USAMO 2026 evaluations, and the dedicated AlephProver evaluation for the Lean track. The team has been explicit that final-answer competitions are saturating and that the platform's future centers on proof writing, research-level problems, and formal verification.
The high school final-answer track historically formed the visual centerpiece of the leaderboard. AIME, the American Invitational Mathematics Examination, contributes two 15-problem papers each year (AIME I and AIME II) with integer answers from 0 to 999. The Harvard-MIT Mathematics Tournament (HMMT) runs in February and November; MathArena evaluates the February edition's individual rounds. The Brown University Mathematics Olympiad (BRUMO), Stanford Math Tournament (SMT), and Carnegie Mellon Informatics and Mathematics Competition (CMIMC) round out the family.
Proof grading is the platform's flagship contribution. The United States of America Mathematical Olympiad (USAMO) is a six-problem, two-day exam graded on a 7-point scale per problem for 42 total. The International Mathematical Olympiad (IMO) uses the same 42-point structure. The Putnam Competition awards up to 120 points across 12 problems. The IMC, held in Bulgaria, and the Miklós Schweitzer Memorial Competition in Hungary extend the platform into university-level problems.
Introduced in August 2025, MathArena Apex collects the very hardest final-answer problems from roughly 100 contests reviewed by the team. Twelve problems survived the filter: six were drawn directly from final-answer events such as SMT, EMCC, and CMM, and six were proof-based problems from IMO-style competitions and team selection tests that the team rewrote into final-answer form. Frontier models were run four times on each candidate problem, and a problem was admitted to Apex only if zero attempts succeeded across the chosen models. Final evaluation runs at 16 attempts per model per problem, and four of the 12 problems remain unsolved by any model in the public leaderboard.
The ArXivMath, BrokenArxiv, and ArXivLean tracks were added in early 2026. ArXivMath uses problems drawn or adapted from recent arXiv preprints; BrokenArxiv tests whether models will refuse to prove deliberately false statements rather than producing plausible-looking but incorrect derivations; and ArXivLean evaluates Lean 4 formalization of theorem statements from research papers. The May 2026 release of AlephProver, an SRI Lab system, reportedly doubled the prior best ArXivLean score on initial evaluation.
MathArena's evaluation pipeline has four stages: problem ingestion, response generation, grading, and statistical reporting.
For each tracked competition the team monitors the official organizing body and well-known mirrors. As soon as the problems are posted, a maintainer (most often Dekoninck) scrapes them, transcribes them into LaTeX with one human pass for accuracy, and writes a problem JSON entry with fields for the statement, the reference answer (for final-answer problems), the maximum point value, a sample solution where available, and a grading scheme for proof problems. Problems are kept private until every candidate model has finished its scheduled runs, which the team treats as an embargo to avoid leaking data back into web indices that future models might scrape.
Every model is run four times per problem under standard evaluation conditions, increased to 16 runs for Apex problems where statistical significance matters more. Hyperparameters follow the provider's recommended defaults rather than being tuned by the MathArena team: this means OpenAI's reasoning effort level, Anthropic's extended thinking budget, and similar provider-specific knobs are set to the values the model card recommends for difficult reasoning tasks. The token budget defaults to 64,000 with selective overrides up to 128,000 for models that benefit from longer contexts, such as Grok 4. Costs are tracked in US dollars and published alongside accuracy so that readers can compare price-performance, not just raw accuracy.
Grading splits cleanly between the two main task types.
For final-answer problems a custom rule-based LaTeX parser extracts the answer from \boxed{} notation, normalizes it into a SymPy expression, and tests symbolic equivalence against the ground truth. When the parser is uncertain (for example when the model emits a fraction in a different form), an LLM judge provides a secondary check. The team publishes the parsing code in the GitHub repository so that any disagreement can be reproduced.
For proof-based problems the grading is done by humans, by an LLM jury, or by both depending on the contest and year. The IMO 2025 and USAMO 2025 evaluations relied on four expert human judges, each with IMO-level competition experience. Each submission was anonymized so that the judge did not know which model produced it, and two judges scored every proof independently. Scores that differed by more than two points were reconciled in discussion or sent to a third judge. The grading scheme followed the official IMO 7-point rubric: a partition into logically independent checkpoints, with at least 4 points reserved for the main idea or critical steps and at most 3 for routine work. Restating the problem, conjecturing without proof, and providing only a final answer earned zero credit, while logic gaps, invalid claims, and contradictions triggered deductions.
In the 2026 USAMO grading round the team experimented with a semi-automated LLM jury consisting of GPT-5.4, Gemini 3.1 Pro, and Claude Opus 4.6 acting as judges, with a final human pass over the jury outputs. The pipeline first uses a strong model to generate a detailed rubric from the official reference solution, then standardizes the format of every candidate proof, then collects three independent jury scores. When the jury's scores diverge by more than two points, judges reconcile and award the minimum. Human reviewers stepped in to adjust three solutions by at most two points each, which suggests that the jury approach matches expert grading closely on contemporary models.
For each model and contest, MathArena publishes the mean of the four (or 16) run scores together with a 95% confidence interval derived from a paired permutation test against every other model. Cross-contest rankings are weighted inversely to the number of problems per contest to avoid overweighting the longest events. As of the 2026 paper update the platform also exposes an item-response-theory (IRT) parameterization that fits each problem with a difficulty parameter and each model with an ability parameter, producing an "expected performance" score that smooths over event-specific noise.
The single most cited MathArena result is the USAMO 2025 evaluation published in the "Proof or Bluff?" report in March 2025. The six 2025 problems were graded out of 42 total points, and only Gemini 2.5 Pro produced anything resembling a competitive performance.
| Rank | Model | Provider | USAMO 2025 score | Percentage |
|---|---|---|---|---|
| 1 | Gemini 2.5 Pro | 10.1 / 42 | 24.4% | |
| 2 | OpenAI o3 | OpenAI | 9.2 / 42 | 21.9% |
| 3 | o4-mini high | OpenAI | 8.1 / 42 | 19.3% |
| 4 | GPT-5 | OpenAI | 7.5 / 42 | 17.9% |
| 5 | Claude 4 Opus | Anthropic | 6.8 / 42 | 16.2% |
| - | DeepSeek-R1 | DeepSeek | < 2 / 42 | < 5% |
| - | Grok 3 mini | xAI | < 2 / 42 | < 5% |
| - | QwQ-32B-Preview | Alibaba | < 2 / 42 | < 5% |
The human median for USAMO 2025 participants, themselves a highly selected group of high school mathematics olympians, was 35.7%, comfortably above every model evaluated. The team's qualitative analysis identified four recurring failure modes in the model outputs: flawed logical inferences buried inside long reasoning traces, unjustified assumptions used to short-circuit case analysis, lack of creative problem-solving (models repeatedly tried the same approach despite getting stuck), and confident self-assessment in which models claimed to have completed proofs that human judges scored at zero. Grok variants in particular tended to produce extremely short outputs, sometimes consisting only of a final number with no derivation. Gemini 2.5 Pro, while the strongest model, frequently cited theorems that do not exist in the literature when it could not find a valid proof.
A few months later the same protocol was applied to the IMO 2025 problems within hours of the contest in Sunshine Coast, Australia. The IMO awards bronze medals to contestants scoring at least 19 of 42, silver around 26, and gold above 33. No publicly released model came close.
| Rank | Model | Provider | IMO 2025 score | Percentage | Medal? |
|---|---|---|---|---|---|
| 1 | Gemini 2.5 Pro | 13 / 42 | 31.0% | No | |
| 2 | Grok 4 (updated prompt, 128k tokens) | xAI | 9 / 42 | 21.4% | No |
| 3 | OpenAI o3 | OpenAI | 8 / 42 | 19.0% | No |
| 4 | GPT-5 | OpenAI | 7 / 42 | 16.7% | No |
| 5 | Claude 4 Sonnet | Anthropic | 6 / 42 | 14.3% | No |
The team noted that 13 points represents real partial credit on multiple problems, which is more than most models would have earned the prior year, but also that even a single problem fully solved on IMO is worth 7 points and none of the public models managed it. The headline result that no off-the-shelf model achieved bronze contrasted sharply with the simultaneous announcement that Google DeepMind's experimental Deep Think system, run in a special evaluation conducted by IMO graders themselves rather than by MathArena, achieved a gold-medal-level score of 35 of 42. ByteDance's Seed-Prover, a Lean-based system, was certified at silver level around the same time. The gap between public model performance on MathArena and the internal results from lab-only systems became one of the most discussed findings in the 2025 AI mathematics literature.
On the AIME, HMMT, BRUMO, and SMT events combined (130 problems in the 2025 paper), top models scored well above the top 1% human percentile.
| Model | Aggregate accuracy on 2025 final-answer contests | Top 1% human reference |
|---|---|---|
| GPT-5 (high) | 91.25% | 84.35% (AIME) |
| Grok 4 Fast | 90.57% | 66.79% (HMMT) |
| Grok 4 | 90.36% | |
| o4-mini (high) | 86% | |
| Gemini 2.5 Pro | 86% | |
| Claude 4 Opus | 79% | |
| DeepSeek-R1 | 79.8% (AIME 2024 reported by DeepSeek) |
The team observed that aggregate accuracy on final-answer contests has trended steadily upward through 2025 and into 2026, to the point where the May 2026 saturation blog post highlighted GPT-5.5 hitting 98% on the AIME-style portion of the 2026 USAMO and 95% on the USAMO 2026 proof problems. The MathArena Apex track was created in part to push beyond that ceiling.
The August 2025 launch of Apex revealed a sharp performance drop on problems specifically curated to be hard for frontier systems.
| Rank | Model | Apex 2025 accuracy | Cost per evaluation (USD) |
|---|---|---|---|
| 1 | Qwen3-A22B-2507-Think | 5.21% | $9.89 |
| 2 | Grok 4 | 2.08% | $99.39 |
| 3 | GPT-5 high (agent scaffolding) | 2.08% | $183.79 |
| 4 | GPT-5-mini high | 1.04% | $13.42 |
| 5 | GLM 4.5 | 1.04% | $14.50 |
The team explicitly cautioned readers not to interpret these rankings as definitive overall capability rankings: by construction the problems were selected because the four reference frontier models failed on them, so any model that solves even one extra problem moves several places. Of the 12 Apex problems, the team identified that problems 9 through 12 remained unsolved across all evaluated models even at pass@k with large k. Problem 1, by contrast, was solved within a small number of attempts by at least one model under each evaluation perspective. The most common wrong answer often appeared in over 50% of attempts, which the team interprets as evidence that models converge on a single confidently incorrect reasoning path rather than exploring alternatives.
Across the broader leaderboard, models show consistent patterns by mathematical subfield. The 2025 paper reports the following best-in-class accuracies for GPT-5 (high):
| Domain | Best accuracy | Notes |
|---|---|---|
| Algebra | ~100% | Foundational manipulation; strongest area for every model |
| Number theory | ~94% | Standard divisibility, modular, and Diophantine techniques |
| Combinatorics | ~91% | Creativity-intensive; performance drops on harder problems |
| Geometry | ~81% | Weakest area; models rely on coordinate methods and struggle with synthetic arguments |
Geometry is the consistent weak spot, attributed in qualitative grading to a reliance on coordinate bash methods rather than synthetic insight, plus a general inability to use diagrams effectively from text-only inputs.
The contamination findings reported in the 2025 paper are some of the most rigorous public measurements of how much benchmark inflation existed in widely cited prior results.
On AIME 2024 most evaluated models scored 10 to 20 percentage points higher than their performance on AIME 2025, despite the two contests having comparable human difficulty distributions. The largest gap belonged to QwQ-Preview-32B, which performed nearly 60 percentage points above its AIME 2025 baseline, a difference too large to attribute to year-to-year problem variation. The paper concludes that AIME 2024 should no longer be treated as a clean benchmark for any model trained on data up to 2024 or later.
A control comparison on HMMT 2024 versus 2025 found much smaller performance differences, which the team attributes to HMMT's lower online prominence and smaller volume of associated student writeups. Even there, the team identified eight AIME 2025 problems and one HMMT 2025 problem that appeared in similar forms online prior to model evaluation, although these were mostly easier problems and did not materially shift the leaderboard.
To prevent future contamination of its own data, MathArena uses an embargo period: evaluations begin within hours of public problem release, but solutions and model outputs are not made fully public until enough time has passed that any prior model has clearly been evaluated. The team also monitors known contamination indicators, such as test-set string matches in publicly indexed corpora, and applies anomalous-performance detection on outlier scores.
MathArena occupies a distinct position in the mathematics benchmarking landscape. The table below compares its design choices against the most cited alternatives.
| Benchmark | Source | Contamination strategy | Proof grading | Status |
|---|---|---|---|---|
| MathArena | Live olympiads and contests | Real-time evaluation post-release | Yes, human and LLM jury | Active, expanding |
| FrontierMath (Epoch AI, 2024) | Original problems from professional mathematicians | New unpublished problems, automated verification | No (final answer only) | Active, mostly private |
| PutnamBench | Putnam problems with Lean formalization | Static, formal verification mitigates leakage | Yes (Lean proofs) | Active |
| OlympiadBench (2024) | 8,476 olympiad and physics problems | None (static historical data) | Partial | Static |
| OmniMATH | 4,428 competition problems | None | No | Static |
| GSM8K | Grade-school word problems | None | No | Static, widely contaminated |
| MATH (Hendrycks et al.) | Competition problems from web | None | No | Static, contamination identified |
| Minerva eval suite | Mixed competition and STEM | None | No | Static |
FrontierMath, run by Epoch AI, is MathArena's closest peer in spirit: both projects target the contamination problem head-on, but they take opposite approaches. FrontierMath commissions original problems from professional mathematicians and keeps them private, evaluating models in a controlled environment. MathArena relies on the public competition calendar and the embargo period to keep its test set fresh. The two are complementary; many AI labs run both. PutnamBench is the proof-writing analog in the formal verification space, requiring models to produce Lean proofs that compile against the Putnam corpus, and shares the SRI Lab's interest in formalization through the ArXivLean track.
The team has been explicit that MathArena does not aim to replace synthetic original-problem benchmarks, which provide much larger samples, but instead serves as a high-signal, low-volume measurement of how much performance is real reasoning versus pattern recall.
MathArena is the most public face of a broader SRI Lab research program on mathematical reasoning. Several connected projects extend the platform's findings.
| Project | Year | Focus | MathArena connection |
|---|---|---|---|
| Proof or Bluff? | March 2025 | USAMO 2025 evaluation report | Pilot paper that led directly to MathArena |
| MathArena main paper | May 2025 (NeurIPS D&B 2025) | Platform description and 2025 results | Core publication |
| Open Proof Corpus | June 2025 | 5,000+ human-annotated LLM proofs | Uses MathArena problems as five of its splits |
| MathConstruct | 2025 | Constructive proof reasoning benchmark | Companion proof-writing project |
| BrokenMath | 2025 | Sycophancy detection in theorem proving | Tests model refusal of false statements |
| QED-Nano | 2025 | Distilling proof ability into smaller models | Trained on data partly derived from MathArena |
| IMProofBench | 2025 | Research-level proof generation | Higher difficulty tier than MathArena |
| AlephProver | May 2026 | Lean formalization for arXiv statements | Headlines the ArXivLean leaderboard |
The Open Proof Corpus, jointly released with INSAIT, is the largest public collection of expert-annotated AI-generated mathematical proofs. Constructed over four weeks by 13 expert judges, it includes more than 5,000 proofs generated by o4-mini, o3, Gemini 2.5 Pro, Grok 3 mini, Qwen3-235B-A22B, and DeepSeek-R1, with one of the five splits drawn directly from MathArena problems so that final-answer correctness and proof validity can be compared for the same questions. The corpus is intended as training data for future proof-grading systems and as a reference for studying how LLM-generated proofs differ from human-written ones.
The code base lives at https://github.com/eth-sri/matharena under an MIT license. The project uses Python 3.12 with the UV package manager for dependency management and supports four backends for inference: direct API calls to OpenAI, Anthropic, Google, DeepSeek, xAI, and OpenRouter; a unified OpenRouter path for less common open-weight models; local serving through vLLM; and provider-specific integration for OpenAI's reasoning effort levels and Anthropic's extended thinking. Configurable parameters include temperature, sampling top-p, token limits, retry logic, and per-run cost tracking. The repository has been actively developed throughout 2025 and 2026 with around 50 commits on the main branch as of mid-2026.
Problem and solution data lives on Hugging Face under the MathArena organization at huggingface.co/MathArena. Each contest is its own dataset: notable splits include apex_2025, apex_2025_outputs, apex-shortlist, usamo_2025, aime_2026, and final_answer_comps. The README in the GitHub repository documents the required schema for new contributions, which uses a problem_idx field as a stable identifier, a problem field with the LaTeX statement, an answer field with the ground truth for final-answer problems, and optional fields for the maximum points, sample solution, grading scheme, and difficulty rating.
A separate README in the repository describes the judging workflow for human graders, including how to anonymize outputs, how to record judge scores, how to flag inter-judge disagreement, and how to handle appeals. The grading infrastructure was significantly expanded in the second half of 2025 to support the LLM-jury pipeline used in USAMO 2026.
MathArena has had outsized influence relative to its problem volume. Major labs cite it in model release announcements, often alongside FrontierMath, as evidence that their models are not simply memorizing competition problems. The contamination finding for AIME 2024 forced a quiet shift in industry practice away from reporting AIME-only numbers and toward AIME plus AIME 2025 or AIME 2026 splits.
In the academic community, the proof versus final-answer gap reported in "Proof or Bluff?" became a reference point for discussions of mathematical reasoning capability. The finding that all models including OpenAI o3 confidently claimed to have solved problems they had not actually solved triggered a wave of follow-up work on calibration, sycophancy, and meta-cognitive evaluation in LLMs. The MathArena team's BrokenMath project extends this line by directly testing whether models will refuse to prove statements that are mathematically false.
The platform has also influenced how olympiad organizations think about AI participation. Both the IMO and USAMO official bodies have engaged with the MathArena team about test protocols, and several private evaluations conducted by lab-internal teams (Google DeepMind's Deep Think IMO 2025 result, ByteDance's Seed-Prover) were structured to be directly comparable to MathArena numbers.
For educators and competition organizers, MathArena provides a public record of how well models can serve as practice partners or check-graders for student work. The Open Proof Corpus, with its 5,000 annotated proofs, has begun to be used by AI tutoring startups to fine-tune grading assistants and to identify the most common LLM errors that human teachers should expect to see when students rely on chatbots for olympiad preparation.
The MathArena team is candid about the platform's constraints, and the 2025 paper devotes a section to them.
First, the problem volume is small. Even with seven contests in 2025, the total of 162 problems leaves wide confidence intervals when comparing models that differ by only a few points. The team mitigates this with cross-contest aggregation and IRT-based smoothing, but the noise is real, especially for the proof-based events with only six problems each.
Second, proof grading is expensive. Four expert judges per pass, two passes per proof, plus reconciliation discussion, costs both money and the time of mathematicians whose availability does not scale. The 2026 transition to a semi-automated LLM jury was driven in part by this constraint. While the LLM jury matches expert grading on contemporary frontier models, the team explicitly does not claim it will continue to work as models become stronger or weirder.
Third, the platform is English-only. Most major mathematical olympiads publish problems in multiple languages, but MathArena evaluates the English versions, which may not be the version a model trained primarily on, say, Mandarin or Russian would handle best.
Fourth, the competition-mathematics focus is itself a narrow slice of mathematical reasoning. Real mathematical research involves much longer time horizons, much larger context, and creative problem formulation, none of which a contest setting captures. The ArXivMath and ArXivLean tracks are explicit attempts to extend the platform in this direction, but they are early and still small.
Fifth, final-answer contests are saturating. The May 2026 blog post acknowledges that GPT-5.5 reaching 98% on the 2026 USAMO final-answer subset and 95% on the USAMO 2026 proof problems essentially closes those tracks as useful discrimination tools for frontier models. The platform's future as a high-signal benchmark depends on the proof tracks and the research-level extensions.
The roadmap outlined in the team's 2026 paper update (a follow-up to the 2025 paper, listed on arXiv as 2605.00674) focuses on five priorities. First, expanding the proof-writing evaluation to include longer and harder competitions, with international olympiads from outside the United States and university-level events such as the Putnam taking on greater weight. Second, integrating formal proof verification through Lean, with ArXivLean and AlephProver as the lead vehicles. Third, scaling up the LLM jury so that human grading is required only for adjudication, reducing the cost of evaluation. Fourth, broadening into multilingual problem sets, beginning with Chinese, Russian, and Bulgarian. Fifth, expanding the research-level mathematics tracks to provide a successor benchmark for the saturated final-answer track.
The long-term goal stated by the team is to provide a continuously refreshed measurement of mathematical reasoning that is robust to the most common failure modes of static benchmarks: training data contamination, narrow problem distributions, and inability to distinguish answer recall from proof construction. As of mid-2026 MathArena is one of the few benchmarks where the gap between human experts and frontier models is visibly closing in some tracks while remaining stubbornly open in others, which is exactly the dynamic range that benchmarks of this kind are meant to provide.