MathArena

MathArena
Overview
Full name	MathArena: Evaluating LLMs on Uncontaminated Math Competitions
Description	A continuously updated public leaderboard that evaluates large language models on freshly released mathematics competitions including AIME, USAMO, IMO, HMMT, BRUMO, SMT, and Putnam, with grading for both final answers and full natural-language proofs
Initial release	March 2025 (USAMO 2025 evaluation), May 2025 (paper)
Latest expansion	MathArena Apex (August 2025), ArXivMath and ArXivLean (early 2026)
Authors	Mislav Balunović, Jasper Dekoninck, Ivo Petrov, Nikola Jovanović, Martin Vechev
Organization	SRI Lab, ETH Zurich; INSAIT (Sofia)
Venue	NeurIPS Datasets and Benchmarks 2025
Technical Details
Type	Mathematical reasoning, proof writing, research-level mathematics, formal verification
Modality	Text, LaTeX, visual (Kangaroo), Lean (ArXivLean)
Task format	Final-answer (numerical or symbolic), natural-language proofs, code-based solutions, formal proofs
Number of problems	162 in the original 2025 paper; over 400 problems by 2026 across all tracks
Evaluation metric	Pass@1 average over 4 runs (16 for Apex), human-graded proof scores out of 7, item-response-theory (IRT) expected performance
Domains	Algebra, number theory, combinatorics, geometry, analysis, research-level mathematics
Languages	English with LaTeX notation
Performance
Human reference	USAMO 2025 human median: 35.7%; IMO 2025 bronze threshold: 19/42 (45.2%)
Top final-answer score	91.25% (GPT-5 high on aggregated 130 problems, 2025 paper)
Top USAMO 2025 proof score	24.4% (Gemini 2.5 Pro), all other models <5%
Top IMO 2025 score	31% / 13 of 42 points (Gemini 2.5 Pro, no medal)
Saturated	Final-answer competitions trending toward saturation by 2026; proof tracks remain open
Resources
Website	https://matharena.ai/
Paper (main)	https://arxiv.org/abs/2505.23281
Paper (USAMO)	https://arxiv.org/abs/2503.21934
Paper (Open Proof Corpus)	https://arxiv.org/abs/2506.21621
Code	https://github.com/eth-sri/matharena
Datasets	https://huggingface.co/MathArena
License	MIT

MathArena is a public, continuously updated leaderboard and evaluation platform that measures the performance of large language models on mathematics competition problems released after each model's training cutoff. The project is maintained by the Secure, Reliable, and Intelligent Systems (SRI) Lab at ETH Zurich, in collaboration with the Institute for Computer Science, Artificial Intelligence and Technology (INSAIT) in Sofia, Bulgaria. By scraping problems within hours of their publication and running every candidate model immediately, MathArena turns each fresh contest into an uncontaminated test set, addressing the well documented problem that benchmarks like AIME 2024 and earlier static datasets had leaked into widely shared training corpora. The platform was introduced in March 2025 with a USAMO 2025 evaluation, formalized in a NeurIPS Datasets and Benchmarks 2025 paper led by Mislav Balunović and Jasper Dekoninck, and has since expanded to cover proof grading, research-level problems pulled from arXiv preprints, and formalization in Lean.

The leaderboard is best known for showing that strong models that score above 85% on final-answer contests routinely collapse below 25% when forced to produce a complete proof graded by IMO-level judges. That gap, between knowing the answer and being able to justify it, is the central empirical contribution of the project and has reshaped how labs report olympiad results.

Background and motivation

Before MathArena, the standard practice in evaluating mathematical reasoning was to report scores on fixed test sets that had been published months or years earlier. The MATH dataset, GSM8K, OlympiadBench, and even AIME 2024 all became part of the public web shortly after release, which meant that any frontier model trained on a recent Common Crawl snapshot had likely seen many of the problems and at least some of the official solutions. Researchers had long suspected that this exposure inflated reported scores. The MathArena team made the case quantitatively by comparing model accuracy on AIME 2024 (potentially contaminated) against AIME 2025 (released only after every evaluated model's training cutoff). Top systems were performing 10 to 20 percentage points higher on the older contest than on the newer one despite the difficulty profiles being similar, and QwQ-Preview-32B showed an inflation of close to 60 percentage points. That measurement, more than any theoretical argument, made the contamination problem concrete for the broader community.

The second motivation was that no public benchmark systematically evaluated proof writing. Mathematical olympiads are graded on the rigor and correctness of natural-language proofs, not just on whether the final boxed number is right. A model that emits "the answer is 42" with a hand-wavy paragraph behind it can score full marks on AIME but zero on USAMO. The team behind MathArena argued that without grading proofs, the field had no reliable way to measure mathematical understanding beyond pattern-matched answer extraction.

The third motivation was timeliness. Competitions like AIME I, AIME II, HMMT February, USAMO, and IMO follow a public schedule. By instrumenting a pipeline that scrapes problem statements within hours, queries all participating model providers, parses the responses, and lines up expert graders, MathArena turns the global competition calendar into a rolling benchmark refresh that no single lab can pre-train against.

Creators and host institution

MathArena is a project of the SRI Lab at ETH Zurich, led by Professor Martin Vechev. The lab is part of the Department of Computer Science and is known for prior work on neural network verification (ERAN, AI2), program synthesis, and a wider research program on the reliability and safety of machine learning systems. Vechev received an ERC Consolidator Grant in 2021 and has co-founded several ETH spin-offs, including LatticeFlow and InvariantLabs, that commercialize trustworthy AI research.

The core MathArena team listed on the project paper and website includes:

Researcher	Affiliation	Primary role
Jasper Dekoninck	SRI Lab, ETH Zurich	Lead author and primary maintainer; main contact for the leaderboard
Mislav Balunović	SRI Lab, ETH Zurich; LatticeFlow	First-listed paper author; methodology
Ivo Petrov	INSAIT, Sofia	USAMO grading lead; co-author on Proof or Bluff paper
Nikola Jovanović	SRI Lab, ETH Zurich	Evaluation infrastructure
Martin Vechev	SRI Lab, ETH Zurich	Senior author and group lead

The USAMO 2025 report ("Proof or Bluff?") additionally credits Lyuben Baltadzhiev, Maria Drencheva, and Kristian Minchev as student graders. INSAIT, founded with a partnership between ETH Zurich, EPFL, and the Bulgarian government in 2022, supplies a sizable share of the olympiad-level judges given its base of former IMO participants in the Sofia mathematics community. Funding for the platform comes from ERC grants held by the SRI Lab and from INSAIT operating support.

Competitions tracked

MathArena groups its tracked contests into five families. The set has grown steadily since launch, and the platform's competitions page is the authoritative reference for what is live at any given time.

Family	Representative contests	Format	Grading
Final answer (high school)	AIME I and II 2024 and 2025, AIME 2026, HMMT February 2025 and 2026, BRUMO 2025, SMT 2025, CMIMC 2025	Integer or short symbolic answer	Automated SymPy equivalence; secondary LLM check
Proof based	USAMO 2025 and 2026, IMO 2025, Putnam, IMC (International Mathematics Competition), Miklós Schweitzer	Full natural-language proofs	Human or LLM jury graders, 7-point IMO-style rubric
Apex	MathArena Apex 2025 (12 problems)	Hardest final-answer problems curated from 100+ contests	Pass@1 over 16 runs
Research level	ArXivMath (monthly batches), BrokenArxiv (false-statement detection), Project Euler	Numerical or short answer with reasoning	Mixed automated and human
Formal and visual	ArXivLean (Lean formalization), Math Kangaroo grades 1 to 12	Lean proofs or multiple choice	Lean compiler or automated check

The original 2025 paper covered 162 problems across seven 2025 contests. By early 2026 the platform had grown to include several monthly arXiv-sourced sets, the AIME 2026 and USAMO 2026 evaluations, and the dedicated AlephProver evaluation for the Lean track. The team has been explicit that final-answer competitions are saturating and that the platform's future centers on proof writing, research-level problems, and formal verification.

Final-answer contests

The high school final-answer track historically formed the visual centerpiece of the leaderboard. AIME, the American Invitational Mathematics Examination, contributes two 15-problem papers each year (AIME I and AIME II) with integer answers from 0 to 999. The Harvard-MIT Mathematics Tournament (HMMT) runs in February and November; MathArena evaluates the February edition's individual rounds. The Brown University Mathematics Olympiad (BRUMO), Stanford Math Tournament (SMT), and Carnegie Mellon Informatics and Mathematics Competition (CMIMC) round out the family.

Proof-based contests

Proof grading is the platform's flagship contribution. The United States of America Mathematical Olympiad (USAMO) is a six-problem, two-day exam graded on a 7-point scale per problem for 42 total. The International Mathematical Olympiad (IMO) uses the same 42-point structure. The Putnam Competition awards up to 120 points across 12 problems. The IMC, held in Bulgaria, and the Miklós Schweitzer Memorial Competition in Hungary extend the platform into university-level problems.

Apex

Introduced in August 2025, MathArena Apex collects the very hardest final-answer problems from roughly 100 contests reviewed by the team. Twelve problems survived the filter: six were drawn directly from final-answer events such as SMT, EMCC, and CMM, and six were proof-based problems from IMO-style competitions and team selection tests that the team rewrote into final-answer form. Frontier models were run four times on each candidate problem, and a problem was admitted to Apex only if zero attempts succeeded across the chosen models. Final evaluation runs at 16 attempts per model per problem, and four of the 12 problems remain unsolved by any model in the public leaderboard.

Research level and formal

The ArXivMath, BrokenArxiv, and ArXivLean tracks were added in early 2026. ArXivMath uses problems drawn or adapted from recent arXiv preprints; BrokenArxiv tests whether models will refuse to prove deliberately false statements rather than producing plausible-looking but incorrect derivations; and ArXivLean evaluates Lean 4 formalization of theorem statements from research papers. The May 2026 release of AlephProver, an SRI Lab system, reportedly doubled the prior best ArXivLean score on initial evaluation.

Methodology

MathArena's evaluation pipeline has four stages: problem ingestion, response generation, grading, and statistical reporting.

Problem ingestion

For each tracked competition the team monitors the official organizing body and well-known mirrors. As soon as the problems are posted, a maintainer (most often Dekoninck) scrapes them, transcribes them into LaTeX with one human pass for accuracy, and writes a problem JSON entry with fields for the statement, the reference answer (for final-answer problems), the maximum point value, a sample solution where available, and a grading scheme for proof problems. Problems are kept private until every candidate model has finished its scheduled runs, which the team treats as an embargo to avoid leaking data back into web indices that future models might scrape.

Response generation

Every model is run four times per problem under standard evaluation conditions, increased to 16 runs for Apex problems where statistical significance matters more. Hyperparameters follow the provider's recommended defaults rather than being tuned by the MathArena team: this means OpenAI's reasoning effort level, Anthropic's extended thinking budget, and similar provider-specific knobs are set to the values the model card recommends for difficult reasoning tasks. The token budget defaults to 64,000 with selective overrides up to 128,000 for models that benefit from longer contexts, such as Grok 4. Costs are tracked in US dollars and published alongside accuracy so that readers can compare price-performance, not just raw accuracy.

Grading

Grading splits cleanly between the two main task types.

For final-answer problems a custom rule-based LaTeX parser extracts the answer from \boxed{} notation, normalizes it into a SymPy expression, and tests symbolic equivalence against the ground truth. When the parser is uncertain (for example when the model emits a fraction in a different form), an LLM judge provides a secondary check. The team publishes the parsing code in the GitHub repository so that any disagreement can be reproduced.

For proof-based problems the grading is done by humans, by an LLM jury, or by both depending on the contest and year. The IMO 2025 and USAMO 2025 evaluations relied on four expert human judges, each with IMO-level competition experience. Each submission was anonymized so that the judge did not know which model produced it, and two judges scored every proof independently. Scores that differed by more than two points were reconciled in discussion or sent to a third judge. The grading scheme followed the official IMO 7-point rubric: a partition into logically independent checkpoints, with at least 4 points reserved for the main idea or critical steps and at most 3 for routine work. Restating the problem, conjecturing without proof, and providing only a final answer earned zero credit, while logic gaps, invalid claims, and contradictions triggered deductions.

In the 2026 USAMO grading round the team experimented with a semi-automated LLM jury consisting of GPT-5.4, Gemini 3.1 Pro, and Claude Opus 4.6 acting as judges, with a final human pass over the jury outputs. The pipeline first uses a strong model to generate a detailed rubric from the official reference solution, then standardizes the format of every candidate proof, then collects three independent jury scores. When the jury's scores diverge by more than two points, judges reconcile and award the minimum. Human reviewers stepped in to adjust three solutions by at most two points each, which suggests that the jury approach matches expert grading closely on contemporary models.

Statistical reporting

For each model and contest, MathArena publishes the mean of the four (or 16) run scores together with a 95% confidence interval derived from a paired permutation test against every other model. Cross-contest rankings are weighted inversely to the number of problems per contest to avoid overweighting the longest events. As of the 2026 paper update the platform also exposes an item-response-theory (IRT) parameterization that fits each problem with a difficulty parameter and each model with an ability parameter, producing an "expected performance" score that smooths over event-specific noise.

Key results

USAMO 2025: the proof gap

The single most cited MathArena result is the USAMO 2025 evaluation published in the "Proof or Bluff?" report in March 2025. The six 2025 problems were graded out of 42 total points, and only Gemini 2.5 Pro produced anything resembling a competitive performance.

Rank	Model	Provider	USAMO 2025 score	Percentage
1	Gemini 2.5 Pro	Google	10.1 / 42	24.4%
2	OpenAI o3	OpenAI	9.2 / 42	21.9%
3	o4-mini high	OpenAI	8.1 / 42	19.3%
4	GPT-5	OpenAI	7.5 / 42	17.9%
5	Claude 4 Opus	Anthropic	6.8 / 42	16.2%
-	DeepSeek-R1	DeepSeek	< 2 / 42	< 5%
-	Grok 3 mini	xAI	< 2 / 42	< 5%
-	QwQ-32B-Preview	Alibaba	< 2 / 42	< 5%

The human median for USAMO 2025 participants, themselves a highly selected group of high school mathematics olympians, was 35.7%, comfortably above every model evaluated. The team's qualitative analysis identified four recurring failure modes in the model outputs: flawed logical inferences buried inside long reasoning traces, unjustified assumptions used to short-circuit case analysis, lack of creative problem-solving (models repeatedly tried the same approach despite getting stuck), and confident self-assessment in which models claimed to have completed proofs that human judges scored at zero. Grok variants in particular tended to produce extremely short outputs, sometimes consisting only of a final number with no derivation. Gemini 2.5 Pro, while the strongest model, frequently cited theorems that do not exist in the literature when it could not find a valid proof.

IMO 2025: no model earns a medal

A few months later the same protocol was applied to the IMO 2025 problems within hours of the contest in Sunshine Coast, Australia. The IMO awards bronze medals to contestants scoring at least 19 of 42, silver around 26, and gold above 33. No publicly released model came close.

Rank	Model	Provider	IMO 2025 score	Percentage	Medal?
1	Gemini 2.5 Pro	Google	13 / 42	31.0%	No
2	Grok 4 (updated prompt, 128k tokens)	xAI	9 / 42	21.4%	No
3	OpenAI o3	OpenAI	8 / 42	19.0%	No
4	GPT-5	OpenAI	7 / 42	16.7%	No
5	Claude 4 Sonnet	Anthropic	6 / 42	14.3%	No

The team noted that 13 points represents real partial credit on multiple problems, which is more than most models would have earned the prior year, but also that even a single problem fully solved on IMO is worth 7 points and none of the public models managed it. The headline result that no off-the-shelf model achieved bronze contrasted sharply with the simultaneous announcement that Google DeepMind's experimental Deep Think system, run in a special evaluation conducted by IMO graders themselves rather than by MathArena, achieved a gold-medal-level score of 35 of 42. ByteDance's Seed-Prover, a Lean-based system, was certified at silver level around the same time. The gap between public model performance on MathArena and the internal results from lab-only systems became one of the most discussed findings in the 2025 AI mathematics literature.

Final-answer aggregates: high scores, smaller gaps

On the AIME, HMMT, BRUMO, and SMT events combined (130 problems in the 2025 paper), top models scored well above the top 1% human percentile.

Model	Aggregate accuracy on 2025 final-answer contests	Top 1% human reference
GPT-5 (high)	91.25%	84.35% (AIME)
Grok 4 Fast	90.57%	66.79% (HMMT)
Grok 4	90.36%
o4-mini (high)	86%
Gemini 2.5 Pro	86%
Claude 4 Opus	79%
DeepSeek-R1	79.8% (AIME 2024 reported by DeepSeek)

The team observed that aggregate accuracy on final-answer contests has trended steadily upward through 2025 and into 2026, to the point where the May 2026 saturation blog post highlighted GPT-5.5 hitting 98% on the AIME-style portion of the 2026 USAMO and 95% on the USAMO 2026 proof problems. The MathArena Apex track was created in part to push beyond that ceiling.

Apex: where models fail

The August 2025 launch of Apex revealed a sharp performance drop on problems specifically curated to be hard for frontier systems.

Rank	Model	Apex 2025 accuracy	Cost per evaluation (USD)
1	Qwen3-A22B-2507-Think	5.21%	$9.89
2	Grok 4	2.08%	$99.39
3	GPT-5 high (agent scaffolding)	2.08%	$183.79
4	GPT-5-mini high	1.04%	$13.42
5	GLM 4.5	1.04%	$14.50

The team explicitly cautioned readers not to interpret these rankings as definitive overall capability rankings: by construction the problems were selected because the four reference frontier models failed on them, so any model that solves even one extra problem moves several places. Of the 12 Apex problems, the team identified that problems 9 through 12 remained unsolved across all evaluated models even at pass@k with large k. Problem 1, by contrast, was solved within a small number of attempts by at least one model under each evaluation perspective. The most common wrong answer often appeared in over 50% of attempts, which the team interprets as evidence that models converge on a single confidently incorrect reasoning path rather than exploring alternatives.

Domain breakdown

Across the broader leaderboard, models show consistent patterns by mathematical subfield. The 2025 paper reports the following best-in-class accuracies for GPT-5 (high):

Domain	Best accuracy	Notes
Algebra	~100%	Foundational manipulation; strongest area for every model
Number theory	~94%	Standard divisibility, modular, and Diophantine techniques
Combinatorics	~91%	Creativity-intensive; performance drops on harder problems
Geometry	~81%	Weakest area; models rely on coordinate methods and struggle with synthetic arguments

Geometry is the consistent weak spot, attributed in qualitative grading to a reliance on coordinate bash methods rather than synthetic insight, plus a general inability to use diagrams effectively from text-only inputs.

Contamination analysis

The contamination findings reported in the 2025 paper are some of the most rigorous public measurements of how much benchmark inflation existed in widely cited prior results.

AIME 2024 versus AIME 2025

On AIME 2024 most evaluated models scored 10 to 20 percentage points higher than their performance on AIME 2025, despite the two contests having comparable human difficulty distributions. The largest gap belonged to QwQ-Preview-32B, which performed nearly 60 percentage points above its AIME 2025 baseline, a difference too large to attribute to year-to-year problem variation. The paper concludes that AIME 2024 should no longer be treated as a clean benchmark for any model trained on data up to 2024 or later.

HMMT shows smaller gaps

A control comparison on HMMT 2024 versus 2025 found much smaller performance differences, which the team attributes to HMMT's lower online prominence and smaller volume of associated student writeups. Even there, the team identified eight AIME 2025 problems and one HMMT 2025 problem that appeared in similar forms online prior to model evaluation, although these were mostly easier problems and did not materially shift the leaderboard.

Embargo and version tracking

To prevent future contamination of its own data, MathArena uses an embargo period: evaluations begin within hours of public problem release, but solutions and model outputs are not made fully public until enough time has passed that any prior model has clearly been evaluated. The team also monitors known contamination indicators, such as test-set string matches in publicly indexed corpora, and applies anomalous-performance detection on outlier scores.

MathArena occupies a distinct position in the mathematics benchmarking landscape. The table below compares its design choices against the most cited alternatives.

Benchmark	Source	Contamination strategy	Proof grading	Status
MathArena	Live olympiads and contests	Real-time evaluation post-release	Yes, human and LLM jury	Active, expanding
FrontierMath (Epoch AI, 2024)	Original problems from professional mathematicians	New unpublished problems, automated verification	No (final answer only)	Active, mostly private
PutnamBench	Putnam problems with Lean formalization	Static, formal verification mitigates leakage	Yes (Lean proofs)	Active
OlympiadBench (2024)	8,476 olympiad and physics problems	None (static historical data)	Partial	Static
OmniMATH	4,428 competition problems	None	No	Static
GSM8K	Grade-school word problems	None	No	Static, widely contaminated
MATH (Hendrycks et al.)	Competition problems from web	None	No	Static, contamination identified
Minerva eval suite	Mixed competition and STEM	None	No	Static

FrontierMath, run by Epoch AI, is MathArena's closest peer in spirit: both projects target the contamination problem head-on, but they take opposite approaches. FrontierMath commissions original problems from professional mathematicians and keeps them private, evaluating models in a controlled environment. MathArena relies on the public competition calendar and the embargo period to keep its test set fresh. The two are complementary; many AI labs run both. PutnamBench is the proof-writing analog in the formal verification space, requiring models to produce Lean proofs that compile against the Putnam corpus, and shares the SRI Lab's interest in formalization through the ArXivLean track.

The team has been explicit that MathArena does not aim to replace synthetic original-problem benchmarks, which provide much larger samples, but instead serves as a high-signal, low-volume measurement of how much performance is real reasoning versus pattern recall.

MathArena is the most public face of a broader SRI Lab research program on mathematical reasoning. Several connected projects extend the platform's findings.

Project	Year	Focus	MathArena connection
Proof or Bluff?	March 2025	USAMO 2025 evaluation report	Pilot paper that led directly to MathArena
MathArena main paper	May 2025 (NeurIPS D&B 2025)	Platform description and 2025 results	Core publication
Open Proof Corpus	June 2025	5,000+ human-annotated LLM proofs	Uses MathArena problems as five of its splits
MathConstruct	2025	Constructive proof reasoning benchmark	Companion proof-writing project
BrokenMath	2025	Sycophancy detection in theorem proving	Tests model refusal of false statements
QED-Nano	2025	Distilling proof ability into smaller models	Trained on data partly derived from MathArena
IMProofBench	2025	Research-level proof generation	Higher difficulty tier than MathArena
AlephProver	May 2026	Lean formalization for arXiv statements	Headlines the ArXivLean leaderboard

The Open Proof Corpus, jointly released with INSAIT, is the largest public collection of expert-annotated AI-generated mathematical proofs. Constructed over four weeks by 13 expert judges, it includes more than 5,000 proofs generated by o4-mini, o3, Gemini 2.5 Pro, Grok 3 mini, Qwen3-235B-A22B, and DeepSeek-R1, with one of the five splits drawn directly from MathArena problems so that final-answer correctness and proof validity can be compared for the same questions. The corpus is intended as training data for future proof-grading systems and as a reference for studying how LLM-generated proofs differ from human-written ones.

Technical implementation

The code base lives at https://github.com/eth-sri/matharena under an MIT license. The project uses Python 3.12 with the UV package manager for dependency management and supports four backends for inference: direct API calls to OpenAI, Anthropic, Google, DeepSeek, xAI, and OpenRouter; a unified OpenRouter path for less common open-weight models; local serving through vLLM; and provider-specific integration for OpenAI's reasoning effort levels and Anthropic's extended thinking. Configurable parameters include temperature, sampling top-p, token limits, retry logic, and per-run cost tracking. The repository has been actively developed throughout 2025 and 2026 with around 50 commits on the main branch as of mid-2026.

Problem and solution data lives on Hugging Face under the MathArena organization at huggingface.co/MathArena. Each contest is its own dataset: notable splits include apex_2025, apex_2025_outputs, apex-shortlist, usamo_2025, aime_2026, and final_answer_comps. The README in the GitHub repository documents the required schema for new contributions, which uses a problem_idx field as a stable identifier, a problem field with the LaTeX statement, an answer field with the ground truth for final-answer problems, and optional fields for the maximum points, sample solution, grading scheme, and difficulty rating.

A separate README in the repository describes the judging workflow for human graders, including how to anonymize outputs, how to record judge scores, how to flag inter-judge disagreement, and how to handle appeals. The grading infrastructure was significantly expanded in the second half of 2025 to support the LLM-jury pipeline used in USAMO 2026.

Impact and reception

MathArena has had outsized influence relative to its problem volume. Major labs cite it in model release announcements, often alongside FrontierMath, as evidence that their models are not simply memorizing competition problems. The contamination finding for AIME 2024 forced a quiet shift in industry practice away from reporting AIME-only numbers and toward AIME plus AIME 2025 or AIME 2026 splits.

In the academic community, the proof versus final-answer gap reported in "Proof or Bluff?" became a reference point for discussions of mathematical reasoning capability. The finding that all models including OpenAI o3 confidently claimed to have solved problems they had not actually solved triggered a wave of follow-up work on calibration, sycophancy, and meta-cognitive evaluation in LLMs. The MathArena team's BrokenMath project extends this line by directly testing whether models will refuse to prove statements that are mathematically false.

The platform has also influenced how olympiad organizations think about AI participation. Both the IMO and USAMO official bodies have engaged with the MathArena team about test protocols, and several private evaluations conducted by lab-internal teams (Google DeepMind's Deep Think IMO 2025 result, ByteDance's Seed-Prover) were structured to be directly comparable to MathArena numbers.

For educators and competition organizers, MathArena provides a public record of how well models can serve as practice partners or check-graders for student work. The Open Proof Corpus, with its 5,000 annotated proofs, has begun to be used by AI tutoring startups to fine-tune grading assistants and to identify the most common LLM errors that human teachers should expect to see when students rely on chatbots for olympiad preparation.

Limitations

The MathArena team is candid about the platform's constraints, and the 2025 paper devotes a section to them.

First, the problem volume is small. Even with seven contests in 2025, the total of 162 problems leaves wide confidence intervals when comparing models that differ by only a few points. The team mitigates this with cross-contest aggregation and IRT-based smoothing, but the noise is real, especially for the proof-based events with only six problems each.

Second, proof grading is expensive. Four expert judges per pass, two passes per proof, plus reconciliation discussion, costs both money and the time of mathematicians whose availability does not scale. The 2026 transition to a semi-automated LLM jury was driven in part by this constraint. While the LLM jury matches expert grading on contemporary frontier models, the team explicitly does not claim it will continue to work as models become stronger or weirder.

Third, the platform is English-only. Most major mathematical olympiads publish problems in multiple languages, but MathArena evaluates the English versions, which may not be the version a model trained primarily on, say, Mandarin or Russian would handle best.

Fourth, the competition-mathematics focus is itself a narrow slice of mathematical reasoning. Real mathematical research involves much longer time horizons, much larger context, and creative problem formulation, none of which a contest setting captures. The ArXivMath and ArXivLean tracks are explicit attempts to extend the platform in this direction, but they are early and still small.

Fifth, final-answer contests are saturating. The May 2026 blog post acknowledges that GPT-5.5 reaching 98% on the 2026 USAMO final-answer subset and 95% on the USAMO 2026 proof problems essentially closes those tracks as useful discrimination tools for frontier models. The platform's future as a high-signal benchmark depends on the proof tracks and the research-level extensions.

Future directions

The roadmap outlined in the team's 2026 paper update (a follow-up to the 2025 paper, listed on arXiv as 2605.00674) focuses on five priorities. First, expanding the proof-writing evaluation to include longer and harder competitions, with international olympiads from outside the United States and university-level events such as the Putnam taking on greater weight. Second, integrating formal proof verification through Lean, with ArXivLean and AlephProver as the lead vehicles. Third, scaling up the LLM jury so that human grading is required only for adjudication, reducing the cost of evaluation. Fourth, broadening into multilingual problem sets, beginning with Chinese, Russian, and Bulgarian. Fifth, expanding the research-level mathematics tracks to provide a successor benchmark for the saturated final-answer track.

The long-term goal stated by the team is to provide a continuously refreshed measurement of mathematical reasoning that is robust to the most common failure modes of static benchmarks: training data contamination, narrow problem distributions, and inability to distinguish answer recall from proof construction. As of mid-2026 MathArena is one of the few benchmarks where the gap between human experts and frontier models is visibly closing in some tracks while remaining stubbornly open in others, which is exactly the dynamic range that benchmarks of this kind are meant to provide.

References

Balunović, M., Dekoninck, J., Petrov, I., Jovanović, N., Vechev, M. (2025). MathArena: Evaluating LLMs on Uncontaminated Math Competitions. NeurIPS Datasets and Benchmarks 2025. https://arxiv.org/abs/2505.23281
Petrov, I., Dekoninck, J., Baltadzhiev, L., Drencheva, M., Minchev, K., Balunović, M., Jovanović, N., Vechev, M. (2025). Proof or Bluff? Evaluating LLMs on 2025 USA Math Olympiad. https://arxiv.org/abs/2503.21934
Dekoninck, J. et al. (2025). The Open Proof Corpus: A Large-Scale Study of LLM-Generated Mathematical Proofs. https://arxiv.org/abs/2506.21621
SRI Lab, ETH Zurich. LLMs for Mathematical Reasoning research page. https://www.sri.inf.ethz.ch/research/mathllm
MathArena project website. https://matharena.ai/
MathArena GitHub repository. https://github.com/eth-sri/matharena
MathArena Hugging Face organization (datasets). https://huggingface.co/MathArena
MathArena IMO 2025 blog post. https://matharena.ai/imo/
MathArena Apex blog post. https://matharena.ai/apex/
MathArena USAMO 2026 blog post. https://matharena.ai/usamo/
MathArena competitions index. https://matharena.ai/competitions
INSAIT. Open Proof Corpus release announcement. https://insait.ai/insait-releases-open-proof-corpus-the-largest-public-collection-of-expert-annotated-ai-generated-mathematical-proofs/
Martin Vechev profile, SRI Lab. https://www.sri.inf.ethz.ch/people/martin
Lyang36 et al. (2025). Winning Gold at IMO 2025 with a Model-Agnostic Verification-and-Refinement Pipeline. https://arxiv.org/abs/2507.15855
Balunović, M. et al. (2026). Beyond Benchmarks: MathArena as an Evaluation Platform for Mathematics with LLMs. https://arxiv.org/abs/2605.00674

Background and motivation

Creators and host institution

Competitions tracked

Final-answer contests

Proof-based contests

Apex

Research level and formal

Methodology

Problem ingestion

Response generation

Grading

Statistical reporting

Key results

USAMO 2025: the proof gap

IMO 2025: no model earns a medal

Final-answer aggregates: high scores, smaller gaps

Apex: where models fail

Domain breakdown

Contamination analysis

AIME 2024 versus AIME 2025

HMMT shows smaller gaps

Embargo and version tracking

Comparison to related benchmarks

Related projects from the SRI Lab

Technical implementation

Impact and reception

Limitations

Future directions

See also

References

Improve this article

Related Articles

SimpleBench

SmolVLA

Humanity's Last Exam

ARC-AGI 2

Claude Sonnet 4.5

AA-LCR

Background and motivation

Creators and host institution

Competitions tracked

Final-answer contests

Proof-based contests

Apex

Research level and formal

Methodology

Problem ingestion

Response generation

Grading

Statistical reporting

Key results

USAMO 2025: the proof gap

IMO 2025: no model earns a medal

Final-answer aggregates: high scores, smaller gaps

Apex: where models fail

Domain breakdown

Contamination analysis

AIME 2024 versus AIME 2025

HMMT shows smaller gaps

Embargo and version tracking

Comparison to related benchmarks

Related projects from the SRI Lab

Technical implementation

Impact and reception

Limitations

Future directions

See also

References

Related Articles

SimpleBench

SmolVLA

Humanity's Last Exam

ARC-AGI 2

Claude Sonnet 4.5

AA-LCR