PutnamBench

AI Benchmarks Model Evaluation

7 min read

Updated Jun 8, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 8, 2026

Fact-checked

In review queue

Sources

8 citations

Revision

v1 · 1,322 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Overview

PutnamBench is a benchmark for evaluating automated theorem proving systems, especially neural and large-language-model-based provers, on competition mathematics. It consists of problems from the William Lowell Putnam Mathematical Competition, the premier undergraduate mathematics competition in North America, with each problem manually rewritten as a formal theorem statement in one or more interactive proof assistants. Because a candidate proof must be checked by the proof assistant's kernel before it counts as a success, PutnamBench scores are objective and not vulnerable to the grading ambiguity or memorization concerns that affect natural-language math benchmarks ^[1]^[2].

The benchmark was introduced in 2024 by a group at the University of Texas at Austin and accepted at the NeurIPS 2024 Datasets and Benchmarks Track ^[1]. At release it was extremely hard: the strongest neural and symbolic provers could solve only a handful of problems out of hundreds ^[1]. By late 2025 and into 2026 it had become a central yardstick in the formal-mathematics and autoformalization race, with specialized provers climbing from single digits to roughly 88 percent of the Lean problems solved ^[3]^[4].

What PutnamBench is

The benchmark accompanies the paper "PutnamBench: Evaluating Neural Theorem-Provers on the Putnam Mathematical Competition" by George Tsoukalas, Jasper Lee, John Jennings, Jimmy Xin, Michelle Ding, Michael Jennings, Amitayush Thakur, and Swarat Chaudhuri ^[1]. The first version was posted to arXiv on 15 July 2024 and revised in November 2024 ^[1].

Each entry is a hand-constructed formal statement of a Putnam problem: the theorem to be proved is written out in the formal language, and the AI system's job is to supply a machine-checkable proof. The competition itself, held annually since 1938, poses twelve problems graded out of ten points each; it is notoriously difficult, with a median score of zero in many years. This makes it a strong stress test for formal reasoning, because solving even one problem requires genuine mathematical insight rather than pattern matching ^[1]^[2].

A notable design feature is "factored" formalization. Many Putnam problems ask the solver to both determine a numerical or closed-form answer and then justify it. PutnamBench separates these two tasks: the expected answer is encoded as a separate definition, so the benchmark can distinguish answer synthesis from proof justification and evaluate each precisely ^[2]^[5].

Structure (Lean, Isabelle, Coq)

PutnamBench is multilingual, which the authors note makes it the first undergraduate-competition benchmark for each of these proof assistants and the first Putnam-style evaluation set for Coq (now also known as Rocq) ^[1]^[5]. The original release contained 1,692 formalizations of 640 theorems ^[1]. The repository has since grown to cover Putnam years 1962 to 2025, and as of early 2026 reports the following totals ^[5]:

Proof assistant	Formalizations
Lean 4 (primary)	672
Isabelle	640
Coq / Rocq	412
Total (all languages)	1,724

Lean 4 is the primary language and the one the public leaderboard tracks most closely; every problem has a Lean and an Isabelle formalization, while a large subset is also formalized in Coq ^[1]^[5]. Problems span a broad range of undergraduate topics, including algebra, analysis, number theory, combinatorics, geometry, linear algebra, and probability ^[5].

Verification is what makes the benchmark trustworthy: a submission is counted as solved only if the proof assistant's trusted kernel accepts the proof term for the exact stated theorem. There is no partial credit and no human judgment in scoring, so reported numbers reflect fully verified proofs ^[2]^[5].

Results and difficulty

At release, PutnamBench was meant to be nearly out of reach for existing systems, and it was. The authors evaluated several families of methods: large language models prompted in context such as GPT-4 and GPT-4o; agentic proof-search systems such as COPRA and ReProver; the hybrid Draft-Sketch-Prove (DSP) approach; and symbolic automation such as Sledgehammer (for Isabelle) and CoqHammer (for Coq) ^[1]^[2]. The results were strikingly low. Few-shot GPT-4o solved only a single problem in each language when given ten attempts, and across all methods and languages only a small handful of the 640 problems were proved ^[1]^[2]. The paper concluded that PutnamBench "is an exceptionally difficult open challenge" and was not solvable through memorization ^[2].

This difficulty contrasted sharply with miniF2F, a widely used formal-math benchmark of high-school olympiad and competition problems where systems had already reached high pass rates. PutnamBench raised the ceiling to undergraduate level, exposing how far automated provers were from genuine mathematical competence ^[1]^[2].

Progress and leaderboard

Progress after release was rapid, driven by a new generation of reinforcement-learning-trained, Lean-native provers. The project maintains a public leaderboard for the Lean track (out of 672 problems) ^[5]. Reported milestones include:

System	Putnam problems solved (Lean)	Notes
Few-shot GPT-4o (2024)	1 per language	Original baseline, 10 attempts ^[1]^[2]
Goedel-Prover (Feb 2025)	7	Open-source, pass@512 ^[6]
DeepSeek-Prover-V2-671B	about 47 to 49	671B-parameter prover ^[3]^[4]
Goedel-Prover-V2-32B	86	About 2x prior SOTA at 1/20 the size ^[3]
Seed-Prover 1.5 (ByteDance)	about 580 (roughly 88 percent)	Agentic Lean prover ^[4]

The jump from Goedel-Prover's 7 in February 2025 to DeepSeek-Prover-V2's high-40s reflected the shift to very large, chain-of-thought provers trained with reinforcement learning; DeepSeek-Prover-V2 reached roughly 89 percent on miniF2F but only about 7 percent on PutnamBench, underscoring the gap between the two benchmarks ^[3]^[6]. DeepSeek-Prover was open-source, and Goedel-Prover-V2-32B then topped the open-source leaderboard with 86 problems while using a far smaller 32-billion-parameter model and fewer sampling passes ^[3].

By late 2025 and early 2026, agentic systems pushed the frontier dramatically higher. ByteDance's Seed-Prover and its successor Seed-Prover 1.5 reported solving about 87.9 percent of PutnamBench (roughly 580 of 672 Lean problems), and the same system produced verified Lean proofs for 11 of the 12 problems at the 2025 Putnam Competition and achieved gold-medal-equivalent performance on IMO 2025 problems in the formal setting ^[4]. Other agentic provers, such as the Hilbert system, reported high pass rates in the same period, and natural-language-assisted pipelines pushed verified solve rates further still ^[4]^[7]. These figures come from system developers' own reports and the project's leaderboard; readers should treat them as self-reported until independently reproduced.

Relationship to other formal-math benchmarks

PutnamBench sits within a family of formal-mathematics benchmarks and milestones. miniF2F covers high-school and olympiad-level problems and has been largely saturated by strong provers, which is why PutnamBench was created to test harder, undergraduate-level reasoning ^[1]^[2]. Both benchmarks rely on the same core principle as autoformalization research: translating informal mathematics into machine-checkable statements, then producing kernel-verified proofs.

The benchmark is closely tied to the wider formal-math effort epitomized by AlphaProof, Google DeepMind's Lean-based system that reached silver-medal-equivalent performance at the 2024 International Mathematical Olympiad and was reported to solve a majority of PutnamBench while saturating miniF2F ^[8]. Subsequent gold-medal results on IMO problems by systems such as Gemini Deep Think and ByteDance's Seed-Prover have kept Putnam-level and olympiad-level formal benchmarks as the leading measures of progress ^[4]^[8]. As Putnam approaches saturation, the community has begun moving toward still-harder, graduate- and research-level formal benchmarks to maintain headroom ^[4].

References

Tsoukalas, G., Lee, J., Jennings, J., Xin, J., Ding, M., Jennings, M., Thakur, A., Chaudhuri, S. "PutnamBench: Evaluating Neural Theorem-Provers on the Putnam Mathematical Competition." arXiv:2407.11214 (2024); NeurIPS 2024 Datasets and Benchmarks Track. https://arxiv.org/abs/2407.11214 ↩
PutnamBench paper, NeurIPS 2024 Datasets and Benchmarks proceedings. https://proceedings.neurips.cc/paper_files/paper/2024/hash/1582eaf9e0cf349e1e5a6ee453100aa1-Abstract-Datasets_and_Benchmarks_Track.html ↩
Goedel-Prover-V2 technical report and announcement. arXiv:2508.03613; https://blog.goedel-prover.com/ ↩
"Seed-Prover 1.5: Mastering Undergraduate-Level Theorem Proving via Learning from Experience." arXiv:2512.17260 (2025). https://arxiv.org/abs/2512.17260 ↩
PutnamBench project page and repository, trishullab. https://trishullab.github.io/PutnamBench/ ; https://github.com/trishullab/PutnamBench ↩
"Goedel-Prover: A Frontier Model for Open-Source Automated Theorem Proving." arXiv:2502.07640 (2025). https://goedel-lm.github.io/ ↩
"Hilbert: Recursively Building Formal Proofs with Informal Reasoning." arXiv:2509.22819 (2025). https://arxiv.org/pdf/2509.22819 ↩
Google DeepMind. "AI achieves silver-medal standard solving International Mathematical Olympiad problems." 25 July 2024. https://deepmind.google/blog/ai-solves-imo-problems-at-silver-medal-level/ ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

Suggest edit

What links here

SWE-bench Verified Terminal-Bench

Overview

What PutnamBench is

Structure (Lean, Isabelle, Coq)

Results and difficulty

Progress and leaderboard

Relationship to other formal-math benchmarks

References

Improve this article

Related Articles

Benchmark (AI)

MATH

SWE-bench Verified

WebArena

Agent evaluation

Terminal-Bench

What links here

Related Articles

Benchmark (AI)

MATH

SWE-bench Verified

WebArena

Agent evaluation

Terminal-Bench

What links here