Goedel-Prover
Last reviewed
May 18, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 · 3,728 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 18, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 · 3,728 words
Add missing citations, update stale details, or suggest a clearer explanation.
Goedel-Prover is an open-source large language model designed for automated formal theorem proving in lean 4. Developed by researchers at Princeton Language and Intelligence in collaboration with Tsinghua University, Amazon, Meta FAIR, and Numina, the first version was released in February 2025 and achieved state-of-the-art performance among open-source theorem provers at the time of release.[^1] The system attained a 57.6% success rate at Pass@32 on the miniF2F benchmark, surpassing the previously leading open-source model DeepSeek-Prover-V1.5-RL by 7.6 percentage points, and ranked first on the PutnamBench leaderboard while contributing 29.7K new formal proofs to the Lean Workbook dataset.[^2]
A subsequent release, Goedel-Prover-V2, appeared in August 2025 with substantially improved performance. The flagship 32-billion parameter model achieved 88.1% Pass@32 on miniF2F in standard mode and 90.4% in self-correction mode, outperforming the much larger DeepSeek-Prover-V2-671B and the concurrent Kimina-Prover-72B despite its smaller size.[^3] The Goedel-Prover project is notable for demonstrating that fully open-source systems, with code, model weights, and training data publicly released, can match or exceed the performance of proprietary closed-source counterparts in a domain previously dominated by industrial laboratories.
The lead authors, Yong Lin and Shange Tang, were affiliated with Princeton University at the time of publication, with senior collaborators including Sanjeev Arora, Chi Jin, Danqi Chen, and Mengzhou Xia, among others spanning multiple institutions.[^1]
Automated theorem proving sits at the intersection of artificial intelligence, mathematical logic, and software verification. The objective is to produce machine-checkable proofs of mathematical statements expressed in a formal language such as Lean, Isabelle, or Coq. Unlike informal mathematical reasoning, which can contain gaps or rely on tacit assumptions, a formal proof must satisfy a proof assistant's verification algorithm before it is accepted. This rigor is what makes formal proving attractive for high-stakes applications such as verifying cryptographic protocols, ensuring software correctness, and confirming results in pure mathematics.
Prior to the Goedel-Prover release, the leading open-source theorem prover on competition-style benchmarks was DeepSeek-Prover-V1.5, developed by DeepSeek-AI. That system combined supervised fine-tuning on Lean 4 proofs, reinforcement learning from proof assistant feedback (RLPAF), and a Monte-Carlo tree search variant called RMaxTS. DeepSeek-Prover-V1.5-RL reached 50.0% Pass@32 on miniF2F.[^4] Other notable systems included InternLM2.5-StepProver, which used a critic-guided search strategy and reached 65.9% Pass on miniF2F-test, and earlier work such as InternLM-Math-Plus.[^5]
The proprietary frontier was held by Google DeepMind's alphaproof, an AlphaZero-style reinforcement learning system that, combined with alphageometry 2, achieved silver-medal performance at the 2024 International Mathematical Olympiad in a controlled evaluation. AlphaProof's neural architecture, training data, and model weights were not released, leaving a substantial capability gap between published proprietary results and what the research community could reproduce or build upon.
A persistent challenge in this field is the scarcity of formalized mathematics. While natural-language mathematical content exists in enormous quantities, the corpus of statements and proofs written in Lean 4 or other proof assistants is comparatively tiny. The Lean community's mathlib library represents an enormous human effort but covers a limited subset of mathematical territory, mostly oriented toward research mathematics rather than competition-style problems. Datasets such as Lean Workbook, released in mid-2024 with roughly 140K autoformalized problems drawn from Art of Problem Solving and Compfiles material, attempted to expand the available training material, but the gap between informal and formal data remained vast.[^6]
The Numina dataset, numinamath, offered approximately 860K natural-language mathematical problems with solutions, but these were not formalized. Translating such problems into Lean 4 statements at scale was an open problem known as autoformalization. Earlier work had explored autoformalization, but reliably producing correct, faithful formal statements at the scale of hundreds of thousands of problems remained difficult.
Goedel-Prover entered this landscape with a clear thesis: by combining large-scale autoformalization with iterative expert training on Lean compiler feedback, an open-source 7-billion-parameter model could rival or surpass the best published open-source results. The name pays homage to the Austrian-American logician Kurt Gödel, whose incompleteness theorems are foundational to the formal logic tradition that proof assistants embody. The project should not be confused with Jürgen Schmidhuber's earlier theoretical proposal of Gödel Machines, which referred to a self-improving artificial general intelligence framework and is unrelated to Goedel-Prover's neural theorem-proving objectives.
Goedel-Prover combines several techniques into a coherent pipeline: autoformalization of natural-language mathematical problems into Lean 4 statements, quality filtering of those statements, and expert iteration to train successive prover models on the growing dataset of verified proofs.
To translate the abundant supply of natural-language problems into Lean 4 statements, the Goedel-Prover team trained two distinct formalizer models, referred to as Formalizer A and Formalizer B. The use of two formalizers was a deliberate design choice intended to increase the diversity of generated formalizations and to hedge against the failure modes of any single formalizer.[^1]
Formalizer A was constructed by supervised fine-tuning of Qwen2.5-Coder-32B on informal-formal (I-F) pairs derived from the Lean Workbook. Formalizer B used Claude-sonnet-3.5 to produce initial formalizations for roughly 230K Numina statements; about 170K of these passed Lean compilation. That filtered set then served as the training data for a second Qwen2.5-Coder-32B model fine-tuned to imitate Claude's formalization style.
The two formalizers exhibit complementary strengths. Formalizer A tended to produce more direct, literal translations, while Formalizer B leaned toward more abstract Lean idioms. Empirically, examples exist where Formalizer A succeeded while Formalizer B failed, and vice versa, validating the dual-formalizer strategy. In an ablation, training the final prover on data from both formalizers yielded 57.6% Pass@32 on miniF2F, compared with 56.5% using Formalizer A alone and 56.2% using Formalizer B alone.[^1]
Raw outputs from the formalizers were not directly usable as training data. The team imposed two filters:
The Compiling Correctness (CC) Test required that a generated Lean 4 statement compile as syntactically valid Lean code when augmented with := by sorry (the standard placeholder for an unproven theorem). Formalizer A achieved 76.74% CC Pass@1, while Formalizer B achieved 88.48% CC Pass@1.[^1]
The Faithfulness and Completeness (FC) Test used Qwen2.5-72B-Instruct as an automated judge, evaluating with four independent judgments whether the Lean statement captured all hypotheses, conclusions, and quantifiers from the original natural-language problem. Statements with an FC score below 0.5 were discarded. Formalizer A achieved 48.06% FC Pass@1, while Formalizer B reached 80.42% FC Pass@1.[^1]
Starting from 928K informal statements (860K from Numina plus 68K from Art of Problem Solving), the filtering process yielded 1.78 million formal statements: 760K with valid pairs from both formalizers and 123K with a single valid formalization. After further deduplication and quality control, the released Goedel-Pset-v1 dataset contained 1.64 million formal statements.[^2]
The training of Goedel-Prover-SFT followed an expert iteration loop, an established technique in which a prover model is used to generate candidate proofs for a pool of unsolved theorems, the candidates are checked against the Lean compiler, and the validated proofs are appended to the training set for the next iteration. The procedure was repeated for nine iterations.[^1]
At each iteration, the current prover generated 16 candidate proofs per statement using whole-proof generation. Unlike step-wise approaches that interleave model output with proof-assistant interaction, Goedel-Prover produced complete proofs without reading intermediate tactic states from Lean during generation, with the Lean compiler used only as a final verifier of candidate proofs. Successful proofs were collected as training data, and the next prover was obtained by supervised fine-tuning of DeepSeek-Prover-V1.5-Base on the cumulative dataset.
The composition of the training pool evolved across iterations. In iterations 0 through 3, the pool was seeded with 140K Lean Workbook statements plus the Claude-formalized statements from Formalizer B's training data. In iterations 4 through 6, the team incorporated outputs from Formalizer B, scaling the formalized pool to 882K statements with 425.8K solved by iteration 4. In iteration 6 and beyond, Mathlib4 data (roughly 104K statements) was added to improve ProofNet performance, although this slightly reduced miniF2F scores. In iterations 7 through 9, Formalizer A outputs were incorporated, expanding the dataset to the full 1.64 million formalized statements and 928.2K solved statements by iteration 9.[^1]
Training used learning rates of 1e-4 or 5e-5 for 1 to 2 epochs with batch size 8, running roughly 12 hours per epoch on 4 H100 GPUs.
A distinctive design choice in Goedel-Prover was the use of whole-proof generation rather than interactive, step-by-step proof construction. In step-wise approaches such as DeepSeek-Prover-V1.5 (which used Monte-Carlo tree search) or InternLM2.5-StepProver, the model generates one or a few tactics at a time, observing the resulting proof state from Lean before producing the next step. Goedel-Prover instead samples complete proofs in one pass and submits them to Lean for verification. This simplifies the inference pipeline at the cost of forgoing potentially useful intermediate feedback during generation. The team's results suggest that, at the scale of training data they assembled, whole-proof generation can be competitive with or superior to step-wise methods.
miniF2F is a cross-system benchmark for formal Olympiad-level mathematics consisting of 488 problem statements drawn from the American Invitational Mathematics Examination (AIME), the American Mathematics Competitions (AMC), the International Mathematical Olympiad (IMO), and high-school and undergraduate mathematics curricula.[^7] The benchmark was introduced in 2021 and has become the standard yardstick for neural theorem provers in Lean and other proof assistants.
On miniF2F-test (the standard 244-problem test split), Goedel-Prover-SFT achieved the following results:[^1]
| Model | Pass@32 | Pass@3200 | Pass@25600 |
|---|---|---|---|
| DeepSeek-Prover-V1.5-RL | 50.0% | 54.9% | 58.5% |
| Goedel-Prover-SFT | 57.6% | 62.7% | 64.7% |
The 7.6-percentage-point gain at Pass@32 was the headline result. Crucially, the improvement was consistent across sampling budgets ranging from Pass@32 up through Pass@25,600, suggesting that Goedel-Prover did not merely shift the distribution of easy problems but genuinely extended the frontier of solvable problems. A subsequent reinforcement-learning extension reported in the same paper raised performance above 60% Pass@32 on miniF2F.[^1]
These figures should be read with attention to the Pass@K methodology: a Pass@K score of P means that, given K independent samples from the model, the proportion of test problems for which at least one sample yields a Lean-verified proof is P. Higher K thus generally yields higher Pass@K, and direct comparisons require matching K. Goedel-Prover's lead over DeepSeek-Prover-V1.5-RL held at every K value reported.
Beyond miniF2F, Goedel-Prover was evaluated on two additional benchmarks that probe different aspects of formal theorem proving.
The Lean Workbook, introduced in 2024, contains roughly 140K autoformalized statements drawn from Art of Problem Solving forum posts and Compfiles material. It serves both as training data and as a measure of how many problems a prover can successfully discharge. Prior work (InternLM2.5-StepProver together with InternLM-Math-Plus) had collectively solved approximately 15.7K of these problems. Goedel-Prover-SFT increased the cumulative number of solved Lean Workbook problems to 29.7K, nearly doubling the prior count.[^1] The newly generated proofs were released publicly as part of the Lean-workbook-proofs dataset on Hugging Face, providing the community with a substantial corpus of verified Lean 4 proofs for training subsequent models.
PutnamBench is a benchmark consisting of hand-constructed formalizations of problems from the William Lowell Putnam Mathematical Competition, an annual undergraduate-level mathematics competition in North America known for its difficulty. The benchmark provides formal statements in Lean 4 and Isabelle, with a substantial subset also available in Coq.[^8] Putnam problems are considerably more difficult than typical miniF2F problems and demand creative mathematical insight beyond routine application of techniques.
On PutnamBench, Goedel-Prover-SFT solved 7 problems at Pass@512, claiming the top position on the public leaderboard at the time of release. This narrowly edged out previously listed systems including ABEL (which had solved 7 problems but at Pass@596, requiring more samples per problem) and InternLM2.5-StepProver (6 problems). While 7 out of more than 640 formalized theorems is an absolute success rate of approximately 1%, the result reflected the genuine difficulty of Putnam problems for current automated theorem provers and established a new state-of-the-art for open-source systems.[^1]
A further benchmark, ProofNet, contains 371 undergraduate-level problems drawn from analysis, topology, and abstract algebra textbooks; these problems lean heavily on Mathlib4 idioms rather than the elementary-school olympiad reasoning typical of miniF2F. On ProofNet, Goedel-Prover-SFT reached 15.2%, a modest score that the paper attributed to a distribution mismatch between competition mathematics and undergraduate analysis or topology.[^1]
deepseek-Prover-V1.5, released by DeepSeek-AI in August 2024, was the immediate predecessor on the open-source frontier of Lean 4 theorem proving.[^4] It combined pre-training of DeepSeekMath-Base, supervised fine-tuning on an enhanced theorem-proving corpus, reinforcement learning from proof-assistant feedback (RLPAF), and a Monte-Carlo tree search variant called RMaxTS. The RL-enhanced final variant, DeepSeek-Prover-V1.5-RL, achieved 63.5% Pass@N on miniF2F at high sample budgets and 25.3% on ProofNet.
Goedel-Prover-SFT, despite using only supervised fine-tuning without any reinforcement learning step in its main configuration, surpassed DeepSeek-Prover-V1.5-RL on miniF2F. The paper highlighted that 57.6% Pass@32 represented a 7.6-point lead over DeepSeek-Prover-V1.5-RL's 50.0% Pass@32, and this gap was consistent across larger sampling budgets. The fact that supervised fine-tuning on a sufficiently large synthetic dataset could outperform a system built with both reinforcement learning and proof-assistant interaction suggested that data quality and quantity, rather than algorithmic sophistication, was the binding constraint on then-current open-source theorem provers.
Goedel-Prover used DeepSeek-Prover-V1.5-Base as its initialization, taking the base model from the same lineage and demonstrating an improvement attributable to its data and iteration strategy rather than to architectural changes. This intellectual debt was acknowledged in the project's MIT-licensed code release, which noted that certain portions derive from DeepSeek-Prover-V1.5 and retain that project's copyright.
In April 2025, DeepSeek released DeepSeek-Prover-V2, a much larger 671-billion-parameter mixture-of-experts model that improved performance on miniF2F. That release would in turn be eclipsed by Goedel-Prover-V2 a few months later, as discussed below.
alphaproof is Google DeepMind's neural theorem prover, introduced in mid-2024 as a key component of DeepMind's IMO 2024 effort. It combined a pre-trained language model with reinforcement learning over a corpus of formal problems, using AlphaZero-style self-play to bootstrap proof-finding ability. AlphaProof, working alongside alphageometry 2, was reported by DeepMind to have achieved a score equivalent to a silver medal at the 2024 International Mathematical Olympiad in a controlled evaluation.
The comparison between Goedel-Prover and AlphaProof differs in kind from the comparison with DeepSeek-Prover, because AlphaProof's model weights, code, and training data have not been publicly released. Direct benchmark-to-benchmark comparison is therefore difficult: DeepMind's reported IMO performance was evaluated under specific conditions (per-problem time budgets in the hours, access to substantial compute) that differ from standard miniF2F evaluation. There is no public Pass@K number for AlphaProof on miniF2F.
What can be said is that the two efforts represented opposite poles of the AI research culture in formal mathematics. AlphaProof was a high-budget, closed-source project from a major industrial laboratory, demonstrating that AI could plausibly approach Olympiad-level mathematical reasoning given sufficient resources. Goedel-Prover was an open-source academic effort, modest in scale, designed to provide the research community with a strong baseline, public model weights, and a reproducible training pipeline. Where AlphaProof relied on extensive proprietary reinforcement learning, Goedel-Prover demonstrated that careful data curation and expert iteration could substantially close the gap from openly available resources.
The two systems are thus complementary in the public conversation about AI mathematical reasoning. AlphaProof showed what was possible at the proprietary frontier, while Goedel-Prover demonstrated how far the open-source community could push with the data and compute available to academic groups.
A defining feature of Goedel-Prover was its commitment to open release. The team published the model weights of Goedel-Prover-SFT on Hugging Face under the Goedel-LM organization, released the source code on GitHub under an MIT license (with the exception of portions inherited from DeepSeek-Prover-V1.5, which retain their original copyright), and published the Lean-workbook-proofs dataset containing the 29.7K newly discovered proofs.[^2][^9]
This release made several contributions to the research ecosystem:
The 7-billion-parameter Goedel-Prover-SFT could be run for inference on a single high-end GPU, putting state-of-the-art performance within reach of academic groups and individual researchers without access to large industrial clusters.
The 1.64-million-statement Goedel-Pset-v1 dataset and the 29.7K solved-proof dataset gave subsequent projects substantial training material for further fine-tuning, distillation, or analysis.
The expert iteration pipeline itself, including the dual-formalizer design and the FC test using an LLM judge, became a reference implementation that other groups could adapt for their own systems.
The release validated a methodology that subsequent open-source projects could improve upon, accelerating progress across the field.
By the time Goedel-Prover-V2 followed in mid-2025, the broader Lean 4 theorem-proving community had visibly expanded, with multiple open-source efforts and a steady stream of new models targeting miniF2F and PutnamBench.
Goedel-Prover-V2 was released in August 2025 with a corresponding arXiv preprint (arXiv:2508.03613) submitted on August 5, 2025.[^3] The new version was a substantial advance over the original, both in absolute performance and in the breadth of its methodological innovations. The lead authors remained Yong Lin, Shange Tang, and Bohan Lyu, joined by an expanded set of collaborators, with affiliations now including Princeton, Tsinghua, NVIDIA, Stanford, and Meta FAIR.
V2 introduced three principal innovations on top of the V1 pipeline:
Scaffolded data synthesis. Where V1 relied on autoformalization of existing problems, V2 generated synthetic intermediate-difficulty problems that bridged the gap between easy solved problems and hard unsolved ones. By creating a smoother curriculum, the team enabled the prover to gradually master more difficult theorems, mitigating the brittleness that can occur when training jumps directly from elementary problems to research-level ones.
Verifier-guided self-correction. Rather than relying solely on whole-proof generation followed by single-shot verification, V2 taught the model to iteratively revise its own proofs based on Lean compiler feedback. Two rounds of self-correction were performed within a modest additional token budget (token counts rising from roughly 32K to 40K), enabling the model to recover from local errors that would have caused a single-shot proof to fail. This represented a partial integration of step-wise feedback into what remained largely a whole-proof generation framework.
Model averaging. Late-stage training on highly curated data tends to reduce output diversity, which in turn hurts Pass@K performance. V2 addressed this by averaging checkpoints with the base model to maintain diversity while preserving the gains from late-stage training.
The training pipeline was further upgraded by moving from DeepSeek-Prover-V1.5-Base to Qwen3 base models in both 8-billion and 32-billion parameter sizes.
Two model variants were released:[^3]
Goedel-Prover-V2-8B, an 8-billion-parameter lightweight version achieving 83.3% (later improved to 84.6% in reported figures) Pass@32 on miniF2F, matching DeepSeek-Prover-V2-671B while being nearly 100 times smaller.
Goedel-Prover-V2-32B, the flagship 32-billion-parameter model reaching 88.1% Pass@32 on miniF2F in standard mode and 90.4% Pass@32 in self-correction mode, outperforming both DeepSeek-Prover-V2-671B (reported at 82.4%) and the concurrent Kimina-Prover-72B (reported at 84.0%).
On PutnamBench, the flagship 32B model solved 64 problems at Pass@64, taking the top open-source ranking and surpassing DeepSeek-Prover-V2's 47 problems solved at Pass@1024. The team also released a new evaluation suite called MathOlympiadBench focused on Olympiad-style problems.
V2 model weights were released under the Apache 2.0 license, a change from the MIT license used for V1's released code. The Princeton AI news service highlighted that the V2 release improved miniF2F performance from roughly 60% to roughly 90% in the span of six months between releases, a substantial pace of improvement for the field.[^10]
Despite the rapid progress represented by Goedel-Prover and its successor, several limitations and caveats deserve note.
First, the strong miniF2F scores reflect competition-style problems drawn from AIME, AMC, IMO, and high-school or undergraduate sources. Performance on ProofNet, which targets undergraduate analysis and topology with heavy reliance on Mathlib4, was much weaker for V1 (15.2%). The distribution of problems on which a model performs well depends sensitively on its training distribution, and benchmark scores should not be over-generalized to the full breadth of mathematics.
Second, formal proofs found by Goedel-Prover are accepted by Lean's verifier, which guarantees logical correctness with respect to the formal statement, but the autoformalization step that produces the formal statement from a natural-language problem is itself fallible. The Faithfulness and Completeness test catches many errors but cannot eliminate the possibility that a generated formal statement subtly misrepresents the original problem. The community's interpretation of such mismatches has been an ongoing topic of discussion.
Third, the Pass@K methodology rewards models that can produce one correct proof among many attempts, which is appropriate for some use cases (a single verified proof is sufficient) but does not directly measure efficiency. Pass@32 and Pass@512 numbers should be interpreted in light of the substantial inference compute they imply.
Fourth, while Goedel-Prover is open-source and runnable on modest hardware, its dependencies on the Lean compiler, Mathlib, and proper environment configuration mean that reproducing results requires care. The Lean ecosystem itself, including the parallel evolution of Lean 4 and Mathlib, can affect proof verification over time.
Finally, the open-source theorem-proving landscape continues to evolve rapidly. Models that were state-of-the-art at one moment can be eclipsed within months. By late 2025 and into 2026, the field continued to see frequent releases of new provers building on the lessons of Goedel-Prover and its peers, and any claim of state-of-the-art performance is best read as a snapshot at the time of writing.