DeepSeek-Prover
Last reviewed
May 21, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 4,096 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 21, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 4,096 words
Add missing citations, update stale details, or suggest a clearer explanation.
DeepSeek-Prover is a family of open-weight large language models developed by Chinese AI laboratory DeepSeek for formal theorem proving in the Lean 4 proof assistant. The series began in May 2024 with DeepSeek-Prover-V1, which fine-tuned a 7B base model on roughly 8 million synthetic Lean 4 statement-proof pairs derived from competition mathematics.[^1] DeepSeek-Prover-V1.5 (August 2024) added reinforcement learning from proof assistant feedback (RLPAF) and a Monte Carlo tree search variant called RMaxTS, reaching state-of-the-art pass rates on the miniF2F and ProofNet benchmarks.[^2] DeepSeek-Prover-V2 (April 2025) further combined recursive subgoal decomposition by DeepSeek-V3 with GRPO reinforcement learning to deliver a 671B-parameter model that solves 88.9% of MiniF2F-test and 49 of 658 PutnamBench problems.[^3] The line of work is among the most cited in open-source neural theorem proving, alongside DeepMind's AlphaProof and EleutherAI's Llemma.[^4][^5]
| Attribute | Detail |
|---|---|
| Developer | DeepSeek (DeepSeek-AI) |
| Domain | Formal theorem proving in Lean 4 |
| First release | DeepSeek-Prover-V1 (23 May 2024)[^1] |
| Latest release | DeepSeek-Prover-V2 (30 April 2025)[^3][^6] |
| Architecture (V1, V1.5, V2-7B) | Dense 7B transformer initialized from DeepSeekMath-Base or DeepSeek-Prover-V1.5-Base[^2][^7] |
| Architecture (V2-671B) | Mixture-of-Experts transformer initialized from DeepSeek-V3-Base[^7] |
| Context length (V2) | 32,768 tokens[^7] |
| Notable result | 88.9% MiniF2F-test pass ratio with V2-671B + chain of thought + Pass@8192[^3] |
| Code license | MIT (V1.5 repository)[^8] |
| Weights | Available on Hugging Face under the DeepSeek License (commercial use permitted)[^9][^7] |
Interactive theorem provers such as Lean 4 require proofs to be expressed as machine-verifiable tactic sequences over a formal type theory. Although LLMs had shown strong performance on natural language math benchmarks, transferring that capability to formal proof generation was bottlenecked by the small size of human-written corpora such as the Lean standard library Mathlib. By contrast, training corpora for informal mathematics, like the MATH benchmark and online competition archives, were orders of magnitude larger.[^1]
Earlier neural provers, including OpenAI's GPT-f, Meta's HyperTree Proof Search, and the ReProver retrieval-augmented system, established a baseline on the miniF2F dataset of 488 Olympiad and competition problems formalized for Lean, Metamath, Isabelle, and HOL Light by Kunhao Zheng, Jesse Michael Han, and Stanislas Polu in 2021.[^10] DeepSeek-Prover-V1 was the first model to publish a clear lead on Lean 4 miniF2F by means of massive synthetic data rather than tree search alone.[^1]
The series is closely tied to DeepSeek's broader math-reasoning stack: DeepSeekMath-Base provides the pre-training initialization for V1 and V1.5, and DeepSeek-V3 supplies the natural-language reasoning teacher for V2's subgoal decomposition.[^1][^2][^3]
Lean 4 functions as an interactive theorem prover where each step of a proof is checked against a strict dependent type theory; an LLM-generated proof either type-checks or it does not, making proof verification a clean, automatable reward signal for reinforcement learning. This contrasts with informal math reasoning, where evaluating partial credit on a chain of thought requires a learned reward model or a heuristic grader. The DeepSeek-Prover line explicitly treats the Lean kernel as a frozen environment whose binary verdict drives policy updates, similar in spirit to how the AlphaZero family used a fixed game rule set as ground truth.[^2][^3] This framing also explains why the team emphasizes whole-proof generation with truncate-and-resume rather than purely step-level tactic prediction: the LLM is allowed to write long Lean programs and let the kernel cut them at the first error.[^2]
DeepSeek had already shipped DeepSeekMath, a math-focused 7B model, before starting on Prover. That model contributed both the DeepSeekMath-Base initialization and the GRPO algorithm later used in Prover's RL stage. The continuity matters because V1 and V1.5 inherit the math-token pre-training and tokenizer from DeepSeekMath, while V2 swaps in DeepSeek-V3 as a more capable backbone and informal reasoner.[^1][^2][^3]
DeepSeek-Prover-V1, posted to arXiv on 23 May 2024 by Huajian Xin, Daya Guo, Zhihong Shao, Zhizhou Ren, Qihao Zhu, Bo Liu, Chong Ruan, Wenda Li, and Xiaodan Liang, focused on closing the data gap for Lean 4.[^1]
The team scraped 869,659 high-school and undergraduate-level natural-language competition problems, with an emphasis on algebra and number theory but also covering combinatorics, geometry, and statistics.[^11] A DeepSeekMath-Base 7B model fine-tuned on the MMA dataset of Lean mathlib statements was used to translate each natural-language problem into a candidate Lean 4 core declaration.[^11]
To remove noisy or vacuous statements, the pipeline applied:
False from each statement's hypotheses; success indicated inconsistent assumptions and the statement was removed. The paper illustrates this with a contradictory complex-number example.[^11]After filtering, 712,073 high-quality formal statements remained, from which roughly 8 million statement-proof pairs were generated using proof search and self-distillation.[^1][^11]
DeepSeekMath-Base 7B was supervised-fine-tuned on this synthetic corpus.[^1] On Lean 4 miniF2F-test (244 problems), DeepSeek-Prover-V1 achieved 46.3% pass@64 and 52.0% cumulative accuracy, substantially exceeding the GPT-4-turbo 0409 baseline of 23.0% pass@64 and the prior HyperTree Proof Search tree-search baseline of 41.0%.[^1] On the Formalized International Mathematical Olympiad (FIMO) benchmark of 148 problems, DeepSeek-Prover-V1 proved 5 problems with 4,096 attempts, while GPT-4 proved none.[^1]
A central empirical claim of the paper is that performance on miniF2F improves roughly proportionally to exponential increases in the synthetic dataset size, which the team treats as evidence that data scaling is the dominant bottleneck in neural theorem proving at this stage.[^1][^11] Models trained on autoformalized competition data also outperform models trained only on Mathlib by a wide margin, suggesting that the topical distribution of formal corpora matters as much as raw size.[^11]
The supervised model and the V1 dataset were released on Hugging Face as deepseek-ai/DeepSeek-Prover-V1, with the dataset itself becoming a reusable resource for follow-up theorem-proving research.[^12]
DeepSeek-Prover-V1.5, posted to arXiv on 15 August 2024, kept the same 7B parameter budget but introduced major changes in fine-tuning, reinforcement learning, and inference-time search.[^2]
Pre-training continued from DeepSeekMath-Base on a corpus combining code and natural-language math, with extra weighting on formal languages including Lean, Isabelle, and Metamath.[^2] The supervised fine-tuning dataset was expanded to 9,645k sequences, with two key augmentations relative to V1:[^2]
/- tactic state: ... -/ comments. During training, loss was computed only on tokens following the state markers, teaching the model to plan around verifier feedback.[^2]The RL phase, dubbed RLPAF, used the Group Relative Policy Optimization (GRPO) algorithm introduced in DeepSeekMath.[^2] Approximately 4.5k unique theorem statements where the SFT model achieved moderate success were used as prompts. For each prompt, 32 candidate proofs of up to 2,048 tokens were sampled, with a binary reward (1 if verified by Lean, 0 otherwise), a learning rate of 5e-6, and a KL penalty of 0.02 against the SFT reference policy.[^2]
DeepSeek-Prover-V1.5 introduced RMaxTS, a Monte Carlo tree search variant designed to encourage diverse proof exploration:[^2]
The tree policy and intrinsic-reward design tie RMaxTS to the lineage of Monte Carlo Tree Search methods that drove AlphaZero and AlphaGo, adapted to the formal-proof setting where leaf evaluation is binary verifier output.[^2]
The motivation for RMax-style intrinsic rewards in this setting is that the extrinsic reward (Lean verifying a complete proof) is extremely sparse, especially for harder problems where the proof tree may need to explore thousands of nodes before any leaf succeeds. A pure success-only reward provides no signal in the meantime, so the agent has no incentive to visit new states. By rewarding tree-expansion novelty, RMaxTS produces densely supervised tree growth even when no proof has been found, which the authors argue is critical for diversifying proof attempts beyond the most likely tactic at each step.[^2] The discount factor of 0.99 ensures that as the agent saturates the tree, the value of further exploration decays smoothly rather than collapsing.[^2]
DeepSeek-Prover-V1.5 set state-of-the-art numbers on both Olympiad-level miniF2F and undergraduate ProofNet at the time of release.[^2][^9] The progression of variants on the official model card is summarized below.[^9]
| Model | miniF2F-test | ProofNet test |
|---|---|---|
| ReProver | 26.5% | 13.8% |
| GPT-f | 36.6% | not reported |
| HyperTree Proof Search | 41.0% | not reported |
| InternLM2-StepProver | 54.5% | 18.1% |
| DeepSeek-Prover-V1 | 50.0% | not reported |
| DeepSeek-Prover-V1.5-Base | 42.2% | 13.2% |
| DeepSeek-Prover-V1.5-SFT | 57.4% | 22.9% |
| DeepSeek-Prover-V1.5-RL | 60.2% | 22.6% |
| DeepSeek-Prover-V1.5-RL + RMaxTS | 63.5% | 25.3% |
All three weight checkpoints (Base, SFT, RL) were released on Hugging Face with code under the MIT license and weights under the DeepSeek model license.[^9][^8]
DeepSeek-Prover-V2 was released on 30 April 2025, with the corresponding arXiv preprint "DeepSeek-Prover-V2: Advancing Formal Mathematical Reasoning via Reinforcement Learning for Subgoal Decomposition" (arXiv:2504.21801) co-authored by 18 DeepSeek researchers led by Z. Z. Ren and Zhihong Shao.[^3][^6][^13]
Two checkpoints were released:[^7][^14]
The training pipeline starts with DeepSeek-V3 acting as both a natural-language decomposer and a formal-statement writer.[^3] Given a target theorem, V3 is prompted to:
have block, with the rest of the proof deferred via sorry.[^3]A specialized 7B prover (the DeepSeek-Prover-V1.5-style model) then attempts to close each subgoal independently. Once all subgoals are resolved, the verified Lean 4 fragments are composed into a complete formal proof, paired with V3's chain of thought to form a single training example combining informal and formal reasoning.[^3]
This division of labor mirrors a classic human strategy: a senior mathematician sketches the high-level structure of a proof while a junior collaborator or student fills in the routine algebraic and combinatorial details. Here the "senior" role is played by a 671B mixture-of-experts model that has been trained broadly on math and code, while the "junior" role is filled by a tightly specialized 7B Lean tactician. Because the subgoals are independent, the 7B prover can be parallelized across subgoals, and successful closures can be cached and re-used in future training rounds.[^3] When all subgoals close, the proof is recomposed; when some do not, the pipeline can retry or escalate to a different decomposition.[^3]
The cold-start corpus produced by recursive decomposition initializes the unified V2 model so that it can interleave natural-language reasoning with formal Lean steps in a single output.[^3] Reinforcement learning then uses Group Relative Policy Optimization (GRPO) with binary correctness rewards from the Lean kernel as the primary supervision, sampling groups of candidate proofs and optimizing on relative reward differences without a separate critic.[^3] The CoT mode systematically articulates intermediate reasoning steps before constructing the final formal proof.[^3]
A practical contribution of V2 is the explicit separation of a chain-of-thought mode, which interleaves natural-language reasoning with Lean tactics, from a non-CoT mode that emits Lean directly. The paper reports a substantial performance gap in favor of CoT, supporting the broader hypothesis that informal scratch reasoning helps an LLM plan formal proofs even when the final artifact must be machine-verified.[^3] The 671B model in particular benefits from CoT at very high sample budgets, which is one of the reasons the headline 88.9% miniF2F number is reported at Pass@8192 with CoT enabled.[^3]
On MiniF2F-test, DeepSeek-Prover-V2-671B with chain of thought and Pass@8192 sampling reaches an 88.9% pass ratio, the highest reported result for a neural prover at the time.[^3][^14] On the historically very difficult PutnamBench (William Lowell Putnam Mathematical Competition problems formalized in Lean), V2-671B solves 49 out of 658 problems, with the 7B model contributing an additional 13 problems using different proving techniques for a combined total of 62.[^3][^14] On ProofNet-test, the 671B CoT model achieves a 37.1% pass rate at Pass@1024.[^3]
Alongside the models, DeepSeek introduced ProverBench, a benchmark of 325 problems including 15 formalized AIME problems from the 2024 and 2025 competitions, plus 310 textbook problems spanning number theory, elementary algebra, linear algebra, abstract algebra, calculus, real analysis, complex analysis, functional analysis, and probability.[^7] On the 15 AIME problems, V2 solved 6 in proof-search mode, while general-purpose DeepSeek-V3 with majority voting solved 8 when asked for natural-language answers, illustrating that formal proving remains harder than answer extraction.[^3][^15]
The model card lists tensor types BF16, F8_E4M3, and F32 for V2-671B, with deployment support documented for vLLM, SGLang, Docker Model Runner, and llama.cpp/Ollama/LM Studio quantizations.[^14]
A recurring theme across V1 through V2 is autoformalization: translating natural-language theorem statements into Lean 4. DeepSeek's pipeline pairs a fine-tuned DeepSeekMath translator with model-scored quality filters and hypothesis-rejection checks, then refines the translator using accepted statements.[^1][^11] The same iterative formalization-then-proving loop underlies the Lean Workbook project, a large-scale Lean problem set formalized from natural-language math problems published by Huajian Xin and collaborators at NeurIPS 2024 (arXiv:2406.03847), some of whose statements feed into the DeepSeek-Prover pipelines.[^16]
The miniF2F benchmark serves as the canonical comparison point. It consists of 488 problem statements drawn from the AIME, AMC, and IMO competitions, plus high-school and undergraduate course problems, formalized across multiple proof systems by Zheng, Han, and Polu in 2021 and updated since.[^10] The undergraduate ProofNet benchmark and the harder PutnamBench complement miniF2F at higher levels of difficulty.[^2][^3]
A subtle issue with autoformalized statements is that an inconsistent set of hypotheses makes any conclusion trivially provable in Lean (exfalso plus a derivation of False). Without filtering, large fractions of a synthetic corpus would be vacuously satisfiable theorems that teach the model nothing useful. DeepSeek-Prover-V1's hypothesis-rejection step attempts to prove False from each candidate statement's hypotheses; success indicates inconsistency and the statement is discarded.[^11] The paper illustrates this with a complex-number example that combines mutually exclusive constraints, which the system correctly identifies as impossible.[^11] This kind of corpus hygiene is one of the practical contributions that the V1 paper highlights as essential to making synthetic data actually useful at scale.[^1][^11]
The autoformalizer is itself improved across iterations: statements that pass quality filtering and are proved successfully become training data for a refined version of the translator, which then generates higher-quality candidates in the next round. The V1 paper reports that this iterative loop is what allows performance on miniF2F to scale roughly log-linearly with the size of the curated corpus, rather than plateauing once initial low-hanging-fruit statements are exhausted.[^1][^11]
DeepSeek-Prover occupies a specific niche among neural theorem provers: open weights, Lean 4 native, whole-proof or recursive-subgoal generation, and an explicit reinforcement-learning loop against the proof assistant.
| System | Lab | Year | Backbone | Formal target | Approach | Best public result |
|---|---|---|---|---|---|---|
| DeepSeek-Prover-V1 | DeepSeek | 2024 | DeepSeekMath 7B | Lean 4 | Large-scale synthetic SFT[^1] | 50.0% miniF2F (Lean 4)[^9] |
| DeepSeek-Prover-V1.5 | DeepSeek | 2024 | DeepSeekMath-Base 7B | Lean 4 | SFT + RLPAF + RMaxTS[^2] | 63.5% miniF2F[^2] |
| DeepSeek-Prover-V2 | DeepSeek | 2025 | DeepSeek-V3-Base (671B MoE) | Lean 4 | Recursive decomposition + GRPO[^3] | 88.9% miniF2F[^3] |
| AlphaProof | Google DeepMind | 2024 | Gemini + AlphaZero-style RL | Lean | Test-time RL on millions of problem variants[^4][^17] | Silver-medal score 28/42 at 2024 IMO (with AlphaGeometry 2)[^4] |
| AlphaGeometry / AlphaGeometry 2 | Google DeepMind | 2024 | Gemini + symbolic engine | Bespoke geometry DSL | Neuro-symbolic hybrid[^4] | Solved IMO 2024 problem 4 in 19 seconds[^4] |
| Llemma (7B, 34B) | Princeton, EleutherAI, et al. | 2023 | Code Llama + Proof-Pile-2 | Tool use, Isabelle, Lean (no fine-tuning) | Continued pre-training | Open base-model SOTA on MATH at time of release[^5] |
AlphaProof couples a Gemini language model with formal Lean verification and uses an AlphaZero-style reinforcement learning loop, including test-time RL that generates problem variants for each Olympiad problem. Combined with AlphaGeometry 2 on the 2024 International Mathematical Olympiad, the pair scored 28 of 42 points, equivalent to a silver medal.[^4][^17] AlphaProof's weights and code are not public; DeepSeek-Prover by contrast releases all checkpoints under permissive terms.[^7][^14]
Llemma (arXiv:2310.10631, Azerbayev et al., ICLR 2024) is an open language model for mathematics trained by continued pre-training of Code Llama on the Proof-Pile-2 corpus, available in 7B and 34B variants.[^5] Llemma is a general math model with strong tool use and basic formal-prover capability without fine-tuning, while DeepSeek-Prover specializes specifically in Lean 4 proof generation.[^5][^1]
A useful way to characterize the design space is along two axes: (1) how strongly the system is specialized to a particular proof assistant, and (2) how heavily it relies on autoregressive whole-proof generation versus search. DeepSeek-Prover-V1 sits at the high-specialization, low-search corner (synthetic SFT + single-pass sampling). DeepSeek-Prover-V1.5 keeps high specialization but adds explicit RMaxTS search. DeepSeek-Prover-V2 sits at high specialization plus explicit recursive subgoal decomposition by a separate teacher model. AlphaProof occupies a similar corner with more aggressive test-time RL. Llemma, by contrast, is closer to a general-purpose math model that can be used for formal proving but is not optimized for it.[^1][^2][^3][^4][^5]
DeepSeek-Prover models have been used or proposed for:
Despite strong benchmark numbers, several limitations of the series are acknowledged or evident from the papers and model cards:
have blocks. Errors in either step propagate into the synthetic corpus and ultimately into the trained model, making the quality of the teacher a hard upper bound on V2.[^3]DeepSeek-Prover popularized two ideas that have become standard in open-source formal reasoning research: (1) using LLMs to autoformalize large competition-math corpora into Lean and then training on the resulting synthetic data, and (2) closing the loop with reinforcement learning against the proof assistant itself rather than against a learned reward model.[^1][^2] V2's recursive subgoal decomposition further showed that a strong informal reasoner like DeepSeek-V3 can be paired with a smaller specialized prover to produce training data and final proofs that combine natural-language planning with formal verification.[^3]
The series has been positioned in the press as China's most visible open-source counterpart to AlphaProof, with secondary coverage emphasizing the gap in openness between the two efforts and the practical implications for the research community.[^15][^17]
A second strand of significance is methodological. By demonstrating that GRPO works well in a setting where rewards are exactly binary and verifier-supplied, the V1.5 and V2 papers helped consolidate GRPO as a standard recipe for RL on tasks with cheap, deterministic, correct/incorrect oracles. This pattern was subsequently picked up in other DeepSeek work and in the broader open-source post-training literature, where verifier-style rewards are increasingly preferred over learned reward models for tasks that admit them.[^2][^3] The Prover series also provides a clean illustration of how a smaller specialized expert (the 7B prover) and a much larger generalist reasoner (DeepSeek-V3) can compose into a pipeline that outperforms either individually, a pattern related to but distinct from classical Mixture-of-Experts routing.[^3]