# DeepSeek-Prover

> Source: https://aiwiki.ai/wiki/deepseek_prover
> Updated: 2026-06-09
> Categories: Chinese AI, Large Language Models, Reasoning Models
> From AI Wiki (https://aiwiki.ai), a free encyclopedia of artificial intelligence. Quote with attribution.

# DeepSeek-Prover

**DeepSeek-Prover** is a family of open-weight large language models developed by Chinese AI laboratory [DeepSeek](/wiki/deepseek) for formal theorem proving in the Lean 4 proof assistant. The series began in May 2024 with DeepSeek-Prover-V1, which fine-tuned a 7B base model on roughly 8 million synthetic Lean 4 statement-proof pairs derived from competition mathematics.[^1] DeepSeek-Prover-V1.5 (August 2024) added reinforcement learning from proof assistant feedback (RLPAF) and a Monte Carlo tree search variant called RMaxTS, reaching state-of-the-art pass rates on the miniF2F and ProofNet benchmarks.[^2] DeepSeek-Prover-V2 (April 2025) further combined recursive subgoal decomposition by [DeepSeek-V3](/wiki/deepseek_v3) with [GRPO](/wiki/grpo) reinforcement learning to deliver a 671B-parameter model that solves 88.9% of MiniF2F-test and 49 of 658 PutnamBench problems.[^3] The line of work is among the most cited in open-source neural theorem proving, alongside DeepMind's AlphaProof and EleutherAI's Llemma.[^4][^5]

## Infobox

| Attribute | Detail |
|---|---|
| Developer | [DeepSeek](/wiki/deepseek) (DeepSeek-AI) |
| Domain | Formal theorem proving in Lean 4 |
| First release | DeepSeek-Prover-V1 (23 May 2024)[^1] |
| Latest release | DeepSeek-Prover-V2 (30 April 2025)[^3][^6] |
| Architecture (V1, V1.5, V2-7B) | Dense 7B transformer initialized from DeepSeekMath-Base or DeepSeek-Prover-V1.5-Base[^2][^7] |
| Architecture (V2-671B) | Mixture-of-Experts transformer initialized from DeepSeek-V3-Base[^7] |
| Context length (V2) | 32,768 tokens[^7] |
| Notable result | 88.9% MiniF2F-test pass ratio with V2-671B + chain of thought + Pass@8192[^3] |
| Code license | MIT (V1.5 repository)[^8] |
| Weights | Available on Hugging Face under the DeepSeek License (commercial use permitted)[^9][^7] |

## Background

Interactive theorem provers such as Lean 4 require proofs to be expressed as machine-verifiable tactic sequences over a formal type theory. Although [LLMs](/wiki/llm) had shown strong performance on natural language [math](/wiki/math) benchmarks, transferring that capability to formal proof generation was bottlenecked by the small size of human-written corpora such as the Lean standard library [Mathlib](/wiki/mathlib). By contrast, training corpora for informal mathematics, like the MATH benchmark and online competition archives, were orders of magnitude larger.[^1]

Earlier neural provers, including OpenAI's GPT-f, Meta's HyperTree Proof Search, and the ReProver retrieval-augmented system, established a baseline on the miniF2F dataset of 488 Olympiad and competition problems formalized for Lean, Metamath, Isabelle, and HOL Light by Kunhao Zheng, Jesse Michael Han, and [Stanislas Polu](/wiki/stanislas_polu) in 2021.[^10] DeepSeek-Prover-V1 was the first model to publish a clear lead on Lean 4 miniF2F by means of massive synthetic data rather than tree search alone.[^1]

The series is closely tied to DeepSeek's broader [math](/wiki/math)-reasoning stack: DeepSeekMath-Base provides the pre-training initialization for V1 and V1.5, and [DeepSeek-V3](/wiki/deepseek_v3) supplies the natural-language reasoning teacher for V2's subgoal decomposition.[^1][^2][^3]

### Lean 4 and the prover-as-environment

Lean 4 functions as an interactive theorem prover where each step of a proof is checked against a strict dependent type theory; an LLM-generated proof either type-checks or it does not, making proof verification a clean, automatable reward signal for [reinforcement learning](/wiki/reinforcement_learning). This contrasts with informal math reasoning, where evaluating partial credit on a chain of thought requires a learned reward model or a heuristic grader. The DeepSeek-Prover line explicitly treats the Lean kernel as a frozen environment whose binary verdict drives policy updates, similar in spirit to how the AlphaZero family used a fixed game rule set as ground truth.[^2][^3] This framing also explains why the team emphasizes whole-proof generation with truncate-and-resume rather than purely step-level tactic prediction: the LLM is allowed to write long Lean programs and let the kernel cut them at the first error.[^2]

### Pre-Prover work at DeepSeek

DeepSeek had already shipped DeepSeekMath, a math-focused 7B model, before starting on Prover. That model contributed both the DeepSeekMath-Base initialization and the GRPO algorithm later used in Prover's RL stage. The continuity matters because V1 and V1.5 inherit the math-token pre-training and tokenizer from DeepSeekMath, while V2 swaps in [DeepSeek-V3](/wiki/deepseek_v3) as a more capable backbone and informal reasoner.[^1][^2][^3]

## DeepSeek-Prover-V1

DeepSeek-Prover-V1, posted to arXiv on 23 May 2024 by Huajian Xin, Daya Guo, Zhihong Shao, Zhizhou Ren, Qihao Zhu, Bo Liu, Chong Ruan, Wenda Li, and Xiaodan Liang, focused on closing the data gap for Lean 4.[^1]

### Autoformalization pipeline

The team scraped 869,659 high-school and undergraduate-level natural-language competition problems, with an emphasis on algebra and number theory but also covering combinatorics, geometry, and statistics.[^11] A DeepSeekMath-Base 7B model fine-tuned on the MMA dataset of Lean mathlib statements was used to translate each natural-language problem into a candidate Lean 4 `core declaration`.[^11]

To remove noisy or vacuous statements, the pipeline applied:

- **Model scoring with chain-of-thought**: each candidate statement was rated "excellent," "good," "above average," "fair," or "poor," with "fair" and "poor" statements discarded.[^11]
- **Hypothesis rejection**: the system attempted to prove `False` from each statement's hypotheses; success indicated inconsistent assumptions and the statement was removed. The paper illustrates this with a contradictory complex-number example.[^11]
- **Iterative refinement**: refined models generated higher-quality statements in subsequent passes.[^11]

After filtering, 712,073 high-quality formal statements remained, from which roughly 8 million statement-proof pairs were generated using proof search and self-distillation.[^1][^11]

### Training and results

DeepSeekMath-Base 7B was supervised-fine-tuned on this synthetic corpus.[^1] On Lean 4 miniF2F-test (244 problems), DeepSeek-Prover-V1 achieved 46.3% pass@64 and 52.0% cumulative accuracy, substantially exceeding the GPT-4-turbo 0409 baseline of 23.0% pass@64 and the prior HyperTree Proof Search tree-search baseline of 41.0%.[^1] On the Formalized International Mathematical Olympiad (FIMO) benchmark of 148 problems, DeepSeek-Prover-V1 proved 5 problems with 4,096 attempts, while [GPT-4](/wiki/gpt-4) proved none.[^1]

A central empirical claim of the paper is that performance on miniF2F improves roughly proportionally to exponential increases in the synthetic dataset size, which the team treats as evidence that data scaling is the dominant bottleneck in neural theorem proving at this stage.[^1][^11] Models trained on autoformalized competition data also outperform models trained only on Mathlib by a wide margin, suggesting that the topical distribution of formal corpora matters as much as raw size.[^11]

The supervised model and the V1 dataset were released on Hugging Face as `deepseek-ai/DeepSeek-Prover-V1`, with the dataset itself becoming a reusable resource for follow-up theorem-proving research.[^12]

## DeepSeek-Prover-V1.5

DeepSeek-Prover-V1.5, posted to arXiv on 15 August 2024, kept the same 7B parameter budget but introduced major changes in fine-tuning, reinforcement learning, and inference-time search.[^2]

### Pre-training and supervised fine-tuning

Pre-training continued from DeepSeekMath-Base on a corpus combining code and natural-language math, with extra weighting on formal languages including Lean, Isabelle, and Metamath.[^2] The supervised fine-tuning dataset was expanded to 9,645k sequences, with two key augmentations relative to V1:[^2]

- **Chain-of-thought annotations**: DeepSeek-Coder V2 236B was prompted to insert natural-language reasoning comments adjacent to Lean 4 tactics, including a high-level proof sketch at the start of each proof block.[^2]
- **Tactic-state integration**: intermediate proof states returned by the Lean kernel were inlined as `/- tactic state: ... -/` comments. During training, loss was computed only on tokens following the state markers, teaching the model to plan around verifier feedback.[^2]

### Reinforcement learning from proof assistant feedback

The RL phase, dubbed RLPAF, used the Group Relative Policy Optimization ([GRPO](/wiki/grpo)) algorithm introduced in DeepSeekMath.[^2] Approximately 4.5k unique theorem statements where the SFT model achieved moderate success were used as prompts. For each prompt, 32 candidate proofs of up to 2,048 tokens were sampled, with a binary reward (1 if verified by Lean, 0 otherwise), a learning rate of 5e-6, and a KL penalty of 0.02 against the SFT reference policy.[^2]

### RMaxTS Monte Carlo tree search

DeepSeek-Prover-V1.5 introduced RMaxTS, a Monte Carlo tree search variant designed to encourage diverse proof exploration:[^2]

- **Truncate-and-resume**: a whole-proof generation is submitted to Lean, then truncated at the first compiler error. The verified prefix becomes the new prompt, with the latest tactic state appended as a comment, and generation resumes.[^2]
- **Intrinsic RMax rewards**: following the RMax algorithm family, the agent receives a reward of 1 whenever expansion adds at least one new node to the search tree, encouraging exploration of novel proof states. There is no extrinsic shaping beyond verifier success.[^2]
- **Discounted UCB**: a discounted Upper Confidence Bound tree policy with discount factor gamma = 0.99 balances exploitation and exploration, accounting for non-stationary intrinsic rewards.[^2]

The tree policy and intrinsic-reward design tie RMaxTS to the lineage of [Monte Carlo Tree Search](/wiki/monte_carlo_tree_search) methods that drove [AlphaZero](/wiki/alphazero) and [AlphaGo](/wiki/alphago), adapted to the formal-proof setting where leaf evaluation is binary verifier output.[^2]

### Why intrinsic rewards

The motivation for RMax-style intrinsic rewards in this setting is that the extrinsic reward (Lean verifying a complete proof) is extremely sparse, especially for harder problems where the proof tree may need to explore thousands of nodes before any leaf succeeds. A pure success-only reward provides no signal in the meantime, so the agent has no incentive to visit new states. By rewarding tree-expansion novelty, RMaxTS produces densely supervised tree growth even when no proof has been found, which the authors argue is critical for diversifying proof attempts beyond the most likely tactic at each step.[^2] The discount factor of 0.99 ensures that as the agent saturates the tree, the value of further exploration decays smoothly rather than collapsing.[^2]

### Benchmarks

DeepSeek-Prover-V1.5 set state-of-the-art numbers on both Olympiad-level miniF2F and undergraduate ProofNet at the time of release.[^2][^9] The progression of variants on the official model card is summarized below.[^9]

| Model | miniF2F-test | ProofNet test |
|---|---|---|
| ReProver | 26.5% | 13.8% |
| GPT-f | 36.6% | not reported |
| HyperTree Proof Search | 41.0% | not reported |
| InternLM2-StepProver | 54.5% | 18.1% |
| DeepSeek-Prover-V1 | 50.0% | not reported |
| DeepSeek-Prover-V1.5-Base | 42.2% | 13.2% |
| DeepSeek-Prover-V1.5-SFT | 57.4% | 22.9% |
| DeepSeek-Prover-V1.5-RL | 60.2% | 22.6% |
| DeepSeek-Prover-V1.5-RL + RMaxTS | 63.5% | 25.3% |

All three weight checkpoints (Base, SFT, RL) were released on Hugging Face with code under the MIT license and weights under the DeepSeek model license.[^9][^8]

## DeepSeek-Prover-V2

DeepSeek-Prover-V2 was released on 30 April 2025, with the corresponding arXiv preprint "DeepSeek-Prover-V2: Advancing Formal Mathematical Reasoning via Reinforcement Learning for Subgoal Decomposition" (arXiv:2504.21801) co-authored by 18 DeepSeek researchers led by Z. Z. Ren and Zhihong Shao.[^3][^6][^13]

### Architecture

Two checkpoints were released:[^7][^14]

- **DeepSeek-Prover-V2-7B**: a dense 7B model initialized from DeepSeek-Prover-V1.5-Base with the context length extended from 4,096 to 32,768 tokens.[^7]
- **DeepSeek-Prover-V2-671B**: a [Mixture of Experts](/wiki/mixture_of_experts) model that shares the [DeepSeek-V3](/wiki/deepseek_v3) architecture, listed on the Hugging Face card with 685B total parameters and 671B in the model name.[^14]

### Recursive subgoal decomposition

The training pipeline starts with [DeepSeek-V3](/wiki/deepseek_v3) acting as both a natural-language decomposer and a formal-statement writer.[^3] Given a target theorem, V3 is prompted to:

1. Produce a high-level proof sketch in natural language.[^3]
2. Translate each sketch step into a corresponding Lean 4 subgoal `have` block, with the rest of the proof deferred via `sorry`.[^3]

A specialized 7B prover (the DeepSeek-Prover-V1.5-style model) then attempts to close each subgoal independently. Once all subgoals are resolved, the verified Lean 4 fragments are composed into a complete formal proof, paired with V3's chain of thought to form a single training example combining informal and formal reasoning.[^3]

This division of labor mirrors a classic human strategy: a senior mathematician sketches the high-level structure of a proof while a junior collaborator or student fills in the routine algebraic and combinatorial details. Here the "senior" role is played by a 671B mixture-of-experts model that has been trained broadly on math and code, while the "junior" role is filled by a tightly specialized 7B Lean tactician. Because the subgoals are independent, the 7B prover can be parallelized across subgoals, and successful closures can be cached and re-used in future training rounds.[^3] When all subgoals close, the proof is recomposed; when some do not, the pipeline can retry or escalate to a different decomposition.[^3]

### Cold start and reinforcement learning

The cold-start corpus produced by recursive decomposition initializes the unified V2 model so that it can interleave natural-language reasoning with formal Lean steps in a single output.[^3] Reinforcement learning then uses Group Relative Policy Optimization ([GRPO](/wiki/grpo)) with binary correctness rewards from the Lean kernel as the primary supervision, sampling groups of candidate proofs and optimizing on relative reward differences without a separate critic.[^3] The CoT mode systematically articulates intermediate reasoning steps before constructing the final formal proof.[^3]

### CoT and non-CoT modes

A practical contribution of V2 is the explicit separation of a chain-of-thought mode, which interleaves natural-language reasoning with Lean tactics, from a non-CoT mode that emits Lean directly. The paper reports a substantial performance gap in favor of CoT, supporting the broader hypothesis that informal scratch reasoning helps an LLM plan formal proofs even when the final artifact must be machine-verified.[^3] The 671B model in particular benefits from CoT at very high sample budgets, which is one of the reasons the headline 88.9% miniF2F number is reported at Pass@8192 with CoT enabled.[^3]

### Benchmarks and ProverBench

On MiniF2F-test, DeepSeek-Prover-V2-671B with chain of thought and Pass@8192 sampling reaches an 88.9% pass ratio, the highest reported result for a neural prover at the time.[^3][^14] On the historically very difficult PutnamBench (William Lowell Putnam Mathematical Competition problems formalized in Lean), V2-671B solves 49 out of 658 problems, with the 7B model contributing an additional 13 problems using different proving techniques for a combined total of 62.[^3][^14] On ProofNet-test, the 671B CoT model achieves a 37.1% pass rate at Pass@1024.[^3]

Alongside the models, DeepSeek introduced **ProverBench**, a benchmark of 325 problems including 15 formalized [AIME](/wiki/aime) problems from the [2024](/wiki/aime_2024) and [2025](/wiki/aime_2025) competitions, plus 310 textbook problems spanning number theory, elementary algebra, linear algebra, abstract algebra, calculus, real analysis, complex analysis, functional analysis, and probability.[^7] On the 15 AIME problems, V2 solved 6 in proof-search mode, while general-purpose DeepSeek-V3 with majority voting solved 8 when asked for natural-language answers, illustrating that formal proving remains harder than answer extraction.[^3][^15]

The model card lists tensor types BF16, F8_E4M3, and F32 for V2-671B, with deployment support documented for vLLM, SGLang, Docker Model Runner, and llama.cpp/Ollama/LM Studio quantizations.[^14]

## Autoformalization, Lean Workbook, and benchmark relationships

A recurring theme across V1 through V2 is autoformalization: translating natural-language theorem statements into Lean 4. DeepSeek's pipeline pairs a fine-tuned DeepSeekMath translator with model-scored quality filters and hypothesis-rejection checks, then refines the translator using accepted statements.[^1][^11] The same iterative formalization-then-proving loop underlies the Lean Workbook project, a large-scale Lean problem set formalized from natural-language math problems published by Huajian Xin and collaborators at NeurIPS 2024 (arXiv:2406.03847), some of whose statements feed into the DeepSeek-Prover pipelines.[^16]

The miniF2F benchmark serves as the canonical comparison point. It consists of 488 problem statements drawn from the AIME, AMC, and IMO competitions, plus high-school and undergraduate course problems, formalized across multiple proof systems by Zheng, Han, and Polu in 2021 and updated since.[^10] The undergraduate ProofNet benchmark and the harder PutnamBench complement miniF2F at higher levels of difficulty.[^2][^3]

### Why hypothesis rejection matters

A subtle issue with autoformalized statements is that an inconsistent set of hypotheses makes any conclusion trivially provable in Lean (`exfalso` plus a derivation of `False`). Without filtering, large fractions of a synthetic corpus would be vacuously satisfiable theorems that teach the model nothing useful. DeepSeek-Prover-V1's hypothesis-rejection step attempts to prove `False` from each candidate statement's hypotheses; success indicates inconsistency and the statement is discarded.[^11] The paper illustrates this with a complex-number example that combines mutually exclusive constraints, which the system correctly identifies as impossible.[^11] This kind of corpus hygiene is one of the practical contributions that the V1 paper highlights as essential to making synthetic data actually useful at scale.[^1][^11]

### Iterative improvement of the translator

The autoformalizer is itself improved across iterations: statements that pass quality filtering and are proved successfully become training data for a refined version of the translator, which then generates higher-quality candidates in the next round. The V1 paper reports that this iterative loop is what allows performance on miniF2F to scale roughly log-linearly with the size of the curated corpus, rather than plateauing once initial low-hanging-fruit statements are exhausted.[^1][^11]

## Comparison with related systems

DeepSeek-Prover occupies a specific niche among neural theorem provers: open weights, Lean 4 native, whole-proof or recursive-subgoal generation, and an explicit reinforcement-learning loop against the proof assistant.

| System | Lab | Year | Backbone | Formal target | Approach | Best public result |
|---|---|---|---|---|---|---|
| DeepSeek-Prover-V1 | [DeepSeek](/wiki/deepseek) | 2024 | DeepSeekMath 7B | Lean 4 | Large-scale synthetic SFT[^1] | 50.0% miniF2F (Lean 4)[^9] |
| DeepSeek-Prover-V1.5 | [DeepSeek](/wiki/deepseek) | 2024 | DeepSeekMath-Base 7B | Lean 4 | SFT + RLPAF + RMaxTS[^2] | 63.5% miniF2F[^2] |
| DeepSeek-Prover-V2 | [DeepSeek](/wiki/deepseek) | 2025 | DeepSeek-V3-Base (671B MoE) | Lean 4 | Recursive decomposition + GRPO[^3] | 88.9% miniF2F[^3] |
| AlphaProof | [Google DeepMind](/wiki/google_deepmind) | 2024 | Gemini + AlphaZero-style RL | Lean | Test-time RL on millions of problem variants[^4][^17] | Silver-medal score 28/42 at 2024 IMO (with [AlphaGeometry 2](/wiki/alphageometry_2))[^4] |
| [AlphaGeometry](/wiki/alphageometry) / [AlphaGeometry 2](/wiki/alphageometry_2) | [Google DeepMind](/wiki/google_deepmind) | 2024 | Gemini + symbolic engine | Bespoke geometry DSL | Neuro-symbolic hybrid[^4] | Solved IMO 2024 problem 4 in 19 seconds[^4] |
| Llemma (7B, 34B) | Princeton, EleutherAI, et al. | 2023 | Code Llama + Proof-Pile-2 | Tool use, Isabelle, Lean (no fine-tuning) | Continued pre-training | Open base-model SOTA on MATH at time of release[^5] |

[AlphaProof](/wiki/alphaproof) couples a Gemini language model with formal Lean verification and uses an AlphaZero-style reinforcement learning loop, including test-time RL that generates problem variants for each Olympiad problem. Combined with [AlphaGeometry 2](/wiki/alphageometry_2) on the 2024 International Mathematical Olympiad, the pair scored 28 of 42 points, equivalent to a silver medal.[^4][^17] AlphaProof's weights and code are not public; DeepSeek-Prover by contrast releases all checkpoints under permissive terms.[^7][^14]

Llemma (arXiv:2310.10631, Azerbayev et al., ICLR 2024) is an open language model for mathematics trained by continued pre-training of Code Llama on the Proof-Pile-2 corpus, available in 7B and 34B variants.[^5] Llemma is a general math model with strong tool use and basic formal-prover capability without fine-tuning, while DeepSeek-Prover specializes specifically in Lean 4 proof generation.[^5][^1]

A useful way to characterize the design space is along two axes: (1) how strongly the system is specialized to a particular proof assistant, and (2) how heavily it relies on autoregressive whole-proof generation versus search. DeepSeek-Prover-V1 sits at the high-specialization, low-search corner (synthetic SFT + single-pass sampling). DeepSeek-Prover-V1.5 keeps high specialization but adds explicit RMaxTS search. DeepSeek-Prover-V2 sits at high specialization plus explicit recursive subgoal decomposition by a separate teacher model. AlphaProof occupies a similar corner with more aggressive test-time RL. Llemma, by contrast, is closer to a general-purpose math model that can be used for formal proving but is not optimized for it.[^1][^2][^3][^4][^5]

## Applications

DeepSeek-Prover models have been used or proposed for:

- Generating Lean 4 proofs for textbook problems and Olympiad statements in Mathlib-compatible form.[^7][^14]
- Producing synthetic data that other groups use to fine-tune their own theorem provers, since the V1 statement-proof corpus is published on Hugging Face.[^12]
- Serving as a research baseline in subsequent neural-prover papers, including studies revisiting miniF2F's limitations.[^18]
- Pedagogical demonstrations of how reinforcement learning interacts with formal verifiers, often cited alongside [GRPO](/wiki/grpo) and AlphaZero-style RL in the [reinforcement learning](/wiki/reinforcement_learning) literature.[^2][^3]
- Educational tooling: the published V2 ZIP archive of generated miniF2F solutions allows students and researchers to study machine-found Lean proofs directly, without re-running expensive inference.[^14]
- Deployment in standard inference stacks: the V2 model card documents vLLM and SGLang serving, plus llama.cpp quantizations, making 7B Prover usage feasible on a single high-end consumer GPU.[^14]

## Limitations

Despite strong benchmark numbers, several limitations of the series are acknowledged or evident from the papers and model cards:

- **Benchmark saturation and contamination concerns**: a 2025 study, "miniF2F-Lean Revisited" (arXiv:2511.03108), audits the benchmark for ambiguities, broken statements, and possible training-data leakage, suggesting headline numbers should be interpreted with care.[^18]
- **Compute-heavy inference**: top-line miniF2F numbers require very large sample budgets (Pass@8192 for V2-671B; RMaxTS expansions for V1.5) rather than single-pass generation.[^3][^2]
- **PutnamBench remains hard**: solving 49 of 658 problems (about 7.4%) shows that competition mathematics at the undergraduate-research boundary is still well outside the model's reliable range.[^3]
- **Closed parts of the pipeline**: while weights are open, some V2 training-data ingredients depend on the much larger DeepSeek-V3 model and proprietary internal infrastructure.[^3]
- **Lean-specific scope**: unlike Llemma's broader math capability, DeepSeek-Prover is narrowly tuned for Lean 4 proof tokens and is not designed as a general math reasoner.[^5][^2]
- **Formal vs. informal asymmetry**: on the 15 AIME problems formalized for ProverBench, V2-671B in proof mode solved 6, while the general-purpose [DeepSeek-V3](/wiki/deepseek_v3) with majority voting on the natural-language version solved 8. This gap illustrates that even with a strong specialist prover, producing a fully verified Lean proof remains harder than producing an unverified correct natural-language answer.[^3][^15]
- **Reliance on a teacher model**: the recursive decomposition pipeline depends on DeepSeek-V3's ability to write coherent proof sketches and well-typed Lean `have` blocks. Errors in either step propagate into the synthetic corpus and ultimately into the trained model, making the quality of the teacher a hard upper bound on V2.[^3]

## Significance

DeepSeek-Prover popularized two ideas that have become standard in open-source formal reasoning research: (1) using LLMs to autoformalize large competition-math corpora into Lean and then training on the resulting [synthetic data](/wiki/synthetic_data), and (2) closing the loop with reinforcement learning against the proof assistant itself rather than against a learned reward model.[^1][^2] V2's recursive subgoal decomposition further showed that a strong informal reasoner like [DeepSeek-V3](/wiki/deepseek_v3) can be paired with a smaller specialized prover to produce training data and final proofs that combine natural-language planning with formal verification.[^3]

The series has been positioned in the press as China's most visible open-source counterpart to [AlphaProof](/wiki/alphaproof), with secondary coverage emphasizing the gap in openness between the two efforts and the practical implications for the research community.[^15][^17]

A second strand of significance is methodological. By demonstrating that GRPO works well in a setting where rewards are exactly binary and verifier-supplied, the V1.5 and V2 papers helped consolidate GRPO as a standard recipe for RL on tasks with cheap, deterministic, correct/incorrect oracles. This pattern was subsequently picked up in other DeepSeek work and in the broader open-source post-training literature, where verifier-style rewards are increasingly preferred over learned reward models for tasks that admit them.[^2][^3] The Prover series also provides a clean illustration of how a smaller specialized expert (the 7B prover) and a much larger generalist reasoner (DeepSeek-V3) can compose into a pipeline that outperforms either individually, a pattern related to but distinct from classical Mixture-of-Experts routing.[^3]

## See also

- [DeepSeek](/wiki/deepseek)
- [DeepSeek V3](/wiki/deepseek_v3)
- [Lean (Theorem Prover)](/wiki/lean)
- [Mathlib](/wiki/mathlib)
- [AlphaProof](/wiki/alphaproof)
- [AlphaGeometry](/wiki/alphageometry)
- [AlphaGeometry 2](/wiki/alphageometry_2)
- [AlphaZero](/wiki/alphazero)
- [GRPO](/wiki/grpo)
- [Monte Carlo Tree Search](/wiki/monte_carlo_tree_search)
- [Mixture of Experts (MoE)](/wiki/mixture_of_experts)
- [Synthetic data](/wiki/synthetic_data)
- [Reinforcement learning](/wiki/reinforcement_learning)
- [AIME](/wiki/aime)
- [AIME 2024](/wiki/aime_2024)
- [AIME 2025](/wiki/aime_2025)
- [MATH](/wiki/math)
- [GPT-4](/wiki/gpt-4)
- [Google DeepMind](/wiki/google_deepmind)

## References

[^1]: Huajian Xin, Daya Guo, Zhihong Shao, Zhizhou Ren, Qihao Zhu, Bo Liu, Chong Ruan, Wenda Li, Xiaodan Liang, "DeepSeek-Prover: Advancing Theorem Proving in LLMs through Large-Scale Synthetic Data", arXiv, 2024-05-23. https://arxiv.org/abs/2405.14333. Accessed 2026-05-21.
[^2]: Huajian Xin et al., "DeepSeek-Prover-V1.5: Harnessing Proof Assistant Feedback for Reinforcement Learning and Monte-Carlo Tree Search", arXiv, 2024-08-15. https://arxiv.org/abs/2408.08152. Accessed 2026-05-21.
[^3]: Z. Z. Ren, Zhihong Shao et al., "DeepSeek-Prover-V2: Advancing Formal Mathematical Reasoning via Reinforcement Learning for Subgoal Decomposition", arXiv, 2025-04-30. https://arxiv.org/abs/2504.21801. Accessed 2026-05-21.
[^4]: Google DeepMind, "AI achieves silver-medal standard solving International Mathematical Olympiad problems", DeepMind Blog, 2024-07-25. https://deepmind.google/blog/ai-solves-imo-problems-at-silver-medal-level/. Accessed 2026-05-21.
[^5]: Zhangir Azerbayev et al., "Llemma: An Open Language Model for Mathematics", arXiv, 2023-10-16. https://arxiv.org/abs/2310.10631. Accessed 2026-05-21.
[^6]: DeepSeek-AI, "DeepSeek-Prover-V2 GitHub repository", GitHub, 2025-04-30. https://github.com/deepseek-ai/DeepSeek-Prover-V2. Accessed 2026-05-21.
[^7]: DeepSeek-AI, "DeepSeek-Prover-V2-7B model card", Hugging Face, 2025-04-30. https://huggingface.co/deepseek-ai/DeepSeek-Prover-V2-7B. Accessed 2026-05-21.
[^8]: DeepSeek-AI, "DeepSeek-Prover-V1.5 GitHub repository", GitHub, 2024-08-15. https://github.com/deepseek-ai/DeepSeek-Prover-V1.5. Accessed 2026-05-21.
[^9]: DeepSeek-AI, "DeepSeek-Prover-V1.5-RL model card", Hugging Face, 2024-08-15. https://huggingface.co/deepseek-ai/DeepSeek-Prover-V1.5-RL. Accessed 2026-05-21.
[^10]: Kunhao Zheng, Jesse Michael Han, Stanislas Polu, "MiniF2F: a cross-system benchmark for formal Olympiad-level mathematics", arXiv, 2021-09-01. https://arxiv.org/abs/2109.00110. Accessed 2026-05-21.
[^11]: Huajian Xin et al., "DeepSeek-Prover: Advancing Theorem Proving in LLMs through Large-Scale Synthetic Data (HTML)", arXiv, 2024-05-23. https://arxiv.org/html/2405.14333v1. Accessed 2026-05-21.
[^12]: DeepSeek-AI, "DeepSeek-Prover-V1 dataset", Hugging Face, 2024-05. https://huggingface.co/datasets/deepseek-ai/DeepSeek-Prover-V1. Accessed 2026-05-21.
[^13]: DeepSeek-AI, "DeepSeek-Prover-V2: Advancing Formal Mathematical Reasoning via Reinforcement Learning for Subgoal Decomposition (HTML)", arXiv, 2025-04-30. https://arxiv.org/html/2504.21801v1. Accessed 2026-05-21.
[^14]: DeepSeek-AI, "DeepSeek-Prover-V2-671B model card", Hugging Face, 2025-04-30. https://huggingface.co/deepseek-ai/DeepSeek-Prover-V2-671B. Accessed 2026-05-21.
[^15]: Anthony Alford, "DeepSeek Launches Prover-V2 Open-Source LLM for Formal Math Proofs", InfoQ, 2025-05-12. https://www.infoq.com/news/2025/05/deepseek-prover-v2-formal-proof/. Accessed 2026-05-21.
[^16]: Huajian Xin et al., "Lean Workbook: A large-scale Lean problem set formalized from natural language math problems", arXiv, 2024-06-06. https://arxiv.org/abs/2406.03847. Accessed 2026-05-21.
[^17]: Google DeepMind, "Olympiad-level formal mathematical reasoning with reinforcement learning", Nature, 2025-11-12. https://www.nature.com/articles/s41586-025-09833-y. Accessed 2026-05-21.
[^18]: Anonymous authors, "miniF2F-Lean Revisited: Reviewing Limitations and Charting a Path Forward", arXiv, 2025-11. https://arxiv.org/html/2511.03108v1. Accessed 2026-05-21.

