DeepSeekMath
Last reviewed
May 21, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 3,794 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 21, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 3,794 words
Add missing citations, update stale details, or suggest a clearer explanation.
DeepSeekMath is a family of open-weight large language models specialized for mathematical reasoning, released by Chinese AI laboratory DeepSeek in February 2024. The 7-billion-parameter model, continue-pretrained from DeepSeek-Coder-Base-v1.5 7B on a 120 B-token math corpus mined from Common Crawl, reached 51.7 % accuracy on the competition-level MATH benchmark without using external tools, approaching the performance of much larger closed models such as GPT-4 and Gemini Ultra.[^1] The paper that introduced DeepSeekMath, "DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models" (arXiv:2402.03300), is also the original source of Group Relative Policy Optimization (GRPO), a critic-free alternative to Proximal Policy Optimization that later became the reinforcement-learning algorithm underlying DeepSeek-R1.[^1][^2] Alongside the models, DeepSeek released the model weights for the Base, Instruct, and RL variants on Hugging Face under a permissive license that allows commercial use.[^3][^4]
| Field | Value |
|---|---|
| Developer | DeepSeek (with collaborators at Peking University)[^5] |
| Family | DeepSeekMath 7B (Base / Instruct / RL) |
| Released | 5 February 2024 (arXiv v1)[^1] |
| Base model | DeepSeek-Coder-Base-v1.5 7B[^1][^6] |
| Parameters | 7 B (dense) |
| Context length | 4,096 tokens[^6] |
| Tensor type | BF16[^4] |
| Math pre-training corpus | DeepSeekMath-Corpus, 120 B tokens, 35.5 M web pages[^1] |
| Key benchmark | 51.7 % on MATH (no tools, no voting)[^1] |
| Algorithmic contribution | Group Relative Policy Optimization (GRPO)[^1] |
| License | MIT for code; DeepSeek model license for weights (commercial use permitted)[^3] |
| Paper | arXiv:2402.03300[^1] |
| Repository | github.com/deepseek-ai/DeepSeek-Math[^3] |
By late 2023, mathematical reasoning had become a stress test for large language models. Closed models such as Google's Minerva (a continued-pretrained variant of PaLM 540 B[^7]) and OpenAI's GPT-4 had pushed competition-level math accuracy well above 50 %, but open-weight efforts lagged behind. The strongest open math models at the time were Princeton and EleutherAI's Code Llama-based Llemma family (7 B and 34 B variants pretrained on the Proof-Pile-2 corpus, released October 2023)[^8] and the InternLM-Math series from Shanghai AI Laboratory, also a continued-pretrain on a general base.[^9] Both projects were open in data and weights but trailed closed models by 15 to 20 percentage points on the MATH benchmark.[^1][^8]
A key part of the gap was data. Minerva had been trained on a curated 118 GB mixture of arXiv preprints and math-rich web pages from Google's index, neither of which was released. Llemma's Proof-Pile-2 corpus was open but only on the order of 55 B tokens, with a substantial fraction drawn from arXiv and code rather than the open math web. The DeepSeekMath authors framed their primary research question as whether a more aggressive use of public Common Crawl data, filtered by a learned classifier rather than rule-based heuristics, could close that gap at a small parameter scale.[^1]
DeepSeek had already released DeepSeek-LLM (a 7 B and 67 B family) and a code-specialized DeepSeek-Coder series in late 2023; the latter included an updated DeepSeek-Coder-Base-v1.5 7B, which was a continued-pretrain of DeepSeek-LLM 7B on roughly 2 trillion tokens of code-heavy data with a 4 K context window.[^6] The DeepSeekMath authors chose that code-focused checkpoint as a starting point, citing the empirical observation that code pre-training transfers positively to mathematical reasoning. Code is dense in structured symbolic manipulation (variable substitution, algorithmic reasoning, type-aware composition) that overlaps with the skills required for chain-of-thought math, even though the surface form differs.[^1][^11]
The first version of the DeepSeekMath preprint was uploaded to arXiv on 5 February 2024 (v1), revised the next day (v2), and updated again on 27 April 2024 (v3).[^1] Authors include Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo, working under DeepSeek with contributions from Peking University collaborators.[^5] Lead author Zhihong Shao went on to be a primary contributor on subsequent DeepSeek post-training work, including the GRPO portions of the DeepSeek-R1 pipeline.
The single largest contribution of the project, in the authors' framing, is the data pipeline that produced the 120 B-token DeepSeekMath-Corpus from Common Crawl.[^1] The pipeline relies on an iterative fastText classifier rather than rule-based filtering, and it is structured to expand coverage of math-heavy domains across multiple rounds.
The team began with OpenWebMath, a publicly released 14.7 B-token corpus of math-bearing web pages, as positive seed data.[^1] They sampled 500,000 positive examples from this seed and 500,000 negative examples from non-mathematical web pages, then trained a fastText classifier producing 256-dimensional multi-gram embeddings.[^1][^10] fastText was chosen because its sub-word features tolerate the heterogeneous Unicode and LaTeX of math web pages, and because the model is fast enough to score billions of Common Crawl documents.
The classifier was run over Common Crawl to harvest additional math-bearing URLs. Because positive coverage clusters by domain (for example math.stackexchange.com or planetmath.org), the team ran statistical analyses of which domains produced the most high-scoring pages. Domains whose mathematical pages constituted a significant share of all retrieved math content were flagged for manual annotation, and additional positive examples from those domains were added back to the classifier's training set for the next round.[^1][^10]
After four iterations of harvesting, classifier retraining, and domain-level annotation, the final corpus contained 35.5 million mathematical web pages totalling 120 B tokens.[^1][^3] The authors describe the corpus as roughly seven times larger than the math web pages used in Minerva and around nine times the size of OpenWebMath.[^1][^10] Additional steps included deduplication and decontamination against benchmark test sets (MATH, GSM8K, and others) to reduce the risk of evaluation leakage.[^1]
DeepSeekMath-Base was produced by continued pre-training on roughly 500 B tokens drawn from a mixed distribution: 56 % DeepSeekMath-Corpus, 20 % GitHub code, 10 % arXiv papers, 4 % AlgebraicStack, and 10 % natural-language data from the general DeepSeek-LLM pre-training mix.[^11] The authors report an ablation finding that surprised some readers: arXiv papers had minimal or even slightly negative impact on math benchmark accuracy in their setup, despite being a common ingredient in prior math models.[^11] Code data, in contrast, transferred positively, supporting the choice to initialize from DeepSeek-Coder-Base-v1.5 7B rather than a general LLM checkpoint.
The paper's broader empirical conclusions on data are widely cited. First, the DeepSeekMath-Corpus alone provides most of the math benefit; mixing in arXiv or competition problem sets does not help once the corpus is large enough. Second, training on code in the immediate pre-training mix continues to help math, not merely as a base model prior. Third, decontamination is a measurable lever: applying 10-gram and 13-gram exact-match filters against benchmark test sets reduced inflated MATH and GSM8K numbers by several points on smaller ablation runs.[^11]
DeepSeek released three checkpoints, each derived from the previous one.
The continued-pretrain on the 500 B-token math mix described above. On few-shot chain-of-thought evaluation, Base reached 64.2 % on GSM8K and 36.2 % on MATH, beating Minerva 540 B (58.8 % / 33.6 %) despite being roughly 77 times smaller.[^2][^11] It also substantially outperformed Mistral 7B (40.3 % / 14.3 %) and Llemma 34B (54.0 % / 25.3 %).[^2][^11]
A supervised-fine-tuning (SFT) stage on roughly 776 K math-related chain-of-thought and program-of-thought instances curated by the team.[^11] Instruct reached 82.9 % on GSM8K and 46.8 % on MATH under chain-of-thought prompting.[^2][^11]
The headline model. It is initialized from Instruct and further trained with Group Relative Policy Optimization using a reward model trained on math problems with verifiable final answers.[^1] RL pushed MATH from 46.8 % to 51.7 % and GSM8K from 82.9 % to 88.2 %.[^2][^11] With self-consistency over 64 samples, MATH accuracy climbs to 60.9 %.[^1] With Python-based tool use, the same model reaches 58.8 % on MATH and 86.7 % on GSM8K.[^2]
All three checkpoints are released on Hugging Face as deepseek-ai/deepseek-math-7b-base, -instruct, and -rl. The RL checkpoint is published in BF16 with Safetensors weights.[^4]
Proximal Policy Optimization (PPO) is the standard policy-gradient algorithm used for reinforcement learning from human feedback in language models. PPO maintains both a policy network and a separate value (critic) network of comparable size, used to estimate the baseline that defines the advantage in the policy-gradient update.[^12] For 7 B-and-larger LLMs, training a same-sized critic doubles the memory footprint and significantly increases compute.
GRPO was designed to remove the critic. The DeepSeekMath authors observed that, in tasks where many candidate solutions can be sampled cheaply per prompt (math problems being a clean example), the empirical reward distribution within a sampled group already provides a useful baseline. They proposed estimating the advantage of each sample by comparing its reward to the mean and standard deviation of rewards in the same group, instead of using a learned value function.[^1][^13]
For each input prompt q, GRPO samples a group of G outputs {o_1, ..., o_G} from the current policy. Each output o_i receives a scalar reward r_i, either from a learned reward model or from a rule-based verifier (such as comparing a final boxed answer to a known solution). The group advantage for output o_i is defined as the z-score within the group:
A_i = (r_i - mean(r_1, ..., r_G)) / std(r_1, ..., r_G)
The policy is then updated with a PPO-style clipped surrogate objective using A_i in place of an advantage estimated by a critic. A KL-divergence penalty against a reference policy (typically the SFT checkpoint) prevents distributional drift.[^1][^13]
Conceptually, GRPO collapses two ideas into one. The group baseline replaces the value function used in advantage estimation, removing the need to train a critic. The within-group standardisation replaces the role normally played by Generalized Advantage Estimation (GAE) in stabilising the gradient by normalising scales. Because both the baseline and the normalisation are computed only from samples drawn for the same prompt, GRPO is unbiased with respect to per-prompt difficulty: a prompt where every sample succeeds contributes zero gradient signal in either direction, whereas in PPO that prompt would still produce a non-zero (and arguably misleading) advantage estimate based on the critic's prediction.[^13][^14]
Because no value network is trained, GRPO is simpler to implement, uses roughly half the memory of PPO at the same model size, and (according to the DeepSeekMath authors) yields competitive or better reward optimization on math tasks.[^1][^14] In practice, the group size G is typically chosen between 4 and 64 depending on compute budget; the DeepSeekMath paper reports good results with G = 64.[^1]
The DeepSeekMath paper analyses both outcome supervision (a single reward per full completion, typically tied to final answer correctness) and process supervision (per-step rewards from a process reward model). GRPO is presented as a unified framework supporting both, since the advantage estimator is computed group-wise regardless of where in the trajectory the reward is assigned.[^1]
GRPO is the RL algorithm used for DeepSeek-R1 and DeepSeek-R1-Zero. The R1 paper, "DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning" (arXiv:2501.12948, January 2025), starts from DeepSeek-V3-Base and applies GRPO at scale, with rule-based rewards for math and code and a format reward for structured reasoning.[^15] DeepSeek-R1-Zero, a model trained without supervised cold-start data, achieves a pass@1 of 71.0 % on AIME 2024 (rising to 86.7 % with majority voting), demonstrating that GRPO-only training can elicit chain-of-thought reasoning without prior SFT.[^15]
The R1 paper is also where GRPO's most well-known emergent properties were documented: as training progresses, the policy learns to allocate more tokens to intermediate reasoning, periodically pauses for self-verification, and (in R1-Zero) occasionally exhibits an "aha moment" where it explicitly notices and corrects an earlier mistake.[^15] These behaviours arise without any specific reward shaping; only the binary correctness reward on final answers and the format reward are applied. The authors interpret this as evidence that GRPO's group-based advantage, applied to a sufficiently capable base model with a verifiable reward, is sufficient to surface latent reasoning capability.
GRPO is also now widely re-implemented outside DeepSeek; the Hugging Face TRL library, vLLM, OpenRLHF, and several research codebases ship GRPO trainers, and the algorithm is the focus of follow-up theoretical work analysing its bias relative to PPO and its behaviour in non-LLM RL settings.[^14]
The headline results from the DeepSeekMath paper, with citations to the paper page and arXiv preprint, are summarised below.
| Model | Size | GSM8K | MATH (no tools) | MATH + Python |
|---|---|---|---|---|
| Minerva 7B[^11] | 7 B | 16.2 % | 14.1 % | --- |
| Minerva 62B[^11] | 62 B | 52.4 % | 27.6 % | --- |
| Minerva 540B[^11] | 540 B | 58.8 % | 33.6 % | --- |
| Llemma 7B[^11] | 7 B | 37.4 % | 18.1 % | --- |
| Llemma 34B[^11] | 34 B | 54.0 % | 25.3 % | 26.3 % |
| Mistral 7B (base)[^11] | 7 B | 40.3 % | 14.3 % | --- |
| DeepSeekMath-Base 7B[^2][^11] | 7 B | 64.2 % | 36.2 % | --- |
| DeepSeekMath-Instruct 7B[^2][^11] | 7 B | 82.9 % | 46.8 % | --- |
| DeepSeekMath-RL 7B[^2] | 7 B | 88.2 % | 51.7 % | 58.8 % |
| GPT-4 (cited in paper)[^2] | --- | 92.0 % | 52.9 % | 69.7 % (Code Interpreter) |
| Gemini Ultra (cited in paper)[^2] | --- | 94.4 % | 53.2 % | --- |
DeepSeekMath-RL 7B was the first openly released 7 B model to break 50 % on MATH without tools or self-consistency.[^1][^2] On Chinese math benchmarks, it reached 79.6 % on MGSM-zh and 88.8 % on CMATH, with self-consistency further boosting MATH accuracy to 60.9 %.[^1][^2]
| Model | Lab | Year | Base model | Math pre-training tokens | MATH best |
|---|---|---|---|---|---|
| Minerva 540B[^7] | Google Research | 2022 | PaLM 540 B | ~17.5 B math web + arXiv | 33.6 % |
| Llemma 34B[^8] | Princeton / EleutherAI | 2023 | Code Llama 34B | ~55 B (Proof-Pile-2) | 25.3 % |
| InternLM-Math 20B[^9] | Shanghai AI Lab | Feb 2024 | InternLM2 | not public (continued pre-train) | reported on GSM8K, MATH, MathBench-ZH, MiniF2F[^9] |
| DeepSeekMath 7B (RL)[^1] | DeepSeek | Feb 2024 | DeepSeek-Coder-Base-v1.5 7B | 120 B (DeepSeekMath-Corpus) | 51.7 % |
| Qwen2-Math 72B-Instruct[^16] | Alibaba Qwen | Aug 2024 | Qwen2 72B | math-specific corpus (size not disclosed) | 84 % (reported by Alibaba) |
| Qwen2.5-Math 72B[^17] | Alibaba Qwen | Sep 2024 | Qwen2.5 72B | proprietary | 66.8 % base (reported in tech report) |
Two patterns stand out. First, the DeepSeekMath team's recipe (initialise from a code-pretrained base, then continued pretrain on a very large fastText-mined math web corpus, then SFT, then GRPO with rule-based math rewards) was reproduced in spirit by later projects, including Qwen2-Math and InternLM-Math 2. Second, scale eventually catches up: Qwen2-Math-72B-Instruct and Qwen2.5-Math-72B-Instruct surpassed DeepSeekMath-RL 7B on MATH by mid-2024, but at roughly ten times the parameter count.[^16][^17]
DeepSeekMath is widely cited as a pivotal data point in three concurrent developments:
The DeepSeekMath-Corpus itself, while not publicly released as a downloadable archive, established a methodology (fastText classifier on Common Crawl, iterative domain-targeted expansion, large-scale deduplication and decontamination) that other groups have since reproduced and open-sourced in projects such as MathPile, FineMath, and open re-implementations of the DeepSeekMath pipeline.[^10]
A further indirect significance is institutional. DeepSeekMath was, alongside DeepSeek-Coder, one of the first DeepSeek papers to receive heavy attention from the broader machine-learning research community in the West. Many of the techniques first surfaced in this paper (GRPO, the fastText-classifier data pipeline, code-then-math continued pre-training, and the use of rule-based correctness rewards on math problems) were carried into DeepSeek-V2, DeepSeek-V3, and DeepSeek-R1, and the lab's open-weight releases of the math and reasoning models played a substantial role in the broader public discussion of open-source LLMs in 2024 and 2025.[^15]
Because the DeepSeekMath weights are released under a license that permits commercial use[^3], the models have been adopted in a variety of downstream contexts:
Several limitations are noted in the paper or in subsequent commentary:
DeepSeekMath sits within a broader ecosystem of math-specialised LLMs: