# DeepSeekMath

> Source: https://aiwiki.ai/wiki/deepseek_math
> Updated: 2026-06-07
> Categories: Chinese AI, Large Language Models, Reasoning Models
> From AI Wiki (https://aiwiki.ai), a free encyclopedia of artificial intelligence. Quote with attribution.

# DeepSeekMath

**DeepSeekMath** is a family of open-weight large language models specialized for mathematical reasoning, released by Chinese AI laboratory [DeepSeek](/wiki/deepseek) in February 2024. The 7-billion-parameter model, continue-pretrained from DeepSeek-Coder-Base-v1.5 7B on a 120 B-token math corpus mined from [Common Crawl](/wiki/common_crawl), reached 51.7 % accuracy on the competition-level MATH benchmark without using external tools, approaching the performance of much larger closed models such as [GPT-4](/wiki/gpt-4) and Gemini Ultra.[^1] The paper that introduced DeepSeekMath, "DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models" (arXiv:2402.03300), is also the original source of **[Group Relative Policy Optimization](/wiki/grpo)** (GRPO), a critic-free alternative to [Proximal Policy Optimization](/wiki/ppo) that later became the reinforcement-learning algorithm underlying [DeepSeek-R1](/wiki/deepseek_r1).[^1][^2] Alongside the models, DeepSeek released the model weights for the Base, Instruct, and RL variants on Hugging Face under a permissive license that allows commercial use.[^3][^4]

## Infobox

| Field | Value |
|---|---|
| Developer | DeepSeek (with collaborators at Peking University)[^5] |
| Family | DeepSeekMath 7B (Base / Instruct / RL) |
| Released | 5 February 2024 (arXiv v1)[^1] |
| Base model | DeepSeek-Coder-Base-v1.5 7B[^1][^6] |
| Parameters | 7 B (dense) |
| Context length | 4,096 tokens[^6] |
| Tensor type | BF16[^4] |
| Math pre-training corpus | DeepSeekMath-Corpus, 120 B tokens, 35.5 M web pages[^1] |
| Key benchmark | 51.7 % on MATH (no tools, no voting)[^1] |
| Algorithmic contribution | [Group Relative Policy Optimization](/wiki/grpo) (GRPO)[^1] |
| License | MIT for code; DeepSeek model license for weights (commercial use permitted)[^3] |
| Paper | arXiv:2402.03300[^1] |
| Repository | github.com/deepseek-ai/DeepSeek-Math[^3] |

## Background

By late 2023, mathematical reasoning had become a stress test for large language models. Closed models such as Google's Minerva (a continued-pretrained variant of [PaLM](/wiki/palm) 540 B[^7]) and OpenAI's [GPT-4](/wiki/gpt-4) had pushed competition-level math accuracy well above 50 %, but open-weight efforts lagged behind. The strongest open math models at the time were Princeton and EleutherAI's [Code Llama](/wiki/code_llama)-based Llemma family (7 B and 34 B variants pretrained on the Proof-Pile-2 corpus, released October 2023)[^8] and the InternLM-Math series from Shanghai AI Laboratory, also a continued-pretrain on a general base.[^9] Both projects were open in data and weights but trailed closed models by 15 to 20 percentage points on the MATH benchmark.[^1][^8]

A key part of the gap was data. Minerva had been trained on a curated 118 GB mixture of arXiv preprints and math-rich web pages from Google's index, neither of which was released. Llemma's Proof-Pile-2 corpus was open but only on the order of 55 B tokens, with a substantial fraction drawn from arXiv and code rather than the open math web. The DeepSeekMath authors framed their primary research question as whether a more aggressive use of public Common Crawl data, filtered by a learned classifier rather than rule-based heuristics, could close that gap at a small parameter scale.[^1]

DeepSeek had already released [DeepSeek-LLM](/wiki/deepseek) (a 7 B and 67 B family) and a code-specialized [DeepSeek-Coder](/wiki/code_llama) series in late 2023; the latter included an updated DeepSeek-Coder-Base-v1.5 7B, which was a continued-pretrain of DeepSeek-LLM 7B on roughly 2 trillion tokens of code-heavy data with a 4 K context window.[^6] The DeepSeekMath authors chose that code-focused checkpoint as a starting point, citing the empirical observation that code pre-training transfers positively to mathematical reasoning. Code is dense in structured symbolic manipulation (variable substitution, algorithmic reasoning, type-aware composition) that overlaps with the skills required for chain-of-thought math, even though the surface form differs.[^1][^11]

The first version of the DeepSeekMath preprint was uploaded to arXiv on 5 February 2024 (v1), revised the next day (v2), and updated again on 27 April 2024 (v3).[^1] Authors include Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo, working under DeepSeek with contributions from Peking University collaborators.[^5] Lead author Zhihong Shao went on to be a primary contributor on subsequent DeepSeek post-training work, including the GRPO portions of the [DeepSeek-R1](/wiki/deepseek_r1) pipeline.

## The DeepSeekMath-Corpus data pipeline

The single largest contribution of the project, in the authors' framing, is the data pipeline that produced the 120 B-token DeepSeekMath-Corpus from [Common Crawl](/wiki/common_crawl).[^1] The pipeline relies on an iterative [fastText](/wiki/fasttext) classifier rather than rule-based filtering, and it is structured to expand coverage of math-heavy domains across multiple rounds.

### Seed and classifier

The team began with OpenWebMath, a publicly released 14.7 B-token corpus of math-bearing web pages, as positive seed data.[^1] They sampled 500,000 positive examples from this seed and 500,000 negative examples from non-mathematical web pages, then trained a fastText classifier producing 256-dimensional multi-gram embeddings.[^1][^10] fastText was chosen because its sub-word features tolerate the heterogeneous Unicode and LaTeX of math web pages, and because the model is fast enough to score billions of Common Crawl documents.

### Four iterations of expansion

The classifier was run over Common Crawl to harvest additional math-bearing URLs. Because positive coverage clusters by domain (for example math.stackexchange.com or planetmath.org), the team ran statistical analyses of which domains produced the most high-scoring pages. Domains whose mathematical pages constituted a significant share of all retrieved math content were flagged for manual annotation, and additional positive examples from those domains were added back to the classifier's training set for the next round.[^1][^10]

After four iterations of harvesting, classifier retraining, and domain-level annotation, the final corpus contained 35.5 million mathematical web pages totalling 120 B tokens.[^1][^3] The authors describe the corpus as roughly seven times larger than the math web pages used in Minerva and around nine times the size of OpenWebMath.[^1][^10] Additional steps included deduplication and decontamination against benchmark test sets (MATH, GSM8K, and others) to reduce the risk of evaluation leakage.[^1]

### Pre-training mix

DeepSeekMath-Base was produced by continued pre-training on roughly 500 B tokens drawn from a mixed distribution: 56 % DeepSeekMath-Corpus, 20 % GitHub code, 10 % arXiv papers, 4 % AlgebraicStack, and 10 % natural-language data from the general DeepSeek-LLM pre-training mix.[^11] The authors report an ablation finding that surprised some readers: arXiv papers had minimal or even slightly negative impact on math benchmark accuracy in their setup, despite being a common ingredient in prior math models.[^11] Code data, in contrast, transferred positively, supporting the choice to initialize from DeepSeek-Coder-Base-v1.5 7B rather than a general LLM checkpoint.

The paper's broader empirical conclusions on data are widely cited. First, the DeepSeekMath-Corpus alone provides most of the math benefit; mixing in arXiv or competition problem sets does not help once the corpus is large enough. Second, training on code in the immediate pre-training mix continues to help math, not merely as a base model prior. Third, decontamination is a measurable lever: applying 10-gram and 13-gram exact-match filters against benchmark test sets reduced inflated MATH and GSM8K numbers by several points on smaller ablation runs.[^11]

## Model variants

DeepSeek released three checkpoints, each derived from the previous one.

### DeepSeekMath-Base 7B

The continued-pretrain on the 500 B-token math mix described above. On few-shot chain-of-thought evaluation, Base reached 64.2 % on GSM8K and 36.2 % on MATH, beating Minerva 540 B (58.8 % / 33.6 %) despite being roughly 77 times smaller.[^2][^11] It also substantially outperformed [Mistral](/wiki/mistral_7b) 7B (40.3 % / 14.3 %) and Llemma 34B (54.0 % / 25.3 %).[^2][^11]

### DeepSeekMath-Instruct 7B

A supervised-fine-tuning (SFT) stage on roughly 776 K math-related chain-of-thought and program-of-thought instances curated by the team.[^11] Instruct reached 82.9 % on GSM8K and 46.8 % on MATH under chain-of-thought prompting.[^2][^11]

### DeepSeekMath-RL 7B

The headline model. It is initialized from Instruct and further trained with [Group Relative Policy Optimization](/wiki/grpo) using a reward model trained on math problems with verifiable final answers.[^1] RL pushed MATH from 46.8 % to 51.7 % and GSM8K from 82.9 % to 88.2 %.[^2][^11] With self-consistency over 64 samples, MATH accuracy climbs to 60.9 %.[^1] With Python-based tool use, the same model reaches 58.8 % on MATH and 86.7 % on GSM8K.[^2]

All three checkpoints are released on Hugging Face as `deepseek-ai/deepseek-math-7b-base`, `-instruct`, and `-rl`. The RL checkpoint is published in BF16 with Safetensors weights.[^4]

## Group Relative Policy Optimization (GRPO)

### Motivation

[Proximal Policy Optimization](/wiki/ppo) (PPO) is the standard policy-gradient algorithm used for [reinforcement learning from human feedback](/wiki/rlhf) in language models. PPO maintains both a policy network and a separate value (critic) network of comparable size, used to estimate the baseline that defines the advantage in the policy-gradient update.[^12] For 7 B-and-larger LLMs, training a same-sized critic doubles the memory footprint and significantly increases compute.

GRPO was designed to remove the critic. The DeepSeekMath authors observed that, in tasks where many candidate solutions can be sampled cheaply per prompt (math problems being a clean example), the empirical reward distribution within a sampled group already provides a useful baseline. They proposed estimating the advantage of each sample by comparing its reward to the mean and standard deviation of rewards in the same group, instead of using a learned value function.[^1][^13]

### Algorithm

For each input prompt q, GRPO samples a group of G outputs {o_1, ..., o_G} from the current policy. Each output o_i receives a scalar reward r_i, either from a learned reward model or from a rule-based verifier (such as comparing a final boxed answer to a known solution). The group advantage for output o_i is defined as the z-score within the group:

A_i = (r_i - mean(r_1, ..., r_G)) / std(r_1, ..., r_G)

The policy is then updated with a PPO-style clipped surrogate objective using A_i in place of an advantage estimated by a critic. A KL-divergence penalty against a reference policy (typically the SFT checkpoint) prevents distributional drift.[^1][^13]

Conceptually, GRPO collapses two ideas into one. The group baseline replaces the value function used in advantage estimation, removing the need to train a critic. The within-group standardisation replaces the role normally played by Generalized Advantage Estimation (GAE) in stabilising the gradient by normalising scales. Because both the baseline and the normalisation are computed only from samples drawn for the same prompt, GRPO is unbiased with respect to per-prompt difficulty: a prompt where every sample succeeds contributes zero gradient signal in either direction, whereas in PPO that prompt would still produce a non-zero (and arguably misleading) advantage estimate based on the critic's prediction.[^13][^14]

Because no value network is trained, GRPO is simpler to implement, uses roughly half the memory of PPO at the same model size, and (according to the DeepSeekMath authors) yields competitive or better reward optimization on math tasks.[^1][^14] In practice, the group size G is typically chosen between 4 and 64 depending on compute budget; the DeepSeekMath paper reports good results with G = 64.[^1]

### Outcome vs. process supervision

The DeepSeekMath paper analyses both outcome supervision (a single reward per full completion, typically tied to final answer correctness) and process supervision (per-step rewards from a process reward model). GRPO is presented as a unified framework supporting both, since the advantage estimator is computed group-wise regardless of where in the trajectory the reward is assigned.[^1]

### Adoption in DeepSeek-R1

GRPO is the RL algorithm used for [DeepSeek-R1](/wiki/deepseek_r1) and DeepSeek-R1-Zero. The R1 paper, "DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning" (arXiv:2501.12948, January 2025), starts from [DeepSeek-V3-Base](/wiki/deepseek_v3) and applies GRPO at scale, with rule-based rewards for math and code and a format reward for structured reasoning.[^15] DeepSeek-R1-Zero, a model trained without supervised cold-start data, achieves a pass@1 of 71.0 % on AIME 2024 (rising to 86.7 % with majority voting), demonstrating that GRPO-only training can elicit chain-of-thought reasoning without prior SFT.[^15]

The R1 paper is also where GRPO's most well-known emergent properties were documented: as training progresses, the policy learns to allocate more tokens to intermediate reasoning, periodically pauses for self-verification, and (in R1-Zero) occasionally exhibits an "aha moment" where it explicitly notices and corrects an earlier mistake.[^15] These behaviours arise without any specific reward shaping; only the binary correctness reward on final answers and the format reward are applied. The authors interpret this as evidence that GRPO's group-based advantage, applied to a sufficiently capable base model with a verifiable reward, is sufficient to surface latent reasoning capability.

GRPO is also now widely re-implemented outside DeepSeek; the Hugging Face TRL library, vLLM, OpenRLHF, and several research codebases ship GRPO trainers, and the algorithm is the focus of follow-up theoretical work analysing its bias relative to PPO and its behaviour in non-LLM RL settings.[^14]

## Benchmark results

The headline results from the DeepSeekMath paper, with citations to the paper page and arXiv preprint, are summarised below.

| Model | Size | GSM8K | MATH (no tools) | MATH + Python |
|---|---|---|---|---|
| Minerva 7B[^11] | 7 B | 16.2 % | 14.1 % | --- |
| Minerva 62B[^11] | 62 B | 52.4 % | 27.6 % | --- |
| Minerva 540B[^11] | 540 B | 58.8 % | 33.6 % | --- |
| Llemma 7B[^11] | 7 B | 37.4 % | 18.1 % | --- |
| Llemma 34B[^11] | 34 B | 54.0 % | 25.3 % | 26.3 % |
| [Mistral 7B](/wiki/mistral_7b) (base)[^11] | 7 B | 40.3 % | 14.3 % | --- |
| DeepSeekMath-Base 7B[^2][^11] | 7 B | 64.2 % | 36.2 % | --- |
| DeepSeekMath-Instruct 7B[^2][^11] | 7 B | 82.9 % | 46.8 % | --- |
| DeepSeekMath-RL 7B[^2] | 7 B | 88.2 % | 51.7 % | 58.8 % |
| GPT-4 (cited in paper)[^2] | --- | 92.0 % | 52.9 % | 69.7 % (Code Interpreter) |
| Gemini Ultra (cited in paper)[^2] | --- | 94.4 % | 53.2 % | --- |

DeepSeekMath-RL 7B was the first openly released 7 B model to break 50 % on MATH without tools or self-consistency.[^1][^2] On Chinese math benchmarks, it reached 79.6 % on MGSM-zh and 88.8 % on CMATH, with self-consistency further boosting MATH accuracy to 60.9 %.[^1][^2]

## Comparison with related math LLMs

| Model | Lab | Year | Base model | Math pre-training tokens | MATH best |
|---|---|---|---|---|---|
| Minerva 540B[^7] | Google Research | 2022 | [PaLM](/wiki/palm) 540 B | ~17.5 B math web + arXiv | 33.6 % |
| Llemma 34B[^8] | Princeton / EleutherAI | 2023 | [Code Llama](/wiki/code_llama) 34B | ~55 B (Proof-Pile-2) | 25.3 % |
| InternLM-Math 20B[^9] | Shanghai AI Lab | Feb 2024 | InternLM2 | not public (continued pre-train) | reported on GSM8K, MATH, MathBench-ZH, MiniF2F[^9] |
| DeepSeekMath 7B (RL)[^1] | [DeepSeek](/wiki/deepseek) | Feb 2024 | DeepSeek-Coder-Base-v1.5 7B | 120 B (DeepSeekMath-Corpus) | 51.7 % |
| Qwen2-Math 72B-Instruct[^16] | [Alibaba Qwen](/wiki/qwen) | Aug 2024 | Qwen2 72B | math-specific corpus (size not disclosed) | 84 % (reported by Alibaba) |
| Qwen2.5-Math 72B[^17] | [Alibaba Qwen](/wiki/qwen) | Sep 2024 | Qwen2.5 72B | proprietary | 66.8 % base (reported in tech report) |

Two patterns stand out. First, the DeepSeekMath team's recipe (initialise from a code-pretrained base, then continued pretrain on a very large fastText-mined math web corpus, then SFT, then GRPO with rule-based math rewards) was reproduced in spirit by later projects, including Qwen2-Math and InternLM-Math 2. Second, scale eventually catches up: Qwen2-Math-72B-Instruct and Qwen2.5-Math-72B-Instruct surpassed DeepSeekMath-RL 7B on MATH by mid-2024, but at roughly ten times the parameter count.[^16][^17]

## Significance

DeepSeekMath is widely cited as a pivotal data point in three concurrent developments:

1. **Open math LLMs approaching closed-model parity.** DeepSeekMath-RL 7B was the first openly released 7 B model to exceed 50 % on MATH without tools, demonstrating that a focused data pipeline could close most of the gap to [GPT-4](/wiki/gpt-4) without scale.[^1][^11]
2. **The rise of GRPO.** The introduction of GRPO in this paper directly seeded the RL pipeline of [DeepSeek-R1](/wiki/deepseek_r1) one year later and propagated into many open-source RL frameworks.[^15] GRPO is now one of the dominant algorithms for RL on LLMs, alongside PPO and DPO-style methods.[^14]
3. **Code pre-training as a math prior.** The team's ablation showing that code data transfers positively to math reasoning (and arXiv-only pre-training transfers less than expected) informed subsequent design choices across DeepSeek's later models, including DeepSeek-V2 and DeepSeek-V3, where strong code and math performance are central marketing claims.[^11]

The DeepSeekMath-Corpus itself, while not publicly released as a downloadable archive, established a methodology (fastText classifier on Common Crawl, iterative domain-targeted expansion, large-scale deduplication and decontamination) that other groups have since reproduced and open-sourced in projects such as MathPile, FineMath, and open re-implementations of the DeepSeekMath pipeline.[^10]

A further indirect significance is institutional. DeepSeekMath was, alongside [DeepSeek-Coder](/wiki/deepseek), one of the first DeepSeek papers to receive heavy attention from the broader machine-learning research community in the West. Many of the techniques first surfaced in this paper (GRPO, the fastText-classifier data pipeline, code-then-math continued pre-training, and the use of rule-based correctness rewards on math problems) were carried into DeepSeek-V2, DeepSeek-V3, and DeepSeek-R1, and the lab's open-weight releases of the math and reasoning models played a substantial role in the broader public discussion of open-source LLMs in 2024 and 2025.[^15]

## Applications and downstream use

Because the DeepSeekMath weights are released under a license that permits commercial use[^3], the models have been adopted in a variety of downstream contexts:

- **Math tutoring and grading prototypes.** Educational AI projects have used DeepSeekMath-Instruct and -RL as a low-cost backbone for step-by-step solution generation, often paired with a separate verifier to filter incorrect derivations. The model's strong performance on Chinese math benchmarks (MGSM-zh and CMATH) has made it particularly attractive for Chinese-language tutoring use cases.[^2]
- **Synthetic data generation.** DeepSeekMath-RL is widely used to generate synthetic chain-of-thought solutions for fine-tuning smaller student models or for augmenting open math datasets. Several MetaMath- and OpenMathInstruct-style follow-up datasets credit DeepSeekMath as one of their teacher models.[^14]
- **Process reward model training.** Because GRPO directly supports per-step reward signals, several research groups have used DeepSeekMath's pipeline as a substrate for training process reward models for math, where each line of a solution is scored separately.[^1]
- **RL algorithm research.** The clear, small-scale, mathematically verifiable nature of the math task has made DeepSeekMath a popular benchmark for testing new RL algorithms. Many follow-up papers on GRPO variants (Critic-augmented GRPO, TIC-GRPO, RLOO and others) report their first results on DeepSeekMath-RL's training setup before scaling to larger models.[^14]
- **DeepSeek-Prover.** DeepSeek's parallel theorem-proving project, DeepSeek-Prover, draws on the DeepSeekMath data and infrastructure for its informal math reasoning components, although the prover targets a different (formal) output format in Lean.[^3]

## Limitations

Several limitations are noted in the paper or in subsequent commentary:

- **No release of the corpus.** Although the model weights are open, the 120 B-token DeepSeekMath-Corpus itself was not published. Researchers wishing to reproduce the pipeline must re-derive their own fastText classifier and re-scrape Common Crawl, which the paper acknowledges is non-trivial.[^1][^10]
- **Benchmark scope.** DeepSeekMath-RL's evaluation focuses on chain-of-thought math (GSM8K, MATH, MGSM-zh, CMATH) and tool-use math (Python). It does not address formal theorem proving (a sister effort, DeepSeek-Prover, targets that domain with Lean).[^3]
- **GRPO instability.** Subsequent analyses (and the DeepSeek-R1 paper itself) have noted that GRPO can be unstable when group variance is small, when reward signals are sparse, or when the policy and reference policies drift apart, motivating later refinements.[^14][^15]
- **Closed-source comparisons.** Comparisons to GPT-4 and Gemini Ultra in the paper rely on publicly reported numbers from those vendors; closed models are typically evaluated with different prompts and decoding settings, so direct head-to-head claims should be read with caution.[^2]
- **Contamination risk.** Math web pages on the open internet substantially overlap with the training sources of test sets such as MATH. The authors describe decontamination steps, but a residual risk of leakage is acknowledged.[^1]

## Related work

DeepSeekMath sits within a broader ecosystem of math-specialised LLMs:

- [PaLM](/wiki/palm) and Minerva (Google, 2022): the first language model widely demonstrated to do competition math, via continued pre-training on arXiv and math web pages.[^7]
- Llemma (Princeton, EleutherAI, Toronto, Cambridge, CMU, Washington, October 2023): first major open continued pre-train for math, based on [Code Llama](/wiki/code_llama) and the Proof-Pile-2 corpus.[^8]
- InternLM-Math (Shanghai AI Lab, February 2024): bilingual open math LLM unifying CoT reasoning, reward modelling, and formal theorem proving in a single seq2seq format.[^9]
- Qwen2-Math and Qwen2.5-Math ([Alibaba](/wiki/qwen), 2024): math-specialised continuations of the Qwen2 / Qwen2.5 family, the first open math models to match or surpass DeepSeekMath-RL on MATH while using considerably more parameters.[^16][^17]
- [DeepSeek-R1](/wiki/deepseek_r1) (January 2025): the direct downstream beneficiary of GRPO, applying the algorithm at scale on a much larger base model to elicit general reasoning, not just math.[^15]

## See also

- [DeepSeek](/wiki/deepseek)
- [DeepSeek-R1](/wiki/deepseek_r1)
- [DeepSeek V3](/wiki/deepseek_v3)
- [GRPO](/wiki/grpo)
- [Proximal Policy Optimization](/wiki/ppo)
- [Reinforcement Learning from Human Feedback](/wiki/rlhf)
- [Supervised fine-tuning](/wiki/supervised_fine-tuning)
- [GSM8K](/wiki/gsm8k)
- [MMLU](/wiki/mmlu)
- [Code Llama](/wiki/code_llama)
- [PaLM](/wiki/palm)
- [Qwen](/wiki/qwen)
- [Mistral 7B](/wiki/mistral_7b)
- [Mixtral](/wiki/mixtral)
- [fastText](/wiki/fasttext)
- [Common Crawl](/wiki/common_crawl)
- [GPT-4](/wiki/gpt-4)
- [Gemini](/wiki/gemini)
- [Reinforcement learning](/wiki/reinforcement_learning)

## References

[^1]: Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y. K. Li, Y. Wu, Daya Guo, "DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models", arXiv, 2024-02-05 (v1, latest revision v3 2024-04-27). https://arxiv.org/abs/2402.03300. Accessed 2026-05-21.
[^2]: Hugging Face Papers, "Paper page: DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models", Hugging Face, 2024-02-05. https://huggingface.co/papers/2402.03300. Accessed 2026-05-21.
[^3]: DeepSeek AI, "DeepSeek-Math (GitHub repository README)", GitHub, 2024-02. https://github.com/deepseek-ai/DeepSeek-Math. Accessed 2026-05-21.
[^4]: DeepSeek AI, "deepseek-ai/deepseek-math-7b-rl (model card)", Hugging Face, 2024-02-05. https://huggingface.co/deepseek-ai/deepseek-math-7b-rl. Accessed 2026-05-21.
[^5]: arXiv, "DeepSeekMath author list (arXiv:2402.03300v3)", arXiv, 2024-04-27. https://arxiv.org/abs/2402.03300v3. Accessed 2026-05-21.
[^6]: DeepSeek AI, "deepseek-ai/deepseek-coder-7b-base-v1.5 (model card)", Hugging Face, 2024-01. https://huggingface.co/deepseek-ai/deepseek-coder-7b-base-v1.5. Accessed 2026-05-21.
[^7]: Aitor Lewkowycz, Anders Andreassen, David Dohan et al., "Solving Quantitative Reasoning Problems with Language Models", arXiv:2206.14858, 2022-06-29. https://arxiv.org/abs/2206.14858. Accessed 2026-05-21.
[^8]: Zhangir Azerbayev, Hailey Schoelkopf, Keiran Paster, Marco Dos Santos, Stephen McAleer, Albert Q. Jiang, Jia Deng, Stella Biderman, Sean Welleck, "Llemma: An Open Language Model For Mathematics", arXiv:2310.10631, 2023-10-16. https://arxiv.org/abs/2310.10631. Accessed 2026-05-21.
[^9]: Huaiyuan Ying et al., "InternLM-Math: Open Math Large Language Models Toward Verifiable Reasoning", arXiv:2402.06332, 2024-02-09. https://arxiv.org/abs/2402.06332. Accessed 2026-05-21.
[^10]: Kevork Sulahian, "Demystifying DeepSeekMath's Data Pipeline: A FastText-Based Reproduction and Analysis", Hugging Face Blog, 2024. https://huggingface.co/blog/herooooooooo/demystifying-deepseekmaths-data-pipeline-a-fasttex. Accessed 2026-05-21.
[^11]: Shao et al., "DeepSeekMath (full HTML version, including ablation tables)", arXiv HTML, 2024-04-27. https://arxiv.org/html/2402.03300v3. Accessed 2026-05-21.
[^12]: John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, Oleg Klimov, "Proximal Policy Optimization Algorithms", arXiv:1707.06347, 2017-07-20. https://arxiv.org/abs/1707.06347. Accessed 2026-05-21.
[^13]: Cameron R. Wolfe, "Group Relative Policy Optimization (GRPO)", Substack, 2024. https://cameronrwolfe.substack.com/p/grpo. Accessed 2026-05-21.
[^14]: Hugging Face LLM Course, "Advanced Understanding of Group Relative Policy Optimization (GRPO) in DeepSeekMath", Hugging Face, 2025. https://huggingface.co/learn/llm-course/chapter12/3b. Accessed 2026-05-21.
[^15]: DeepSeek-AI, "DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning", arXiv:2501.12948, 2025-01-22. https://arxiv.org/abs/2501.12948. Accessed 2026-05-21.
[^16]: Carl Franzen, "Alibaba claims no. 1 spot in AI math models with Qwen2-Math", VentureBeat, 2024-08-09. https://venturebeat.com/ai/alibaba-claims-no-1-spot-in-ai-math-models-with-qwen2-math. Accessed 2026-05-21.
[^17]: Qwen Team, "Qwen2.5-Math Technical Report: Toward Mathematical Expert Model via Self-Improvement", arXiv:2409.12122, 2024-09-18. https://arxiv.org/abs/2409.12122. Accessed 2026-05-21.

