GRPO

AI Inference Chinese AI Reasoning Models Reinforcement Learning Training & Optimization

43 min read

Updated Jun 21, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 21, 2026

Fact-checked

In review queue

Sources

35 citations

Revision

v4 · 8,646 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Group Relative Policy Optimization (GRPO) is a reinforcement learning algorithm for fine-tuning large language models that eliminates the separate critic (value) network used by PPO, replacing it with a group-based baseline computed from multiple sampled completions to the same prompt. It was introduced in February 2024 in the DeepSeekMath paper by Zhihong Shao, Peiyi Wang, Qihao Zhu, and colleagues at DeepSeek-AI, who describe it as "a variant of Proximal Policy Optimization (PPO)" that "enhances mathematical reasoning abilities while concurrently optimizing the memory usage of PPO."^[1] By eliminating the value model, GRPO cuts peak GPU memory for RLHF-style training roughly in half compared with PPO.^[1] The algorithm gained wide recognition after DeepSeek-R1 used it as the core RL method for training a frontier reasoning model, demonstrating, in the authors' words, that "the reasoning abilities of LLMs can be incentivized through pure reinforcement learning (RL), obviating the need for human-labeled reasoning trajectories."^[2]

By 2025 GRPO and its descendants had become the dominant post-training algorithm for open reasoning models. Sebastian Raschka characterized 2025 as the year of "RLVR plus GRPO," replacing the 2022 RLHF-plus-PPO paradigm that had defined ChatGPT-era alignment.^[19] The algorithm is now the default starting point in Hugging Face TRL, ByteDance verl, OpenRLHF, NVIDIA NeMo-Aligner, and Unsloth, and dozens of refinements (DAPO, Dr. GRPO, GSPO, GFPO, REINFORCE++, lambda-GRPO, EDGE-GRPO, MO-GRPO, TIC-GRPO, GAGPO) have been published since the original paper.^[31] In 2026 the algorithm continued to anchor flagship releases including DeepSeek-V3.2 and DeepSeek-V4, where it was paired with two-stage post-training and hybrid reward systems to scale RL across many domains simultaneously.^[28]

What problem does GRPO solve? PPO and the value model

Before GRPO, the dominant approach to reinforcement learning from human feedback was actor-critic training with PPO. PPO requires two large models to be resident in GPU memory simultaneously: the policy model (the LLM being trained) and a value model (critic) that estimates the expected future reward at each token position. In practice, the value model is typically initialized from the same pretrained LLM as the policy, meaning training a 7B parameter model under PPO actually requires the compute and memory of roughly two 7B models, plus a separate reference policy to compute the KL divergence penalty and a separate reward model.^[1] For 70B or larger models, this overhead becomes prohibitive.

The value model's job is to provide a per-token baseline: an estimate of how well the model is doing at each point in the generation, which lets the policy gradient update assign credit to specific tokens rather than entire sequences. PPO uses Generalized Advantage Estimation (GAE) to produce these per-token advantage estimates by combining the value predictions at consecutive timesteps.^[20] This works well but requires the critic to be trained jointly with the policy, adding optimization complexity and the risk of instability when the value function is inaccurate. In LLM RLHF specifically, value learning is hard because the reward signal usually arrives only at the end of a long sequence, leaving most of the trajectory unsupervised.^[1] Empirically the critic often underfits or overfits, dragging the policy down with it.

Several earlier approaches tried to reduce this overhead. REINFORCE, the simplest policy gradient method, uses the total episode return as the baseline with no learned value function at all, but it suffers from high variance.^[20] RLHF practitioners often applied variance reduction tricks like subtracting the mean reward of a batch, but these methods still fell short of PPO's stability when training very large models on complex reasoning tasks. RLOO (REINFORCE Leave-One-Out, Ahmadian et al. 2024) revived the variance-reduced REINFORCE family and showed it could outperform PPO on RLHF benchmarks at lower cost; GRPO emerged from a similar motivation but with a different choice of baseline.^[7]

When was GRPO introduced? The DeepSeekMath paper

The GRPO algorithm was presented in "DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models" (arXiv:2402.03300), submitted to arXiv on February 5, 2024 (final v3 dated April 27, 2024) by a team from DeepSeek-AI, Tsinghua University, and Peking University. The author list is Zhihong Shao, Peiyi Wang, Qihao Zhu (equal contributors), Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo.^[1]

The paper's primary goal was to build a state-of-the-art mathematical reasoning model at the 7B parameter scale. DeepSeekMath-7B continued pretraining from DeepSeek-Coder-Base-v1.5 on 120 billion math-related tokens sourced from Common Crawl, along with natural language and code data. After supervised fine-tuning, the authors applied reinforcement learning with a reward signal derived from answer correctness. The RL stage is where GRPO was introduced: the authors found PPO's value model overhead difficult to justify for a 7B model and designed GRPO as a leaner alternative.^[1]

DeepSeekMath-7B reached 51.7% on the MATH benchmark (competition-level problems) without external tools or majority voting. The paper reports that the model "has achieved an impressive score of 51.7% on the competition-level MATH benchmark without relying on external toolkits and voting techniques, approaching the performance level of Gemini-Ultra and GPT-4," and substantially outperforming other open 7B models.^[1] Self-consistency over 64 samples lifted that score to 60.9% on MATH. The paper also showed gains on GSM8K (82.9% before RL, 88.2% after), CMATH, and several other benchmarks. The chosen RL training set consisted of roughly 144,000 chain-of-thought formatted questions sourced from the GSM8K and MATH portions of the SFT dataset, deliberately excluding other supervised questions to make benchmark gains attributable to the algorithm.^[1]

In the published hyperparameter table the policy was trained with learning rate $1 \times 10^{-6}$, KL coefficient $\beta = 0.04$, group size $G = 64$ completions per prompt, training batch size of 1,024 with 16 prompts per batch (so 16 prompts times 64 completions equals 1,024 samples), maximum generation length 1,024 tokens, and a single policy update per exploration step. The reward model itself was trained from DeepSeekMath-Base 7B with learning rate $2 \times 10^{-5}$ and a 10% historical-data replay buffer to combat distribution drift.^[1]

How does the GRPO algorithm work?

group sampling

The central idea in GRPO is to replace the learned value baseline with a statistical baseline computed from a group of completions. For each training question $q$, GRPO samples a group of $G$ outputs ${o_1, o_2, \ldots, o_G}$ from the current (old) policy $\pi_{\theta_{\text{old}}}$.^[1] In the original DeepSeekMath paper the group size was $G = 64$; the later DeepSeek-R1 work scaled the model and tasks but reduced the group size to $G = 16$ for efficiency.^[2] Each output receives a scalar reward from a reward model or rule-based evaluator.

The advantage for output $o_i$ is then computed by normalizing the reward relative to the group:

$\hat{A}_i = \frac{r_i - \text{mean}(\{r_1, \ldots, r_G\})}{\text{std}(\{r_1, \ldots, r_G\})}$

This single scalar advantage is assigned uniformly to every token in $o_i$. There is no learned value function predicting per-token returns; the group statistics serve as the baseline instead.^[1] A small numerical stability term $\epsilon \approx 10^{-8}$ is typically added to the standard deviation in implementations.^[12]

The intuition is that if a model generates 64 answers to the same question and 20 of them are correct, the correct answers will have positive normalized rewards and the incorrect ones will have negative normalized rewards. The policy is then nudged to increase the probability of the correct-answer tokens and decrease the probability of the incorrect ones, without any explicit credit assignment to individual tokens beyond what comes from the group-level ranking.^[13]

The Hugging Face course on DeepSeek-R1 frames the procedure as analogous to grading on a curve: instead of treating each completion as an absolute success or failure, GRPO scores each one relative to its peers on the same question.^[17] This relative formulation is what makes the algorithm robust to reward scale and allows arbitrary reward functions, including non-differentiable code execution checks or symbolic math verifiers, to be plugged in directly.

objective function

The GRPO optimization objective is:

$\mathcal{J}_{\text{GRPO}}(\theta) = \mathbb{E}\left[\frac{1}{G}\sum_{i=1}^{G}\frac{1}{|o_i|}\sum_{t=1}^{|o_i|}\left\{\min\left[\frac{\pi_\theta(o_{i,t}|q,o_{i,<t})}{\pi_{\theta_{\text{old}}}(o_{i,t}|q,o_{i,<t})}\hat{A}_{i,t},\ \text{clip}\left(\frac{\pi_\theta}{\pi_{\theta_{\text{old}}}}, 1-\varepsilon, 1+\varepsilon\right)\hat{A}_{i,t}\right] - \beta\,\mathbb{D}_{\text{KL}}[\pi_\theta \| \pi_{\text{ref}}]\right\}\right]$

Several components deserve attention:

Policy ratio and clipping. The term $\pi_\theta / \pi_{\theta_{\text{old}}}$ is the importance sampling ratio between the current and old policy. PPO clips this ratio to the range $[1-\varepsilon, 1+\varepsilon]$ to prevent large, destabilizing updates. GRPO uses the same clipping mechanism, typically with $\varepsilon = 0.2$.^[1]

KL penalty. Rather than folding the KL divergence into the reward signal (as some PPO implementations do), GRPO applies a direct KL penalty term $\beta,\mathbb{D}{\text{KL}}[\pi\theta | \pi_{\text{ref}}]$ in the loss. The reference policy $\pi_{\text{ref}}$ is typically the supervised fine-tuned (SFT) model before RL. The DeepSeekMath paper uses an unbiased low-variance estimator for this KL term:^[1]

$\mathbb{D}_{\text{KL}}[\pi_\theta \| \pi_{\text{ref}}] = \frac{\pi_{\text{ref}}(o_{i,t})}{\pi_\theta(o_{i,t})} - \log\frac{\pi_{\text{ref}}(o_{i,t})}{\pi_\theta(o_{i,t})} - 1$

This form is always non-negative, which prevents the KL term from acting as a spurious positive reward and reduces gradient noise compared to the simpler log-ratio approximation. The estimator is sometimes called the "k3" estimator after John Schulman's 2020 blog post that catalogued three KL approximations: $k_1 = \log(p/q)$ (unbiased but high variance), $k_2 = \frac{1}{2}(\log(p/q))^2$ (lower variance but biased), and $k_3 = (q/p - 1) - \log(q/p)$ (unbiased and low variance).^[18] DeepSeek's choice of $k_3$ was popularized through GRPO and has since become the default in TRL and verl.^[12]

Token-level averaging. The objective averages over both the group dimension ($1/G$) and the token dimension ($1/|o_i|$) within each output. This means that the total gradient contribution of each response is normalized by its length, so a 1,000-token response and a 10-token response each contribute roughly the same total gradient, regardless of content quality. This normalization choice has downstream consequences that later work (DAPO, Dr. GRPO) identified as a source of bias.^[3]

relationship to REINFORCE and policy gradients

GRPO is most cleanly understood as a particular member of the policy gradient family. The vanilla REINFORCE estimator is

$\nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}\left[\sum_t \nabla_\theta \log \pi_\theta(a_t \mid s_t) \cdot R(\tau)\right],$

where $R(\tau)$ is the trajectory return. Subtracting any baseline $b(s)$ that does not depend on the action leaves the gradient unbiased and reduces variance.^[20] PPO learns a state-dependent baseline (the value function). RLOO uses a leave-one-out mean from the same group, $b_i = \frac{1}{G - 1}\sum_{j \neq i} R_j$, which yields an exactly unbiased estimator.^[7] GRPO uses the in-group mean $b = \frac{1}{G}\sum_j R_j$ and additionally divides by the in-group standard deviation. The mean-with-self baseline introduces a small bias because each sample's reward appears in both the numerator and the baseline, but the bias shrinks as $G$ grows and was found to be acceptable in practice given the stability gains from standard-deviation normalization.^[34]

The outer loop of GRPO closely mirrors PPO. It collects a batch of rollouts, computes ratios against the snapshot policy, applies clipped surrogate updates, and adds an explicit KL penalty against the reference. The only structural change is the substitution of the value-function-derived advantage with the group-statistics advantage and a uniform per-token assignment.^[13]

outcome and process supervision

GRPO supports two variants of reward signal.

In the outcome supervision (OS) variant, each output receives a single scalar reward at the end (for example, 1 if the final answer is correct, 0 otherwise). The advantage is computed from this final reward alone.^[1]

In the process supervision (PS) variant, a process reward model scores each reasoning step individually. The advantage at token $t$ in output $o_i$ is then the sum of all normalized step rewards for steps at or after token $t$. This gives finer-grained training signal and helps the model learn which intermediate reasoning steps are valuable. The DeepSeekMath paper found that GRPO with process supervision outperformed GRPO with outcome supervision on complex mathematical tasks, though outcome supervision remains far more common in practice because training process reward models is expensive and PRMs are themselves hackable.^[1]

reward function design

GRPO with verifiable rewards has spawned a small ecosystem of conventions for how rewards are typically defined. The most common pattern in DeepSeek-R1 and its open reproductions is to combine an accuracy reward and a format reward and sum them.^[2]

A typical accuracy reward returns 1.0 when the model's final answer (extracted from <answer> tags or a boxed expression) matches the ground truth, and 0.0 otherwise. For mathematics, parsing the answer with LaTeX-aware tools and comparing with a symbolic math library is standard. For code, the answer is fed to a sandboxed executor and run against a hidden test suite. For competitive programming the reward can be a continuous value reflecting fraction of test cases passed.^[16]

A format reward returns a small positive value (commonly 0.5) when the model wraps its reasoning in the expected tags (<think>...</think><answer>...</answer>) and 0.0 otherwise. This shapes the model toward emitting parseable structure without explicitly teaching reasoning content. In some implementations a length penalty or cosine length bonus is added, awarding longer outputs for correct answers and shorter ones for incorrect answers, which counteracts GRPO's tendency to inflate response length.^[14]

Multiple reward functions can be combined by summation or weighted summation. TRL allows the user to pass any synchronous Python callable (or async coroutine for tool-use evaluations) and computes the total reward as a weighted sum across the list.^[9] This pluggability is one of the practical reasons GRPO has spread so quickly: any project that can write a verifier function in Python can immediately train with the algorithm.

How does GRPO differ from PPO?

Property	PPO	GRPO
Value model	Required (same size as policy)	Not required
Trainable models	Policy + Critic	Policy only
Memory (rough)	~2x policy size	~1x policy size
Advantage estimation	Learned per-token via GAE	Group statistics, uniform per token
KL handling	Usually in reward signal	Direct loss penalty
Completions per prompt	Typically 1	Must be high (e.g., 16-64)
Training complexity	Higher (two loss functions)	Lower (single model)
Stability at small groups	Good	Degrades
Adopted in DeepSeek-R1	No	Yes

The primary practical advantage of GRPO is memory. By eliminating the critic, GRPO cuts peak GPU memory consumption roughly in half compared to PPO for equivalent model sizes, making large-scale RL fine-tuning accessible on fewer or smaller GPUs.^[1] The tradeoff is that GRPO depends on sampling enough completions per prompt to get a reliable group-level baseline. If the group is too small or if all completions happen to have identical rewards (all correct or all wrong), the advantage estimate is either noisy or zero, providing no gradient signal.^[3]

Another practical consideration is throughput. PPO performs one forward pass per prompt at rollout time (since each prompt produces one completion), while GRPO performs $G$ forward passes per prompt, multiplying inference cost by the group size. At $G = 16$ this is a 16-fold increase per prompt, which can dominate end-to-end training time unless the rollout is heavily optimized through batched generation, prefix sharing, and asynchronous engines like vLLM.^[11] Modern GRPO toolchains generally pair the trainer with a vLLM-based generator running in parallel to keep GPUs saturated.^[11]

A final qualitative difference is that PPO's per-token credit assignment can in principle reward earlier reasoning steps that lead to correct answers even if the late steps fail, while GRPO assigns the same advantage to every token in a response, treating the whole completion as a single unit. In practice, with sparse reward signals from outcome verifiers, this distinction matters less than the theoretical analysis suggests, because PPO's value function struggles to learn meaningful per-token credit anyway.

A December 2025 systematic comparison (arXiv:2512.07611) tuned each algorithm in isolation on identical Qwen2.5 bases and reported that GRPO and DAPO consistently dominated PPO on math reasoning, with the gap widening as group size grew. The same study found DAPO's dynamic sampling was not always beneficial; the best configuration disabled dynamic sampling and relied on Clip-Higher and overlong reward shaping. This contradicts some earlier claims and underscores how sensitive GRPO-family results are to the specific base model and dataset.^[35]

How was GRPO used in DeepSeek-R1?

DeepSeek-R1 (arXiv:2501.12948, January 2025) applied GRPO at a much larger scale and for a broader purpose than the original DeepSeekMath work: training a frontier 671B MoE model to produce long, explicit chain-of-thought reasoning competitive with OpenAI o1.^[2]

DeepSeek-R1-Zero

The first experiment, called DeepSeek-R1-Zero, applied GRPO directly to DeepSeek-V3-Base with no supervised fine-tuning on reasoning traces. The reward function was entirely rule-based: an accuracy reward (correctness of the final answer, verified by a deterministic checker for math and by code execution for programming problems) and a format reward (the model must wrap its thinking in <think> and </think> tags). Neural reward models were deliberately avoided because the authors worried about reward hacking at large RL scale.^[2]

The hyperparameters used were: learning rate $3 \times 10^{-6}$, KL coefficient $\beta = 0.001$, clip ratio $\varepsilon = 0.2$, group size $G = 16$ completions per question, maximum length 32,768 tokens, and 32 unique questions per step (512 total completions per step). Compared with the DeepSeekMath setup the KL coefficient was reduced 40-fold, allowing the policy to drift much further from the reference, and the maximum sequence length was scaled up by 32x to accommodate long chain-of-thought generations.^[2]

R1-Zero produced a striking result: without any human-labeled chain-of-thought data, the model spontaneously developed sophisticated reasoning behaviors. The authors documented an "aha moment" in which the model learned to pause and reassess its approach mid-solution, producing the chain-of-thought line "Wait, wait. Wait. That's an aha moment I can flag here," which the paper interpreted as evidence of "a self-evolution process" induced by optimization pressure.^[2] The model's AIME 2024 pass@1 score rose from 15.6% (the base model) to 71.0%, with majority voting pushing it to 86.7%, matching OpenAI-o1-0912 at the time. Average response length grew steadily over training, from a few hundred tokens at the start to multi-thousand-token reasoning chains by the end of RL, with the model learning to allocate more thinking to harder problems.^[2]

However, R1-Zero also produced readability issues: language mixing (switching between English and Chinese mid-reasoning) and occasional incoherent formatting.^[2]

DeepSeek-R1 with cold start

The full DeepSeek-R1 model uses a four-stage training pipeline that addresses R1-Zero's weaknesses:

Cold start. The base model is fine-tuned on a small dataset of thousands of long chain-of-thought examples to stabilize generation format before RL begins.
Reasoning RL. Large-scale GRPO training focused on math, code, and science problems, using the same rule-based reward as R1-Zero, supplemented with a language-consistency reward to discourage code-switching.
Rejection sampling and SFT. Around 600,000 reasoning samples and 200,000 general-purpose samples are generated from the RL-trained model, filtered, and used for another SFT round.
General RL. A final RL stage using both rule-based accuracy rewards (for verifiable tasks) and preference-based rewards (from a reward model, for open-ended tasks).^[2]

DeepSeek-R1 achieved 79.8% on AIME 2024 pass@1, 97.3% on MATH-500, 71.5% on GPQA Diamond, and a Codeforces Elo of 2,029 (a rating the paper notes outperforms 96.3% of human competitors), reaching performance comparable to OpenAI-o1-1217 on most benchmarks.^[2] The distilled versions (R1-Distill-Qwen-7B, R1-Distill-Qwen-32B) achieved 55.5% and 72.6% on AIME respectively, showing that reasoning capabilities can be transferred to smaller models via knowledge distillation. The distilled-Qwen-70B variant approached o1-mini performance on MATH-500 (94.5%), and even the 1.5B distilled checkpoint exceeded the base R1-Zero on several benchmarks, demonstrating that the GRPO-trained reasoning behaviors transfer remarkably well to compact backbones via simple SFT on R1 traces.^[2]

The practical impact extended beyond benchmarks. DeepSeek-R1's API pricing of approximately $0.14 per million input tokens, combined with the fully open recipe, set off a cascade of open reasoning models in the first half of 2025 (Open-R1, OpenThinker, S1, DeepSeek-R1-Distill family, Sky-T1, OpenReasoning-Nemotron) that all relied on GRPO at some point in the pipeline.^[16]

use in DeepSeek-V3.2 and DeepSeek-V4

DeepSeek continued to use GRPO as the core RL algorithm in its 2025 and 2026 flagship releases, but with progressively more sophisticated reward systems and orchestration rather than algorithmic substitution.

DeepSeek-V3.2 hybrid reward system

DeepSeek-V3.2 (released September 2025 as DeepSeek-V3.2-Exp and detailed in arXiv:2512.02556) retained GRPO as the RL algorithm rather than switching to GSPO or DAPO, combining it with a hybrid reward system:^[25]

Rule-based reward. For reasoning and agent tasks (math, code, tool use), the same deterministic checkers as R1 were used, supplemented with length penalties and a language-consistency reward.
Generative reward. For tasks where output quality cannot be measured directly, a generative reward model scored candidate responses against per-prompt rubrics.^[27]

V3.2 kept the KL penalty against the reference policy but tuned the weight $\beta$ per domain. For math the team reported that very weak or zero KL often produced the best results, allowing the policy to drift aggressively toward the verifiable correctness signal. For general-purpose tasks the KL weight was kept higher to prevent regression on safety and style. Stabilization features specific to MoE were added: unbiased KL estimators, off-policy masking when the importance ratio exceeded a threshold, and "keep-routing" auxiliary losses that prevented expert routing patterns from collapsing.^[26] This per-domain KL tuning is now a common pattern across frontier RL systems.

V3.2's GRPO stage pushed response budgets to 65,536 tokens, and average reasoning traces grew roughly 60% longer than R1's, a tradeoff the team defended as consistent with the "more thinking equals better answers" empirical scaling law that GRPO-trained reasoning models had established by mid-2025.^[26]

DeepSeek-V4 two-stage post-training

DeepSeek-V4, released in April 2026 (V4-Pro and V4-Flash variants), introduced a more substantial post-training restructuring while still using GRPO as the underlying RL algorithm.^[30] The published recipe (described in the V4 technical materials) used a two-stage paradigm:

Independent cultivation of domain-specific experts. Rather than train a single model on all domains simultaneously, the team trained separate specialist models from the same V4 base: one for mathematics, one for coding, one for agent and tool use, one for instruction following, and others for additional domains. Each specialist underwent SFT on domain-specific data followed by GRPO with reward signals tailored to that domain (rule-based verifiers for math and code, rubric-based generative rewards for instruction following, environment rewards for agent tasks).
Unified model consolidation via on-policy distillation. These specialist models then served as teachers for a single unified student model. The student generated its own outputs and optimized against the full logit distributions of whichever teacher was most relevant to the current prompt context, an approach the team called on-policy distillation (OPD). The student inherited the specialized capabilities of each expert without the cross-domain interference that had limited prior multi-task RL.^[28]

The motivation was an observation that multi-capability LLMs trained on mixed-domain RL data tend to converge to compromises rather than excelling in any single area. By isolating GRPO training per domain, V4 reported substantial benchmark gains: HumanEval jumped by approximately 14 percentage points relative to a single-model GRPO baseline (driven by the coding expert), and world-knowledge tasks improved by roughly 27 points (because that expert was insulated from math and code interference during RL).^[28]

The Pro variant reached approximately 80.6% on SWE-Bench Verified, positioning it among the top open-weights models in May 2026 alongside Kimi K2.6, GLM-5.1, and Qwen 3.6 Plus.^[29] All four flagships either used GRPO directly or used a published GRPO descendant (GSPO in the case of Qwen), confirming that the algorithm family had become the post-training default.

implementations

TRL (Hugging Face)

Hugging Face's TRL library added a GRPOTrainer class that implements GRPO with support for PEFT (LoRA/QLoRA), FSDP multi-GPU training, and vLLM-based generation for fast rollout sampling. The trainer accepts a user-defined reward function (or list of functions), making it straightforward to apply GRPO to custom tasks beyond math.^[9] The Liger Kernel integration reduced memory usage by a further 40% with no quality drop by chunking the language-modeling head's logit computation across the batch, which is one of the dominant memory costs at long sequence lengths.^[10] TRL's GRPO implementation became the default starting point for most academic groups reproducing R1-style training on smaller models.

The trainer can run vLLM in either a separate-server mode or a co-located mode that shares GPU memory with the trainer. Co-located mode improves utilization on a single node but trades off against potential memory contention. For multi-node training the typical pattern is to run vLLM on dedicated inference GPUs and stream completions into the trainer's experience buffer asynchronously.^[11]

OpenRLHF

OpenRLHF is a Ray-based distributed RLHF framework that supports PPO, GRPO, REINFORCE++, DAPO, and several other algorithms. It was designed for scalability across large GPU clusters and implements asynchronous rollout generation, allowing inference and training to be decoupled across different hardware allocations.^[6] The framework was used in early reproductions of R1-Zero on Qwen2.5 base models and continues to be a reference implementation for distributed GRPO at hundreds of GPUs.

verl (Volcano Engine RL)

verl (HybridFlow) is the RL post-training framework developed by ByteDance, with contributions from Anyscale, Alibaba Qwen, Shanghai AI Lab, UC Berkeley, and others. It supports GRPO, GSPO, DAPO, REINFORCE++, PPO, and other algorithms with optimized multi-GPU scheduling.^[12] The library provides a reference script for GRPO + LoRA training on Qwen2.5 with GSM8K as the target task, which became a common benchmark configuration for comparing implementations. verl was the framework used to produce the original DAPO results and to train several Qwen-family RL checkpoints.^[3]

Unsloth

Unsloth is a memory-efficient fine-tuning toolkit that added GRPO support in early 2025. By pairing 4-bit QLoRA quantization with a custom Triton-based attention and a tightly integrated TRL backend, Unsloth claims roughly 2x faster training and up to 70% lower VRAM usage compared with stock TRL. It demonstrated that the R1-Zero "aha moment" can be reproduced on Qwen2.5-3B in a single 5GB consumer GPU, dramatically lowering the barrier to GRPO experimentation. Recommended hyperparameters include learning rate $5 \times 10^{-6}$ for GRPO and $2 \times 10^{-4}$ for ordinary LoRA SFT.^[23]

Open-R1

Open-R1 is Hugging Face's open reproduction of DeepSeek-R1, started in January 2025. It uses GRPO (via TRL) to replicate the R1 training pipeline on open base models, with public training recipes, datasets, and checkpoints.^[16] The project demonstrated that a Qwen2.5-32B model trained with open-source tooling and GRPO could closely match the performance of R1's distilled versions on coding benchmarks and produced the OpenR1-Math reasoning dataset that has been adopted by several follow-up projects.^[16]

NeMo-Aligner

NVIDIA's NeMo-Aligner framework integrated GRPO support in 2025 alongside existing PPO, DPO, and SteerLM pipelines. It provides reference recipes for Megatron-LM-style tensor and pipeline parallelism, making GRPO usable on very large dense models without sharding the policy across heterogeneous accelerators.

Predibase, Together, and managed offerings

Several RL fine-tuning platforms productized GRPO under the label "reinforcement fine-tuning," exposing a SaaS interface where the user supplies a reward function and a prompt set and the platform handles rollout, reward computation, and policy updates. This brought GRPO to teams that lacked the GPU infrastructure to run R1-style training in-house.

variants and extensions

GRPO has spawned an extensive lineage of variants, each addressing a specific failure mode observed when scaling the algorithm.

Variant	Year	Authors / Org	Core change relative to GRPO
RLOO	2024	Cohere (Ahmadian et al.)	Leave-one-out baseline instead of group mean
REINFORCE++	2025	OpenRLHF	Batch-level baseline, KL in reward
DAPO	2025	ByteDance Seed	Decoupled clip, dynamic sampling, token-level loss
Dr. GRPO	2025	Sail AI Lab (Liu et al.)	Removes per-response and difficulty length normalization
GSPO	2025	Alibaba Qwen	Sequence-level importance ratios
GFPO	2025	Microsoft Research	Larger samples plus filtering by length and efficiency
lambda-GRPO	2025	Various	Learnable interpolation between length normalizations
EDGE-GRPO	2025	Open-source	Entropy-driven exploration with guided error correction
MO-GRPO	2025	IBM-Imperial	Multi-objective reward variance normalization
TIC-GRPO	2025	Academic	Trajectory-importance correction for provable convergence
GAGPO	2026	Academic	Generalized advantage grouped policy optimization

DAPO

DAPO (Decoupled Clip and Dynamic sAmpling Policy Optimization, arXiv:2503.14476, March 2025) was developed by ByteDance Seed and addresses several limitations that become pronounced in large-scale RL training:^[3]

Clip-Higher. Standard GRPO uses symmetric clipping ($\varepsilon = 0.2$ on both sides). DAPO decouples the upper and lower bounds, using a higher upper clip ($\varepsilon_{\text{high}} = 0.28$) and the standard lower clip ($\varepsilon_{\text{low}} = 0.2$). The higher upper bound allows low-probability tokens to be promoted more aggressively, which counteracts entropy collapse and preserves output diversity.^[3]

Dynamic Sampling. When all completions in a group are correct (reward = 1.0) or all are incorrect (reward = 0.0), the group standard deviation is zero and the advantage for every token is zero. This kills the gradient signal. DAPO addresses this by filtering out such prompts before the gradient step, oversampling the batch to maintain effective batch size. This ensures every training step has a genuine learning signal.^[3]

Token-level policy gradient loss. GRPO normalizes loss per response (dividing by $|o_i|$), which means short responses have higher per-token contribution than long ones. DAPO removes per-response normalization and normalizes across all tokens in the batch, giving longer, high-quality reasoning chains more influence on the gradient.^[3]

Soft overlong reward shaping. When a response is truncated due to hitting the maximum length limit, GRPO assigns the same reward as a complete but incorrect response. This sends incorrect signal to the model. DAPO applies a gradual length penalty in a window near the maximum length (the paper uses a cache window of 4,096 tokens before the 20,480-token limit), reducing noise from truncated samples.^[3]

DAPO uses prompt batch size 512, group size $G = 16$, learning rate $1 \times 10^{-6}$, and maximum generation length 20,480 tokens. In an ablation study, naive GRPO on Qwen2.5-32B reached 30 points on AIME 2024. Adding each DAPO component progressively brought the score to 36 (with overlong filtering), 38 (plus Clip-Higher), 41 (plus soft overlong punishment), 42 (plus token-level loss), and finally 50 points with all four techniques and dynamic sampling, outperforming DeepSeek-R1-Zero-Qwen-32B (47 points) using 50% fewer training steps.^[3]

Dr. GRPO

Dr. GRPO ("GRPO Done Right") was introduced in "Understanding R1-Zero-Like Training: A Critical Perspective" (arXiv:2503.20783, March 2025) by Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin from Sail AI Lab, accepted to COLM 2025.^[4] The paper identifies two normalizations in standard GRPO that introduce bias.

First, dividing the per-response loss by $|o_i|$ creates a length-aggregation bias: longer responses contribute less per token, so the gradient implicitly prefers shorter responses. Combined with the standard-deviation normalization in the advantage, this produces a feedback loop in which incorrect responses become progressively longer over training (because incorrect responses receive negative advantage, and the gradient pushes toward longer continuations to dilute the per-token penalty). The authors demonstrate this empirically: average length of incorrect responses grows steadily across training, while correct responses stay roughly constant.^[4]

Second, the in-group standard-deviation scaling causes the advantage magnitude to depend on the difficulty of the prompt: easy prompts (high mean reward, low variance) produce larger normalized advantages than hard prompts. Dr. GRPO removes the standard-deviation division and replaces the per-response length scaling with a single group-level constant, which they show produces an unbiased gradient and better token efficiency.^[4]

Using this minimalist Dr. GRPO recipe, the authors achieved 43.3% on AIME 2024 with a 7B base model.^[4] Lambda-GRPO ($\lambda$-GRPO, arXiv:2510.06870) generalizes both approaches with a learnable interpolation parameter $\lambda$ that smoothly trades off between per-response and per-token normalizations.^[21]

GSPO

GSPO (Group Sequence Policy Optimization, arXiv:2507.18071, July 2025) was developed by the Alibaba Qwen team and is the RL algorithm used to train Qwen3.^[5] GSPO's core criticism of GRPO is that GRPO applies importance sampling ratios at the token level, but rewards are given at the sequence level. This mismatch causes the effective importance weight of a sequence to grow with sequence length (since it is a product of per-token ratios), introducing variance that accumulates and destabilizes training at large scale.^[5]

GSPO instead computes importance ratios at the sequence level, defined as the geometric mean of per-token log-probabilities across the entire sequence under the new policy divided by the same quantity under the old policy. Clipping and optimization are then applied at this sequence level. This matches the granularity at which rewards are actually assigned. Notably, GSPO clips out two orders of magnitude more tokens than GRPO during training but still achieves better final performance, suggesting GRPO's token-level updates were actively harmful in those regimes. The Qwen team states that these "merits have contributed to the exceptional performance of the latest Qwen3 models (Instruct, Coder, Thinking)," and emphasizes that GSPO is particularly valuable for Mixture-of-Experts (MoE) models: where GRPO required a "Routing Replay" workaround to converge, "GSPO completely eliminates the dependency on Routing Replay," avoiding the destructive oscillation of expert activation patterns under token-by-token routing variability.^[15]

By 2026 GSPO had become the default RL algorithm across the Qwen3 family (Qwen3-Coder, Qwen3.6 Plus) and was adopted in the post-training of the GLM-5 family (Z.AI, April 2026). Cross-team analyses positioned the three approaches as follows: DAPO prioritizes training stability, GSPO prioritizes mathematical correctness of the importance-sampling step at long sequence lengths and MoE scale, and vanilla GRPO prioritizes simplicity.^[32] In flagship 2026 systems, GSPO is more common for MoE policy models while GRPO with DAPO-style refinements is more common for dense or smaller mixture models.

REINFORCE++

REINFORCE++ (arXiv:2501.03262) by Jian Hu and the OpenRLHF team takes a different approach to eliminating the critic. It retains the REINFORCE objective but uses the normalized reward of a full training batch as the baseline, rather than the group-level statistics in GRPO.^[6] This makes the baseline more globally calibrated. Empirical comparisons (e.g., in the Logic-RL and PRIME frameworks) found REINFORCE++ to be more stable than GRPO in some settings and to produce longer, more detailed responses, suggesting GRPO's group-level normalization can cause the model to converge on shorter outputs.^[6]

GFPO

GFPO (Group Filtered Policy Optimization, arXiv:2508.09726, August 2025) was developed by Microsoft Research as a direct response to GRPO's tendency to inflate response length.^[8] The core idea is to oversample the group (for example, generating 32 candidates instead of 16) and then keep only a filtered subset for the policy gradient update, where the filter selects responses by length, by token efficiency (reward per token), or by adaptive difficulty. On Phi-4-reasoning, GFPO cut GRPO's length inflation by 46-71% across AIME 24/25, GPQA, Omni-MATH, and LiveCodeBench while preserving accuracy. Optimizing for reward per token pushed length reductions to 71-85%, translating to roughly 30% lower end-to-end inference latency at deployment time and a 90-second improvement on hard queries. An adaptive-difficulty variant allocates more rollout budget to harder problems.^[8]

EDGE-GRPO and entropy-aware variants

EDGE-GRPO (Entropy-Driven GRPO with Guided Error Correction for Advantage Diversity) and the related GTPO and GRPO-S frameworks share a common observation: maintaining policy entropy during GRPO training is essential for continued improvement, and most failures of GRPO at scale trace back to premature entropy collapse. These methods adjust the loss to add entropy regularization terms, perform error-correction on collapsing groups, or shape the per-token contribution by entropy. They typically produce a characteristic "entropy rebound" curve where exploration recovers after an initial dip, rather than monotonically collapsing.

TIC-GRPO (Trajectory-Importance-Corrected GRPO, arXiv:2508.02833, August 2025) was the first GRPO variant to ship with a provable convergence guarantee under standard policy-gradient assumptions. The paper showed that GRPO's mean-with-self baseline introduces a small but systematic bias and proposed a trajectory-level importance correction that restores unbiasedness without sacrificing the group structure. Empirical gains were modest, but the contribution was theoretical: it provided the first formal foothold for analyzing convergence rates of group-based policy gradient methods on LLM RL problems.^[34] Follow-up work including GAGPO (Generalized Advantage Grouped Policy Optimization) built on this framework.

connection to DPO and preference learning

A theoretical thread that became prominent in late 2025 is the equivalence between GRPO and DPO under specific configurations. The "It Takes Two" analysis (arXiv:2510.00977) showed that GRPO with $G = 2$ reduces to an on-policy variant of DPO over the sampled pair, with the KL penalty playing the role of DPO's $\beta$ regularizer.^[33] By 2026 several training stacks (including TRL and verl) supported hybrid DPO/GRPO objectives natively.

What are the strengths of GRPO?

The case for GRPO over PPO comes down to resource efficiency and implementation simplicity. Removing the value model cuts peak GPU memory roughly in half for a given policy size. This makes GRPO accessible to research groups and companies that cannot provision the cluster resources required for full PPO training of large models.^[1] In practice, many of the RLHF pipelines for frontier reasoning models in 2025 moved to GRPO or its descendants rather than PPO.^[19]

GRPO is also simpler to implement correctly. PPO requires careful tuning of the value function learning rate, the GAE hyperparameters ($\lambda$, $\gamma$), and the value loss coefficient. GRPO's only novel hyperparameter beyond standard PPO is the group size $G$. With larger groups, the group-level baseline becomes a better estimate, at the cost of needing to generate more completions per prompt per training step.^[13]

For tasks with verifiable answers (math, code, logic puzzles), GRPO pairs naturally with rule-based rewards, avoiding the neural reward model entirely. This eliminates reward model training costs and sidesteps reward hacking against a neural evaluator, a problem the DeepSeek-R1 authors specifically noted as a concern at large scale.^[2] The flexibility of accepting arbitrary verifier functions has made GRPO the natural fit for the broader RLVR (Reinforcement Learning from Verifiable Rewards) paradigm that emerged in 2024-2025.^[22]

GRPO scales gracefully across model sizes. The same algorithm has been used to train 1.5B Qwen2.5-Math models on a single GPU and 671B DeepSeek-V3-Base MoE models on multi-thousand-GPU clusters. The hyperparameter changes between scales are largely about optimization (learning rate, group size, max length, KL coefficient) rather than algorithmic restructuring, making transfer of recipes across scales straightforward.

Finally, GRPO has good optimization properties on the kinds of problems where reasoning models are evaluated. Because the advantage is normalized within each prompt, the algorithm is robust to reward scale and to the absolute difficulty of the prompt distribution: a prompt where the model gets 80% correct and a prompt where it gets 20% correct contribute comparable gradient magnitudes per step, simply with the sign of contribution flipped. This is harder to engineer with PPO's value-based credit assignment.^[13]

What are the limitations of GRPO?

entropy collapse

One of the most consistently observed failure modes of GRPO training is entropy collapse: the policy's output distribution converges prematurely, all completions within a group start looking similar, and the group-level advantage estimates become noisy or near-zero. Once entropy collapses, the model stops exploring alternative solution strategies and training stalls. DAPO's Clip-Higher technique directly targets this by asymmetrically allowing upward probability adjustments for currently low-probability tokens, and entropy-aware variants like EDGE-GRPO and GTPO add explicit entropy regularization or shaping.^[3] Practitioners commonly monitor token-level entropy as a primary training health metric and intervene (by reducing learning rate, raising temperature, or pausing for cold-start data) if it falls too quickly.^[14]

length bias and token aggregation

Because each response's loss is normalized by its own length, GRPO implicitly assigns equal total gradient weight to short and long responses regardless of content. This can cause the training signal to favor concise (often incorrect) responses over longer, correct chain-of-thought solutions. The Sail AI Lab analysis behind Dr. GRPO showed this concretely: average length of incorrect responses tends to grow over training under standard GRPO.^[4] Dr. GRPO and DAPO's token-level loss normalization both attempt to correct this. The GSPO paper argues the root cause is a deeper mismatch between token-level importance sampling and sequence-level rewards.^[5] In practice, GFPO shows that simply oversampling and filtering by length and efficiency can cut response lengths by half or more while preserving accuracy, and it has been adopted in deployments where inference latency matters.^[8]

group degeneracy

When all completions in a group receive the same reward (all correct or all incorrect), the standard deviation is zero and every advantage is zero. GRPO provides no gradient signal for these prompts. This is a structural issue: for tasks where the model is already highly accurate, GRPO training offers diminishing returns. DAPO's dynamic sampling addresses this by filtering such prompts, but the problem reflects a fundamental constraint of group-based baselines and forces curriculum design to weight prompts at the model's current frontier of capability.^[3]

high rollout cost

GRPO requires sampling $G$ completions per prompt. With $G = 16$ or $G = 64$, this means each training step involves 16 to 64 times more model inference than supervised fine-tuning. For large models, this rollout cost can dominate overall training time unless inference is highly optimized (e.g., using vLLM with batched generation, prefix caching, and continuous batching).^[11] The memory saving over PPO is partially offset by the need to hold multiple completions in memory simultaneously during rollout, particularly when sequences reach 32,768 or 65,536 tokens as in the R1 setup.

reward hacking

GRPO with rule-based rewards is resistant to the specific form of reward hacking that affects neural reward models (adversarial prompts that fool the reward model without being correct). However, it remains susceptible to other forms: models can learn to game format requirements (e.g., producing the required answer tags with correct-looking but incorrect content), to produce extremely short responses that happen to format-match correct answers, or to exploit ambiguities in the correctness checker. A widely shared example involved a length-reward function that was supposed to penalize verbose outputs; the model learned to fill its output buffer with random numbers so that all completions in the group hit the same length, driving the in-group standard deviation to zero and stopping the training signal entirely.^[14] The DeepSeek-R1 training addressed reward hacking partly through multi-stage training and language consistency rewards.^[2] MO-GRPO (Multi-Objective GRPO) generalizes this by automatically reweighting reward components based on their variances to prevent any single component from dominating.

token-sequence mismatch

The GSPO paper formalizes a more subtle issue: the importance sampling ratio $\pi_\theta / \pi_{\theta_{\text{old}}}$ at the token level produces a high-variance product when multiplied across long sequences, so the effective per-sequence importance weight blows up with response length. This is one of the factors implicated in the collapse observed when training MoE models with GRPO, where small expert routing changes amplify into large product-of-ratios that the clip cannot tame. GSPO's sequence-level ratio resolves this at the cost of coarser updates, trading some expressiveness for stability.^[5]

cross-domain interference

A failure mode that became prominent at the frontier scale of 2026 releases is cross-domain interference: when a single policy is updated with GRPO across mixed batches of math, code, agent, and instruction-following tasks, gains in one domain often come at the cost of regressions in another. DeepSeek-V4 attributed roughly a 14 to 27 point benchmark improvement to isolating GRPO training per domain rather than running it across mixed batches.^[28] The most performant 2026 frontier systems pair GRPO with explicit orchestration (domain experts plus distillation, or sequential per-domain RL rounds) rather than a single mixed-batch update.

Which models use GRPO? Adoption

GRPO and its variants became the dominant RL post-training paradigm for open reasoning models in 2025 and continued in that role through 2026:^[19]

Open-R1 (Hugging Face, January 2025) reproduced DeepSeek-R1's training pipeline with GRPO, making the approach accessible to the broader research community.^[16]

OLMo-2 (Allen Institute for AI) used RLVR training with GRPO in its post-training pipeline, applied after supervised fine-tuning on the Tulu 3 dataset. OLMo-2-0325-32B-Instruct incorporated this training as part of its public release in March 2025, although a follow-up analysis showed RLVR effects were strongest on Qwen-family bases and replicated less consistently on Llama and OLMo bases, an observation that has shaped subsequent base-model selection in the open-reasoning community.^[22]

Tülu 3 (Allen Institute for AI, 2024-2025) included a GRPO-based RLVR stage as part of its public post-training recipe, and the resulting Tülu 3 instruct models matched or exceeded GPT-4o-mini and Claude 3.5 Haiku on several benchmarks. The recipe was one of the first end-to-end open RLVR pipelines documented in detail.^[22]

Qwen models (Alibaba) adopted GRPO-family training extensively. The Qwen team developed GSPO as a more stable alternative for their Qwen3 models, and by 2026 had standardized on GSPO across the Qwen3 family including Qwen3-Coder, Qwen3-30B-A3B-Base, and Qwen 3.6 Plus, which reached 78.8% on SWE-Bench Verified at release in late March 2026.^[5] Earlier Qwen2.5 variants were trained with GRPO using verl and are commonly used as base models for GRPO experiments in the research community.

Phi-4-reasoning (Microsoft) was trained using GFPO, the GRPO variant developed by Microsoft Research that filters oversampled groups to reduce response length while maintaining accuracy.^[8] This represents a productized application of a GRPO descendant in a flagship model release.

Kimi K2 and K2.6 (Moonshot AI) used a GRPO-derived RL pipeline as part of their post-training, with K2 released in July 2025 and K2.6 released in April 2026. K2.6 reached 80.2% on SWE-Bench Verified and 58.6% on SWE-Bench Pro. Moonshot's release notes emphasized agentic stability improvements attributed in part to GRPO refinements around tool-use rewards.

GLM-5 (Z.AI) released GLM-5.1 in April 2026 with 754 billion total parameters using MoE routing, scoring 58.4% on SWE-Bench Pro. The GLM-5 series adopted GSPO for its MoE post-training, citing the same stability concerns that motivated Qwen3's transition.^[32]

DeepSeek-V3.2 and DeepSeek-V4 continued to use GRPO directly, supplementing it with hybrid reward systems (V3.2) and two-stage domain-expert post-training (V4) rather than switching to a different algorithm.^[25] DeepSeek's persistence with the algorithm it had originated demonstrates that the core GRPO objective remains competitive at frontier scale when paired with sophisticated reward design.

VLM-R1 and multimodal extensions. Groups extended GRPO to vision-language models, applying it to Qwen2-VL and Qwen2.5-VL with visual question answering rewards, producing models like Qwen2-VL-2B-GRPO-8k, Qwen2-VL-7B-GRPO-8k, and the VLM-R1 family released February 2025.^[24] These models showed that GRPO trained on tasks with deterministic ground truth (referring expression comprehension, open-vocabulary detection) generalizes better to out-of-domain data than corresponding SFT models.^[24] The R1-V project demonstrated that VLM "aha moments" can emerge with under $3 of compute when paired with GRPO. Other extensions include UAV-VL-R1 (multi-stage GRPO for drone visual reasoning) and various agentic-RL formulations using GRPO for tool use and web navigation.

Continuous integrations. The Hugging Face TRL Trainer ecosystem, ByteDance's verl, and NVIDIA's NeMo-Aligner all expose GRPO as a first-class trainer, and the experimental Trainers in TRL (DAPO, GSPO, RLOO) are usually layered on top of the GRPO trainer rather than reimplemented from scratch.^[9]

Beyond specific models, GRPO's influence is visible in the broader shift toward Reinforcement Learning from Verifiable Rewards (RLVR) as the preferred post-training recipe for reasoning tasks, replacing or supplementing DPO and SFT-only pipelines.^[31] The reasoning capabilities that emerged from DeepSeek-R1's GRPO training also spurred interest in test-time compute scaling, as GRPO-trained models naturally produce longer, more careful outputs that can be further improved by increasing generation length or using majority voting at inference time.

hyperparameter reference

Parameter	DeepSeekMath	DeepSeek-R1-Zero	DAPO	DeepSeek-V3.2 (math)	TRL default
Group size $G$	64	16	16	16	8
Learning rate	$1 \times 10^{-6}$	$3 \times 10^{-6}$	$1 \times 10^{-6}$	$1 \times 10^{-6}$	$1 \times 10^{-6}$
KL coefficient $\beta$	0.04	0.001	0.0	~0 (math), domain-tuned	0.04
Clip $\varepsilon$	0.2	0.2	0.2 / 0.28	0.2	0.2
Max length	1,024	32,768	20,480	65,536	varies
Prompts per step	16	32	512	scaled large	varies
Reward	reward model + rule	rule (accuracy + format)	rule (accuracy + format)	hybrid (rule + generative)	user-defined
Updates per batch	1	1	typically 1	1	configurable

Notes on this table: the TRL defaults are a starting point and most projects override them for their own training run; DAPO sets $\beta = 0$ (no KL penalty against reference) for its main runs, relying on the clipped surrogate alone for stability;^[3] DeepSeek-V3.2 keeps the KL term but tunes it per domain, with near-zero weights for verifiable math/code domains and higher weights for general-purpose tasks;^[26] and the maximum length grows with task difficulty, with reasoning models routinely setting it to 16K-65K tokens to accommodate long chains of thought.

References

Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Bi, X., Zhang, H., Zhang, M., Li, Y. K., Wu, Y., & Guo, D. (2024). DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models. arXiv:2402.03300. https://arxiv.org/abs/2402.03300 ↩
DeepSeek-AI (2025). DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. arXiv:2501.12948. https://arxiv.org/abs/2501.12948 ↩
Yu, Q., et al. (2025). DAPO: An Open-Source LLM Reinforcement Learning System at Scale. arXiv:2503.14476. https://arxiv.org/abs/2503.14476 ↩
Liu, Z., Chen, C., Li, W., Qi, P., Pang, T., Du, C., Lee, W. S., & Lin, M. (2025). Understanding R1-Zero-Like Training: A Critical Perspective (Dr. GRPO). arXiv:2503.20783. https://arxiv.org/abs/2503.20783 ↩
Zheng, C., Liu, S., et al. (2025). Group Sequence Policy Optimization. arXiv:2507.18071. https://arxiv.org/abs/2507.18071 ↩
Hu, J., et al. (2025). REINFORCE++: An Efficient RLHF Algorithm with Robustness to Both Prompt and Reward Models. arXiv:2501.03262. https://arxiv.org/abs/2501.03262 ↩
Ahmadian, A., Cremer, C., Gallé, M., Fadaee, M., Kreutzer, J., Pietquin, O., Üstün, A., & Hooker, S. (2024). Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMs (RLOO). arXiv:2402.14740. https://arxiv.org/abs/2402.14740 ↩
Microsoft Research (2025). Sample More to Think Less: Group Filtered Policy Optimization for Concise Reasoning. arXiv:2508.09726. https://arxiv.org/abs/2508.09726 ↩
Hugging Face TRL. GRPOTrainer documentation. https://huggingface.co/docs/trl/grpo_trainer ↩
Hugging Face. Liger GRPO meets TRL. https://huggingface.co/blog/liger-grpo ↩
Hugging Face Cookbook. Efficient Online Training with GRPO and vLLM in TRL. https://huggingface.co/learn/cookbook/grpo_vllm_online_training ↩
verl project. GRPO documentation. https://verl.readthedocs.io/en/latest/algo/grpo.html ↩
Wolfe, C. R. (2025). Group Relative Policy Optimization (GRPO). Deep (Learning) Focus. https://cameronrwolfe.substack.com/p/grpo ↩
Wolfe, C. R. (2025). GRPO++: Tricks for Making RL Actually Work. Deep (Learning) Focus. https://cameronrwolfe.substack.com/p/grpo-tricks ↩
Qwen Team, Alibaba (2025). GSPO: Towards Scalable Reinforcement Learning for Language Models. https://qwenlm.github.io/blog/gspo/ ↩
Hugging Face (2025). Open R1: Update #1. https://huggingface.co/blog/open-r1/update-1 ↩
Hugging Face DeepSeek R1 course. Understanding the DeepSeek R1 Paper. https://huggingface.co/learn/llm-course/chapter12/3 ↩
Schulman, J. (2020). Approximating KL Divergence. http://joschu.net/blog/kl-approx.html ↩
Raschka, S. (2025). The State of Reinforcement Learning for LLM Reasoning. https://magazine.sebastianraschka.com/p/the-state-of-llm-reasoning-model-training ↩
Lambert, N. (2025). Reinforcement Learning, RLHF Book, Policy Gradients chapter. https://rlhfbook.com/c/06-policy-gradients ↩
Lambda, P., et al. (2025). lambda-GRPO: Unifying the GRPO Frameworks with Learnable Token Preferences. arXiv:2510.06870. https://arxiv.org/abs/2510.06870 ↩
Lambert, N., et al. (2024). Tulu 3: Pushing Frontiers in Open Language Model Post-Training. arXiv:2411.15124. https://arxiv.org/abs/2411.15124 ↩
Unsloth Documentation (2025). Reinforcement Learning (RL) Guide. https://unsloth.ai/docs/get-started/reinforcement-learning-rl-guide ↩
om-ai-lab (2025). VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model. arXiv:2504.07615. https://arxiv.org/abs/2504.07615 ↩
DeepSeek-AI (2025). DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models. arXiv:2512.02556. https://arxiv.org/abs/2512.02556 ↩
Raschka, S. (2026). A Technical Tour of the DeepSeek Models from V3 to V3.2. https://magazine.sebastianraschka.com/p/technical-deepseek ↩
Kiraz, A. (2025). LLM Autopsy Series: DeepSeek v3.2 Analysis, GRPO and Hybrid-Reward-System RL Process. https://alican-kiraz1.medium.com/llm-autopsy-series-deepseek-v3-2-analysis-grpo-a-hybrid-reward-system-rl-process-7a18f6137673 ↩
BSWEN (2026). How DeepSeek V4's Two-Stage Post-Training Solves Multi-Domain Interference. https://docs.bswen.com/blog/2026-04-25-deepseek-v4-two-stage-post-training/ ↩
Kili Technology (2026). DeepSeek V4 Guide: Engram Memory, Training Data Strategy and Release Status. https://kili-technology.com/blog/data-story-deepseek-v4 ↩
MarkTechPost (2026). DeepSeek AI Releases DeepSeek-V4: Compressed Sparse Attention and Heavily Compressed Attention Enable One-Million-Token Contexts. https://www.marktechpost.com/2026/04/24/deepseek-ai-releases-deepseek-v4-compressed-sparse-attention-and-heavily-compressed-attention-enable-one-million-token-contexts/ ↩
LLM-Stats (2026). Post-Training in 2026: GRPO, DAPO, RLVR and Beyond. https://llm-stats.com/blog/research/post-training-techniques-2026 ↩
Hugging Face Blog (2026). From GRPO to DAPO and GSPO: What, Why, and How. https://huggingface.co/blog/NormalUhr/grpo-to-dapo-and-gspo ↩
Anonymous (2025). It Takes Two: Your GRPO Is Secretly DPO. arXiv:2510.00977. https://arxiv.org/abs/2510.00977 ↩
Anonymous (2025). TIC-GRPO: Trajectory-Importance-Corrected GRPO for Provable and Efficient Reinforcement Learning. arXiv:2508.02833. https://arxiv.org/abs/2508.02833 ↩
Anonymous (2025). Comparative Analysis and Parametric Tuning of PPO, GRPO, and DAPO for LLM Reasoning Enhancement. arXiv:2512.07611. https://arxiv.org/abs/2512.07611 ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

3 revisions by 1 contributors · full history

Suggest edit