DAPO (Decoupled Clip and Dynamic Sampling Policy Optimization)

Machine Learning Reinforcement Learning

10 min read

Updated Jun 8, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 8, 2026

Fact-checked

In review queue

Sources

5 citations

Revision

v1 · 1,986 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

DAPO, short for Decoupled Clip and Dynamic sAmpling Policy Optimization, is an open-source reinforcement learning algorithm and training system for large language models, introduced in March 2025 by researchers from ByteDance Seed and the Institute for AI Industry Research (AIR) at Tsinghua University. It was presented in the paper "DAPO: An Open-Source LLM Reinforcement Learning System at Scale," whose lead author is Qiying Yu, and was released through a joint SIA-Lab of Tsinghua AIR and ByteDance Seed, with additional contributors from the University of Hong Kong. ^[1]^[2] DAPO is designed for training reasoning models that produce long chain-of-thought traces. It modifies Group Relative Policy Optimization (GRPO) with four techniques that target instabilities observed in large-scale reasoning RL, including entropy collapse, reward noise, and inefficient gradient signals. Using the Qwen2.5-32B base model, DAPO reaches 50 points on the AIME 2024 mathematics competition benchmark, surpassing the 47 points reported for DeepSeek-R1-Zero-Qwen-32B while using roughly half the training steps. ^[1]^[2]

The "decoupled clip" and "dynamic sampling" in the name refer to two of its four components. A defining feature of the release is that the authors published the full recipe rather than a headline result: the algorithm, the training code built on the verl framework, a curated dataset called DAPO-Math-17K, and trained model weights. The team framed this transparency as a response to the limited disclosure around contemporary reasoning systems such as OpenAI o1 and the DeepSeek-R1 technical report, where key training details were withheld and direct reproduction was difficult. ^[1]^[3]

Background

Reasoning-oriented language models are commonly trained with reinforcement learning from verifiable rewards (RLVR). A policy model generates a long chain of thought for a prompt with a known answer, an automatic verifier checks the final answer, and the model receives a scalar reward. DAPO uses a rule-based reward of +1 when the extracted answer is equivalent to the reference answer and -1 otherwise, which avoids a learned reward model and the reward hacking that can accompany one. ^[1]

DAPO builds on GRPO, a critic-free variant of Proximal Policy Optimization (PPO) introduced by DeepSeek. Instead of training a separate value network to provide a baseline, GRPO samples a group of G responses for each prompt and standardizes each response reward against the group mean and standard deviation to produce a relative advantage that is shared across the tokens of a response. This makes training cheaper and simpler, since no value model is required. ^[1]^[4]

The DAPO authors report that a direct GRPO implementation on the Qwen2.5-32B base model is unstable and plateaus well below the achievable score, reaching only about 30 points on AIME 2024. They attribute this to several concrete failure modes:

Entropy collapse. The policy's output entropy falls quickly, the model becomes overly deterministic, exploration of alternative reasoning paths stops, and accuracy saturates early.
Zero-gradient prompts and a shrinking effective batch. When all G samples for a prompt are correct, or all are wrong, the group-standardized advantage is zero for every token, so the prompt contributes no gradient. As the policy improves, more prompts become fully solved, so an increasing share of each batch is wasted.
Length and reward noise. Long chains of thought are frequently truncated at a maximum length. Assigning a punitive reward to a truncated but possibly sound response injects noise, and GRPO's sample-level loss averaging under-weights the many tokens inside long responses.

The four techniques

DAPO retains GRPO's group-relative advantage estimator and its clipped surrogate objective but removes the Kullback-Leibler (KL) divergence penalty against a reference policy. The authors argue that for long-chain-of-thought RL the policy is expected to move far from the initial model, so constraining it toward a fixed reference is unnecessary and even counterproductive. On top of this base, DAPO adds four techniques, summarized below and then described in detail.

Technique	Problem it addresses	Mechanism
Clip-Higher	Entropy collapse, too little exploration	Decouple the PPO clip range into a lower bound and a higher upper bound, using e_low = 0.2 and e_high = 0.28
Dynamic Sampling	Zero-gradient prompts, shrinking effective batch	Oversample and discard prompts whose group accuracy is exactly 0 or 1, keeping the batch full of informative prompts
Token-level Policy Gradient Loss	Long responses under-weighted, garbled long outputs	Normalize the loss by total token count across the batch rather than averaging per sample
Overlong Reward Shaping	Reward noise from truncated responses	Mask truncated samples and apply a soft, length-aware penalty near the length limit

Clip-Higher

In PPO and GRPO, the per-token objective multiplies the advantage by an importance ratio r between the new and old policy probabilities, and clips that ratio to the interval (1 - e, 1 + e) to limit how far a single update can move the policy. With a single symmetric epsilon, the upper clip caps the relative increase of low-probability tokens severely. For example, a token with probability 0.01 can rise at most to about 0.012 under e = 0.2, while a high-probability token has much more room to grow. This asymmetry suppresses the exploratory tokens that drive diversity, accelerating entropy collapse. DAPO decouples the bound into a lower clip e_low and a higher clip e_high, raising e_high to 0.28 while keeping e_low at 0.2. The looser upper bound gives low-probability tokens room to increase, which raises policy entropy and yields more diverse samples, while the tighter lower bound still guards against probabilities collapsing toward zero. ^[1]

Dynamic Sampling

Because group-standardized advantages vanish when every sample in a group is correct or every sample is wrong, such prompts produce no learning signal and dilute the batch. DAPO oversamples and then filters, continuing to draw prompts and discard those with group accuracy of exactly 0 or 1 until the batch is filled only with prompts where the number of correct outputs is strictly between 0 and G. This keeps the gradient signal consistent and the effective batch size stable across training. The authors note that the extra sampling does not materially slow training, because generation time is dominated by a few long, slow responses rather than by the count of prompts attempted. ^[1]

Token-level Policy Gradient Loss

GRPO computes a loss for each sample by averaging over the tokens of that sample, then averages those per-sample losses across the group. This gives every response equal weight regardless of length, so a token in a short response influences the gradient far more than a token in a long one. In long-chain-of-thought training that imbalance is harmful: undesirable patterns such as repetitive or rambling text appear disproportionately in long responses, yet receive little gradient pressure. DAPO instead sums the per-token loss over all tokens in the batch and divides by the total number of tokens, so every token contributes equally. Longer reasoning traces therefore exert proportionate influence, and high-quality or low-quality patterns are reinforced or suppressed regardless of where they fall in the length distribution. ^[1]

Overlong Reward Shaping

Responses that hit the maximum generation length are truncated, and naively assigning them a negative reward penalizes the model for content it was not allowed to finish, adding noise. DAPO addresses this in two parts. Overlong Filtering masks the loss for truncated samples so they neither help nor hurt the update. Soft Overlong Punishment then applies a graduated, length-aware penalty: responses shorter than a cushion below the cap incur no length penalty, responses entering a buffer zone of length L_cache below the cap receive a penalty that grows linearly toward -1, and responses exceeding the maximum receive the full -1. In the reported configuration the maximum length L_max is 20,480 tokens and the cache length L_cache is 4,096 tokens. This shaping reduces reward noise and stabilizes training while still discouraging runaway generation. ^[1]

Results

On AIME 2024, evaluated as avg@32 (the average over 32 independent attempts per problem to reduce variance), DAPO trains the Qwen2.5-32B base model to 50 points. This exceeds DeepSeek-R1-Zero-Qwen-32B, which reached 47 on the same benchmark, and DAPO reaches its result using about 50 percent of the training steps. ^[1]^[3] The paper presents a progressive ablation that isolates the contribution of each technique, starting from a naive GRPO baseline and adding components one at a time.

Configuration	AIME 2024 (avg@32)
DeepSeek-R1-Zero-Qwen-32B (reference)	47
Naive GRPO	30
+ Overlong Filtering	36
+ Clip-Higher	38
+ Soft Overlong Punishment	41
+ Token-level Loss	42
+ Dynamic Sampling (full DAPO)	50

The ablation shows that no single trick dominates; the gains accumulate, with Dynamic Sampling providing the final, largest jump. Alongside the score, the team released the DAPO-Math-17K dataset of roughly 17,000 mathematics problems whose answers were transformed into integers to make rule-based verification reliable, the trained DAPO-Qwen-32B weights, and the training code on the verl framework, together with training logs. ^[1]^[2]^[3]

Relationship to other methods

DAPO sits within a lineage of policy-gradient algorithms for language-model reasoning. PPO provides the clipped surrogate objective that all of these methods share. GRPO removes PPO's value network and replaces the learned baseline with a group-relative one, trading some variance reduction for simplicity and lower cost. DAPO keeps GRPO's critic-free structure and its clipping mechanism but reworks the clip bounds, the sampling, the loss normalization, and the length handling, and it drops the KL penalty entirely. In effect, DAPO can be read as a set of engineering corrections that make GRPO stable and effective at the 32B scale on long-chain-of-thought tasks. ^[1]^[4]

A closely related successor from the same ByteDance Seed group is VAPO (Value-based Augmented PPO), released in April 2025. VAPO moves in the opposite direction from GRPO and DAPO by reintroducing a trained value model, arguing that a well-built critic can outperform value-free methods on long reasoning. VAPO reports a state-of-the-art AIME 2024 score of 60.4 on the Qwen2.5-32B base model and states that it surpasses both DeepSeek-R1-Zero-Qwen-32B and DAPO by more than 10 points under matched settings. Notably, VAPO adopts several DAPO ideas, including the decoupled clipping, length-aware reward shaping, and token-level loss, layering them on a value-based foundation. ^[5] All of these systems operate in the RLVR paradigm popularized by DeepSeek-R1, in which a verifiable, rule-based reward replaces a learned reward model. ^[3]

Significance

DAPO's main contribution is less a single novel objective than a fully reproducible, openly documented recipe for scaling RL on reasoning models. At the time of release, several leading systems reported strong reasoning results without disclosing the algorithmic and data details needed to reproduce them, which made it hard for the wider community to study why large-scale reasoning RL succeeds or fails. By open-sourcing the algorithm, the verl-based code, the DAPO-Math-17K dataset, and the model weights, the authors lowered the barrier to entry and gave researchers a concrete, working baseline to build on and ablate. ^[1]^[3] The four techniques have since been widely cited and reused: the decoupled clip, dynamic sampling, token-level loss, and overlong reward shaping appear individually in subsequent RLVR work, and comparative studies routinely benchmark new methods against DAPO as a standard point of reference. DAPO has accordingly become one of the most influential open RL recipes for training long-chain-of-thought reasoning models. ^[4]^[5]

References

Yu, Qiying et al. "DAPO: An Open-Source LLM Reinforcement Learning System at Scale." arXiv:2503.14476, March 2025. https://arxiv.org/abs/2503.14476 ↩
DAPO project page, SIA-Lab of Tsinghua AIR and ByteDance Seed. https://dapo-sia.github.io/ ↩
DeepSeek-AI. "DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning." arXiv:2501.12948, January 2025. https://arxiv.org/abs/2501.12948 ↩
Shao, Zhihong et al. "DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models" (introducing GRPO). arXiv:2402.03300, February 2024. https://arxiv.org/abs/2402.03300 ↩
Yue, Yu et al. "VAPO: Efficient and Reliable Reinforcement Learning for Advanced Reasoning Tasks." arXiv:2504.05118, April 2025. https://arxiv.org/abs/2504.05118 ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

Suggest edit

What links here

VAPO (Value-based Augmented PPO)