VAPO (Value-based Augmented PPO)

Machine Learning Reinforcement Learning

10 min read

Updated Jun 8, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 8, 2026

Fact-checked

In review queue

Sources

8 citations

Revision

v1 · 1,902 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Overview

VAPO (Value-based Augmented Proximal Policy Optimization) is a reinforcement learning framework for training large language models on long chain-of-thought reasoning tasks. It was introduced by ByteDance Seed, the foundation-model research group at ByteDance, in a paper titled "VAPO: Efficient and Reliable Reinforcement Learning for Advanced Reasoning Tasks," posted to arXiv in April 2025. ^[1] Secondary coverage sometimes renders it "Value-Augmented PPO." ^[1]^[8]

VAPO's defining choice is that it is value-based: it trains and relies on a learned value model (a critic). This contrasts with the value-free methods that dominated reasoning RL in early 2025, namely GRPO and DAPO, which discard the critic and estimate advantages from groups of sampled responses. The VAPO authors argue that a well-trained value model delivers finer-grained, per-token credit assignment and lower-variance advantage estimates, and therefore offers a higher theoretical performance ceiling than value-free approaches. The catch, which prior work had struggled to overcome, is that value models are notoriously difficult to train reliably on long reasoning traces. VAPO is presented as the first value-based framework to significantly outperform value-free methods on long chain-of-thought tasks. ^[1]

Trained from the Qwen2.5-32B base model with no supervised fine-tuning, VAPO reaches a score of 60.4 on the AIME 2024 mathematics competition benchmark. This surpasses the GRPO-trained DeepSeek-R1-Zero-Qwen-32B baseline (47) and DAPO (50) by more than 10 points, and it does so within 5,000 gradient steps, roughly 60 percent of the steps DAPO required, with no training crashes across multiple independent runs. ^[1]^[2]

Background: value-free versus value-based RL

Proximal Policy Optimization (PPO) is the standard policy-gradient algorithm behind much of RLHF and reasoning RL. PPO is an actor-critic method: it trains a policy (the actor) alongside a value function (the critic) that estimates the expected future reward from a given state. The critic is used to compute advantages, typically through generalized advantage estimation (GAE), which controls the bias-variance tradeoff with a decay parameter lambda. Advantages tell the optimizer how much better than expected each token was, providing token-level credit assignment.

Value-free methods remove the critic to avoid the cost and instability of training it. GRPO (Group Relative Policy Optimization), introduced with DeepSeekMath and used to train DeepSeek-R1, replaces the value model with a baseline computed across a group of responses sampled for the same prompt: a response's advantage is its reward normalized by the group's mean and standard deviation. ^[4]^[5] DAPO, also from ByteDance Seed, builds on GRPO and adds Clip-Higher, dynamic sampling, a token-level loss, and overlong-reward shaping, removing the KL penalty entirely. ^[2] These methods are robust and simple, but the same scalar advantage is broadcast to every token in a response, which gives coarse credit assignment and comparatively high variance.

Value-based methods keep the critic and, in principle, assign a distinct advantage to each token. For long chain-of-thought, where a single verifiable reward arrives only at the end of a trace thousands of tokens long, the critic becomes hard to learn: initialization bias compounds over the long horizon, the exponential decay in GAE drives the terminal reward signal toward zero, and PPO frequently collapses. The immediate predecessor to VAPO, VC-PPO (Value-Calibrated PPO, ByteDance, March 2025), diagnosed these two failure modes and proposed value-pretraining and decoupled GAE to address them. ^[3] VAPO extends this line of work, combining the value-model fixes from VC-PPO with several techniques adapted from the value-free DAPO.

How VAPO works

VAPO frames value-based RL for long reasoning around three challenges and layers seven techniques on top of PPO to address them. ^[1]

The three challenges are:

Value-model bias. A critic bootstrapped from a reward model carries initialization bias that accumulates across long sequences and can destabilize training.
Heterogeneous sequence lengths. Reasoning traces vary enormously in length, so a single fixed estimator under-weights either short or long responses.
Sparse reward signals. With one verifiable reward per response, learning signal is scarce, and naive optimization tends toward entropy collapse and weak exploration.

The seven techniques and the challenge each targets are summarized below.

Technique	Challenge addressed	Origin
Value-Pretraining	Value-model bias	VC-PPO
Decoupled GAE	Value-model bias	VC-PPO
Length-Adaptive GAE	Heterogeneous lengths	VAPO (new)
Token-Level Policy-Gradient Loss	Heterogeneous lengths	DAPO
Clip-Higher	Sparse reward / exploration	DAPO
Positive-Example LM Loss	Sparse reward	VAPO
Group-Sampling	Sparse reward	VAPO

Value-Pretraining. Before joint training begins, the value model is fit offline to rollouts from a fixed policy, using Monte Carlo returns as targets. Starting the critic from an informed state rather than a biased initialization prevents the bias from propagating once policy updates start. ^[1]^[3]

Decoupled GAE. Inherited from VC-PPO, this uses different lambda values for the critic and the policy. The critic is updated with lambda equal to 1.0, which corresponds to an unbiased Monte Carlo target and lets the value model learn accurate long-range expectations. The policy uses a smaller lambda for variance reduction. ^[1]^[3]

Length-Adaptive GAE. A fixed policy lambda such as 0.95 is unsuitable when responses span very different lengths, because the coefficient on the reward decays geometrically; for a 100-step horizon, 0.95 to the 100th power is about 0.006, effectively erasing the signal. VAPO instead sets the policy lambda as a function of the response length l, approximately lambda = 1 - 1/(alpha times l) with alpha = 0.05. The effective credit-assignment horizon then grows with length, so a short response gets a shorter horizon and a long response gets a longer one, distributing TD-errors more uniformly across both. ^[1]

Token-Level Policy-Gradient Loss. Borrowed from DAPO, this weights every token equally rather than first averaging within each sequence. Without it, long responses, which contain most of the reasoning, are systematically under-weighted relative to short ones. ^[1]^[2]

Clip-Higher. Also from DAPO, this uses asymmetric clipping in the PPO objective, raising the upper clipping bound epsilon_high relative to epsilon_low. The extra headroom lets low-probability tokens increase, preserving exploration and countering the entropy collapse that plagues naive PPO and GRPO. ^[1]^[2]

Positive-Example LM Loss. VAPO adds an auxiliary negative-log-likelihood (language-modeling) loss over correct sampled responses, weighted by a coefficient mu. This is a form of self-imitation that squeezes more learning out of the rare positive rewards available under sparse, verifier-based training. ^[1]

Group-Sampling. Rather than sampling many distinct prompts once each, VAPO samples fewer prompts with more repetitions per prompt, producing richer contrastive signal within each group and sharper advantage estimates. ^[1]

Results

All headline numbers use the Qwen2.5-32B base model trained from scratch with reinforcement learning and no SFT data, evaluated on AIME 2024. Vanilla PPO on the same setup scores about 5, illustrating how badly unmodified value-based RL fares on long chain-of-thought. VAPO lifts this to 60.4. ^[1]

Method	Type	AIME 2024 (Qwen2.5-32B base)
Vanilla PPO	value-based	~5
DeepSeek-R1-Zero-Qwen-32B (GRPO)	value-free	47
DAPO	value-free	50
VAPO	value-based	60.4

Beyond the final score, the authors report smoother and more stable training curves, faster score growth from the granular value signal, and better length scaling. Reaching state-of-the-art within 5,000 steps, roughly 60 percent of DAPO's update budget, is offered as evidence of efficiency, and the absence of crashes across runs as evidence of reliability. ^[1]^[8]

Ablations indicate the techniques contribute very unequally. Removing value-pretraining is by far the most damaging, collapsing the score from 60 to 11, which underscores how central a well-initialized critic is to the whole approach. Removing decoupled GAE drops it to 33 and removing length-adaptive GAE to 45. The DAPO-derived pieces matter less in isolation: without clip-higher the score is 46, without the token-level loss 53. The sparse-reward aids contribute the smallest individual gains, with the model scoring 54 without the positive-example LM loss and 55 without group-sampling. ^[1]

Relationship to GRPO and DAPO

VAPO, GRPO, and DAPO all descend from PPO and all target long chain-of-thought reasoning with verifiable rewards (RLVR), but they split on the value question. GRPO and DAPO are value-free: they delete the critic and derive advantages from group statistics, trading fine-grained credit assignment for simplicity and stability. VAPO is value-based: it restores the critic and invests heavily in making it trainable on long sequences. ^[1]^[2]^[5]

The relationship is collaborative rather than purely competitive, since all three come from the same ByteDance Seed research line and share authors. VAPO explicitly reuses DAPO's Clip-Higher and token-level loss, and it inherits value-pretraining and decoupled GAE from VC-PPO, whose author list overlaps with both DAPO and VAPO. In effect, VAPO grafts the engineering tricks that made value-free DAPO work onto a properly stabilized value model, then shows the value-based result wins. The central empirical claim is direct: on identical base model and benchmark, the value-based framework beats the best value-free systems by more than 10 points. The authors interpret this as support for their thesis that value-based RL has a higher ceiling because the value model traces each action's effect on later returns, which matters when a single subtle error can derail a long proof. ^[1]^[8]

Significance

VAPO is notable less for any single mechanism than for reopening a strategic question in reasoning RL. Through early 2025, the field had largely converged on value-free methods, with DeepSeek-R1's GRPO and ByteDance's own DAPO as the reference points, partly because value-based PPO was seen as too unstable for long chain-of-thought. By assembling value-pretraining, decoupled and length-adaptive GAE, and several exploration and reward-exploitation aids into one recipe, VAPO demonstrated that the value-based route is not a dead end and can in fact set the state of the art on AIME 2024 for its model size. ^[1]

The result should be read with its scope in mind. The reported gains are on a single 32B base model and a math-competition benchmark, the framework is more complex than value-free alternatives, and the heavy dependence on value-pretraining shown in the ablations means the approach demands care to reproduce. Independent third-party replications and analyses were still limited as of mid-2026. Even so, VAPO, with VC-PPO and DAPO, forms an influential cluster of 2025 ByteDance Seed work that maps both sides of the value-based versus value-free tradeoff and gives practitioners a concrete template for value-model-based RL on reasoning tasks. ^[1]^[2]^[3]

References

Yu Yue, Yufeng Yuan, Qiying Yu, Xiaochen Zuo, Ruofei Zhu, et al. "VAPO: Efficient and Reliable Reinforcement Learning for Advanced Reasoning Tasks." arXiv:2504.05118, April 2025. https://arxiv.org/abs/2504.05118 ↩
Qiying Yu, et al. "DAPO: An Open-Source LLM Reinforcement Learning System at Scale." arXiv:2503.14476, March 2025. https://arxiv.org/abs/2503.14476 ↩
Yufeng Yuan, Yu Yue, Ruofei Zhu, Tiantian Fan, Lin Yan. "What's Behind PPO's Collapse in Long-CoT? Value Optimization Holds the Secret" (VC-PPO). arXiv:2503.01491, March 2025. https://arxiv.org/abs/2503.01491 ↩
DeepSeek-AI. "DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning." arXiv:2501.12948, January 2025. https://arxiv.org/abs/2501.12948 ↩
Zhihong Shao, et al. "DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models" (introduces GRPO). arXiv:2402.03300, 2024. https://arxiv.org/abs/2402.03300 ↩
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, Oleg Klimov. "Proximal Policy Optimization Algorithms." arXiv:1707.06347, 2017. https://arxiv.org/abs/1707.06347
John Schulman, Philipp Moritz, Sergey Levine, Michael Jordan, Pieter Abbeel. "High-Dimensional Continuous Control Using Generalized Advantage Estimation." arXiv:1506.02438, 2015. https://arxiv.org/abs/1506.02438
Asif Razzaq. "ByteDance Introduces VAPO: A Novel Reinforcement Learning Framework for Advanced Reasoning Tasks." MarkTechPost, April 10, 2025. https://www.marktechpost.com/2025/04/10/bytedance-introduces-vapo-a-novel-reinforcement-learning-framework-for-advanced-reasoning-tasks/ ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

Suggest edit

What links here

DAPO (Decoupled Clip and Dynamic Sampling Policy Optimization)

Overview

Background: value-free versus value-based RL

How VAPO works

Results

Relationship to GRPO and DAPO

Significance

References

Improve this article

Related Articles

State (Reinforcement Learning)

State-Action Value Function

Action (Reinforcement Learning)

Bellman Equation

Critic

Deep Q-Network (DQN)