VAPO (Value-based Augmented PPO)
Last reviewed
Jun 8, 2026
Sources
8 citations
Review status
Source-backed
Revision
v1 · 1,902 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 8, 2026
Sources
8 citations
Review status
Source-backed
Revision
v1 · 1,902 words
Add missing citations, update stale details, or suggest a clearer explanation.
VAPO (Value-based Augmented Proximal Policy Optimization) is a reinforcement learning framework for training large language models on long chain-of-thought reasoning tasks. It was introduced by ByteDance Seed, the foundation-model research group at ByteDance, in a paper titled "VAPO: Efficient and Reliable Reinforcement Learning for Advanced Reasoning Tasks," posted to arXiv in April 2025. [1] Secondary coverage sometimes renders it "Value-Augmented PPO." [1][8]
VAPO's defining choice is that it is value-based: it trains and relies on a learned value model (a critic). This contrasts with the value-free methods that dominated reasoning RL in early 2025, namely GRPO and DAPO, which discard the critic and estimate advantages from groups of sampled responses. The VAPO authors argue that a well-trained value model delivers finer-grained, per-token credit assignment and lower-variance advantage estimates, and therefore offers a higher theoretical performance ceiling than value-free approaches. The catch, which prior work had struggled to overcome, is that value models are notoriously difficult to train reliably on long reasoning traces. VAPO is presented as the first value-based framework to significantly outperform value-free methods on long chain-of-thought tasks. [1]
Trained from the Qwen2.5-32B base model with no supervised fine-tuning, VAPO reaches a score of 60.4 on the AIME 2024 mathematics competition benchmark. This surpasses the GRPO-trained DeepSeek-R1-Zero-Qwen-32B baseline (47) and DAPO (50) by more than 10 points, and it does so within 5,000 gradient steps, roughly 60 percent of the steps DAPO required, with no training crashes across multiple independent runs. [1][2]
Proximal Policy Optimization (PPO) is the standard policy-gradient algorithm behind much of RLHF and reasoning RL. PPO is an actor-critic method: it trains a policy (the actor) alongside a value function (the critic) that estimates the expected future reward from a given state. The critic is used to compute advantages, typically through generalized advantage estimation (GAE), which controls the bias-variance tradeoff with a decay parameter lambda. Advantages tell the optimizer how much better than expected each token was, providing token-level credit assignment.
Value-free methods remove the critic to avoid the cost and instability of training it. GRPO (Group Relative Policy Optimization), introduced with DeepSeekMath and used to train DeepSeek-R1, replaces the value model with a baseline computed across a group of responses sampled for the same prompt: a response's advantage is its reward normalized by the group's mean and standard deviation. [4][5] DAPO, also from ByteDance Seed, builds on GRPO and adds Clip-Higher, dynamic sampling, a token-level loss, and overlong-reward shaping, removing the KL penalty entirely. [2] These methods are robust and simple, but the same scalar advantage is broadcast to every token in a response, which gives coarse credit assignment and comparatively high variance.
Value-based methods keep the critic and, in principle, assign a distinct advantage to each token. For long chain-of-thought, where a single verifiable reward arrives only at the end of a trace thousands of tokens long, the critic becomes hard to learn: initialization bias compounds over the long horizon, the exponential decay in GAE drives the terminal reward signal toward zero, and PPO frequently collapses. The immediate predecessor to VAPO, VC-PPO (Value-Calibrated PPO, ByteDance, March 2025), diagnosed these two failure modes and proposed value-pretraining and decoupled GAE to address them. [3] VAPO extends this line of work, combining the value-model fixes from VC-PPO with several techniques adapted from the value-free DAPO.
VAPO frames value-based RL for long reasoning around three challenges and layers seven techniques on top of PPO to address them. [1]
The three challenges are:
The seven techniques and the challenge each targets are summarized below.
| Technique | Challenge addressed | Origin |
|---|---|---|
| Value-Pretraining | Value-model bias | VC-PPO |
| Decoupled GAE | Value-model bias | VC-PPO |
| Length-Adaptive GAE | Heterogeneous lengths | VAPO (new) |
| Token-Level Policy-Gradient Loss | Heterogeneous lengths | DAPO |
| Clip-Higher | Sparse reward / exploration | DAPO |
| Positive-Example LM Loss | Sparse reward | VAPO |
| Group-Sampling | Sparse reward | VAPO |
Value-Pretraining. Before joint training begins, the value model is fit offline to rollouts from a fixed policy, using Monte Carlo returns as targets. Starting the critic from an informed state rather than a biased initialization prevents the bias from propagating once policy updates start. [1][3]
Decoupled GAE. Inherited from VC-PPO, this uses different lambda values for the critic and the policy. The critic is updated with lambda equal to 1.0, which corresponds to an unbiased Monte Carlo target and lets the value model learn accurate long-range expectations. The policy uses a smaller lambda for variance reduction. [1][3]
Length-Adaptive GAE. A fixed policy lambda such as 0.95 is unsuitable when responses span very different lengths, because the coefficient on the reward decays geometrically; for a 100-step horizon, 0.95 to the 100th power is about 0.006, effectively erasing the signal. VAPO instead sets the policy lambda as a function of the response length l, approximately lambda = 1 - 1/(alpha times l) with alpha = 0.05. The effective credit-assignment horizon then grows with length, so a short response gets a shorter horizon and a long response gets a longer one, distributing TD-errors more uniformly across both. [1]
Token-Level Policy-Gradient Loss. Borrowed from DAPO, this weights every token equally rather than first averaging within each sequence. Without it, long responses, which contain most of the reasoning, are systematically under-weighted relative to short ones. [1][2]
Clip-Higher. Also from DAPO, this uses asymmetric clipping in the PPO objective, raising the upper clipping bound epsilon_high relative to epsilon_low. The extra headroom lets low-probability tokens increase, preserving exploration and countering the entropy collapse that plagues naive PPO and GRPO. [1][2]
Positive-Example LM Loss. VAPO adds an auxiliary negative-log-likelihood (language-modeling) loss over correct sampled responses, weighted by a coefficient mu. This is a form of self-imitation that squeezes more learning out of the rare positive rewards available under sparse, verifier-based training. [1]
Group-Sampling. Rather than sampling many distinct prompts once each, VAPO samples fewer prompts with more repetitions per prompt, producing richer contrastive signal within each group and sharper advantage estimates. [1]
All headline numbers use the Qwen2.5-32B base model trained from scratch with reinforcement learning and no SFT data, evaluated on AIME 2024. Vanilla PPO on the same setup scores about 5, illustrating how badly unmodified value-based RL fares on long chain-of-thought. VAPO lifts this to 60.4. [1]
| Method | Type | AIME 2024 (Qwen2.5-32B base) |
|---|---|---|
| Vanilla PPO | value-based | ~5 |
| DeepSeek-R1-Zero-Qwen-32B (GRPO) | value-free | 47 |
| DAPO | value-free | 50 |
| VAPO | value-based | 60.4 |
Beyond the final score, the authors report smoother and more stable training curves, faster score growth from the granular value signal, and better length scaling. Reaching state-of-the-art within 5,000 steps, roughly 60 percent of DAPO's update budget, is offered as evidence of efficiency, and the absence of crashes across runs as evidence of reliability. [1][8]
Ablations indicate the techniques contribute very unequally. Removing value-pretraining is by far the most damaging, collapsing the score from 60 to 11, which underscores how central a well-initialized critic is to the whole approach. Removing decoupled GAE drops it to 33 and removing length-adaptive GAE to 45. The DAPO-derived pieces matter less in isolation: without clip-higher the score is 46, without the token-level loss 53. The sparse-reward aids contribute the smallest individual gains, with the model scoring 54 without the positive-example LM loss and 55 without group-sampling. [1]
VAPO, GRPO, and DAPO all descend from PPO and all target long chain-of-thought reasoning with verifiable rewards (RLVR), but they split on the value question. GRPO and DAPO are value-free: they delete the critic and derive advantages from group statistics, trading fine-grained credit assignment for simplicity and stability. VAPO is value-based: it restores the critic and invests heavily in making it trainable on long sequences. [1][2][5]
The relationship is collaborative rather than purely competitive, since all three come from the same ByteDance Seed research line and share authors. VAPO explicitly reuses DAPO's Clip-Higher and token-level loss, and it inherits value-pretraining and decoupled GAE from VC-PPO, whose author list overlaps with both DAPO and VAPO. In effect, VAPO grafts the engineering tricks that made value-free DAPO work onto a properly stabilized value model, then shows the value-based result wins. The central empirical claim is direct: on identical base model and benchmark, the value-based framework beats the best value-free systems by more than 10 points. The authors interpret this as support for their thesis that value-based RL has a higher ceiling because the value model traces each action's effect on later returns, which matters when a single subtle error can derail a long proof. [1][8]
VAPO is notable less for any single mechanism than for reopening a strategic question in reasoning RL. Through early 2025, the field had largely converged on value-free methods, with DeepSeek-R1's GRPO and ByteDance's own DAPO as the reference points, partly because value-based PPO was seen as too unstable for long chain-of-thought. By assembling value-pretraining, decoupled and length-adaptive GAE, and several exploration and reward-exploitation aids into one recipe, VAPO demonstrated that the value-based route is not a dead end and can in fact set the state of the art on AIME 2024 for its model size. [1]
The result should be read with its scope in mind. The reported gains are on a single 32B base model and a math-competition benchmark, the framework is more complex than value-free alternatives, and the heavy dependence on value-pretraining shown in the ablations means the approach demands care to reproduce. Independent third-party replications and analyses were still limited as of mid-2026. Even so, VAPO, with VC-PPO and DAPO, forms an influential cluster of 2025 ByteDance Seed work that maps both sides of the value-based versus value-free tradeoff and gives practitioners a concrete template for value-model-based RL on reasoning tasks. [1][2][3]