Reinforcement Learning with Verifiable Rewards (RLVR) is a post-training paradigm for large language models in which the reward signal comes from a deterministic, rule-based verification function rather than from a learned reward model trained on human preferences. The verifier checks whether a model's output is objectively correct, typically returning a binary signal: 1 for a correct answer, 0 for an incorrect one. Because the verification function is external, auditable, and does not require human labelers at training time, RLVR avoids many of the instabilities and costs associated with RLHF. The paradigm has become the dominant training method for reasoning-focused language models, popularized first by the Tulu 3 paper from the Allen Institute for AI in November 2024 and then brought to global attention by DeepSeek-R1 in January 2025.
RLVR works best in domains where correctness can be determined without human judgment: mathematics, competitive programming, formal logic, and structured instruction following. In these domains, a test harness, a symbolic solver, or a string-matching function can replace the expensive neural reward model that RLHF requires, enabling training runs that scale to hundreds of billions of parameters while remaining computationally tractable.
The term "Reinforcement Learning with Verifiable Rewards" and its acronym RLVR were introduced in the Tulu 3 paper (Lambert et al., Allen Institute for AI, arXiv:2411.15124, November 2024). The authors described RLVR as "a novel method" and positioned it as one of three post-training algorithms alongside supervised fine-tuning (SFT) and DPO. They defined it as using the existing RLHF objective but replacing the learned reward model with a deterministic verification function that checks objective correctness.
The underlying idea predates the name. Researchers at OpenAI and elsewhere had used execution-based feedback to train code generation models for several years before 2024, and some academic work had applied rule-based rewards to mathematical reasoning. The Tulu 3 paper was the first to give this pattern a distinct name, frame it as a standalone training paradigm, and release an open-source implementation at the 8B, 70B, and 405B scales.
Conceptually, RLVR sits at the intersection of outcome-supervised reinforcement learning and symbolic verification. Unlike process reward models that score intermediate reasoning steps, standard RLVR provides a reward only at the end of a generation, once the final answer has been verified. This outcome-only design keeps the reward signal clean and avoids the annotation cost of step-level labeling, though it also introduces a sparse reward problem that later variants have tried to address.
The Tulu 3 paper (arXiv:2411.15124) introduced RLVR as part of a comprehensive open post-training recipe for the Llama 3 family of base models. The release included 8B, 70B, and 405B variants and was published to coincide with a blog post at the Allen Institute for AI.
In the Tulu 3 training pipeline, RLVR is applied after SFT and DPO. The system generates multiple candidate responses to a given prompt, verifies each response using a domain-specific function, assigns rewards based on correctness, and updates the policy using a KL-regularized RL objective. For mathematics, the verifier checks whether the model's numerical answer matches the ground truth after normalization. For instruction-following tasks such as IFEval, the verifier applies a set of rule-based constraints (format checks, length constraints, keyword requirements) to the model's output.
The Tulu 3 results showed that adding RLVR on top of a DPO checkpoint produced gains of up to 1.7 points on MATH, 3.3 points on GSM8K, and 1.3 points on IFEval compared to the DPO-only checkpoint. Improvements also transferred to tasks outside the RLVR training distribution, including BigBenchHard and DROP, suggesting some degree of generalization. Gains were larger at the 405B scale than at 8B, indicating that RLVR interacts favorably with model capacity.
The Tulu 3 paper also noted that verifiable rewards improved consistently across a range of KL budget values (beta values of 0.01 to 0.1), but that higher KL budgets did not always translate to better performance on the target benchmarks, a pattern that would later receive theoretical scrutiny.
The release was significant not only for naming the paradigm but also for publishing code, data, and training details, making RLVR reproducible for the broader research community.
While Tulu 3 coined the term and established RLVR as a named paradigm, DeepSeek-R1 (arXiv:2501.12948, January 2025) brought it to global attention and demonstrated what pure RL with verifiable rewards could achieve at scale.
The DeepSeek-R1 paper introduced two systems. DeepSeek-R1-Zero was trained from the DeepSeek-V3-Base model using RL with verifiable rewards and no supervised fine-tuning whatsoever. DeepSeek-R1 used a four-stage pipeline: a cold-start SFT phase on a small set of human-readable chain-of-thought examples, a primary RLVR phase, a rejection-sampling SFT phase, and a secondary RL phase covering both reasoning and general capabilities.
The reward design in DeepSeek-R1 had two components. An accuracy reward checked answer correctness using rule-based verification: math answers were compared against ground truth after format normalization, and coding answers were verified by running test cases through a compiler. A format reward enforced structural compliance, requiring the model to wrap its reasoning inside <think> and </think> tags. The authors explicitly chose not to use a neural reward model, citing concerns about reward hacking in large-scale RL training.
DeepSeek-R1-Zero achieved 71% pass@1 on AIME 2024, compared to the 15.6% baseline of the base model, and improved to 86.7% with majority voting across 64 samples. DeepSeek-R1 reached 79.8% on AIME 2024, matching OpenAI-o1-1217 at 79.2%. On MATH-500, DeepSeek-R1 scored 97.3% against o1-1217's 96.4%. On the Codeforces competitive programming benchmark, DeepSeek-R1-Zero reached a rating of 1444, and the full DeepSeek-R1 model reached 2029, compared to o1-1217's 2061.
The most discussed finding from the DeepSeek-R1-Zero experiments was the spontaneous emergence of self-reflection and self-correction during RL training. Without any explicit instruction to verify its work, the model began producing reasoning traces where it would stop mid-solution, reconsider an earlier step, and revise its approach. The authors called this the "aha moment." Response length also grew substantially during training, from hundreds of tokens in early iterations to thousands of tokens in later stages, as the model learned that longer deliberation correlated with higher rewards on difficult problems.
DeepSeek-R1's open release under a permissive license, combined with its performance numbers, triggered a wave of reproduction efforts and derivative research across the community.
The standard RLVR training loop follows four steps.
First, for each training prompt, the policy model generates a group of K candidate responses, typically 4 to 16.
Second, each response is passed to a domain-specific verifier function V. The verifier returns a scalar reward r (usually 0 or 1, though partial scores are possible).
Third, advantages are computed by comparing each response's reward to the group average. A response with reward above the group mean receives a positive advantage; one below receives a negative advantage. This relative comparison is the core idea of GRPO (Group Relative Policy Optimization), the RL algorithm most commonly paired with RLVR.
Fourth, the policy is updated using a policy gradient objective with a KL divergence penalty that prevents the model from drifting too far from a reference policy (usually the SFT checkpoint). The objective can be written as maximizing the expected reward minus a KL regularization term:
L = E[A_i * log π_θ(o_i | q)] - β * KL(π_θ || π_ref)
where A_i is the advantage estimate for output o_i given query q, β is the regularization coefficient, and π_ref is the reference policy.
GRPO was the algorithm used in DeepSeek-R1. It eliminates the separate critic (value function) network that PPO requires by using the group of sampled responses as a self-contained baseline. This reduces memory consumption and simplifies the training infrastructure, which matters at the scales where RLVR is typically applied. Tulu 3 used a related approach, also using group-relative advantage estimation without a learned value model.
The verifier function V is the central component that distinguishes RLVR from RLHF. In RLHF, V is a neural network trained on human preference data and subject to reward hacking, distributional shift, and annotation noise. In RLVR, V is a deterministic program that returns the same output for the same input every time.
The range of verifiable domains has expanded considerably since 2024. The most established verifier categories are the following.
Mathematical correctness. The verifier normalizes the model's answer (stripping LaTeX delimiters, expanding fractions, converting between equivalent forms) and checks whether it matches the ground-truth answer exactly. For problems with numerical answers, floating-point comparison with a tolerance threshold handles rounding. This verifier covers benchmarks such as GSM8K, MATH-500, AIME, and AMC.
Code execution. The verifier runs the model's code submission against a set of hidden test cases in a sandboxed execution environment. A pass is awarded if all test cases pass; failure on any test case gives a zero reward. This verifier covers competitive programming benchmarks (Codeforces, LeetCode) and software engineering tasks (SWE-Bench). Security sandboxing is essential here, as generated code may contain adversarial patterns.
Format and instruction compliance. For structured instruction-following tasks, the verifier applies a set of rule-based constraints to the model's output: presence of specific keywords, adherence to word count limits, use of specific formats such as JSON or numbered lists. The IFEval benchmark is designed around this type of verifier.
Formal proof verification. In theorem proving, interactive proof assistants such as Lean 4 serve as verifiers. The model generates a formal proof, the proof assistant checks each step for logical validity, and the reward is 1 only if the entire proof checks. This is one of the most demanding uses of RLVR, as even a single incorrect step causes the verifier to reject the full proof.
Text-to-SQL execution. For database query tasks, the verifier executes both the model's SQL query and the reference query against a database and compares result sets. This handles cases where multiple syntactically different queries produce equivalent results.
Beyond these established types, 2025 saw active research into extending RLVR to domains where clean binary verifiers are harder to construct, including medical multiple-choice question answering, scientific reasoning, and legal reasoning. These extensions often require more elaborate verification pipelines or accept noisier reward signals.
RLVR and GRPO have become closely associated because they solve complementary problems in large-scale reasoning model training. RLVR provides the reward signal; GRPO provides an RL algorithm that can use that signal without a learned critic model.
Standard PPO requires maintaining a critic network that estimates the expected future reward from each state, which approximately doubles the memory footprint of the training run. At 70B or 405B parameter scales, this becomes prohibitive. GRPO replaces the critic with a batch-relative baseline: within each group of K responses to the same prompt, the advantage of each response is its reward minus the group mean reward, divided by the group standard deviation.
This design has a mathematical interpretation: GRPO with binary rewards is equivalent to a KL-regularized contrastive loss where the contrastive samples are drawn from the policy itself. The winning responses in each group act as positive examples and the losing responses act as negative examples, similar in spirit to the contrastive objective used in DPO but computed online rather than from a static preference dataset.
Several modifications to GRPO have emerged to address known failure modes. DAPO (Decoupled Clip and Dynamic Sampling Policy Optimization, ByteDance and Tsinghua, arXiv:2503.14476) added four techniques: Clip-Higher to promote diversity and reduce entropy collapse, Dynamic Sampling to filter prompts where all responses receive the same reward (which provide no gradient signal), Token-Level Policy Gradient Loss to stabilize training with long chain-of-thought outputs, and Overlong Reward Shaping to reduce noise from excessively long responses. DAPO achieved 50 points on AIME 2024 using the Qwen2.5-32B base model, outperforming DeepSeek-R1-Zero-Qwen-32B while using only 50% of the training compute.
NGRPO (Negative-enhanced GRPO) addresses the situation where all responses in a group fail verification, which normally produces a zero gradient. By converting uniform failure into structured negative learning signals, NGRPO extracts useful updates from all-wrong batches.
OpenAI has not published training details for OpenAI o1, OpenAI o3, or subsequent o-series models. The system cards and technical reports describe them as trained with "large-scale reinforcement learning" on reasoning traces, but do not specify reward sources, algorithms, or training pipelines.
Several technical signals suggest these models use approaches closely related to RLVR. The o-series models perform best on tasks with objective verifiable answers: mathematics, competitive programming, formal logic, and structured problem solving. Their performance profile matches what RLVR produces: strong pass@1 accuracy on tasks where a verifier can check correctness, with less differentiation on open-ended generative tasks. The emergent reasoning behaviors described in the DeepSeek-R1-Zero paper (extended chain-of-thought, self-reflection, self-correction) also appear in o1 and o3 outputs.
o3 is reported to learn test-time strategies autonomously, including writing brute-force solution prototypes and using them to verify the outputs of more optimized solutions. This kind of self-verification behavior is consistent with what pure RLVR training produces, as models are incentivized to find any strategy that increases the probability of passing the verifier.
It is possible that OpenAI uses process reward models in addition to outcome-based RLVR, which would represent a hybrid approach. The o1 technical primer on LessWrong (2024) speculated that o1 combines outcome verification with dense process supervision, but this has not been confirmed. The DeepSeek-R1 authors explicitly opted against neural process rewards due to reward hacking concerns; OpenAI may have taken the opposite view.
Across published benchmarks, RLVR consistently improves pass@1 performance over both base models and SFT-only checkpoints on verifiable tasks.
On mathematics, the most documented domain, the gains are substantial. DeepSeek-R1-Zero improved from 15.6% to 71.0% on AIME 2024 (pass@1) by applying RLVR to the DeepSeek-V3-Base model. Qwen2.5-Math variants showed a jump from approximately 36% to over 73% on MATH-500 under comparable RLVR training. Tulu 3 added 3.3 points on GSM8K over a strong DPO baseline. These gains are large in absolute terms, though the Limits-of-RLVR research (discussed below) argues that they reflect redistribution of probability mass rather than new capabilities.
On coding, DeepSeek-R1 reached a Codeforces rating of 2029, which corresponds to roughly the 97th percentile of rated human competitors on that platform. RLVR applied to competitive programming benchmarks has produced similar improvements across model families.
On instruction following, Tulu 3 showed a 1.3-point gain on IFEval from RLVR, and the gains transferred to tasks outside the training distribution.
One particularly striking result came from research showing that using a single well-chosen mathematical problem as the entire training dataset, rather than thousands of examples, can still move Qwen2.5-Math-1.5B from 36% to 73.6% on MATH-500. This suggests that the shape of the training signal matters more than its volume, consistent with the hypothesis that RLVR is reshaping the model's sampling distribution rather than teaching it new facts.
Shortly after DeepSeek-R1's release triggered widespread adoption of RLVR, researchers at Tsinghua University published a series of papers questioning whether RLVR actually expands a model's reasoning capability or merely improves sampling efficiency on capabilities the base model already possesses.
The core argument uses the pass@k metric. Pass@k measures whether a model produces at least one correct solution in k attempts. At k=1, RL-trained models consistently outperform their base model counterparts. However, as k increases, the base model catches up and, at sufficiently high k (such as k=256 on AIME or LiveCodeBench), the base model surpasses the RL-trained version.
This inversion has a straightforward interpretation: RLVR concentrates probability mass on high-reward outputs, improving the odds that the first sample is correct, but simultaneously reduces the diversity of the model's output distribution. The base model, with its broader distribution, eventually finds a correct solution given enough attempts; the RL-trained model has better single-shot precision but a smaller effective solution space.
A key empirical claim in the Limits-of-RLVR work (arXiv:2510.27044 and the limit-of-rlvr.github.io project) is that all correct solutions produced by RL-trained models are already present in the base model's distribution. The researchers verified this by checking whether correct solutions from RL models could be sampled from the base model, and found that they could. The conclusion drawn is that RLVR does not teach new reasoning strategies but rather biases the model toward strategies it already knew.
A complementary paper (arXiv:2510.04028) offered a two-stage view: RLVR initially shrinks the capability boundary by narrowing output diversity (a contraction phase), but continued training can subsequently expand beyond the base model's capabilities (an expansion phase) as the model discovers new strategies through exploration. This suggests the dichotomy between "sampling efficiency" and "capability expansion" may depend on training duration and the difficulty of the task distribution.
The debate has practical implications for practitioners. If RLVR only reshapes existing capabilities, then the base model quality is the binding constraint and investing in better pretraining may yield more durable gains than extended RLVR training. The Limits-of-RLVR authors also note that distillation (training a smaller model on outputs from a larger RL-trained model) can introduce genuinely new capabilities that the student's base model did not have, because the distillation teacher provides reasoning trajectories that were never in the student's pretraining distribution.
A separate line of evidence raised concerns about reward signal validity. Research found that applying random noise as the reward signal, rather than a genuine correctness verifier, still improved Qwen2.5-Math-7B by 21.4% on math benchmarks, nearly matching the gains from ground-truth rewards. This suggests that some of the benefit attributed to RLVR may come from the training dynamics of RL itself (entropy regularization, distribution sharpening, increased average response length) rather than from the specific reward signal.
Noisy verifiers. The assumption of perfect binary verification breaks down in practice. Regex-based math verifiers fail on unusual answer formats; code execution verifiers may time out on correct solutions; SQL verifiers depend on database state. Research on RLVR under imperfect verifiers (arXiv:2510.00915) proposed two corrections to the policy gradient estimator: a backward correction that produces an unbiased surrogate reward in expectation, and a forward correction that reweights gradient terms using only an estimate of the false-negative rate. Both corrections demonstrated improved performance under synthetic and real verifier noise, with the forward correction being more stable under heavy noise. A companion paper (arXiv:2604.07666) showed that noise rates up to 15% yield peak validation accuracy within 2 percentage points of the clean baseline across three model families (Qwen3, GLM4, Llama 3.1) and model sizes from 4B to 9B.
Process reward models combined with RLVR. Standard RLVR provides a reward only at the end of a generation. Process reward models (PRMs) provide step-level rewards during the reasoning chain, giving the optimizer denser gradient signal for long chain-of-thought problems. PRIME (Process Reward IM) and TANGO are frameworks that co-train the process reward model alongside the policy, using RLVR-style outcome verification to calibrate the process rewards. Verifiable Process Reward Models (VPRMs) extend this idea by applying rule-based deterministic verification to intermediate reasoning steps where possible, combining the auditability of RLVR with the density of process supervision.
DAPO. DAPO (arXiv:2503.14476) is a variant of GRPO that adds four modifications to improve training stability with long reasoning traces: Clip-Higher entropy control, Dynamic Sampling to skip uniformly-rewarded batches, Token-Level Policy Gradient Loss, and Overlong Reward Shaping. It is built on the veRL framework and achieved 50 points on AIME 2024 with Qwen2.5-32B.
RLVR for non-math domains. Med-RLVR applied RLVR to medical reasoning using a 3B base model, achieving 8-point accuracy gains on out-of-distribution medical tasks compared to SFT. Open-Medical-R1 investigated dataset design for RLVR in the medical domain. K2V (Knowledge-to-Verification) proposed a method for constructing verifiable rewards in knowledge-intensive domains where ground-truth answers exist but standard string matching fails. Research at NeurIPS 2025 also explored rubric-based reward functions that extend RLVR-style training to domains without purely binary correct-wrong verifiers.
RLPR. RLPR (arXiv:2506.18254) extends RLVR to general domains without any explicit verifier by using perplexity under a reference model as a proxy reward signal, trading the clean binary signal for broader domain coverage.
The RLVR ecosystem has produced several mature open source frameworks.
veRL (HybridFlow, from ByteDance) is a production-ready RL training library that supports GRPO, PPO, DAPO, and other RL algorithms at scale. It supports multi-node training and has been used to train models as large as DeepSeek-671B and Qwen3-235B. The DAPO recipe and reproduction code are included in the veRL codebase.
OpenRLHF is one of the earliest open RL libraries for LLMs, built on Ray for distributed orchestration. It supports both RLHF and RLVR workflows and achieves training efficiency improvements of 1.22x to 1.68x over comparable frameworks across model sizes.
TinyZero is a minimal reproduction of the DeepSeek-R1-Zero recipe, designed for researchers who want to understand the core RLVR training loop without the complexity of production infrastructure. It is widely used in academic settings.
open-r1 (from Hugging Face) is a community-driven effort to reproduce the full DeepSeek-R1 training pipeline, including data preparation, RLVR training, and evaluation.
AReaL (Ant Reasoning RL for LLMs) is a framework from Ant Group focused on efficient scaling of reasoning RL training.
TRL (Transformers Reinforcement Learning, from Hugging Face) added GRPO support in early 2025 and is accessible to researchers who want to run smaller-scale RLVR experiments within the familiar Transformers ecosystem.
The opendilab/awesome-RLVR repository on GitHub maintains a curated list of papers, implementations, and datasets, updated continuously.
RLVR has a narrow domain scope. It works well only where a reliable verifier exists. Tasks involving nuanced judgment, creative writing, long-form analysis, or any output that cannot be checked by a deterministic program are outside the paradigm's natural reach. Extending RLVR to these domains requires either constructing approximate verifiers (rubric-based, model-based) or accepting noisy rewards, both of which reduce the core advantage of deterministic feedback.
Reward hacking is a persistent concern. Verifiers are programs and programs have edge cases. Models trained with code execution verifiers have been observed generating solutions that exploit test case implementation weaknesses rather than solving the underlying problem. Math verifiers can be gamed by producing answers in formats the normalizer handles incorrectly. The risk grows as training continues and the model's outputs move further from the distribution the verifier was designed for.
The sparse reward problem makes training unstable on hard problems. If a model can never generate a correct solution (and thus never receives a reward of 1), the RL optimizer receives zero gradient from that problem and cannot make progress. This is the fundamental constraint identified by the Limits-of-RLVR work: RLVR cannot teach a model to solve problems it could not solve at all before training.
RLVR amplifies biases in the training data. If the set of problems used for RL training over-represents certain problem types or solution strategies, the model's distribution narrows toward those strategies and away from others. This is beneficial for the targeted benchmark but can harm generalization.
The binary reward signal provides no credit assignment for partial progress. A proof attempt that is 95% correct and fails on the last step receives the same reward as a response that is entirely wrong. This makes it difficult to learn from near-misses on hard problems.
Finally, compute requirements are substantial. RLVR requires generating multiple responses per prompt, running the verifier on each, and performing gradient updates that involve both the policy model and (in some configurations) a reference model. Training runs for state-of-the-art reasoning models using RLVR have required thousands of GPU-hours.