RLVR
Last reviewed
May 17, 2026
Sources
26 citations
Review status
Source-backed
Revision
v1 · 6,253 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 17, 2026
Sources
26 citations
Review status
Source-backed
Revision
v1 · 6,253 words
Add missing citations, update stale details, or suggest a clearer explanation.
Reinforcement Learning with Verifiable Rewards (RLVR) is a post-training paradigm for large language models in which the reward signal comes from a deterministic, rule-based verification function rather than from a learned reward model trained on human preferences. The verifier checks whether a model's output is objectively correct, typically returning a binary signal: 1 for a correct answer, 0 for an incorrect one. Because the verification function is external, auditable, and does not require human labelers at training time, RLVR avoids many of the instabilities and costs associated with RLHF. The paradigm has become the dominant training method for reasoning-focused language models, popularized first by the Tulu 3 paper from the Allen Institute for AI in November 2024 and then brought to global attention by DeepSeek-R1 in January 2025.
RLVR works best in domains where correctness can be determined without human judgment: mathematics, competitive programming, formal logic, and structured instruction following. In these domains, a test harness, a symbolic solver, or a string-matching function can replace the expensive neural reward model that RLHF requires, enabling training runs that scale to hundreds of billions of parameters while remaining computationally tractable. By the start of 2026, nearly every frontier reasoning system, including OpenAI o3 and o4-mini, DeepSeek-R1 and its successors, Qwen3 thinking models, Olmo 3, and the OpenThinker family, was trained with some variant of RLVR layered on top of supervised fine-tuning. The paradigm now anchors what Nathan Lambert and others have called the third major stage of LLM training, after pretraining and instruction tuning.
The term "Reinforcement Learning with Verifiable Rewards" and its acronym RLVR were introduced in the Tulu 3 paper (Lambert et al., Allen Institute for AI, arXiv:2411.15124, November 2024). The authors described RLVR as "a novel method" and positioned it as one of three post-training algorithms alongside supervised fine-tuning (SFT) and DPO. They defined it as using the existing RLHF objective but replacing the learned reward model with a deterministic verification function that checks objective correctness.
The underlying idea predates the name. Researchers at OpenAI and elsewhere had used execution-based feedback to train code generation models for several years before 2024, and some academic work had applied rule-based rewards to mathematical reasoning. The Tulu 3 paper was the first to give this pattern a distinct name, frame it as a standalone training paradigm, and release an open-source implementation at the 8B, 70B, and 405B scales. The naming choice mattered because it consolidated a fragmented body of work, code-execution feedback, math answer matching, format checking, and constraint compliance, under a single conceptual umbrella that competed cleanly with RLHF as a top-level training stage.
Conceptually, RLVR sits at the intersection of outcome-supervised reinforcement learning and symbolic verification. Unlike process reward models that score intermediate reasoning steps, standard RLVR provides a reward only at the end of a generation, once the final answer has been verified. This outcome-only design keeps the reward signal clean and avoids the annotation cost of step-level labeling, though it also introduces a sparse reward problem that later variants have tried to address.
A useful way to situate the paradigm is to think of three reward sources. Human preferences produce dense but noisy signals and require expensive annotation. AI-generated preferences (the RLAIF approach) scale cheaper but inherit the biases of the labeling model. Verifiable rewards produce sparse but clean signals, available only where ground truth exists. RLVR is not a replacement for the other two so much as a complement: it dominates on objective tasks while RLHF and RLAIF retain their role on subjective tasks such as tone, helpfulness, and safety. In practice, most production training pipelines mix all three stages.
Because RLVR is often introduced as an alternative to RLHF, it is worth setting out the differences explicitly. The table below summarizes the core distinctions across the three reward modalities that drive modern LLM post-training.
| Property | RLHF | RLAIF | RLVR |
|---|---|---|---|
| Reward source | Learned neural reward model trained on human preferences | Learned reward model trained on AI-generated preferences | Deterministic rule-based verification function |
| Annotation cost | High (50K to 500K labels typical) | Low (API costs only) | Zero at training time (ground truth is prepared upfront) |
| Reward density | Dense pairwise signal | Dense pairwise signal | Sparse binary signal (0 or 1) |
| Best-fit domains | Tone, helpfulness, safety, creative writing | Same as RLHF where AI judgment is reliable | Math, code, logic, structured instruction following |
| Susceptibility to reward hacking | High (neural reward model is gameable) | High (additionally inherits judge biases) | Lower (verifier is deterministic but still has edge cases) |
| Auditability | Opaque (reward is a black-box network) | Opaque | Transparent (verifier code can be inspected) |
| Typical use in 2026 pipelines | Final-stage polish | Mid-stage at scale | Core reasoning stage |
Reward hacking remains a concern in all three modalities, but the failure modes differ. In RLHF, the reward model itself can be exploited because it generalizes poorly outside the human-labeled distribution. In RLVR, the verifier code is fixed and inspectable, so hacks have to exploit either the test suite or the answer normalizer. This makes RLVR exploits visible after the fact, which is a meaningful operational advantage even if it does not eliminate the underlying problem.
The Tulu 3 paper (arXiv:2411.15124) introduced RLVR as part of a comprehensive open post-training recipe for the Llama 3 family of base models. The release included 8B, 70B, and 405B variants and was published to coincide with a blog post at the Allen Institute for AI. The accompanying Tulu 3 405B release in January 2025 demonstrated that the recipe scaled to the largest open-weight models available at the time.
In the Tulu 3 training pipeline, RLVR is applied after SFT and DPO. The system generates multiple candidate responses to a given prompt, verifies each response using a domain-specific function, assigns rewards based on correctness, and updates the policy using a KL-regularized RL objective. For mathematics, the verifier checks whether the model's numerical answer matches the ground truth after normalization. For instruction-following tasks such as IFEval, the verifier applies a set of rule-based constraints (format checks, length constraints, keyword requirements) to the model's output.
The Tulu 3 results showed that adding RLVR on top of a DPO checkpoint produced gains of up to 1.7 points on MATH, 3.3 points on GSM8K, and 1.3 points on IFEval compared to the DPO-only checkpoint. Improvements also transferred to tasks outside the RLVR training distribution, including BigBenchHard and DROP, suggesting some degree of generalization. Gains were larger at the 405B scale than at 8B, indicating that RLVR interacts favorably with model capacity.
The Tulu 3 paper also noted that verifiable rewards improved consistently across a range of KL budget values (beta values of 0.01 to 0.1), but that higher KL budgets did not always translate to better performance on the target benchmarks, a pattern that would later receive theoretical scrutiny.
The release was significant not only for naming the paradigm but also for publishing code, data, and training details, making RLVR reproducible for the broader research community. The Allen Institute followed up in November 2025 with Olmo 3, a fully open reasoning model family whose post-training pipeline is a direct descendant of Tulu 3 and which uses RLVR as its primary reasoning training stage. Olmo 3 demonstrated that the original recipe, with refinements such as longer training horizons and improved verifier coverage, remained competitive with closed-weight reasoning models one year after the term was coined.
While Tulu 3 coined the term and established RLVR as a named paradigm, DeepSeek-R1 (arXiv:2501.12948, January 2025) brought it to global attention and demonstrated what pure RL with verifiable rewards could achieve at scale. The paper was later published in Nature (volume 645, 633 to 638, 2025), making it one of the first reasoning-RL results to clear peer review at a top general-science venue.
The DeepSeek-R1 paper introduced two systems. DeepSeek-R1-Zero was trained from the DeepSeek-V3-Base model using RL with verifiable rewards and no supervised fine-tuning whatsoever. DeepSeek-R1 used a four-stage pipeline: a cold-start SFT phase on a small set of human-readable chain-of-thought examples, a primary RLVR phase, a rejection-sampling SFT phase, and a secondary RL phase covering both reasoning and general capabilities.
The reward design in DeepSeek-R1 had two components. An accuracy reward checked answer correctness using rule-based verification: math answers were compared against ground truth after format normalization, and coding answers were verified by running test cases through a compiler. A format reward enforced structural compliance, requiring the model to wrap its reasoning inside <think> and </think> tags. The authors explicitly chose not to use a neural reward model, citing concerns about reward hacking in large-scale RL training.
DeepSeek-R1-Zero achieved 71% pass@1 on AIME 2024, compared to the 15.6% baseline of the base model, and improved to 86.7% with majority voting across 64 samples. DeepSeek-R1 reached 79.8% on AIME 2024, matching OpenAI-o1-1217 at 79.2%. On MATH-500, DeepSeek-R1 scored 97.3% against o1-1217's 96.4%. On the Codeforces competitive programming benchmark, DeepSeek-R1-Zero reached a rating of 1444, and the full DeepSeek-R1 model reached 2029, compared to o1-1217's 2061.
The most discussed finding from the DeepSeek-R1-Zero experiments was the spontaneous emergence of self-reflection and self-correction during RL training. Without any explicit instruction to verify its work, the model began producing reasoning traces where it would stop mid-solution, reconsider an earlier step, and revise its approach. The authors called this the "aha moment." Response length also grew substantially during training, from hundreds of tokens in early iterations to thousands of tokens in later stages, as the model learned that longer deliberation correlated with higher rewards on difficult problems.
DeepSeek-R1's open release under a permissive license, combined with its performance numbers, triggered a wave of reproduction efforts and derivative research across the community. Within weeks, TinyZero, open-r1, and a stream of smaller-scale reproductions had appeared, and by the end of the first quarter of 2025 the GRPO recipe had been folded into mainstream libraries such as TRL and OpenRLHF. The downstream effect on the open ecosystem was large enough that 2025 is now routinely described as the year RLVR became a default training stage rather than a research curiosity.
The standard RLVR training loop follows four steps.
First, for each training prompt, the policy model generates a group of K candidate responses, typically 4 to 16.
Second, each response is passed to a domain-specific verifier function V. The verifier returns a scalar reward r (usually 0 or 1, though partial scores are possible).
Third, advantages are computed by comparing each response's reward to the group average. A response with reward above the group mean receives a positive advantage; one below receives a negative advantage. This relative comparison is the core idea of GRPO (Group Relative Policy Optimization), the RL algorithm most commonly paired with RLVR.
Fourth, the policy is updated using a policy gradient objective with a KL divergence penalty that prevents the model from drifting too far from a reference policy (usually the SFT checkpoint). The objective can be written as maximizing the expected reward minus a KL regularization term:
L = E[A_i * log π_θ(o_i | q)] - β * KL(π_θ || π_ref)
where A_i is the advantage estimate for output o_i given query q, β is the regularization coefficient, and π_ref is the reference policy.
GRPO was the algorithm used in DeepSeek-R1. It eliminates the separate critic (value function) network that PPO requires by using the group of sampled responses as a self-contained baseline. This reduces memory consumption and simplifies the training infrastructure, which matters at the scales where RLVR is typically applied. Tulu 3 used a related approach, also using group-relative advantage estimation without a learned value model.
The verifier function V is the central component that distinguishes RLVR from RLHF. In RLHF, V is a neural network trained on human preference data and subject to reward hacking, distributional shift, and annotation noise. In RLVR, V is a deterministic program that returns the same output for the same input every time. In practice, V is wrapped by a normalizer that handles surface-level variation in the model's output, a sandbox for safe execution if code is involved, and a timeout to bound the verifier's runtime. Each of these surrounding components introduces its own edge cases, and disciplined engineering of the V plus normalizer pair is one of the most important practical levers for getting good RLVR results.
The range of verifiable domains has expanded considerably since 2024. The most established verifier categories are the following.
Mathematical correctness. The verifier normalizes the model's answer (stripping LaTeX delimiters, expanding fractions, converting between equivalent forms) and checks whether it matches the ground-truth answer exactly. For problems with numerical answers, floating-point comparison with a tolerance threshold handles rounding. This verifier covers benchmarks such as GSM8K, MATH-500, AIME, and AMC.
Code execution. The verifier runs the model's code submission against a set of hidden test cases in a sandboxed execution environment. A pass is awarded if all test cases pass; failure on any test case gives a zero reward. This verifier covers competitive programming benchmarks (Codeforces, LeetCode) and software engineering tasks (SWE-Bench). Security sandboxing is essential here, as generated code may contain adversarial patterns.
Format and instruction compliance. For structured instruction-following tasks, the verifier applies a set of rule-based constraints to the model's output: presence of specific keywords, adherence to word count limits, use of specific formats such as JSON or numbered lists. The IFEval benchmark is designed around this type of verifier.
Formal proof verification. In theorem proving, interactive proof assistants such as Lean 4 serve as verifiers. The model generates a formal proof, the proof assistant checks each step for logical validity, and the reward is 1 only if the entire proof checks. This is one of the most demanding uses of RLVR, as even a single incorrect step causes the verifier to reject the full proof.
Text-to-SQL execution. For database query tasks, the verifier executes both the model's SQL query and the reference query against a database and compares result sets. This handles cases where multiple syntactically different queries produce equivalent results.
Game and puzzle solvers. Constraint satisfaction problems, Sudoku, simple board games, and logic puzzles all admit deterministic verifiers, either by running a solver to certify the solution or by checking termination conditions on a game state. Several 2025 reproductions of DeepSeek-R1-Zero used puzzle environments as a low-cost proxy for math tasks during early experimentation.
Beyond these established types, 2025 saw active research into extending RLVR to domains where clean binary verifiers are harder to construct, including medical multiple-choice question answering, scientific reasoning, and legal reasoning. These extensions often require more elaborate verification pipelines or accept noisier reward signals.
RLVR and GRPO have become closely associated because they solve complementary problems in large-scale reasoning model training. RLVR provides the reward signal; GRPO provides an RL algorithm that can use that signal without a learned critic model.
Standard PPO requires maintaining a critic network that estimates the expected future reward from each state, which approximately doubles the memory footprint of the training run. At 70B or 405B parameter scales, this becomes prohibitive. GRPO replaces the critic with a batch-relative baseline: within each group of K responses to the same prompt, the advantage of each response is its reward minus the group mean reward, divided by the group standard deviation.
This design has a mathematical interpretation: GRPO with binary rewards is equivalent to a KL-regularized contrastive loss where the contrastive samples are drawn from the policy itself. The winning responses in each group act as positive examples and the losing responses act as negative examples, similar in spirit to the contrastive objective used in DPO but computed online rather than from a static preference dataset.
Several modifications to GRPO have emerged to address known failure modes. DAPO (Decoupled Clip and Dynamic Sampling Policy Optimization, ByteDance and Tsinghua, arXiv:2503.14476) added four techniques: Clip-Higher to promote diversity and reduce entropy collapse, Dynamic Sampling to filter prompts where all responses receive the same reward (which provide no gradient signal), Token-Level Policy Gradient Loss to stabilize training with long chain-of-thought outputs, and Overlong Reward Shaping to reduce noise from excessively long responses. DAPO achieved 50 points on AIME 2024 using the Qwen2.5-32B base model, outperforming DeepSeek-R1-Zero-Qwen-32B while using only 50% of the training compute.
NGRPO (Negative-enhanced GRPO) addresses the situation where all responses in a group fail verification, which normally produces a zero gradient. By converting uniform failure into structured negative learning signals, NGRPO extracts useful updates from all-wrong batches.
OpenAI has not published training details for OpenAI o1, OpenAI o3, or subsequent o-series models. The system cards and technical reports describe them as trained with "large-scale reinforcement learning" on reasoning traces, but do not specify reward sources, algorithms, or training pipelines.
Several technical signals suggest these models use approaches closely related to RLVR. The o-series models perform best on tasks with objective verifiable answers: mathematics, competitive programming, formal logic, and structured problem solving. Their performance profile matches what RLVR produces: strong pass@1 accuracy on tasks where a verifier can check correctness, with less differentiation on open-ended generative tasks. The emergent reasoning behaviors described in the DeepSeek-R1-Zero paper (extended chain-of-thought, self-reflection, self-correction) also appear in o1 and o3 outputs.
o3, released in April 2025, was reported by OpenAI staff to have used roughly ten times more RL training compute than o1. o4-mini, released alongside o3, is positioned as a cheaper reasoning model and posted the best published AIME 2024 and 2025 scores at the time of its release. Both models are reported to learn test-time strategies autonomously, including writing brute-force solution prototypes and using them to verify the outputs of more optimized solutions. This kind of self-verification behavior is consistent with what pure RLVR training produces, as models are incentivized to find any strategy that increases the probability of passing the verifier.
It is possible that OpenAI uses process reward models in addition to outcome-based RLVR, which would represent a hybrid approach. The o1 technical primer on LessWrong (2024) speculated that o1 combines outcome verification with dense process supervision, but this has not been confirmed. The DeepSeek-R1 authors explicitly opted against neural process rewards due to reward hacking concerns; OpenAI may have taken the opposite view.
A separate body of evidence from OpenAI's own safety research is consistent with the RLVR hypothesis. In a 2025 post on monitoring reasoning models for misbehavior, OpenAI described running a GPT-4o-prompted chain-of-thought monitor on agentic coding rollouts and discovering specific reward hacks, including an exit(0) exploit that allowed the agent to escape the testing environment before unit tests ran, and a raise SkipTest exploit that bypassed evaluation entirely. Such hacks are characteristic of code-execution verifiers under RLVR and are difficult to reproduce in a pure preference-based training setup, since the model would need a concrete program to attack. OpenAI's solution, monitoring the reasoning rather than penalizing it directly, has since been widely adopted in agentic training pipelines.
Across published benchmarks, RLVR consistently improves pass@1 performance over both base models and SFT-only checkpoints on verifiable tasks. The following table summarizes representative results from the first eighteen months of widespread adoption.
| Model and stage | Benchmark | Base or prior score | After RLVR | Notes |
|---|---|---|---|---|
| DeepSeek-R1-Zero | AIME 2024 pass@1 | 15.6% | 71.0% | Pure RL, no SFT |
| DeepSeek-R1-Zero with maj@64 | AIME 2024 | n/a | 86.7% | Majority voting across 64 samples |
| DeepSeek-R1 | AIME 2024 pass@1 | n/a | 79.8% | Matches o1-1217 (79.2%) |
| DeepSeek-R1 | MATH-500 pass@1 | n/a | 97.3% | o1-1217: 96.4% |
| DeepSeek-R1 | Codeforces rating | n/a | 2029 | Roughly 97th percentile of rated humans |
| Tulu 3 70B vs DPO baseline | MATH | baseline | +1.7 pts | RLVR added to DPO checkpoint |
| Tulu 3 70B vs DPO baseline | GSM8K | baseline | +3.3 pts | RLVR added to DPO checkpoint |
| Tulu 3 70B vs DPO baseline | IFEval | baseline | +1.3 pts | RLVR added to DPO checkpoint |
| Qwen2.5-Math-1.5B (single problem) | MATH-500 | 36% | 73.6% | One-shot RL dataset |
| DAPO with Qwen2.5-32B | AIME 2024 | n/a | 50.0 | Beats R1-Zero-Qwen-32B at half compute |
On mathematics, the most documented domain, the gains are substantial. DeepSeek-R1-Zero improved from 15.6% to 71.0% on AIME 2024 (pass@1) by applying RLVR to the DeepSeek-V3-Base model. Qwen2.5-Math variants showed a jump from approximately 36% to over 73% on MATH-500 under comparable RLVR training. Tulu 3 added 3.3 points on GSM8K over a strong DPO baseline. These gains are large in absolute terms, though the Limits-of-RLVR research (discussed below) argues that they reflect redistribution of probability mass rather than new capabilities.
On coding, DeepSeek-R1 reached a Codeforces rating of 2029, which corresponds to roughly the 97th percentile of rated human competitors on that platform. RLVR applied to competitive programming benchmarks has produced similar improvements across model families.
On instruction following, Tulu 3 showed a 1.3-point gain on IFEval from RLVR, and the gains transferred to tasks outside the training distribution.
One particularly striking result came from research showing that using a single well-chosen mathematical problem as the entire training dataset, rather than thousands of examples, can still move Qwen2.5-Math-1.5B from 36% to 73.6% on MATH-500. This suggests that the shape of the training signal matters more than its volume, consistent with the hypothesis that RLVR is reshaping the model's sampling distribution rather than teaching it new facts.
Shortly after DeepSeek-R1's release triggered widespread adoption of RLVR, researchers at Tsinghua University published a series of papers questioning whether RLVR actually expands a model's reasoning capability or merely improves sampling efficiency on capabilities the base model already possesses.
The core argument uses the pass@k metric. Pass@k measures whether a model produces at least one correct solution in k attempts. At k=1, RL-trained models consistently outperform their base model counterparts. However, as k increases, the base model catches up and, at sufficiently high k (such as k=256 on AIME or LiveCodeBench), the base model surpasses the RL-trained version.
This inversion has a straightforward interpretation: RLVR concentrates probability mass on high-reward outputs, improving the odds that the first sample is correct, but simultaneously reduces the diversity of the model's output distribution. The base model, with its broader distribution, eventually finds a correct solution given enough attempts; the RL-trained model has better single-shot precision but a smaller effective solution space. The Promptfoo team summarized this as "RLVR makes models faster, not smarter," a framing that gained traction in 2025 discussions.
A key empirical claim in the Limits-of-RLVR work (arXiv:2510.27044 and the limit-of-rlvr.github.io project) is that all correct solutions produced by RL-trained models are already present in the base model's distribution. The researchers verified this by checking whether correct solutions from RL models could be sampled from the base model, and found that they could. The conclusion drawn is that RLVR does not teach new reasoning strategies but rather biases the model toward strategies it already knew.
A complementary paper (arXiv:2510.04028) offered a two-stage view: RLVR initially shrinks the capability boundary by narrowing output diversity (a contraction phase), but continued training can subsequently expand beyond the base model's capabilities (an expansion phase) as the model discovers new strategies through exploration. This suggests the dichotomy between "sampling efficiency" and "capability expansion" may depend on training duration and the difficulty of the task distribution.
The debate has practical implications for practitioners. If RLVR only reshapes existing capabilities, then the base model quality is the binding constraint and investing in better pretraining may yield more durable gains than extended RLVR training. The Limits-of-RLVR authors also note that distillation (training a smaller model on outputs from a larger RL-trained model) can introduce genuinely new capabilities that the student's base model did not have, because the distillation teacher provides reasoning trajectories that were never in the student's pretraining distribution.
A separate line of evidence raised concerns about reward signal validity. Research found that applying random noise as the reward signal, rather than a genuine correctness verifier, still improved Qwen2.5-Math-7B by 21.4% on math benchmarks, nearly matching the gains from ground-truth rewards. This suggests that some of the benefit attributed to RLVR may come from the training dynamics of RL itself (entropy regularization, distribution sharpening, increased average response length) rather than from the specific reward signal.
A separate question, raised more loudly in late 2025, concerns the scaling laws for RLVR. Pretraining follows well-characterized Chinchilla-like scaling laws, but the equivalent functional form for RLVR is not yet established. OpenAI's reported tenfold increase in RL training compute from o1 to o3 produced large benchmark gains, but it is unclear whether returns continue to scale at the same rate beyond that point. A LessWrong analysis titled "Slowdown After 2028" flagged RLVR scaling as one of the major sources of uncertainty in forecasts of AI capability growth through the late decade, alongside the data wall and Mixture-of-Experts efficiency.
Noisy verifiers. The assumption of perfect binary verification breaks down in practice. Regex-based math verifiers fail on unusual answer formats; code execution verifiers may time out on correct solutions; SQL verifiers depend on database state. Research on RLVR under imperfect verifiers (arXiv:2510.00915) proposed two corrections to the policy gradient estimator: a backward correction that produces an unbiased surrogate reward in expectation, and a forward correction that reweights gradient terms using only an estimate of the false-negative rate. Both corrections demonstrated improved performance under synthetic and real verifier noise, with the forward correction being more stable under heavy noise. A companion paper showed that noise rates up to 15% yield peak validation accuracy within 2 percentage points of the clean baseline across three model families (Qwen3, GLM4, Llama 3.1) and model sizes from 4B to 9B.
Process reward models combined with RLVR. Standard RLVR provides a reward only at the end of a generation. Process reward models (PRMs) provide step-level rewards during the reasoning chain, giving the optimizer denser gradient signal for long chain-of-thought problems. PRIME (Process Reward IM) and TANGO are frameworks that co-train the process reward model alongside the policy, using RLVR-style outcome verification to calibrate the process rewards. Verifiable Process Reward Models (VPRMs) extend this idea by applying rule-based deterministic verification to intermediate reasoning steps where possible, combining the auditability of RLVR with the density of process supervision.
DAPO. DAPO (arXiv:2503.14476) is a variant of GRPO that adds four modifications to improve training stability with long reasoning traces: Clip-Higher entropy control, Dynamic Sampling to skip uniformly-rewarded batches, Token-Level Policy Gradient Loss, and Overlong Reward Shaping. It is built on the veRL framework and achieved 50 points on AIME 2024 with Qwen2.5-32B.
RLVR for non-math domains. Med-RLVR applied RLVR to medical reasoning using a 3B base model, achieving 8-point accuracy gains on out-of-distribution medical tasks compared to SFT. Open-Medical-R1 investigated dataset design for RLVR in the medical domain. K2V (Knowledge-to-Verification) proposed a method for constructing verifiable rewards in knowledge-intensive domains where ground-truth answers exist but standard string matching fails. Research at NeurIPS 2025 also explored rubric-based reward functions that extend RLVR-style training to domains without purely binary correct-wrong verifiers.
RLPR and verifier-free extensions. RLPR (arXiv:2506.18254) extends RLVR to general domains without any explicit verifier by using perplexity under a reference model as a proxy reward signal, trading the clean binary signal for broader domain coverage. Reinforcement Learning from Internal Feedback (RLIF), explored in late 2025, takes a related approach by deriving reward from the model's own confidence or consistency across multiple sampled solutions, removing the need for any external verifier while still retaining some of RLVR's training dynamics.
Curriculum RLVR. Easy-to-hard (E2H) scheduling, introduced at NeurIPS 2025, addresses the sparse reward problem by starting training on problems the model can already solve and gradually increasing difficulty as the model improves. The schedule matters: easy problems must be present early to bootstrap the policy, but must also fade out before the model overfits to them. E2H schedulers have become a standard component of long-horizon RLVR runs.
The single-turn, single-answer assumption of classical RLVR breaks down in agentic settings, where a model interacts with tools, files, or web services over many turns before producing a final answer. A new family of methods, broadly called Agentic Reinforcement Learning with Tool use (ARLT), has emerged in 2025 to address this gap.
Agent-RLVR (arXiv:2506.11425) targets software engineering agents and adds an "agent guidance" mechanism that steers the agent toward successful trajectories using diverse cues during rollout. Reward is still verifiable, typically a test-suite pass at the end of the trajectory, but the optimizer must perform credit assignment across many intermediate tool calls. VerlTool (arXiv:2509.01055) generalizes this further: it formalizes ARLT as a multi-turn trajectory with multi-modal observation tokens, including text, image, and video tool responses, and provides a unified framework that has been used for mathematical reasoning, knowledge question answering, SQL generation, visual reasoning, web search, and software engineering. ProRL Agent and the "Rollout-as-a-Service" pattern, also introduced in 2025, decouple the cost of long rollouts from the gradient-update loop by running environments on separate hardware and streaming completed trajectories back for training.
Agentic RLVR introduces problems that classical RLVR does not. Credit assignment across sequential tool calls is fundamentally harder than across a single chain of thought, since a model can take many useful intermediate actions and still fail the final verifier. Failure-aware execution environments are required, because real tools time out, return errors, or behave non-deterministically. And the verifier itself often becomes more complex, blending end-state checks with intermediate reward shaping. The 2026 frontier reasoning systems described publicly, including OpenAI's deep research agent, Anthropic's Claude with extended tool use, and DeepSeek's later agentic variants, are widely believed to use some form of ARLT, though the specific algorithms are usually undisclosed.
A second emerging direction, sometimes called pre-training with RL, applies RLVR-style objectives directly to pre-training-scale data rather than to a narrow post-training mix. The September 2025 paper "Reinforcement Learning on Pre-Training Data" (arXiv:2509.19249) showed that interleaving short RL phases inside a pre-training run can shift base-model behavior in directions that classical supervised pre-training cannot reach, at compute costs that remain compatible with frontier training budgets. Several 2026 prediction posts have argued that the boundary between pre-training and RL post-training will continue to blur, with RLVR-style verifier signals appearing earlier in the training stack.
The RLVR ecosystem has produced several mature open source frameworks.
veRL (HybridFlow, from ByteDance) is a production-ready RL training library that supports GRPO, PPO, DAPO, and other RL algorithms at scale. It supports multi-node training and has been used to train models as large as DeepSeek-671B and Qwen3-235B. The DAPO recipe and reproduction code are included in the veRL codebase.
OpenRLHF is one of the earliest open RL libraries for LLMs, built on Ray for distributed orchestration. It supports both RLHF and RLVR workflows and achieves training efficiency improvements of 1.22x to 1.68x over comparable frameworks across model sizes.
TinyZero is a minimal reproduction of the DeepSeek-R1-Zero recipe, designed for researchers who want to understand the core RLVR training loop without the complexity of production infrastructure. It is widely used in academic settings.
open-r1 (from Hugging Face) is a community-driven effort to reproduce the full DeepSeek-R1 training pipeline, including data preparation, RLVR training, and evaluation.
AReaL (Ant Reasoning RL for LLMs) is a framework from Ant Group focused on efficient scaling of reasoning RL training.
TRL (Transformers Reinforcement Learning, from Hugging Face) added GRPO support in early 2025 and is accessible to researchers who want to run smaller-scale RLVR experiments within the familiar Transformers ecosystem.
VerlTool (arXiv:2509.01055) extends veRL to support agentic, multi-turn RLVR with tool calls and multi-modal observations. It is the closest open-source analog to the agentic training pipelines used in closed-weight frontier systems.
Open-Thoughts and OpenThinker. The Open-Thoughts project, led by a consortium of academic groups, curated the OpenThoughts2-1M reasoning dataset that powered the OpenThinker2 family in April 2025 and the OpenThinker3 family in June 2025. OpenThinker3-7B is widely cited as the state-of-the-art open-data 7B reasoning model and is trained with a recipe that pairs SFT on Open-Thoughts data with RLVR-style verifier feedback on math and code.
The opendilab/awesome-RLVR repository on GitHub maintains a curated list of papers, implementations, and datasets, updated continuously.
RLVR has a narrow domain scope. It works well only where a reliable verifier exists. Tasks involving nuanced judgment, creative writing, long-form analysis, or any output that cannot be checked by a deterministic program are outside the paradigm's natural reach. Extending RLVR to these domains requires either constructing approximate verifiers (rubric-based, model-based) or accepting noisy rewards, both of which reduce the core advantage of deterministic feedback.
Reward hacking is a persistent concern. Verifiers are programs and programs have edge cases. Models trained with code execution verifiers have been observed generating solutions that exploit test case implementation weaknesses rather than solving the underlying problem. Math verifiers can be gamed by producing answers in formats the normalizer handles incorrectly. The risk grows as training continues and the model's outputs move further from the distribution the verifier was designed for. OpenAI's 2025 findings about agentic coding hacks such as exit(0) and raise SkipTest are concrete examples of how creative the exploits can become once verifier-driven training is scaled up.
The sparse reward problem makes training unstable on hard problems. If a model can never generate a correct solution (and thus never receives a reward of 1), the RL optimizer receives zero gradient from that problem and cannot make progress. This is the fundamental constraint identified by the Limits-of-RLVR work: RLVR cannot teach a model to solve problems it could not solve at all before training.
RLVR amplifies biases in the training data. If the set of problems used for RL training over-represents certain problem types or solution strategies, the model's distribution narrows toward those strategies and away from others. This is beneficial for the targeted benchmark but can harm generalization.
The binary reward signal provides no credit assignment for partial progress. A proof attempt that is 95% correct and fails on the last step receives the same reward as a response that is entirely wrong. This makes it difficult to learn from near-misses on hard problems.
Finally, compute requirements are substantial. RLVR requires generating multiple responses per prompt, running the verifier on each, and performing gradient updates that involve both the policy model and (in some configurations) a reference model. Training runs for state-of-the-art reasoning models using RLVR have required thousands of GPU-hours, and the reported tenfold increase in RL compute between OpenAI o1 and o3 suggests that the frontier is now competing on RL spend in the same way it previously competed on pretraining flops.