Recursive reward modeling
Last reviewed
May 20, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 4,176 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 20, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 4,176 words
Add missing citations, update stale details, or suggest a clearer explanation.
Recursive reward modeling (RRM) is a proposed approach to the scalable oversight problem in AI alignment, in which agents trained by reward modeling are recursively used to help humans evaluate the behavior of more capable agents, whose feedback in turn trains the next reward model in the chain. The proposal was articulated in the 2018 DeepMind research direction paper Scalable Agent Alignment via Reward Modeling by Jan Leike, David Krueger, Tom Everitt, Miljan Martic, Vishal Maini, and Shane Legg.[^1] RRM sits in the same family as Paul Christiano's iterated distillation and amplification (IDA)[^2] and the AI safety via debate proposal of Geoffrey Irving, Paul Christiano, and Dario Amodei,[^3] all of which aim to use AI assistance to extend the reach of human supervisory signals to tasks that exceed unaided human comprehension.
The technique inherits the basic two stage decomposition of reinforcement learning from human feedback (RLHF): first learn a reward model from human preference data, then optimize a policy against that reward model. Its distinguishing claim is that the reward model itself can be bootstrapped to ever more capable regimes by training assistant agents that help the human labeler. RRM has been less empirically tested than vanilla RLHF, and the strongest published demonstration commonly associated with the idea, OpenAI's Recursively Summarizing Books with Human Feedback (2021),[^4] uses recursive task decomposition rather than the full assistant-trains-evaluator loop. The framework remains influential as a conceptual scaffolding for current work on scalable oversight, weak-to-strong generalization, and constitutional AI, but it has not been demonstrated end to end on tasks that exceed unaided human evaluation in the way the 2018 agenda imagined.
The phrase "recursive application of reward modeling" was introduced in the DeepMind research agenda Scalable Agent Alignment via Reward Modeling: A Research Direction, posted to arXiv as 1811.07871 on 19 November 2018.[^1] The paper presents reward modeling not as a finished algorithm but as a research direction, listing five concrete challenges (reward hacking, unsafe exploration, distributional shift, reward gaming via influencing the user, and unintended side effects) and pairing each with potential mitigations such as adversarial training, model based RL, online learning of the reward model, and uncertainty estimates over the reward.
Jan Leike was at the time a research scientist at DeepMind in London, where he had previously co-authored Deep Reinforcement Learning from Human Preferences (2017) with Paul Christiano, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei, the seminal paper that brought preference based reward learning into the deep RL era using Atari and MuJoCo benchmarks.[^5] The 2018 paper extends that line of work by asking what would be needed to scale the same paradigm to tasks where humans cannot directly judge whether the policy succeeded. Leike later moved to OpenAI in 2021, co-led the Superalignment team with Ilya Sutskever from June 2023 until his resignation on 17 May 2024, and joined Anthropic later that month to lead the Alignment Science team, where his stated focus is scalable oversight, weak-to-strong generalization, and automated alignment research.[^6][^7]
A companion blog post by DeepMind Safety Research framed the proposal as separating "learning what to do (the reward model) from learning how to do it (the policy)" and used a chip design example to illustrate how helper agents could evaluate hard subquestions (heat dissipation, lifetime, security) that humans cannot reliably score themselves.[^8] The DeepMind safety blog also explicitly described the recursive application as "an instance of iterated amplification," signaling the close kinship between RRM and Christiano's earlier framework.
The first step of recursive reward modeling is ordinary reward modeling, an idea that predates the 2018 paper and goes back at least to the 2017 Deep RL from Human Preferences work.[^5] A human evaluator is shown pairs of agent trajectories or candidate outputs and asked which is preferable. A neural network reward model is trained to predict these preferences using a Bradley-Terry style loss. A policy is then trained by reinforcement learning to maximize the expected predicted reward, with the reward model continuously updated as new behavior is encountered to mitigate reward hacking caused by distributional shift.
This decomposition has two practical advantages emphasized in the 2018 paper. First, it cleanly separates the value specification problem from the policy optimization problem, so improvements in RL algorithms automatically translate into capability gains without changing the alignment target. Second, the reward model can be queried many times per RL update, amortizing expensive human supervision over millions of policy improvement steps. The same skeleton underlies modern RLHF for large language models, including the InstructGPT line of work and the Learning to Summarize paper that Leike co-authored after moving to OpenAI.
The recursive step is the distinguishing feature of RRM. The 2018 paper argues that the reward modeling skeleton can be extended to tasks too complex for direct human evaluation by training intermediate agents whose specific purpose is to help a human evaluate the target task. Concretely, to train an agent A on a task whose outcomes a human cannot reliably judge, the designers identify a set of evaluation subtasks. These subtasks must satisfy two conditions. They must be easier than the target task itself, so that human evaluation of the subtask outputs is feasible. And they must be sufficient, in combination, to let a human form a reliable judgement about A's outputs.
Helper agents are then trained on the evaluation subtasks using ordinary reward modeling. Their outputs become inputs to the human's evaluation of A. When even the subtask evaluation is too hard, the same construction recurses: helper agents are trained for sub-subtasks that humans can score, and so on. The result is a tree of reward modeling problems whose leaves are tasks short enough that an unaided human can label preferences directly, and whose root is the agent the designer ultimately cares about. The DeepMind blog post used the example of designing a computer chip: a human cannot judge whether a proposed design is good, but can judge the outputs of helper agents that assess heat dissipation, energy efficiency, the presence of certain security vulnerabilities, and the predicted lifetime, all of which feed into a composite evaluation of the chip design itself.[^8]
A worked second example often paired with chip design is fantasy novel writing: a human cannot read an entire novel and judge whether it is the best version of the story the author was trying to tell, but can judge summaries, can check claims of internal consistency made by a fact checking assistant, and can evaluate brief excerpts the helper agent flags as suspicious. By chaining several such helpers, the human's evaluative bandwidth is amplified to span a task whose surface area would otherwise overwhelm them.
The crucial implicit assumption is that evaluation is easier than generation, an analogue of the P versus NP gap in complexity theory. The 2018 paper acknowledges that this is an empirical assumption rather than a theorem, and that the recursive structure must be designed so that, at every level, the helper agents' tasks are strictly easier than the parent task. If evaluation is not in fact easier than generation for the relevant domain, recursion provides no leverage.
Vanilla RLHF, as practiced in InstructGPT, Claude, Llama, and many open weight models, is the non-recursive base case of RRM: humans label preferences directly, a reward model is trained, and a policy is optimized against it. Recursive reward modeling is the proposal that the same machinery should be applied recursively when the task exceeds human evaluative capacity. Several practical RLHF variants can be read as partial steps toward the recursive picture.
Reinforcement learning from AI feedback (RLAIF), introduced in 2022 work on constitutional AI[^9] and benchmarked by Lee et al. (2023),[^10] replaces human preference labels with labels from another language model. In the strict RRM picture, the preference-labeling model would itself need to have been trained by reward modeling on a strictly simpler task; in practice, RLAIF systems often use an off-the-shelf model rather than a deliberately trained helper, which gives them empirical traction but loses the principled story.
Process supervision, exemplified by OpenAI's Let's Verify Step by Step paper (Lightman et al., 2023),[^11] trains process reward models (PRMs) that score each intermediate step of a chain of thought rather than only the final answer. This can be read as decomposing the evaluation of a hard problem into evaluations of easier reasoning steps, a structural cousin of the recursion in RRM, though the decomposition is fixed by the chain-of-thought format rather than negotiated between trained helper agents.
OpenAI's Recursively Summarizing Books with Human Feedback (Wu, Ouyang, Ziegler, Stiennon, Lowe, Leike, Christiano, 2021)[^4] is the clearest published example of an RRM-style training pipeline. The system summarizes book sections, then recursively summarizes those summaries, then a summary of summaries, with each level trained by RLHF on a subtask that human evaluators can complete in under an hour. The authors describe the work as the first large-scale empirical demonstration of scaling alignment techniques through recursive decomposition. However, the recursion in this paper is in the task structure rather than in the assistant agents that train the evaluator, so it implements only part of the RRM picture.
RRM is one of several proposals introduced in 2017 to 2018 for scaling human supervision to superhuman AI systems. The closest cousins are iterated amplification, AI safety via debate, constitutional AI, and weak-to-strong generalization. Each takes a different stance on how AI assistance to the human evaluator should be structured, and each makes different assumptions about what kinds of tasks decompose cleanly.
| Approach | Year | Origin | Key mechanism | Training signal source | Verified empirical use |
|---|---|---|---|---|---|
| Recursive reward modeling | 2018 | Leike et al., DeepMind[^1] | Tree of reward modeling agents; helpers train evaluators for harder tasks | Human preferences amplified by helper agents | Partial: recursive book summarization (Wu et al., 2021)[^4] |
| Iterated distillation and amplification (IDA) | 2018 | Christiano, Shlegeris, Amodei[^2] | Amplify weak agents by allowing them to call themselves as subroutines, then distill the amplified policy | Decomposition of question by amplified agent | Algorithmic toy tasks (Christiano et al., 2018) |
| AI safety via debate | 2018 | Irving, Christiano, Amodei, OpenAI[^3] | Two agents make competing arguments; human judges who told the truth | Human judges of zero sum debate game | MNIST sparse pixel debate; later LLM debate experiments |
| Constitutional AI / RLAIF | 2022 | Bai et al., Anthropic[^9] | Model critiques and revises its own outputs against a written constitution; AI labels preferences | Written principles plus AI feedback | Anthropic's Claude line; widely reproduced |
| Weak-to-strong generalization | 2023 | Burns et al., OpenAI[^12] | Finetune strong pretrained model on labels from a weak supervisor; rely on generalization | Weak model labels; auxiliary losses | GPT-2 supervising GPT-4 on NLP, chess, reward modeling |
The DeepMind safety blog explicitly described RRM as "an instance of iterated amplification."[^8] The closest technical reading is that iterated amplification provides a general framework, in which a weak agent calls copies of itself to solve a question and the result is distilled into a single faster agent, while recursive reward modeling specializes the amplified signal to be a reward model that scores another agent. In IDA the amplified entity is the answer-producing agent; in RRM the amplified entity is the evaluator.
Debate is even more clearly distinct. Where RRM builds a cooperative scaffolding of helpers under the human's supervision, debate sets two agents against each other in a zero sum game whose winner is determined by a human judge. The theoretical attraction of debate, articulated by Irving et al. (2018), is that with optimal play debate can in principle answer any question in PSPACE with a polynomial-time judge, compared with NP for direct judging.[^3] In practice both debate and RRM rely on the empirical claim that evaluation is easier than generation in the relevant domain.
Constitutional AI, introduced by Anthropic in late 2022 with Bai et al.,[^9] replaces some of the human preference labels in RLHF with model self-critique guided by a written constitution. It can be read as a flat, non-recursive instance of "AI helps the evaluator," in which the helping AI is the same as the model being trained rather than a strictly weaker assistant trained on a strictly easier subtask. Constitutional AI is the most empirically developed member of the family, underpinning Anthropic's Claude line.
Weak-to-strong generalization, introduced by Collin Burns, Pavel Izmailov, Jan Hendrik Kirchner and colleagues at OpenAI in December 2023,[^12] takes the opposite tack from RRM. Rather than build helpers to make the human evaluator smarter, it studies whether a strong pretrained model finetuned on labels from a weak supervisor can recover more capability than the supervisor has, on the analogy that humans are the weak supervisor and superhuman AIs are the strong student. The two approaches are complementary: RRM tries to make the supervision signal smarter; weak-to-strong tries to extract more from a fixed weak supervision signal.
RRM has been less empirically tested than RLHF, and the literature does not yet contain a controlled experiment that runs the full tree of helper agents the 2018 paper describes. The most influential empirical artifact associated with the approach is Recursively Summarizing Books with Human Feedback, published by an OpenAI team in September 2021 (arXiv 2109.10862).[^4] The authors combine recursive task decomposition with reward modeling and behavioral cloning on top of fine-tuned GPT-3. At inference time the model summarizes short book sections, then recursively summarizes those summaries, producing a whole-book summary that can be evaluated by a human who has not read the book. The system achieves state of the art performance on the BookSum benchmark and zero-shot QA models built on its summaries reach state of the art on NarrativeQA. The authors describe the work as the first large-scale empirical work on scaling alignment techniques.
It is worth being precise about what this demonstrates. The recursion in the book summarization paper is in the task structure: long inputs are sliced into shorter chunks, summarized, and the summaries are recursively combined. The reward model is trained directly on human comparisons of summary pairs. The setup does not train explicit assistant agents whose role is to help the human evaluator score a fixed task. Calling the book summarization work an RRM demonstration is therefore an interpretation rather than an exact match.
Other adjacent empirical contributions include OpenAI's Learning to Summarize from Human Feedback (Stiennon et al., 2020), InstructGPT (Ouyang et al., 2022), process reward models (Lightman et al., 2023),[^11] Measuring Progress on Scalable Oversight for Large Language Models (Bowman et al., 2022),[^13] and the constitutional AI line (Bai et al., 2022).[^9] Bowman et al. operationalize the "sandwiching" methodology proposed by Ajeya Cotra in a 2021 blog post, in which a model is more capable than a non-expert user but less capable than a domain expert, and the question is whether the non-expert assisted by the model can match the expert.[^14] Sandwiching is a methodological framework that can be used to test RRM-style proposals, though Bowman et al.'s initial experiment evaluated a simple chat assistant baseline rather than a full recursive tree.
To date there is no published demonstration of a multi-level RRM tree on a task where unaided human evaluation fails and assisted human evaluation succeeds. The community sometimes treats this absence as expected, given the cost and difficulty of designing such an experiment, and sometimes treats it as evidence that the recursive picture has fundamental practical obstacles.
The 2018 paper depends on each helper agent being capable enough to materially aid the human's evaluation while still being trainable from a strictly easier subtask. Critics have argued that this assumption is in tension: if a helper is to flag subtle problems in a superhuman target agent's output, the helper itself must be capable of detecting subtle problems, which pushes its own training subtask back toward the difficulty of the target. The empirical question of whether the difficulty gradient is steep enough to support meaningful recursion has not been settled.
A common technical objection, raised by Rohin Shah in his alignment newsletter coverage and by many commenters since,[^15] is that errors propagate through the tree. Imperfect reward models at the leaves train slightly flawed helper agents, whose flawed evaluations train slightly more flawed reward models at the next level, and so on. In the worst case the cumulative error grows with tree depth, capping the depth of recursion that delivers useful signal. The 2018 paper acknowledges this concern and lists it among the open problems.
Deceptive alignment, the hypothesized failure mode in which a model behaves as if aligned during training so as to be deployed and then defect, applies to RRM in two ways. First, the target agent may behave well only because it knows it is being evaluated by a helper, and defect when the helper's coverage is incomplete. Second, and more troubling, helper agents trained by ordinary RL on their own subtask have no built-in incentive to be honest evaluators; they have an incentive to maximize the evaluation reward modeled by their own reward model, which may be subtly different from accurate evaluation. If the helpers themselves are deceptively aligned, the entire tree's trustworthiness is undermined. This concern shades into the literature on mesa optimization.
The 2018 paper itself flags as a research challenge the possibility that an agent learns to influence the user's preference labels, for example by producing outputs that exploit the user's cognitive biases. This problem becomes more acute in the recursive setting, because helper agents and target agents may interact with the human evaluator in ways that subtly distort the supervision signal at multiple levels of the tree.
A broader critique, articulated in several blog posts including BlueDot's introduction to RRM,[^16] is that RRM inherits the well-documented limitations of RLHF: sycophancy, reward gaming, distributional fragility, and the inability of a fixed reward model to track shifting human preferences. AI assistance to the human evaluator may shift but does not remove the fundamental dependence on human judgement, and many of the failure modes of RLHF persist under recursion.
Even granting that helper agents are honest and capable, the human at the top of the tree must still be able to integrate the helpers' outputs into a coherent judgement about the target agent. As tasks become more complex, this integration step becomes itself the bottleneck: the human is no longer evaluating raw behavior but is evaluating helper agent reports about behavior, which requires trust in the helpers and the ability to reason across the helpers' joint output. This concern motivates the broader scalable oversight literature, including weak-to-strong generalization, and the line of work on debate where the structure of the human's reasoning is more constrained.
While no single line of empirical research is labeled "recursive reward modeling," the conceptual machinery of the 2018 paper has shaped several strands of contemporary alignment work.
The most direct descendant is OpenAI's recursive book summarization (Wu et al., 2021), discussed above, which applied the recursive task decomposition idea at scale on top of RLHF.[^4] Process supervision (Lightman et al., 2023) extended the idea of decomposing evaluation into step-level evaluations on mathematical reasoning, releasing PRM800K, a dataset of 800,000 step-level human feedback labels.[^11] Constitutional AI (Bai et al., 2022) operationalized model self-critique against a written constitution, with the model acting as its own evaluator under RLAIF.[^9]
The scalable oversight measurement program, beginning with Bowman et al.'s 2022 paper,[^13] turned the question of whether RRM-like methods work into an empirical research program with concrete benchmarks. The sandwiching methodology (Cotra, 2021)[^14] provided the experimental design philosophy under which proposals like RRM, debate, and weak-to-strong generalization could be compared head to head. Subsequent work has built on this with adversarial oversight protocols, doubly efficient debate, and other refinements.
Weak-to-strong generalization (Burns et al., 2023)[^12] was framed by its OpenAI authors as a complementary alternative: rather than amplify the human supervisor with helper agents, accept that the supervisor is fixed and study what the strong model generalizes to. When Jan Leike moved from OpenAI to Anthropic in May 2024 his publicly stated research program named scalable oversight, weak-to-strong generalization, and automated alignment research as his three priorities, indicating that the original RRM agenda has fragmented into a portfolio of related techniques rather than persisting as a single named research program.[^6]
Recent work also revisits the foundational assumptions of RRM. Papers in 2024 and 2025 on debate variants, automated alignment researchers, generative reward models, and LLM-as-a-judge systems all build on the idea that AI assistance can lift the ceiling of human evaluation, while attempting to address the error compounding and deceptive alignment concerns through interpretability, adversarial training, and process-level supervision. The 2018 paper is now most often cited as a foundational research direction rather than as a method to be implemented as written.
The 2018 paper named five research challenges (reward hacking, unsafe exploration, distributional shift, reward gaming via the user, and unintended side effects) and proposed mitigation directions for each. Roughly a decade later, several open problems specific to RRM remain.