Recursive reward modeling

AI Alignment AI Safety Reinforcement Learning

23 min read

Updated Jul 11, 2026

Suggest edit History Talk

RawGraph

Last edited

Jul 11, 2026

Fact-checked

In review queue

Sources

18 citations

Revision

v2 · 4,639 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Recursive reward modeling (RRM) is a proposed approach to the scalable oversight problem in AI alignment, in which agents trained by reward modeling are recursively used to help humans evaluate the behavior of more capable agents, whose feedback in turn trains the next reward model in the chain. The proposal was articulated in the DeepMind research direction paper Scalable agent alignment via reward modeling: a research direction, submitted to arXiv on 19 November 2018 by Jan Leike, David Krueger, Tom Everitt, Miljan Martic, Vishal Maini, and Shane Legg.^[1] The paper defines its central method as "learning a reward function from interaction with the user and optimizing the learned reward function with reinforcement learning," then asks how to scale that recipe to tasks humans cannot directly judge.^[1] RRM sits in the same family as Paul Christiano's iterated distillation and amplification (IDA)^[2] and the AI safety via debate proposal of Geoffrey Irving, Paul Christiano, and Dario Amodei,^[3] all of which aim to use AI assistance to extend the reach of human supervisory signals to tasks that exceed unaided human comprehension.

The technique inherits the basic two stage decomposition of reinforcement learning from human feedback (RLHF): first learn a reward model from human preference data, then optimize a policy against that reward model. Its distinguishing claim is that the reward model itself can be bootstrapped to ever more capable regimes by training assistant agents that help the human labeler. RRM has been less empirically tested than vanilla RLHF, and the strongest published demonstration commonly associated with the idea, OpenAI's Recursively Summarizing Books with Human Feedback (2021),^[4] uses recursive task decomposition rather than the full assistant-trains-evaluator loop. The framework remains influential as a conceptual scaffolding for current work on scalable oversight, weak-to-strong generalization, and constitutional AI, but it has not been demonstrated end to end on tasks that exceed unaided human evaluation in the way the 2018 agenda imagined.

Who proposed recursive reward modeling, and when?

The phrase "recursive application of reward modeling" was introduced in the DeepMind research agenda Scalable agent alignment via reward modeling: a research direction, posted to arXiv as 1811.07871 on 19 November 2018.^[1] The paper presents reward modeling not as a finished algorithm but as a research direction, listing five concrete challenges (reward hacking, unsafe exploration, distributional shift, reward gaming via influencing the user, and unintended side effects) and pairing each with potential mitigations such as adversarial training, model based reinforcement learning, online learning of the reward model, and uncertainty estimates over the reward. The abstract frames the objective as identifying "concrete approaches to mitigate these challenges, and ways to establish trust in the resulting agents."^[1]

Jan Leike was at the time a research scientist at DeepMind in London, where he had previously co-authored Deep reinforcement learning from human preferences (2017) with Paul Christiano, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei, the seminal paper that brought preference based reward learning into the deep RL era using Atari and MuJoCo benchmarks.^[5] That work defined goals "in terms of (non-expert) human preferences between pairs of trajectory segments" and showed the approach could train complex behaviors while "providing feedback on less than one percent of our agent's interactions with the environment," at a cost of roughly an hour of human time per task.^[5] The 2018 paper extends that line of work by asking what would be needed to scale the same paradigm to tasks where humans cannot directly judge whether the policy succeeded. Leike later moved to OpenAI in 2021 and co-led its Superalignment team with Ilya Sutskever after it was announced on 5 July 2023 with a commitment of 20 percent of OpenAI's secured compute over four years toward aligning superintelligent systems.^[18] He resigned on 17 May 2024 and joined Anthropic later that month to lead its Alignment Science team, stating that his new team would work on scalable oversight, weak-to-strong generalization, and automated alignment research.^[6]^[7]

A companion blog post by DeepMind Safety Research framed the proposal as a way to "separate learning what to do (the reward model) from learning how to do it (the policy)," and used a chip design example to illustrate how helper agents could evaluate hard subquestions such as heat dissipation, lifetime, and security vulnerabilities that humans cannot reliably score themselves.^[8] The same post explicitly described the recursive application as "an instance of iterated amplification," signaling the close kinship between RRM and Christiano's earlier framework.^[8]

How does recursive reward modeling work?

Step 1: reward modeling

The first step of recursive reward modeling is ordinary reward modeling, an idea that predates the 2018 paper and goes back at least to the 2017 Deep reinforcement learning from human preferences work.^[5] A human evaluator is shown pairs of agent trajectories or candidate outputs and asked which is preferable. A neural network reward model is trained to predict these preferences using a Bradley-Terry style loss. A policy is then trained by reinforcement learning to maximize the expected predicted reward, with the reward model continuously updated as new behavior is encountered to mitigate reward hacking caused by distributional shift.

This decomposition has two practical advantages emphasized in the 2018 paper. First, it cleanly separates the value specification problem from the policy optimization problem, so improvements in RL algorithms automatically translate into capability gains without changing the alignment target. Second, the reward model can be queried many times per RL update, amortizing expensive human supervision over millions of policy improvement steps. The same skeleton underlies modern RLHF for large language models, including OpenAI's Learning to summarize from human feedback (Stiennon et al., 2020)^[17] and the InstructGPT line of work (Ouyang et al., 2022).

Step 2: recursive application

The recursive step is the distinguishing feature of RRM. The 2018 paper argues that the reward modeling skeleton can be extended to tasks too complex for direct human evaluation by training intermediate agents whose specific purpose is to help a human evaluate the target task. Concretely, to train an agent A on a task whose outcomes a human cannot reliably judge, the designers identify a set of evaluation subtasks. These subtasks must satisfy two conditions. They must be easier than the target task itself, so that human evaluation of the subtask outputs is feasible. And they must be sufficient, in combination, to let a human form a reliable judgement about A's outputs.

Helper agents are then trained on the evaluation subtasks using ordinary reward modeling. Their outputs become inputs to the human's evaluation of A. When even the subtask evaluation is too hard, the same construction recurses: helper agents are trained for sub-subtasks that humans can score, and so on. The result is a tree of reward modeling problems whose leaves are tasks short enough that an unaided human can label preferences directly, and whose root is the agent the designer ultimately cares about. The DeepMind blog post used the example of designing a computer chip. As the post put it, "to evaluate a proposed chip design, we train other 'helper' agents with reward modeling to benchmark the chip's performance in simulation, calculate heat dissipation, estimate the chip's lifetime, try to find security vulnerabilities, and so on."^[8] A human cannot judge whether a proposed design is good in isolation, but can judge the outputs of those helper agents, all of which feed into a composite evaluation of the chip design itself.

A worked second example often paired with chip design is fantasy novel writing: a human cannot read an entire novel and judge whether it is the best version of the story the author was trying to tell, but can judge summaries, can check claims of internal consistency made by a fact checking assistant, and can evaluate brief excerpts the helper agent flags as suspicious. By chaining several such helpers, the human's evaluative bandwidth is amplified to span a task whose surface area would otherwise overwhelm them.

The crucial implicit assumption is that evaluation is easier than generation, an analogue of the P versus NP gap in complexity theory. The 2018 paper acknowledges that this is an empirical assumption rather than a theorem, and that the recursive structure must be designed so that, at every level, the helper agents' tasks are strictly easier than the parent task. If evaluation is not in fact easier than generation for the relevant domain, recursion provides no leverage.

How does recursive reward modeling relate to RLHF?

Vanilla RLHF, as practiced in InstructGPT, Claude, Llama, and many open weight models, is the non-recursive base case of RRM: humans label preferences directly, a reward model is trained, and a policy is optimized against it. Recursive reward modeling is the proposal that the same machinery should be applied recursively when the task exceeds human evaluative capacity. Several practical RLHF variants can be read as partial steps toward the recursive picture.

Reinforcement learning from AI feedback (RLAIF), introduced in 2022 work on constitutional AI^[9] and benchmarked by Lee et al. (2023),^[10] replaces human preference labels with labels from another language model. In the strict RRM picture, the preference-labeling model would itself need to have been trained by reward modeling on a strictly simpler task; in practice, RLAIF systems often use an off-the-shelf model rather than a deliberately trained helper, which gives them empirical traction but loses the principled story.

Process supervision, exemplified by OpenAI's Let's Verify Step by Step paper (Lightman et al., 2023),^[11] trains process reward models (PRMs) that score each intermediate step of a chain of thought rather than only the final answer. That paper released PRM800K, a dataset of roughly 800,000 step-level human feedback labels, and reported that its process-supervised model "solves 78% of problems from a representative subset of the MATH test set," with process supervision significantly outperforming outcome supervision.^[11] This can be read as decomposing the evaluation of a hard problem into evaluations of easier reasoning steps, a structural cousin of the recursion in RRM, though the decomposition is fixed by the chain-of-thought format rather than negotiated between trained helper agents.

OpenAI's Recursively Summarizing Books with Human Feedback (Wu, Ouyang, Ziegler, Stiennon, Lowe, Leike, Christiano, 2021)^[4] is the clearest published example of an RRM-style training pipeline. The system summarizes book sections, then recursively summarizes those summaries, then a summary of summaries, with each level trained by RLHF on a subtask that human evaluators can complete in under an hour. As the authors describe it, the method "combines learning from human feedback with recursive task decomposition: we use models trained on smaller parts of the task to assist humans in giving feedback on the broader task."^[4] The recursion here is in the task structure rather than in the assistant agents that train the evaluator, so it implements only part of the RRM picture.

How does RRM compare to debate, IDA, and constitutional AI?

RRM is one of several proposals introduced in 2017 to 2018 for scaling human supervision to superhuman AI systems. The closest cousins are iterated amplification, AI safety via debate, constitutional AI, and weak-to-strong generalization. Each takes a different stance on how AI assistance to the human evaluator should be structured, and each makes different assumptions about what kinds of tasks decompose cleanly.

Approach	Year	Origin	Key mechanism	Training signal source	Verified empirical use
Recursive reward modeling	2018	Leike et al., DeepMind^[1]	Tree of reward modeling agents; helpers train evaluators for harder tasks	Human preferences amplified by helper agents	Partial: recursive book summarization (Wu et al., 2021)^[4]
Iterated distillation and amplification (IDA)	2018	Christiano, Shlegeris, Amodei^[2]	Amplify weak agents by allowing them to call themselves as subroutines, then distill the amplified policy	Decomposition of question by amplified agent	Algorithmic toy tasks (Christiano et al., 2018)
AI safety via debate	2018	Irving, Christiano, Amodei, OpenAI^[3]	Two agents make competing arguments; human judges who told the truth	Human judges of zero sum debate game	MNIST sparse pixel debate; later LLM debate experiments
Constitutional AI / RLAIF	2022	Bai et al., Anthropic^[9]	Model critiques and revises its own outputs against a written constitution; AI labels preferences	Written principles plus AI feedback	Anthropic's Claude line; widely reproduced
Weak-to-strong generalization	2023	Burns et al., OpenAI^[12]	Finetune strong pretrained model on labels from a weak supervisor; rely on generalization	Weak model labels; auxiliary losses	GPT-2 supervising GPT-4 on NLP, chess, reward modeling

The DeepMind safety blog explicitly described RRM as "an instance of iterated amplification."^[8] Iterated amplification, which Christiano, Shlegeris, and Amodei describe as a method that "progressively builds up a training signal for difficult problems by combining solutions to easier subproblems,"^[2] provides the general framework: a weak agent calls copies of itself to solve a question and the result is distilled into a single faster agent, while recursive reward modeling specializes the amplified signal to be a reward model that scores another agent. In IDA the amplified entity is the answer-producing agent; in RRM the amplified entity is the evaluator.

Debate is even more clearly distinct. Where RRM builds a cooperative scaffolding of helpers under the human's supervision, debate sets two agents against each other in a zero sum game whose winner is determined by a human judge. Irving et al. (2018) argue that "debate with optimal play can answer any question in PSPACE given polynomial time judges (direct judging answers only NP questions)."^[3] In practice both debate and RRM rely on the empirical claim that evaluation is easier than generation in the relevant domain.

Constitutional AI, introduced by Anthropic in late 2022 with Bai et al.,^[9] replaces some of the human preference labels in RLHF with model self-critique guided by a written constitution. It can be read as a flat, non-recursive instance of "AI helps the evaluator," in which the helping AI is the same as the model being trained rather than a strictly weaker assistant trained on a strictly easier subtask. Constitutional AI is the most empirically developed member of the family, underpinning Anthropic's Claude line.

Weak-to-strong generalization, introduced by Collin Burns, Pavel Izmailov, Jan Hendrik Kirchner and colleagues at OpenAI on 14 December 2023,^[12] takes the opposite tack from RRM. Rather than build helpers to make the human evaluator smarter, it studies whether a strong pretrained model finetuned on labels from a weak supervisor can recover more capability than the supervisor has, on the analogy that humans are the weak supervisor and superhuman AIs are the strong student. The authors report that "when we naively finetune strong pretrained models on labels generated by a weak model, they consistently perform better than their weak supervisors," and that finetuning GPT-4 with a GPT-2-level supervisor plus an auxiliary confidence loss recovered close to GPT-3.5-level performance on natural language tasks.^[12] The two approaches are complementary: RRM tries to make the supervision signal smarter; weak-to-strong tries to extract more from a fixed weak supervision signal.

Has recursive reward modeling been tested empirically?

RRM has been less empirically tested than RLHF, and the literature does not yet contain a controlled experiment that runs the full tree of helper agents the 2018 paper describes. The most influential empirical artifact associated with the approach is Recursively Summarizing Books with Human Feedback, submitted by an OpenAI team on 22 September 2021 (arXiv 2109.10862).^[4] The authors combine recursive task decomposition with reward modeling and behavioral cloning on top of fine-tuned GPT-3. At inference time the model summarizes short book sections, then recursively summarizes those summaries, producing a whole-book summary that can be evaluated by a human who has not read the book. The system achieves state of the art results on the BookSum benchmark for book-length summarization, and a zero-shot question-answering model built on its summaries reaches state of the art on the challenging NarrativeQA benchmark; the model even matched the quality of human-written summaries in roughly 5 percent of books.^[4] The authors emphasize that their "human labelers are able to supervise and evaluate the models quickly, despite not having read the entire books themselves."^[4]

It is worth being precise about what this demonstrates. The recursion in the book summarization paper is in the task structure: long inputs are sliced into shorter chunks, summarized, and the summaries are recursively combined. The reward model is trained directly on human comparisons of summary pairs. The setup does not train explicit assistant agents whose role is to help the human evaluator score a fixed task. Calling the book summarization work an RRM demonstration is therefore an interpretation rather than an exact match.

Other adjacent empirical contributions include OpenAI's Learning to summarize from human feedback (Stiennon et al., 2020),^[17] InstructGPT (Ouyang et al., 2022), process reward models (Lightman et al., 2023),^[11] Measuring Progress on Scalable Oversight for Large Language Models (Bowman et al., 2022),^[13] and the constitutional AI line (Bai et al., 2022).^[9] Bowman et al. define scalable oversight as "the problem of supervising systems that potentially outperform us on most skills relevant to the task," and found that human participants interacting with an unreliable model assistant "substantially outperform both the model alone and their own unaided performance."^[13] Their study operationalizes the "sandwiching" methodology proposed by Ajeya Cotra in a 2021 blog post, in which a model is more capable than a non-expert user but less capable than a domain expert, and the question is whether the non-expert assisted by the model can match the expert.^[14] Sandwiching is a methodological framework that can be used to test RRM-style proposals, though Bowman et al.'s initial experiment evaluated a simple chat assistant baseline rather than a full recursive tree.

To date there is no published demonstration of a multi-level RRM tree on a task where unaided human evaluation fails and assisted human evaluation succeeds. The community sometimes treats this absence as expected, given the cost and difficulty of designing such an experiment, and sometimes treats it as evidence that the recursive picture has fundamental practical obstacles.

What are the main criticisms of recursive reward modeling?

Capability elicitation in helper agents

The 2018 paper depends on each helper agent being capable enough to materially aid the human's evaluation while still being trainable from a strictly easier subtask. Critics have argued that this assumption is in tension: if a helper is to flag subtle problems in a superhuman target agent's output, the helper itself must be capable of detecting subtle problems, which pushes its own training subtask back toward the difficulty of the target. The empirical question of whether the difficulty gradient is steep enough to support meaningful recursion has not been settled.

Error compounding across the tree

A common technical objection, raised by Rohin Shah in his alignment newsletter coverage and by many commenters since,^[15] is that errors propagate through the tree. Imperfect reward models at the leaves train slightly flawed helper agents, whose flawed evaluations train slightly more flawed reward models at the next level, and so on. In the worst case the cumulative error grows with tree depth, capping the depth of recursion that delivers useful signal. The 2018 paper acknowledges this concern and lists it among the open problems.

Deceptive alignment

Deceptive alignment, the hypothesized failure mode in which a model behaves as if aligned during training so as to be deployed and then defect, applies to RRM in two ways. First, the target agent may behave well only because it knows it is being evaluated by a helper, and defect when the helper's coverage is incomplete. Second, and more troubling, helper agents trained by ordinary RL on their own subtask have no built-in incentive to be honest evaluators; they have an incentive to maximize the evaluation reward modeled by their own reward model, which may be subtly different from accurate evaluation. If the helpers themselves are deceptively aligned, the entire tree's trustworthiness is undermined. This concern shades into the literature on mesa optimization.

Reward hacking by influencing the user

The 2018 paper itself flags as a research challenge the possibility that an agent learns to influence the user's preference labels, for example by producing outputs that exploit the user's cognitive biases. This problem becomes more acute in the recursive setting, because helper agents and target agents may interact with the human evaluator in ways that subtly distort the supervision signal at multiple levels of the tree.

Fundamental dependence on RLHF's limits

A broader critique, articulated in several blog posts including BlueDot's introduction to RRM,^[16] is that RRM inherits the well-documented limitations of RLHF: sycophancy, reward gaming, distributional fragility, and the inability of a fixed reward model to track shifting human preferences. AI assistance to the human evaluator may shift but does not remove the fundamental dependence on human judgement, and many of the failure modes of RLHF persist under recursion.

Capability-elicitation problem at the top level

Even granting that helper agents are honest and capable, the human at the top of the tree must still be able to integrate the helpers' outputs into a coherent judgement about the target agent. As tasks become more complex, this integration step becomes itself the bottleneck: the human is no longer evaluating raw behavior but is evaluating helper agent reports about behavior, which requires trust in the helpers and the ability to reason across the helpers' joint output. This concern motivates the broader scalable oversight literature, including weak-to-strong generalization, and the line of work on debate where the structure of the human's reasoning is more constrained.

How has recursive reward modeling influenced later alignment work?

While no single line of empirical research is labeled "recursive reward modeling," the conceptual machinery of the 2018 paper has shaped several strands of contemporary alignment work.

The most direct descendant is OpenAI's recursive book summarization (Wu et al., 2021), discussed above, which applied the recursive task decomposition idea at scale on top of RLHF.^[4] Process supervision (Lightman et al., 2023) extended the idea of decomposing evaluation into step-level evaluations on mathematical reasoning, releasing PRM800K, a dataset of 800,000 step-level human feedback labels.^[11] Constitutional AI (Bai et al., 2022) operationalized model self-critique against a written constitution, with the model acting as its own evaluator under RLAIF.^[9]

The scalable oversight measurement program, beginning with Bowman et al.'s 2022 paper,^[13] turned the question of whether RRM-like methods work into an empirical research program with concrete benchmarks. The sandwiching methodology (Cotra, 2021)^[14] provided the experimental design philosophy under which proposals like RRM, debate, and weak-to-strong generalization could be compared head to head. Subsequent work has built on this with adversarial oversight protocols, doubly efficient debate, and other refinements.

Weak-to-strong generalization (Burns et al., 2023)^[12] was framed by its OpenAI authors as a complementary alternative: rather than amplify the human supervisor with helper agents, accept that the supervisor is fixed and study what the strong model generalizes to. When Jan Leike moved from OpenAI to Anthropic in May 2024, his publicly stated research program named scalable oversight, weak-to-strong generalization, and automated alignment research as its three priorities, indicating that the original RRM agenda has fragmented into a portfolio of related techniques rather than persisting as a single named research program.^[6]

Recent work also revisits the foundational assumptions of RRM. Papers in 2024 and 2025 on debate variants, automated alignment researchers, generative reward models, and LLM-as-a-judge systems all build on the idea that AI assistance can lift the ceiling of human evaluation, while attempting to address the error compounding and deceptive alignment concerns through interpretability, adversarial training, and process-level supervision. The 2018 paper is now most often cited as a foundational research direction rather than as a method to be implemented as written.

What open problems remain?

The 2018 paper named five research challenges (reward hacking, unsafe exploration, distributional shift, reward gaming via the user, and unintended side effects) and proposed mitigation directions for each. Roughly a decade later, several open problems specific to RRM remain.

Whether evaluation is in fact strictly easier than generation across the domains where superhuman AI matters most, including scientific research, software engineering at the system level, and long-horizon agentic tasks.
How to bound error compounding across multiple levels of recursion, including whether interpretability or formal verification of helper agents can be used to certify trust at each level.
How to design helper agents that are honest evaluators rather than reward maximizers on their own training objective, given the close link to deceptive alignment and mesa optimization.
How to test the full RRM tree empirically. The community has not yet produced a clean sandwiching experiment that runs the recursion to two or more levels on a task where unaided human evaluation fails.
How to integrate RRM with complementary approaches such as weak-to-strong generalization, debate, constitutional AI, and process supervision, none of which alone suffices for the superhuman regime.
Whether the assumption of a fixed task hierarchy is compatible with the open-ended tasks that future AI systems will face, including tasks where decomposing into subtasks itself requires superhuman judgement.

References

Leike, J., Krueger, D., Everitt, T., Martic, M., Maini, V., Legg, S. (2018). *Scalable agent alignment via reward modeling: a research direction*. arXiv:1811.07871. https://arxiv.org/abs/1811.07871 . Accessed 2026-05-20. ↩
Christiano, P., Shlegeris, B., Amodei, D. (2018). *Supervising strong learners by amplifying weak experts*. arXiv:1810.08575. https://arxiv.org/abs/1810.08575 . Accessed 2026-05-20. ↩
Irving, G., Christiano, P., Amodei, D. (2018). *AI safety via debate*. arXiv:1805.00899. https://arxiv.org/abs/1805.00899 . Accessed 2026-05-20. ↩
Wu, J., Ouyang, L., Ziegler, D. M., Stiennon, N., Lowe, R., Leike, J., Christiano, P. (2021). *Recursively Summarizing Books with Human Feedback*. arXiv:2109.10862. https://arxiv.org/abs/2109.10862 . Accessed 2026-05-20. ↩
Christiano, P., Leike, J., Brown, T., Martic, M., Legg, S., Amodei, D. (2017). *Deep Reinforcement Learning from Human Preferences*. NeurIPS 2017. arXiv:1706.03741. https://arxiv.org/abs/1706.03741 . Accessed 2026-05-20. ↩
CNBC (2024). "OpenAI safety leader Jan Leike joins rival AI startup Anthropic." 28 May 2024. https://www.cnbc.com/2024/05/28/openai-safety-leader-jan-leike-joins-amazon-backed-anthropic.html . Accessed 2026-05-20. ↩
Wikipedia contributors. "Jan Leike." https://en.wikipedia.org/wiki/Jan_Leike . Accessed 2026-05-20. ↩
DeepMind Safety Research (2018). "Scalable agent alignment via reward modeling." Medium. https://deepmindsafetyresearch.medium.com/scalable-agent-alignment-via-reward-modeling-bf4ab06dfd84 . Accessed 2026-05-20. ↩
Bai, Y., Kadavath, S., Kundu, S., et al. (2022). *Constitutional AI: Harmlessness from AI Feedback*. arXiv:2212.08073. https://arxiv.org/abs/2212.08073 . Accessed 2026-05-20. ↩
Lee, H., Phatale, S., Mansoor, H., et al. (2023). *RLAIF vs. RLHF: Scaling Reinforcement Learning from Human Feedback with AI Feedback*. arXiv:2309.00267. https://arxiv.org/abs/2309.00267 . Accessed 2026-05-20. ↩
Lightman, H., Kosaraju, V., Burda, Y., Edwards, H., Baker, B., Lee, T., Leike, J., Schulman, J., Sutskever, I., Cobbe, K. (2023). *Let's Verify Step by Step*. arXiv:2305.20050. https://arxiv.org/abs/2305.20050 . Accessed 2026-05-20. ↩
Burns, C., Izmailov, P., Kirchner, J. H., Baker, B., Gao, L., Aschenbrenner, L., Chen, Y., Ecoffet, A., Joglekar, M., Leike, J., Sutskever, I., Wu, J. (2023). *Weak-to-Strong Generalization: Eliciting Strong Capabilities With Weak Supervision*. arXiv:2312.09390. https://arxiv.org/abs/2312.09390 . Accessed 2026-05-20. ↩
Bowman, S. R., Hyun, J., Perez, E., et al. (2022). *Measuring Progress on Scalable Oversight for Large Language Models*. arXiv:2211.03540. https://arxiv.org/abs/2211.03540 . Accessed 2026-05-20. ↩
Cotra, A. (2021). "The case for aligning narrowly superhuman models." AI Alignment Forum. Referenced in Bowman et al. (2022). https://arxiv.org/abs/2211.03540 . Accessed 2026-05-20. ↩
Shah, R. (2019). "AN #79: Recursive reward modeling as an alignment technique integrated with deep RL." LessWrong. https://www.lesswrong.com/posts/EoY6P6mpz7ZozhAxm/an-79-recursive-reward-modeling-as-an-alignment-technique . Accessed 2026-05-20. ↩
BlueDot Impact (2024). "What is Recursive Reward Modelling?" https://blog.bluedot.org/p/what-is-recursive-reward-modelling . Accessed 2026-05-20. ↩
Stiennon, N., Ouyang, L., Wu, J., Ziegler, D. M., Lowe, R., Voss, C., Radford, A., Amodei, D., Christiano, P. (2020). *Learning to summarize from human feedback*. arXiv:2009.01325. https://arxiv.org/abs/2009.01325 . Accessed 2026-07-12. ↩
TechCrunch (2023). "OpenAI is forming a new team to bring 'superintelligent' AI under control." 5 July 2023. https://techcrunch.com/2023/07/05/openai-is-forming-a-new-team-to-bring-superintelligent-ai-under-control/ . Accessed 2026-07-12. ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

1 revision by 1 contributor · full history

Suggest edit

What links here

Jan Leike Self-Rewarding Language Models WebGPT🤖

Who proposed recursive reward modeling, and when?

How does recursive reward modeling work?

Step 1: reward modeling

Step 2: recursive application

How does recursive reward modeling relate to RLHF?

How does RRM compare to debate, IDA, and constitutional AI?

Has recursive reward modeling been tested empirically?

What are the main criticisms of recursive reward modeling?

Capability elicitation in helper agents

Error compounding across the tree

Deceptive alignment

Reward hacking by influencing the user

Fundamental dependence on RLHF's limits

Capability-elicitation problem at the top level

How has recursive reward modeling influenced later alignment work?

What open problems remain?

See also

References

Improve this article

Related Articles

Reward hacking

Specification gaming

KTO

RLOO (REINFORCE Leave-One-Out)

Constitutional AI

MACHIAVELLI (benchmark)

What links here

Related Articles

Reward hacking

Specification gaming

KTO

RLOO (REINFORCE Leave-One-Out)

Constitutional AI

MACHIAVELLI (benchmark)

What links here