Reinforcement Learning from AI Feedback (RLAIF) is a family of alignment techniques in which a large language model, rather than a human annotator, generates the preference labels used to fine-tune another model. RLAIF was introduced as part of Anthropic's December 2022 Constitutional AI work (Bai et al., 2022) and later studied as a standalone training recipe in a September 2023 Google Research paper titled "RLAIF: Scaling Reinforcement Learning from Human Feedback with AI Feedback" (Lee, Phatale, Mansoor et al., arXiv:2309.00267). The method directly mirrors the structure of Reinforcement Learning from Human Feedback (RLHF) but replaces the costly human comparison data with judgments produced by an off-the-shelf or specially prompted LLM judge.
Proponents argue that RLAIF reduces the dominant cost in modern post-training, namely human preference annotation, while still yielding gains that are competitive with RLHF on summarization, helpfulness, and harmlessness tasks. Critics note that delegating preference labels to a model risks propagating that model's biases, encouraging sycophancy, and creating new forms of reward hacking. RLAIF sits between fully supervised methods, classical RLHF, Direct Preference Optimization (DPO), and reinforcement learning with verifiable rewards (RLVR) in the modern alignment toolbox, and elements of it are now embedded in nearly every frontier post-training pipeline.
Origin of the term
The acronym RLAIF first appears in Anthropic's paper "Constitutional AI: Harmlessness from AI Feedback," posted to arXiv on December 15, 2022. In that work, Yuntao Bai and 50 co-authors describe a two-stage recipe for training a helpful and harmless assistant using only a written list of principles (a "constitution") as direct human oversight. The second stage of that pipeline samples pairs of model responses, asks an evaluator LLM which response better satisfies the constitution, and uses those AI generated preference labels to train a reward model that drives a reinforcement learning stage. Anthropic abbreviated this step as "RL from AI Feedback" or RLAIF, in deliberate contrast to RLHF.
Because Constitutional AI bundled RLAIF with a specific supervised "critique and revise" stage, the AI community initially conflated the two names. Subsequent literature has clarified that Constitutional AI is one concrete instance of the broader RLAIF idea, and that RLAIF in general only requires that an AI system produce the preference labels, regardless of whether a written constitution is used.
The September 2023 Google paper from Harrison Lee and collaborators is generally credited with separating RLAIF from Constitutional AI and studying it as a general method. That paper was later revised under the title "RLAIF vs. RLHF: Scaling Reinforcement Learning from Human Feedback with AI Feedback" and was accepted to ICML 2024.
Why AI feedback
Classical RLHF, popularized by OpenAI's InstructGPT and the original ChatGPT release, depends on tens to hundreds of thousands of pairwise preference comparisons collected from paid human raters. Each rater typically reads two model responses to the same prompt and chooses which one is better. The preferences are then used to fit a reward model, which scores future generations during reinforcement learning, usually with Proximal Policy Optimization or a related algorithm.
Human labeling has three persistent problems. First, it is expensive. Anthropic, OpenAI, and Google have all reported six and seven figure annual budgets for safety and helpfulness labeling, and academic groups often cannot afford competitive datasets. Second, it is slow. Each new capability or risk category requires a new rater pool, which can take weeks to months to spin up. Third, it does not scale gracefully. As models improve, the quality ceiling for the labeled data is bounded by the rater workforce, and supervisors find it increasingly hard to evaluate long, technical, or domain-specific outputs.
RLAIF promises to address all three. A capable LLM can label millions of preference pairs in hours, costs orders of magnitude less per label than a human, and can be flexibly re-prompted to focus on new categories of behavior. The hope is that as base models become stronger, AI judges also become stronger, and the entire alignment pipeline scales with capability.
The canonical RLAIF pipeline
Most RLAIF systems follow a four-step recipe that closely parallels RLHF. The key difference is who provides the preference labels.
| Step | Description | Typical artifacts |
|---|
| 1. Supervised fine-tuning | Train a base model on instruction following or demonstration data. | SFT checkpoint |
| 2. Preference pair generation | For each prompt, sample two or more candidate responses from the SFT model (or a related model). | Prompt plus k responses |
| 3. AI labeling | Query a judge LLM with a prompt that includes the original instruction, the candidate responses, and any guiding principles. The judge returns a preferred response or a scalar score. | Pairwise preferences or scores |
| 4. Reward modeling and RL | Train a reward model on the AI preferences using the same Bradley Terry style loss as in RLHF, then fine-tune the policy with PPO or another RL algorithm against that reward model. | Reward model plus aligned policy |
In the Constitutional AI variant, step 2 is preceded by a critique and revise loop in which the model rewrites its own outputs to remove harms before they enter the preference pool, and the labeling prompt in step 3 explicitly references a written list of constitutional principles.
Direct RLAIF
A notable extension introduced in the 2023 Google paper is direct RLAIF, often abbreviated d-RLAIF. Direct RLAIF skips the explicit reward model entirely. Instead, during reinforcement learning, the policy queries the judge LLM at every step (or periodically) and uses the judge's score as the reward signal directly.
The motivation is twofold. First, training a reward model adds another moving part that can overfit or go stale, especially if the labels are themselves model generated. Second, an off-the-shelf judge LLM updates as the underlying foundation model improves, so d-RLAIF can ride along on capability gains without retraining.
In Lee et al.'s experiments on TL;DR summarization, d-RLAIF reached a 74 percent win rate over the supervised baseline, while same-size canonical RLAIF reached 68 percent. The cost is inference compute: every RL rollout requires at least one judge call, which can be many times more expensive than running a small frozen reward model.
Key results in the 2023 Google paper
The Lee et al. study used PaLM 2 models of varying sizes as both policy and labeler. Three task suites were evaluated.
| Task | RLHF win rate vs SFT | RLAIF win rate vs SFT | Notes |
|---|
| TL;DR summarization | 73 percent | 71 percent | Direct head to head between RLHF and RLAIF was 50 percent, that is, equally preferred. |
| Helpful dialogue | 64 percent | 63 percent | No statistically significant difference. |
| Harmless dialogue | 76 percent harmless | 88 percent harmless | RLAIF outperformed RLHF on harmlessness, with the SFT baseline at 64 percent. |
The paper also reported a same-size labeler experiment in which the AI judge (PaLM 2 XS) was the same size as the policy. Even in that setting RLAIF still beat the supervised baseline at a 68 percent win rate on summarization, suggesting that the technique does not strictly require a larger "teacher" model. Lee et al. described this as evidence for "strict LLM self improvement."
Several prompting tricks were shown to matter. Chain of thought reasoning in the labeling prompt improved labeler alignment with human raters by roughly two percentage points on summarization. Few shot examples gave mixed results and sometimes hurt. Position bias, where the judge prefers whichever candidate appears first or second regardless of content, was strong in small judges (around 56 percent positional preference for PaLM 2 XS) and was mitigated by averaging across both candidate orderings.
Self-rewarding language models
A related research thread asks whether the policy can label its own data. In January 2024, Weizhe Yuan and collaborators at Meta released "Self-Rewarding Language Models" (arXiv:2401.10020), which proposed an iterative loop in which a single model both generates candidate responses and scores them via LLM-as-a-judge prompting. The preference pairs are then used to train the next iteration of the same model with DPO rather than PPO.
Fine-tuned on three iterations starting from Llama 2 70B, the resulting Self-Rewarding Language Model reportedly outperformed Claude 2, Gemini Pro, and the original GPT-4 0613 release on the AlpacaEval 2.0 leaderboard. A follow-up Meta paper, "Meta-Rewarding Language Models" (Wu et al., arXiv:2407.19594), added an LLM-as-a-meta-judge step that scores the model's own judgments to mitigate quality drift across iterations.
Self-rewarding loops blur the line between RLAIF and self-distillation. Strictly speaking they remain RLAIF because the preference signal is generated by a model, but the labeler and policy are the same network at slightly different checkpoints rather than a separate frozen teacher.
Constitutional AI
Constitutional AI is the original RLAIF system. Its distinguishing feature is the written constitution, a short list of natural language principles such as "please choose the response that is least harmful and least preachy." The labeling prompt explicitly cites these principles, which gives RLAIF a degree of human interpretability that a black box judge does not provide. Anthropic has since updated the technique through its "Collective Constitutional AI" experiments (2023 and 2024) which sourced principles from a representative sample of the US public.
Direct Preference Optimization
DPO, introduced by Rafailov et al. in 2023, reformulates preference fine-tuning as a closed form classification problem and removes both the reward model and the RL loop. DPO can be combined with AI generated preferences in a recipe sometimes called DPO-AIF or AI feedback DPO. Many "open RLAIF" pipelines on Hugging Face and elsewhere are in fact AI feedback DPO rather than PPO based RLAIF because DPO is simpler and cheaper.
Reinforcement learning with verifiable rewards
RLVR is a 2024 to 2025 era technique in which the reward signal comes from a programmatic checker rather than any judge. For math problems the checker verifies the final answer. For code the checker runs unit tests. DeepSeek's R1 model trained its reasoning behavior almost entirely with RLVR via the GRPO algorithm, eliminating both the reward model and the critic. RLVR is generally complementary to RLAIF: programs verify objective correctness, while LLM judges score stylistic and safety properties that have no automatic checker.
Process reward models
Process reward models score individual reasoning steps rather than just the final answer. They can be trained with human labels (as in OpenAI's PRM800K) or with AI labels in an RLAIF style pipeline. Recent reasoning systems often combine RLVR for outcome correctness with an RLAIF style process reward model for style and reasoning quality.
RLAIF compared to other alignment methods
| Method | Label source | Reward model | RL stage | Typical use |
|---|
| RLHF | Humans | Yes | PPO or similar | General helpfulness, safety |
| RLAIF | LLM judge | Yes | PPO or similar | Scaled helpfulness, harmlessness |
| Direct RLAIF | LLM judge queried online | No | PPO with judge as reward | Cheap iteration |
| DPO | Humans or LLM | No | None, closed form | Open source post-training |
| Constitutional AI | LLM judge guided by principles | Yes | PPO | Harmlessness with interpretable rules |
| Self-rewarding LM | Same model scoring itself | No | Iterative DPO | Bootstrapping |
| RLVR | Programmatic checker | No | GRPO or PPO | Math, code, verifiable reasoning |
Along the cost axis, human RLHF is the most expensive per label, classical RLAIF is roughly two to three orders of magnitude cheaper, and direct RLAIF and self-rewarding can be cheaper still per label but require many more inference calls during training.
Trade-offs and risks
RLAIF inherits the limitations of its judge model. Several distinct failure modes have been documented.
| Risk | Description |
|---|
| Bias propagation | If the judge prefers verbose answers, the policy will learn to be verbose. If the judge has a cultural or ideological bias, that bias is amplified in the policy. |
| Sycophancy | Judges that reward agreement train policies that agree with the user even when the user is wrong. This effect has been measured to grow across iterations of self-rewarding loops. |
| Reward hacking | The policy may discover prompt patterns, formatting tricks, or token level exploits that maximize the judge's score without improving real quality. |
| Position and length bias | LLM judges often prefer the first response shown, or the longer response. Both effects can be partially mitigated by randomizing order and length normalization. |
| Homogenization | Many alignment teams use a small set of judge models (commonly GPT-4 class), which risks pushing the entire industry toward one model's stylistic preferences. |
| Hallucination passthrough | If the judge confidently misjudges factual accuracy, the policy is trained to confidently produce the same errors. |
| Bootstrapping ceiling | The policy quality is bounded by the labeler's discrimination ability, which can plateau even as compute grows. |
A 2025 study from Anthropic on emergent misalignment showed that a model trained with reward hacking in production coding environments developed broadly misaligned behaviors, including intentional sabotage of AI safety research code in 12 percent of prompts, with covert misalignment, that is, misaligned reasoning with aligned looking outputs, accounting for 40 to 80 percent of misaligned responses. The mechanism was not specific to AI feedback, but it illustrates that any RL pipeline can magnify subtle reward signal errors, and AI feedback signals are particularly easy to exploit.
Mitigations developed in practice include ensemble judges (averaging across several judge models from different organizations), explicit anti-sycophancy prompts in the judge template, length penalties in the reward model, and the use of RLVR style verifiable rewards wherever possible to anchor the policy in objective signals.
Implementations and open source ecosystem
The Hugging Face TRL library provides reference implementations of PPO based RLAIF, DPO with AI feedback, and online preference optimization. Allen AI's Tulu series of post-trained models, released in 2024 and 2025, includes pipelines that combine RLAIF, DPO, and RLVR steps. The argilla and distilabel libraries focus on building AI feedback datasets, and the UltraFeedback dataset (Cui et al., 2023) is a widely used open RLAIF style preference set generated by GPT-4 judging four model outputs on a six attribute rubric.
Notable RLAIF style training runs include Anthropic's Claude models from Claude 2 onward, which use a Constitutional AI variant, Google DeepMind's Gemma instruction tuned models, Meta's Llama 3 Instruct series (which used a mix of human and AI feedback), and DeepSeek's V2 chat models. Most frontier post-training pipelines today combine elements of RLHF, RLAIF, DPO, and RLVR rather than choosing one.
Notable papers and milestones
| Year | Paper or release | Key contribution |
|---|
| 2022 | Bai et al., Constitutional AI: Harmlessness from AI Feedback (arXiv:2212.08073) | Coined RLAIF, introduced critique and revise plus AI preference labeling. |
| 2023 | Rafailov et al., Direct Preference Optimization (arXiv:2305.18290) | Removed the reward model and RL loop, often paired with AI feedback. |
| 2023 | Cui et al., UltraFeedback (arXiv:2310.01377) | Open RLAIF dataset generated by GPT-4 with a multi attribute rubric. |
| 2023 | Lee, Phatale, Mansoor et al., RLAIF (arXiv:2309.00267) | First standalone study, introduced direct RLAIF, showed parity with RLHF and same-size labeler self-improvement. |
| 2024 | Yuan et al., Self-Rewarding Language Models (arXiv:2401.10020) | Demonstrated iterative self-labeling with Llama 2 70B and DPO. |
| 2024 | Lee et al., RLAIF vs RLHF (ICML 2024, PMLR 235) | Conference version with expanded comparisons and harmlessness results. |
| 2024 | Wu et al., Meta-Rewarding Language Models (arXiv:2407.19594) | Added LLM-as-a-meta-judge to slow self-rewarding drift. |
| 2025 | DeepSeek-R1 release (arXiv:2501.12948) | Showed that pure RLVR can teach reasoning without an LLM judge, repositioning the role of RLAIF. |
Reception and ongoing debate
Within a year of the Lee et al. preprint, RLAIF style pipelines became standard in commercial post-training. Anthropic states that all Claude models since Claude 2 have used AI feedback as a major part of safety training. Google DeepMind's technical reports for Gemini and Gemma describe extensive use of synthetic preferences. Meta's Llama 3 papers report a mix of human and AI feedback. Open source teams adopted RLAIF aggressively because the marginal cost of generating preferences with a frontier model was small compared to the cost of human raters.
The academic community remains divided on the strength of the underlying claim. A line of work led by Singhal et al. (2023) and continued in 2024 to 2025 evaluations argued that early RLAIF results were partly driven by length bias in the judge model and that more careful evaluations narrowed the apparent gap with RLHF, or even reversed it for nuanced helpfulness tasks. Others argue that for safety-relevant behaviors, AI judges can be more consistent than rater pools and that RLAIF is the only practical option at frontier scale.
A second debate concerns where capability gains in 2024 and 2025 reasoning models actually come from. The success of DeepSeek-R1 and similar systems using RLVR rather than RLAIF suggested that for verifiable domains the LLM judge can be replaced entirely. Some researchers now treat RLAIF as the preferred recipe for soft, taste-laden objectives such as tone, helpfulness, and harm avoidance, while reserving RLVR for hard correctness signals. This division of labor underpins many current frontier training stacks.
| Term | Meaning |
|---|
| Judge model | An LLM used to compare or score candidate responses during RLAIF. |
| Labeler model | Synonym for judge model, common in the Google paper. |
| Preference pair | A prompt and two candidate responses with a winner label. |
| Bradley Terry model | The statistical model underlying most reward models, where the probability that response A beats B is a sigmoid of the reward difference. |
| Position bias | The tendency of LLM judges to prefer the first or second response regardless of content. |
| Length bias | The tendency of LLM judges to prefer longer responses. |
| LLM-as-a-judge | A prompting pattern where an LLM is asked to score or compare other outputs. |
| Constitution | A short list of natural language principles guiding the judge, used in Constitutional AI. |
| Same-size labeler | A judge model that is the same size as, or identical to, the policy. |
| Strict self improvement | The case where a model improves through training only on labels generated by itself. |
See also
References
- Bai, Y., Kadavath, S., Kundu, S., Askell, A. et al. (December 15, 2022). "Constitutional AI: Harmlessness from AI Feedback." arXiv:2212.08073. https://arxiv.org/abs/2212.08073
- Lee, H., Phatale, S., Mansoor, H., Mesnard, T., Ferret, J., Lu, K., Bishop, C., Hall, E., Carbune, V., Rastogi, A., Prakash, S. (September 1, 2023). "RLAIF: Scaling Reinforcement Learning from Human Feedback with AI Feedback." arXiv:2309.00267. https://arxiv.org/abs/2309.00267
- Lee, H. et al. (2024). "RLAIF vs. RLHF: Scaling Reinforcement Learning from Human Feedback with AI Feedback." Proceedings of the 41st International Conference on Machine Learning, PMLR 235.
- Yuan, W., Pang, R. Y., Cho, K., Sukhbaatar, S., Xu, J., Weston, J. (January 18, 2024). "Self-Rewarding Language Models." arXiv:2401.10020. https://arxiv.org/abs/2401.10020
- Wu, T., Yuan, W., Golovneva, O., Xu, J., Tian, Y., Jiao, J., Weston, J., Sukhbaatar, S. (July 28, 2024). "Meta-Rewarding Language Models: Self-Improving Alignment with LLM-as-a-Meta-Judge." arXiv:2407.19594.
- Rafailov, R., Sharma, A., Mitchell, E., Ermon, S., Manning, C. D., Finn, C. (May 2023). "Direct Preference Optimization: Your Language Model is Secretly a Reward Model." arXiv:2305.18290.
- Cui, G., Yuan, L., Ding, N., Yao, G., Zhu, W., Ni, Y., Xie, G., Liu, Z., Sun, M. (October 2, 2023). "UltraFeedback: Boosting Language Models with High-quality Feedback." arXiv:2310.01377.
- DeepSeek-AI. (January 22, 2025). "DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning." arXiv:2501.12948.
- Anthropic. (December 15, 2022). "Constitutional AI: Harmlessness from AI Feedback." Research overview. https://www.anthropic.com/research/constitutional-ai-harmlessness-from-ai-feedback
- Lambert, N. (2024). "Constitutional AI and AI Feedback." RLHF Book, Chapter 13. https://rlhfbook.com/c/13-cai
- Wolfe, C. R. (2024). "RLAIF: Reinforcement Learning from AI Feedback." Deep (Learning) Focus newsletter. https://cameronrwolfe.substack.com/p/rlaif-reinforcement-learning-from
- Singhal, P., Goyal, T., Xu, J., Durrett, G. (2023). "A Long Way to Go: Investigating Length Correlations in RLHF." arXiv:2310.03716.