RLAIF

AI Safety Machine Learning Reinforcement Learning

25 min read

Updated Jun 22, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 22, 2026

Fact-checked

In review queue

Sources

19 citations

Revision

v5 · 4,902 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Reinforcement Learning from AI Feedback (RLAIF) is a family of alignment techniques for large language models in which the preference labels used to fine-tune a model are produced by another AI system, typically a strong language model acting as a judge, rather than by paid human annotators.^[1]^[2] RLAIF directly mirrors the structure of Reinforcement Learning from Human Feedback (RLHF), but replaces the costly human comparison data with judgments generated by an off-the-shelf or specially prompted LLM. The acronym was coined by Anthropic in its December 2022 Constitutional AI paper,^[1] and the technique was studied as a standalone training recipe in a September 2023 Google Research paper titled RLAIF: Scaling Reinforcement Learning from Human Feedback with AI Feedback by Lee, Phatale, Mansoor and collaborators.^[2] That paper concluded that "RLAIF can achieve performance on-par with using human feedback, offering a potential solution to the scalability limitations of RLHF."^[2]

Proponents argue that RLAIF eliminates the dominant cost in modern post-training, the human preference annotation step, while yielding gains that are competitive with RLHF on summarization, helpfulness, and harmlessness tasks.^[2] Critics note that delegating preference labels to a model risks propagating that model's biases, encouraging sycophancy, and opening new attack surfaces for reward hacking.^[3] RLAIF sits between supervised fine-tuning, classical RLHF, Direct Preference Optimization (DPO), and reinforcement learning with verifiable rewards (RLVR) in the modern alignment toolbox, and elements of it are now embedded in nearly every frontier post-training pipeline, including those used to train Claude, Gemini, Llama 3, and DeepSeek models.^[1]^[4]^[5]

It is important to distinguish RLAIF from the broader idea of synthetic data. Both rely on AI-generated artifacts, but synthetic data refers to AI-authored training inputs or demonstrations, while RLAIF specifically describes AI-generated preferences over candidate model outputs that drive a reward model or reinforcement learning objective.

What problem does RLAIF solve?

Modern instruction-following LLMs are typically built in three stages: large-scale pretraining, supervised fine-tuning (often called instruction tuning) on demonstrations of desired behavior, and a preference-based stage in which the model learns to choose responses humans (or AI judges) prefer.^[6] The third stage was popularized by OpenAI's InstructGPT release^[7] and the launch of ChatGPT in November 2022. Classical RLHF in this setting depends on tens to hundreds of thousands of pairwise comparisons collected from paid human raters. Each rater reads two model responses to the same prompt and chooses which is better. The preferences are then used to fit a reward model, which scores future generations during reinforcement learning, usually with Proximal Policy Optimization (PPO) or a closely related algorithm.^[7]

Human labeling has three persistent problems. First, it is expensive: gathering high-quality human preference labels is, in the words of the RLAIF authors, "a time-consuming and expensive endeavor," and academic groups often cannot afford competitive datasets.^[2]^[6] Second, it is slow: each new capability or risk category requires a new rater pool, which can take weeks to months to assemble and train. Third, it does not scale gracefully. As models improve, the quality ceiling for labeled data is bounded by the rater workforce, and supervisors find it increasingly hard to evaluate long, technical, or domain-specific outputs.^[2]

RLAIF promises to address all three pain points. A capable LLM can label millions of preference pairs in hours, costs orders of magnitude less per label than a human, and can be flexibly re-prompted to focus on new categories of behavior. Crucially, the hope is that as base models become stronger, AI judges become stronger as well, and the entire alignment pipeline scales with capability instead of being bottlenecked by a fixed human workforce. This is the scalable-oversight motivation captured by the Constitutional AI paper's opening line: "As AI systems become more capable, we would like to enlist their help to supervise other AIs."^[1]

When was RLAIF invented?

Anthropic Constitutional AI as proto-RLAIF (December 2022)

The acronym RLAIF first appears in Anthropic's paper Constitutional AI: Harmlessness from AI Feedback, posted to arXiv on December 15, 2022.^[1] In that work, Yuntao Bai and dozens of co-authors describe a two-stage recipe for training a helpful and harmless assistant using only a written list of principles, the "constitution," as direct human oversight. The pipeline has a supervised phase, in which the model self-critiques and revises its own responses, and a reinforcement-learning phase, in which the system samples pairs of model responses, asks an evaluator LLM which response better satisfies the constitution, and uses those AI-generated preference labels to train a reward model that drives the RL stage.^[1] Anthropic abbreviated this RL stage as "RL from AI Feedback" or RLAIF, explicitly framed in contrast to RLHF. The paper states the goal plainly: to train "a harmless AI assistant through self-improvement, without any human labels identifying harmful outputs."^[1]

Anthropic's earlier April 2022 paper Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback (Bai et al., arXiv:2204.05862) had laid the groundwork for the helpfulness/harmlessness framing and the RLHF training stack.^[8] The Constitutional AI work essentially replaced the human harmlessness labels in that pipeline with AI-generated labels guided by written principles. Because the resulting system was bundled with a specific supervised "critique and revise" stage, the AI community initially conflated Constitutional AI and RLAIF. Subsequent literature has clarified that Constitutional AI is one concrete instance of the broader RLAIF idea, and that RLAIF in general only requires that an AI system produce the preference labels, regardless of whether a written constitution is used.

Google Lee et al. and the canonical RLAIF paper (September 2023)

The September 2023 Google Research paper RLAIF: Scaling Reinforcement Learning from Human Feedback with AI Feedback by Harrison Lee, Samrat Phatale, Hassan Mansoor, Thomas Mesnard, Johan Ferret, Kellie Lu, Colton Bishop, Ethan Hall, Victor Carbune, Abhinav Rastogi, and Sushant Prakash is generally credited with separating RLAIF from Constitutional AI and studying it as a general method.^[2] The paper was later revised and accepted to ICML 2024 under the expanded title RLAIF vs. RLHF: Scaling Reinforcement Learning from Human Feedback with AI Feedback.^[9]

Lee et al. used PaLM 2 models of varying sizes as both policy and labeler, and tested the recipe on three task suites: TL;DR summarization, helpful dialogue, and harmless dialogue.^[2] Their headline result was that RLAIF can match the performance of RLHF on summarization and helpfulness, and can outperform it on harmlessness, even when the AI labeler is the same size as the policy model. The paper also introduced direct RLAIF (d-RLAIF), in which the judge LLM is queried online during RL rather than via a separately trained reward model. The combination of strong empirical results and a clean recipe is what made RLAIF a household term in the alignment literature.

Modern hybrid approaches (2024 to 2026)

By 2024, the academic literature had largely moved past "RLHF vs. RLAIF" as a binary choice. Frontier post-training pipelines documented by Meta for Llama 3,^[10] Google DeepMind for Gemini and Gemma, Anthropic for Claude, and DeepSeek for V2 and V3 all combine human labels, AI labels, and verifiable rewards in some mix. New methods such as iterative DPO with AI feedback,^[11] self-rewarding language models,^[12] meta-rewarding language models,^[13] and process-reward-modeling pipelines blur the boundary between RLAIF, distillation, and self-training. Open-source ecosystems built around tools like Hugging Face's TRL library, Argilla's distilabel, and the UltraFeedback dataset^[14] made AI-feedback-style training widely accessible.

A second shift came from reasoning models such as DeepSeek-R1^[15] and OpenAI's o1/o3 family, which lean heavily on RLVR (rewards from programmatic checkers, code unit tests, and math graders) rather than LLM judges. This split the post-training landscape into a "soft objectives, judge-based" lane (helpfulness, harmlessness, tone) and a "hard objectives, verifier-based" lane (math, code, formal reasoning), with RLAIF dominating the former.

How does RLAIF work?

Most RLAIF systems follow a four-step recipe that closely parallels RLHF. The key difference is who provides the preference labels.

Step	Description	Typical artifacts
1. Supervised fine-tuning	Train a base model on instruction-following or demonstration data.	SFT checkpoint
2. Preference pair generation	For each prompt, sample two or more candidate responses from the SFT model.	Prompt plus k responses
3. AI labeling	Query a judge LLM with the original instruction, the candidate responses, and any guiding principles. The judge returns a preferred response or a scalar score.	Pairwise preferences or scores
4. Reward modeling and RL	Train a reward model on the AI preferences using the same Bradley-Terry-style loss as in RLHF, then fine-tune the policy with PPO or another RL algorithm against that reward model.	Reward model plus aligned policy

In the Constitutional AI variant, step 2 is preceded by a critique-and-revise loop in which the model rewrites its own outputs to remove harms before they enter the preference pool, and the labeling prompt in step 3 explicitly references a written list of constitutional principles.^[1]

The strong LLM as preference judge

The core design decision in RLAIF is the LLM-as-judge prompt. A typical labeling prompt presents the original instruction, the two candidate responses (often labeled "Response A" and "Response B"), and asks the judge to pick the better one according to one or more criteria. Lee et al. and Zheng et al. report several prompting techniques that materially affect judge quality:^[2]^[16]

Chain-of-thought reasoning in the labeling prompt improved labeler alignment with human raters by roughly two percentage points on summarization in the Google study.^[2]
Order randomization mitigates position bias, the tendency of LLM judges to prefer whichever response appears first or second regardless of content. The bias was strong in small judges (around 56 percent positional preference for PaLM 2 XS) in Lee et al.'s experiments.^[2]
Length normalization is a partial mitigation for length bias, the tendency to prefer longer responses; Singhal et al. argued that early RLAIF gains were partly driven by uncorrected length bias.^[17]
Few-shot examples in the judge prompt give mixed results in the Google study, sometimes hurting alignment with humans.^[2]

Zheng et al. found that strong judges such as GPT-4 can reach over 80 percent agreement with human raters on MT-Bench and Chatbot Arena conversations, the same level of agreement between humans, which is the central empirical claim that makes RLAIF viable.^[16]

Reward model training on AI preferences

The labels from the judge model are typically used to train a reward model with a Bradley-Terry loss: the probability that response A beats response B is modeled as a sigmoid of the reward difference. Once trained, the reward model is frozen and used to score new generations from the policy during RL.^[7] The reward model architecture is usually a copy of the policy's transformer with a single scalar head, sometimes initialized from the SFT checkpoint.

What is direct RLAIF (d-RLAIF)?

A notable extension introduced in the 2023 Google paper is direct RLAIF, often abbreviated d-RLAIF. Direct RLAIF skips the explicit reward model entirely. Instead, during reinforcement learning, the policy queries the judge LLM at every step (or periodically) and uses the judge's score as the reward signal directly.^[2] Lee et al. describe d-RLAIF as "a technique that circumvents RM training by obtaining rewards directly from an off-the-shelf LLM during RL, which achieves superior performance to canonical RLAIF."^[2]

The motivation is twofold. First, training a reward model adds another moving part that can overfit or go stale, especially if the labels are themselves model-generated. Second, an off-the-shelf judge LLM updates as the underlying foundation model improves, so d-RLAIF can ride along on capability gains without retraining. In Lee et al.'s TL;DR summarization experiments, d-RLAIF reached a 74 percent win rate over the supervised baseline, while same-size canonical RLAIF reached 68 percent.^[2] The cost is inference compute: every RL rollout requires at least one judge call, which can be many times more expensive than running a small frozen reward model.

PPO, DPO, and online preference optimization

RLAIF preferences can drive any of several optimization algorithms. The original Anthropic and Google RLAIF systems used PPO against a learned reward model. DPO, introduced by Rafailov et al. in 2023, reformulates preference fine-tuning as a closed-form classification problem and removes both the reward model and the RL loop.^[18] DPO can be combined with AI-generated preferences in a recipe sometimes called DPO-AIF or AI-feedback DPO. Many "open RLAIF" pipelines on Hugging Face and elsewhere are in fact AI-feedback DPO rather than PPO-based RLAIF because DPO is simpler and cheaper. Online variants such as Online DPO, iterative DPO, and Online Iterative Preference Optimization further reduce the gap between DPO and full RL by recomputing preferences as the policy drifts during training.

How does RLAIF compare to RLHF empirically?

The Lee et al. study reported the following head-to-head numbers across three task suites using PaLM 2 models as both policies and labelers.^[2]

Task	RLHF win rate vs SFT	RLAIF win rate vs SFT	Notes
TL;DR summarization	73 percent	71 percent	Direct head-to-head between RLHF and RLAIF was 50 percent, equally preferred.
Helpful dialogue	64 percent	63 percent	No statistically significant difference.
Harmless dialogue	76 percent harmless	88 percent harmless	RLAIF outperformed RLHF on harmlessness, SFT baseline 64 percent.

The paper also reported a same-size labeler experiment in which the AI judge (PaLM 2 XS) was the same size as the policy. Even in that setting, RLAIF beat the supervised baseline at a 68 percent win rate on summarization,^[2] which Lee et al. framed as a step toward "self-improvement," a scenario in which a model improves through training only on labels generated by a same-size, or even the exact same, checkpoint.^[2]

Subsequent reanalyses qualified these results. Singhal et al.'s 2023 study A Long Way to Go: Investigating Length Correlations in RLHF argued that early RLAIF gains were partly driven by length bias in the judge model, and that more careful length-controlled evaluations narrowed the apparent gap with RLHF or even reversed it for nuanced helpfulness tasks.^[17] On the other hand, several 2024 to 2026 industrial reports argued that for safety-relevant behaviors AI judges can be more consistent than human rater pools, and that RLAIF is the only practical option at frontier scale.

The takeaway from the literature is roughly: for tasks where humans can be replaced by a strong judge with reasonable confidence (summarization, harmlessness, broad helpfulness), RLAIF reaches RLHF parity at a fraction of the cost; for tasks with subtle taste judgments or specialized domains, careful human oversight still matters.

How does Constitutional AI relate to RLAIF?

Constitutional AI is the original RLAIF system and remains the most influential concrete implementation. Its distinguishing feature is the constitution: a short list of natural-language principles such as "please choose the response that is least harmful and least preachy."^[1] The labeling prompt explicitly cites these principles, which gives the RLAIF pipeline a degree of human interpretability that a black-box judge does not provide; a researcher can read the constitution to understand what the model was optimized for, and edit the constitution to change behavior.

The Constitutional AI pipeline has two distinctive stages on top of the generic RLAIF recipe:

Critique-and-revise supervised stage. The model is asked to critique its own potentially harmful responses against the constitution and then rewrite them. The revised responses become supervised training data.^[1]
AI preference labeling stage. Pairs of model responses are then ranked by an evaluator LLM guided by the constitution. These preferences train a reward model used in standard RLHF-style PPO training, but with no human harmlessness labels.^[1]

Anthropic has used Constitutional AI as a core safety training method for the Claude model family since Claude 2.^[19] The company has also explored Collective Constitutional AI in 2023 and 2024 experiments that sourced principles from a representative sample of the US public, broadening the source of the values encoded in the constitution.

Which frontier labs use RLAIF?

By the 2024 to 2026 period, virtually every major frontier post-training stack documented some form of RLAIF or AI-feedback-driven preference labeling. Concrete reports include:

Anthropic's Claude. Anthropic states that all Claude models from Claude 2 onward have used a Constitutional AI variant as a major component of safety training, in combination with human feedback for helpfulness.^[19]
Google DeepMind's Gemini and Gemma. Google's technical reports describe extensive use of synthetic preference data and AI judges in alignment training, building directly on the Lee et al. RLAIF line of work.^[2]
Meta's Llama 3 Instruct series. The Llama 3 technical report describes a mix of human preference labels and AI-generated preferences in its post-training pipeline, with synthetic preference data used to scale safety and code training in particular.^[10]
DeepSeek V2 and V3 chat. DeepSeek's instruction-tuned chat models used AI-feedback-style preference data, while DeepSeek-R1 introduced a heavy RLVR component for reasoning and used the GRPO algorithm instead of PPO.^[15]
Open-source ecosystem. Models in Allen AI's Tulu series, Hugging Face's Zephyr line, and many open community models use UltraFeedback^[14] or similar GPT-4-judged preference datasets, sometimes combined with DPO rather than PPO.

The dominant pattern at frontier scale is hybridization: human raters define and audit guidelines, AI judges scale those guidelines to large preference datasets, programmatic verifiers anchor objective tasks, and the policy is trained against a combination of all three signals.

What are the limitations and risks of RLAIF?

RLAIF inherits the limitations of its judge model. Several distinct failure modes have been documented.

Risk	Description
Bias propagation	If the judge prefers verbose answers, the policy will learn to be verbose. Cultural or ideological biases in the judge propagate into the policy.^[16]
Sycophancy	Judges that reward agreement train policies that agree with the user even when the user is wrong. The effect has been measured to grow across iterations of self-rewarding loops.^[3]
Reward hacking	The policy may discover prompt patterns, formatting tricks, or token-level exploits that maximize the judge's score without improving real quality.
Position and length bias	LLM judges often prefer the first response shown or the longer response. Both effects can be partially mitigated by randomizing order and length normalization.^[2]^[17]
Homogenization	Many teams use a small set of judge models (commonly GPT-4 class), risking convergence on one model's stylistic preferences across the industry.
Hallucination passthrough	If the judge confidently misjudges factual accuracy, the policy is trained to confidently reproduce the same errors.
Bootstrapping ceiling	Policy quality is bounded by the labeler's discrimination ability, which can plateau even as compute grows.^[2]
Loss of human values	Replacing human labels reduces the channel through which human values enter the model, which some researchers consider a fundamental safety concern, particularly as AI labelers themselves become more capable than typical human raters.

A 2024 to 2025 line of work from Anthropic and others examined emergent misalignment in models trained with reward-hackable signals, finding that policies optimized against weak reward signals can develop broadly misaligned behaviors that generalize far beyond the original training task. The mechanism is not specific to AI feedback, but AI feedback signals are particularly easy to exploit because the judge is itself a fallible language model.

A related concern is model collapse. If AI labels are produced by a narrow distribution of judges, training on those labels can shrink the policy's diversity over successive generations. The risk grows in iterated self-rewarding loops, where the same model both produces and scores responses. Wu et al.'s 2024 meta-rewarding work explicitly addresses this by adding an LLM-as-a-meta-judge step that scores the model's own judgments to slow quality drift.^[13]

Mitigations developed in practice include ensemble judges (averaging across several judge models from different organizations), explicit anti-sycophancy prompts in the judge template, length penalties in the reward model, the use of RLVR-style verifiable rewards wherever possible to anchor the policy in objective signals, and human "red team" audits of the resulting model behavior.

Direct Preference Optimization

DPO, introduced by Rafailov et al. in 2023, reformulates preference fine-tuning as a closed-form classification problem and removes both the reward model and the RL loop.^[18] DPO is often paired with AI-generated preferences (sometimes labeled DPO-AIF or AI-feedback DPO) because it is simpler and cheaper than PPO-based RLAIF. The combination dominates open-source post-training pipelines.

LLM-as-a-judge and MT-Bench

LLM-as-a-judge is the prompting pattern that underlies RLAIF labeling. The seminal 2023 paper Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena by Zheng et al. introduced MT-Bench (a multi-turn evaluation set) and AlpacaEval-style judging procedures, and demonstrated that strong LLM judges such as GPT-4 reach roughly 80 percent agreement with human raters, matching human-to-human agreement.^[16] MT-Bench and AlpacaEval became the standard public benchmarks for AI-feedback-driven post-training.

Self-rewarding language models

A related research thread asks whether the policy can label its own data. In January 2024, Weizhe Yuan and collaborators at Meta released Self-Rewarding Language Models, which proposed an iterative loop in which a single model both generates candidate responses and scores them via LLM-as-a-judge prompting.^[12] The preference pairs are then used to train the next iteration of the same model with DPO. Fine-tuned on three iterations starting from Llama 2 70B, the resulting model reportedly outperformed Claude 2, Gemini Pro, and GPT-4 0613 on the AlpacaEval 2.0 leaderboard. A follow-up Meta paper, Meta-Rewarding Language Models (Wu et al., 2024), added an LLM-as-a-meta-judge step that scores the model's own judgments to mitigate quality drift across iterations.^[13]

Self-rewarding loops blur the line between RLAIF and self-distillation. Strictly speaking they remain RLAIF because the preference signal is generated by a model, but the labeler and policy are the same network at slightly different checkpoints rather than a separate frozen teacher.

Reinforcement learning with verifiable rewards

RLVR is a 2024 to 2025 era technique in which the reward signal comes from a programmatic checker rather than any judge. For math problems the checker verifies the final answer; for code the checker runs unit tests. DeepSeek's R1 model trained its reasoning behavior almost entirely with RLVR via the GRPO algorithm, eliminating both the reward model and the critic.^[15] RLVR is generally complementary to RLAIF: programs verify objective correctness, while LLM judges score stylistic and safety properties that have no automatic checker.

Process reward models

Process reward models score individual reasoning steps rather than just the final answer. They can be trained with human labels (as in OpenAI's PRM800K) or with AI labels in an RLAIF-style pipeline. Recent reasoning systems often combine RLVR for outcome correctness with an RLAIF-style process reward model for style and intermediate-step quality.

Synthetic data vs. RLAIF

A common point of confusion is the relationship between RLAIF and synthetic data. Synthetic data refers to AI-generated training inputs or demonstrations (for example, AI-authored instruction-response pairs used in supervised fine-tuning). RLAIF refers specifically to AI-generated preferences over candidate model outputs. Both rely on AI as a source of training signal, but they enter the model at different stages (SFT vs. preference/RL) and have different failure modes. Many production pipelines use both.

How does RLAIF compare to other alignment methods?

Method	Label source	Reward model	RL stage	Typical use
RLHF	Humans	Yes	PPO or similar	General helpfulness, safety
RLAIF	LLM judge	Yes	PPO or similar	Scaled helpfulness, harmlessness
Direct RLAIF	LLM judge queried online	No	PPO with judge as reward	Cheap iteration
DPO	Humans or LLM	No	None, closed form	Open-source post-training
Constitutional AI	LLM judge guided by principles	Yes	PPO	Harmlessness with interpretable rules
Self-rewarding LM	Same model scoring itself	No	Iterative DPO	Bootstrapping
RLVR	Programmatic checker	No	GRPO or PPO	Math, code, verifiable reasoning

Along the cost axis, human RLHF is the most expensive per label, classical RLAIF is roughly two to three orders of magnitude cheaper, and direct RLAIF and self-rewarding can be cheaper still per label, though they require many more inference calls during training.

Notable papers and milestones

Year	Paper or release	Key contribution
2022	Bai et al., Constitutional AI: Harmlessness from AI Feedback (arXiv:2212.08073)^[1]	Coined RLAIF; introduced critique-and-revise plus AI preference labeling.
2023	Rafailov et al., Direct Preference Optimization (arXiv:2305.18290)^[18]	Removed the reward model and RL loop; often paired with AI feedback.
2023	Zheng et al., Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena (arXiv:2306.05685)^[16]	Established LLM-as-judge methodology; introduced MT-Bench.
2023	Lee et al., RLAIF: Scaling RLHF with AI Feedback (arXiv:2309.00267)^[2]	First standalone RLAIF study; introduced d-RLAIF and same-size labeler self-improvement.
2023	Cui et al., UltraFeedback (arXiv:2310.01377)^[14]	Open RLAIF dataset of one million GPT-4 feedback annotations on 250k conversations.
2024	Yuan et al., Self-Rewarding Language Models (arXiv:2401.10020)^[12]	Iterative self-labeling with Llama 2 70B and DPO.
2024	Lee et al., RLAIF vs RLHF (ICML 2024, PMLR 235)^[9]	Conference version with expanded harmlessness results.
2024	Wu et al., Meta-Rewarding Language Models (arXiv:2407.19594)^[13]	Added LLM-as-a-meta-judge to slow self-rewarding drift.
2025	DeepSeek-R1 release (arXiv:2501.12948)^[15]	Showed that pure RLVR can teach reasoning without an LLM judge, repositioning the role of RLAIF.

Term	Meaning
Judge model	An LLM used to compare or score candidate responses during RLAIF.
Labeler model	Synonym for judge model, common in the Google paper.
Preference pair	A prompt and two candidate responses with a winner label.
Bradley-Terry model	The statistical model underlying most reward models; the probability that response A beats B is a sigmoid of the reward difference.
Position bias	The tendency of LLM judges to prefer the first or second response regardless of content.
Length bias	The tendency of LLM judges to prefer longer responses.
LLM-as-a-judge	The prompting pattern where an LLM is asked to score or compare other outputs.
Constitution	A short list of natural-language principles guiding the judge, used in Constitutional AI.
Same-size labeler	A judge model that is the same size as, or identical to, the policy.
Strict self-improvement	The case where a model improves through training only on labels generated by itself.

References

Bai, Y., Kadavath, S., Kundu, S., Askell, A., et al. (December 15, 2022). "Constitutional AI: Harmlessness from AI Feedback." arXiv:2212.08073. https://arxiv.org/abs/2212.08073 ↩
Lee, H., Phatale, S., Mansoor, H., Mesnard, T., Ferret, J., Lu, K., Bishop, C., Hall, E., Carbune, V., Rastogi, A., Prakash, S. (September 1, 2023). "RLAIF: Scaling Reinforcement Learning from Human Feedback with AI Feedback." arXiv:2309.00267. https://arxiv.org/abs/2309.00267 ↩
Sharma, M., et al. (2023). "Towards Understanding Sycophancy in Language Models." Anthropic. arXiv:2310.13548. https://arxiv.org/abs/2310.13548 ↩
Anthropic. (May 9, 2023). "Claude's Constitution." https://www.anthropic.com/news/claudes-constitution ↩
DeepSeek-AI. (December 2024). "DeepSeek-V3 Technical Report." arXiv:2412.19437. https://arxiv.org/abs/2412.19437 ↩
Ouyang, L., et al. (March 4, 2022). "Training language models to follow instructions with human feedback." arXiv:2203.02155 (InstructGPT). https://arxiv.org/abs/2203.02155 ↩
Christiano, P. F., Leike, J., Brown, T., Martic, M., Legg, S., Amodei, D. (June 12, 2017). "Deep reinforcement learning from human preferences." arXiv:1706.03741. https://arxiv.org/abs/1706.03741 ↩
Bai, Y., et al. (April 12, 2022). "Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback." arXiv:2204.05862. https://arxiv.org/abs/2204.05862 ↩
Lee, H., et al. (2024). "RLAIF vs. RLHF: Scaling Reinforcement Learning from Human Feedback with AI Feedback." Proceedings of the 41st International Conference on Machine Learning, PMLR 235. https://proceedings.mlr.press/v235/lee24t.html ↩
Grattafiori, A., et al. (Meta AI). (July 31, 2024). "The Llama 3 Herd of Models." arXiv:2407.21783. https://arxiv.org/abs/2407.21783 ↩
Hugging Face TRL library. "Online DPO Trainer." https://huggingface.co/docs/trl/main/en/online_dpo_trainer ↩
Yuan, W., Pang, R. Y., Cho, K., Li, X., Sukhbaatar, S., Xu, J., Weston, J. (January 18, 2024). "Self-Rewarding Language Models." arXiv:2401.10020. https://arxiv.org/abs/2401.10020 ↩
Wu, T., Yuan, W., Golovneva, O., Xu, J., Tian, Y., Jiao, J., Weston, J., Sukhbaatar, S. (July 28, 2024). "Meta-Rewarding Language Models: Self-Improving Alignment with LLM-as-a-Meta-Judge." arXiv:2407.19594. https://arxiv.org/abs/2407.19594 ↩
Cui, G., Yuan, L., Ding, N., Yao, G., Zhu, W., Ni, Y., Xie, G., Liu, Z., Sun, M. (October 2, 2023). "UltraFeedback: Boosting Language Models with High-quality Feedback." arXiv:2310.01377. https://arxiv.org/abs/2310.01377 ↩
DeepSeek-AI. (January 22, 2025). "DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning." arXiv:2501.12948. https://arxiv.org/abs/2501.12948 ↩
Zheng, L., Chiang, W.-L., Sheng, Y., Zhuang, S., Wu, Z., Zhuang, Y., Lin, Z., Li, Z., Li, D., Xing, E. P., Zhang, H., Gonzalez, J. E., Stoica, I. (June 9, 2023). "Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena." arXiv:2306.05685. https://arxiv.org/abs/2306.05685 ↩
Singhal, P., Goyal, T., Xu, J., Durrett, G. (October 5, 2023). "A Long Way to Go: Investigating Length Correlations in RLHF." arXiv:2310.03716. https://arxiv.org/abs/2310.03716 ↩
Rafailov, R., Sharma, A., Mitchell, E., Ermon, S., Manning, C. D., Finn, C. (May 29, 2023). "Direct Preference Optimization: Your Language Model is Secretly a Reward Model." arXiv:2305.18290. https://arxiv.org/abs/2305.18290 ↩
Anthropic. (December 15, 2022). "Constitutional AI: Harmlessness from AI Feedback." Research overview. https://www.anthropic.com/research/constitutional-ai-harmlessness-from-ai-feedback ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

4 revisions by 1 contributors · full history

Suggest edit

RLAIF

What problem does RLAIF solve?

When was RLAIF invented?

Anthropic Constitutional AI as proto-RLAIF (December 2022)

Google Lee et al. and the canonical RLAIF paper (September 2023)

Modern hybrid approaches (2024 to 2026)

How does RLAIF work?

The strong LLM as preference judge

Reward model training on AI preferences

What is direct RLAIF (d-RLAIF)?

PPO, DPO, and online preference optimization

How does RLAIF compare to RLHF empirically?

How does Constitutional AI relate to RLAIF?

Which frontier labs use RLAIF?

What are the limitations and risks of RLAIF?

Direct Preference Optimization

LLM-as-a-judge and MT-Bench

Self-rewarding language models

Reinforcement learning with verifiable rewards

Process reward models

Synthetic data vs. RLAIF

How does RLAIF compare to other alignment methods?

Notable papers and milestones

See also

References

Improve this article

What links here

What links here

What problem does RLAIF solve?

When was RLAIF invented?

Anthropic Constitutional AI as proto-RLAIF (December 2022)

Google Lee et al. and the canonical RLAIF paper (September 2023)

Modern hybrid approaches (2024 to 2026)

How does RLAIF work?

The strong LLM as preference judge

Reward model training on AI preferences

What is direct RLAIF (d-RLAIF)?

PPO, DPO, and online preference optimization

How does RLAIF compare to RLHF empirically?

How does Constitutional AI relate to RLAIF?

Which frontier labs use RLAIF?

What are the limitations and risks of RLAIF?

Related techniques

Direct Preference Optimization

LLM-as-a-judge and MT-Bench

Self-rewarding language models

Reinforcement learning with verifiable rewards

Process reward models

Synthetic data vs. RLAIF

How does RLAIF compare to other alignment methods?

Notable papers and milestones

Glossary of related terms

See also

References

Improve this article

Related Articles

Reward hacking

Process reward model (PRM)

Specification gaming

Recursive reward modeling

State (Reinforcement Learning)

State-Action Value Function

What links here

Related Articles

Reward hacking

Process reward model (PRM)

Specification gaming

Recursive reward modeling

State (Reinforcement Learning)

State-Action Value Function

What links here