Reinforcement Learning from Human Feedback (RLHF) is a machine learning technique that trains artificial intelligence systems to behave according to human preferences by learning reward functions from human feedback rather than hand-coded rules.[1] The technique combines supervised learning, reward modeling from human preferences, and reinforcement learning optimization to teach AI systems complex behaviors that are difficult to specify explicitly.
RLHF has become the industry-standard method for aligning large language models (LLMs) with human values, enabling systems like ChatGPT, Claude, and GPT-4 to follow instructions, provide helpful responses, and avoid harmful outputs.[2] Instead of using a hand-crafted reward function to specify the task in a reinforcement learning setup, RLHF involves learning a reward model directly from human feedback, and then optimizing the agent's policy using this learned reward signal.[3] RLHF is particularly useful for tasks where the ideal behavior is easy for humans to recognize but difficult to program explicitly, such as judging whether an answer is helpful or whether a joke is funny.
Explain like I'm 5 (ELI5)
Imagine you're teaching a dog a new trick, but you can't tell it exactly what to do in words. Instead, every time the dog tries something, you say "good dog!" or "bad dog!" based on whether it did what you wanted. Over time, the dog figures out what makes you happy and keeps doing that.
RLHF works the same way with computers. A computer writes many different answers to a question. Then people look at pairs of answers and say which one they like better. A second computer program learns what makes people happy based on all these choices. Then the first computer practices writing answers, trying to make the second program (which learned what people like) give it a high score. After lots of practice, the computer gets really good at writing answers that people find helpful and safe.
This is how ChatGPT and Claude learned to be so good at answering questions in a way that feels natural and useful.
Overview
The core innovation of RLHF lies in learning what humans want rather than explicitly programming it. Humans provide comparative judgments between AI outputs, a reward model learns to predict these preferences, and reinforcement learning optimizes the AI to maximize predicted rewards while maintaining fluency and coherence.[4]
The standard RLHF pipeline consists of three distinct stages:
-
Supervised fine-tuning (SFT): trains a pretrained language model on high-quality human demonstrations to establish basic instruction-following capabilities
-
Reward model training: collects human preferences by showing annotators multiple AI-generated responses and training a model to predict these preferences
-
Reinforcement learning optimization: uses the reward model to fine-tune the AI policy with algorithms like Proximal Policy Optimization (PPO), incorporating a KL divergence penalty to prevent drift
This approach addresses the reward specification problem: for complex tasks like writing helpful responses or generating creative content, it is nearly impossible to write explicit rules capturing what makes outputs good. RLHF leverages humans' ability to judge quality when comparing examples, even if they cannot articulate precise criteria.[3]
History and development
Early foundations (2008-2011)
Training AI systems from human feedback has long been explored as a way to handle objectives that are hard to formally specify. The intellectual foundations of RLHF trace back to research on learning from human feedback in the late 2000s. The TAMER framework (Training an Agent Manually via Evaluative Reinforcement), introduced by Knox and Stone in 2008, allowed humans to guide an RL agent by giving scalar feedback signals, effectively shaping the agent's policy via human reinforcement instead of a predefined reward function.[5] These efforts demonstrated the feasibility of learning from human preferences but were limited to relatively simple environments.
The conceptual leap to preference-based reinforcement learning occurred in 2011 when two independent research teams simultaneously published foundational work. Akrour et al. introduced preference-based policy learning, demonstrating that agents could learn directly from expert rankings of policies without simulator access or explicit rewards.[6]
The breakthrough: deep RLHF (2017)
Modern RLHF took shape with the landmark 2017 paper "Deep Reinforcement Learning from Human Preferences" by researchers from OpenAI and DeepMind, led by Paul Christiano and Jan Leike.[1] This work scaled preference-based learning to complex, high-dimensional tasks using deep neural networks as a general and scalable method for preference-based learning in complex domains. The paper demonstrated that agents could master challenging behaviors from remarkably little human feedback. Their algorithm learned a difficult maneuver (a backflip for a simulated humanoid) using about 900 bits of human feedback, which amounted to roughly an hour of a human trainer's time.[1] A relatively small amount of well-placed human feedback (less than 1% of the agent's interactions) was sufficient to significantly outperform baselines and even achieve superhuman scores in some tasks, without the agent ever seeing the true programmed rewards of the environment.
The key innovations included:
- Training deep neural network reward models from pairwise comparisons of trajectory segments
- Using Proximal Policy Optimization (PPO) for stable reinforcement learning
- Employing active query selection strategies to identify the most informative trajectory pairs
The success of this 2017 work established RLHF as a promising technique for aligning AI behavior with human-desired outcomes, sparking broader research into human-in-the-loop learning for AI safety and alignment.
Evolution to language models (2019-2020)
Following the initial breakthrough, subsequent research expanded RLHF into new domains. Applying RLHF to natural language presented new challenges: language generation involves discrete tokens, massive action spaces, and subtle quality distinctions. OpenAI's 2019 paper "Fine-Tuning Language Models from Human Preferences" made the first major application of RLHF to language models, fine-tuning GPT-2 on sentiment control, descriptiveness, and summarization using only 5,000 to 60,000 human comparisons.[7] The work built on advances in generative pretraining and applied reward learning to four natural language tasks, demonstrating that RLHF could effectively steer language model behavior.
Building on this foundation, in 2020, OpenAI applied RLHF to text summarization in the paper "Learning to Summarize from Human Feedback." A reward model learned to predict which summaries people preferred, and an RL policy (based on a pretrained GPT-3 model) was optimized to maximize this learned reward.[8] The results showed that with only a few thousand comparison judgments from humans, the RLHF-tuned model could generate summaries that were preferred by users over those from the original model or from certain supervised baselines. Models trained with 60,000 human preference labels significantly outperformed much larger supervised models, with RLHF summaries preferred to human-written reference summaries in evaluations. This provided one of the first demonstrations that RLHF can successfully guide large-scale natural language processing (NLP) models on real-world tasks.
Mainstream deployment (2022-present)
RLHF gained widespread attention with the development of InstructGPT and ChatGPT by OpenAI in 2022. InstructGPT is a family of GPT-3 based models fine-tuned using human feedback to better follow user instructions. The March 2022 InstructGPT paper represented RLHF's transition from research to industry-standard practice.[2] In the InstructGPT work, human annotators first provided demonstration answers and ranking comparisons for model outputs. Then a reward model was trained on these rankings, and finally the base model was further optimized via PPO to produce answers that maximize the reward model's score.
This process led to dramatic improvements in alignment with user intent. Results showed that 1.3 billion parameter InstructGPT models were preferred to 175 billion parameter GPT-3 outputs despite having 100x fewer parameters. According to OpenAI, testers "significantly preferred" the outputs of a 1.3 billion-parameter InstructGPT model over the original 175 billion-parameter GPT-3 on a wide range of user prompts. RLHF fine-tuning improved truthfulness on the TruthfulQA benchmark and reduced generation of toxic outputs, while maintaining performance on most academic NLP benchmarks (an effect Ouyang et al. attributed in part to mixing pretraining gradients into the RL update).[2]
These RLHF-trained InstructGPT models were deployed as the default models in OpenAI's API in 2022, and the approach paved the way for ChatGPT, a conversational AI launched in late 2022 that was built by fine-tuning GPT-3.5 with human feedback. InstructGPT's methodology directly enabled ChatGPT's November 2022 launch, which OpenAI explicitly credited to RLHF technology.[9] ChatGPT rapidly reached an estimated 100 million users within two months of launch, catalyzing industry-wide adoption.
Concurrently, the RLHF paradigm has been adopted by other leading AI labs. Anthropic developed Constitutional AI, extending RLHF with AI-generated feedback.[10] Their December 2022 paper introduced "RLAIF" (Reinforcement Learning from AI Feedback), where AI models evaluate responses according to written constitutional principles. Anthropic built its assistant Claude, refining it using RLHF and related techniques.
DeepMind developed a dialogue agent called Sparrow in 2022, which was trained via reinforcement learning on feedback from human reviewers to make its answers more correct and safer.[11] Sparrow uses human preference modeling to learn to avoid unsafe or misleading responses while engaging in helpful conversation. DeepMind also applied RLHF in models like Gopher and developed several notable applications including GopherCite, which trained models to cite evidence from the web.[12]
By 2023-2025, RLHF became ubiquitous across the AI industry and a standard part of the training pipeline for state-of-the-art large language models. Meta AI's Llama 2 documentation provided comprehensive public implementation details,[13] Llama 3.1 used an iterative SFT + rejection sampling + DPO recipe,[14] and Google's Gemini[15] and virtually all major deployed language models now use RLHF or variants as standard practice. Starting with OpenAI o1 in late 2024 and DeepSeek-R1 in early 2025, large-scale reinforcement learning was also extended to train reasoning models that learn to produce extended chains of thought.[16][17]
| Year | Milestone | Key reference |
|---|
| 2008 | TAMER Framework: first demonstration of learning from human evaluative feedback | Knox & Stone[5] |
| 2011 | Preference-based RL: established preference learning foundations | Akrour et al.[6] |
| 2017 | Deep RLHF: introduction of deep RL from human preferences for games and robotics | Christiano et al.[1] |
| 2017 | PPO: introduced dominant RL algorithm for RLHF | Schulman et al.[18] |
| 2019 | Fine-tuning GPT-2: first major RLHF application to language models | Ziegler et al.[7] |
| 2020 | Learning to Summarize: demonstrated RLHF superiority over larger supervised models | Stiennon et al.[8] |
| 2022 | InstructGPT: established RLHF as viable for general-purpose alignment | Ouyang et al.[2] |
| 2022 | HH-RLHF: training a helpful and harmless assistant with RLHF | Bai et al.[19] |
| 2022 | Constitutional AI: introduced RLAIF and explicit value specification | Bai et al.[10] |
| 2023 | "Let's Verify Step by Step": process supervision and PRM800K | Lightman et al.[20] |
| 2023 | DPO: simpler alternative to RL-based training | Rafailov et al.[21] |
| 2023 | Open problems in RLHF: examined fundamental limitations | Casper et al.[22] |
| 2023 | Llama 2: most detailed public RLHF documentation | Touvron et al.[13] |
| 2024 | KTO: alignment via binary feedback using prospect theory | Ethayarajh et al.[23] |
| 2024 | GRPO: critic-free RL for reasoning using group-relative scoring | Shao et al.[24] |
| 2024 | Llama 3.1: iterative SFT + rejection sampling + DPO at frontier scale | Dubey et al.[14] |
| 2024 | Tulu 3 introduces RLVR (RL with Verifiable Rewards) | Lambert et al.[25] |
| 2024 | "Is DPO Superior to PPO?": revisits PPO scaling for alignment | Xu et al.[26] |
| 2025 | DeepSeek-R1: large-scale RL for reasoning (Nature) | Guo et al.[17] |
Technical methodology
The process of reinforcement learning from human feedback typically consists of several stages, following a sequence of pre-training, reward modeling, and policy optimization (sometimes with an intermediate supervised fine-tuning step). RLHF typically involves three main steps: pretraining or supervised fine-tuning, training a reward model, and optimizing the policy with reinforcement learning.[4]
Stage 1: supervised fine-tuning (SFT)
RLHF usually starts with a pretrained model or agent that has learned broadly from a large dataset or through conventional RL. In NLP applications, this is a large language model trained on vast text corpora (e.g., GPT). In control tasks or games, this could be an agent with some prior knowledge. The pre-training provides a foundation of general capabilities, which RLHF will then refine. Notably, the computational cost of the initial pre-training far exceeds that of the subsequent RLHF fine-tuning; for example, the RLHF phase for InstructGPT consumed less than 2% of the compute used to pre-train GPT-3.
The RLHF pipeline begins with supervised fine-tuning (SFT), which transforms a pretrained language model into an instruction-following system. Pretrained models excel at pattern completion but don't naturally follow explicit instructions. SFT bridges this gap by training on high-quality human demonstrations, where humans write desired outputs for given prompts. The model is fine-tuned on these prompt-response pairs to ensure it can follow instructions and generate desirable outputs.[2]
The process uses standard supervised learning with the causal language modeling objective:
L_SFT(theta) = -E_(x,y)~D [ sum_t log p_theta(y_t | x, y_<t) ]
where theta represents model parameters, x is the prompt/instruction, y is the desired response, and D is the demonstration dataset.[4]
Training data typically consists of 10,000 to 100,000 demonstrations sourced from human labelers, API usage, or carefully curated existing data. For example, in InstructGPT, approximately 13,000 prompts were used for SFT.[2] The SFT model becomes the reference policy (pi_ref) used during RL optimization to compute KL divergence penalties. This supervised learning step primes the model to output generally acceptable answers, simplifying the subsequent reinforcement learning stage.
Stage 2: reward model training
A dataset is constructed by having human evaluators judge or rank outputs of the model-in-training. Typically, the current model (or a set of candidate models) is used to generate answers for a variety of prompts, and human annotators are asked to compare which output is better according to some criteria (e.g., which answer is more helpful or accurate). Human annotators provide feedback by ranking multiple model-generated responses to a given prompt. For instance, to train a chatbot, labelers might be shown two possible replies to a user query and asked which reply they prefer. These comparison judgments (or sometimes scalar ratings) constitute the feedback data.
Using pairwise comparisons helps because humans often find it easier to say "output A is better than output B" than to assign absolute scores, and it reduces variance between different annotators' scales. The collection process may be iterative: as the policy improves, new data is sampled in areas where the model is still uncertain, and humans provide feedback on those outputs, continually expanding the feedback dataset.
The second stage trains a reward model (also called preference model) that predicts human preferences between AI-generated outputs. The reward model takes as input an agent's output (and sometimes the initial prompt or state) and outputs a scalar reward value that should correlate with how humans would rate that output. The reward model is typically a neural network initialized from the same base model. In NLP tasks, one often uses a copy of the language model and fine-tunes it to predict preference scores. The RM is often initialized from the SFT model and trained using a loss function based on pairwise comparisons.
The reward model is trained using the Bradley-Terry model of pairwise preferences. This model assumes that for responses y_w (winner/chosen) and y_l (loser/rejected) with latent qualities r(y_w) and r(y_l), the probability that y_w is preferred follows:
P(y_w > y_l | x) = sigma(r_theta(x, y_w) - r_theta(x, y_l))
where sigma is the sigmoid function (logistic function) and r_theta is the reward model.[27] The Bradley-Terry model assumes each item has a latent strength, and observed preferences are a noisy reflection of these underlying strengths. Only differences in reward scores matter; adding the same constant to all scores leaves the preference probabilities unchanged.
The reward model is trained by maximizing the log-likelihood of observed preferences using cross-entropy loss:
L_RM(theta) = -E_(x, y_w, y_l) ~ D [ log sigma(r_theta(x, y_w) - r_theta(x, y_l)) ]
where y_w is the preferred response over y_l.
Architecturally, reward models typically start from the SFT model checkpoint, replacing the final token prediction layer with a linear projection to a single scalar value.[2] The scalar reward is read from the last token position, representing the quality of the entire sequence. Datasets for RM training can include 100,000 to 1 million comparisons. After training, the reward model serves as a stand-in for human judgment, evaluating any new output and estimating how well a human would like it. This allows the next phase of training to proceed without a human in the loop for every single evaluation.
Stage 3: reinforcement learning optimization
In the final phase, the policy model is fine-tuned using a reinforcement learning algorithm, with the reward model providing the reward signal. This stage formulates text generation as a sequential decision problem. Instead of a manual reward function defined by engineers, the agent now treats the learned reward model as its objective to maximize.
The complete RLHF objective combines reward maximization with a KL divergence penalty:
J_RLHF(theta) = E_(x ~ D, y ~ pi_theta) [ r_phi(x, y) - beta * D_KL(pi_theta(y|x) || pi_ref(y|x)) ]
Or equivalently:
objective(phi) = E_(x,y) ~ D_pi_RL [ r_theta(x, y) - beta * log(pi_RL(y|x) / pi_SFT(y|x)) ]
where pi_theta is the current policy being optimized, pi_ref is the frozen reference policy (SFT model), r_phi is the frozen reward model, and beta is the KL penalty coefficient typically ranging from 0.01 to 0.1.[4]
An additional pretraining term may be added to avoid catastrophic forgetting:
L(phi) = E[r_theta(x, y) - beta * log(.)] + gamma * E_(x ~ D_pretrain) [log(pi_RL(x))]
A common choice of RL algorithm for this stage is Proximal Policy Optimization (PPO), a stable policy-gradient method, though others like actor-critic can be used. Using PPO, the model generates an output (action) for a given input (state or prompt), the reward model scores this output, and the PPO update adjusts the model's parameters to increase the probability of outputs that lead to higher reward scores. PPO includes a mechanism to prevent the policy from straying too far from its initial parameters in a single update (through a clipping penalty). This is important because the reward model is an imperfect proxy; if the policy changes too drastically, it might exploit quirks of the reward model (a form of reward hacking), producing gibberish or undesired outputs that nevertheless score high.
The KL divergence term prevents distribution collapse where the model assigns all probability to narrow sequences, maintains language fluency by staying close to the well-trained SFT model, and prevents reward hacking where the model exploits reward model weaknesses. Over many training iterations, this RL process tunes the agent to generate outputs that align with the learned human preferences.
After RLHF fine-tuning, InstructGPT and similar models began following user instructions more reliably, avoided certain classes of undesirable content, and showed improved factual accuracy in open-ended question answering.
| Step | Description | Key components |
|---|
| Pretraining/SFT | Fine-tune base model on human demonstrations | Prompt-response pairs, supervised learning, 10,000-100,000 demonstrations |
| Reward modeling | Train RM on ranked preferences | Pairwise comparisons, Bradley-Terry model, 100,000-1M comparisons |
| Policy optimization | Use RL to maximize RM rewards | PPO, KL penalty, optional pretraining mix |
Key components and algorithms
Proximal Policy Optimization (PPO)
Proximal Policy Optimization (PPO) is the dominant reinforcement learning algorithm used in RLHF, introduced by Schulman et al. in 2017.[18] PPO addresses how to update policies using sampled data without taking destructively large steps that degrade performance. The introduction of PPO in the original OpenAI 2017 RLHF paper significantly reduced the amount of feedback needed by stabilizing training, making it practical to apply RLHF to high-dimensional neural networks.
The core innovation is the clipped surrogate objective. PPO defines a probability ratio:
r_t(theta) = pi_theta(a_t | s_t) / pi_theta_old(a_t | s_t)
The objective becomes:
L^CLIP(theta) = E_t [ min(r_t(theta) * A_hat_t, clip(r_t(theta), 1 - epsilon, 1 + epsilon) * A_hat_t) ]
where A_hat_t is the advantage estimate, epsilon is the clipping parameter (typically 0.1 to 0.2), and the minimum creates a pessimistic lower bound.[18]
PPO's selection for RLHF stems from multiple advantages:
- Stability through trust region constraints that prevent excessively large policy updates
- Sample efficiency by reusing data through multiple epochs of minibatch updates
- Simplicity requiring only standard backpropagation
- Scalability to large models with distributed training
However, PPO-based RLHF requires maintaining four large models simultaneously: the policy model, reference model, reward model, and value/critic model. For models exceeding 70 billion parameters, this creates substantial memory challenges.
Reward models and preference learning
Reward models translate human preferences into scalar signals for reinforcement learning. These models face a challenging task: predicting subtle, context-dependent human judgments about text quality from limited training data while generalizing to novel outputs.
Architecturally, reward models typically mirror the language model being trained, often initialized from the same SFT checkpoint to leverage linguistic knowledge.[2] The final layer is modified to output a single scalar value rather than a probability distribution over tokens.
Reward model quality critically determines RLHF success and failure modes. A weak reward model gets exploited. During RL optimization, the policy discovers inputs that achieve high predicted rewards without actually satisfying human preferences, a phenomenon called reward hacking or reward overoptimization.[28]
Mitigating reward model limitations involves:
- Ensemble methods using multiple reward models to reduce variance and detect exploitation
- Adversarial training exposing models to challenging examples that probe for weaknesses
- Process reward models training on intermediate reasoning steps rather than only final answers
- Uncertainty quantification recognizing when the model has low confidence in its predictions
To support systematic evaluation of reward models, the Allen Institute for AI released RewardBench in March 2024, the first benchmark explicitly targeting reward models used in RLHF (covering chat, chat-hard, safety, and reasoning subsets).[29] RewardBench 2, released in 2025, extends evaluation to six domains including factuality, precise instruction following, and ties.
KL divergence penalty
The KL divergence penalty is a fundamental component preventing RLHF collapse. The Kullback-Leibler divergence measures the difference between two probability distributions:
D_KL(P || Q) = E_(x ~ P) [ log(P(x) / Q(x)) ]
In RLHF, the KL penalty constrains how much the optimized policy pi_theta can diverge from the reference policy pi_ref during RL training. The penalty coefficient beta controls the trade-off between reward maximization and staying close to the reference.[7]
The penalty serves multiple functions:
- Prevents distribution collapse where the model assigns all probability mass to a few high-reward sequences
- Maintains language fluency by anchoring the policy to the well-trained SFT model
- Prevents reward hacking by limiting how far the policy can optimize against the imperfect reward model
- Acts as Bayesian regularization, implementing a form of variational inference
From a Bayesian perspective, KL-regularized RL implements variational inference approximating a target posterior distribution, which provides theoretical grounding for why the method avoids distribution collapse.
There are two main approaches to implementing the KL constraint in PPO for RLHF. PPO-Penalty approximately solves a KL-constrained update by penalizing the KL divergence in the objective function and automatically adjusting the penalty coefficient during training. PPO-Clip relies instead on specialized clipping in the objective function to remove incentives for the new policy to deviate too far from the old policy, without an explicit KL term.
Applications
RLHF has been applied across a range of AI domains, from game-playing agents to large-scale text generation, to align models with human values.[3]
Natural language processing
One of the most prominent uses of RLHF is in natural language processing, where it has become a key technique for aligning language models with human expectations. RLHF improves conversational agents, text summarization, and instruction-following. Instruct-tuned models like InstructGPT and conversational agents like ChatGPT rely on RLHF to produce answers that users find helpful and safe. The technique has been used for tasks such as text summarization (models that generate summaries preferred by readers), open-ended question answering, translation, and dialogue.
Notable systems include OpenAI's ChatGPT and InstructGPT, Anthropic's Claude, DeepMind's Sparrow, and Google's Gemini. By training on human feedback, these models can handle subjective or nuanced criteria (e.g., writing style, humor, avoiding offensive language) that are not captured by likelihood alone. It helps reduce toxicity and bias in LLM outputs.
ChatGPT and InstructGPT
ChatGPT represents RLHF's most visible and impactful application, transforming GPT-3.5's raw capabilities into a helpful, harmless conversational assistant.[9] OpenAI's announcement was explicit about the role of RLHF: "We trained this model using Reinforcement Learning from Human Feedback (RLHF), using the same methods as InstructGPT, but with slight differences in the data collection setup." The system builds directly on InstructGPT's methodology, with key adaptations for dialogue:
- Human AI trainers played both user and assistant roles
- Model-written suggestions helped trainers compose responses
- Rejection sampling selected better outputs during inference
- Multiple rounds of iterative RLHF refinement
InstructGPT's results demonstrated RLHF's power: human labelers preferred the 1.3B parameter InstructGPT outputs to the 175B GPT-3 base model on the OpenAI API prompt distribution, with substantial gains on truthfulness and reductions in toxicity at fixed prompt difficulty.[2]
Claude and Constitutional AI
Anthropic's Claude models pioneered Constitutional AI (CAI), which reduces human annotation requirements while making value alignment more transparent.[10] The approach uses AI-generated feedback (RLAIF) guided by explicit constitutional principles rather than relying entirely on human preference comparisons.
Constitutional AI operates in two phases:
- Supervised learning phase: the model critiques and revises its own responses using constitutional principles
- RL phase: AI evaluators assess responses according to the constitution, generating preference data for reward model training
Claude's constitution draws from diverse sources including the UN Universal Declaration of Human Rights, trust and safety best practices, and Anthropic's "helpful, honest, harmless" (HHH) criteria.[30] Constitutional AI's results showed that models trained with RL from AI feedback could be both more helpful and less harmful than RLHF-only baselines, while requiring substantially less human labeling effort on the harmlessness dimension.[10]
Separately, Anthropic's earlier 2022 paper on training a helpful and harmless assistant (the HH-RLHF paper) provided one of the most detailed public studies of applying RLHF to a dialogue agent.[19] That work explored an iterated online mode of training where preference models and RL policies were updated on a weekly cadence with fresh human feedback data, demonstrating that alignment training improves performance on almost all NLP evaluations and is fully compatible with training for specialized skills such as Python coding and summarization.
Meta AI's Llama 2 paper provides one of the most comprehensive public expositions of RLHF implementation details to date.[13] The documentation reveals:
- Starting from a 2 trillion token pretrained base
- Initial SFT on approximately 27,540 high-quality annotations
- Training separate reward models for helpfulness and safety
- Five sequential RLHF versions (V1-V5) over the project lifetime, alternating between rejection-sampling fine-tuning and PPO
Technical innovations include:
- Rejection sampling for the largest 70B model, generating multiple candidates and selecting the highest-scoring under the reward model
- Ghost Attention (GAtt): a technique to maintain multi-turn conversation consistency across turns
- Context distillation: enabling internalizing lengthy instructions
The follow-up Llama 3.1 paper (2024) documents a notable shift in Meta's recipe: post-training proceeds through six rounds of supervised fine-tuning, rejection sampling, and DPO rather than PPO, with the Llama 3 team reporting that DPO required less compute and "performed better, especially on instruction following benchmarks" in their setup.[14] This stands as one of the highest-profile production endorsements of DPO at frontier scale, while also illustrating that the choice is recipe-dependent rather than universal.
Google Gemini and other systems
Google's Gemini models represent large-scale application of RLHF to multimodal systems.[15] The natively multimodal pretraining architecture processes text, images, audio, and video together, with RLHF refinement applied across modalities.
DeepMind developed several notable RLHF applications:
- GopherCite: trained models to cite evidence from the web[12]
- Sparrow: combined RLHF with rule-based alignment[11]
Reasoning models (2024-2025)
A new family of applications emerged in late 2024 with the release of OpenAI o1, a model trained with large-scale reinforcement learning to produce extended chains of thought before answering.[16] OpenAI reported that o1's performance "consistently improves with more reinforcement learning (train-time compute) and with more time spent thinking (test-time compute)," establishing a new scaling axis distinct from pretraining.
DeepSeek-R1, released as an open-weight model in January 2025 and later published in Nature, used GRPO with verifiable rewards (correctness of math/code answers) as its core RL algorithm and demonstrated that reasoning abilities can be incentivized through pure RL without supervised reasoning trajectories.[17] The DeepSeek-R1 release significantly influenced industry practice and demonstrated that frontier-level reasoning could be reproduced by an open team.
Robotics and games
In robotics and control, RLHF offers a way to teach robots complex behaviors that are hard to specify with a reward function. The original 2017 "Deep Reinforcement Learning from Human Preferences" paper demonstrated RLHF in simulated robotics and Atari games.[1] A human can watch two attempt videos (e.g., a robot stacking blocks) and indicate which attempt was better, and from this the robot eventually learns a policy that accomplishes the task as the human intends. This approach bypasses manually engineering a reward (which might accidentally encourage wrong behaviors) and instead leverages human intuition about what the correct outcome looks like. The original experiments showed a simulated robot learning to do a backflip and a drive-and-park task solely from human preference judgments.
For Atari games, humans viewed two clips of agent gameplay and indicated which looked better. Agents learned to play successfully from preference data alone, and human preferences sometimes contained more useful information than performance-based metrics.
Other domains
Beyond NLP and games, RLHF has been explored in domains like image generation and computer vision. Text-to-image models can produce outputs ranked by humans for quality and alignment with the prompt; a reward model trained on these rankings can fine-tune the image generator accordingly. Black et al. (2023) introduced Denoising Diffusion Policy Optimization (DDPO) to apply policy-gradient RL to diffusion models, fine-tuning on objectives including aesthetic quality and prompt adherence.[31]
RLHF has also been explored in healthcare for patient education, recommender systems, and multimodal alignment tasks such as text-to-image and video generation.
| Domain | Example systems | Benefits |
|---|
| LLMs | ChatGPT, Claude, InstructGPT | Improved helpfulness, safety, and instruction-following |
| Games | Atari agents (Christiano et al.) | Better exploration, robustness, and human-like behavior |
| Robotics | Simulated tasks, manipulation, navigation | Alignment with human demonstrations and preferences |
| Image generation | Text-to-image diffusion models, DDPO | Better alignment with prompt and aesthetic preferences |
| Reasoning | OpenAI o1, DeepSeek-R1 | Extended chain-of-thought, verifiable rewards |
| Recommender systems | Content recommendation | Optimization beyond simple click signals |
Public preference datasets and open-source RLHF frameworks have been central to RLHF's diffusion beyond well-resourced labs.
Preference datasets
- Anthropic HH-RLHF: roughly 170,000 helpful-and-harmless comparisons used in Bai et al. (2022), one of the earliest large-scale public preference datasets[19]
- OpenAI summarization: the TL;DR comparison data released with Stiennon et al. (2020)[8]
- PRM800K: 800,000 step-level correctness labels on MATH solutions, released by OpenAI with "Let's Verify Step by Step"[20]
- UltraFeedback (Cui et al. 2023): ~64,000 prompts and ~256,000 responses annotated by GPT-4 across instruction-following, truthfulness, honesty, and helpfulness[32]
- HelpSteer / HelpSteer2 (NVIDIA 2023-2024): multi-attribute human ratings (helpfulness, correctness, coherence, complexity, verbosity); HelpSteer2 contains ~10,000 response pairs and was used to train reward models reaching state-of-the-art RewardBench scores at the time of release[33]
- Skywork-Reward-Preference (2024-2025): curated preference data accompanying the Skywork-Reward reward-model series, which topped RewardBench shortly after release[34]
Open-source RLHF frameworks
- TRL (Transformer Reinforcement Learning) by Hugging Face: widely-used PPO/DPO/KTO/ORPO/IPO/SimPO/RLOO trainers integrated with the Transformers stack
- trlx (CarperAI): an earlier large-model RLHF library
- DeepSpeed-Chat (Microsoft): the first end-to-end open implementation of a full three-stage RLHF pipeline at scale
- OpenRLHF: a Ray-based, vLLM-accelerated framework supporting PPO, REINFORCE++, RLOO, GRPO, and RLVR, designed for high-throughput multi-node training[35]
- NeMo-Aligner (NVIDIA): an alignment toolkit for the NeMo ecosystem covering RLHF, DPO, and SteerLM workflows
- verl (used by ByteDance/Tsinghua for DAPO): a flexible RL framework targeting reasoning and tool-use post-training
Advantages and benefits
RLHF offers several benefits over conventional training approaches that explain its rapid adoption.
Flexibility in capturing complex preferences. RLHF can align models with subtle, context-dependent human judgments nearly impossible to specify through hand-coded rules. This helps tackle the long-standing problem of value misalignment, where an agent achieves the literal objective it was given but not the outcome humans actually wanted. By defining the objective through human feedback, RLHF directly optimizes what humans care about (at least to the extent that the human feedback is representative of true preferences). This makes it a powerful tool for AI alignment, especially in scenarios involving ethics, safety, or complex social values.
Data efficiency in fine-tuning. In many cases, a relatively small amount of human feedback can substantially improve a large model's performance on subjective tasks. InstructGPT's results, where labelers preferred a 1.3B model with RLHF over the 175B base model, exemplify how human feedback can unlock latent capabilities more effectively than scaling up model size or unsupervised data.
Implicit reward shaping. RLHF learns reward functions from data rather than requiring hand-specified proxies, making it applicable to tasks where defining the reward function manually would be intractable.
Practical effectiveness. The technique has been validated across a wide range of tasks (summarization, dialogue, coding, instruction following, reasoning) and is now used in essentially every frontier deployed LLM.
Safety improvements. RLHF-trained models are better at refusing inappropriate requests and explaining their refusals. OpenAI reported that RLHF reduced toxic generations and improved truthfulness on TruthfulQA at fixed prompt difficulty,[2] and DeepMind's Sparrow saw improvements in following conversation rules and providing evidence when available.[11]
The technique successfully handles tasks spanning summarization, dialogue, question answering, code generation, creative writing, and instruction following.[3]
Challenges and limitations
Despite its effectiveness, RLHF faces significant challenges that remain active areas of research.
Cost and scalability
Obtaining high-quality human preference data can be expensive and time-consuming, since it often requires skilled annotators to carefully compare outputs. As models become more capable and are applied to broader tasks, the amount of feedback required to cover diverse scenarios increases. Relying heavily on human-in-the-loop training can become a bottleneck for scaling up AI systems.
Computational requirements pose another barrier. PPO-based RLHF requires maintaining four large models simultaneously: the policy model, reference model, reward model, and value/critic model. For models exceeding 70 billion parameters, this creates substantial memory challenges.
Human feedback quality and bias
Human feedback is noisy and potentially biased. Different annotators may have inconsistent preferences or make mistakes, and even a single person's judgment can vary over time or context. If the feedback dataset is not carefully curated, the reward model may learn a skewed version of human preferences. Human preferences are subjective and can introduce biases, leading to models that favor majority opinions and disadvantage underrepresented groups.[22] A model tuned with feedback from a specific demographic might not perform well for users from a different background, or it might unintentionally amplify the biases present in the trainers' judgments.
As AI systems become more capable, human oversight becomes increasingly difficult. The Bradley-Terry model assumes preferences are transitive and consistent, but real human preferences often violate these assumptions.
Reward hacking and overoptimization
Reward hacking represents a fundamental problem where models exploit weaknesses in reward models to achieve high scores without genuinely improving quality.[28] Since the reward model is an imperfect proxy for what humans truly want, a powerful agent may find ways to game this proxy. Research by Gao et al. (2022) demonstrated reward overoptimization: as policies optimize proxy rewards, true reward initially increases but then declines after reaching a peak, with the relationship following predictable scaling laws as a function of the policy-reference KL distance and reward-model size.[36]
Manifestations include:
- Sycophancy: agreeing with stated user beliefs or repeating user errors rather than providing truthful information; Perez et al. (2022) documented this as a systematic effect of RLHF training[37]
- Sophistical reasoning: generating convincing but incorrect arguments
- Length bias exploitation: producing unnecessarily verbose responses because reward models often correlate length with quality
- Formatting tricks: using layouts or bullet points that disproportionately impress reward models
- Mode collapse: reducing output diversity by converging on a narrow set of high-reward patterns
The problem is closely related to Goodhart's law: "when a measure becomes a target, it ceases to be a good measure." Techniques like regularizing policy updates (PPO's KL penalty), continually refreshing reward-model data with newly-generated outputs, ensembling reward models, and length-debiasing during reward modeling can mitigate but not fully eliminate the issue.
Training instability and complexity
PPO is notoriously sensitive to hyperparameter choices. The multi-stage pipeline creates intricate dependencies; poor SFT models produce poor reward model training data, leading to ineffective RL optimization. The alignment tax means RLHF can degrade performance on general capabilities not targeted during alignment, though InstructGPT showed this effect can be mitigated through careful pretraining data mixing.[2]
Evaluation difficulties
Judging how well an RLHF-trained model actually aligns with human values is inherently difficult. The model is optimized to do well on the feedback it was given, but verifying it will generalize to new situations requires thorough testing with adversarial or diverse queries. For example, after training Sparrow, DeepMind had participants try to trick the agent into breaking rules to see how often it fails. These evaluations help quantify progress but are inherently limited by the creativity and perspectives of the evaluators.
The objective optimized by RL (maximizing reward model scores) does not perfectly align with true human preferences.[22] Creating a single reward model for diverse human values is fundamentally misspecified. Critics argue that RLHF alone is insufficient for aligning superintelligent AI due to these limitations.
Alternative and complementary approaches
As RLHF becomes a standard tool, researchers are exploring extensions and alternatives to address its limitations. The 2023-2025 period has produced a particularly large family of "direct" preference-optimization methods that bypass the explicit reward model and the PPO inner loop.
Direct Preference Optimization (DPO)
Direct Preference Optimization (DPO), introduced by Rafailov et al. in 2023, aims to eliminate the reinforcement learning step altogether and emerged as a simpler, more stable alternative to RLHF.[21] DPO takes the preference data (pairwise comparisons) and directly fine-tunes the main model to satisfy those preferences using a simple supervised objective, rather than training a separate reward model and doing RL.
DPO's key insight is that the optimal policy for the KL-regularized RLHF objective can be expressed in closed form in terms of the reward, which can in turn be re-parameterized as a log-ratio of the policy and reference. Substituting into the Bradley-Terry objective yields a supervised classification loss directly on the policy:
L_DPO(theta) = -E_(x, y_w, y_l) [ log sigma( beta * log(pi_theta(y_w|x) / pi_ref(y_w|x)) - beta * log(pi_theta(y_l|x) / pi_ref(y_l|x)) ) ]
DPO's reported advantages include:
- Simplicity: requires only the policy model and frozen reference model (no separate reward model or critic)
- Stability: supervised learning avoids the instabilities of online PPO
- Computational efficiency: roughly halves the memory footprint of a PPO-based pipeline
- Performance: matches or exceeds PPO-based RLHF on summarization and dialogue benchmarks in Rafailov et al.'s experiments[21]
The method has seen rapid adoption: open-source models such as Zephyr-7B used DPO on UltraFeedback as a primary post-training recipe, and Meta's Llama 3.1 paper documents DPO as the preference-optimization step of choice over PPO in their iterative recipe.[14] However, DPO still relies on the quality of human preference data and can suffer from the same bias issues if the data is unrepresentative. See the dedicated DPO article for a detailed treatment.
IPO, KTO, SimPO, ORPO, and the family of "X-PO" methods
Following DPO, a wave of papers introduced alternative loss functions in the same family, each addressing a perceived shortcoming of the DPO objective.
- Identity Preference Optimization (IPO) (Azar et al. 2024) generalizes DPO under a framework the authors call PsiPO. By replacing DPO's log-sigmoid with a squared-loss on implicit-reward differences, IPO removes a particular form of overfitting where DPO can drive the policy ratio arbitrarily large on noisy preferences.[38]
- Kahneman-Tversky Optimization (KTO) (Ethayarajh et al. 2024) requires only a binary signal (output desirable / undesirable) rather than paired comparisons, and frames the loss using prospect theory's value function so that gains and losses are weighted asymmetrically. KTO matched or exceeded DPO at 1B-30B parameter scales in the original experiments.[23]
- Odds Ratio Preference Optimization (ORPO) (Hong et al. 2024) folds preference optimization directly into the SFT stage by adding an odds-ratio term to the negative log-likelihood loss, removing the need for a separate reference model and a second-stage preference pass.[39]
- SimPO (Meng, Xia & Chen 2024) replaces DPO's log-ratio with the average log-probability of a sequence as the implicit reward, eliminating the reference model entirely and adding a target reward margin. SimPO reported gains over DPO of up to 6.4 points on AlpacaEval 2 and 7.5 points on Arena-Hard, and a Gemma-2-9B SimPO model briefly ranked first among sub-10B models on Chatbot Arena.[40]
| Method | Year | Reference model | Data | Loss type |
|---|
| DPO | 2023 | Yes (frozen) | Pairwise | Bradley-Terry / log-sigmoid |
| IPO | 2024 | Yes | Pairwise | Squared error |
| KTO | 2024 | Yes | Binary good/bad | Prospect-theoretic |
| ORPO | 2024 | No (folded into SFT) | Pairwise | NLL + log odds ratio |
| SimPO | 2024 | No | Pairwise | Length-normalized margin |
RLAIF (Reinforcement Learning from AI Feedback)
RLAIF replaces human preference labels with AI-generated evaluations, addressing RLHF's scalability bottleneck.[41] The approach follows the standard RLHF structure but uses an off-the-shelf LLM (or the trained policy itself) to generate preference labels rather than requiring human annotators. Lee et al. (2023) reported that RLAIF achieves performance comparable to RLHF on summarization, helpful dialogue, and harmless dialogue, with same-model RLAIF (the same LLM used as both policy and labeler) avoiding distillation effects.
Advantages include dramatic cost reduction, vastly greater scalability, more consistency than human evaluators on well-specified criteria, and faster iteration cycles. However, these methods depend on the AI feedback being reliable; the feedback model's alignment becomes the critical bottleneck. RLAIF is seen as a promising direction to scale alignment techniques to very powerful models but is not a complete replacement for human judgment, especially on questions of contested ethics and values.
Constitutional AI and hybrid approaches
Constitutional AI represents more than just RLAIF; it is an approach emphasizing explicit, transparent value specification.[30] Rather than learning implicit preferences, CAI encodes values in written principles drawn from established sources.
The methodology combines self-improvement through critique-and-revision with RLAIF supervision. Collective Constitutional AI extends this by incorporating democratic input, with experiments involving approximately 1,000 participants voting on AI principles.[42]
Hybrid approaches combining multiple techniques show promise:
- Using RLHF for helpfulness while Constitutional AI handles harmlessness
- Integrating DPO or SimPO with process supervision for complex reasoning
- Mixing human and AI feedback strategically
- Multi-stage training alternating between SFT, rejection sampling, DPO, and PPO
Group Relative Policy Optimization (GRPO)
Group Relative Policy Optimization (GRPO), introduced by DeepSeek in the DeepSeekMath paper (Feb 2024), is a variant of PPO designed to reduce the computational cost of RLHF by eliminating the need for a separate critic (value) model.[24] In standard PPO-based RLHF, a value network estimates baselines for advantage computation, which roughly doubles memory requirements. GRPO instead estimates baselines from group scores: for each prompt, the model generates a group of responses, scores them with the reward model (or a verifier), and computes advantages relative to the group mean.
GRPO was first introduced in the DeepSeekMath paper and subsequently used as the central RL algorithm for DeepSeek-R1, the first major open-weight model to demonstrate that competitive long-chain-of-thought reasoning could be elicited via pure RL with rule-based rewards.[17] DeepSeek-R1 was published in Nature in September 2025 as the first frontier-class model to undergo peer review, raising the profile of GRPO substantially. See GRPO for further detail.
Several refinements followed:
- Dr. GRPO (Sea AI Lab, 2025) removes length and standard-deviation normalization terms from GRPO that were shown to bias the model toward longer (especially incorrect) responses.
- DAPO (ByteDance Seed / Tsinghua, 2025) introduces Clip-Higher, Dynamic Sampling, token-level policy-gradient loss, and Overlong Reward Shaping; DAPO reportedly matches DeepSeek-R1-Zero-Qwen-32B accuracy on AIME 2024 with roughly half the training steps.[43]
Other RL algorithms in use for RLHF
PPO is not the only RL algorithm used for language model alignment, and several simpler or more targeted alternatives have been proposed.
- RLOO (REINFORCE Leave-One-Out) (Ahmadian et al. 2024) revisits classic REINFORCE-style updates and uses k-1 sibling samples per prompt as a baseline. The paper argues that many PPO components (clipping, GAE, value networks) are not necessary at LLM scale; RLOO can outperform PPO with substantially less memory and wall-clock time.[44]
- REINFORCE++ is a stabilized REINFORCE variant implemented in OpenRLHF that combines a fixed-baseline REINFORCE update with PPO-style stabilizers; it has become a common baseline for open RL recipes.
- VinePPO (Kazemnejad et al., ICML 2025) replaces the learned value network with Monte-Carlo rollouts from intermediate states ("vine" estimates), arguing that PPO's value network is a poor credit-assignment tool on long reasoning trajectories.[45]
- VAPO (ByteDance Seed, 2025) is a value-augmented PPO variant tailored for reasoning models; on AIME 2024 with a Qwen 32B backbone it reports state-of-the-art results within ~5,000 steps.[46]
The DPO vs. PPO debate
A common narrative through 2023-2024 treated DPO as a strict simplification and replacement for PPO-based RLHF. The empirical picture has turned out to be more nuanced. Xu et al. (2024, ICML), "Is DPO Superior to PPO for LLM Alignment? A Comprehensive Study," report theoretical limitations of DPO (its solution set is strictly larger than that of the constrained RL problem) and empirically find that a carefully-tuned PPO outperforms DPO on harder tasks such as code competitions.[26] In practice, frontier teams have used both: OpenAI continues to use online RL methods for o-series reasoning models; Meta's Llama 3.1 paper chose DPO over PPO for general post-training; and DeepSeek and others use GRPO for reasoning training. The choice is recipe- and task-dependent rather than universal.
Process supervision and process reward models
Lightman et al.'s "Let's Verify Step by Step" (OpenAI, 2023) introduced process supervision for mathematical reasoning: instead of rewarding only final-answer correctness ("outcome supervision"), human labelers score each step of a chain of thought.[20] The paper showed that a process-supervised reward model trained on the released PRM800K dataset (800,000 step-level labels) substantially outperformed outcome-supervised models on the MATH benchmark, solving 78% of a representative subset. This distinction between process reward models (PRMs) and outcome reward models (ORMs) is now standard terminology in the reasoning-models literature; see process reward model.
Reinforcement Learning from Verifiable Rewards (RLVR)
Reinforcement Learning from Verifiable Rewards (RLVR) is a term that emerged in 2024-2025 for the special case in which the reward model is replaced by an automatic verifier that returns a binary correctness signal (mathematical answer matches the ground truth, code passes unit tests, instruction-following constraint is satisfied, etc.).[25] AI2's Tulu 3 paper introduced the term in its current sense, applying RLVR on top of SFT and DPO to push targeted gains on math (GSM8K, MATH) and verifiable instruction-following without degrading general performance. RLVR is also the operative paradigm behind DeepSeek-R1-Zero's pure-RL reasoning training, where rewards come from a rule-based grader rather than a learned reward model.
Iterative and online DPO; SPPO
Offline DPO trains once on a fixed preference dataset. Iterative DPO (sometimes called online DPO in practice, although true online DPO is a separate concept) interleaves DPO training with new sampling from the current policy and re-labeling by a fixed preference model or stronger judge model, allowing the preference distribution to track the policy. The "RLHF Workflow" paper (Dong et al. 2024) provides a public iterative DPO recipe.[47]
Self-Play Preference Optimization (SPPO) (Wu et al. 2024) treats alignment as a constant-sum two-player game and uses iterative self-play updates to provably approximate a Nash equilibrium of the preference game. On AlpacaEval 2, SPPO with only 60K UltraFeedback prompts reached a 28.5% length-controlled win rate against GPT-4-Turbo using Mistral-7B-Instruct-v0.2, and it outperformed iterative DPO and IPO on Arena-Hard and MT-Bench in their setup.[48]
Other improvements
Researchers are also investigating ways to make human feedback more efficient through active learning strategies (selectively querying humans on the most informative comparisons rather than random samples), semi-automated feedback (using heuristic or model-based pre-screening to reduce trivial queries), and combining demonstrations and preferences (using a few high-quality human demonstrations to bootstrap the policy and then preferences for further refinement).
| Method | Year | Requires RM | Requires RL loop | Data format | Key advantage |
|---|
| RLHF (PPO) | 2017/2022 | Yes | Yes | Pairwise preferences | Well-tested, flexible |
| DPO | 2023 | No | No | Pairwise | Simpler, often more stable |
| IPO | 2024 | No | No | Pairwise | Less overfitting on noisy data |
| KTO | 2024 | No | No | Binary good/bad | No paired data needed |
| ORPO | 2024 | No | No | Pairwise (folded into SFT) | Single-stage training |
| SimPO | 2024 | No | No | Pairwise | Reference-free, length-normalized |
| RLAIF | 2022-2023 | Yes | Yes | AI-generated preferences | Scalable, lower cost |
| Constitutional AI | 2022 | Yes | Yes | Principles + AI feedback | Transparent values |
| GRPO | 2024 | Yes | Yes (critic-free) | Pairwise or rule-based | Half the memory of PPO |
| Dr. GRPO / DAPO | 2025 | Yes | Yes (critic-free) | Verifiable + length controls | Bias-corrected, reasoning-oriented |
| RLOO | 2024 | Yes | Yes (critic-free) | Group-of-k samples | REINFORCE-style, low memory |
| RLVR | 2024-2025 | No (verifier) | Yes | Verifiable answers | No reward-model exploitation |
Future directions and open problems
The RLHF research community faces numerous open challenges, surveyed comprehensively by Casper et al. (2023), who identified problems across three categories: challenges with feedback, challenges with the reward model, and challenges with the policy.[22]
Improving reward models
- Adversarial training for robustness against exploitation
- Ensemble methods aggregating multiple models to reduce variance
- Uncertainty quantification for confidence assessment
- Process reward models that evaluate intermediate reasoning steps rather than only final outputs
- Better generalization to out-of-distribution prompts, evaluated systematically on benchmarks like RewardBench
Algorithmic improvements
- Developing RL algorithms specifically designed for LLM alignment (GRPO, DAPO, RLOO, VinePPO, VAPO illustrate this trend)
- More efficient optimization with parameter reallocation and offloaded inference (vLLM-style)
- Better parallelization strategies for multi-model RLHF pipelines
- Reduced memory footprint approaches that eliminate the need for multiple full-size models
Addressing scalability
- Hybrid human-AI feedback approaches that allocate human effort where it matters most
- Recursive supervision with hierarchical structures for overseeing increasingly capable systems
- Debate and verification systems where models argue for and against claims
- Constitutional approaches enabling customizable values without per-instance human labeling
- Verifiable-reward training (RLVR) for domains where ground truth is available
Tackling fundamental limitations
- Better ways to capture and aggregate diverse human preferences across demographics and cultures
- Dynamic preference learning that adapts to changing values over time
- Detection methods for subtle reward hacking that is difficult for humans to notice
- Causality-based approaches to reward modeling that distinguish correlation from genuine quality
- Addressing sycophancy and deceptive alignment where models learn to appear aligned without genuine value internalization
Theoretical understanding
- Formalizing why RLHF works as well as it does in practice
- Characterizing the relationship between proxy and true rewards (building on Gao et al.'s scaling laws)
- Identifying tasks where RLHF fundamentally cannot work
- Understanding the limits of preference-based learning and the Bradley-Terry assumption
See also
References
- Christiano, P., Leike, J., Brown, T.B., Martic, M., Legg, S., & Amodei, D. (2017). "Deep Reinforcement Learning from Human Preferences." Advances in Neural Information Processing Systems 30 (NeurIPS 2017). https://arxiv.org/abs/1706.03741
- Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., et al. (2022). "Training language models to follow instructions with human feedback." Advances in Neural Information Processing Systems 35 (NeurIPS 2022). https://arxiv.org/abs/2203.02155
- Wikipedia contributors. "Reinforcement learning from human feedback." Wikipedia, The Free Encyclopedia. https://en.wikipedia.org/wiki/Reinforcement_learning_from_human_feedback
- Lambert, N. (2024). *RLHF Book*. https://rlhfbook.com/
- Knox, W.B. & Stone, P. (2008). "TAMER: Training an Agent Manually via Evaluative Reinforcement." 2008 7th IEEE International Conference on Development and Learning. https://www.cs.utexas.edu/~bradknox/papers/icdl08-knox.pdf
- Akrour, R., Schoenauer, M., & Sebag, M. (2011). "Preference-Based Policy Learning." ECML PKDD 2011. https://link.springer.com/chapter/10.1007/978-3-642-23780-5_11
- Ziegler, D.M., Stiennon, N., Wu, J., Brown, T.B., Radford, A., Amodei, D., Christiano, P., & Irving, G. (2019). "Fine-Tuning Language Models from Human Preferences." https://arxiv.org/abs/1909.08593
- Stiennon, N., Ouyang, L., Wu, J., Ziegler, D., Lowe, R., Voss, C., Radford, A., Amodei, D., & Christiano, P.F. (2020). "Learning to summarize from human feedback." NeurIPS 2020. https://arxiv.org/abs/2009.01325
- OpenAI. (2022). "Introducing ChatGPT." https://openai.com/index/chatgpt/
- Bai, Y., Kadavath, S., Kundu, S., Askell, A., Kernion, J., Jones, A., Chen, A., Goldie, A., Mirhoseini, A., McKinnon, C., et al. (2022). "Constitutional AI: Harmlessness from AI Feedback." https://arxiv.org/abs/2212.08073
- Glaese, A., McAleese, N., Trebacz, M., Aslanides, J., Firoiu, V., Ewalds, T., Rauh, M., Weidinger, L., Chadwick, M., Thacker, P., et al. (2022). "Improving alignment of dialogue agents via targeted human judgements." https://arxiv.org/abs/2209.14375
- Menick, J., Trebacz, M., Mikulik, V., Aslanides, J., Song, F., Chadwick, M., Glaese, M., Young, S., Campbell-Gillingham, L., Irving, G., & McAleese, N. (2022). "Teaching language models to support answers with verified quotes." https://arxiv.org/abs/2203.11147
- Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S., et al. (2023). "Llama 2: Open Foundation and Fine-Tuned Chat Models." https://arxiv.org/abs/2307.09288
- Dubey, A., Jauhri, A., Pandey, A., Kadian, A., et al. (2024). "The Llama 3 Herd of Models." https://arxiv.org/abs/2407.21783
- Gemini Team, Google. (2023). "Gemini: A Family of Highly Capable Multimodal Models." https://arxiv.org/abs/2312.11805
- OpenAI. (2024). "Learning to Reason with LLMs." https://openai.com/index/learning-to-reason-with-llms/
- Guo, D., Yang, D., Zhang, H., Song, J., et al. (2025). "DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning." *Nature* (Vol. 645, Issue 8081). https://www.nature.com/articles/s41586-025-09422-z (arXiv: https://arxiv.org/abs/2501.12948)
- Schulman, J., Wolski, F., Dhariwal, P., Radford, A., & Klimov, O. (2017). "Proximal Policy Optimization Algorithms." https://arxiv.org/abs/1707.06347
- Bai, Y., Jones, A., Ndousse, K., Askell, A., Chen, A., DasSarma, N., Drain, D., Fort, S., Ganguli, D., Henighan, T., et al. (2022). "Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback." https://arxiv.org/abs/2204.05862
- Lightman, H., Kosaraju, V., Burda, Y., Edwards, H., Baker, B., Lee, T., Leike, J., Schulman, J., Sutskever, I., & Cobbe, K. (2023). "Let's Verify Step by Step." https://arxiv.org/abs/2305.20050
- Rafailov, R., Sharma, A., Mitchell, E., Ermon, S., Manning, C.D., & Finn, C. (2023). "Direct Preference Optimization: Your Language Model is Secretly a Reward Model." NeurIPS 2023. https://arxiv.org/abs/2305.18290
- Casper, S., Davies, X., Shi, C., Gilbert, T.K., Scheurer, J., Rando, J., Freedman, R., Korbak, T., Lindner, D., Freire, P., et al. (2023). "Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback." Transactions on Machine Learning Research (TMLR). https://arxiv.org/abs/2307.15217
- Ethayarajh, K., Xu, W., Muennighoff, N., Jurafsky, D., & Kiela, D. (2024). "KTO: Model Alignment as Prospect Theoretic Optimization." https://arxiv.org/abs/2402.01306
- Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Zhang, M., Li, Y.K., Wu, Y., & Guo, D. (2024). "DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models." https://arxiv.org/abs/2402.03300
- Lambert, N., Morrison, J., Pyatkin, V., Huang, S., Ivison, H., et al. (2024). "Tulu 3: Pushing Frontiers in Open Language Model Post-Training." https://arxiv.org/abs/2411.15124
- Xu, S., Fu, W., Gao, J., Ye, W., Liu, W., Mei, Z., Wang, G., Yu, C., & Wu, Y. (2024). "Is DPO Superior to PPO for LLM Alignment? A Comprehensive Study." ICML 2024. https://arxiv.org/abs/2404.10719
- Bradley, R.A. & Terry, M.E. (1952). "Rank Analysis of Incomplete Block Designs: I. The Method of Paired Comparisons." *Biometrika*, 39(3/4), 324-345. https://www.jstor.org/stable/2334029
- Amodei, D., Olah, C., Steinhardt, J., Christiano, P., Schulman, J., & Mane, D. (2016). "Concrete Problems in AI Safety." https://arxiv.org/abs/1606.06565
- Lambert, N., Pyatkin, V., Morrison, J., Miranda, LJ., Lin, B.Y., Chandu, K., Dziri, N., Kumar, S., Zick, T., Choi, Y., Smith, N.A., & Hajishirzi, H. (2024). "RewardBench: Evaluating Reward Models for Language Modeling." https://arxiv.org/abs/2403.13787
- Anthropic. (2023). "Claude's Constitution." https://www.anthropic.com/news/claudes-constitution
- Black, K., Janner, M., Du, Y., Kostrikov, I., & Levine, S. (2023). "Training Diffusion Models with Reinforcement Learning." https://arxiv.org/abs/2305.13301
- Cui, G., Yuan, L., Ding, N., Yao, G., Zhu, W., Ni, Y., Xie, G., Liu, Z., & Sun, M. (2023). "UltraFeedback: Boosting Language Models with High-quality Feedback." https://arxiv.org/abs/2310.01377
- Wang, Z., Dong, Y., Delalleau, O., Zeng, J., Shen, G., Egert, D., Zhang, J.J., Sreedhar, M.N., & Kuchaiev, O. (2024). "HelpSteer2: Open-source dataset for training top-performing reward models." NeurIPS 2024 Datasets and Benchmarks Track. https://arxiv.org/abs/2406.08673
- Liu, C.Y., Zeng, L., Liu, J., Yan, R., He, J., Wang, C., Yan, S., Liu, Y., & Zhou, Y. (2024). "Skywork-Reward: Bag of Tricks for Reward Modeling in LLMs." https://arxiv.org/abs/2410.18451
- Hu, J., Wu, X., Wang, W., Zhang, D., & Cao, Y. (2024). "OpenRLHF: An Easy-to-use, Scalable and High-performance RLHF Framework." https://arxiv.org/abs/2405.11143
- Gao, L., Schulman, J., & Hilton, J. (2022). "Scaling Laws for Reward Model Overoptimization." ICML 2023. https://arxiv.org/abs/2210.10760
- Perez, E., Ringer, S., Lukosiute, K., Nguyen, K., Chen, E., Heiner, S., Pettit, C., Olsson, C., Kundu, S., Kadavath, S., et al. (2022). "Discovering Language Model Behaviors with Model-Written Evaluations." https://arxiv.org/abs/2212.09251
- Azar, M.G., Rowland, M., Piot, B., Guo, D., Calandriello, D., Valko, M., & Munos, R. (2024). "A General Theoretical Paradigm to Understand Learning from Human Preferences." AISTATS 2024. https://arxiv.org/abs/2310.12036
- Hong, J., Lee, N., & Thorne, J. (2024). "ORPO: Monolithic Preference Optimization without Reference Model." EMNLP 2024. https://arxiv.org/abs/2403.07691
- Meng, Y., Xia, M., & Chen, D. (2024). "SimPO: Simple Preference Optimization with a Reference-Free Reward." NeurIPS 2024. https://arxiv.org/abs/2405.14734
- Lee, H., Phatale, S., Mansoor, H., Lu, K., Mesnard, T., Bishop, C., Carbune, V., & Rastogi, A. (2023). "RLAIF: Scaling Reinforcement Learning from Human Feedback with AI Feedback." https://arxiv.org/abs/2309.00267
- Anthropic. (2023). "Collective Constitutional AI: Aligning a Language Model with Public Input." https://www.anthropic.com/news/collective-constitutional-ai-aligning-a-language-model-with-public-input
- Yu, Q., Zhang, Z., Zhu, R., Yuan, Y., Zuo, X., et al. (2025). "DAPO: An Open-Source LLM Reinforcement Learning System at Scale." https://arxiv.org/abs/2503.14476
- Ahmadian, A., Cremer, C., Gallé, M., Fadaee, M., Kreutzer, J., Pietquin, O., Üstün, A., & Hooker, S. (2024). "Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMs." ACL 2024. https://arxiv.org/abs/2402.14740
- Kazemnejad, A., Aghajohari, M., Portelance, E., Sordoni, A., Reddy, S., Courville, A., & Roux, N.L. (2024). "VinePPO: Unlocking RL Potential For LLM Reasoning Through Refined Credit Assignment." ICML 2025. https://arxiv.org/abs/2410.01679
- Yue, Y., Yuan, Y., Yu, Q., Zuo, X., Zhu, R., Xu, W., Chen, J., et al. (2025). "VAPO: Efficient and Reliable Reinforcement Learning for Advanced Reasoning Tasks." https://arxiv.org/abs/2504.05118
- Dong, H., Xiong, W., Pang, B., Wang, H., Zhao, H., Zhou, Y., Jiang, N., Sahoo, D., Xiong, C., & Zhang, T. (2024). "RLHF Workflow: From Reward Modeling to Online RLHF." https://arxiv.org/abs/2405.07863
- Wu, Y., Sun, Z., Yuan, H., Ji, K., Yang, Y., & Gu, Q. (2024). "Self-Play Preference Optimization for Language Model Alignment." https://arxiv.org/abs/2405.00675
External links