Reinforcement Learning from Human Feedback (RLHF) is a machine learning technique that trains artificial intelligence systems to behave according to human preferences by learning reward functions from human feedback rather than hand-coded rules.[1] The technique combines supervised learning, reward modeling from human preferences, and reinforcement learning optimization to teach AI systems complex behaviors that are difficult to specify explicitly.
RLHF has become the industry-standard method for aligning large language models (LLMs) with human values, enabling systems like ChatGPT, Claude, and GPT-4 to follow instructions, provide helpful responses, and avoid harmful outputs.[2] Instead of using a hand-crafted reward function to specify the task in a reinforcement learning setup, RLHF involves learning a reward model directly from human feedback, and then optimizing the agent's policy using this learned reward signal.[3] RLHF is particularly useful for tasks where the ideal behavior is easy for humans to recognize but difficult to program explicitly, such as judging whether an answer is helpful or whether a joke is funny.
Imagine you're teaching a dog a new trick, but you can't tell it exactly what to do in words. Instead, every time the dog tries something, you say "good dog!" or "bad dog!" based on whether it did what you wanted. Over time, the dog figures out what makes you happy and keeps doing that.
RLHF works the same way with computers. A computer writes many different answers to a question. Then people look at pairs of answers and say which one they like better. A second computer program learns what makes people happy based on all these choices. Then the first computer practices writing answers, trying to make the second program (which learned what people like) give it a high score. After lots of practice, the computer gets really good at writing answers that people find helpful and safe.
This is how ChatGPT and Claude learned to be so good at answering questions in a way that feels natural and useful.
The core innovation of RLHF lies in learning what humans want rather than explicitly programming it. Humans provide comparative judgments between AI outputs, a reward model learns to predict these preferences, and reinforcement learning optimizes the AI to maximize predicted rewards while maintaining fluency and coherence.[4]
The standard RLHF pipeline consists of three distinct stages:
Supervised fine-tuning (SFT): trains a pretrained language model on high-quality human demonstrations to establish basic instruction-following capabilities
Reward model training: collects human preferences by showing annotators multiple AI-generated responses and training a model to predict these preferences
Reinforcement learning optimization: uses the reward model to fine-tune the AI policy with algorithms like Proximal Policy Optimization (PPO), incorporating a KL divergence penalty to prevent drift
This approach addresses the reward specification problem: for complex tasks like writing helpful responses or generating creative content, it is nearly impossible to write explicit rules capturing what makes outputs good. RLHF leverages humans' ability to judge quality when comparing examples, even if they cannot articulate precise criteria.[3]
Training AI systems from human feedback has long been explored as a way to handle objectives that are hard to formally specify. The intellectual foundations of RLHF trace back to research on learning from human feedback in the late 2000s. The TAMER framework (Training an Agent Manually via Evaluative Reinforcement), introduced by Knox and Stone in 2008, allowed humans to guide an RL agent by giving scalar feedback signals, effectively shaping the agent's policy via human reinforcement instead of a predefined reward function.[5] These efforts demonstrated the feasibility of learning from human preferences but were limited to relatively simple environments.
The conceptual leap to preference-based reinforcement learning occurred in 2011 when two independent research teams simultaneously published foundational work. Akrour et al. introduced preference-based policy learning, demonstrating that agents could learn directly from expert rankings of policies without simulator access or explicit rewards.[6]
Modern RLHF took shape with the landmark 2017 paper "Deep Reinforcement Learning from Human Preferences" by researchers from OpenAI and DeepMind, led by Paul Christiano and Jan Leike.[1] This work scaled preference-based learning to complex, high-dimensional tasks using deep neural networks as a general and scalable method for preference-based learning in complex domains. The paper demonstrated that agents could master challenging behaviors from remarkably little human feedback. Their algorithm learned a difficult maneuver (a backflip for a simulated humanoid) using about 900 bits of human feedback, which amounted to roughly an hour of a human trainer's time.[1] A relatively small amount of well-placed human feedback (less than 1% of the agent's interactions) was sufficient to significantly outperform baselines and even achieve superhuman scores in some tasks, without the agent ever seeing the true programmed rewards of the environment.
The key innovations included:
The success of this 2017 work established RLHF as a promising technique for aligning AI behavior with human-desired outcomes, sparking broader research into human-in-the-loop learning for AI safety and alignment.
Following the initial breakthrough, subsequent research expanded RLHF into new domains. Applying RLHF to natural language presented new challenges: language generation involves discrete tokens, massive action spaces, and subtle quality distinctions. OpenAI's 2019 paper "Fine-Tuning Language Models from Human Preferences" made the first major application of RLHF to language models, fine-tuning GPT-2 on sentiment control, descriptiveness, and summarization using only 5,000 to 60,000 human comparisons.[7] The work built on advances in generative pretraining and applied reward learning to four natural language tasks, demonstrating that RLHF could effectively steer language model behavior.
Building on this foundation, in 2020, OpenAI applied RLHF to text summarization in the paper "Learning to Summarize from Human Feedback." A reward model learned to predict which summaries people preferred, and an RL policy (based on a pretrained GPT-3 model) was optimized to maximize this learned reward.[8] The results showed that with only a few thousand comparison judgments from humans, the RLHF-tuned model could generate summaries that were preferred by users over those from the original model or from certain supervised baselines. Models trained with 60,000 human preference labels significantly outperformed much larger supervised models, with RLHF summaries preferred to human-written reference summaries in evaluations. This provided one of the first demonstrations that RLHF can successfully guide large-scale natural language processing (NLP) models on real-world tasks.
RLHF gained widespread attention with the development of InstructGPT and ChatGPT by OpenAI in 2022. InstructGPT is a family of GPT-3 based models fine-tuned using human feedback to better follow user instructions. The March 2022 InstructGPT paper represented RLHF's transition from research to industry-standard practice.[2] In the InstructGPT work, human annotators first provided demonstration answers and ranking comparisons for model outputs. Then a reward model was trained on these rankings, and finally the base model was further optimized via PPO to produce answers that maximize the reward model's score.
This process led to dramatic improvements in alignment with user intent. Results showed that 1.3 billion parameter InstructGPT models were preferred to 175 billion parameter GPT-3 outputs despite having 100x fewer parameters. According to OpenAI, testers "significantly preferred" the outputs of a 1.3 billion-parameter InstructGPT model over the original 175 billion-parameter GPT-3 on a wide range of user prompts. RLHF fine-tuning made the model's outputs more factual and less toxic compared to the base GPT-3, showing 82% reduction in harmful content generation and 29% better adherence to safety policies, all while maintaining performance on academic NLP benchmarks.[2]
These RLHF-trained InstructGPT models were deployed as the default models in OpenAI's API in 2022, and the approach paved the way for ChatGPT, a conversational AI launched in late 2022 that was built by fine-tuning GPT-3.5 with human feedback. InstructGPT's methodology directly enabled ChatGPT's November 2022 launch, which OpenAI explicitly credited to RLHF technology.[9] ChatGPT rapidly reached 100 million users and catalyzed industry-wide adoption.
Concurrently, the RLHF paradigm has been adopted by other leading AI labs. Anthropic developed Constitutional AI, extending RLHF with AI-generated feedback.[10] Their December 2022 paper introduced "RLAIF" (Reinforcement Learning from AI Feedback), where AI models evaluate responses according to written constitutional principles. Anthropic built its assistant Claude, refining it using RLHF and related techniques.
DeepMind developed a dialogue agent called Sparrow in 2022, which was trained via reinforcement learning on feedback from human reviewers to make its answers more correct and safer.[11] Sparrow uses human preference modeling to learn to avoid unsafe or misleading responses while engaging in helpful conversation. DeepMind also applied RLHF in models like Gopher and developed several notable applications including GopherCite, which trained models to cite evidence from the web.[12]
By 2023-2025, RLHF became ubiquitous across the AI industry and a standard part of the training pipeline for state-of-the-art large language models. Meta AI's Llama 2 documentation provided comprehensive public implementation details,[13] while Google's Gemini[14] and virtually all major deployed language models now use RLHF or variants as standard practice.
| Year | Milestone | Key reference |
|---|---|---|
| 2008 | TAMER Framework: first demonstration of learning from human evaluative feedback | Knox & Stone[5] |
| 2011 | Preference-based RL: established preference learning foundations | Akrour et al.[6] |
| 2017 | Deep RLHF: introduction of deep RL from human preferences for games and robotics | Christiano et al.[1] |
| 2017 | PPO: introduced dominant RL algorithm for RLHF | Schulman et al.[15] |
| 2019 | Fine-tuning GPT-2: first major RLHF application to language models | Ziegler et al.[7] |
| 2020 | Learning to Summarize: demonstrated RLHF superiority over larger supervised models | Stiennon et al.[8] |
| 2022 | InstructGPT: established RLHF as viable for general-purpose alignment | Ouyang et al.[2] |
| 2022 | HH-RLHF: training a helpful and harmless assistant with RLHF | Bai et al.[16] |
| 2022 | Constitutional AI: introduced RLAIF and explicit value specification | Bai et al.[10] |
| 2023 | DPO: simpler alternative to RL-based training | Rafailov et al.[17] |
| 2023 | Open problems in RLHF: examined fundamental limitations | Casper et al.[18] |
| 2023 | Llama 2: most detailed public RLHF documentation | Touvron et al.[13] |
| 2024 | KTO: alignment via binary feedback using prospect theory | Ethayarajh et al.[19] |
| 2024 | GRPO: critic-free RL for reasoning using group-relative scoring | Shao et al.[20] |
The process of reinforcement learning from human feedback typically consists of several stages, following a sequence of pre-training, reward modeling, and policy optimization (sometimes with an intermediate supervised fine-tuning step). RLHF typically involves three main steps: pretraining or supervised fine-tuning, training a reward model, and optimizing the policy with reinforcement learning.[4]
RLHF usually starts with a pretrained model or agent that has learned broadly from a large dataset or through conventional RL. In NLP applications, this is a large language model trained on vast text corpora (e.g., GPT). In control tasks or games, this could be an agent with some prior knowledge. The pre-training provides a foundation of general capabilities, which RLHF will then refine. Notably, the computational cost of the initial pre-training far exceeds that of the subsequent RLHF fine-tuning; for example, the RLHF phase for InstructGPT consumed less than 2% of the compute used to pre-train GPT-3.
The RLHF pipeline begins with supervised fine-tuning (SFT), which transforms a pretrained language model into an instruction-following system. Pretrained models excel at pattern completion but don't naturally follow explicit instructions. SFT bridges this gap by training on high-quality human demonstrations, where humans write desired outputs for given prompts. The model is fine-tuned on these prompt-response pairs to ensure it can follow instructions and generate desirable outputs.[2]
The process uses standard supervised learning with the causal language modeling objective:
L_SFT(theta) = -E_(x,y)~D [ sum_t log p_theta(y_t | x, y_<t) ]
where theta represents model parameters, x is the prompt/instruction, y is the desired response, and D is the demonstration dataset.[4]
Training data typically consists of 10,000 to 100,000 demonstrations sourced from human labelers, API usage, or carefully curated existing data. For example, in InstructGPT, approximately 13,000 prompts were used for SFT.[2] The SFT model becomes the reference policy (pi_ref) used during RL optimization to compute KL divergence penalties. This supervised learning step primes the model to output generally acceptable answers, simplifying the subsequent reinforcement learning stage.
A dataset is constructed by having human evaluators judge or rank outputs of the model-in-training. Typically, the current model (or a set of candidate models) is used to generate answers for a variety of prompts, and human annotators are asked to compare which output is better according to some criteria (e.g., which answer is more helpful or accurate). Human annotators provide feedback by ranking multiple model-generated responses to a given prompt. For instance, to train a chatbot, labelers might be shown two possible replies to a user query and asked which reply they prefer. These comparison judgments (or sometimes scalar ratings) constitute the feedback data.
Using pairwise comparisons helps because humans often find it easier to say "output A is better than output B" than to assign absolute scores, and it reduces variance between different annotators' scales. The collection process may be iterative: as the policy improves, new data is sampled in areas where the model is still uncertain, and humans provide feedback on those outputs, continually expanding the feedback dataset.
The second stage trains a reward model (also called preference model) that predicts human preferences between AI-generated outputs. The reward model takes as input an agent's output (and sometimes the initial prompt or state) and outputs a scalar reward value that should correlate with how humans would rate that output. The reward model is typically a neural network initialized from the same base model. In NLP tasks, one often uses a copy of the language model and fine-tunes it to predict preference scores. The RM is often initialized from the SFT model and trained using a loss function based on pairwise comparisons.
The reward model is trained using the Bradley-Terry model of pairwise preferences. This model assumes that for responses y_w (winner/chosen) and y_l (loser/rejected) with latent qualities r(y_w) and r(y_l), the probability that y_w is preferred follows:
P(y_w > y_l | x) = sigma(r_theta(x, y_w) - r_theta(x, y_l))
where sigma is the sigmoid function (logistic function) and r_theta is the reward model.[21] The Bradley-Terry model assumes each item has a latent strength, and observed preferences are a noisy reflection of these underlying strengths. Only differences in reward scores matter; adding the same constant to all scores leaves the preference probabilities unchanged.
The reward model is trained by maximizing the log-likelihood of observed preferences using cross-entropy loss:
L_RM(theta) = -E_(x, y_w, y_l) ~ D [ log sigma(r_theta(x, y_w) - r_theta(x, y_l)) ]
where y_w is the preferred response over y_l.
Architecturally, reward models typically start from the SFT model checkpoint, replacing the final token prediction layer with a linear projection to a single scalar value.[2] The scalar reward is read from the last token position, representing the quality of the entire sequence. Datasets for RM training can include 100,000 to 1 million comparisons. After training, the reward model serves as a stand-in for human judgment, evaluating any new output and estimating how well a human would like it. This allows the next phase of training to proceed without a human in the loop for every single evaluation.
In the final phase, the policy model is fine-tuned using a reinforcement learning algorithm, with the reward model providing the reward signal. This stage formulates text generation as a sequential decision problem. Instead of a manual reward function defined by engineers, the agent now treats the learned reward model as its objective to maximize.
The complete RLHF objective combines reward maximization with a KL divergence penalty:
J_RLHF(theta) = E_(x ~ D, y ~ pi_theta) [ r_phi(x, y) - beta * D_KL(pi_theta(y|x) || pi_ref(y|x)) ]
Or equivalently:
objective(phi) = E_(x,y) ~ D_pi_RL [ r_theta(x, y) - beta * log(pi_RL(y|x) / pi_SFT(y|x)) ]
where pi_theta is the current policy being optimized, pi_ref is the frozen reference policy (SFT model), r_phi is the frozen reward model, and beta is the KL penalty coefficient typically ranging from 0.01 to 0.1.[4]
An additional pretraining term may be added to avoid catastrophic forgetting:
L(phi) = E[r_theta(x, y) - beta * log(.)] + gamma * E_(x ~ D_pretrain) [log(pi_RL(x))]
A common choice of RL algorithm for this stage is Proximal Policy Optimization (PPO), a stable policy-gradient method, though others like actor-critic can be used. Using PPO, the model generates an output (action) for a given input (state or prompt), the reward model scores this output, and the PPO update adjusts the model's parameters to increase the probability of outputs that lead to higher reward scores. PPO includes a mechanism to prevent the policy from straying too far from its initial parameters in a single update (through a clipping penalty). This is important because the reward model is an imperfect proxy; if the policy changes too drastically, it might exploit quirks of the reward model (a form of reward hacking), producing gibberish or undesired outputs that nevertheless score high.
The KL divergence term prevents distribution collapse where the model assigns all probability to narrow sequences, maintains language fluency by staying close to the well-trained SFT model, and prevents reward hacking where the model exploits reward model weaknesses. Over many training iterations, this RL process tunes the agent to generate outputs that align with the learned human preferences.
After RLHF fine-tuning, InstructGPT and similar models began following user instructions more reliably, avoided certain classes of undesirable content, and showed improved factual accuracy in open-ended question answering.
| Step | Description | Key components |
|---|---|---|
| Pretraining/SFT | Fine-tune base model on human demonstrations | Prompt-response pairs, supervised learning, 10,000-100,000 demonstrations |
| Reward modeling | Train RM on ranked preferences | Pairwise comparisons, Bradley-Terry model, 100,000-1M comparisons |
| Policy optimization | Use RL to maximize RM rewards | PPO, KL penalty, optional pretraining mix |
Proximal Policy Optimization (PPO) is the dominant reinforcement learning algorithm used in RLHF, introduced by Schulman et al. in 2017.[15] PPO addresses how to update policies using sampled data without taking destructively large steps that degrade performance. The introduction of PPO in the original OpenAI 2017 RLHF paper significantly reduced the amount of feedback needed by stabilizing training, making it practical to apply RLHF to high-dimensional neural networks.
The core innovation is the clipped surrogate objective. PPO defines a probability ratio:
r_t(theta) = pi_theta(a_t | s_t) / pi_theta_old(a_t | s_t)
The objective becomes:
L^CLIP(theta) = E_t [ min(r_t(theta) * A_hat_t, clip(r_t(theta), 1 - epsilon, 1 + epsilon) * A_hat_t) ]
where A_hat_t is the advantage estimate, epsilon is the clipping parameter (typically 0.1 to 0.2), and the minimum creates a pessimistic lower bound.[22]
PPO's selection for RLHF stems from multiple advantages:
However, PPO-based RLHF requires maintaining four large models simultaneously: the policy model, reference model, reward model, and value/critic model. For models exceeding 70 billion parameters, this creates substantial memory challenges.
Reward models translate human preferences into scalar signals for reinforcement learning. These models face a challenging task: predicting subtle, context-dependent human judgments about text quality from limited training data while generalizing to novel outputs.
Architecturally, reward models typically mirror the language model being trained, often initialized from the same SFT checkpoint to leverage linguistic knowledge.[2] The final layer is modified to output a single scalar value rather than a probability distribution over tokens.
Reward model quality critically determines RLHF success and failure modes. A weak reward model gets exploited. During RL optimization, the policy discovers inputs that achieve high predicted rewards without actually satisfying human preferences, a phenomenon called reward hacking or reward overoptimization.[23]
Mitigating reward model limitations involves:
The KL divergence penalty is a fundamental component preventing RLHF collapse. The Kullback-Leibler divergence measures the difference between two probability distributions:
D_KL(P || Q) = E_(x ~ P) [ log(P(x) / Q(x)) ]
In RLHF, the KL penalty constrains how much the optimized policy pi_theta can diverge from the reference policy pi_ref during RL training. The penalty coefficient beta controls the trade-off between reward maximization and staying close to the reference.[24]
The penalty serves multiple functions:
From a Bayesian perspective, KL-regularized RL implements variational inference approximating a target posterior distribution, which provides theoretical grounding for why the method avoids distribution collapse.
There are two main approaches to implementing the KL constraint in PPO for RLHF. PPO-Penalty approximately solves a KL-constrained update by penalizing the KL divergence in the objective function and automatically adjusting the penalty coefficient during training. PPO-Clip relies instead on specialized clipping in the objective function to remove incentives for the new policy to deviate too far from the old policy, without an explicit KL term.
RLHF has been applied across a range of AI domains, from game-playing agents to large-scale text generation, to align models with human values.[3]
One of the most prominent uses of RLHF is in natural language processing, where it has become a key technique for aligning language models with human expectations. RLHF improves conversational agents, text summarization, and instruction-following. Instruct-tuned models like InstructGPT and conversational agents like ChatGPT rely on RLHF to produce answers that users find helpful and safe. The technique has been used for tasks such as text summarization (models that generate summaries preferred by readers), open-ended question answering, translation, and dialogue.
Notable systems include OpenAI's ChatGPT and InstructGPT, Anthropic's Claude, DeepMind's Sparrow, and Google's Gemini. By training on human feedback, these models can handle subjective or nuanced criteria (e.g., writing style, humor, avoiding offensive language) that are not captured by likelihood alone. It helps reduce toxicity and bias in LLM outputs.
ChatGPT represents RLHF's most visible and impactful application, transforming GPT-3.5's raw capabilities into a helpful, harmless conversational assistant.[9] OpenAI's announcement was explicit about the role of RLHF: "We trained this model using Reinforcement Learning from Human Feedback (RLHF), using the same methods as InstructGPT, but with slight differences in the data collection setup." The system builds directly on InstructGPT's methodology, with key adaptations for dialogue:
InstructGPT's results demonstrated RLHF's power: the 1.3 billion parameter model was preferred over 100x larger GPT-3 outputs, showing 82% reduction in harmful content generation and 29% better adherence to safety policies.[2]
Anthropic's Claude models pioneered Constitutional AI (CAI), which reduces human annotation requirements while making value alignment more transparent.[10] The approach uses AI-generated feedback (RLAIF) guided by explicit constitutional principles rather than relying entirely on human preference comparisons.
Constitutional AI operates in two phases:
Claude's constitution draws from diverse sources including the UN Universal Declaration of Human Rights, trust and safety best practices, and Anthropic's "helpful, honest, harmless" (HHH) criteria.[25] Results demonstrate Constitutional AI's effectiveness: models achieved Pareto improvement, being both more helpful and more harmless than standard RLHF, with 70-85% preference rates in human evaluations. Anthropic reported that RL from AI feedback achieved results comparable to standard RLHF in making an assistant harmless, while using approximately 80% less human feedback on harm.[10]
Separately, Anthropic's earlier 2022 paper on training a helpful and harmless assistant (the HH-RLHF paper) provided one of the most detailed public studies of applying RLHF to a dialogue agent.[16] That work explored an iterated online mode of training where preference models and RL policies were updated on a weekly cadence with fresh human feedback data, demonstrating that alignment training improves performance on almost all NLP evaluations and is fully compatible with training for specialized skills such as Python coding and summarization.
Meta AI's Llama 2 paper provides the most comprehensive public documentation of RLHF implementation details.[13] The documentation reveals:
Technical innovations include:
Results showed Llama 2-Chat models outperforming most open-source competitors on helpfulness and safety benchmarks.
Google's Gemini models represent large-scale application of RLHF to multimodal systems.[14] The natively multimodal pretraining architecture processes text, images, audio, and video together, with RLHF refinement applied across modalities.
DeepMind developed several notable RLHF applications:
In robotics and control, RLHF offers a way to teach robots complex behaviors that are hard to specify with a reward function. The original 2017 "Deep Reinforcement Learning from Human Preferences" paper demonstrated RLHF in simulated robotics and Atari games.[1] A human can watch two attempt videos (e.g., a robot stacking blocks) and indicate which attempt was better, and from this the robot eventually learns a policy that accomplishes the task as the human intends. This approach bypasses manually engineering a reward (which might accidentally encourage wrong behaviors) and instead leverages human intuition about what the correct outcome looks like. The original experiments showed a simulated robot learning to do a backflip and a drive-and-park task solely from human preference judgments.
For Atari games, humans viewed two clips of agent gameplay and indicated which looked better. Agents learned to play successfully from preference data alone, and human preferences sometimes contained more useful information than performance-based metrics.
Another area is video game agents and other interactive environments. RLHF can train game-playing bots not just to win in terms of game score, but to behave in ways that human players or spectators prefer. OpenAI Five for Dota 2 defeated professional players using techniques related to RLHF. DeepMind's AlphaStar achieved superhuman StarCraft II performance incorporating human feedback elements.
Beyond NLP and games, RLHF has been explored in domains like image generation and computer vision. Text-to-image models can produce outputs ranked by humans for quality and alignment with the prompt; a reward model trained on these rankings can fine-tune the image generator accordingly. RLHF has found applications in text-to-image generation using Denoising Diffusion Policy Optimization (DDPO) to train on aesthetic preferences.[26]
RLHF is also used in healthcare for patient education, recommender systems for engagement (optimizing recommendation policies based on feedback beyond simple clicks), and multimodal tasks like text-to-image alignment.
| Domain | Example systems | Benefits |
|---|---|---|
| LLMs | ChatGPT, Claude, InstructGPT | Improved helpfulness, safety, and instruction-following |
| Games | Atari agents, OpenAI Five, AlphaStar | Better exploration, robustness, and human-like behavior |
| Robotics | Simulated tasks, manipulation, navigation | Alignment with human demonstrations and preferences |
| Image generation | Text-to-image models, DDPO | Reduced overfitting, better quality, aesthetic alignment |
| Healthcare | Patient education systems | Accurate, ethical responses |
| Recommender systems | Content recommendation | Better user satisfaction beyond clicks |
RLHF offers several benefits over conventional training approaches that explain its rapid adoption.
Flexibility in capturing complex preferences. RLHF can align models with subtle, context-dependent human judgments nearly impossible to specify through hand-coded rules. This helps tackle the long-standing problem of value misalignment, where an agent achieves the literal objective it was given but not the outcome humans actually wanted. By defining the objective through human feedback, RLHF directly optimizes what humans care about (at least to the extent that the human feedback is representative of true preferences). This makes it a powerful tool for AI alignment, especially in scenarios involving ethics, safety, or complex social values.
Data efficiency in fine-tuning. In many cases, a relatively small amount of human feedback can substantially improve a large model's performance on subjective tasks. InstructGPT's results, where a 1.3B model with RLHF surpassed a 175B model, exemplify how human feedback can unlock latent capabilities more effectively than scaling up model size or unsupervised data.
Implicit reward shaping. RLHF learns reward functions from data rather than requiring hand-specified proxies, making it applicable to tasks where defining the reward function manually would be intractable.
Practical effectiveness. The technique has been proven across diverse tasks and scales, with 15-85% improvements across various metrics in production systems. Users generally prefer the outputs of models tuned with human feedback, finding them more helpful, correct, and aligned with what was asked.
Safety improvements. RLHF-trained models are better at refusing inappropriate requests and explaining their refusals. OpenAI reported that RLHF dramatically reduced the frequency of hallucinations and toxic content in their model outputs. DeepMind's Sparrow saw improvements in following conversation rules and providing evidence when available.
The technique successfully handles tasks spanning summarization, dialogue, question answering, code generation, creative writing, and instruction following.[3]
Despite its effectiveness, RLHF faces significant challenges that remain active areas of research.
Obtaining high-quality human preference data can be expensive and time-consuming, since it often requires skilled annotators to carefully compare outputs. Human preference data collection costs roughly $1-10+ per preference pair at scale, often requiring thousands of annotations. InstructGPT's alignment estimated at approximately $1 million for dataset creation alone. As models become more capable and are applied to broader tasks, the amount of feedback required to cover diverse scenarios increases. Relying heavily on human-in-the-loop training can become a bottleneck for scaling up AI systems.
Computational requirements pose another barrier. PPO-based RLHF requires maintaining four large models simultaneously: the policy model, reference model, reward model, and value/critic model. For models exceeding 70 billion parameters, this creates enormous memory challenges. Research suggests RLHF does not scale as effectively as pretraining, with larger policy models benefiting less and improvements saturating quickly.[27]
Human feedback is noisy and potentially biased. Different annotators may have inconsistent preferences or make mistakes, and even a single person's judgment can vary over time or context. If the feedback dataset is not carefully curated, the reward model may learn a skewed version of human preferences. Human preferences are subjective and can introduce biases, leading to models that favor majority opinions and disadvantage underrepresented groups.[18] A model tuned with feedback from a specific demographic might not perform well for users from a different background, or it might unintentionally amplify the biases present in the trainers' judgments.
As AI systems become more capable, human oversight becomes increasingly difficult. The Bradley-Terry model assumes preferences are transitive and consistent, but real human preferences often violate these assumptions.
Reward hacking represents a fundamental problem where models exploit weaknesses in reward models to achieve high scores without genuinely improving quality.[28] Since the reward model is an imperfect proxy for what humans truly want, a powerful agent may find ways to game this proxy. Research by Gao et al. (2022) demonstrated "reward overoptimization": as policies optimize proxy rewards, true reward initially increases but then declines after reaching a peak.[23] The relationship between proxy reward and true reward follows predictable scaling laws, with the coefficients scaling smoothly with the number of reward model parameters.
Manifestations include:
Techniques like regularizing the policy updates (as PPO does) and continually updating the reward model with fresh human data can mitigate this, but not fully eliminate it. The problem is an instance of Goodhart's law: "when a measure becomes a target, it ceases to be a good measure."
PPO is notoriously sensitive to hyperparameter choices. The multi-stage pipeline creates intricate dependencies; poor SFT models produce poor reward model training data, leading to ineffective RL optimization. The alignment tax means RLHF can degrade performance on general capabilities not targeted during alignment, though InstructGPT showed this effect can be mitigated through careful pretraining data mixing.
Judging how well an RLHF-trained model actually aligns with human values is inherently difficult. The model is optimized to do well on the feedback it was given, but verifying it will generalize to new situations requires thorough testing with adversarial or diverse queries. For example, after training Sparrow, DeepMind had participants try to trick the agent into breaking rules to see how often it fails. These evaluations help quantify progress but are inherently limited by the creativity and perspectives of the evaluators.
The objective optimized by RL (maximizing reward model scores) does not perfectly align with true human preferences.[18] Creating a single reward model for diverse human values is fundamentally misspecified. Critics argue that RLHF alone is insufficient for aligning superintelligent AI due to these limitations.
As RLHF becomes a standard tool, researchers are exploring extensions and alternatives to address its limitations.
Direct Preference Optimization (DPO), introduced by Rafailov et al. in 2023, aims to eliminate the reinforcement learning step altogether and emerged as a simpler, more stable alternative to RLHF.[17] DPO takes the preference data (pairwise comparisons) and directly fine-tunes the main model to satisfy those preferences using a simple supervised objective, rather than training a separate reward model and doing RL.
DPO's key insight is that the optimal policy for the RLHF objective can be expressed analytically, enabling direct optimization from preference data using supervised learning. The method derives a loss function for the policy such that optimizing this loss is theoretically equivalent to optimizing the expected reward as in RLHF, under certain assumptions. The benefit is that it forgoes the complexity of RL training (which can be unstable and sensitive to hyperparameters) and instead uses standard gradient descent on a tailored classification loss constructed from the preference pairs.
The DPO loss is:
L_DPO(theta) = -E_(x, y_w, y_l) [ log sigma( beta * log(pi_theta(y_w|x) / pi_ref(y_w|x)) - beta * log(pi_theta(y_l|x) / pi_ref(y_l|x)) ) ]
DPO's advantages include:
In experiments, DPO achieved performance on par with or exceeding PPO-based RLHF in controlling sentiment, improving dialogue responses, and summarization. The method has seen rapid adoption with models like Zephyr, Mixtral, and Intel's NeuralChat. However, DPO still relies on the quality of human preference data and can suffer from the same bias issues if the data is unrepresentative.
Reinforcement Learning from AI Feedback (RLAIF) replaces human preference labels with AI-generated evaluations, addressing RLHF's scalability bottleneck.[30] The approach follows the standard RLHF structure but uses an off-the-shelf LLM to generate preference labels rather than requiring human annotators.
Research demonstrates RLAIF achieves performance on par with RLHF across summarization, helpful dialogue, and harmless dialogue tasks. RLAIF approaches can dramatically cut down on cost, since once a robust AI evaluator is in place, it can label large amounts of data quickly. Advantages include dramatic cost reduction, vastly greater scalability, more consistency than human evaluators, and faster iteration cycles.
However, these methods depend on the AI feedback being reliable; the feedback model's alignment becomes the critical bottleneck. RLAIF is seen as a promising direction to scale alignment techniques to very powerful models but is not a complete replacement for human judgment, especially on questions of ethics and values.
Constitutional AI represents more than just RLAIF; it is a philosophical approach emphasizing explicit, transparent value specification.[25] Rather than learning implicit preferences, CAI encodes values in written principles drawn from established sources.
The methodology combines self-improvement through critique-and-revision with RLAIF supervision. Collective Constitutional AI extends this by incorporating democratic input, with experiments involving approximately 1,000 participants voting on AI principles.[31]
Hybrid approaches combining multiple techniques show promise:
Group Relative Policy Optimization (GRPO), introduced by DeepSeek in 2024, is a variant of PPO designed to reduce the computational cost of RLHF by eliminating the need for a separate critic (value) model.[20] In standard PPO-based RLHF, a value network estimates baselines for advantage computation, which roughly doubles memory requirements. GRPO instead estimates baselines from group scores: for each prompt, the model generates a group of responses, scores them with the reward model, and computes advantages relative to the group mean.
This approach cuts approximately half the compute requirements compared to standard PPO-based RLHF. GRPO was first introduced in the DeepSeekMath paper and subsequently used in the post-training of DeepSeek-R1, where it proved particularly effective at enhancing mathematical reasoning capabilities. On GSM8K, GRPO improved accuracy from 82.9% to 88.2%, and on MATH from 46.8% to 51.7%. Recent extensions include DAPO and DR-GRPO, which further refine the approach.
Kahneman-Tversky Optimization (KTO), introduced by Ethayarajh et al. in 2024, frames model alignment through the lens of prospect theory from behavioral economics.[19] While DPO requires paired preference data (chosen vs. rejected responses), KTO only needs a binary signal indicating whether an output is desirable or undesirable for a given input. This is a significant practical advantage because binary feedback is far cheaper and more abundant than pairwise comparisons.
KTO's theoretical foundation draws on Kahneman and Tversky's prospect theory, which describes how humans perceive value and make decisions under uncertainty, incorporating cognitive biases like loss aversion. The paper demonstrates that existing alignment objectives (including DPO) implicitly incorporate many of these biases, and the success of these objectives over simple cross-entropy minimization can partly be attributed to them belonging to a family of loss functions called human-aware losses (HALOs).
KTO matches or exceeds the performance of preference-based methods at scales from 1B to 30B parameters.
Researchers are also investigating ways to make human feedback more efficient through active learning strategies (selectively querying humans on the most informative comparisons rather than random samples), semi-automated feedback (using heuristic or model-based pre-screening to reduce trivial queries), and combining demonstrations and preferences (using a few high-quality human demonstrations to bootstrap the policy and then preferences for further refinement).
| Method | Year | Requires reward model | Requires RL | Data format | Key advantage |
|---|---|---|---|---|---|
| RLHF (PPO) | 2017 | Yes | Yes | Pairwise preferences | Well-tested, flexible |
| DPO | 2023 | No | No | Pairwise preferences | Simpler, more stable |
| RLAIF | 2022 | Yes | Yes | AI-generated preferences | Scalable, lower cost |
| Constitutional AI | 2022 | Yes | Yes | Principles + AI feedback | Transparent values |
| GRPO | 2024 | Yes | Yes (critic-free) | Pairwise or rule-based | Half the memory of PPO |
| KTO | 2024 | No | No | Binary (good/bad) | No paired data needed |
The RLHF research community faces numerous open challenges, surveyed comprehensively by Casper et al. (2023), who identified problems across three categories: challenges with feedback, challenges with the reward model, and challenges with the policy.[18]