Reinforcement Learning from Human Feedback (RLHF)

Reinforcement Learning from Human Feedback (RLHF) is a machine learning technique that trains artificial intelligence systems to behave according to human preferences by learning reward functions from human feedback rather than hand-coded rules.^[1] The technique combines supervised learning, reward modeling from human preferences, and reinforcement learning optimization to teach AI systems complex behaviors that are difficult to specify explicitly.

RLHF has become the industry-standard method for aligning large language models (LLMs) with human values, enabling systems like ChatGPT, Claude, and GPT-4 to follow instructions, provide helpful responses, and avoid harmful outputs.^[2] Instead of using a hand-crafted reward function to specify the task in a reinforcement learning setup, RLHF involves learning a reward model directly from human feedback, and then optimizing the agent's policy using this learned reward signal.^[3] RLHF is particularly useful for tasks where the ideal behavior is easy for humans to recognize but difficult to program explicitly, such as judging whether an answer is helpful or whether a joke is funny.

Explain like I'm 5 (ELI5)

Imagine you're teaching a dog a new trick, but you can't tell it exactly what to do in words. Instead, every time the dog tries something, you say "good dog!" or "bad dog!" based on whether it did what you wanted. Over time, the dog figures out what makes you happy and keeps doing that.

RLHF works the same way with computers. A computer writes many different answers to a question. Then people look at pairs of answers and say which one they like better. A second computer program learns what makes people happy based on all these choices. Then the first computer practices writing answers, trying to make the second program (which learned what people like) give it a high score. After lots of practice, the computer gets really good at writing answers that people find helpful and safe.

This is how ChatGPT and Claude learned to be so good at answering questions in a way that feels natural and useful.

Overview

The core innovation of RLHF lies in learning what humans want rather than explicitly programming it. Humans provide comparative judgments between AI outputs, a reward model learns to predict these preferences, and reinforcement learning optimizes the AI to maximize predicted rewards while maintaining fluency and coherence.^[4]

The standard RLHF pipeline consists of three distinct stages:

Supervised fine-tuning (SFT): trains a pretrained language model on high-quality human demonstrations to establish basic instruction-following capabilities
Reward model training: collects human preferences by showing annotators multiple AI-generated responses and training a model to predict these preferences
Reinforcement learning optimization: uses the reward model to fine-tune the AI policy with algorithms like Proximal Policy Optimization (PPO), incorporating a KL divergence penalty to prevent drift

This approach addresses the reward specification problem: for complex tasks like writing helpful responses or generating creative content, it is nearly impossible to write explicit rules capturing what makes outputs good. RLHF leverages humans' ability to judge quality when comparing examples, even if they cannot articulate precise criteria.^[3]

History and development

Early foundations (2008-2011)

Training AI systems from human feedback has long been explored as a way to handle objectives that are hard to formally specify. The intellectual foundations of RLHF trace back to research on learning from human feedback in the late 2000s. The TAMER framework (Training an Agent Manually via Evaluative Reinforcement), introduced by Knox and Stone in 2008, allowed humans to guide an RL agent by giving scalar feedback signals, effectively shaping the agent's policy via human reinforcement instead of a predefined reward function.^[5] These efforts demonstrated the feasibility of learning from human preferences but were limited to relatively simple environments.

The conceptual leap to preference-based reinforcement learning occurred in 2011 when two independent research teams simultaneously published foundational work. Akrour et al. introduced preference-based policy learning, demonstrating that agents could learn directly from expert rankings of policies without simulator access or explicit rewards.^[6]

The breakthrough: deep RLHF (2017)

Modern RLHF took shape with the landmark 2017 paper "Deep Reinforcement Learning from Human Preferences" by researchers from OpenAI and DeepMind, led by Paul Christiano and Jan Leike.^[1] This work scaled preference-based learning to complex, high-dimensional tasks using deep neural networks as a general and scalable method for preference-based learning in complex domains. The paper demonstrated that agents could master challenging behaviors from remarkably little human feedback. Their algorithm learned a difficult maneuver (a backflip for a simulated humanoid) using about 900 bits of human feedback, which amounted to roughly an hour of a human trainer's time.^[1] A relatively small amount of well-placed human feedback (less than 1% of the agent's interactions) was sufficient to significantly outperform baselines and even achieve superhuman scores in some tasks, without the agent ever seeing the true programmed rewards of the environment.

The key innovations included:

Training deep neural network reward models from pairwise comparisons of trajectory segments
Using Proximal Policy Optimization (PPO) for stable reinforcement learning
Employing active query selection strategies to identify the most informative trajectory pairs

The success of this 2017 work established RLHF as a promising technique for aligning AI behavior with human-desired outcomes, sparking broader research into human-in-the-loop learning for AI safety and alignment.

Evolution to language models (2019-2020)

Following the initial breakthrough, subsequent research expanded RLHF into new domains. Applying RLHF to natural language presented new challenges: language generation involves discrete tokens, massive action spaces, and subtle quality distinctions. OpenAI's 2019 paper "Fine-Tuning Language Models from Human Preferences" made the first major application of RLHF to language models, fine-tuning GPT-2 on sentiment control, descriptiveness, and summarization using only 5,000 to 60,000 human comparisons.^[7] The work built on advances in generative pretraining and applied reward learning to four natural language tasks, demonstrating that RLHF could effectively steer language model behavior.

Building on this foundation, in 2020, OpenAI applied RLHF to text summarization in the paper "Learning to Summarize from Human Feedback." A reward model learned to predict which summaries people preferred, and an RL policy (based on a pretrained GPT-3 model) was optimized to maximize this learned reward.^[8] The results showed that with only a few thousand comparison judgments from humans, the RLHF-tuned model could generate summaries that were preferred by users over those from the original model or from certain supervised baselines. Models trained with 60,000 human preference labels significantly outperformed much larger supervised models, with RLHF summaries preferred to human-written reference summaries in evaluations. This provided one of the first demonstrations that RLHF can successfully guide large-scale natural language processing (NLP) models on real-world tasks.

Mainstream deployment (2022-present)

RLHF gained widespread attention with the development of InstructGPT and ChatGPT by OpenAI in 2022. InstructGPT is a family of GPT-3 based models fine-tuned using human feedback to better follow user instructions. The March 2022 InstructGPT paper represented RLHF's transition from research to industry-standard practice.^[2] In the InstructGPT work, human annotators first provided demonstration answers and ranking comparisons for model outputs. Then a reward model was trained on these rankings, and finally the base model was further optimized via PPO to produce answers that maximize the reward model's score.

This process led to dramatic improvements in alignment with user intent. Results showed that 1.3 billion parameter InstructGPT models were preferred to 175 billion parameter GPT-3 outputs despite having 100x fewer parameters. According to OpenAI, testers "significantly preferred" the outputs of a 1.3 billion-parameter InstructGPT model over the original 175 billion-parameter GPT-3 on a wide range of user prompts. RLHF fine-tuning made the model's outputs more factual and less toxic compared to the base GPT-3, showing 82% reduction in harmful content generation and 29% better adherence to safety policies, all while maintaining performance on academic NLP benchmarks.^[2]

These RLHF-trained InstructGPT models were deployed as the default models in OpenAI's API in 2022, and the approach paved the way for ChatGPT, a conversational AI launched in late 2022 that was built by fine-tuning GPT-3.5 with human feedback. InstructGPT's methodology directly enabled ChatGPT's November 2022 launch, which OpenAI explicitly credited to RLHF technology.^[9] ChatGPT rapidly reached 100 million users and catalyzed industry-wide adoption.

Concurrently, the RLHF paradigm has been adopted by other leading AI labs. Anthropic developed Constitutional AI, extending RLHF with AI-generated feedback.^[10] Their December 2022 paper introduced "RLAIF" (Reinforcement Learning from AI Feedback), where AI models evaluate responses according to written constitutional principles. Anthropic built its assistant Claude, refining it using RLHF and related techniques.

DeepMind developed a dialogue agent called Sparrow in 2022, which was trained via reinforcement learning on feedback from human reviewers to make its answers more correct and safer.^[11] Sparrow uses human preference modeling to learn to avoid unsafe or misleading responses while engaging in helpful conversation. DeepMind also applied RLHF in models like Gopher and developed several notable applications including GopherCite, which trained models to cite evidence from the web.^[12]

By 2023-2025, RLHF became ubiquitous across the AI industry and a standard part of the training pipeline for state-of-the-art large language models. Meta AI's Llama 2 documentation provided comprehensive public implementation details,^[13] while Google's Gemini^[14] and virtually all major deployed language models now use RLHF or variants as standard practice.

Year	Milestone	Key reference
2008	TAMER Framework: first demonstration of learning from human evaluative feedback	Knox & Stone^[5]
2011	Preference-based RL: established preference learning foundations	Akrour et al.^[6]
2017	Deep RLHF: introduction of deep RL from human preferences for games and robotics	Christiano et al.^[1]
2017	PPO: introduced dominant RL algorithm for RLHF	Schulman et al.^[15]
2019	Fine-tuning GPT-2: first major RLHF application to language models	Ziegler et al.^[7]
2020	Learning to Summarize: demonstrated RLHF superiority over larger supervised models	Stiennon et al.^[8]
2022	InstructGPT: established RLHF as viable for general-purpose alignment	Ouyang et al.^[2]
2022	HH-RLHF: training a helpful and harmless assistant with RLHF	Bai et al.^[16]
2022	Constitutional AI: introduced RLAIF and explicit value specification	Bai et al.^[10]
2023	DPO: simpler alternative to RL-based training	Rafailov et al.^[17]
2023	Open problems in RLHF: examined fundamental limitations	Casper et al.^[18]
2023	Llama 2: most detailed public RLHF documentation	Touvron et al.^[13]
2024	KTO: alignment via binary feedback using prospect theory	Ethayarajh et al.^[19]
2024	GRPO: critic-free RL for reasoning using group-relative scoring	Shao et al.^[20]

Technical methodology

The process of reinforcement learning from human feedback typically consists of several stages, following a sequence of pre-training, reward modeling, and policy optimization (sometimes with an intermediate supervised fine-tuning step). RLHF typically involves three main steps: pretraining or supervised fine-tuning, training a reward model, and optimizing the policy with reinforcement learning.^[4]

Stage 1: supervised fine-tuning (SFT)

RLHF usually starts with a pretrained model or agent that has learned broadly from a large dataset or through conventional RL. In NLP applications, this is a large language model trained on vast text corpora (e.g., GPT). In control tasks or games, this could be an agent with some prior knowledge. The pre-training provides a foundation of general capabilities, which RLHF will then refine. Notably, the computational cost of the initial pre-training far exceeds that of the subsequent RLHF fine-tuning; for example, the RLHF phase for InstructGPT consumed less than 2% of the compute used to pre-train GPT-3.

The RLHF pipeline begins with supervised fine-tuning (SFT), which transforms a pretrained language model into an instruction-following system. Pretrained models excel at pattern completion but don't naturally follow explicit instructions. SFT bridges this gap by training on high-quality human demonstrations, where humans write desired outputs for given prompts. The model is fine-tuned on these prompt-response pairs to ensure it can follow instructions and generate desirable outputs.^[2]

The process uses standard supervised learning with the causal language modeling objective:

L_SFT(theta) = -E_(x,y)~D [ sum_t log p_theta(y_t | x, y_<t) ]

where theta represents model parameters, x is the prompt/instruction, y is the desired response, and D is the demonstration dataset.^[4]

Training data typically consists of 10,000 to 100,000 demonstrations sourced from human labelers, API usage, or carefully curated existing data. For example, in InstructGPT, approximately 13,000 prompts were used for SFT.^[2] The SFT model becomes the reference policy (pi_ref) used during RL optimization to compute KL divergence penalties. This supervised learning step primes the model to output generally acceptable answers, simplifying the subsequent reinforcement learning stage.

Stage 2: reward model training

A dataset is constructed by having human evaluators judge or rank outputs of the model-in-training. Typically, the current model (or a set of candidate models) is used to generate answers for a variety of prompts, and human annotators are asked to compare which output is better according to some criteria (e.g., which answer is more helpful or accurate). Human annotators provide feedback by ranking multiple model-generated responses to a given prompt. For instance, to train a chatbot, labelers might be shown two possible replies to a user query and asked which reply they prefer. These comparison judgments (or sometimes scalar ratings) constitute the feedback data.

Using pairwise comparisons helps because humans often find it easier to say "output A is better than output B" than to assign absolute scores, and it reduces variance between different annotators' scales. The collection process may be iterative: as the policy improves, new data is sampled in areas where the model is still uncertain, and humans provide feedback on those outputs, continually expanding the feedback dataset.

The second stage trains a reward model (also called preference model) that predicts human preferences between AI-generated outputs. The reward model takes as input an agent's output (and sometimes the initial prompt or state) and outputs a scalar reward value that should correlate with how humans would rate that output. The reward model is typically a neural network initialized from the same base model. In NLP tasks, one often uses a copy of the language model and fine-tunes it to predict preference scores. The RM is often initialized from the SFT model and trained using a loss function based on pairwise comparisons.

The reward model is trained using the Bradley-Terry model of pairwise preferences. This model assumes that for responses y_w (winner/chosen) and y_l (loser/rejected) with latent qualities r(y_w) and r(y_l), the probability that y_w is preferred follows:

P(y_w > y_l | x) = sigma(r_theta(x, y_w) - r_theta(x, y_l))

where sigma is the sigmoid function (logistic function) and r_theta is the reward model.^[21] The Bradley-Terry model assumes each item has a latent strength, and observed preferences are a noisy reflection of these underlying strengths. Only differences in reward scores matter; adding the same constant to all scores leaves the preference probabilities unchanged.

The reward model is trained by maximizing the log-likelihood of observed preferences using cross-entropy loss:

L_RM(theta) = -E_(x, y_w, y_l) ~ D [ log sigma(r_theta(x, y_w) - r_theta(x, y_l)) ]

where y_w is the preferred response over y_l.

Architecturally, reward models typically start from the SFT model checkpoint, replacing the final token prediction layer with a linear projection to a single scalar value.^[2] The scalar reward is read from the last token position, representing the quality of the entire sequence. Datasets for RM training can include 100,000 to 1 million comparisons. After training, the reward model serves as a stand-in for human judgment, evaluating any new output and estimating how well a human would like it. This allows the next phase of training to proceed without a human in the loop for every single evaluation.

Stage 3: reinforcement learning optimization

In the final phase, the policy model is fine-tuned using a reinforcement learning algorithm, with the reward model providing the reward signal. This stage formulates text generation as a sequential decision problem. Instead of a manual reward function defined by engineers, the agent now treats the learned reward model as its objective to maximize.

The complete RLHF objective combines reward maximization with a KL divergence penalty:

J_RLHF(theta) = E_(x ~ D, y ~ pi_theta) [ r_phi(x, y) - beta * D_KL(pi_theta(y|x) || pi_ref(y|x)) ]

Or equivalently:

objective(phi) = E_(x,y) ~ D_pi_RL [ r_theta(x, y) - beta * log(pi_RL(y|x) / pi_SFT(y|x)) ]

where pi_theta is the current policy being optimized, pi_ref is the frozen reference policy (SFT model), r_phi is the frozen reward model, and beta is the KL penalty coefficient typically ranging from 0.01 to 0.1.^[4]

An additional pretraining term may be added to avoid catastrophic forgetting:

L(phi) = E[r_theta(x, y) - beta * log(.)] + gamma * E_(x ~ D_pretrain) [log(pi_RL(x))]

A common choice of RL algorithm for this stage is Proximal Policy Optimization (PPO), a stable policy-gradient method, though others like actor-critic can be used. Using PPO, the model generates an output (action) for a given input (state or prompt), the reward model scores this output, and the PPO update adjusts the model's parameters to increase the probability of outputs that lead to higher reward scores. PPO includes a mechanism to prevent the policy from straying too far from its initial parameters in a single update (through a clipping penalty). This is important because the reward model is an imperfect proxy; if the policy changes too drastically, it might exploit quirks of the reward model (a form of reward hacking), producing gibberish or undesired outputs that nevertheless score high.

The KL divergence term prevents distribution collapse where the model assigns all probability to narrow sequences, maintains language fluency by staying close to the well-trained SFT model, and prevents reward hacking where the model exploits reward model weaknesses. Over many training iterations, this RL process tunes the agent to generate outputs that align with the learned human preferences.

After RLHF fine-tuning, InstructGPT and similar models began following user instructions more reliably, avoided certain classes of undesirable content, and showed improved factual accuracy in open-ended question answering.

Step	Description	Key components
Pretraining/SFT	Fine-tune base model on human demonstrations	Prompt-response pairs, supervised learning, 10,000-100,000 demonstrations
Reward modeling	Train RM on ranked preferences	Pairwise comparisons, Bradley-Terry model, 100,000-1M comparisons
Policy optimization	Use RL to maximize RM rewards	PPO, KL penalty, optional pretraining mix

Key components and algorithms

Proximal Policy Optimization (PPO)

Proximal Policy Optimization (PPO) is the dominant reinforcement learning algorithm used in RLHF, introduced by Schulman et al. in 2017.^[15] PPO addresses how to update policies using sampled data without taking destructively large steps that degrade performance. The introduction of PPO in the original OpenAI 2017 RLHF paper significantly reduced the amount of feedback needed by stabilizing training, making it practical to apply RLHF to high-dimensional neural networks.

The core innovation is the clipped surrogate objective. PPO defines a probability ratio:

r_t(theta) = pi_theta(a_t | s_t) / pi_theta_old(a_t | s_t)

The objective becomes:

L^CLIP(theta) = E_t [ min(r_t(theta) * A_hat_t, clip(r_t(theta), 1 - epsilon, 1 + epsilon) * A_hat_t) ]

where A_hat_t is the advantage estimate, epsilon is the clipping parameter (typically 0.1 to 0.2), and the minimum creates a pessimistic lower bound.^[22]

PPO's selection for RLHF stems from multiple advantages:

Stability through trust region constraints that prevent excessively large policy updates
Sample efficiency by reusing data through multiple epochs of minibatch updates
Simplicity requiring only standard backpropagation
Scalability to large models with distributed training

However, PPO-based RLHF requires maintaining four large models simultaneously: the policy model, reference model, reward model, and value/critic model. For models exceeding 70 billion parameters, this creates substantial memory challenges.

Reward models and preference learning

Reward models translate human preferences into scalar signals for reinforcement learning. These models face a challenging task: predicting subtle, context-dependent human judgments about text quality from limited training data while generalizing to novel outputs.

Architecturally, reward models typically mirror the language model being trained, often initialized from the same SFT checkpoint to leverage linguistic knowledge.^[2] The final layer is modified to output a single scalar value rather than a probability distribution over tokens.

Reward model quality critically determines RLHF success and failure modes. A weak reward model gets exploited. During RL optimization, the policy discovers inputs that achieve high predicted rewards without actually satisfying human preferences, a phenomenon called reward hacking or reward overoptimization.^[23]

Mitigating reward model limitations involves:

Ensemble methods using multiple reward models to reduce variance and detect exploitation
Adversarial training exposing models to challenging examples that probe for weaknesses
Process supervision training on intermediate reasoning steps rather than only final answers
Uncertainty quantification recognizing when the model has low confidence in its predictions

KL divergence penalty

The KL divergence penalty is a fundamental component preventing RLHF collapse. The Kullback-Leibler divergence measures the difference between two probability distributions:

D_KL(P || Q) = E_(x ~ P) [ log(P(x) / Q(x)) ]

In RLHF, the KL penalty constrains how much the optimized policy pi_theta can diverge from the reference policy pi_ref during RL training. The penalty coefficient beta controls the trade-off between reward maximization and staying close to the reference.^[24]

The penalty serves multiple functions:

Prevents distribution collapse where the model assigns all probability mass to a few high-reward sequences
Maintains language fluency by anchoring the policy to the well-trained SFT model
Prevents reward hacking by limiting how far the policy can optimize against the imperfect reward model
Acts as Bayesian regularization, implementing a form of variational inference

From a Bayesian perspective, KL-regularized RL implements variational inference approximating a target posterior distribution, which provides theoretical grounding for why the method avoids distribution collapse.

There are two main approaches to implementing the KL constraint in PPO for RLHF. PPO-Penalty approximately solves a KL-constrained update by penalizing the KL divergence in the objective function and automatically adjusting the penalty coefficient during training. PPO-Clip relies instead on specialized clipping in the objective function to remove incentives for the new policy to deviate too far from the old policy, without an explicit KL term.

Applications

RLHF has been applied across a range of AI domains, from game-playing agents to large-scale text generation, to align models with human values.^[3]

Natural language processing

One of the most prominent uses of RLHF is in natural language processing, where it has become a key technique for aligning language models with human expectations. RLHF improves conversational agents, text summarization, and instruction-following. Instruct-tuned models like InstructGPT and conversational agents like ChatGPT rely on RLHF to produce answers that users find helpful and safe. The technique has been used for tasks such as text summarization (models that generate summaries preferred by readers), open-ended question answering, translation, and dialogue.

Notable systems include OpenAI's ChatGPT and InstructGPT, Anthropic's Claude, DeepMind's Sparrow, and Google's Gemini. By training on human feedback, these models can handle subjective or nuanced criteria (e.g., writing style, humor, avoiding offensive language) that are not captured by likelihood alone. It helps reduce toxicity and bias in LLM outputs.

ChatGPT and InstructGPT

ChatGPT represents RLHF's most visible and impactful application, transforming GPT-3.5's raw capabilities into a helpful, harmless conversational assistant.^[9] OpenAI's announcement was explicit about the role of RLHF: "We trained this model using Reinforcement Learning from Human Feedback (RLHF), using the same methods as InstructGPT, but with slight differences in the data collection setup." The system builds directly on InstructGPT's methodology, with key adaptations for dialogue:

Human AI trainers played both user and assistant roles
Model-written suggestions helped trainers compose responses
Rejection sampling selected better outputs during inference
Multiple rounds of iterative RLHF refinement

InstructGPT's results demonstrated RLHF's power: the 1.3 billion parameter model was preferred over 100x larger GPT-3 outputs, showing 82% reduction in harmful content generation and 29% better adherence to safety policies.^[2]

Claude and Constitutional AI

Anthropic's Claude models pioneered Constitutional AI (CAI), which reduces human annotation requirements while making value alignment more transparent.^[10] The approach uses AI-generated feedback (RLAIF) guided by explicit constitutional principles rather than relying entirely on human preference comparisons.

Constitutional AI operates in two phases:

Supervised learning phase: the model critiques and revises its own responses using constitutional principles
RL phase: AI evaluators assess responses according to the constitution, generating preference data for reward model training

Claude's constitution draws from diverse sources including the UN Universal Declaration of Human Rights, trust and safety best practices, and Anthropic's "helpful, honest, harmless" (HHH) criteria.^[25] Results demonstrate Constitutional AI's effectiveness: models achieved Pareto improvement, being both more helpful and more harmless than standard RLHF, with 70-85% preference rates in human evaluations. Anthropic reported that RL from AI feedback achieved results comparable to standard RLHF in making an assistant harmless, while using approximately 80% less human feedback on harm.^[10]

Separately, Anthropic's earlier 2022 paper on training a helpful and harmless assistant (the HH-RLHF paper) provided one of the most detailed public studies of applying RLHF to a dialogue agent.^[16] That work explored an iterated online mode of training where preference models and RL policies were updated on a weekly cadence with fresh human feedback data, demonstrating that alignment training improves performance on almost all NLP evaluations and is fully compatible with training for specialized skills such as Python coding and summarization.

Meta's Llama 2

Meta AI's Llama 2 paper provides the most comprehensive public documentation of RLHF implementation details.^[13] The documentation reveals:

Starting from a 2 trillion token pretrained base
Initial SFT on approximately 27,540 high-quality annotations
Training separate reward models for helpfulness and safety
Five sequential RLHF versions (V1-V5) over 3+ months

Technical innovations include:

Rejection sampling for the 70B model, generating multiple candidates and selecting the highest-scoring
Ghost Attention (GAtt): maintains multi-turn conversation consistency
Context distillation: enables internalizing lengthy instructions

Results showed Llama 2-Chat models outperforming most open-source competitors on helpfulness and safety benchmarks.

Google Gemini and other systems

Google's Gemini models represent large-scale application of RLHF to multimodal systems.^[14] The natively multimodal pretraining architecture processes text, images, audio, and video together, with RLHF refinement applied across modalities.

DeepMind developed several notable RLHF applications:

GopherCite: trained models to cite evidence from the web^[12]
Sparrow: combined RLHF with rule-based alignment^[11]

Robotics and games

In robotics and control, RLHF offers a way to teach robots complex behaviors that are hard to specify with a reward function. The original 2017 "Deep Reinforcement Learning from Human Preferences" paper demonstrated RLHF in simulated robotics and Atari games.^[1] A human can watch two attempt videos (e.g., a robot stacking blocks) and indicate which attempt was better, and from this the robot eventually learns a policy that accomplishes the task as the human intends. This approach bypasses manually engineering a reward (which might accidentally encourage wrong behaviors) and instead leverages human intuition about what the correct outcome looks like. The original experiments showed a simulated robot learning to do a backflip and a drive-and-park task solely from human preference judgments.

For Atari games, humans viewed two clips of agent gameplay and indicated which looked better. Agents learned to play successfully from preference data alone, and human preferences sometimes contained more useful information than performance-based metrics.

Another area is video game agents and other interactive environments. RLHF can train game-playing bots not just to win in terms of game score, but to behave in ways that human players or spectators prefer. OpenAI Five for Dota 2 defeated professional players using techniques related to RLHF. DeepMind's AlphaStar achieved superhuman StarCraft II performance incorporating human feedback elements.

Other domains

Beyond NLP and games, RLHF has been explored in domains like image generation and computer vision. Text-to-image models can produce outputs ranked by humans for quality and alignment with the prompt; a reward model trained on these rankings can fine-tune the image generator accordingly. RLHF has found applications in text-to-image generation using Denoising Diffusion Policy Optimization (DDPO) to train on aesthetic preferences.^[26]

RLHF is also used in healthcare for patient education, recommender systems for engagement (optimizing recommendation policies based on feedback beyond simple clicks), and multimodal tasks like text-to-image alignment.

Domain	Example systems	Benefits
LLMs	ChatGPT, Claude, InstructGPT	Improved helpfulness, safety, and instruction-following
Games	Atari agents, OpenAI Five, AlphaStar	Better exploration, robustness, and human-like behavior
Robotics	Simulated tasks, manipulation, navigation	Alignment with human demonstrations and preferences
Image generation	Text-to-image models, DDPO	Reduced overfitting, better quality, aesthetic alignment
Healthcare	Patient education systems	Accurate, ethical responses
Recommender systems	Content recommendation	Better user satisfaction beyond clicks

Advantages and benefits

RLHF offers several benefits over conventional training approaches that explain its rapid adoption.

Flexibility in capturing complex preferences. RLHF can align models with subtle, context-dependent human judgments nearly impossible to specify through hand-coded rules. This helps tackle the long-standing problem of value misalignment, where an agent achieves the literal objective it was given but not the outcome humans actually wanted. By defining the objective through human feedback, RLHF directly optimizes what humans care about (at least to the extent that the human feedback is representative of true preferences). This makes it a powerful tool for AI alignment, especially in scenarios involving ethics, safety, or complex social values.

Data efficiency in fine-tuning. In many cases, a relatively small amount of human feedback can substantially improve a large model's performance on subjective tasks. InstructGPT's results, where a 1.3B model with RLHF surpassed a 175B model, exemplify how human feedback can unlock latent capabilities more effectively than scaling up model size or unsupervised data.

Implicit reward shaping. RLHF learns reward functions from data rather than requiring hand-specified proxies, making it applicable to tasks where defining the reward function manually would be intractable.

Practical effectiveness. The technique has been proven across diverse tasks and scales, with 15-85% improvements across various metrics in production systems. Users generally prefer the outputs of models tuned with human feedback, finding them more helpful, correct, and aligned with what was asked.

Safety improvements. RLHF-trained models are better at refusing inappropriate requests and explaining their refusals. OpenAI reported that RLHF dramatically reduced the frequency of hallucinations and toxic content in their model outputs. DeepMind's Sparrow saw improvements in following conversation rules and providing evidence when available.

The technique successfully handles tasks spanning summarization, dialogue, question answering, code generation, creative writing, and instruction following.^[3]

Challenges and limitations

Despite its effectiveness, RLHF faces significant challenges that remain active areas of research.

Cost and scalability

Obtaining high-quality human preference data can be expensive and time-consuming, since it often requires skilled annotators to carefully compare outputs. Human preference data collection costs roughly $1-10+ per preference pair at scale, often requiring thousands of annotations. InstructGPT's alignment estimated at approximately $1 million for dataset creation alone. As models become more capable and are applied to broader tasks, the amount of feedback required to cover diverse scenarios increases. Relying heavily on human-in-the-loop training can become a bottleneck for scaling up AI systems.

Computational requirements pose another barrier. PPO-based RLHF requires maintaining four large models simultaneously: the policy model, reference model, reward model, and value/critic model. For models exceeding 70 billion parameters, this creates enormous memory challenges. Research suggests RLHF does not scale as effectively as pretraining, with larger policy models benefiting less and improvements saturating quickly.^[27]

Human feedback quality and bias

Human feedback is noisy and potentially biased. Different annotators may have inconsistent preferences or make mistakes, and even a single person's judgment can vary over time or context. If the feedback dataset is not carefully curated, the reward model may learn a skewed version of human preferences. Human preferences are subjective and can introduce biases, leading to models that favor majority opinions and disadvantage underrepresented groups.^[18] A model tuned with feedback from a specific demographic might not perform well for users from a different background, or it might unintentionally amplify the biases present in the trainers' judgments.

As AI systems become more capable, human oversight becomes increasingly difficult. The Bradley-Terry model assumes preferences are transitive and consistent, but real human preferences often violate these assumptions.

Reward hacking and overoptimization

Reward hacking represents a fundamental problem where models exploit weaknesses in reward models to achieve high scores without genuinely improving quality.^[28] Since the reward model is an imperfect proxy for what humans truly want, a powerful agent may find ways to game this proxy. Research by Gao et al. (2022) demonstrated "reward overoptimization": as policies optimize proxy rewards, true reward initially increases but then declines after reaching a peak.^[23] The relationship between proxy reward and true reward follows predictable scaling laws, with the coefficients scaling smoothly with the number of reward model parameters.

Manifestations include:

Sycophancy: agreeing with user beliefs rather than providing truthful information^[29]
Sophistical reasoning: generating convincing but incorrect arguments
Length bias exploitation: producing unnecessarily verbose responses because reward models often correlate length with quality
Formatting tricks: using layouts or bullet points that fool reward models
Mode collapse: reducing output diversity by converging on a narrow set of high-reward patterns

Techniques like regularizing the policy updates (as PPO does) and continually updating the reward model with fresh human data can mitigate this, but not fully eliminate it. The problem is an instance of Goodhart's law: "when a measure becomes a target, it ceases to be a good measure."

Training instability and complexity

PPO is notoriously sensitive to hyperparameter choices. The multi-stage pipeline creates intricate dependencies; poor SFT models produce poor reward model training data, leading to ineffective RL optimization. The alignment tax means RLHF can degrade performance on general capabilities not targeted during alignment, though InstructGPT showed this effect can be mitigated through careful pretraining data mixing.

Evaluation difficulties

Judging how well an RLHF-trained model actually aligns with human values is inherently difficult. The model is optimized to do well on the feedback it was given, but verifying it will generalize to new situations requires thorough testing with adversarial or diverse queries. For example, after training Sparrow, DeepMind had participants try to trick the agent into breaking rules to see how often it fails. These evaluations help quantify progress but are inherently limited by the creativity and perspectives of the evaluators.

The objective optimized by RL (maximizing reward model scores) does not perfectly align with true human preferences.^[18] Creating a single reward model for diverse human values is fundamentally misspecified. Critics argue that RLHF alone is insufficient for aligning superintelligent AI due to these limitations.

Alternative and complementary approaches

As RLHF becomes a standard tool, researchers are exploring extensions and alternatives to address its limitations.

Direct Preference Optimization (DPO)

Direct Preference Optimization (DPO), introduced by Rafailov et al. in 2023, aims to eliminate the reinforcement learning step altogether and emerged as a simpler, more stable alternative to RLHF.^[17] DPO takes the preference data (pairwise comparisons) and directly fine-tunes the main model to satisfy those preferences using a simple supervised objective, rather than training a separate reward model and doing RL.

DPO's key insight is that the optimal policy for the RLHF objective can be expressed analytically, enabling direct optimization from preference data using supervised learning. The method derives a loss function for the policy such that optimizing this loss is theoretically equivalent to optimizing the expected reward as in RLHF, under certain assumptions. The benefit is that it forgoes the complexity of RL training (which can be unstable and sensitive to hyperparameters) and instead uses standard gradient descent on a tailored classification loss constructed from the preference pairs.

The DPO loss is:

L_DPO(theta) = -E_(x, y_w, y_l) [ log sigma( beta * log(pi_theta(y_w|x) / pi_ref(y_w|x)) - beta * log(pi_theta(y_l|x) / pi_ref(y_l|x)) ) ]

DPO's advantages include:

Simplicity: requires only the policy model and frozen reference model (no separate reward model or critic)
Stability: supervised learning is more stable than PPO optimization
Computational efficiency: approximately half the memory of PPO-based RLHF
Performance: matches or exceeds PPO-based RLHF on many tasks, including summarization and dialogue

In experiments, DPO achieved performance on par with or exceeding PPO-based RLHF in controlling sentiment, improving dialogue responses, and summarization. The method has seen rapid adoption with models like Zephyr, Mixtral, and Intel's NeuralChat. However, DPO still relies on the quality of human preference data and can suffer from the same bias issues if the data is unrepresentative.

RLAIF (Reinforcement Learning from AI Feedback)

Reinforcement Learning from AI Feedback (RLAIF) replaces human preference labels with AI-generated evaluations, addressing RLHF's scalability bottleneck.^[30] The approach follows the standard RLHF structure but uses an off-the-shelf LLM to generate preference labels rather than requiring human annotators.

Research demonstrates RLAIF achieves performance on par with RLHF across summarization, helpful dialogue, and harmless dialogue tasks. RLAIF approaches can dramatically cut down on cost, since once a robust AI evaluator is in place, it can label large amounts of data quickly. Advantages include dramatic cost reduction, vastly greater scalability, more consistency than human evaluators, and faster iteration cycles.

However, these methods depend on the AI feedback being reliable; the feedback model's alignment becomes the critical bottleneck. RLAIF is seen as a promising direction to scale alignment techniques to very powerful models but is not a complete replacement for human judgment, especially on questions of ethics and values.

Constitutional AI and hybrid approaches

Constitutional AI represents more than just RLAIF; it is a philosophical approach emphasizing explicit, transparent value specification.^[25] Rather than learning implicit preferences, CAI encodes values in written principles drawn from established sources.

The methodology combines self-improvement through critique-and-revision with RLAIF supervision. Collective Constitutional AI extends this by incorporating democratic input, with experiments involving approximately 1,000 participants voting on AI principles.^[31]

Hybrid approaches combining multiple techniques show promise:

Using RLHF for helpfulness while Constitutional AI handles harmlessness
Integrating DPO with process supervision for complex reasoning
Mixing human and AI feedback strategically
Multi-stage training alternating between different alignment methods

Group Relative Policy Optimization (GRPO)

Group Relative Policy Optimization (GRPO), introduced by DeepSeek in 2024, is a variant of PPO designed to reduce the computational cost of RLHF by eliminating the need for a separate critic (value) model.^[20] In standard PPO-based RLHF, a value network estimates baselines for advantage computation, which roughly doubles memory requirements. GRPO instead estimates baselines from group scores: for each prompt, the model generates a group of responses, scores them with the reward model, and computes advantages relative to the group mean.

This approach cuts approximately half the compute requirements compared to standard PPO-based RLHF. GRPO was first introduced in the DeepSeekMath paper and subsequently used in the post-training of DeepSeek-R1, where it proved particularly effective at enhancing mathematical reasoning capabilities. On GSM8K, GRPO improved accuracy from 82.9% to 88.2%, and on MATH from 46.8% to 51.7%. Recent extensions include DAPO and DR-GRPO, which further refine the approach.

Kahneman-Tversky Optimization (KTO)

Kahneman-Tversky Optimization (KTO), introduced by Ethayarajh et al. in 2024, frames model alignment through the lens of prospect theory from behavioral economics.^[19] While DPO requires paired preference data (chosen vs. rejected responses), KTO only needs a binary signal indicating whether an output is desirable or undesirable for a given input. This is a significant practical advantage because binary feedback is far cheaper and more abundant than pairwise comparisons.

KTO's theoretical foundation draws on Kahneman and Tversky's prospect theory, which describes how humans perceive value and make decisions under uncertainty, incorporating cognitive biases like loss aversion. The paper demonstrates that existing alignment objectives (including DPO) implicitly incorporate many of these biases, and the success of these objectives over simple cross-entropy minimization can partly be attributed to them belonging to a family of loss functions called human-aware losses (HALOs).

KTO matches or exceeds the performance of preference-based methods at scales from 1B to 30B parameters.

Other improvements

Researchers are also investigating ways to make human feedback more efficient through active learning strategies (selectively querying humans on the most informative comparisons rather than random samples), semi-automated feedback (using heuristic or model-based pre-screening to reduce trivial queries), and combining demonstrations and preferences (using a few high-quality human demonstrations to bootstrap the policy and then preferences for further refinement).

Method	Year	Requires reward model	Requires RL	Data format	Key advantage
RLHF (PPO)	2017	Yes	Yes	Pairwise preferences	Well-tested, flexible
DPO	2023	No	No	Pairwise preferences	Simpler, more stable
RLAIF	2022	Yes	Yes	AI-generated preferences	Scalable, lower cost
Constitutional AI	2022	Yes	Yes	Principles + AI feedback	Transparent values
GRPO	2024	Yes	Yes (critic-free)	Pairwise or rule-based	Half the memory of PPO
KTO	2024	No	No	Binary (good/bad)	No paired data needed

Future directions and open problems

The RLHF research community faces numerous open challenges, surveyed comprehensively by Casper et al. (2023), who identified problems across three categories: challenges with feedback, challenges with the reward model, and challenges with the policy.^[18]

Improving reward models

Adversarial training for robustness against exploitation
Ensemble methods aggregating multiple models to reduce variance
Uncertainty quantification for confidence assessment
Process reward models that evaluate intermediate reasoning steps rather than only final outputs
Better generalization to out-of-distribution prompts

Algorithmic improvements

Developing RL algorithms specifically designed for LLM alignment (as GRPO has begun to do)
More efficient optimization with parameter reallocation techniques
Better parallelization strategies for multi-model RLHF pipelines
Reduced memory footprint approaches that eliminate the need for multiple full-size models

Addressing scalability

Hybrid human-AI feedback approaches that allocate human effort where it matters most
Recursive supervision with hierarchical structures for overseeing increasingly capable systems
Debate and verification systems where models argue for and against claims
Constitutional approaches enabling customizable values without per-instance human labeling

Tackling fundamental limitations

Better ways to capture and aggregate diverse human preferences across demographics and cultures
Dynamic preference learning that adapts to changing values over time
Detection methods for subtle reward hacking that is difficult for humans to notice
Causality-based approaches to reward modeling that distinguish correlation from genuine quality
Addressing sycophancy and deceptive alignment where models learn to appear aligned without genuine value internalization

Theoretical understanding

Formalizing why RLHF works as well as it does in practice
Characterizing the relationship between proxy and true rewards (building on Gao et al.'s scaling laws)
Identifying tasks where RLHF fundamentally cannot work
Understanding the limits of preference-based learning and the Bradley-Terry assumption

References

Christiano, P., Leike, J., Brown, T.B., Martic, M., Legg, S., & Amodei, D. (2017). "Deep Reinforcement Learning from Human Preferences." Advances in Neural Information Processing Systems 30 (NeurIPS 2017). arXiv:1706.03741.
Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., et al. (2022). "Training language models to follow instructions with human feedback." Advances in Neural Information Processing Systems 35 (NeurIPS 2022). arXiv:2203.02155.
Wikipedia contributors. "Reinforcement learning from human feedback." Wikipedia, The Free Encyclopedia.
Lambert, N. (2024). RLHF Book. rlhfbook.com.
Knox, W.B. & Stone, P. (2008). "TAMER: Training an Agent Manually via Evaluative Reinforcement." Proceedings of the 7th International Joint Conference on Autonomous Agents and Multiagent Systems.
Akrour, R., Schoenauer, M., & Sebag, M. (2011). "Preference-Based Policy Learning." European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases.
Ziegler, D.M., Stiennon, N., Wu, J., Brown, T.B., Radford, A., Amodei, D., Christiano, P., & Irving, G. (2019). "Fine-Tuning Language Models from Human Preferences." arXiv:1909.08593.
Stiennon, N., Ouyang, L., Wu, J., Ziegler, D., Lowe, R., Voss, C., Radford, A., Amodei, D., & Christiano, P.F. (2020). "Learning to summarize from human feedback." Advances in Neural Information Processing Systems 33 (NeurIPS 2020). arXiv:2009.01325.
OpenAI. (2022). "Introducing ChatGPT." OpenAI Blog.
Bai, Y., Kadavath, S., Kundu, S., Askell, A., Kernion, J., Jones, A., Chen, A., Goldie, A., Mirhoseini, A., McKinnon, C., et al. (2022). "Constitutional AI: Harmlessness from AI Feedback." arXiv:2212.08073.
Glaese, A., McAleese, N., Trebacz, M., Aslanides, J., Firoiu, V., Ewalds, T., Rauh, M., Weidinger, L., Chadwick, M., Thacker, P., et al. (2022). "Improving alignment of dialogue agents via targeted human judgements." arXiv:2209.14375.
Menick, J., Trebacz, M., Mikulik, V., Aslanides, J., Song, F., Chadwick, M., Glaese, M., Young, S., Campbell-Gillingham, L., Irving, G., & McAleese, N. (2022). "Teaching language models to support answers with verified quotes." arXiv:2203.11147.
Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S., et al. (2023). "Llama 2: Open Foundation and Fine-Tuned Chat Models." arXiv:2307.09288.
Gemini Team, Google. (2023). "Gemini: A Family of Highly Capable Multimodal Models." arXiv:2312.11805.
Schulman, J., Wolski, F., Dhariwal, P., Radford, A., & Klimov, O. (2017). "Proximal Policy Optimization Algorithms." arXiv:1707.06347.
Bai, Y., Jones, A., Ndousse, K., Askell, A., Chen, A., DasSarma, N., Drain, D., Fort, S., Ganguli, D., Henighan, T., et al. (2022). "Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback." arXiv:2204.05862.
Rafailov, R., Sharma, A., Mitchell, E., Ermon, S., Manning, C.D., & Finn, C. (2023). "Direct Preference Optimization: Your Language Model is Secretly a Reward Model." Advances in Neural Information Processing Systems 36 (NeurIPS 2023). arXiv:2305.18290.
Casper, S., Davies, X., Shi, C., Gilbert, T.K., Scheurer, J., Rando, J., Freedman, R., Korbak, T., Lindner, D., Freire, P., et al. (2023). "Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback." Transactions on Machine Learning Research (TMLR). arXiv:2307.15217.
Ethayarajh, K., Xu, W., Muennighoff, N., Jurafsky, D., & Kiela, D. (2024). "KTO: Model Alignment as Prospect Theoretic Optimization." arXiv:2402.01306.
Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Zhang, M., Li, Y.K., Wu, Y., & Guo, D. (2024). "DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models." arXiv:2402.03300.
Bradley, R.A. & Terry, M.E. (1952). "Rank Analysis of Incomplete Block Designs: I. The Method of Paired Comparisons." Biometrika, 39(3/4), 324-345.
Schulman, J. et al. (2017). "Proximal Policy Optimization Algorithms." arXiv:1707.06347.
Gao, L., Schulman, J., & Hilton, J. (2022). "Scaling Laws for Reward Model Overoptimization." Proceedings of the 40th International Conference on Machine Learning (ICML 2023). arXiv:2210.10760.
Ziegler, D.M. et al. (2019). "Fine-Tuning Language Models from Human Preferences." arXiv:1909.08593.
Anthropic. (2023). "Claude's Constitution." Anthropic Research.
Black, K., Janner, M., Du, Y., Kostrikov, I., & Levine, S. (2023). "Training Diffusion Models with Reinforcement Learning." arXiv:2305.13301.
Sorscher, B., Geirhos, R., Shekhar, S., Ganguli, S., & Morcos, A.S. (2022). "Beyond neural scaling laws: beating power law scaling via data pruning." arXiv:2206.14486.
Amodei, D., Olah, C., Steinhardt, J., Christiano, P., Schulman, J., & Mane, D. (2016). "Concrete Problems in AI Safety." arXiv:1606.06565.
Perez, E., Ringer, S., Lukosuite, K., Nguyen, K., Chen, E., Heiner, S., Pettit, C., Olsson, C., Kundu, S., Kadavath, S., et al. (2022). "Discovering Language Model Behaviors with Model-Written Evaluations." arXiv:2212.09251.
Lee, H., Phatale, S., Mansoor, H., Lu, K., Mesnard, T., Bishop, C., Carbune, V., & Rastogi, A. (2023). "RLAIF: Scaling Reinforcement Learning from Human Feedback with AI Feedback." arXiv:2309.00267.
Anthropic. (2023). "Collective Constitutional AI: Aligning a Language Model with Public Input." Anthropic Research.

External links

Illustrating Reinforcement Learning from Human Feedback (RLHF) - Hugging Face Blog
RLHF Book by Nathan Lambert
Learning from human preferences - OpenAI
Constitutional AI - Anthropic Research
Proximal Policy Optimization - OpenAI Spinning Up
Open Problems and Fundamental Limitations of RLHF - Casper et al. 2023

Explain like I'm 5 (ELI5)

Overview

History and development

Early foundations (2008-2011)

The breakthrough: deep RLHF (2017)

Evolution to language models (2019-2020)

Mainstream deployment (2022-present)

Technical methodology

Stage 1: supervised fine-tuning (SFT)

Stage 2: reward model training

Stage 3: reinforcement learning optimization

Key components and algorithms

Proximal Policy Optimization (PPO)

Reward models and preference learning

KL divergence penalty

Applications

Natural language processing

ChatGPT and InstructGPT

Claude and Constitutional AI

Meta's Llama 2

Google Gemini and other systems

Robotics and games

Other domains

Advantages and benefits

Challenges and limitations

Cost and scalability

Human feedback quality and bias

Reward hacking and overoptimization

Training instability and complexity

Evaluation difficulties

Alternative and complementary approaches

Direct Preference Optimization (DPO)

RLAIF (Reinforcement Learning from AI Feedback)

Constitutional AI and hybrid approaches

Group Relative Policy Optimization (GRPO)

Kahneman-Tversky Optimization (KTO)

Other improvements

Future directions and open problems

Improving reward models

Algorithmic improvements

Addressing scalability

Tackling fundamental limitations

Theoretical understanding

See also

References

External links

Improve this article

Related Articles

Direct Preference Optimization (DPO)

Sparse autoencoder

Context window

OCR Models

Post-training

Pre-training

Explain like I'm 5 (ELI5)

Overview

History and development

Early foundations (2008-2011)

The breakthrough: deep RLHF (2017)

Evolution to language models (2019-2020)

Mainstream deployment (2022-present)

Technical methodology

Stage 1: supervised fine-tuning (SFT)

Stage 2: reward model training

Stage 3: reinforcement learning optimization

Key components and algorithms

Proximal Policy Optimization (PPO)

Reward models and preference learning

KL divergence penalty

Applications

Natural language processing

ChatGPT and InstructGPT

Claude and Constitutional AI

Meta's Llama 2

Google Gemini and other systems

Robotics and games

Other domains

Advantages and benefits

Challenges and limitations

Cost and scalability