Reinforcement learning

Reinforcement learning (RL) is an area of machine learning concerned with how intelligent agents ought to take actions in an environment in order to maximize the notion of cumulative reward.^[1] Reinforcement learning is one of three basic machine learning paradigms, alongside supervised learning and unsupervised learning.^[2]

Unlike supervised learning, which requires labeled input/output pairs, and unlike unsupervised learning, which focuses on finding hidden structure in unlabeled data, reinforcement learning focuses on finding a balance between exploration (of uncharted territory) and exploitation (of current knowledge) through trial-and-error interaction with an environment.^[3] The environment is typically formulated as a Markov decision process (MDP), as many reinforcement learning algorithms utilize dynamic programming techniques.^[4] RL has driven some of the most visible achievements in modern artificial intelligence, from defeating world champions at board games and video games to aligning large language models with human preferences.

Overview

Reinforcement learning achieved widespread recognition through several landmark achievements. In 2016, DeepMind's AlphaGo defeated world champion Lee Sedol in the complex game of Go^[5], a feat previously thought to be decades away. In 2019, OpenAI Five defeated the reigning world champion team in Dota 2^[6], demonstrating RL's ability to handle complex team-based strategy games. By 2022, RL had become central to the training of large language models like ChatGPT and Claude through a technique called Reinforcement Learning from Human Feedback (RLHF).

The field emerged from the convergence of multiple intellectual traditions. The psychology of animal learning, beginning with Edward Thorndike's Law of Effect in 1911, established that behaviors followed by satisfying consequences tend to be repeated. The mathematical framework came from optimal control theory and Richard Bellman's development of dynamic programming in the 1950s. These threads were unified in the modern field through the work of Richard Sutton and Andrew Barto, who received the 2024 Turing Award for their foundational contributions.^[7]

History

The history of reinforcement learning spans over a century, drawing from psychology, control theory, and computer science. Three distinct intellectual threads developed independently before merging into the unified field recognized today.

Behavioral psychology and trial-and-error learning

The earliest roots of reinforcement learning lie in experimental psychology. Ivan Pavlov's work on classical conditioning in the 1890s and 1900s demonstrated that animals could learn to associate stimuli with rewards, forming the basis for understanding learned behavior.^[8] Edward Thorndike formalized this in 1911 with his Law of Effect, which states that responses followed by satisfying outcomes become more firmly associated with the situation, while responses followed by discomfort become less likely.^[9] B.F. Skinner extended these ideas in the 1930s through operant conditioning, which studied how rewards and punishments shape voluntary behavior. These psychological principles directly inspired the reward-based learning framework that RL uses today.

Optimal control and dynamic programming

The second thread came from applied mathematics. In the 1950s, Richard Bellman developed dynamic programming as a method for solving multi-stage decision problems. His key insight, formalized in the Bellman equation (1957), was that an optimal policy can be decomposed into an immediate decision plus the optimal policy from the resulting state onward.^[10] This recursive formulation became the mathematical backbone of virtually all RL algorithms. The term "reinforcement learning" itself was first used in the engineering literature by Minsky in 1961, connecting the psychological concept of reinforcement to computational decision-making.

Temporal difference learning and unification

The third thread, temporal difference (TD) learning, bridged the gap between the other two. In 1988, Richard Sutton introduced TD learning as a class of model-free methods that learn by bootstrapping from current value estimates rather than waiting for final outcomes.^[11] TD methods combined the sampling approach of Monte Carlo methods with the bootstrapping of dynamic programming, creating a practical algorithm for environments where the full model is unknown. Sutton and Barto's 1998 textbook, Reinforcement Learning: An Introduction, synthesized these three threads into a coherent framework and became the standard reference for the field.^[12]

Timeline of major developments

Year	Development	Key contributor(s)	Significance
1890s	Classical conditioning experiments	Ivan Pavlov	Showed animals learn stimulus-reward associations
1911	Law of Effect	Edward Thorndike	Established that rewarded actions are reinforced
1930s	Operant conditioning	B.F. Skinner	Formalized how rewards and punishments shape behavior
1950s	Dynamic programming, Bellman equation	Richard Bellman	Mathematical framework for sequential decision-making
1959	Checkers program	Arthur Samuel	First self-learning game program; coined "machine learning"
1961	"Steps toward artificial intelligence"	Marvin Minsky	Used term "reinforcement" in engineering context
1963	MENACE	Donald Michie	Matchbox machine that learned tic-tac-toe
1972	REINFORCE-like policy search	Ronald Williams (later formalized 1992)	Conceptual foundation for policy gradient methods
1988	TD(lambda)	Richard Sutton	Unified Monte Carlo and dynamic programming approaches
1989	Q-learning	Christopher Watkins	Model-free off-policy control algorithm
1992	TD-Gammon	Gerald Tesauro	First RL system to achieve world-class game performance
1994	SARSA	Gavin Rummery, Mahesan Niranjan	On-policy temporal difference control
1998	Reinforcement Learning: An Introduction	Sutton, Barto	Seminal textbook that defined the field
2013	Deep Q-Network (DQN) on Atari	DeepMind (Mnih et al.)	First deep RL breakthrough using raw pixel input
2015	DQN published in Nature	DeepMind	Human-level play on 49 Atari games
2016	AlphaGo defeats Lee Sedol	DeepMind	First AI to beat a world champion at Go
2017	AlphaGo Zero, AlphaZero	DeepMind	Learned Go, chess, and shogi from self-play alone
2017	PPO published	OpenAI (Schulman et al.)	Became the default on-policy RL algorithm
2018	SAC published	Haarnoja et al. (UC Berkeley)	Maximum entropy framework for continuous control
2019	OpenAI Five defeats OG at Dota 2	OpenAI	RL conquers a complex multi-agent real-time game
2019	AlphaStar reaches Grandmaster	DeepMind	Grandmaster-level play in StarCraft II
2020	MuZero	DeepMind	Learned to plan without knowing environment rules
2022	RLHF used to train ChatGPT	OpenAI	RL becomes central to LLM alignment
2024	Turing Award	Richard Sutton, Andrew Barto	Recognition for foundational RL contributions
2025	DeepSeek-R1 with GRPO	DeepSeek	RL trains reasoning capabilities in LLMs without supervised data

Core concepts

Agent-environment interaction

Reinforcement learning problems involve an agent interacting with an environment through a cycle of observation, action, and reward.^[3] At each discrete time step t:

The agent observes the current state s_t of the environment.
Based on its policy pi, the agent selects an action a_t.
The environment transitions to a new state s_{t+1} according to transition probabilities P(s' | s, a).
The agent receives a scalar reward r_{t+1} indicating the immediate benefit of that action.

The agent's objective is to learn a policy that maximizes the expected return (cumulative discounted reward):^[1]

G_t = R_{t+1} + gamma * R_{t+2} + gamma^2 * R_{t+3} + ... = Sum_{k=0}^{infinity} gamma^k * R_{t+k+1}

The discount factor gamma (where 0 <= gamma <= 1) controls how much the agent values future rewards relative to immediate ones. A gamma close to 0 makes the agent short-sighted, prioritizing immediate reward. A gamma close to 1 makes the agent far-sighted, weighting future rewards almost as heavily as immediate ones. Choosing the right discount factor is problem-dependent: a robot navigating a maze might use gamma = 0.99, while a day-trading algorithm might use a lower value.

Key components

Component	Description	Example
Agent	The learner and decision-maker	Robot, game-playing AI, trading algorithm
Environment	External system the agent interacts with	Maze, chess board, stock market
State (s)	Description of the environment's current configuration	Board position in chess, joint angles of a robot
Action (a)	Choice available to the agent at a given state	Move a piece, buy/sell stock, turn left
Reward (r)	Immediate scalar feedback signal	Points scored, profit earned, distance to goal
Policy (pi)	Agent's strategy mapping states to actions	"If in state X, take action Y"
Value function V(s)	Expected long-term return from a state under a policy	Position evaluation in chess
Action-value function Q(s,a)	Expected return from taking action a in state s, then following the policy	Estimated value of moving a specific piece
Model	Agent's learned representation of environment dynamics	Predicted next state and reward given current state and action

Value functions

Value functions are central to reinforcement learning, estimating how good it is for an agent to be in a particular state or to take a particular action in a state:^[1]

State-value function V^pi(s): the expected return starting from state s and following policy pi.
Action-value function Q^pi(s,a): the expected return from taking action a in state s, then following policy pi.

The optimal value functions satisfy the Bellman optimality equations:^[4]

V(s) = max_a Sum_{s'} P(s'|s,a) [R(s,a,s') + gamma * V*(s')]*
Q(s,a) = Sum_{s'} P(s'|s,a) [R(s,a,s') + gamma * max_{a'} Q*(s',a')]*

These equations express the key recursive insight: the value of a state equals the best immediate reward plus the discounted value of the best reachable next state.

Exploration vs. exploitation

One fundamental challenge in reinforcement learning is the exploration-exploitation tradeoff.^[2] The agent must balance:

Exploration: trying new, untested actions to discover potentially better strategies.
Exploitation: using what the agent already knows to maximize immediate rewards.

An agent that only exploits may get stuck in a suboptimal policy, never discovering better options. An agent that only explores wastes time on actions it already knows are bad. Common strategies for managing this tradeoff include:

Strategy	Description	Tradeoff
Epsilon-greedy	Acts randomly with probability epsilon, greedily otherwise	Simple but uniform random exploration is inefficient
Epsilon decay	Decreases epsilon over time, exploring more early on	Balances early exploration with later exploitation
Upper Confidence Bound (UCB)	Selects actions that have high uncertainty or high estimated value	Principled, based on confidence intervals
Thompson sampling	Samples from posterior distribution of action values	Bayesian approach, naturally balances exploration
Boltzmann (softmax) exploration	Selects actions proportional to exponentiated Q-values	Temperature parameter controls exploration degree
Curiosity-driven exploration	Rewards agent for visiting novel states	Effective in sparse-reward environments

Mathematical foundations

Markov decision processes

Reinforcement learning problems are formally modeled as Markov decision processes (MDPs), defined by the tuple (S, A, P, R, gamma):^[13]

S: set of states (state space)
A: set of actions (action space)
P(s'|s,a): state transition probability function
R(s,a,s'): reward function
gamma: discount factor (0 <= gamma < 1)

The Markov property states that the future depends only on the current state, not on the sequence of events that preceded it: P(s_{t+1} | s_t, a_t, s_{t-1}, ..., s_0) = P(s_{t+1} | s_t, a_t). This memoryless property is what makes MDPs tractable. In practice, many real-world problems violate the Markov property (the current observation does not fully capture the state), leading to partially observable MDPs (POMDPs), which are substantially harder to solve.

Bellman equations

The Bellman equations, named after Richard Bellman, provide the recursive decomposition that underpins nearly all RL algorithms. For a given policy pi:

V^pi(s) = Sum_a pi(a|s) Sum_{s'} P(s'|s,a) [R(s,a,s') + gamma * V^pi(s')]

This equation says that the value of a state under policy pi equals the expected immediate reward plus the discounted value of the next state, averaged over all possible actions and transitions. The Bellman optimality equation replaces the policy average with a maximum, defining what the best possible policy would achieve.

Temporal difference learning

Temporal difference (TD) learning, introduced by Sutton in 1988, is a core method in RL that combines ideas from Monte Carlo methods and dynamic programming.^[11] Instead of waiting until the end of an episode to update value estimates (as Monte Carlo methods do), TD methods update estimates after each step using the observed reward and the current estimate of the next state's value:

V(s_t) <- V(s_t) + alpha [r_{t+1} + gamma * V(s_{t+1}) - V(s_t)]

The term in brackets, r_{t+1} + gamma * V(s_{t+1}) - V(s_t), is called the TD error. It measures the difference between the estimated value and a better estimate derived from the actual reward received plus the next state's estimated value. TD learning converges to the true value function under certain conditions and forms the basis of algorithms like Q-learning and SARSA.

Algorithm taxonomy

RL algorithms can be classified along several axes. Understanding these distinctions is essential for choosing the right algorithm for a given problem.

Model-based vs. model-free

Model-free algorithms learn a policy or value function directly from experience without building an explicit model of how the environment works. Q-learning and PPO are model-free. They are simpler to implement but often require many more interactions with the environment.

Model-based algorithms learn or are given a model of the environment's dynamics (transition probabilities and rewards) and use it for planning. Dyna-Q, introduced by Sutton in 1990, was an early approach that combined real experience with simulated experience generated from a learned model.^[14] More recent model-based methods include MuZero, which learns a latent dynamics model focused on predicting rewards and values rather than raw observations, and Dreamer, which learns a world model in latent space and uses it to train a policy entirely through imagined rollouts.^[15]

Model-based methods tend to be more sample-efficient because they can generate synthetic training data through mental simulation. However, if the learned model is inaccurate, compounding errors can lead to poor policies.

Value-based vs. policy-based

Value-based methods (Q-learning, DQN) learn a value function and derive a policy from it (e.g., always choose the action with the highest Q-value). They work well for discrete action spaces but struggle with continuous actions.

Policy-based methods (REINFORCE, PPO) directly parameterize and optimize the policy without necessarily learning a value function. They handle continuous action spaces naturally and can learn stochastic policies, but tend to have higher variance in gradient estimates.

Actor-critic methods combine both: an actor (policy network) selects actions while a critic (value network) evaluates them. This reduces variance compared to pure policy gradient methods while retaining the ability to handle continuous actions.

On-policy vs. off-policy

On-policy algorithms (SARSA, PPO, A2C) learn about the policy currently being executed. They use data generated by the current policy to update that same policy. This can be more stable but is less sample-efficient because old data cannot be reused after a policy update.

Off-policy algorithms (Q-learning, DQN, SAC) can learn from data generated by any policy, including old versions of the agent or even random exploration. This allows experience replay, where past transitions are stored in a buffer and sampled repeatedly, greatly improving sample efficiency.

Classification axis	Category A	Category B
Environment model	Model-free: Q-learning, PPO, SAC	Model-based: Dyna-Q, MuZero, Dreamer
What is learned	Value-based: Q-learning, DQN	Policy-based: REINFORCE, PPO
Data source	On-policy: SARSA, A2C, PPO	Off-policy: Q-learning, DQN, SAC
State representation	Tabular: classic Q-learning	Function approximation: deep learning-based RL

Key algorithms

Q-learning

Q-learning, introduced by Christopher Watkins in 1989, is a model-free, off-policy algorithm that learns the optimal action-value function directly.^[16] The update rule is:

Q(s,a) <- Q(s,a) + alpha [r + gamma * max_{a'} Q(s',a') - Q(s,a)]

where alpha is the learning rate. The key insight is that the update uses the maximum Q-value over the next state's actions regardless of which action the agent actually took. This "off-policy" property means Q-learning can learn about the optimal policy while following an exploratory one. Watkins proved that Q-learning converges to the optimal Q-function with probability 1, given sufficient exploration and decreasing learning rates.

Q-learning is simple and effective for problems with small, discrete state and action spaces. For larger problems, function approximation (such as neural networks) is needed.

SARSA

SARSA (State-Action-Reward-State-Action), introduced by Rummery and Niranjan in 1994, is an on-policy variant of Q-learning.^[17] Its update rule uses the action actually taken in the next state rather than the maximum:

Q(s,a) <- Q(s,a) + alpha [r + gamma * Q(s',a') - Q(s,a)]

Because SARSA evaluates the policy it is actually following, it tends to learn safer policies than Q-learning. In a cliff-walking problem, for example, Q-learning learns the optimal path along the cliff edge, while SARSA learns a safer path further from the edge, because it accounts for the possibility of exploratory actions leading to a fall.

Deep Q-Networks (DQN)

Deep Q-Networks (DQN), published by Mnih et al. at DeepMind in 2013 and in Nature in 2015, revolutionized RL by using deep convolutional neural networks to approximate Q-values for high-dimensional state spaces.^[18] DQN took raw pixel inputs from Atari 2600 games and learned to play 49 different games using the same architecture and hyperparameters, achieving human-level performance on 29 of them.

Two innovations made this possible:

Experience replay: the agent stores transitions (s, a, r, s') in a replay buffer and samples random mini-batches for training. This breaks correlations between consecutive samples and improves data efficiency.
Target network: a separate, periodically updated copy of the Q-network computes target values. This stabilizes training by preventing the target from shifting with every update.

DQN was the first demonstration that a single RL agent could learn complex behaviors directly from sensory input across many different tasks, and it sparked the deep reinforcement learning revolution.

Policy gradient methods

Policy gradient methods directly optimize a parameterized policy by estimating the gradient of expected return with respect to policy parameters.^[19] The foundational algorithm is REINFORCE (Williams, 1992), which updates policy parameters theta using:

nabla_theta J(theta) ~ Sum_t G_t * nabla_theta log pi_theta(a_t | s_t)

where G_t is the return from time step t. The intuition is straightforward: increase the probability of actions that led to high returns, decrease the probability of actions that led to low returns.

REINFORCE is simple but suffers from high variance in gradient estimates. Adding a baseline (typically the state value function) reduces variance without introducing bias:

nabla_theta J(theta) ~ Sum_t (G_t - V(s_t)) * nabla_theta log pi_theta(a_t | s_t)

The term (G_t - V(s_t)) is called the advantage, and this leads to the family of advantage actor-critic methods.

Actor-critic methods

Actor-critic algorithms combine policy-based and value-based learning:^[20]

The actor is a policy network that selects actions.
The critic is a value network that estimates the value of states or state-action pairs.

The critic reduces variance in the policy gradient estimate by providing a learned baseline. Several important variants exist:

A2C (Advantage Actor-Critic) uses the advantage function A(s,a) = Q(s,a) - V(s) to update the actor. A3C (Asynchronous Advantage Actor-Critic), introduced by Mnih et al. in 2016, runs multiple agents in parallel on separate copies of the environment, each contributing gradients asynchronously to a shared model.^[21] This was one of the first methods to effectively scale RL training across many CPU cores.

DDPG (Deep Deterministic Policy Gradient), introduced by Lillicrap et al. in 2015, extends DQN to continuous action spaces by learning a deterministic policy alongside a Q-function.^[22] It uses experience replay and target networks, similar to DQN.

TD3 (Twin Delayed DDPG), published by Fujimoto et al. in 2018, addresses overestimation bias in DDPG by maintaining two critic networks and taking the minimum of their estimates, delaying policy updates, and adding noise to target actions.^[23]

Proximal Policy Optimization (PPO)

Proximal Policy Optimization (PPO), introduced by Schulman et al. at OpenAI in 2017, constrains policy updates to prevent destructively large changes.^[24] PPO optimizes a clipped surrogate objective:

L^CLIP(theta) = E[min(r_t(theta) * A_t, clip(r_t(theta), 1 - epsilon, 1 + epsilon) * A_t)]

where r_t(theta) = pi_theta(a_t | s_t) / pi_{theta_old}(a_t | s_t) is the probability ratio and epsilon is typically 0.2. The clipping prevents the new policy from deviating too far from the old one in a single update.

PPO has become one of the most widely used RL algorithms due to its simplicity, stability, and strong empirical performance. OpenAI used it to train OpenAI Five (Dota 2), and it was the original RL algorithm used in RLHF for ChatGPT.

Soft Actor-Critic (SAC)

Soft Actor-Critic (SAC), introduced by Haarnoja et al. in 2018, augments the standard RL objective with an entropy term that encourages exploration:^[25]

J(pi) = Sum_t E[r(s_t, a_t) + alpha * H(pi(.|s_t))]

where H is the entropy of the policy and alpha is a temperature parameter controlling the tradeoff between reward maximization and entropy (exploration). SAC is off-policy, uses experience replay, and automatically tunes the temperature parameter. It achieves strong performance on continuous control benchmarks with better sample efficiency than on-policy methods like PPO.

Algorithm comparison

Algorithm	Type	Year	Key innovation	Best suited for	Sample efficiency
Q-learning	Value, off-policy	1989	Model-free optimal control	Small discrete problems	Low
SARSA	Value, on-policy	1994	On-policy TD control	Safe learning scenarios	Low
DQN	Value, off-policy	2013	Deep RL with experience replay	Discrete actions, visual input	Medium
DDPG	Actor-critic, off-policy	2015	Continuous action DQN	Continuous control	Medium
TRPO	Policy, on-policy	2015	Trust region constraints	Stable policy optimization	Low
A3C	Actor-critic, on-policy	2016	Asynchronous parallel training	CPU-based distributed training	Low
PPO	Policy, on-policy	2017	Clipped surrogate objective	General purpose, RLHF	Low
SAC	Actor-critic, off-policy	2018	Maximum entropy RL	Continuous control, robotics	High
TD3	Actor-critic, off-policy	2018	Twin critics, delayed updates	Continuous control	High
AlphaZero	Model-based, self-play	2017	Self-play with MCTS	Perfect information games	Very high
MuZero	Model-based, learned model	2020	Learned latent dynamics	Games without known rules	Very high
GRPO	Policy, on-policy	2024	Group relative advantage estimation	LLM reasoning training	Medium

Deep reinforcement learning

Deep reinforcement learning (deep RL) combines RL algorithms with deep neural networks as function approximators, enabling agents to handle high-dimensional state and action spaces that are intractable for tabular methods.

Why deep learning transformed RL

Classic RL algorithms like tabular Q-learning maintain a table of values for every state-action pair. This works for problems with small state spaces (a few hundred or thousand states) but fails completely when states are described by images, continuous variables, or other high-dimensional inputs. A single Atari game frame has 210 x 160 pixels with 128 possible colors per pixel, making the raw state space astronomically large.

Neural networks solve this by learning compact, generalizable representations of value functions or policies. A convolutional neural network can process raw pixels and output Q-values or action probabilities, automatically learning relevant features like object positions, velocities, and spatial relationships.

Key architectural patterns

Deep RL uses several recurring architectural patterns:

Convolutional networks for processing visual observations (DQN, AlphaGo).
Recurrent networks (LSTMs, GRUs) for handling partial observability and sequential dependencies.
Transformers for sequence modeling and attention over long histories (Decision Transformer, Gato).
Residual networks and normalization layers for training stability in deep value and policy networks.

Stability challenges

Combining neural networks with RL introduces several instability issues that do not arise in supervised learning. The training data distribution changes as the policy improves (non-stationarity). Small changes in the value function can cause large changes in the policy, which in turn changes the data distribution. Experience replay, target networks, gradient clipping, and entropy regularization are common techniques for addressing these issues.

Major milestones

TD-Gammon (1992)

TD-Gammon, developed by Gerald Tesauro at IBM's Thomas J. Watson Research Center, was one of the earliest demonstrations that RL combined with neural networks could achieve expert-level performance.^[26] The system used a three-layer neural network with 198 input features, 80 hidden units, and one output unit to evaluate backgammon positions. It learned entirely through self-play using TD(lambda), playing approximately 1.5 million games against itself. By version 2.1, TD-Gammon played at a level just slightly below the world's top human players. The program is commonly cited as a precursor to the deep RL breakthroughs that followed two decades later.

DQN and Atari (2013, 2015)

DeepMind's DQN was the first system to learn successful control policies directly from raw pixel inputs across a diverse set of tasks.^[18] The 2013 paper demonstrated strong performance on seven Atari games; the 2015 Nature paper extended this to 49 games, achieving human-level performance on 29 of them using identical architecture and hyperparameters for every game. This result demonstrated that a single deep RL architecture could generalize across very different tasks.

AlphaGo and AlphaZero (2016, 2017)

AlphaGo defeated 18-time world Go champion Lee Sedol 4-1 in March 2016, an event watched by over 200 million people.^[5] AlphaGo combined supervised learning from human expert games with RL through self-play, using Monte Carlo tree search (MCTS) guided by a policy network and a value network.

AlphaGo Zero, published later in 2017, eliminated the need for human data entirely, learning exclusively through self-play starting from random play.^[27] It surpassed the original AlphaGo within 40 hours of training. AlphaZero generalized this approach to chess and shogi as well, defeating the strongest existing programs in all three games within 24 hours of training from scratch.^[28]

OpenAI Five (2019)

OpenAI Five tackled Dota 2, a game with far greater complexity than Go: imperfect information, real-time decision-making, long time horizons (approximately 20,000 frames per game), a massive action space, and five-player teamwork.^[6] The system used PPO with self-play across 128,000 CPU cores and 256 GPUs, accumulating the equivalent of 45,000 years of gameplay experience. In April 2019, it defeated OG, the reigning human world champions, 2-0. OpenAI Five demonstrated that PPO and massive-scale self-play could handle multi-agent coordination in complex real-time environments.

AlphaStar (2019)

DeepMind's AlphaStar reached Grandmaster level in StarCraft II, placing in the top 0.2% of human players on the official European ladder.^[29] StarCraft II presents challenges beyond Go: imperfect information (fog of war), real-time actions, long-term strategic planning, and a combinatorial action space. AlphaStar combined imitation learning from human replays with multi-agent reinforcement learning, training a league of agents that competed against one another to develop diverse strategies.

Reinforcement Learning from Human Feedback (RLHF)

RLHF has become one of the most consequential applications of reinforcement learning. It is the technique that transforms a pre-trained language model into a conversational assistant that follows instructions, refuses harmful requests, and generally behaves in ways humans find helpful.^[30]

How RLHF works

The RLHF process typically involves three stages:

Supervised fine-tuning (SFT): a pre-trained language model is fine-tuned on a dataset of human-written demonstrations of desired behavior.
Reward model training: human labelers compare pairs of model outputs and indicate which they prefer. These preference labels train a reward model that predicts a scalar score for any given output.
RL optimization: the SFT model is further trained using RL (typically PPO) to maximize the reward model's score, with a KL-divergence penalty to prevent the model from deviating too far from the SFT model.

OpenAI's InstructGPT (2022) was one of the first published demonstrations of this approach,^[31] and the same methodology was used for ChatGPT. Anthropic applied a variant called Constitutional AI (CAI) to train Claude, where AI-generated feedback partially replaces human labeling.^[32]

Evolution beyond PPO

The RL component of RLHF has evolved rapidly:

Method	Year	Description
PPO-based RLHF	2022	Original approach used for InstructGPT and ChatGPT
Direct Preference Optimization (DPO)	2023	Eliminates separate reward model and RL step; directly optimizes on preference pairs
Kahneman-Tversky Optimization (KTO)	2024	Works with binary (good/bad) labels instead of pairwise preferences
Group Relative Policy Optimization (GRPO)	2024	Eliminates value network; estimates advantages from group reward distribution
Reinforcement Learning from AI Feedback (RLAIF)	2023+	Uses AI-generated preferences to scale alignment

DeepSeek-R1, released in January 2025, demonstrated that pure RL training (using GRPO with verifiable rewards) can produce strong reasoning capabilities in LLMs without any supervised fine-tuning step.^[33] The model learned behaviors like self-reflection, verification, and chain-of-thought reasoning purely through RL, achieving performance comparable to OpenAI's o1 on mathematical reasoning benchmarks.

Reinforcement Learning with Verifiable Rewards (RLVR)

RLVR is a training paradigm where rewards come from deterministic, rule-based verifiers rather than learned reward models.^[33] For mathematical problems, the verifier checks whether the model's final answer matches the correct solution. For code generation, automated tests serve as the verifier. RLVR avoids the reward hacking problems inherent in learned reward models and has become the standard approach for training reasoning-focused LLMs as of 2025. GRPO is the most common RL optimizer used with RLVR in open-source reasoning models.

Applications

Game playing

Reinforcement learning has achieved superhuman performance in numerous games:

Board games: AlphaGo, AlphaZero, and MuZero mastered Go, chess, and shogi through self-play.^[28]
Video games: DQN mastered 49 Atari games; OpenAI Five conquered Dota 2; AlphaStar reached Grandmaster in StarCraft II.^[29]
Poker: Pluribus (2019) defeated professional players in six-player no-limit Texas Hold'em, the first AI to beat humans in a major multiplayer poker format.^[34]
Diplomacy: Meta's Cicero (2022) achieved human-level performance in the board game Diplomacy, combining RL with natural language generation for negotiation.^[35]

Robotics

RL enables robots to acquire motor skills through trial and error rather than manual programming:

Locomotion: policies trained in simulation transfer to physical robots for walking, running, and navigating uneven terrain. Boston Dynamics uses RL for some aspects of their robots' behavior.
Manipulation: OpenAI demonstrated a robotic hand (Dactyl) solving a Rubik's Cube using RL with sim-to-real transfer and domain randomization (2019).^[36]
Assembly and manufacturing: industrial robots learn assembly sequences, welding paths, and pick-and-place operations.
Sim-to-real transfer: training in physics simulators (MuJoCo, Isaac Gym) and transferring to physical hardware remains a major research area. Domain randomization, where simulation parameters are varied during training, helps bridge the gap between simulated and real environments.

Autonomous vehicles

Self-driving systems employ RL for several aspects of driving:

Path planning and trajectory optimization
Lane changing and merging decisions
Adaptive cruise control and following distance
Intersection and traffic light negotiation

Waymo, Tesla, and other companies use RL as one component of their autonomous driving stacks, though most production systems combine RL with rule-based safety constraints and imitation learning from human drivers.

Healthcare

RL applications in medicine include:

Treatment optimization: dynamic treatment regimes for chronic diseases such as sepsis management in ICUs, where RL agents recommend drug dosages and ventilator settings.^[37]
Drug discovery: molecular design and optimization of chemical structures.
Personalized medicine: adaptive clinical trial designs that allocate patients to treatments based on observed responses.
Medical imaging: RL-guided strategies for anatomical landmark detection and image acquisition optimization.

Finance

Financial applications include:

Algorithmic trading: automated strategies that learn to execute trades, manage inventory, and time entries and exits.
Portfolio management: dynamic asset allocation that adapts to changing market conditions.
Risk management: credit scoring models and fraud detection systems.
Market making: RL agents that provide liquidity and manage bid-ask spreads.

Energy and sustainability

Data center cooling: Google DeepMind achieved a 40% reduction in data center energy consumption for cooling by using RL to optimize HVAC settings (2016).^[38]
Smart grids: RL for load balancing, demand response, and renewable energy integration.
Wind farms: optimizing turbine yaw angles and blade pitch to maximize energy output.
Building management: HVAC and lighting optimization in commercial buildings.

Natural language processing and LLM alignment

RLHF and RLVR: training ChatGPT, Claude, GPT-4, Gemini, Llama, and DeepSeek to follow instructions and align with human values.
Dialogue systems: optimizing conversational agents for engagement and task completion.
Machine translation: improving translation quality through reward signals based on BLEU scores or human preferences.
Text summarization: generating concise, informative summaries optimized by RL-based reward signals.

Recommendation systems

RL is used in recommendation systems where the goal is to maximize long-term user engagement rather than immediate click-through rates. Platforms like YouTube, Netflix, and Spotify use RL-inspired approaches to balance exploration (showing new content) with exploitation (recommending proven favorites), account for the sequential nature of user interactions, and optimize for long-term metrics like retention rather than short-term clicks.

Multi-agent reinforcement learning

Multi-agent reinforcement learning (MARL) extends RL to settings where multiple agents interact within a shared environment.^[39] This introduces challenges absent in single-agent RL: agents must account for the behavior of other learning agents, which makes the environment non-stationary from each agent's perspective.

Types of multi-agent settings

Setting	Description	Examples
Fully cooperative	All agents share a common reward	Robot swarm coordination, team-based games
Fully competitive	One agent's gain is another's loss (zero-sum)	Board games, competitive video games
Mixed (general-sum)	Agents have partially aligned, partially conflicting goals	Autonomous driving, economic markets, negotiation

Key approaches

Independent learners: each agent runs its own RL algorithm, treating other agents as part of the environment. Simple but ignores the non-stationarity caused by other agents learning simultaneously.
Centralized training, decentralized execution (CTDE): agents share information during training (e.g., a shared critic with global state) but act based only on local observations during deployment. QMIX and MAPPO are popular CTDE algorithms.
Communication learning: agents learn to communicate through discrete or continuous messages, enabling coordination in partially observable settings.
Population-based training: a population of agents with different strategies co-evolve, as used in AlphaStar's league training.

Applications of MARL

MARL has been applied to autonomous driving (multiple vehicles negotiating at intersections), robotic swarms (coordinated exploration and task allocation), traffic signal control (city-wide optimization of traffic flow), multiplayer games (Dota 2, StarCraft II), and resource allocation in smart grids and communication networks. A comprehensive MIT Press textbook on MARL was published in December 2024, reflecting the field's maturity.^[40]

Challenges and limitations

Sample inefficiency

RL algorithms often require enormous amounts of interaction data to learn effective policies:^[41]

DQN: 200 million frames for Atari (equivalent to roughly 924 hours of human play)
OpenAI Five: the equivalent of 45,000 years of Dota 2 gameplay
AlphaGo Zero: 4.9 million self-play games over 40 days

This makes direct training on physical systems (robots, real vehicles) impractical for most current algorithms. Solutions include model-based RL (generating synthetic data from learned models), transfer learning (reusing knowledge from related tasks), curriculum learning (gradually increasing task difficulty), and offline RL (learning from fixed datasets without further interaction).

Exploration in large state spaces

Effective exploration becomes extremely difficult in environments with:

Sparse rewards: where the agent receives no feedback until it reaches a rare goal state. A robot learning to stack blocks might receive a reward only upon successful completion, with no signal during the thousands of intermediate steps.
Large or continuous state spaces: where the number of possible configurations is astronomical.
Safety-critical domains: where exploration risks catastrophic failures. A self-driving car cannot explore bad driving strategies.

Approaches to these challenges include intrinsic motivation and curiosity-driven exploration (rewarding the agent for visiting novel states), hierarchical RL (decomposing problems into subgoals), and safe exploration methods with constraints.

Reward specification and reward hacking

Designing reward functions that capture the true objective is notoriously difficult:^[42]

Reward hacking: agents exploit unintended shortcuts in the reward function. A boat-racing agent famously learned to drive in circles collecting bonus items instead of finishing the race, because the bonus items gave more reward than race completion.
Reward shaping: manually engineering intermediate rewards to guide learning is error-prone and can introduce biases.
Specification gaming: agents find unexpected strategies that satisfy the letter of the reward function but not its intent.

In RLHF for LLMs, reward hacking manifests as models producing verbose, sycophantic responses that score highly with the reward model but are not actually more helpful. Mitigation strategies include inverse RL (learning rewards from demonstrations), reward model ensembles, and the Preference As Reward (PAR) approach introduced in 2025.

Sim-to-real transfer

Policies trained in simulation often fail when deployed on physical hardware due to the "sim-to-real gap": differences in physics, sensor noise, actuator dynamics, and visual appearance between simulator and reality.^[43] Research has shown that physics-based dynamics models can achieve up to 50% real-world success under strict precision constraints where simplified models fail entirely. Domain randomization (varying simulation parameters during training), system identification (calibrating simulation to match reality), and progressive domain adaptation help bridge this gap.

Generalization and catastrophic forgetting

RL agents often fail to generalize beyond their training environment. A policy trained in one version of a video game may fail on a slightly different version. When learning multiple tasks sequentially, neural networks suffer from catastrophic forgetting, where learning a new task overwrites the weights needed for previously learned tasks. Meta-learning, domain randomization, and continual learning are active research areas addressing these issues.

Interpretability and safety

Neural network policies are black boxes; it is difficult to understand why an agent takes a particular action. This creates problems for:

Verification: proving that an RL system will behave safely in all possible situations.
Debugging: identifying why an agent fails in specific scenarios.
Regulation: deploying RL in safety-critical domains like healthcare or autonomous driving requires explainable decision-making.
AI alignment: ensuring that RL agents' learned objectives align with human values and intentions.

Current research directions (2025-2026)

Offline reinforcement learning

Offline RL (also called batch RL) learns from fixed datasets of previously collected transitions without any further environment interaction.^[44] This is valuable in domains where online exploration is expensive or dangerous (healthcare, autonomous driving, industrial control). Key methods include:

Conservative Q-Learning (CQL): penalizes Q-values for out-of-distribution actions to prevent overestimation.
Implicit Q-Learning (IQL): avoids querying out-of-distribution actions entirely.
Decision Transformer: frames RL as a sequence modeling problem, using a transformer to predict actions conditioned on desired returns.

Foundation models meet RL

The intersection of foundation models and RL is one of the most active research areas. Several directions have emerged:

RL for foundation model training: RLHF, RLVR, and GRPO for aligning and improving LLMs.
Foundation models for RL: using pre-trained language and vision models to provide representations, world knowledge, or reward signals for RL agents.
Generalist agents: systems like DeepMind's Gato (2022) and Google's RT-2 (2023) combine large pre-trained models with RL to create agents that operate across multiple domains (text, images, robotic control).^[45]
Vision-Language-Action models: RT-1, RT-2, and similar systems use transformer architectures to map visual observations and language instructions directly to robot actions.

World models and model-based deep RL

Learned world models allow agents to plan and imagine future scenarios without interacting with the real environment:

Dreamer (v1, v2, v3): learns a latent dynamics model and trains a policy entirely through imagined rollouts, achieving competitive performance with far less real-world data.^[15]
RLVR-World (2025): a framework that uses reinforcement learning with verifiable rewards to directly optimize world models across domains including text games, web navigation, and robot manipulation.
Differentiable physics simulators: enable gradient-based optimization through simulated physics for robotics applications.

Hierarchical reinforcement learning

Hierarchical RL decomposes complex, long-horizon tasks into manageable subtasks:

Options framework: temporal abstraction through "options" (sub-policies that execute over multiple time steps).
Goal-conditioned policies: high-level policies set subgoals; low-level policies achieve them.
Feudal networks: hierarchical architectures where a manager sets goals for a worker.

This is particularly relevant for robotics and navigation tasks where planning over hundreds or thousands of steps is needed.

Safe reinforcement learning

Safe RL develops algorithms that satisfy safety constraints during both training and deployment. Constrained MDPs formalize safety requirements as constraints on expected costs. Shielding approaches use formal verification to block unsafe actions. This is a growing area as RL moves into safety-critical applications like autonomous driving and medical treatment optimization.

Development tools and frameworks

Framework	Language	Maintained by	Best suited for
Gymnasium (formerly OpenAI Gym)	Python	Farama Foundation	Environment standard and benchmarking
Stable-Baselines3	Python	Community	Reliable algorithm implementations (PPO, SAC, DQN)
Ray RLlib	Python	Anyscale	Production-scale distributed training
CleanRL	Python	Community	Single-file, readable algorithm implementations
TorchRL	Python	Meta (PyTorch)	Research flexibility and modularity
Unity ML-Agents	C#/Python	Unity Technologies	3D simulation and game environments
TF-Agents	Python	Google	TensorFlow ecosystem integration
Tianshou	Python	Community	Modular research framework
ACME	Python	DeepMind	JAX-based research at scale

Simulation environments

Environment	Domain	Description
MuJoCo	Physics/robotics	High-fidelity physics simulation for continuous control
Isaac Gym	Robotics	GPU-accelerated physics for massively parallel training
Arcade Learning Environment (ALE)	Atari games	Standard benchmark for discrete control from pixels
PettingZoo	Multi-agent	Standard API for multi-agent environments
CARLA	Autonomous driving	Open-source urban driving simulator
MineRL	Minecraft	Hierarchical tasks in a complex open-world game
Meta-World	Robotic manipulation	50 distinct manipulation tasks for meta-learning research
RoboSuite	Robotic manipulation	Standardized benchmarks for robot learning

References

Sutton, R. S., & Barto, A. G. (2018). *Reinforcement Learning: An Introduction* (2nd ed.). MIT Press.
Kaelbling, L. P., Littman, M. L., & Moore, A. W. (1996). "Reinforcement learning: A survey." *Journal of Artificial Intelligence Research*, 4, 237-285.
Sutton, R. S., & Barto, A. G. (1998). *Reinforcement Learning: An Introduction* (1st ed.). MIT Press.
Bellman, R. (1957). *Dynamic Programming*. Princeton University Press.
Silver, D., et al. (2016). "Mastering the game of Go with deep neural networks and tree search." *Nature*, 529(7587), 484-489.
OpenAI. (2019). "OpenAI Five defeats Dota 2 world champions." OpenAI Blog.
ACM. (2024). "ACM A.M. Turing Award recognizes pioneers of reinforcement learning."
Pavlov, I. P. (1927). *Conditioned Reflexes: An Investigation of the Physiological Activity of the Cerebral Cortex*. Oxford University Press.
Thorndike, E. L. (1911). *Animal Intelligence: Experimental Studies*. Macmillan.
Bellman, R. (1957). "A Markovian decision process." *Journal of Mathematics and Mechanics*, 6(5), 679-684.
Sutton, R. S. (1988). "Learning to predict by the methods of temporal differences." *Machine Learning*, 3(1), 9-44.
Sutton, R. S., & Barto, A. G. (1998). *Reinforcement Learning: An Introduction*. MIT Press.
Puterman, M. L. (1994). *Markov Decision Processes: Discrete Stochastic Dynamic Programming*. Wiley.
Sutton, R. S. (1990). "Integrated architectures for learning, planning, and reacting based on approximating dynamic programming." *Proceedings of the 7th International Conference on Machine Learning*.
Hafner, D., et al. (2023). "Mastering diverse domains through world models." *arXiv:2301.04104*.
Watkins, C. J. C. H., & Dayan, P. (1992). "Q-learning." *Machine Learning*, 8(3), 279-292.
Rummery, G. A., & Niranjan, M. (1994). "On-line Q-learning using connectionist systems." *Technical Report CUED/F-INFENG/TR 166*, Cambridge University.
Mnih, V., et al. (2015). "Human-level control through deep reinforcement learning." *Nature*, 518(7540), 529-533.
Williams, R. J. (1992). "Simple statistical gradient-following algorithms for connectionist reinforcement learning." *Machine Learning*, 8(3), 229-256.
Konda, V. R., & Tsitsiklis, J. N. (2000). "Actor-critic algorithms." *Advances in Neural Information Processing Systems*, 12.
Mnih, V., et al. (2016). "Asynchronous methods for deep reinforcement learning." *Proceedings of the 33rd International Conference on Machine Learning*.
Lillicrap, T. P., et al. (2015). "Continuous control with deep reinforcement learning." *arXiv:1509.02971*.
Fujimoto, S., Hoof, H., & Meger, D. (2018). "Addressing function approximation error in actor-critic methods." *Proceedings of the 35th International Conference on Machine Learning*.
Schulman, J., et al. (2017). "Proximal policy optimization algorithms." *arXiv:1707.06347*.
Haarnoja, T., et al. (2018). "Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor." *Proceedings of the 35th International Conference on Machine Learning*.
Tesauro, G. (1995). "Temporal difference learning and TD-Gammon." *Communications of the ACM*, 38(3), 58-68.
Silver, D., et al. (2017). "Mastering the game of Go without human knowledge." *Nature*, 550(7676), 354-359.
Silver, D., et al. (2018). "A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play." *Science*, 362(6419), 1140-1144.
Vinyals, O., et al. (2019). "Grandmaster level in StarCraft II using multi-agent reinforcement learning." *Nature*, 575(7782), 350-354.
Christiano, P. F., et al. (2017). "Deep reinforcement learning from human preferences." *Advances in Neural Information Processing Systems*, 30.
Ouyang, L., et al. (2022). "Training language models to follow instructions with human feedback." *Advances in Neural Information Processing Systems*, 35.
Bai, Y., et al. (2022). "Constitutional AI: Harmlessness from AI feedback." *arXiv:2212.08073*.
DeepSeek-AI. (2025). "DeepSeek-R1: Incentivizing reasoning capability in LLMs via reinforcement learning." *arXiv:2501.12948*.
Brown, N., & Sandholm, T. (2019). "Superhuman AI for multiplayer poker." *Science*, 365(6456), 885-890.
FAIR et al. (2022). "Human-level play in the game of Diplomacy by combining language models with strategic reasoning." *Science*, 378(6624), 1067-1074.
OpenAI et al. (2019). "Solving Rubik's Cube with a robot hand." *arXiv:1910.07113*.
Komorowski, M., et al. (2018). "The Artificial Intelligence Clinician learns optimal treatment strategies for sepsis in intensive care." *Nature Medicine*, 24(11), 1716-1720.
Evans, R., & Gao, J. (2016). "DeepMind AI reduces Google data centre cooling bill by 40%." DeepMind Blog.
Busoniu, L., Babuska, R., & De Schutter, B. (2008). "A comprehensive survey of multiagent reinforcement learning." *IEEE Transactions on Systems, Man, and Cybernetics*, 38(2), 156-172.
Albrecht, S. V., Christianos, F., & Schafer, L. (2024). *Multi-Agent Reinforcement Learning: Foundations and Modern Approaches*. MIT Press.
Dulac-Arnold, G., et al. (2019). "Challenges of real-world reinforcement learning." *arXiv:1904.12901*.
Amodei, D., et al. (2016). "Concrete problems in AI safety." *arXiv:1606.06565*.
Zhao, W., et al. (2020). "Sim-to-real transfer in deep reinforcement learning for robotics: A survey." *arXiv:2009.13303*.
Levine, S., et al. (2020). "Offline reinforcement learning: Tutorial, review, and perspectives on open problems." *arXiv:2005.01643*.
Reed, S., et al. (2022). "A generalist agent." *arXiv:2205.06175*.

External links

Reinforcement Learning: An Introduction (Sutton and Barto's textbook, free online)
OpenAI Spinning Up (educational resource for deep RL)
Gymnasium (standard RL environment library)
Stable-Baselines3 Documentation
Ray RLlib Documentation
DeepMind Research
RLHF Book by Nathan Lambert (comprehensive guide to RLHF)