Reinforcement learning (RL) is an area of machine learning concerned with how intelligent agents ought to take actions in an environment in order to maximize the notion of cumulative reward.[1] Reinforcement learning is one of three basic machine learning paradigms, alongside supervised learning and unsupervised learning.[2]
Unlike supervised learning, which requires labeled input/output pairs, and unlike unsupervised learning, which focuses on finding hidden structure in unlabeled data, reinforcement learning focuses on finding a balance between exploration (of uncharted territory) and exploitation (of current knowledge) through trial-and-error interaction with an environment.[3] The environment is typically formulated as a Markov decision process (MDP), as many reinforcement learning algorithms utilize dynamic programming techniques.[4] RL has driven some of the most visible achievements in modern artificial intelligence, from defeating world champions at board games and video games to aligning large language models with human preferences.
Reinforcement learning achieved widespread recognition through several landmark achievements. In 2016, DeepMind's AlphaGo defeated world champion Lee Sedol in the complex game of Go[5], a feat previously thought to be decades away. In 2019, OpenAI Five defeated the reigning world champion team in Dota 2[6], demonstrating RL's ability to handle complex team-based strategy games. By 2022, RL had become central to the training of large language models like ChatGPT and Claude through a technique called Reinforcement Learning from Human Feedback (RLHF).
The field emerged from the convergence of multiple intellectual traditions. The psychology of animal learning, beginning with Edward Thorndike's Law of Effect in 1911, established that behaviors followed by satisfying consequences tend to be repeated. The mathematical framework came from optimal control theory and Richard Bellman's development of dynamic programming in the 1950s. These threads were unified in the modern field through the work of Richard Sutton and Andrew Barto, who received the 2024 Turing Award for their foundational contributions.[7]
The history of reinforcement learning spans over a century, drawing from psychology, control theory, and computer science. Three distinct intellectual threads developed independently before merging into the unified field recognized today.
The earliest roots of reinforcement learning lie in experimental psychology. Ivan Pavlov's work on classical conditioning in the 1890s and 1900s demonstrated that animals could learn to associate stimuli with rewards, forming the basis for understanding learned behavior.[8] Edward Thorndike formalized this in 1911 with his Law of Effect, which states that responses followed by satisfying outcomes become more firmly associated with the situation, while responses followed by discomfort become less likely.[9] B.F. Skinner extended these ideas in the 1930s through operant conditioning, which studied how rewards and punishments shape voluntary behavior. These psychological principles directly inspired the reward-based learning framework that RL uses today.
The second thread came from applied mathematics. In the 1950s, Richard Bellman developed dynamic programming as a method for solving multi-stage decision problems. His key insight, formalized in the Bellman equation (1957), was that an optimal policy can be decomposed into an immediate decision plus the optimal policy from the resulting state onward.[10] This recursive formulation became the mathematical backbone of virtually all RL algorithms. The term "reinforcement learning" itself was first used in the engineering literature by Minsky in 1961, connecting the psychological concept of reinforcement to computational decision-making.
The third thread, temporal difference (TD) learning, bridged the gap between the other two. In 1988, Richard Sutton introduced TD learning as a class of model-free methods that learn by bootstrapping from current value estimates rather than waiting for final outcomes.[11] TD methods combined the sampling approach of Monte Carlo methods with the bootstrapping of dynamic programming, creating a practical algorithm for environments where the full model is unknown. Sutton and Barto's 1998 textbook, Reinforcement Learning: An Introduction, synthesized these three threads into a coherent framework and became the standard reference for the field.[12]
| Year | Development | Key contributor(s) | Significance |
|---|---|---|---|
| 1890s | Classical conditioning experiments | Ivan Pavlov | Showed animals learn stimulus-reward associations |
| 1911 | Law of Effect | Edward Thorndike | Established that rewarded actions are reinforced |
| 1930s | Operant conditioning | B.F. Skinner | Formalized how rewards and punishments shape behavior |
| 1950s | Dynamic programming, Bellman equation | Richard Bellman | Mathematical framework for sequential decision-making |
| 1959 | Checkers program | Arthur Samuel | First self-learning game program; coined "machine learning" |
| 1961 | "Steps toward artificial intelligence" | Marvin Minsky | Used term "reinforcement" in engineering context |
| 1963 | MENACE | Donald Michie | Matchbox machine that learned tic-tac-toe |
| 1972 | REINFORCE-like policy search | Ronald Williams (later formalized 1992) | Conceptual foundation for policy gradient methods |
| 1988 | TD(lambda) | Richard Sutton | Unified Monte Carlo and dynamic programming approaches |
| 1989 | Q-learning | Christopher Watkins | Model-free off-policy control algorithm |
| 1992 | TD-Gammon | Gerald Tesauro | First RL system to achieve world-class game performance |
| 1994 | SARSA | Gavin Rummery, Mahesan Niranjan | On-policy temporal difference control |
| 1998 | Reinforcement Learning: An Introduction | Sutton, Barto | Seminal textbook that defined the field |
| 2013 | Deep Q-Network (DQN) on Atari | DeepMind (Mnih et al.) | First deep RL breakthrough using raw pixel input |
| 2015 | DQN published in Nature | DeepMind | Human-level play on 49 Atari games |
| 2016 | AlphaGo defeats Lee Sedol | DeepMind | First AI to beat a world champion at Go |
| 2017 | AlphaGo Zero, AlphaZero | DeepMind | Learned Go, chess, and shogi from self-play alone |
| 2017 | PPO published | OpenAI (Schulman et al.) | Became the default on-policy RL algorithm |
| 2018 | SAC published | Haarnoja et al. (UC Berkeley) | Maximum entropy framework for continuous control |
| 2019 | OpenAI Five defeats OG at Dota 2 | OpenAI | RL conquers a complex multi-agent real-time game |
| 2019 | AlphaStar reaches Grandmaster | DeepMind | Grandmaster-level play in StarCraft II |
| 2020 | MuZero | DeepMind | Learned to plan without knowing environment rules |
| 2022 | RLHF used to train ChatGPT | OpenAI | RL becomes central to LLM alignment |
| 2024 | Turing Award | Richard Sutton, Andrew Barto | Recognition for foundational RL contributions |
| 2025 | DeepSeek-R1 with GRPO | DeepSeek | RL trains reasoning capabilities in LLMs without supervised data |
Reinforcement learning problems involve an agent interacting with an environment through a cycle of observation, action, and reward.[3] At each discrete time step t:
The agent's objective is to learn a policy that maximizes the expected return (cumulative discounted reward):[1]
G_t = R_{t+1} + gamma * R_{t+2} + gamma^2 * R_{t+3} + ... = Sum_{k=0}^{infinity} gamma^k * R_{t+k+1}
The discount factor gamma (where 0 <= gamma <= 1) controls how much the agent values future rewards relative to immediate ones. A gamma close to 0 makes the agent short-sighted, prioritizing immediate reward. A gamma close to 1 makes the agent far-sighted, weighting future rewards almost as heavily as immediate ones. Choosing the right discount factor is problem-dependent: a robot navigating a maze might use gamma = 0.99, while a day-trading algorithm might use a lower value.
| Component | Description | Example |
|---|---|---|
| Agent | The learner and decision-maker | Robot, game-playing AI, trading algorithm |
| Environment | External system the agent interacts with | Maze, chess board, stock market |
| State (s) | Description of the environment's current configuration | Board position in chess, joint angles of a robot |
| Action (a) | Choice available to the agent at a given state | Move a piece, buy/sell stock, turn left |
| Reward (r) | Immediate scalar feedback signal | Points scored, profit earned, distance to goal |
| Policy (pi) | Agent's strategy mapping states to actions | "If in state X, take action Y" |
| Value function V(s) | Expected long-term return from a state under a policy | Position evaluation in chess |
| Action-value function Q(s,a) | Expected return from taking action a in state s, then following the policy | Estimated value of moving a specific piece |
| Model | Agent's learned representation of environment dynamics | Predicted next state and reward given current state and action |
Value functions are central to reinforcement learning, estimating how good it is for an agent to be in a particular state or to take a particular action in a state:[1]
The optimal value functions satisfy the Bellman optimality equations:[4]
These equations express the key recursive insight: the value of a state equals the best immediate reward plus the discounted value of the best reachable next state.
One fundamental challenge in reinforcement learning is the exploration-exploitation tradeoff.[2] The agent must balance:
An agent that only exploits may get stuck in a suboptimal policy, never discovering better options. An agent that only explores wastes time on actions it already knows are bad. Common strategies for managing this tradeoff include:
| Strategy | Description | Tradeoff |
|---|---|---|
| Epsilon-greedy | Acts randomly with probability epsilon, greedily otherwise | Simple but uniform random exploration is inefficient |
| Epsilon decay | Decreases epsilon over time, exploring more early on | Balances early exploration with later exploitation |
| Upper Confidence Bound (UCB) | Selects actions that have high uncertainty or high estimated value | Principled, based on confidence intervals |
| Thompson sampling | Samples from posterior distribution of action values | Bayesian approach, naturally balances exploration |
| Boltzmann (softmax) exploration | Selects actions proportional to exponentiated Q-values | Temperature parameter controls exploration degree |
| Curiosity-driven exploration | Rewards agent for visiting novel states | Effective in sparse-reward environments |
Reinforcement learning problems are formally modeled as Markov decision processes (MDPs), defined by the tuple (S, A, P, R, gamma):[13]
The Markov property states that the future depends only on the current state, not on the sequence of events that preceded it: P(s_{t+1} | s_t, a_t, s_{t-1}, ..., s_0) = P(s_{t+1} | s_t, a_t). This memoryless property is what makes MDPs tractable. In practice, many real-world problems violate the Markov property (the current observation does not fully capture the state), leading to partially observable MDPs (POMDPs), which are substantially harder to solve.
The Bellman equations, named after Richard Bellman, provide the recursive decomposition that underpins nearly all RL algorithms. For a given policy pi:
V^pi(s) = Sum_a pi(a|s) Sum_{s'} P(s'|s,a) [R(s,a,s') + gamma * V^pi(s')]
This equation says that the value of a state under policy pi equals the expected immediate reward plus the discounted value of the next state, averaged over all possible actions and transitions. The Bellman optimality equation replaces the policy average with a maximum, defining what the best possible policy would achieve.
Temporal difference (TD) learning, introduced by Sutton in 1988, is a core method in RL that combines ideas from Monte Carlo methods and dynamic programming.[11] Instead of waiting until the end of an episode to update value estimates (as Monte Carlo methods do), TD methods update estimates after each step using the observed reward and the current estimate of the next state's value:
V(s_t) <- V(s_t) + alpha [r_{t+1} + gamma * V(s_{t+1}) - V(s_t)]
The term in brackets, r_{t+1} + gamma * V(s_{t+1}) - V(s_t), is called the TD error. It measures the difference between the estimated value and a better estimate derived from the actual reward received plus the next state's estimated value. TD learning converges to the true value function under certain conditions and forms the basis of algorithms like Q-learning and SARSA.
RL algorithms can be classified along several axes. Understanding these distinctions is essential for choosing the right algorithm for a given problem.
Model-free algorithms learn a policy or value function directly from experience without building an explicit model of how the environment works. Q-learning and PPO are model-free. They are simpler to implement but often require many more interactions with the environment.
Model-based algorithms learn or are given a model of the environment's dynamics (transition probabilities and rewards) and use it for planning. Dyna-Q, introduced by Sutton in 1990, was an early approach that combined real experience with simulated experience generated from a learned model.[14] More recent model-based methods include MuZero, which learns a latent dynamics model focused on predicting rewards and values rather than raw observations, and Dreamer, which learns a world model in latent space and uses it to train a policy entirely through imagined rollouts.[15]
Model-based methods tend to be more sample-efficient because they can generate synthetic training data through mental simulation. However, if the learned model is inaccurate, compounding errors can lead to poor policies.
Value-based methods (Q-learning, DQN) learn a value function and derive a policy from it (e.g., always choose the action with the highest Q-value). They work well for discrete action spaces but struggle with continuous actions.
Policy-based methods (REINFORCE, PPO) directly parameterize and optimize the policy without necessarily learning a value function. They handle continuous action spaces naturally and can learn stochastic policies, but tend to have higher variance in gradient estimates.
Actor-critic methods combine both: an actor (policy network) selects actions while a critic (value network) evaluates them. This reduces variance compared to pure policy gradient methods while retaining the ability to handle continuous actions.
On-policy algorithms (SARSA, PPO, A2C) learn about the policy currently being executed. They use data generated by the current policy to update that same policy. This can be more stable but is less sample-efficient because old data cannot be reused after a policy update.
Off-policy algorithms (Q-learning, DQN, SAC) can learn from data generated by any policy, including old versions of the agent or even random exploration. This allows experience replay, where past transitions are stored in a buffer and sampled repeatedly, greatly improving sample efficiency.
| Classification axis | Category A | Category B |
|---|---|---|
| Environment model | Model-free: Q-learning, PPO, SAC | Model-based: Dyna-Q, MuZero, Dreamer |
| What is learned | Value-based: Q-learning, DQN | Policy-based: REINFORCE, PPO |
| Data source | On-policy: SARSA, A2C, PPO | Off-policy: Q-learning, DQN, SAC |
| State representation | Tabular: classic Q-learning | Function approximation: deep learning-based RL |
Q-learning, introduced by Christopher Watkins in 1989, is a model-free, off-policy algorithm that learns the optimal action-value function directly.[16] The update rule is:
Q(s,a) <- Q(s,a) + alpha [r + gamma * max_{a'} Q(s',a') - Q(s,a)]
where alpha is the learning rate. The key insight is that the update uses the maximum Q-value over the next state's actions regardless of which action the agent actually took. This "off-policy" property means Q-learning can learn about the optimal policy while following an exploratory one. Watkins proved that Q-learning converges to the optimal Q-function with probability 1, given sufficient exploration and decreasing learning rates.
Q-learning is simple and effective for problems with small, discrete state and action spaces. For larger problems, function approximation (such as neural networks) is needed.
SARSA (State-Action-Reward-State-Action), introduced by Rummery and Niranjan in 1994, is an on-policy variant of Q-learning.[17] Its update rule uses the action actually taken in the next state rather than the maximum:
Q(s,a) <- Q(s,a) + alpha [r + gamma * Q(s',a') - Q(s,a)]
Because SARSA evaluates the policy it is actually following, it tends to learn safer policies than Q-learning. In a cliff-walking problem, for example, Q-learning learns the optimal path along the cliff edge, while SARSA learns a safer path further from the edge, because it accounts for the possibility of exploratory actions leading to a fall.
Deep Q-Networks (DQN), published by Mnih et al. at DeepMind in 2013 and in Nature in 2015, revolutionized RL by using deep convolutional neural networks to approximate Q-values for high-dimensional state spaces.[18] DQN took raw pixel inputs from Atari 2600 games and learned to play 49 different games using the same architecture and hyperparameters, achieving human-level performance on 29 of them.
Two innovations made this possible:
DQN was the first demonstration that a single RL agent could learn complex behaviors directly from sensory input across many different tasks, and it sparked the deep reinforcement learning revolution.
Policy gradient methods directly optimize a parameterized policy by estimating the gradient of expected return with respect to policy parameters.[19] The foundational algorithm is REINFORCE (Williams, 1992), which updates policy parameters theta using:
nabla_theta J(theta) ~ Sum_t G_t * nabla_theta log pi_theta(a_t | s_t)
where G_t is the return from time step t. The intuition is straightforward: increase the probability of actions that led to high returns, decrease the probability of actions that led to low returns.
REINFORCE is simple but suffers from high variance in gradient estimates. Adding a baseline (typically the state value function) reduces variance without introducing bias:
nabla_theta J(theta) ~ Sum_t (G_t - V(s_t)) * nabla_theta log pi_theta(a_t | s_t)
The term (G_t - V(s_t)) is called the advantage, and this leads to the family of advantage actor-critic methods.
Actor-critic algorithms combine policy-based and value-based learning:[20]
The critic reduces variance in the policy gradient estimate by providing a learned baseline. Several important variants exist:
A2C (Advantage Actor-Critic) uses the advantage function A(s,a) = Q(s,a) - V(s) to update the actor. A3C (Asynchronous Advantage Actor-Critic), introduced by Mnih et al. in 2016, runs multiple agents in parallel on separate copies of the environment, each contributing gradients asynchronously to a shared model.[21] This was one of the first methods to effectively scale RL training across many CPU cores.
DDPG (Deep Deterministic Policy Gradient), introduced by Lillicrap et al. in 2015, extends DQN to continuous action spaces by learning a deterministic policy alongside a Q-function.[22] It uses experience replay and target networks, similar to DQN.
TD3 (Twin Delayed DDPG), published by Fujimoto et al. in 2018, addresses overestimation bias in DDPG by maintaining two critic networks and taking the minimum of their estimates, delaying policy updates, and adding noise to target actions.[23]
Proximal Policy Optimization (PPO), introduced by Schulman et al. at OpenAI in 2017, constrains policy updates to prevent destructively large changes.[24] PPO optimizes a clipped surrogate objective:
L^CLIP(theta) = E[min(r_t(theta) * A_t, clip(r_t(theta), 1 - epsilon, 1 + epsilon) * A_t)]
where r_t(theta) = pi_theta(a_t | s_t) / pi_{theta_old}(a_t | s_t) is the probability ratio and epsilon is typically 0.2. The clipping prevents the new policy from deviating too far from the old one in a single update.
PPO has become one of the most widely used RL algorithms due to its simplicity, stability, and strong empirical performance. OpenAI used it to train OpenAI Five (Dota 2), and it was the original RL algorithm used in RLHF for ChatGPT.
Soft Actor-Critic (SAC), introduced by Haarnoja et al. in 2018, augments the standard RL objective with an entropy term that encourages exploration:[25]
J(pi) = Sum_t E[r(s_t, a_t) + alpha * H(pi(.|s_t))]
where H is the entropy of the policy and alpha is a temperature parameter controlling the tradeoff between reward maximization and entropy (exploration). SAC is off-policy, uses experience replay, and automatically tunes the temperature parameter. It achieves strong performance on continuous control benchmarks with better sample efficiency than on-policy methods like PPO.
| Algorithm | Type | Year | Key innovation | Best suited for | Sample efficiency |
|---|---|---|---|---|---|
| Q-learning | Value, off-policy | 1989 | Model-free optimal control | Small discrete problems | Low |
| SARSA | Value, on-policy | 1994 | On-policy TD control | Safe learning scenarios | Low |
| DQN | Value, off-policy | 2013 | Deep RL with experience replay | Discrete actions, visual input | Medium |
| DDPG | Actor-critic, off-policy | 2015 | Continuous action DQN | Continuous control | Medium |
| TRPO | Policy, on-policy | 2015 | Trust region constraints | Stable policy optimization | Low |
| A3C | Actor-critic, on-policy | 2016 | Asynchronous parallel training | CPU-based distributed training | Low |
| PPO | Policy, on-policy | 2017 | Clipped surrogate objective | General purpose, RLHF | Low |
| SAC | Actor-critic, off-policy | 2018 | Maximum entropy RL | Continuous control, robotics | High |
| TD3 | Actor-critic, off-policy | 2018 | Twin critics, delayed updates | Continuous control | High |
| AlphaZero | Model-based, self-play | 2017 | Self-play with MCTS | Perfect information games | Very high |
| MuZero | Model-based, learned model | 2020 | Learned latent dynamics | Games without known rules | Very high |
| GRPO | Policy, on-policy | 2024 | Group relative advantage estimation | LLM reasoning training | Medium |
Deep reinforcement learning (deep RL) combines RL algorithms with deep neural networks as function approximators, enabling agents to handle high-dimensional state and action spaces that are intractable for tabular methods.
Classic RL algorithms like tabular Q-learning maintain a table of values for every state-action pair. This works for problems with small state spaces (a few hundred or thousand states) but fails completely when states are described by images, continuous variables, or other high-dimensional inputs. A single Atari game frame has 210 x 160 pixels with 128 possible colors per pixel, making the raw state space astronomically large.
Neural networks solve this by learning compact, generalizable representations of value functions or policies. A convolutional neural network can process raw pixels and output Q-values or action probabilities, automatically learning relevant features like object positions, velocities, and spatial relationships.
Deep RL uses several recurring architectural patterns:
Combining neural networks with RL introduces several instability issues that do not arise in supervised learning. The training data distribution changes as the policy improves (non-stationarity). Small changes in the value function can cause large changes in the policy, which in turn changes the data distribution. Experience replay, target networks, gradient clipping, and entropy regularization are common techniques for addressing these issues.
TD-Gammon, developed by Gerald Tesauro at IBM's Thomas J. Watson Research Center, was one of the earliest demonstrations that RL combined with neural networks could achieve expert-level performance.[26] The system used a three-layer neural network with 198 input features, 80 hidden units, and one output unit to evaluate backgammon positions. It learned entirely through self-play using TD(lambda), playing approximately 1.5 million games against itself. By version 2.1, TD-Gammon played at a level just slightly below the world's top human players. The program is commonly cited as a precursor to the deep RL breakthroughs that followed two decades later.
DeepMind's DQN was the first system to learn successful control policies directly from raw pixel inputs across a diverse set of tasks.[18] The 2013 paper demonstrated strong performance on seven Atari games; the 2015 Nature paper extended this to 49 games, achieving human-level performance on 29 of them using identical architecture and hyperparameters for every game. This result demonstrated that a single deep RL architecture could generalize across very different tasks.
AlphaGo defeated 18-time world Go champion Lee Sedol 4-1 in March 2016, an event watched by over 200 million people.[5] AlphaGo combined supervised learning from human expert games with RL through self-play, using Monte Carlo tree search (MCTS) guided by a policy network and a value network.
AlphaGo Zero, published later in 2017, eliminated the need for human data entirely, learning exclusively through self-play starting from random play.[27] It surpassed the original AlphaGo within 40 hours of training. AlphaZero generalized this approach to chess and shogi as well, defeating the strongest existing programs in all three games within 24 hours of training from scratch.[28]
OpenAI Five tackled Dota 2, a game with far greater complexity than Go: imperfect information, real-time decision-making, long time horizons (approximately 20,000 frames per game), a massive action space, and five-player teamwork.[6] The system used PPO with self-play across 128,000 CPU cores and 256 GPUs, accumulating the equivalent of 45,000 years of gameplay experience. In April 2019, it defeated OG, the reigning human world champions, 2-0. OpenAI Five demonstrated that PPO and massive-scale self-play could handle multi-agent coordination in complex real-time environments.
DeepMind's AlphaStar reached Grandmaster level in StarCraft II, placing in the top 0.2% of human players on the official European ladder.[29] StarCraft II presents challenges beyond Go: imperfect information (fog of war), real-time actions, long-term strategic planning, and a combinatorial action space. AlphaStar combined imitation learning from human replays with multi-agent reinforcement learning, training a league of agents that competed against one another to develop diverse strategies.
RLHF has become one of the most consequential applications of reinforcement learning. It is the technique that transforms a pre-trained language model into a conversational assistant that follows instructions, refuses harmful requests, and generally behaves in ways humans find helpful.[30]
The RLHF process typically involves three stages:
OpenAI's InstructGPT (2022) was one of the first published demonstrations of this approach,[31] and the same methodology was used for ChatGPT. Anthropic applied a variant called Constitutional AI (CAI) to train Claude, where AI-generated feedback partially replaces human labeling.[32]
The RL component of RLHF has evolved rapidly:
| Method | Year | Description |
|---|---|---|
| PPO-based RLHF | 2022 | Original approach used for InstructGPT and ChatGPT |
| Direct Preference Optimization (DPO) | 2023 | Eliminates separate reward model and RL step; directly optimizes on preference pairs |
| Kahneman-Tversky Optimization (KTO) | 2024 | Works with binary (good/bad) labels instead of pairwise preferences |
| Group Relative Policy Optimization (GRPO) | 2024 | Eliminates value network; estimates advantages from group reward distribution |
| Reinforcement Learning from AI Feedback (RLAIF) | 2023+ | Uses AI-generated preferences to scale alignment |
DeepSeek-R1, released in January 2025, demonstrated that pure RL training (using GRPO with verifiable rewards) can produce strong reasoning capabilities in LLMs without any supervised fine-tuning step.[33] The model learned behaviors like self-reflection, verification, and chain-of-thought reasoning purely through RL, achieving performance comparable to OpenAI's o1 on mathematical reasoning benchmarks.
RLVR is a training paradigm where rewards come from deterministic, rule-based verifiers rather than learned reward models.[33] For mathematical problems, the verifier checks whether the model's final answer matches the correct solution. For code generation, automated tests serve as the verifier. RLVR avoids the reward hacking problems inherent in learned reward models and has become the standard approach for training reasoning-focused LLMs as of 2025. GRPO is the most common RL optimizer used with RLVR in open-source reasoning models.
Reinforcement learning has achieved superhuman performance in numerous games:
RL enables robots to acquire motor skills through trial and error rather than manual programming:
Self-driving systems employ RL for several aspects of driving:
Waymo, Tesla, and other companies use RL as one component of their autonomous driving stacks, though most production systems combine RL with rule-based safety constraints and imitation learning from human drivers.
RL applications in medicine include:
Financial applications include:
RL is used in recommendation systems where the goal is to maximize long-term user engagement rather than immediate click-through rates. Platforms like YouTube, Netflix, and Spotify use RL-inspired approaches to balance exploration (showing new content) with exploitation (recommending proven favorites), account for the sequential nature of user interactions, and optimize for long-term metrics like retention rather than short-term clicks.
Multi-agent reinforcement learning (MARL) extends RL to settings where multiple agents interact within a shared environment.[39] This introduces challenges absent in single-agent RL: agents must account for the behavior of other learning agents, which makes the environment non-stationary from each agent's perspective.
| Setting | Description | Examples |
|---|---|---|
| Fully cooperative | All agents share a common reward | Robot swarm coordination, team-based games |
| Fully competitive | One agent's gain is another's loss (zero-sum) | Board games, competitive video games |
| Mixed (general-sum) | Agents have partially aligned, partially conflicting goals | Autonomous driving, economic markets, negotiation |
MARL has been applied to autonomous driving (multiple vehicles negotiating at intersections), robotic swarms (coordinated exploration and task allocation), traffic signal control (city-wide optimization of traffic flow), multiplayer games (Dota 2, StarCraft II), and resource allocation in smart grids and communication networks. A comprehensive MIT Press textbook on MARL was published in December 2024, reflecting the field's maturity.[40]
RL algorithms often require enormous amounts of interaction data to learn effective policies:[41]
This makes direct training on physical systems (robots, real vehicles) impractical for most current algorithms. Solutions include model-based RL (generating synthetic data from learned models), transfer learning (reusing knowledge from related tasks), curriculum learning (gradually increasing task difficulty), and offline RL (learning from fixed datasets without further interaction).
Effective exploration becomes extremely difficult in environments with:
Approaches to these challenges include intrinsic motivation and curiosity-driven exploration (rewarding the agent for visiting novel states), hierarchical RL (decomposing problems into subgoals), and safe exploration methods with constraints.
Designing reward functions that capture the true objective is notoriously difficult:[42]
In RLHF for LLMs, reward hacking manifests as models producing verbose, sycophantic responses that score highly with the reward model but are not actually more helpful. Mitigation strategies include inverse RL (learning rewards from demonstrations), reward model ensembles, and the Preference As Reward (PAR) approach introduced in 2025.
Policies trained in simulation often fail when deployed on physical hardware due to the "sim-to-real gap": differences in physics, sensor noise, actuator dynamics, and visual appearance between simulator and reality.[43] Research has shown that physics-based dynamics models can achieve up to 50% real-world success under strict precision constraints where simplified models fail entirely. Domain randomization (varying simulation parameters during training), system identification (calibrating simulation to match reality), and progressive domain adaptation help bridge this gap.
RL agents often fail to generalize beyond their training environment. A policy trained in one version of a video game may fail on a slightly different version. When learning multiple tasks sequentially, neural networks suffer from catastrophic forgetting, where learning a new task overwrites the weights needed for previously learned tasks. Meta-learning, domain randomization, and continual learning are active research areas addressing these issues.
Neural network policies are black boxes; it is difficult to understand why an agent takes a particular action. This creates problems for:
Offline RL (also called batch RL) learns from fixed datasets of previously collected transitions without any further environment interaction.[44] This is valuable in domains where online exploration is expensive or dangerous (healthcare, autonomous driving, industrial control). Key methods include:
The intersection of foundation models and RL is one of the most active research areas. Several directions have emerged:
Learned world models allow agents to plan and imagine future scenarios without interacting with the real environment:
Hierarchical RL decomposes complex, long-horizon tasks into manageable subtasks:
This is particularly relevant for robotics and navigation tasks where planning over hundreds or thousands of steps is needed.
Safe RL develops algorithms that satisfy safety constraints during both training and deployment. Constrained MDPs formalize safety requirements as constraints on expected costs. Shielding approaches use formal verification to block unsafe actions. This is a growing area as RL moves into safety-critical applications like autonomous driving and medical treatment optimization.
| Framework | Language | Maintained by | Best suited for |
|---|---|---|---|
| Gymnasium (formerly OpenAI Gym) | Python | Farama Foundation | Environment standard and benchmarking |
| Stable-Baselines3 | Python | Community | Reliable algorithm implementations (PPO, SAC, DQN) |
| Ray RLlib | Python | Anyscale | Production-scale distributed training |
| CleanRL | Python | Community | Single-file, readable algorithm implementations |
| TorchRL | Python | Meta (PyTorch) | Research flexibility and modularity |
| Unity ML-Agents | C#/Python | Unity Technologies | 3D simulation and game environments |
| TF-Agents | Python | TensorFlow ecosystem integration | |
| Tianshou | Python | Community | Modular research framework |
| ACME | Python | DeepMind | JAX-based research at scale |
| Environment | Domain | Description |
|---|---|---|
| MuJoCo | Physics/robotics | High-fidelity physics simulation for continuous control |
| Isaac Gym | Robotics | GPU-accelerated physics for massively parallel training |
| Arcade Learning Environment (ALE) | Atari games | Standard benchmark for discrete control from pixels |
| PettingZoo | Multi-agent | Standard API for multi-agent environments |
| CARLA | Autonomous driving | Open-source urban driving simulator |
| MineRL | Minecraft | Hierarchical tasks in a complex open-world game |
| Meta-World | Robotic manipulation | 50 distinct manipulation tasks for meta-learning research |
| RoboSuite | Robotic manipulation | Standardized benchmarks for robot learning |