# Reinforcement learning

> Source: https://aiwiki.ai/wiki/reinforcement_learning
> Updated: 2026-07-10
> Categories: Artificial Intelligence, Deep Learning, Machine Learning, Reinforcement Learning
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

**Reinforcement learning** (**RL**) is a branch of [machine learning](/wiki/machine_learning) in which an agent learns to make decisions by taking actions in an environment to maximize a cumulative reward signal, discovering good behavior through trial and error rather than from labeled examples.[1] In the words of Richard Sutton and Andrew Barto, whose textbook defined the modern field, "Reinforcement learning is learning what to do, how to map situations to actions, so as to maximize a numerical reward signal."[1] It is one of three basic machine learning paradigms, alongside [supervised learning](/wiki/supervised_learning) and [unsupervised learning](/wiki/unsupervised_learning).[2] Sutton and Barto received the 2024 [Turing Award](/wiki/turing_award), often called the Nobel Prize of computing, for developing the conceptual and algorithmic foundations of reinforcement learning.[7]

Unlike supervised learning, which requires labeled input/output pairs, and unlike unsupervised learning, which focuses on finding hidden structure in unlabeled data, reinforcement learning focuses on finding a balance between exploration (of uncharted territory) and exploitation (of current knowledge) through trial-and-error interaction with an environment.[3] The environment is typically formulated as a [Markov decision process](/wiki/markov_decision_process_mdp) (MDP), as many reinforcement learning algorithms utilize [dynamic programming](/wiki/dynamic_programming) techniques.[4] RL has driven some of the most visible achievements in modern artificial intelligence, from defeating world champions at board games and video games to aligning [large language models](/wiki/large_language_model) with human preferences.

## Overview

Reinforcement learning achieved widespread recognition through several landmark achievements. In 2016, [DeepMind](/wiki/deepmind)'s [AlphaGo](/wiki/alphago) defeated world champion Lee Sedol in the complex game of [Go](/wiki/go_(game))[5], a feat previously thought to be decades away. In 2019, [OpenAI Five](/wiki/openai_five) defeated the reigning world champion team in Dota 2[6], demonstrating RL's ability to handle complex team-based strategy games. By 2022, RL had become central to the training of large language models like [ChatGPT](/wiki/chatgpt) and [Claude](/wiki/claude) through a technique called [Reinforcement Learning from Human Feedback](/wiki/rlhf) (RLHF).

The field emerged from the convergence of multiple intellectual traditions. The psychology of animal learning, beginning with Edward Thorndike's Law of Effect in 1911, established that behaviors followed by satisfying consequences tend to be repeated. The mathematical framework came from optimal control theory and Richard Bellman's development of dynamic programming in the 1950s. These threads were unified in the modern field through the work of Richard Sutton and Andrew Barto, who received the 2024 [Turing Award](/wiki/turing_award) for their foundational contributions.[7] The Turing Award, presented by the Association for Computing Machinery (ACM), carries a 1 million US dollar prize funded by Google. In its announcement, the ACM said that in a series of papers beginning in the 1980s, Barto and Sutton "introduced the main ideas, constructed the mathematical foundations, and developed important algorithms for reinforcement learning."[51] Their textbook, *Reinforcement Learning: An Introduction*, has been cited more than 75,000 times.[51]

## History

The history of reinforcement learning spans over a century, drawing from psychology, control theory, and computer science. Three distinct intellectual threads developed independently before merging into the unified field recognized today.

### Behavioral psychology and trial-and-error learning

The earliest roots of reinforcement learning lie in experimental psychology. Ivan Pavlov's work on classical conditioning in the 1890s and 1900s demonstrated that animals could learn to associate stimuli with rewards, forming the basis for understanding learned behavior.[8] Edward Thorndike formalized this in 1911 with his Law of Effect, which states that responses followed by satisfying outcomes become more firmly associated with the situation, while responses followed by discomfort become less likely.[9] B.F. Skinner extended these ideas in the 1930s through operant conditioning, which studied how rewards and punishments shape voluntary behavior. These psychological principles directly inspired the reward-based learning framework that RL uses today.

### Optimal control and dynamic programming

The second thread came from applied mathematics. In the 1950s, Richard Bellman developed [dynamic programming](/wiki/dynamic_programming) as a method for solving multi-stage decision problems. His key insight, formalized in the [Bellman equation](/wiki/bellman_equation) (1957), was that an optimal policy can be decomposed into an immediate decision plus the optimal policy from the resulting state onward.[10] This recursive formulation became the mathematical backbone of virtually all RL algorithms. The term "reinforcement learning" itself was first used in the engineering literature by Minsky in 1961, connecting the psychological concept of reinforcement to computational decision-making.

### Temporal difference learning and unification

The third thread, temporal difference (TD) learning, bridged the gap between the other two. In 1988, Richard Sutton introduced TD learning as a class of model-free methods that learn by bootstrapping from current value estimates rather than waiting for final outcomes.[11] TD methods combined the sampling approach of Monte Carlo methods with the bootstrapping of dynamic programming, creating a practical algorithm for environments where the full model is unknown. Sutton and Barto's 1998 textbook, *Reinforcement Learning: An Introduction*, synthesized these three threads into a coherent framework and became the standard reference for the field.[12]

### Timeline of major developments

| Year | Development | Key contributor(s) | Significance |
| --- | --- | --- | --- |
| 1890s | Classical conditioning experiments | Ivan Pavlov | Showed animals learn stimulus-reward associations |
| 1911 | Law of Effect | Edward Thorndike | Established that rewarded actions are reinforced |
| 1930s | Operant conditioning | B.F. Skinner | Formalized how rewards and punishments shape behavior |
| 1950s | Dynamic programming, Bellman equation | Richard Bellman | Mathematical framework for sequential decision-making |
| 1959 | Checkers program | Arthur Samuel | First self-learning game program; coined "machine learning" |
| 1961 | "Steps toward artificial intelligence" | Marvin Minsky | Used term "reinforcement" in engineering context |
| 1963 | MENACE | Donald Michie | Matchbox machine that learned tic-tac-toe |
| 1988 | TD(lambda) | Richard Sutton | Unified Monte Carlo and dynamic programming approaches |
| 1989 | Q-learning | Christopher Watkins | Model-free off-policy control algorithm |
| 1992 | REINFORCE policy gradient algorithm | Ronald Williams | Foundational algorithm for policy gradient methods |
| 1992 | TD-Gammon | Gerald Tesauro | First RL system to achieve world-class game performance |
| 1994 | SARSA | Gavin Rummery, Mahesan Niranjan | On-policy temporal difference control |
| 1998 | *Reinforcement Learning: An Introduction* | Sutton, Barto | Seminal textbook that defined the field |
| 2013 | Deep Q-Network (DQN) on Atari | DeepMind (Mnih et al.) | First deep RL breakthrough using raw pixel input |
| 2015 | DQN published in Nature | DeepMind | Human-level play on 49 Atari games |
| 2016 | [AlphaGo](/wiki/alphago) defeats Lee Sedol | [DeepMind](/wiki/deepmind) | First AI to beat a world champion at Go |
| 2017 | AlphaGo Zero, AlphaZero | DeepMind | Learned Go, chess, and shogi from self-play alone |
| 2017 | PPO published | [OpenAI](/wiki/openai) (Schulman et al.) | Became the default on-policy RL algorithm |
| 2018 | SAC published | Haarnoja et al. (UC Berkeley) | Maximum entropy framework for continuous control |
| 2019 | OpenAI Five defeats OG at Dota 2 | OpenAI | RL conquers a complex multi-agent real-time game |
| 2019 | AlphaStar reaches Grandmaster | DeepMind | Grandmaster-level play in StarCraft II |
| 2020 | MuZero | DeepMind | Learned to plan without knowing environment rules |
| 2022 | RLHF used to train ChatGPT | OpenAI | RL becomes central to LLM alignment |
| 2024 | Turing Award | Richard Sutton, Andrew Barto | Recognition for foundational RL contributions |
| 2025 | DeepSeek-R1 and R1-Zero with GRPO | [DeepSeek](/wiki/deepseek) | RL trains reasoning capabilities in LLMs; R1-Zero used no supervised fine-tuning |

## What is reinforcement learning, in plain terms?

Reinforcement learning teaches an agent through consequences. The agent is never told the correct action; instead it tries actions, observes the resulting reward, and gradually shifts its behavior toward whatever earns the most reward over time. Sutton and Barto identify two features that, together, distinguish RL from all other forms of learning: "trial-and-error search and delayed reward."[1] The learner must discover which actions are valuable by trying them, and because a single action can affect not only the immediate reward but also the next situation and therefore all subsequent rewards, the agent must reason about long-term consequences rather than instantaneous payoff.[1]

## Core concepts

### How does reinforcement learning work? Agent-environment interaction

Reinforcement learning problems involve an [agent](/wiki/intelligent_agent) interacting with an [environment](/wiki/environment) through a cycle of observation, action, and reward.[3] At each discrete time step *t*:

1. The agent observes the current **state** *s_t* of the environment.
2. Based on its **policy** *pi*, the agent selects an **action** *a_t*.
3. The environment transitions to a new state *s_{t+1}* according to transition probabilities *P(s' | s, a)*.
4. The agent receives a scalar **reward** *r_{t+1}* indicating the immediate benefit of that action.

The agent's objective is to learn a policy that maximizes the **expected return** (cumulative discounted reward):[1]

*G_t = R_{t+1} + gamma * R_{t+2} + gamma^2 * R_{t+3} + ... = Sum_{k=0}^{infinity} gamma^k * R_{t+k+1}*

The **discount factor** *gamma* (where 0 <= gamma <= 1) controls how much the agent values future rewards relative to immediate ones. A gamma close to 0 makes the agent short-sighted, prioritizing immediate reward. A gamma close to 1 makes the agent far-sighted, weighting future rewards almost as heavily as immediate ones. Choosing the right discount factor is problem-dependent: a robot navigating a maze might use gamma = 0.99, while a day-trading algorithm might use a lower value.

### Key components

| Component | Description | Example |
| --- | --- | --- |
| **Agent** | The learner and decision-maker | Robot, game-playing AI, trading algorithm |
| **Environment** | External system the agent interacts with | Maze, chess board, stock market |
| **State (s)** | Description of the environment's current configuration | Board position in chess, joint angles of a robot |
| **Action (a)** | Choice available to the agent at a given state | Move a piece, buy/sell stock, turn left |
| **Reward (r)** | Immediate scalar feedback signal | Points scored, profit earned, distance to goal |
| **Policy (pi)** | Agent's strategy mapping states to actions | "If in state X, take action Y" |
| **Value function V(s)** | Expected long-term return from a state under a policy | Position evaluation in chess |
| **Action-value function Q(s,a)** | Expected return from taking action *a* in state *s*, then following the policy | Estimated value of moving a specific piece |
| **Model** | Agent's learned representation of environment dynamics | Predicted next state and reward given current state and action |

### Value functions

Value functions are central to reinforcement learning, estimating how good it is for an agent to be in a particular state or to take a particular action in a state:[1]

- **State-value function V^pi(s)**: the expected return starting from state *s* and following policy *pi*.
- **Action-value function Q^pi(s,a)**: the expected return from taking action *a* in state *s*, then following policy *pi*.

The optimal value functions satisfy the [Bellman optimality equations](/wiki/bellman_equation):[4]

- *V*(s) = max_a Sum_{s'} P(s'|s,a) [R(s,a,s') + gamma * V*(s')]*
- *Q*(s,a) = Sum_{s'} P(s'|s,a) [R(s,a,s') + gamma * max_{a'} Q*(s',a')]*

These equations express the key recursive insight: the value of a state equals the best immediate reward plus the discounted value of the best reachable next state.

### Exploration vs. exploitation

One fundamental challenge in reinforcement learning is the **exploration-exploitation tradeoff**.[2] The agent must balance:

- **Exploration**: trying new, untested actions to discover potentially better strategies.
- **Exploitation**: using what the agent already knows to maximize immediate rewards.

An agent that only exploits may get stuck in a suboptimal policy, never discovering better options. An agent that only explores wastes time on actions it already knows are bad. Common strategies for managing this tradeoff include:

| Strategy | Description | Tradeoff |
| --- | --- | --- |
| Epsilon-greedy | Acts randomly with probability epsilon, greedily otherwise | Simple but uniform random exploration is inefficient |
| Epsilon decay | Decreases epsilon over time, exploring more early on | Balances early exploration with later exploitation |
| Upper Confidence Bound (UCB) | Selects actions that have high uncertainty or high estimated value | Principled, based on confidence intervals |
| Thompson sampling | Samples from posterior distribution of action values | Bayesian approach, naturally balances exploration |
| Boltzmann (softmax) exploration | Selects actions proportional to exponentiated Q-values | Temperature parameter controls exploration degree |
| Curiosity-driven exploration | Rewards agent for visiting novel states | Effective in sparse-reward environments |

## What is the difference between reinforcement learning and supervised learning?

Reinforcement learning and supervised learning are both branches of machine learning, but they differ in where the training signal comes from and what the model is asked to do. Supervised learning is trained on a fixed dataset of correct input/output pairs and learns to imitate those labels; reinforcement learning has no correct answers, only a scalar reward, and must generate its own data by acting in an environment.[1][2] In supervised learning each example is independent, whereas in RL an action changes the agent's future situation, so decisions are sequential and rewards can be delayed. The table below summarizes the main differences.

| Dimension | Reinforcement learning | Supervised learning |
| --- | --- | --- |
| Training signal | Scalar reward, often delayed | Labeled correct output for each input |
| Data source | Generated by the agent through interaction | Fixed, pre-collected dataset |
| Feedback | Evaluative (how good was the action) | Instructive (what the right answer was) |
| Example independence | Sequential; actions affect future states | Independent and identically distributed |
| Core goal | Maximize cumulative reward over time | Minimize prediction error on labels |
| Typical use | Control, games, robotics, LLM alignment | Classification, regression, perception |

In practice the two paradigms are often combined: many systems are first trained with supervised learning and then refined with reinforcement learning. The training of [ChatGPT](/wiki/chatgpt) is a prominent example, beginning with [supervised fine-tuning](/wiki/supervised_fine-tuning) on human demonstrations and then applying RL to optimize a reward model that captures human preferences.[31]

## Mathematical foundations

### Markov decision processes

Reinforcement learning problems are formally modeled as **[Markov decision processes](/wiki/markov_decision_process_mdp)** (MDPs), defined by the tuple *(S, A, P, R, gamma)*:[13]

- **S**: set of states (state space)
- **A**: set of actions (action space)
- **P(s'|s,a)**: state transition probability function
- **R(s,a,s')**: reward function
- **gamma**: discount factor (0 <= gamma < 1)

The **[Markov property](/wiki/markov_property)** states that the future depends only on the current state, not on the sequence of events that preceded it: *P(s_{t+1} | s_t, a_t, s_{t-1}, ..., s_0) = P(s_{t+1} | s_t, a_t)*. This memoryless property is what makes MDPs tractable. In practice, many real-world problems violate the Markov property (the current observation does not fully capture the state), leading to **partially observable MDPs** (POMDPs), which are substantially harder to solve.

### Bellman equations

The Bellman equations, named after Richard Bellman, provide the recursive decomposition that underpins nearly all RL algorithms. For a given policy *pi*:

*V^pi(s) = Sum_a pi(a|s) Sum_{s'} P(s'|s,a) [R(s,a,s') + gamma * V^pi(s')]*

This equation says that the value of a state under policy *pi* equals the expected immediate reward plus the discounted value of the next state, averaged over all possible actions and transitions. The Bellman optimality equation replaces the policy average with a maximum, defining what the best possible policy would achieve.

### Temporal difference learning

Temporal difference (TD) learning, introduced by Sutton in 1988, is a core method in RL that combines ideas from Monte Carlo methods and dynamic programming.[11] Instead of waiting until the end of an episode to update value estimates (as Monte Carlo methods do), TD methods update estimates after each step using the observed reward and the current estimate of the next state's value:

*V(s_t) <- V(s_t) + alpha [r_{t+1} + gamma * V(s_{t+1}) - V(s_t)]*

The term in brackets, *r_{t+1} + gamma * V(s_{t+1}) - V(s_t)*, is called the **TD error**. It measures the difference between the estimated value and a better estimate derived from the actual reward received plus the next state's estimated value. TD learning converges to the true value function under certain conditions and forms the basis of algorithms like Q-learning and SARSA.

## Algorithm taxonomy

RL algorithms can be classified along several axes. Understanding these distinctions is essential for choosing the right algorithm for a given problem.

### Model-based vs. model-free

**Model-free** algorithms learn a policy or value function directly from experience without building an explicit model of how the environment works. Q-learning and PPO are model-free. They are simpler to implement but often require many more interactions with the environment.

**Model-based** algorithms learn or are given a model of the environment's dynamics (transition probabilities and rewards) and use it for planning. Dyna-Q, introduced by Sutton in 1990, was an early approach that combined real experience with simulated experience generated from a learned model.[14] More recent model-based methods include MuZero, which learns a latent dynamics model focused on predicting rewards and values rather than raw observations, and Dreamer, which learns a world model in latent space and uses it to train a policy entirely through imagined rollouts.[15]

Model-based methods tend to be more sample-efficient because they can generate synthetic training data through mental simulation. However, if the learned model is inaccurate, compounding errors can lead to poor policies.

### Value-based vs. policy-based

**Value-based** methods (Q-learning, DQN) learn a value function and derive a policy from it (e.g., always choose the action with the highest Q-value). They work well for discrete action spaces but struggle with continuous actions.

**Policy-based** methods (REINFORCE, PPO) directly parameterize and optimize the policy without necessarily learning a value function. They handle continuous action spaces naturally and can learn stochastic policies, but tend to have higher variance in gradient estimates.

**Actor-critic** methods combine both: an actor (policy network) selects actions while a critic (value network) evaluates them. This reduces variance compared to pure policy gradient methods while retaining the ability to handle continuous actions.

### On-policy vs. off-policy

**On-policy** algorithms (SARSA, PPO, A2C) learn about the policy currently being executed. They use data generated by the current policy to update that same policy. This can be more stable but is less sample-efficient because old data cannot be reused after a policy update.

**Off-policy** algorithms (Q-learning, DQN, SAC) can learn from data generated by any policy, including old versions of the agent or even random exploration. This allows experience replay, where past transitions are stored in a buffer and sampled repeatedly, greatly improving sample efficiency.

| Classification axis | Category A | Category B |
| --- | --- | --- |
| Environment model | Model-free: Q-learning, PPO, SAC | Model-based: Dyna-Q, MuZero, Dreamer |
| What is learned | Value-based: Q-learning, DQN | Policy-based: REINFORCE, PPO |
| Data source | On-policy: SARSA, A2C, PPO | Off-policy: Q-learning, DQN, SAC |
| State representation | Tabular: classic Q-learning | Function approximation: [deep learning](/wiki/deep_learning)-based RL |

## Key algorithms

### Q-learning

[Q-learning](/wiki/q-learning), introduced by Christopher Watkins in his 1989 Cambridge PhD thesis *Learning from Delayed Rewards*, is a model-free, off-policy algorithm that learns the optimal action-value function directly.[16] The update rule is:

*Q(s,a) <- Q(s,a) + alpha [r + gamma * max_{a'} Q(s',a') - Q(s,a)]*

where *alpha* is the learning rate. The key insight is that the update uses the maximum Q-value over the next state's actions regardless of which action the agent actually took. This "off-policy" property means Q-learning can learn about the optimal policy while following an exploratory one. Watkins and Dayan (1992) provided the first rigorous proof that Q-learning converges to the optimal Q-function with probability 1, given sufficient exploration and decreasing learning rates.[16]

Q-learning is simple and effective for problems with small, discrete state and action spaces. For larger problems, function approximation (such as neural networks) is needed.

### SARSA

SARSA (State-Action-Reward-State-Action), introduced by Rummery and Niranjan in 1994, is an on-policy variant of Q-learning.[17] Its update rule uses the action actually taken in the next state rather than the maximum:

*Q(s,a) <- Q(s,a) + alpha [r + gamma * Q(s',a') - Q(s,a)]*

Because SARSA evaluates the policy it is actually following, it tends to learn safer policies than Q-learning. In a cliff-walking problem, for example, Q-learning learns the optimal path along the cliff edge, while SARSA learns a safer path further from the edge, because it accounts for the possibility of exploratory actions leading to a fall.

### Deep Q-Networks (DQN)

[Deep Q-Networks](/wiki/deep_q-network_dqn) (DQN), published by Mnih et al. at [DeepMind](/wiki/deepmind) in 2013 and in *Nature* in 2015, revolutionized RL by using deep [convolutional neural networks](/wiki/convolutional_neural_network) to approximate Q-values for high-dimensional state spaces.[18] DQN took raw pixel inputs from Atari 2600 games and learned to play 49 different games using the same architecture and hyperparameters, achieving human-level performance on 29 of them. The Nature paper reported that "the deep Q-network agent, receiving only the pixels and the game score as inputs, was able to surpass the performance of all previous algorithms and achieve a level comparable to that of a professional human games tester across a set of 49 games, using the same algorithm, network architecture and hyperparameters."[18]

Two innovations made this possible:

- **Experience replay**: the agent stores transitions *(s, a, r, s')* in a replay buffer and samples random mini-batches for training. This breaks correlations between consecutive samples and improves data efficiency.
- **Target network**: a separate, periodically updated copy of the Q-network computes target values. This stabilizes training by preventing the target from shifting with every update.

DQN was the first demonstration that a single RL agent could learn complex behaviors directly from sensory input across many different tasks, and it sparked the deep reinforcement learning revolution.

### Policy gradient methods

Policy gradient methods directly optimize a parameterized policy by estimating the gradient of expected return with respect to policy parameters.[19] The foundational algorithm is REINFORCE (Williams, 1992), which updates policy parameters *theta* using:

*nabla_theta J(theta) ~ Sum_t G_t * nabla_theta log pi_theta(a_t | s_t)*

where *G_t* is the return from time step *t*. The intuition is straightforward: increase the probability of actions that led to high returns, decrease the probability of actions that led to low returns.

REINFORCE is simple but suffers from high variance in gradient estimates. Adding a **baseline** (typically the state value function) reduces variance without introducing bias:

*nabla_theta J(theta) ~ Sum_t (G_t - V(s_t)) * nabla_theta log pi_theta(a_t | s_t)*

The term *(G_t - V(s_t))* is called the **advantage**, and this leads to the family of advantage actor-critic methods.

### Actor-critic methods

Actor-critic algorithms combine policy-based and value-based learning:[20]

- The **actor** is a policy network that selects actions.
- The **critic** is a value network that estimates the value of states or state-action pairs.

The critic reduces variance in the policy gradient estimate by providing a learned baseline. Several important variants exist:

**A2C (Advantage Actor-Critic)** uses the advantage function *A(s,a) = Q(s,a) - V(s)* to update the actor. **A3C (Asynchronous Advantage Actor-Critic)**, introduced by Mnih et al. in 2016, runs multiple agents in parallel on separate copies of the environment, each contributing gradients asynchronously to a shared model.[21] This was one of the first methods to effectively scale RL training across many CPU cores.

**DDPG (Deep Deterministic Policy Gradient)**, introduced by Lillicrap et al. in 2015, extends DQN to continuous action spaces by learning a deterministic policy alongside a Q-function.[22] It uses experience replay and target networks, similar to DQN.

**TD3 (Twin Delayed DDPG)**, published by Fujimoto et al. in 2018, addresses overestimation bias in DDPG by maintaining two critic networks and taking the minimum of their estimates, delaying policy updates, and adding noise to target actions.[23]

### Proximal Policy Optimization (PPO)

[Proximal Policy Optimization](/wiki/reinforcement_learning) (PPO), introduced by Schulman et al. at OpenAI in 2017, constrains policy updates to prevent destructively large changes.[24] PPO optimizes a clipped surrogate objective:

*L^CLIP(theta) = E[min(r_t(theta) * A_t, clip(r_t(theta), 1 - epsilon, 1 + epsilon) * A_t)]*

where *r_t(theta) = pi_theta(a_t | s_t) / pi_{theta_old}(a_t | s_t)* is the probability ratio and *epsilon* is typically 0.2. The clipping prevents the new policy from deviating too far from the old one in a single update.

PPO has become one of the most widely used RL algorithms due to its simplicity, stability, and strong empirical performance. OpenAI used it to train OpenAI Five (Dota 2), and it was the original RL algorithm used in RLHF for ChatGPT.

### Soft Actor-Critic (SAC)

[Soft Actor-Critic](/wiki/soft_actor_critic) (SAC), introduced by Haarnoja et al. in 2018, augments the standard RL objective with an entropy term that encourages exploration:[25]

*J(pi) = Sum_t E[r(s_t, a_t) + alpha * H(pi(.|s_t))]*

where *H* is the entropy of the policy and *alpha* is a temperature parameter controlling the tradeoff between reward maximization and entropy (exploration). SAC is off-policy, uses experience replay, and automatically tunes the temperature parameter. It achieves strong performance on continuous control benchmarks with better sample efficiency than on-policy methods like PPO.

### Algorithm comparison

| Algorithm | Type | Year | Key innovation | Best suited for | Sample efficiency |
| --- | --- | --- | --- | --- | --- |
| [Q-learning](/wiki/q-learning) | Value, off-policy | 1989 | Model-free optimal control | Small discrete problems | Low |
| SARSA | Value, on-policy | 1994 | On-policy TD control | Safe learning scenarios | Low |
| [DQN](/wiki/deep_q-network_dqn) | Value, off-policy | 2013 | Deep RL with experience replay | Discrete actions, visual input | Medium |
| DDPG | Actor-critic, off-policy | 2015 | Continuous action DQN | Continuous control | Medium |
| TRPO | Policy, on-policy | 2015 | Trust region constraints | Stable policy optimization | Low |
| A3C | Actor-critic, on-policy | 2016 | Asynchronous parallel training | CPU-based distributed training | Low |
| [PPO](/wiki/reinforcement_learning) | Policy, on-policy | 2017 | Clipped surrogate objective | General purpose, RLHF | Low |
| [SAC](/wiki/soft_actor_critic) | Actor-critic, off-policy | 2018 | Maximum entropy RL | Continuous control, robotics | High |
| TD3 | Actor-critic, off-policy | 2018 | Twin critics, delayed updates | Continuous control | High |
| [AlphaZero](/wiki/alphazero) | Model-based, self-play | 2017 | Self-play with MCTS | Perfect information games | Very high |
| [MuZero](/wiki/muzero) | Model-based, learned model | 2020 | Learned latent dynamics | Games without known rules | Very high |
| GRPO | Policy, on-policy | 2024 | Group relative advantage estimation | LLM reasoning training | Medium |

## Deep reinforcement learning

Deep reinforcement learning (deep RL) combines RL algorithms with [deep neural networks](/wiki/deep_learning) as function approximators, enabling agents to handle high-dimensional state and action spaces that are intractable for tabular methods.

### Why deep learning transformed RL

Classic RL algorithms like tabular Q-learning maintain a table of values for every state-action pair. This works for problems with small state spaces (a few hundred or thousand states) but fails completely when states are described by images, continuous variables, or other high-dimensional inputs. A single Atari game frame has 210 x 160 pixels with 128 possible colors per pixel, making the raw state space astronomically large.

[Neural networks](/wiki/neural_network) solve this by learning compact, generalizable representations of value functions or policies. A convolutional neural network can process raw pixels and output Q-values or action probabilities, automatically learning relevant features like object positions, velocities, and spatial relationships.

### Key architectural patterns

Deep RL uses several recurring architectural patterns:

- **Convolutional networks** for processing visual observations (DQN, AlphaGo).
- **Recurrent networks** (LSTMs, GRUs) for handling partial observability and sequential dependencies.
- **[Transformers](/wiki/transformer)** for sequence modeling and attention over long histories (Decision Transformer, Gato).
- **Residual networks** and **normalization layers** for training stability in deep value and policy networks.

### Stability challenges

Combining neural networks with RL introduces several instability issues that do not arise in supervised learning. The training data distribution changes as the policy improves (non-stationarity). Small changes in the value function can cause large changes in the policy, which in turn changes the data distribution. Experience replay, target networks, gradient clipping, and entropy regularization are common techniques for addressing these issues.

## Major milestones

### TD-Gammon (1992)

TD-Gammon, developed by Gerald Tesauro at IBM's Thomas J. Watson Research Center, was one of the earliest demonstrations that RL combined with neural networks could achieve expert-level performance.[26] The system used a three-layer neural network with 198 input features, 80 hidden units, and one output unit to evaluate backgammon positions. It learned entirely through self-play using TD(lambda), playing approximately 1.5 million games against itself. By version 2.1, TD-Gammon played at a level just slightly below the world's top human players. The program is commonly cited as a precursor to the deep RL breakthroughs that followed two decades later.

### DQN and Atari (2013, 2015)

DeepMind's DQN was the first system to learn successful control policies directly from raw pixel inputs across a diverse set of tasks.[18] The 2013 paper demonstrated strong performance on seven Atari games; the 2015 *Nature* paper extended this to 49 games, achieving human-level performance on 29 of them using identical architecture and hyperparameters for every game. This result demonstrated that a single deep RL architecture could generalize across very different tasks.

### AlphaGo and AlphaZero (2016, 2017)

[AlphaGo](/wiki/alphago) defeated 18-time world Go champion Lee Sedol 4-1 in March 2016, an event watched by over 200 million people.[5] AlphaGo combined supervised learning from human expert games with RL through self-play, using Monte Carlo tree search (MCTS) guided by a policy network and a value network.

[AlphaGo Zero](/wiki/alphago_zero), published later in 2017, eliminated the need for human data entirely, learning exclusively through self-play starting from random play.[27] It surpassed the original AlphaGo within 40 hours of training. [AlphaZero](/wiki/alphazero) generalized this approach to chess and shogi as well, defeating the strongest existing programs in all three games within 24 hours of training from scratch.[28]

### OpenAI Five (2019)

[OpenAI Five](/wiki/openai_five) tackled Dota 2, a game with far greater complexity than Go: imperfect information, real-time decision-making, long time horizons (approximately 20,000 frames per game), a massive action space, and five-player teamwork.[6] The system used PPO with self-play across 128,000 CPU cores and 256 GPUs, accumulating the equivalent of 45,000 years of gameplay experience. In April 2019, it defeated OG, the reigning human world champions, 2-0. OpenAI Five demonstrated that PPO and massive-scale self-play could handle multi-agent coordination in complex real-time environments.

### AlphaStar (2019)

DeepMind's [AlphaStar](/wiki/alphastar) reached Grandmaster level in StarCraft II, placing in the top 0.2% of human players on the official European ladder.[29] StarCraft II presents challenges beyond Go: imperfect information (fog of war), real-time actions, long-term strategic planning, and a combinatorial action space. AlphaStar combined imitation learning from human replays with multi-agent reinforcement learning, training a league of agents that competed against one another to develop diverse strategies.

## Reinforcement Learning from Human Feedback (RLHF)

[RLHF](/wiki/rlhf) has become one of the most consequential applications of reinforcement learning. It is the technique that transforms a pre-trained language model into a conversational assistant that follows instructions, refuses harmful requests, and generally behaves in ways humans find helpful.[30]

### How RLHF works

The RLHF process typically involves three stages:

1. **[Supervised fine-tuning](/wiki/supervised_fine-tuning) (SFT)**: a pre-trained language model is fine-tuned on a dataset of human-written demonstrations of desired behavior.
2. **Reward model training**: human labelers compare pairs of model outputs and indicate which they prefer. These preference labels train a reward model that predicts a scalar score for any given output.
3. **RL optimization**: the SFT model is further trained using RL (typically PPO) to maximize the reward model's score, with a KL-divergence penalty to prevent the model from deviating too far from the SFT model.

OpenAI's InstructGPT (2022) was one of the first published demonstrations of this approach,[31] and the same methodology was used for [ChatGPT](/wiki/chatgpt). [Anthropic](/wiki/anthropic) applied a variant called [Constitutional AI](/wiki/constitutional_ai) (CAI) to train [Claude](/wiki/claude), where AI-generated feedback partially replaces human labeling.[32]

### Evolution beyond PPO

The RL component of RLHF has evolved rapidly:

| Method | Year | Description |
| --- | --- | --- |
| PPO-based RLHF | 2022 | Original approach used for InstructGPT and ChatGPT |
| [Direct Preference Optimization](/wiki/direct_preference_optimization_dpo) (DPO) | 2023 | Eliminates separate reward model and RL step; directly optimizes on preference pairs |
| Kahneman-Tversky Optimization (KTO) | 2024 | Works with binary (good/bad) labels instead of pairwise preferences |
| Group Relative Policy Optimization (GRPO) | 2024 | Eliminates value network; estimates advantages from group reward distribution |
| Reinforcement Learning from AI Feedback (RLAIF) | 2023+ | Uses AI-generated preferences to scale alignment |

[DeepSeek](/wiki/deepseek)-R1, released in January 2025, demonstrated that RL using GRPO with verifiable rewards can produce strong reasoning capabilities in LLMs.[33] Its companion model, DeepSeek-R1-Zero, was trained without any supervised fine-tuning step and learned behaviors like self-reflection, verification, and chain-of-thought reasoning purely through RL. The released DeepSeek-R1 model itself incorporated cold-start supervised fine-tuning stages before its RL training and achieved performance comparable to OpenAI's o1 on mathematical reasoning benchmarks.[33]

### Reinforcement Learning with Verifiable Rewards (RLVR)

RLVR is a training paradigm where rewards come from deterministic, rule-based verifiers rather than learned reward models.[33] For mathematical problems, the verifier checks whether the model's final answer matches the correct solution. For code generation, automated tests serve as the verifier. RLVR avoids the reward hacking problems inherent in learned reward models and has become the standard approach for training reasoning-focused LLMs as of 2025. GRPO is the most common RL optimizer used with RLVR in open-source reasoning models.

## What is reinforcement learning used for?

Reinforcement learning is used wherever a system must learn a sequence of decisions to optimize a long-term outcome rather than predict a single label. Its highest-profile successes are in game playing, but the same machinery now trains industrial control systems, robots, recommendation engines, and the alignment of large language models. The sections below survey the major application areas.

### Game playing

Reinforcement learning has achieved superhuman performance in numerous games:

- **Board games**: [AlphaGo](/wiki/alphago), [AlphaZero](/wiki/alphazero), and [MuZero](/wiki/muzero) mastered Go, chess, and shogi through self-play.[28]
- **Video games**: DQN mastered 49 Atari games; OpenAI Five conquered Dota 2; AlphaStar reached Grandmaster in StarCraft II.[29]
- **Poker**: Pluribus (2019) defeated professional players in six-player no-limit Texas Hold'em, the first AI to beat humans in a major multiplayer poker format.[34]
- **Diplomacy**: Meta's Cicero (2022) achieved human-level performance in the board game Diplomacy, combining RL with natural language generation for negotiation.[35]

### Robotics

RL enables robots to acquire motor skills through trial and error rather than manual programming:

- **Locomotion**: policies trained in simulation transfer to physical robots for walking, running, and navigating uneven terrain. [Boston Dynamics](/wiki/boston_dynamics) uses RL for some aspects of their robots' behavior.
- **Manipulation**: OpenAI demonstrated a robotic hand (Dactyl) solving a Rubik's Cube using RL with sim-to-real transfer and domain randomization (2019).[36]
- **Assembly and manufacturing**: industrial robots learn assembly sequences, welding paths, and pick-and-place operations.
- **Sim-to-real transfer**: training in physics simulators (MuJoCo, Isaac Gym) and transferring to physical hardware remains a major research area. Domain randomization, where simulation parameters are varied during training, helps bridge the gap between simulated and real environments.

### Autonomous vehicles

Self-driving systems employ RL for several aspects of driving:

- Path planning and trajectory optimization
- Lane changing and merging decisions
- Adaptive cruise control and following distance
- Intersection and traffic light negotiation

[Waymo](/wiki/waymo), [Tesla](/wiki/tesla), and other companies use RL as one component of their autonomous driving stacks, though most production systems combine RL with rule-based safety constraints and imitation learning from human drivers.

### Healthcare

RL applications in medicine include:

- **Treatment optimization**: dynamic treatment regimes for chronic diseases such as sepsis management in ICUs, where RL agents recommend drug dosages and ventilator settings.[37]
- **Drug discovery**: molecular design and optimization of chemical structures.
- **Personalized medicine**: adaptive clinical trial designs that allocate patients to treatments based on observed responses.
- **Medical imaging**: RL-guided strategies for anatomical landmark detection and image acquisition optimization.

### Finance

Financial applications include:

- **Algorithmic trading**: automated strategies that learn to execute trades, manage inventory, and time entries and exits.
- **Portfolio management**: dynamic asset allocation that adapts to changing market conditions.
- **Risk management**: credit scoring models and fraud detection systems.
- **Market making**: RL agents that provide liquidity and manage bid-ask spreads.

### Energy and sustainability

- **Data center cooling**: [Google DeepMind](/wiki/google_deepmind) achieved a 40% reduction in data center energy consumption for cooling by using RL to optimize HVAC settings (2016).[38]
- **Smart grids**: RL for load balancing, demand response, and renewable energy integration.
- **Wind farms**: optimizing turbine yaw angles and blade pitch to maximize energy output.
- **Building management**: HVAC and lighting optimization in commercial buildings.

### Natural language processing and LLM alignment

- **RLHF and RLVR**: training [ChatGPT](/wiki/chatgpt), [Claude](/wiki/claude), [GPT-4](/wiki/gpt-4), [Gemini](/wiki/gemini), [Llama](/wiki/llama), and [DeepSeek](/wiki/deepseek) to follow instructions and align with human values.
- **Dialogue systems**: optimizing conversational agents for engagement and task completion.
- **[Machine translation](/wiki/machine_translation)**: improving translation quality through reward signals based on [BLEU](/wiki/bleu_bilingual_evaluation_understudy) scores or human preferences.
- **[Text summarization](/wiki/text_summarization)**: generating concise, informative summaries optimized by RL-based reward signals.

### Recommendation systems

RL is used in recommendation systems where the goal is to maximize long-term user engagement rather than immediate click-through rates. Platforms like YouTube, Netflix, and Spotify use RL-inspired approaches to balance exploration (showing new content) with exploitation (recommending proven favorites), account for the sequential nature of user interactions, and optimize for long-term metrics like retention rather than short-term clicks.

## Multi-agent reinforcement learning

Multi-agent reinforcement learning (MARL) extends RL to settings where multiple agents interact within a shared environment.[39] This introduces challenges absent in single-agent RL: agents must account for the behavior of other learning agents, which makes the environment non-stationary from each agent's perspective.

### Types of multi-agent settings

| Setting | Description | Examples |
| --- | --- | --- |
| Fully cooperative | All agents share a common reward | Robot swarm coordination, team-based games |
| Fully competitive | One agent's gain is another's loss (zero-sum) | Board games, competitive video games |
| Mixed (general-sum) | Agents have partially aligned, partially conflicting goals | Autonomous driving, economic markets, negotiation |

### Key approaches

- **Independent learners**: each agent runs its own RL algorithm, treating other agents as part of the environment. Simple but ignores the non-stationarity caused by other agents learning simultaneously.
- **Centralized training, decentralized execution (CTDE)**: agents share information during training (e.g., a shared critic with global state) but act based only on local observations during deployment. QMIX and MAPPO are popular CTDE algorithms.
- **Communication learning**: agents learn to communicate through discrete or continuous messages, enabling coordination in partially observable settings.
- **Population-based training**: a population of agents with different strategies co-evolve, as used in AlphaStar's league training.

### Applications of MARL

MARL has been applied to autonomous driving (multiple vehicles negotiating at intersections), robotic swarms (coordinated exploration and task allocation), traffic signal control (city-wide optimization of traffic flow), multiplayer games (Dota 2, StarCraft II), and resource allocation in smart grids and communication networks. A comprehensive MIT Press textbook on MARL was published in December 2024, reflecting the field's maturity.[40]

## Challenges and limitations

### Sample inefficiency

RL algorithms often require enormous amounts of interaction data to learn effective policies:[41]

- DQN: 200 million frames for Atari (equivalent to roughly 924 hours of human play)
- OpenAI Five: the equivalent of 45,000 years of Dota 2 gameplay
- AlphaGo Zero: 4.9 million self-play games over 3 days for its 20-block version; the larger 40-block version trained for 40 days and generated 29 million games.[27]

This makes direct training on physical systems (robots, real vehicles) impractical for most current algorithms. Solutions include model-based RL (generating synthetic data from learned models), transfer learning (reusing knowledge from related tasks), curriculum learning (gradually increasing task difficulty), and offline RL (learning from fixed datasets without further interaction).

### Exploration in large state spaces

Effective exploration becomes extremely difficult in environments with:

- **Sparse rewards**: where the agent receives no feedback until it reaches a rare goal state. A robot learning to stack blocks might receive a reward only upon successful completion, with no signal during the thousands of intermediate steps.
- **Large or continuous state spaces**: where the number of possible configurations is astronomical.
- **Safety-critical domains**: where exploration risks catastrophic failures. A self-driving car cannot explore bad driving strategies.

Approaches to these challenges include intrinsic motivation and curiosity-driven exploration (rewarding the agent for visiting novel states), hierarchical RL (decomposing problems into subgoals), and safe exploration methods with constraints.

### Reward specification and reward hacking

Designing reward functions that capture the true objective is notoriously difficult:[42]

- **[Reward hacking](/wiki/reward_hacking)**: agents exploit unintended shortcuts in the reward function. A boat-racing agent famously learned to drive in circles collecting bonus items instead of finishing the race, because the bonus items gave more reward than race completion.
- **Reward shaping**: manually engineering intermediate rewards to guide learning is error-prone and can introduce biases.
- **Specification gaming**: agents find unexpected strategies that satisfy the letter of the reward function but not its intent.

In RLHF for LLMs, reward hacking manifests as models producing verbose, sycophantic responses that score highly with the reward model but are not actually more helpful. Mitigation strategies include inverse RL (learning rewards from demonstrations), reward model ensembles, and the Preference As Reward (PAR) approach introduced in 2025.

### Sim-to-real transfer

Policies trained in simulation often fail when deployed on physical hardware due to the "sim-to-real gap": differences in physics, sensor noise, actuator dynamics, and visual appearance between simulator and reality.[43] Research has shown that physics-based dynamics models can achieve up to 50% real-world success under strict precision constraints where simplified models fail entirely. Domain randomization (varying simulation parameters during training), system identification (calibrating simulation to match reality), and progressive domain adaptation help bridge this gap.

### Generalization and catastrophic forgetting

RL agents often fail to generalize beyond their training environment. A policy trained in one version of a video game may fail on a slightly different version. When learning multiple tasks sequentially, neural networks suffer from catastrophic forgetting, where learning a new task overwrites the weights needed for previously learned tasks. [Meta-learning](/wiki/meta-learning), domain randomization, and continual learning are active research areas addressing these issues.

### Interpretability and safety

Neural network policies are black boxes; it is difficult to understand why an agent takes a particular action. This creates problems for:

- **Verification**: proving that an RL system will behave safely in all possible situations.
- **Debugging**: identifying why an agent fails in specific scenarios.
- **Regulation**: deploying RL in safety-critical domains like healthcare or autonomous driving requires explainable decision-making.
- **[AI alignment](/wiki/ai_alignment)**: ensuring that RL agents' learned objectives align with human values and intentions.

## Current research directions (2025-2026)

### Offline reinforcement learning

Offline RL (also called batch RL) learns from fixed datasets of previously collected transitions without any further environment interaction.[44] This is valuable in domains where online exploration is expensive or dangerous (healthcare, autonomous driving, industrial control). Key methods include:

- **Conservative Q-Learning (CQL)**: penalizes Q-values for out-of-distribution actions to prevent overestimation.
- **Implicit Q-Learning (IQL)**: avoids querying out-of-distribution actions entirely.
- **Decision Transformer**: frames RL as a sequence modeling problem, using a [transformer](/wiki/transformer) to predict actions conditioned on desired returns.

### Foundation models meet RL

The intersection of foundation models and RL is one of the most active research areas. Several directions have emerged:

- **RL for foundation model training**: RLHF, RLVR, and GRPO for aligning and improving LLMs.
- **[Foundation models](/wiki/foundation_models) for RL**: using pre-trained language and vision models to provide representations, world knowledge, or reward signals for RL agents.
- **Generalist agents**: systems like DeepMind's Gato (2022) and Google's RT-2 (2023) combine large pre-trained models with RL to create agents that operate across multiple domains (text, images, robotic control).[45]
- **Vision-Language-Action models**: RT-1, RT-2, and similar systems use transformer architectures to map visual observations and language instructions directly to robot actions.

### World models and model-based deep RL

Learned world models allow agents to plan and imagine future scenarios without interacting with the real environment:

- **Dreamer** (v1, v2, v3): learns a latent dynamics model and trains a policy entirely through imagined rollouts, achieving competitive performance with far less real-world data.[15]
- **RLVR-World** (2025): a framework that uses reinforcement learning with verifiable rewards to directly optimize world models across domains including text games, web navigation, and robot manipulation.
- **Differentiable physics simulators**: enable gradient-based optimization through simulated physics for robotics applications.

### Hierarchical reinforcement learning

Hierarchical RL decomposes complex, long-horizon tasks into manageable subtasks:

- **Options framework**: temporal abstraction through "options" (sub-policies that execute over multiple time steps).
- **Goal-conditioned policies**: high-level policies set subgoals; low-level policies achieve them.
- **Feudal networks**: hierarchical architectures where a manager sets goals for a worker.

This is particularly relevant for robotics and navigation tasks where planning over hundreds or thousands of steps is needed.

### Safe reinforcement learning

Safe RL develops algorithms that satisfy safety constraints during both training and deployment. Constrained MDPs formalize safety requirements as constraints on expected costs. Shielding approaches use formal verification to block unsafe actions. This is a growing area as RL moves into safety-critical applications like autonomous driving and medical treatment optimization.

### Recent developments (2026)

The momentum behind RL for reasoning and agentic models continued through 2026. In September 2025, the [DeepSeek](/wiki/deepseek)-R1 work was published in *Nature* (vol. 645, pp. 633-638) after independent peer review, making it the first major open-weight LLM to clear that bar and formalizing the claim that reasoning can be incentivized through pure RL with GRPO.[46] On April 24, 2026, DeepSeek released its next-generation open-weight model, DeepSeek-V4, in two MIT-licensed variants: V4-Pro (a 1.6-trillion-parameter mixture-of-experts model with roughly 49 billion active parameters) and the smaller V4-Flash, both with a one-million-token context window. Its model card describes a two-stage post-training recipe that cultivates domain experts through supervised fine-tuning and RL with GRPO before consolidating them via on-policy distillation.[47]

A distinct research direction, "agentic RL," consolidated during 2026. Surveys recast LLM training as a temporally extended, partially observable MDP (rather than a single-step bandit), with RL endowing models with planning, tool use, memory, and self-reflection over long horizons.[48] Work on reward design and credit assignment also advanced: subproblem curriculum RL (SCRL), introduced in May 2026, converts partial progress on hard problems into verifiable learning signals, reporting gains of up to 4.1 points over GRPO on Qwen3-Base models.[49] In robotics, sim-to-real RL grew markedly cheaper, with a December 2025 method from a Berkeley-led team learning humanoid locomotion transferable to hardware in roughly 15 minutes of training.[50]

## Development tools and frameworks

| Framework | Language | Maintained by | Best suited for |
| --- | --- | --- | --- |
| [Gymnasium](/wiki/openai_gym) (formerly OpenAI Gym) | Python | Farama Foundation | Environment standard and benchmarking |
| Stable-Baselines3 | Python | Community | Reliable algorithm implementations (PPO, SAC, DQN) |
| Ray RLlib | Python | Anyscale | Production-scale distributed training |
| CleanRL | Python | Community | Single-file, readable algorithm implementations |
| TorchRL | Python | Meta (PyTorch) | Research flexibility and modularity |
| Unity ML-Agents | C#/Python | Unity Technologies | 3D simulation and game environments |
| TF-Agents | Python | Google | TensorFlow ecosystem integration |
| Tianshou | Python | Community | Modular research framework |
| ACME | Python | [DeepMind](/wiki/deepmind) | JAX-based research at scale |

### Simulation environments

| Environment | Domain | Description |
| --- | --- | --- |
| MuJoCo | Physics/robotics | High-fidelity physics simulation for continuous control |
| Isaac Gym | Robotics | GPU-accelerated physics for massively parallel training |
| Arcade Learning Environment (ALE) | Atari games | Standard benchmark for discrete control from pixels |
| PettingZoo | Multi-agent | Standard API for multi-agent environments |
| CARLA | Autonomous driving | Open-source urban driving simulator |
| MineRL | Minecraft | Hierarchical tasks in a complex open-world game |
| Meta-World | Robotic manipulation | 50 distinct manipulation tasks for meta-learning research |
| RoboSuite | Robotic manipulation | Standardized benchmarks for robot learning |

## Frequently asked questions

**Who invented reinforcement learning?** No single person invented RL; it grew from three traditions (animal-learning psychology, optimal control, and temporal difference learning) that Richard Sutton and Andrew Barto unified in their 1998 textbook. The two shared the 2024 Turing Award for these foundations.[7][51] Christopher Watkins introduced Q-learning in 1989,[16] and Richard Bellman supplied the dynamic-programming mathematics in the 1950s.[10]

**Is reinforcement learning supervised or unsupervised?** Neither. RL is its own paradigm, the third alongside supervised and unsupervised learning, because it learns from an evaluative reward signal rather than from labels or from unlabeled structure.[1][2]

**Is reinforcement learning used in ChatGPT?** Yes. ChatGPT and other modern assistants are aligned using Reinforcement Learning from Human Feedback (RLHF), and reasoning models such as DeepSeek-R1 are trained with RL using verifiable rewards.[31][33]

**What is the most popular RL algorithm?** Proximal Policy Optimization (PPO), published by OpenAI in 2017, is the most widely used general-purpose algorithm and was the original optimizer for RLHF; for LLM reasoning, Group Relative Policy Optimization (GRPO) has become common.[24][33]

## See also

- [Group Sequence Policy Optimization (GSPO)](/wiki/gspo)
- [Instruction backtranslation (Humpback)](/wiki/instruction_backtranslation)
- [Depth up-scaling (DUS)](/wiki/depth_upscaling)
- [Selective Language Modeling (Rho-1)](/wiki/selective_language_modeling)
- [Machine learning](/wiki/machine_learning)
- [Deep learning](/wiki/deep_learning)
- [Supervised learning](/wiki/supervised_learning)
- [Unsupervised learning](/wiki/unsupervised_learning)
- [Markov decision process](/wiki/markov_decision_process_mdp)
- [Q-learning](/wiki/q-learning)
- [Deep Q-Network](/wiki/deep_q-network_dqn)
- [AlphaGo](/wiki/alphago)
- [OpenAI](/wiki/openai)
- [DeepMind](/wiki/deepmind)
- [RLHF](/wiki/rlhf)
- [Neural network](/wiki/neural_network)
- [AI alignment](/wiki/ai_alignment)

## References

[1] Sutton, R. S., & Barto, A. G. (2018). *Reinforcement Learning: An Introduction* (2nd ed.). MIT Press.

[2] Kaelbling, L. P., Littman, M. L., & Moore, A. W. (1996). "Reinforcement learning: A survey." *Journal of Artificial Intelligence Research*, 4, 237-285.

[3] Sutton, R. S., & Barto, A. G. (1998). *Reinforcement Learning: An Introduction* (1st ed.). MIT Press.

[4] Bellman, R. (1957). *Dynamic Programming*. Princeton University Press.

[5] Silver, D., et al. (2016). "Mastering the game of Go with deep neural networks and tree search." *Nature*, 529(7587), 484-489.

[6] OpenAI. (2019). "OpenAI Five defeats Dota 2 world champions." OpenAI Blog.

[7] ACM. (2024). "ACM A.M. Turing Award recognizes pioneers of reinforcement learning."

[8] Pavlov, I. P. (1927). *Conditioned Reflexes: An Investigation of the Physiological Activity of the Cerebral Cortex*. Oxford University Press.

[9] Thorndike, E. L. (1911). *Animal Intelligence: Experimental Studies*. Macmillan.

[10] Bellman, R. (1957). "A Markovian decision process." *Journal of Mathematics and Mechanics*, 6(5), 679-684.

[11] Sutton, R. S. (1988). "Learning to predict by the methods of temporal differences." *Machine Learning*, 3(1), 9-44.

[12] Sutton, R. S., & Barto, A. G. (1998). *Reinforcement Learning: An Introduction*. MIT Press.

[13] Puterman, M. L. (1994). *Markov Decision Processes: Discrete Stochastic Dynamic Programming*. Wiley.

[14] Sutton, R. S. (1990). "Integrated architectures for learning, planning, and reacting based on approximating dynamic programming." *Proceedings of the 7th International Conference on Machine Learning*.

[15] Hafner, D., et al. (2023). "Mastering diverse domains through world models." *arXiv:2301.04104*.

[16] Watkins, C. J. C. H. (1989). *Learning from Delayed Rewards*. PhD thesis, University of Cambridge. Convergence proof in Watkins, C. J. C. H., & Dayan, P. (1992). "Q-learning." *Machine Learning*, 8(3), 279-292.

[17] Rummery, G. A., & Niranjan, M. (1994). "On-line Q-learning using connectionist systems." *Technical Report CUED/F-INFENG/TR 166*, Cambridge University.

[18] Mnih, V., et al. (2015). "Human-level control through deep reinforcement learning." *Nature*, 518(7540), 529-533.

[19] Williams, R. J. (1992). "Simple statistical gradient-following algorithms for connectionist reinforcement learning." *Machine Learning*, 8(3), 229-256.

[20] Konda, V. R., & Tsitsiklis, J. N. (2000). "Actor-critic algorithms." *Advances in Neural Information Processing Systems*, 12.

[21] Mnih, V., et al. (2016). "Asynchronous methods for deep reinforcement learning." *Proceedings of the 33rd International Conference on Machine Learning*.

[22] Lillicrap, T. P., et al. (2015). "Continuous control with deep reinforcement learning." *arXiv:1509.02971*.

[23] Fujimoto, S., Hoof, H., & Meger, D. (2018). "Addressing function approximation error in actor-critic methods." *Proceedings of the 35th International Conference on Machine Learning*.

[24] Schulman, J., et al. (2017). "Proximal policy optimization algorithms." *arXiv:1707.06347*.

[25] Haarnoja, T., et al. (2018). "Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor." *Proceedings of the 35th International Conference on Machine Learning*.

[26] Tesauro, G. (1995). "Temporal difference learning and TD-Gammon." *Communications of the ACM*, 38(3), 58-68.

[27] Silver, D., et al. (2017). "Mastering the game of Go without human knowledge." *Nature*, 550(7676), 354-359.

[28] Silver, D., et al. (2018). "A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play." *Science*, 362(6419), 1140-1144.

[29] Vinyals, O., et al. (2019). "Grandmaster level in StarCraft II using multi-agent reinforcement learning." *Nature*, 575(7782), 350-354.

[30] Christiano, P. F., et al. (2017). "Deep reinforcement learning from human preferences." *Advances in Neural Information Processing Systems*, 30.

[31] Ouyang, L., et al. (2022). "Training language models to follow instructions with human feedback." *Advances in Neural Information Processing Systems*, 35.

[32] Bai, Y., et al. (2022). "Constitutional AI: Harmlessness from AI feedback." *arXiv:2212.08073*.

[33] DeepSeek-AI. (2025). "[DeepSeek-R1](/wiki/deepseek_r1): Incentivizing reasoning capability in LLMs via reinforcement learning." *arXiv:2501.12948*.

[34] Brown, N., & Sandholm, T. (2019). "Superhuman AI for multiplayer poker." *Science*, 365(6456), 885-890.

[35] FAIR et al. (2022). "Human-level play in the game of Diplomacy by combining language models with strategic reasoning." *Science*, 378(6624), 1067-1074.

[36] OpenAI et al. (2019). "Solving Rubik's Cube with a robot hand." *arXiv:1910.07113*.

[37] Komorowski, M., et al. (2018). "The Artificial Intelligence Clinician learns optimal treatment strategies for sepsis in intensive care." *Nature Medicine*, 24(11), 1716-1720.

[38] Evans, R., & Gao, J. (2016). "DeepMind AI reduces Google data centre cooling bill by 40%." DeepMind Blog.

[39] Busoniu, L., Babuska, R., & De Schutter, B. (2008). "A comprehensive survey of multiagent reinforcement learning." *IEEE Transactions on Systems, Man, and Cybernetics*, 38(2), 156-172.

[40] Albrecht, S. V., Christianos, F., & Schafer, L. (2024). *Multi-Agent Reinforcement Learning: Foundations and Modern Approaches*. MIT Press.

[41] Dulac-Arnold, G., et al. (2019). "Challenges of real-world reinforcement learning." *arXiv:1904.12901*.

[42] Amodei, D., et al. (2016). "Concrete problems in [AI safety](/wiki/ai_safety)." *arXiv:1606.06565*.

[43] Zhao, W., et al. (2020). "Sim-to-real transfer in deep reinforcement learning for robotics: A survey." *arXiv:2009.13303*.

[44] Levine, S., et al. (2020). "Offline reinforcement learning: Tutorial, review, and perspectives on open problems." *arXiv:2005.01643*.

[45] Reed, S., et al. (2022). "A generalist agent." *arXiv:2205.06175*.

[46] DeepSeek-AI. (2025). "DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning." *Nature*, 645(8081), 633-638. https://www.nature.com/articles/s41586-025-09422-z

[47] DeepSeek-AI. (2026). "DeepSeek-V4-Pro model card." Hugging Face. https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro

[48] Zhang, G., et al. (2025-2026). "The landscape of agentic reinforcement learning for LLMs: A survey." *arXiv:2509.02547*. https://arxiv.org/abs/2509.02547

[49] "From reasoning chains to verifiable subproblems: Curriculum reinforcement learning enables credit assignment for LLM reasoning." (2026). *arXiv:2605.22074*. https://arxiv.org/abs/2605.22074

[50] Seo, Y., Sferrazza, C., Chen, J., Shi, G., Duan, R., & Abbeel, P. (2025). "Learning sim-to-real humanoid locomotion in 15 minutes." *arXiv:2512.01996*. https://arxiv.org/abs/2512.01996

[51] ACM. (2025). "ACM Announces 2024 A.M. Turing Award Recipients: Andrew Barto and Richard Sutton." Association for Computing Machinery. The award carries a 1 million US dollar prize sponsored by Google. https://www.acm.org/articles/bulletins/2025/march/turing-award-2024

## External links

- [Reinforcement Learning: An Introduction](http://incompleteideas.net/book/the-book.html) (Sutton and Barto's textbook, free online)
- [OpenAI Spinning Up](https://spinningup.openai.com/) (educational resource for deep RL)
- [Gymnasium](https://gymnasium.farama.org/) (standard RL environment library)
- [Stable-Baselines3 Documentation](https://stable-baselines3.readthedocs.io/)
- [Ray RLlib Documentation](https://docs.ray.io/en/latest/rllib/index.html)
- [DeepMind Research](https://deepmind.google/research/)
- [RLHF Book by Nathan Lambert](https://rlhfbook.com/) (comprehensive guide to RLHF)