See also: Machine learning terms
Reinforcement learning (RL) is a branch of machine learning in which an agent learns to make sequential decisions by interacting with an environment and receiving feedback in the form of rewards or penalties. Unlike supervised learning, where a model is trained on labeled examples, an RL agent must discover good behavior through trial and error, balancing the need to gather new information (exploration) with the need to use what it already knows (exploitation). RL has produced many landmark results in artificial intelligence, including DeepMind's Atari-playing networks, AlphaGo, AlphaZero, MuZero, OpenAI Five, and the reinforcement learning from human feedback (RLHF) systems used to align modern large language models such as InstructGPT, ChatGPT, Claude, and DeepSeek-R1.
This page provides a comprehensive overview of reinforcement learning concepts, methods, and applications, and serves as a glossary index linking to detailed pages on individual RL terms.
core concepts
The standard reinforcement learning loop is built around an interaction between an agent and an environment. At each discrete time step, the agent observes a state, chooses an action according to its policy, and the environment responds with a new state and a scalar reward. The agent's goal is to maximize the expected cumulative reward, often called the return, over time.
| Concept | Symbol | Description |
|---|
| Agent | | The learner or decision maker that chooses actions. |
| Environment | | Everything outside the agent that responds to actions and produces states and rewards. |
| State | s | A representation of the current situation that the agent observes. |
| Action | a | A choice the agent makes at a given state. |
| Reward | r | A scalar signal indicating how good the most recent transition was. |
| Policy | π(a|s) | A mapping from states to actions, possibly stochastic. |
| Return | G | The total discounted future reward from a given time step. |
| Value function | V(s) | Expected return starting from state s under a policy. |
| Action-value function | Q(s, a) | Expected return after taking action a in state s and then following the policy. |
| Discount factor | γ | A number in [0, 1] that reduces the weight of distant rewards. |
| Trajectory | τ | A sequence of states, actions, and rewards. |
| Episode | | A complete trajectory from an initial state to a terminal state. |
| Termination condition | | A rule that ends an episode, for example reaching a goal or running out of time. |
A policy can be deterministic, choosing a single action per state, or stochastic, defining a probability distribution over actions. The optimal policy, usually written π*, is one that achieves the highest possible expected return from every state.
markov decision processes
Most RL problems are modeled as a Markov decision process (MDP), defined by a tuple (S, A, P, R, γ) where S is the set of states, A is the set of actions, P(s'|s, a) is the transition probability, R(s, a) is the reward function, and γ is the discount factor. The defining feature is the Markov property: the next state depends only on the current state and action, not on the history of how the agent arrived there. When the agent cannot directly observe the full state, the problem is a partially observable MDP (POMDP), which often requires memory based policies built from recurrent neural networks or transformers.
The Bellman equation expresses the value of a state as the expected immediate reward plus the discounted value of the next state. For the optimal action-value function, the Bellman optimality equation is:
Q*(s, a) = E[r + γ max_{a'} Q*(s', a')]
Most RL algorithms can be viewed as approximate ways of solving this equation. Classical methods such as dynamic programming, value iteration, and policy iteration require a known model of the environment and are described in Sutton and Barto's textbook Reinforcement Learning: An Introduction.
tabular methods
When the state and action spaces are small, RL can be solved with tabular methods that store one value per state or state-action pair.
- Q-learning, introduced by Christopher Watkins in his 1989 PhD thesis, is an off-policy temporal difference algorithm. The agent updates Q(s, a) toward r + γ max_{a'} Q(s', a'). Tabular Q-learning converges to the optimal policy under mild conditions when every state-action pair is visited infinitely often.
- SARSA (state, action, reward, state, action), described by Rummery and Niranjan in 1994, is an on-policy variant that updates toward r + γ Q(s', a') using the action actually taken under the current policy.
- Monte Carlo methods estimate value functions by averaging returns from complete episodes.
- Dyna-Q, proposed by Richard Sutton in 1990, blends real experience with simulated experience from a learned model, which is one of the earliest examples of model-based RL.
These algorithms typically use an epsilon greedy policy for exploration: with probability ε the agent picks a random action and otherwise it picks the greedy policy action. A random policy selects actions uniformly at random and is often used as a baseline.
value-based deep reinforcement learning
For large or continuous state spaces, tabular storage is infeasible and value functions must be approximated, usually with neural networks. The combination of deep learning with RL is known as deep reinforcement learning.
- Deep Q-Network (DQN), introduced by Mnih and colleagues at DeepMind in the 2013 arXiv paper Playing Atari with Deep Reinforcement Learning and the 2015 Nature paper Human-level control through deep reinforcement learning, parameterizes the Q-function with a convolutional neural network. DQN learned to play 49 Atari 2600 games at or above human level using the same architecture and hyperparameters for every game.
- Two key stabilization tricks made DQN work. The replay buffer, also called experience replay, stores past transitions and samples mini batches uniformly to break the correlations between consecutive samples. A separate target network copies the online weights periodically and provides stable bootstrap targets.
- Double DQN (van Hasselt and colleagues, 2016) decouples action selection from action evaluation to reduce the systematic overestimation bias of standard Q-learning.
- Dueling DQN (Wang and colleagues, 2016) splits the network into a state-value stream and an advantage stream, then recombines them, which improves learning when many actions yield similar values.
- Prioritized experience replay (Schaul and colleagues, 2016) samples transitions with high temporal difference error more often.
- Rainbow DQN (Hessel and colleagues, 2018) combines six DQN improvements, namely double Q-learning, prioritized replay, dueling networks, multi-step targets, distributional RL, and noisy networks, to set new benchmark scores on Atari.
policy gradient methods
Policy gradient methods directly parameterize the policy π_θ(a|s) and update θ to increase expected return using the policy gradient theorem (Sutton, McAllester, Singh, and Mansour, 2000).
- REINFORCE, introduced by Ronald Williams in 1992, computes a Monte Carlo estimate of the policy gradient using complete episode returns.
- Actor-critic methods combine a policy network (the actor) with a value network (the critic) that estimates baselines, reducing variance.
- Advantage Actor-Critic (A2C) and Asynchronous Advantage Actor-Critic (A3C), introduced by Mnih and colleagues in 2016, run many parallel actors to decorrelate experience without a replay buffer.
- Trust Region Policy Optimization (TRPO), proposed by Schulman and colleagues in 2015, constrains each policy update by a KL-divergence trust region for monotonic improvement.
- Proximal Policy Optimization (PPO), introduced by Schulman and colleagues in 2017, replaces TRPO's hard constraint with a clipped surrogate objective. PPO is widely used because it is simple, sample efficient, and works well on many tasks. It became the default RL backbone for OpenAI Five and for the RLHF stage of InstructGPT and ChatGPT.
- Deep Deterministic Policy Gradient (DDPG), proposed by Lillicrap and colleagues in 2016, extends actor-critic to continuous action spaces using off-policy data.
- Twin Delayed DDPG (TD3), introduced by Fujimoto, van Hoof, and Meger in 2018, fixes the overestimation bias of DDPG with twin critics, delayed policy updates, and target policy smoothing.
- Soft Actor-Critic (SAC), introduced by Haarnoja and colleagues in 2018, adds an entropy bonus to the objective so that the policy is as random as possible while still maximizing return. SAC is a leading off-policy method for continuous control.
value-based, policy-based, and actor-critic
RL algorithms are often grouped into three families.
| Family | Learns | Typical algorithms | Strengths |
|---|
| Value based | Q(s, a) | Q-learning, SARSA, DQN, Rainbow | Sample efficient, easy to use with discrete actions |
| Policy based | π_θ(a|s) | REINFORCE, TRPO, PPO | Handles continuous and stochastic actions, smooth policy improvement |
| Actor-critic | both | A2C, A3C, DDPG, TD3, SAC | Combines variance reduction of values with flexibility of policies |
Value-based methods are usually off-policy, which means they can learn from data collected by a different policy, while pure policy gradient methods are on-policy. Off-policy actor-critic methods such as DDPG, TD3, and SAC try to combine the best of both worlds.
model-based reinforcement learning
Model-based RL learns or uses a model of the environment to plan or to generate synthetic experience. This often improves sample efficiency at the cost of additional complexity.
- AlphaZero (Silver and colleagues, 2017) combines Monte Carlo tree search with a deep network that predicts moves and values, and learns purely from self play. It mastered Go, chess, and shogi from scratch.
- MuZero (Schrittwieser and colleagues, 2020) extends AlphaZero by learning the dynamics of the environment in a latent space, so it does not need a known set of rules. It matches or surpasses AlphaZero on board games and DQN on Atari.
- World models (Ha and Schmidhuber, 2018) train a generative model of pixels and learn policies inside the imagined environment.
- Dreamer, DreamerV2, and DreamerV3 (Hafner and colleagues, 2020 to 2023) learn a recurrent latent world model and train an actor-critic by backpropagating through imagined trajectories. DreamerV3 is notable for solving a wide range of tasks with one set of hyperparameters and was the first method to collect diamonds in Minecraft without curriculum.
- Other notable model-based methods include PETS, PlaNet, and SimPLe.
exploration
Exploration is the problem of trying actions whose value is uncertain in order to discover better strategies. Naive random exploration scales poorly in large or sparse reward problems.
- Epsilon greedy and Boltzmann (softmax) exploration are simple and widely used.
- Upper Confidence Bound (UCB) methods, formalized by Auer, Cesa-Bianchi, and Fischer in 2002, choose the action with the highest optimistic upper bound on its value.
- Thompson sampling samples a model from a posterior over environments and acts greedily with respect to it. The idea goes back to William Thompson's 1933 paper.
- Intrinsic motivation rewards the agent for visiting novel states. Examples include count based bonuses, Random Network Distillation (Burda and colleagues, 2018), and curiosity driven exploration via prediction error (Pathak and colleagues, 2017).
- Maximum entropy methods such as SAC encourage exploration by rewarding randomness in the policy.
- Go-Explore (Ecoffet and colleagues, 2021) explicitly remembers promising states and returns to them before exploring further.
multi-agent reinforcement learning
Multi-agent RL studies environments where two or more agents interact, possibly cooperating, competing, or both. Classic ideas come from game theory, including Nash equilibria and self play. Notable systems include OpenAI Five for Dota 2, AlphaStar for StarCraft II (Vinyals and colleagues, Nature 2019), and CICERO for the language game Diplomacy (Meta AI, 2022). Algorithms include independent Q-learning, MADDPG (Lowe and colleagues, 2017), QMIX (Rashid and colleagues, 2018), and population based training.
reinforcement learning from human feedback
Reinforcement learning from human feedback (RLHF) trains a model using a learned reward model fitted to human preferences. The standard recipe was popularized by Christiano and colleagues in the 2017 paper Deep reinforcement learning from human preferences. It uses three steps:
- Collect pairs of model outputs and ask humans which one they prefer.
- Train a reward model to predict these preferences.
- Fine-tune the policy with an RL algorithm, usually PPO, to maximize the reward model under a KL penalty against a reference policy.
RLHF is the central alignment step in InstructGPT (Ouyang and colleagues, 2022) and ChatGPT, and similar techniques underlie Claude, Gemini, and many open source instruction tuned models. Variants and successors include Direct Preference Optimization (DPO) by Rafailov and colleagues in 2023, which removes the explicit reward model, reinforcement learning from AI feedback (RLAIF), Constitutional AI (Bai and colleagues, Anthropic 2022), and Identity Preference Optimization (IPO).
reinforcement learning for reasoning models
A new wave of large language model training uses RL with verifiable rewards (often grading code or math answers automatically) to elicit long chain of thought reasoning.
- OpenAI o1 (2024) was the first widely deployed reasoning model trained with large scale RL on chains of thought.
- DeepSeek-R1 (DeepSeek AI, January 2025) introduced Group Relative Policy Optimization (GRPO), an actor only RL algorithm that replaces PPO's value function with a group baseline computed from multiple sampled responses to the same prompt. GRPO was originally described in the DeepSeekMath paper (Shao and colleagues, 2024).
- DAPO (Decoupled Clip and Dynamic Sampling Policy Optimization), released by ByteDance Seed and Tsinghua in 2025, builds on GRPO with decoupled clipping ranges, dynamic sampling, and token level loss for long chain of thought stability.
- Tülu 3 (Allen Institute for AI, 2024) is a fully open post training recipe that combines supervised fine tuning, DPO, and RL with verifiable rewards (RLVR) to reach state of the art results among open weight models.
- Other recent methods include ReMax, RLOO, and various length normalized policy optimization variants.
frameworks and benchmarks
Researchers and engineers usually rely on standard libraries and benchmark suites.
| Project | Maintainer | Description |
|---|
| OpenAI Gym | OpenAI (now Farama Foundation as Gymnasium) | Standard environment API including Atari, classic control, and MuJoCo tasks. |
| Gymnasium | Farama Foundation | Maintained fork of Gym used by most current research code. |
| DeepMind Control Suite | DeepMind | Continuous control benchmarks built on MuJoCo. |
| MuJoCo | DeepMind (open source) | Physics simulator widely used for continuous control. |
| Stable Baselines3 | DLR-RM | PyTorch implementations of PPO, SAC, TD3, DQN, and others. |
| RLlib | Anyscale (Ray) | Scalable distributed RL library. |
| Dopamine | Google | Research framework focused on Atari and reproducibility. |
| CleanRL | Costa Huang and contributors | Single file implementations of RL algorithms for clarity. |
| Tianshou | Tsinghua TSAIL | PyTorch RL library with broad algorithm coverage. |
| Acme | DeepMind | Distributed agents library. |
| TRL | Hugging Face | RLHF, DPO, GRPO, and PPO for transformer language models. |
| verl | ByteDance Seed | Volcano Engine reinforcement learning library used for large scale LLM RL. |
| Procgen, MiniGrid, NetHack, Crafter | various | Generalization benchmarks. |
notable milestones
| Year | Milestone |
|---|
| 1959 | Arthur Samuel's checkers program uses temporal difference style learning, an early precursor to modern RL. |
| 1989 | Christopher Watkins introduces Q-learning in his Cambridge PhD thesis. |
| 1992 | Gerald Tesauro's TD-Gammon learns to play backgammon at a world class level using temporal difference learning with a neural network. |
| 1998 | First edition of Reinforcement Learning: An Introduction by Sutton and Barto. |
| 2013 | Mnih and colleagues at DeepMind release the original DQN paper on arXiv, learning Atari games from raw pixels. |
| 2015 | DQN paper appears in Nature, achieving human level scores on 49 Atari games. |
| 2016 | DeepMind's AlphaGo defeats Lee Sedol four games to one. |
| 2017 | Christiano and colleagues publish Deep RL from human preferences, laying the foundation for RLHF. |
| 2017 | Schulman and colleagues introduce PPO. |
| 2017 | AlphaZero masters Go, chess, and shogi entirely through self play. |
| 2018 | OpenAI Five defeats top human teams in restricted Dota 2. |
| 2019 | DeepMind's AlphaStar reaches grandmaster level in StarCraft II. |
| 2019 | OpenAI's robot hand solves a Rubik's cube using PPO with domain randomization. |
| 2020 | MuZero matches AlphaZero without knowing the rules. |
| 2022 | OpenAI releases InstructGPT and then ChatGPT, both trained with PPO based RLHF. |
| 2023 | DeepMind publishes DreamerV3 and a series of generally capable RL agents. |
| 2024 | OpenAI launches the o1 reasoning model trained with large scale RL on chains of thought. |
| 2025 | DeepSeek releases DeepSeek-R1 and the GRPO recipe, which kicks off widespread adoption of RL with verifiable rewards in open source LLM training. |
limitations and challenges
Reinforcement learning is powerful but notoriously difficult to use in practice. Common challenges include sample inefficiency (deep RL often needs millions of frames), unstable training due to bootstrapping with function approximation and off-policy data, sensitivity to hyperparameters, reward hacking where the agent finds unintended ways to maximize the reward, credit assignment over long horizons, and a sim-to-real gap that limits transfer from simulation to physical robots. Safety, interpretability, and alignment with human intent are active research areas, particularly for RL fine tuned language models.
RL connects to many other fields. Imitation learning and behavior cloning train a policy directly from expert demonstrations. Inverse reinforcement learning recovers a reward function from observed behavior. Offline reinforcement learning, also known as batch RL, learns from a fixed dataset without further interaction. Meta reinforcement learning learns algorithms that adapt quickly to new tasks. Hierarchical reinforcement learning decomposes long horizon problems into reusable sub policies, often using the options framework of Sutton, Precup, and Singh.
index of reinforcement learning terms
references
- Sutton, R. S., and Barto, A. G. (2018). Reinforcement Learning: An Introduction (2nd ed.). MIT Press.
- Watkins, C. J. C. H. (1989). Learning from delayed rewards. PhD thesis, Cambridge University.
- Williams, R. J. (1992). Simple statistical gradient following algorithms for connectionist reinforcement learning. Machine Learning, 8.
- Tesauro, G. (1995). Temporal difference learning and TD-Gammon. Communications of the ACM, 38(3).
- Mnih, V., et al. (2013). Playing Atari with Deep Reinforcement Learning. arXiv:1312.5602.
- Mnih, V., et al. (2015). Human-level control through deep reinforcement learning. Nature, 518.
- van Hasselt, H., Guez, A., and Silver, D. (2016). Deep reinforcement learning with double Q-learning. AAAI.
- Wang, Z., et al. (2016). Dueling network architectures for deep reinforcement learning. ICML.
- Schaul, T., et al. (2016). Prioritized experience replay. ICLR.
- Hessel, M., et al. (2018). Rainbow: Combining improvements in deep reinforcement learning. AAAI.
- Schulman, J., et al. (2015). Trust Region Policy Optimization. ICML.
- Schulman, J., et al. (2017). Proximal Policy Optimization Algorithms. arXiv:1707.06347.
- Lillicrap, T. P., et al. (2016). Continuous control with deep reinforcement learning. ICLR.
- Fujimoto, S., van Hoof, H., and Meger, D. (2018). Addressing function approximation error in actor-critic methods. ICML.
- Haarnoja, T., et al. (2018). Soft actor-critic. ICML.
- Silver, D., et al. (2016). Mastering the game of Go with deep neural networks and tree search. Nature, 529.
- Silver, D., et al. (2017). Mastering the game of Go without human knowledge. Nature, 550.
- Silver, D., et al. (2018). A general reinforcement learning algorithm that masters chess, shogi, and Go through self play. Science, 362.
- Schrittwieser, J., et al. (2020). Mastering Atari, Go, chess, and shogi by planning with a learned model. Nature, 588.
- Vinyals, O., et al. (2019). Grandmaster level in StarCraft II using multi-agent reinforcement learning. Nature, 575.
- Hafner, D., et al. (2023). Mastering diverse domains through world models (DreamerV3). arXiv:2301.04104.
- Christiano, P., et al. (2017). Deep reinforcement learning from human preferences. NeurIPS.
- Ouyang, L., et al. (2022). Training language models to follow instructions with human feedback (InstructGPT). NeurIPS.
- Bai, Y., et al. (2022). Constitutional AI: harmlessness from AI feedback. arXiv:2212.08073.
- Rafailov, R., et al. (2023). Direct Preference Optimization: your language model is secretly a reward model. NeurIPS.
- Shao, Z., et al. (2024). DeepSeekMath: pushing the limits of mathematical reasoning in open language models. arXiv:2402.03300.
- DeepSeek-AI (2025). DeepSeek-R1: incentivizing reasoning capability in LLMs via reinforcement learning. arXiv:2501.12948.
- Yu, Q., et al. (2025). DAPO: an open source LLM reinforcement learning system at scale. arXiv:2503.14476.
- Lambert, N., et al. (2024). Tülu 3: pushing frontiers in open language model post training. arXiv:2411.15124.