Machine learning terms/Reinforcement Learning
Last reviewed
Sources
30 citations
Review status
Source-backed
Revision
v3 · 3,774 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Sources
30 citations
Review status
Source-backed
Revision
v3 · 3,774 words
Add missing citations, update stale details, or suggest a clearer explanation.
See also: Machine learning terms
Reinforcement learning (RL) is a branch of machine learning in which an agent learns to make sequential decisions by interacting with an environment and receiving feedback in the form of rewards or penalties. In the standard reference text, Richard Sutton and Andrew Barto define it as "learning what to do, how to map situations to actions, so as to maximize a numerical reward signal" [1]. Unlike supervised learning, where a model is trained on labeled examples, an RL agent must discover good behavior through trial and error, balancing the need to gather new information (exploration) with the need to use what it already knows (exploitation). This page is a glossary index of the key machine-learning terms used in reinforcement learning: the core elements (agent, environment, state, action, reward, policy, value function, return, discount factor), the Markov decision process framework, the main algorithm families (value-based, policy-gradient, actor-critic, and model-based), the exploration-exploitation tradeoff, and modern uses such as reinforcement learning from human feedback (RLHF).
RL has produced many landmark results in artificial intelligence, including DeepMind's Atari-playing networks, AlphaGo, AlphaZero, MuZero, OpenAI Five, and the RLHF systems used to align modern large language models such as InstructGPT, ChatGPT, Claude, and DeepSeek-R1. It also serves as a glossary index linking to detailed pages on individual RL terms (see the index of reinforcement learning terms below).
The standard reinforcement learning loop is built around an interaction between an agent and an environment. At each discrete time step, the agent observes a state, chooses an action according to its policy, and the environment responds with a new state and a scalar reward. The agent's goal is to maximize the expected cumulative reward, often called the return, over time. Sutton and Barto identify trial-and-error search and delayed reward as the two most important distinguishing features of reinforcement learning [1].
| Concept | Symbol | Description |
|---|---|---|
| Agent | The learner or decision maker that chooses actions. | |
| Environment | Everything outside the agent that responds to actions and produces states and rewards. | |
| State | s | A representation of the current situation that the agent observes. |
| Action | a | A choice the agent makes at a given state. |
| Reward | r | A scalar signal indicating how good the most recent transition was. |
| Policy | π(a|s) | A mapping from states to actions, possibly stochastic. |
| Return | G | The total discounted future reward from a given time step. |
| Value function | V(s) | Expected return starting from state s under a policy. |
| Action-value function | Q(s, a) | Expected return after taking action a in state s and then following the policy. |
| Discount factor | γ | A number in [0, 1] that reduces the weight of distant rewards. |
| Trajectory | τ | A sequence of states, actions, and rewards. |
| Episode | A complete trajectory from an initial state to a terminal state. | |
| Termination condition | A rule that ends an episode, for example reaching a goal or running out of time. |
A policy can be deterministic, choosing a single action per state, or stochastic, defining a probability distribution over actions. The optimal policy, usually written π*, is one that achieves the highest possible expected return from every state.
Most RL problems are modeled as a Markov decision process (MDP), defined by a tuple (S, A, P, R, γ) where S is the set of states, A is the set of actions, P(s'|s, a) is the transition probability, R(s, a) is the reward function, and γ is the discount factor [1]. The defining feature is the Markov property: the next state depends only on the current state and action, not on the history of how the agent arrived there. When the agent cannot directly observe the full state, the problem is a partially observable MDP (POMDP), which often requires memory based policies built from recurrent neural networks or transformers.
The Bellman equation expresses the value of a state as the expected immediate reward plus the discounted value of the next state. For the optimal action-value function, the Bellman optimality equation is:
Q*(s, a) = E[r + γ max_{a'} Q*(s', a')]
Most RL algorithms can be viewed as approximate ways of solving this equation. Classical methods such as dynamic programming, value iteration, and policy iteration require a known model of the environment and are described in Sutton and Barto's textbook Reinforcement Learning: An Introduction [1].
When the state and action spaces are small, RL can be solved with tabular methods that store one value per state or state-action pair.
These algorithms typically use an epsilon greedy policy for exploration: with probability ε the agent picks a random action and otherwise it picks the greedy policy action. A random policy selects actions uniformly at random and is often used as a baseline.
For large or continuous state spaces, tabular storage is infeasible and value functions must be approximated, usually with neural networks. The combination of deep learning with RL is known as deep reinforcement learning.
Policy gradient methods directly parameterize the policy π_θ(a|s) and update θ to increase expected return using the policy gradient theorem (Sutton, McAllester, Singh, and Mansour, 2000).
RL algorithms are often grouped into three families.
| Family | Learns | Typical algorithms | Strengths |
|---|---|---|---|
| Value based | Q(s, a) | Q-learning, SARSA, DQN, Rainbow | Sample efficient, easy to use with discrete actions |
| Policy based | π_θ(a|s) | REINFORCE, TRPO, PPO | Handles continuous and stochastic actions, smooth policy improvement |
| Actor-critic | both | A2C, A3C, DDPG, TD3, SAC | Combines variance reduction of values with flexibility of policies |
Value-based methods are usually off-policy, which means they can learn from data collected by a different policy, while pure policy gradient methods are on-policy. Off-policy actor-critic methods such as DDPG, TD3, and SAC try to combine the best of both worlds.
Model-based RL learns or uses a model of the environment to plan or to generate synthetic experience. This often improves sample efficiency at the cost of additional complexity.
Exploration is the problem of trying actions whose value is uncertain in order to discover better strategies, while exploitation means using current knowledge to maximize reward. Balancing the two is one of the central challenges of RL [1]. Naive random exploration scales poorly in large or sparse reward problems.
Multi-agent RL studies environments where two or more agents interact, possibly cooperating, competing, or both. Classic ideas come from game theory, including Nash equilibria and self play. Notable systems include OpenAI Five for Dota 2, AlphaStar for StarCraft II (Vinyals and colleagues, Nature 2019) [19], and CICERO for the language game Diplomacy (Meta AI, 2022). Algorithms include independent Q-learning, MADDPG (Lowe and colleagues, 2017), QMIX (Rashid and colleagues, 2018), and population based training.
Reinforcement learning from human feedback (RLHF) trains a model using a learned reward model fitted to human preferences. The standard recipe was popularized by Christiano and colleagues in the 2017 paper Deep reinforcement learning from human preferences [20]. It uses three steps:
RLHF is the central alignment step in InstructGPT (Ouyang and colleagues, 2022) [21] and ChatGPT, and similar techniques underlie Claude, Gemini, and many open source instruction tuned models. Variants and successors include Direct Preference Optimization (DPO) by Rafailov and colleagues in 2023, which removes the explicit reward model [22], reinforcement learning from AI feedback (RLAIF), Constitutional AI (Bai and colleagues, Anthropic 2022) [23], and Identity Preference Optimization (IPO).
A new wave of large language model training uses RL with verifiable rewards (often grading code or math answers automatically) to elicit long chain of thought reasoning.
Researchers and engineers usually rely on standard libraries and benchmark suites.
| Project | Maintainer | Description |
|---|---|---|
| OpenAI Gym | OpenAI (now Farama Foundation as Gymnasium) | Standard environment API including Atari, classic control, and MuJoCo tasks. |
| Gymnasium | Farama Foundation | Maintained fork of Gym used by most current research code. |
| DeepMind Control Suite | DeepMind | Continuous control benchmarks built on MuJoCo. |
| MuJoCo | DeepMind (open source) | Physics simulator widely used for continuous control. |
| Stable Baselines3 | DLR-RM | PyTorch implementations of PPO, SAC, TD3, DQN, and others. |
| RLlib | Anyscale (Ray) | Scalable distributed RL library. |
| Dopamine | Research framework focused on Atari and reproducibility. | |
| CleanRL | Costa Huang and contributors | Single file implementations of RL algorithms for clarity. |
| Tianshou | Tsinghua TSAIL | PyTorch RL library with broad algorithm coverage. |
| Acme | DeepMind | Distributed agents library. |
| TRL | Hugging Face | RLHF, DPO, GRPO, and PPO for transformer language models. |
| verl | ByteDance Seed | Volcano Engine reinforcement learning library used for large scale LLM RL. |
| Procgen, MiniGrid, NetHack, Crafter | various | Generalization benchmarks. |
| Year | Milestone |
|---|---|
| 1959 | Arthur Samuel's checkers program uses temporal difference style learning, an early precursor to modern RL. |
| 1989 | Christopher Watkins introduces Q-learning in his Cambridge PhD thesis [2]. |
| 1992 | Gerald Tesauro's TD-Gammon learns to play backgammon at a world class level using temporal difference learning with a neural network. |
| 1998 | First edition of Reinforcement Learning: An Introduction by Sutton and Barto [1]. |
| 2013 | Mnih and colleagues at DeepMind release the original DQN paper on arXiv, learning Atari games from raw pixels [3]. |
| 2015 | DQN paper appears in Nature, achieving human level scores on 49 Atari games [4]. |
| 2016 | DeepMind's AlphaGo defeats Lee Sedol four games to one in a five-game match in Seoul, 9-15 March [30]. |
| 2017 | Christiano and colleagues publish Deep RL from human preferences, laying the foundation for RLHF [20]. |
| 2017 | Schulman and colleagues introduce PPO [11]. |
| 2017 | AlphaZero masters Go, chess, and shogi entirely through self play [15]. |
| 2018 | OpenAI Five defeats top human teams in restricted Dota 2. |
| 2019 | DeepMind's AlphaStar reaches grandmaster level in StarCraft II [19]. |
| 2019 | OpenAI's robot hand solves a Rubik's cube using PPO with domain randomization. |
| 2020 | MuZero matches AlphaZero without knowing the rules [16]. |
| 2022 | OpenAI releases InstructGPT and then ChatGPT, both trained with PPO based RLHF [21]. |
| 2023 | DeepMind publishes DreamerV3 and a series of generally capable RL agents [17]. |
| 2024 | OpenAI launches the o1 reasoning model trained with large scale RL on chains of thought [24]. |
| 2025 | DeepSeek releases DeepSeek-R1 and the GRPO recipe, which kicks off widespread adoption of RL with verifiable rewards in open source LLM training [25]. |
Reinforcement learning is powerful but notoriously difficult to use in practice. Common challenges include sample inefficiency (deep RL often needs millions of frames), unstable training due to bootstrapping with function approximation and off-policy data, sensitivity to hyperparameters, reward hacking where the agent finds unintended ways to maximize the reward, credit assignment over long horizons, and a sim-to-real gap that limits transfer from simulation to physical robots. Safety, interpretability, and alignment with human intent are active research areas, particularly for RL fine tuned language models.
RL connects to many other fields. Imitation learning and behavior cloning train a policy directly from expert demonstrations. Inverse reinforcement learning recovers a reward function from observed behavior. Offline reinforcement learning, also known as batch RL, learns from a fixed dataset without further interaction. Meta reinforcement learning learns algorithms that adapt quickly to new tasks. Hierarchical reinforcement learning decomposes long horizon problems into reusable sub policies, often using the options framework of Sutton, Precup, and Singh.