Machine learning terms/Reinforcement Learning

Machine Learning Reinforcement Learning

19 min read

Updated Jul 11, 2026

Suggest edit History Talk

RawGraph

Last edited

Jul 11, 2026

Fact-checked

In review queue

Sources

30 citations

Revision

v4 · 3,784 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

See also: Machine learning terms

Reinforcement learning (RL) is a branch of machine learning in which an agent learns to make sequential decisions by interacting with an environment and receiving feedback in the form of rewards or penalties. In the standard reference text, Richard Sutton and Andrew Barto define it as "learning what to do, how to map situations to actions, so as to maximize a numerical reward signal" ^[1]. Unlike supervised learning, where a model is trained on labeled examples, an RL agent must discover good behavior through trial and error, balancing the need to gather new information (exploration) with the need to use what it already knows (exploitation). This page is a glossary index of the key machine-learning terms used in reinforcement learning: the core elements (agent, environment, state, action, reward, policy, value function, return, discount factor), the Markov decision process framework, the main algorithm families (value-based, policy-gradient, actor-critic, and model-based), the exploration-exploitation tradeoff, and modern uses such as reinforcement learning from human feedback (RLHF).

RL has produced many landmark results in artificial intelligence, including DeepMind's Atari-playing networks, AlphaGo, AlphaZero, MuZero, OpenAI Five, and the RLHF systems used to align modern large language models such as InstructGPT, ChatGPT, Claude, and DeepSeek-R1. It also serves as a glossary index linking to detailed pages on individual RL terms (see the index of reinforcement learning terms below).

What are the core elements of reinforcement learning?

The standard reinforcement learning loop is built around an interaction between an agent and an environment. At each discrete time step, the agent observes a state, chooses an action according to its policy, and the environment responds with a new state and a scalar reward. The agent's goal is to maximize the expected cumulative reward, often called the return, over time. Sutton and Barto identify trial-and-error search and delayed reward as the two most important distinguishing features of reinforcement learning ^[1].

Concept	Symbol	Description
Agent		The learner or decision maker that chooses actions.
Environment		Everything outside the agent that responds to actions and produces states and rewards.
State	s	A representation of the current situation that the agent observes.
Action	a	A choice the agent makes at a given state.
Reward	r	A scalar signal indicating how good the most recent transition was.
Policy	$\pi(a \mid s)$	A mapping from states to actions, possibly stochastic.
Return	G	The total discounted future reward from a given time step.
Value function	$V(s)$	Expected return starting from state s under a policy.
Action-value function	$Q(s, a)$	Expected return after taking action a in state s and then following the policy.
Discount factor	γ	A number in $[0, 1]$ that reduces the weight of distant rewards.
Trajectory	τ	A sequence of states, actions, and rewards.
Episode		A complete trajectory from an initial state to a terminal state.
Termination condition		A rule that ends an episode, for example reaching a goal or running out of time.

A policy can be deterministic, choosing a single action per state, or stochastic, defining a probability distribution over actions. The optimal policy, usually written $\pi^*$ , is one that achieves the highest possible expected return from every state.

What is a Markov decision process?

Most RL problems are modeled as a Markov decision process (MDP), defined by a tuple $(S, A, P, R, \gamma)$ where S is the set of states, A is the set of actions, $P(s' \mid s, a)$ is the transition probability, $R(s, a)$ is the reward function, and $\gamma$ is the discount factor ^[1]. The defining feature is the Markov property: the next state depends only on the current state and action, not on the history of how the agent arrived there. When the agent cannot directly observe the full state, the problem is a partially observable MDP (POMDP), which often requires memory based policies built from recurrent neural networks or transformers.

The Bellman equation expresses the value of a state as the expected immediate reward plus the discounted value of the next state. For the optimal action-value function, the Bellman optimality equation is:

Q^*(s, a) = \mathbb{E}[r + \gamma \max_{a'} Q^*(s', a')]

Most RL algorithms can be viewed as approximate ways of solving this equation. Classical methods such as dynamic programming, value iteration, and policy iteration require a known model of the environment and are described in Sutton and Barto's textbook Reinforcement Learning: An Introduction ^[1].

What are tabular reinforcement learning methods?

When the state and action spaces are small, RL can be solved with tabular methods that store one value per state or state-action pair.

Q-learning, introduced by Christopher Watkins in his 1989 PhD thesis, is an off-policy temporal difference algorithm ^[2]. The agent updates $Q(s, a)$ toward $r + \gamma \max_{a'} Q(s', a')$ . Tabular Q-learning converges to the optimal policy under mild conditions when every state-action pair is visited infinitely often.
SARSA (state, action, reward, state, action), described by Rummery and Niranjan in 1994, is an on-policy variant that updates toward $r + \gamma Q(s', a')$ using the action actually taken under the current policy.
Monte Carlo methods estimate value functions by averaging returns from complete episodes.
Dyna-Q, proposed by Richard Sutton in 1990, blends real experience with simulated experience from a learned model, which is one of the earliest examples of model-based RL.

These algorithms typically use an epsilon greedy policy for exploration: with probability ε the agent picks a random action and otherwise it picks the greedy policy action. A random policy selects actions uniformly at random and is often used as a baseline.

How does value-based deep reinforcement learning work?

For large or continuous state spaces, tabular storage is infeasible and value functions must be approximated, usually with neural networks. The combination of deep learning with RL is known as deep reinforcement learning.

Deep Q-Network (DQN), introduced by Mnih and colleagues at DeepMind in the 2013 arXiv paper Playing Atari with Deep Reinforcement Learning ^[3] and the 2015 Nature paper Human-level control through deep reinforcement learning ^[4], parameterizes the Q-function with a convolutional neural network. DQN learned to play 49 Atari 2600 games at or above human level using the same architecture, network, and hyperparameters for every game ^[4].
Two key stabilization tricks made DQN work. The replay buffer, also called experience replay, stores past transitions and samples mini batches uniformly to break the correlations between consecutive samples. A separate target network copies the online weights periodically and provides stable bootstrap targets ^[4].
Double DQN (van Hasselt and colleagues, 2016) decouples action selection from action evaluation to reduce the systematic overestimation bias of standard Q-learning ^[5].
Dueling DQN (Wang and colleagues, 2016) splits the network into a state-value stream and an advantage stream, then recombines them, which improves learning when many actions yield similar values ^[6].
Prioritized experience replay (Schaul and colleagues, 2016) samples transitions with high temporal difference error more often ^[7].
Rainbow DQN (Hessel and colleagues, 2018) combines six DQN improvements, namely double Q-learning, prioritized replay, dueling networks, multi-step targets, distributional RL, and noisy networks, to set new benchmark scores on Atari ^[8].

What are policy gradient methods?

Policy gradient methods directly parameterize the policy $\pi_\theta(a \mid s)$ and update $\theta$ to increase expected return using the policy gradient theorem (Sutton, McAllester, Singh, and Mansour, 2000).

REINFORCE, introduced by Ronald Williams in 1992, computes a Monte Carlo estimate of the policy gradient using complete episode returns ^[9].
Actor-critic methods combine a policy network (the actor) with a value network (the critic) that estimates baselines, reducing variance.
Advantage Actor-Critic (A2C) and Asynchronous Advantage Actor-Critic (A3C), introduced by Mnih and colleagues in 2016, run many parallel actors to decorrelate experience without a replay buffer.
Trust Region Policy Optimization (TRPO), proposed by Schulman and colleagues in 2015, constrains each policy update by a KL-divergence trust region for monotonic improvement ^[10].
Proximal Policy Optimization (PPO), introduced by Schulman and colleagues in 2017, replaces TRPO's hard constraint with a clipped surrogate objective ^[11]. PPO is widely used because it is simple, sample efficient, and works well on many tasks. It became the default RL backbone for OpenAI Five and for the RLHF stage of InstructGPT and ChatGPT.
Deep Deterministic Policy Gradient (DDPG), proposed by Lillicrap and colleagues in 2016, extends actor-critic to continuous action spaces using off-policy data ^[12].
Twin Delayed DDPG (TD3), introduced by Fujimoto, van Hoof, and Meger in 2018, fixes the overestimation bias of DDPG with twin critics, delayed policy updates, and target policy smoothing ^[13].
Soft Actor-Critic (SAC), introduced by Haarnoja and colleagues in 2018, adds an entropy bonus to the objective so that the policy is as random as possible while still maximizing return ^[14]. SAC is a leading off-policy method for continuous control.

What is the difference between value-based, policy-based, and actor-critic methods?

RL algorithms are often grouped into three families.

Family	Learns	Typical algorithms	Strengths
Value based	$Q(s, a)$	Q-learning, SARSA, DQN, Rainbow	Sample efficient, easy to use with discrete actions
Policy based	$\pi_\theta(a \mid s)$	REINFORCE, TRPO, PPO	Handles continuous and stochastic actions, smooth policy improvement
Actor-critic	both	A2C, A3C, DDPG, TD3, SAC	Combines variance reduction of values with flexibility of policies

Value-based methods are usually off-policy, which means they can learn from data collected by a different policy, while pure policy gradient methods are on-policy. Off-policy actor-critic methods such as DDPG, TD3, and SAC try to combine the best of both worlds.

What is model-based reinforcement learning?

Model-based RL learns or uses a model of the environment to plan or to generate synthetic experience. This often improves sample efficiency at the cost of additional complexity.

AlphaZero (Silver and colleagues, 2018) combines Monte Carlo tree search with a deep network that predicts moves and values, and learns purely from self play ^[15]. It mastered Go, chess, and shogi from scratch.
MuZero (Schrittwieser and colleagues, 2020) extends AlphaZero by learning the dynamics of the environment in a latent space, so it does not need a known set of rules ^[16]. It matches or surpasses AlphaZero on board games and DQN on Atari.
World models (Ha and Schmidhuber, 2018) train a generative model of pixels and learn policies inside the imagined environment.
Dreamer, DreamerV2, and DreamerV3 (Hafner and colleagues, 2020 to 2023) learn a recurrent latent world model and train an actor-critic by backpropagating through imagined trajectories ^[17]. DreamerV3 is notable for solving a wide range of tasks with one set of hyperparameters and was the first method to collect diamonds in Minecraft without curriculum ^[17].
Other notable model-based methods include PETS, PlaNet, and SimPLe.

What is the exploration-exploitation tradeoff?

Exploration is the problem of trying actions whose value is uncertain in order to discover better strategies, while exploitation means using current knowledge to maximize reward. Balancing the two is one of the central challenges of RL ^[1]. Naive random exploration scales poorly in large or sparse reward problems.

Epsilon greedy and Boltzmann (softmax) exploration are simple and widely used.
Upper Confidence Bound (UCB) methods, formalized by Auer, Cesa-Bianchi, and Fischer in 2002, choose the action with the highest optimistic upper bound on its value ^[18].
Thompson sampling samples a model from a posterior over environments and acts greedily with respect to it. The idea goes back to William Thompson's 1933 paper.
Intrinsic motivation rewards the agent for visiting novel states. Examples include count based bonuses, Random Network Distillation (Burda and colleagues, 2018), and curiosity driven exploration via prediction error (Pathak and colleagues, 2017).
Maximum entropy methods such as SAC encourage exploration by rewarding randomness in the policy ^[14].
Go-Explore (Ecoffet and colleagues, 2021) explicitly remembers promising states and returns to them before exploring further.

What is multi-agent reinforcement learning?

Multi-agent RL studies environments where two or more agents interact, possibly cooperating, competing, or both. Classic ideas come from game theory, including Nash equilibria and self play. Notable systems include OpenAI Five for Dota 2, AlphaStar for StarCraft II (Vinyals and colleagues, Nature 2019) ^[19], and CICERO for the language game Diplomacy (Meta AI, 2022). Algorithms include independent Q-learning, MADDPG (Lowe and colleagues, 2017), QMIX (Rashid and colleagues, 2018), and population based training.

What is reinforcement learning from human feedback (RLHF)?

Reinforcement learning from human feedback (RLHF) trains a model using a learned reward model fitted to human preferences. The standard recipe was popularized by Christiano and colleagues in the 2017 paper Deep reinforcement learning from human preferences ^[20]. It uses three steps:

Collect pairs of model outputs and ask humans which one they prefer.
Train a reward model to predict these preferences.
Fine-tune the policy with an RL algorithm, usually PPO, to maximize the reward model under a KL penalty against a reference policy.

RLHF is the central alignment step in InstructGPT (Ouyang and colleagues, 2022) ^[21] and ChatGPT, and similar techniques underlie Claude, Gemini, and many open source instruction tuned models. Variants and successors include Direct Preference Optimization (DPO) by Rafailov and colleagues in 2023, which removes the explicit reward model ^[22], reinforcement learning from AI feedback (RLAIF), Constitutional AI (Bai and colleagues, Anthropic 2022) ^[23], and Identity Preference Optimization (IPO).

What is reinforcement learning for reasoning models?

A new wave of large language model training uses RL with verifiable rewards (often grading code or math answers automatically) to elicit long chain of thought reasoning.

OpenAI o1, previewed on 12 September 2024, was the first widely deployed reasoning model trained with large scale RL on chains of thought ^[24].
DeepSeek-R1 (DeepSeek AI, January 2025) introduced Group Relative Policy Optimization (GRPO), an actor only RL algorithm that replaces PPO's value function with a group baseline computed from multiple sampled responses to the same prompt ^[25]. On the AIME 2024 math competition, RL raised DeepSeek-R1-Zero's pass@1 score from 15.6% to 71.0%, rising to 86.7% with majority voting ^[25]. GRPO was originally described in the DeepSeekMath paper (Shao and colleagues, 2024) ^[26], and the DeepSeek-R1 work was published in Nature in September 2025 with a reported training cost of about $294,000 ^[27].
DAPO (Decoupled Clip and Dynamic Sampling Policy Optimization), released by ByteDance Seed and Tsinghua in 2025, builds on GRPO with decoupled clipping ranges, dynamic sampling, and token level loss for long chain of thought stability ^[28].
Tülu 3 (Allen Institute for AI, 2024) is a fully open post training recipe that combines supervised fine tuning, DPO, and RL with verifiable rewards (RLVR) to reach state of the art results among open weight models ^[29].
Other recent methods include ReMax, RLOO, and various length normalized policy optimization variants.

What frameworks and benchmarks are used for reinforcement learning?

Researchers and engineers usually rely on standard libraries and benchmark suites.

Project	Maintainer	Description
OpenAI Gym	OpenAI (now Farama Foundation as Gymnasium)	Standard environment API including Atari, classic control, and MuJoCo tasks.
Gymnasium	Farama Foundation	Maintained fork of Gym used by most current research code.
DeepMind Control Suite	DeepMind	Continuous control benchmarks built on MuJoCo.
MuJoCo	DeepMind (open source)	Physics simulator widely used for continuous control.
Stable Baselines3	DLR-RM	PyTorch implementations of PPO, SAC, TD3, DQN, and others.
RLlib	Anyscale (Ray)	Scalable distributed RL library.
Dopamine	Google	Research framework focused on Atari and reproducibility.
CleanRL	Costa Huang and contributors	Single file implementations of RL algorithms for clarity.
Tianshou	Tsinghua TSAIL	PyTorch RL library with broad algorithm coverage.
Acme	DeepMind	Distributed agents library.
TRL	Hugging Face	RLHF, DPO, GRPO, and PPO for transformer language models.
verl	ByteDance Seed	Volcano Engine reinforcement learning library used for large scale LLM RL.
Procgen, MiniGrid, NetHack, Crafter	various	Generalization benchmarks.

What are the notable milestones in reinforcement learning?

Year	Milestone
1959	Arthur Samuel's checkers program uses temporal difference style learning, an early precursor to modern RL.
1989	Christopher Watkins introduces Q-learning in his Cambridge PhD thesis ^[2].
1992	Gerald Tesauro's TD-Gammon learns to play backgammon at a world class level using temporal difference learning with a neural network.
1998	First edition of Reinforcement Learning: An Introduction by Sutton and Barto ^[1].
2013	Mnih and colleagues at DeepMind release the original DQN paper on arXiv, learning Atari games from raw pixels ^[3].
2015	DQN paper appears in Nature, achieving human level scores on 49 Atari games ^[4].
2016	DeepMind's AlphaGo defeats Lee Sedol four games to one in a five-game match in Seoul, 9-15 March ^[30].
2017	Christiano and colleagues publish Deep RL from human preferences, laying the foundation for RLHF ^[20].
2017	Schulman and colleagues introduce PPO ^[11].
2017	AlphaZero masters Go, chess, and shogi entirely through self play ^[15].
2018	OpenAI Five defeats top human teams in restricted Dota 2.
2019	DeepMind's AlphaStar reaches grandmaster level in StarCraft II ^[19].
2019	OpenAI's robot hand solves a Rubik's cube using PPO with domain randomization.
2020	MuZero matches AlphaZero without knowing the rules ^[16].
2022	OpenAI releases InstructGPT and then ChatGPT, both trained with PPO based RLHF ^[21].
2023	DeepMind publishes DreamerV3 and a series of generally capable RL agents ^[17].
2024	OpenAI launches the o1 reasoning model trained with large scale RL on chains of thought ^[24].
2025	DeepSeek releases DeepSeek-R1 and the GRPO recipe, which kicks off widespread adoption of RL with verifiable rewards in open source LLM training ^[25].

What are the limitations and challenges of reinforcement learning?

Reinforcement learning is powerful but notoriously difficult to use in practice. Common challenges include sample inefficiency (deep RL often needs millions of frames), unstable training due to bootstrapping with function approximation and off-policy data, sensitivity to hyperparameters, reward hacking where the agent finds unintended ways to maximize the reward, credit assignment over long horizons, and a sim-to-real gap that limits transfer from simulation to physical robots. Safety, interpretability, and alignment with human intent are active research areas, particularly for RL fine tuned language models.

RL connects to many other fields. Imitation learning and behavior cloning train a policy directly from expert demonstrations. Inverse reinforcement learning recovers a reward function from observed behavior. Offline reinforcement learning, also known as batch RL, learns from a fixed dataset without further interaction. Meta reinforcement learning learns algorithms that adapt quickly to new tasks. Hierarchical reinforcement learning decomposes long horizon problems into reusable sub policies, often using the options framework of Sutton, Precup, and Singh.

Index of reinforcement learning terms

References

Sutton, R. S., and Barto, A. G. (2018). *Reinforcement Learning: An Introduction* (2nd ed.). MIT Press. http://incompleteideas.net/book/the-book-2nd.html ↩
Watkins, C. J. C. H. (1989). *Learning from delayed rewards*. PhD thesis, Cambridge University. ↩
Mnih, V., et al. (2013). Playing Atari with Deep Reinforcement Learning. arXiv:1312.5602. https://arxiv.org/abs/1312.5602 ↩
Mnih, V., et al. (2015). Human-level control through deep reinforcement learning. *Nature*, 518(7540), 529-533. https://www.nature.com/articles/nature14236 ↩
van Hasselt, H., Guez, A., and Silver, D. (2016). Deep reinforcement learning with double Q-learning. AAAI. ↩
Wang, Z., et al. (2016). Dueling network architectures for deep reinforcement learning. ICML. ↩
Schaul, T., et al. (2016). Prioritized experience replay. ICLR. ↩
Hessel, M., et al. (2018). Rainbow: Combining improvements in deep reinforcement learning. AAAI. ↩
Williams, R. J. (1992). Simple statistical gradient following algorithms for connectionist reinforcement learning. *Machine Learning*, 8. ↩
Schulman, J., et al. (2015). Trust Region Policy Optimization. ICML. ↩
Schulman, J., et al. (2017). Proximal Policy Optimization Algorithms. arXiv:1707.06347. https://arxiv.org/abs/1707.06347 ↩
Lillicrap, T. P., et al. (2016). Continuous control with deep reinforcement learning. ICLR. ↩
Fujimoto, S., van Hoof, H., and Meger, D. (2018). Addressing function approximation error in actor-critic methods. ICML. ↩
Haarnoja, T., et al. (2018). Soft actor-critic. ICML. ↩
Silver, D., et al. (2018). A general reinforcement learning algorithm that masters chess, shogi, and Go through self play. *Science*, 362(6419), 1140-1144. ↩
Schrittwieser, J., et al. (2020). Mastering Atari, Go, chess, and shogi by planning with a learned model. *Nature*, 588. ↩
Hafner, D., et al. (2023). Mastering diverse domains through world models (DreamerV3). arXiv:2301.04104. https://arxiv.org/abs/2301.04104 ↩
Auer, P., Cesa-Bianchi, N., and Fischer, P. (2002). Finite-time analysis of the multiarmed bandit problem. *Machine Learning*, 47. ↩
Vinyals, O., et al. (2019). Grandmaster level in StarCraft II using multi-agent reinforcement learning. *Nature*, 575. ↩
Christiano, P., et al. (2017). Deep reinforcement learning from human preferences. NeurIPS. ↩
Ouyang, L., et al. (2022). Training language models to follow instructions with human feedback (InstructGPT). NeurIPS. https://arxiv.org/abs/2203.02155 ↩
Rafailov, R., et al. (2023). Direct Preference Optimization: your language model is secretly a reward model. NeurIPS. ↩
Bai, Y., et al. (2022). Constitutional AI: harmlessness from AI feedback. arXiv:2212.08073. ↩
OpenAI (2024). Introducing OpenAI o1-preview. https://openai.com/index/introducing-openai-o1-preview/ ↩
DeepSeek-AI (2025). DeepSeek-R1: incentivizing reasoning capability in LLMs via reinforcement learning. arXiv:2501.12948. https://arxiv.org/abs/2501.12948 ↩
Shao, Z., et al. (2024). DeepSeekMath: pushing the limits of mathematical reasoning in open language models. arXiv:2402.03300. ↩
DeepSeek-AI (2025). DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning. *Nature*, 645. https://www.nature.com/articles/s41586-025-09422-z ↩
Yu, Q., et al. (2025). DAPO: an open source LLM reinforcement learning system at scale. arXiv:2503.14476. ↩
Lambert, N., et al. (2024). Tülu 3: pushing frontiers in open language model post training. arXiv:2411.15124. ↩
AlphaGo versus Lee Sedol. Wikipedia. https://en.wikipedia.org/wiki/AlphaGo_versus_Lee_Sedol ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

3 revisions by 1 contributors · full history

Suggest edit

What links here

AI Wiki Machine learning terms Machine learning terms/Decision Forests SARSA (State-Action-Reward-State-Action)Terms

What are the core elements of reinforcement learning?

What is a Markov decision process?

What are tabular reinforcement learning methods?

How does value-based deep reinforcement learning work?

What are policy gradient methods?

What is the difference between value-based, policy-based, and actor-critic methods?

What is model-based reinforcement learning?

What is the exploration-exploitation tradeoff?

What is multi-agent reinforcement learning?

What is reinforcement learning from human feedback (RLHF)?

What is reinforcement learning for reasoning models?

What frameworks and benchmarks are used for reinforcement learning?

What are the notable milestones in reinforcement learning?

What are the limitations and challenges of reinforcement learning?

What fields are related to reinforcement learning?

Index of reinforcement learning terms

References

Improve this article

Related Articles

State (Reinforcement Learning)

State-Action Value Function

Action (Reinforcement Learning)

Bellman Equation

Critic

Deep Q-Network (DQN)

What links here

Related Articles

State (Reinforcement Learning)

State-Action Value Function

Action (Reinforcement Learning)

Bellman Equation

Critic

Deep Q-Network (DQN)

What links here