Reinforcement learning (RL) is an area of machine learning concerned with how intelligent agents ought to take actions in an environment in order to maximize the notion of cumulative reward.[1] Reinforcement learning is one of three basic machine learning paradigms, alongside supervised learning and unsupervised learning.[2]
Unlike supervised learning which requires labeled input/output pairs, and unlike unsupervised learning which focuses on finding hidden structure in unlabeled data, reinforcement learning focuses on finding a balance between exploration (of uncharted territory) and exploitation (of current knowledge) through trial-and-error interaction with an environment.[3] The environment is typically formulated as a Markov decision process (MDP), as many reinforcement learning algorithms utilize dynamic programming techniques.[4]
Overview
Reinforcement learning achieved widespread recognition through several landmark achievements. In 2016, DeepMind's AlphaGo defeated world champion Lee Sedol in the complex game of Go[5], a feat previously thought to be decades away. In 2019, OpenAI Five defeated the reigning world champion team in Dota 2[6], demonstrating RL's ability to handle complex team-based strategy games.
The field emerged from the convergence of multiple intellectual traditions. The psychology of animal learning, beginning with Edward Thorndike's Law of Effect in 1911, established that behaviors followed by satisfying consequences tend to be repeated. The mathematical framework came from optimal control theory and Richard Bellman's development of dynamic programming in the 1950s. These threads were unified in the modern field through the work of Richard Sutton and Andrew Barto, who received the 2024 Turing Award for their foundational contributions.[7]
Core Concepts
Agent-Environment Interaction
Reinforcement learning problems involve an agent interacting with an environment through a cycle of observation, action, and reward.[3] At each discrete time step t:
The agent observes the current state st of the environment
Based on its policy π, the agent selects an action at
The environment transitions to a new state st+1 according to transition probabilities P(s'|s,a)
The agent receives a scalar reward rt+1 indicating the immediate benefit of that action
The agent's objective is to learn a policy that maximizes the expected return (cumulative reward), typically discounted by factor γ (gamma) where 0 ≤ γ ≤ 1:[1]
Gt = Rt+1 + γRt+2 + γ²Rt+3 + ... = Σk=0∞ γkRt+k+1
Key Components
Core Components of Reinforcement Learning Systems
Component
Description
Example
Agent
The learner and decision-maker
Robot, game-playing AI, trading algorithm
Environment
External world the agent interacts with
Maze, chess board, stock market
State (s)
Complete description of environment configuration
Board position in chess
Action (a)
Choice available to the agent
Move piece, buy/sell stock
Reward (r)
Immediate feedback signal
Points scored, profit/loss
Policy (π)
Agent's strategy mapping states to actions
"If in state X, take action Y"
Value Function
Expected long-term reward from a state
Position evaluation in chess
Model
Agent's representation of environment dynamics
Predicted next state and reward
Value Functions
Value functions are central to reinforcement learning, estimating the expected return from states or state-action pairs:[1]
State-value function Vπ(s): Expected return starting from state s and following policy π
Action-value function Qπ(s,a): Expected return from taking action a in state s, then following policy π
One fundamental challenge in reinforcement learning is the exploration-exploitation tradeoff.[2] The agent must balance:
Exploration: Trying new actions to discover potentially better strategies
Exploitation: Using current knowledge to maximize immediate rewards
Common strategies include ε-greedy (acting randomly with probability ε), upper confidence bound (UCB), and Thompson sampling.
Mathematical Foundations
Markov Decision Processes
Reinforcement learning problems are formally modeled as Markov Decision Processes (MDPs), defined by the tuple (S, A, P, R, γ):[8]
S: Finite set of states (state space)
A: Finite set of actions (action space)
P(s'|s,a): State transition probability function
R(s,a,s'): Reward function
γ: Discount factor (0 ≤ γ < 1)
The Markov property states that the future depends only on the current state, not on the sequence of events that preceded it: P(st+1|st,at,st-1,...,s0) = P(st+1|st,at)
where α is the learning rate. Q-learning converges to the optimal Q-function with probability 1 under certain conditions.
Deep Q-Networks (DQN)
Deep Q-Networks revolutionized RL by using deep neural networks to approximate Q-values for high-dimensional state spaces.[10] Key innovations include:
Experience replay: Stores transitions in buffer and samples randomly for training
Target network: Separate network for computing target values, updated periodically
DQN achieved human-level performance on 29 of 49 Atari games using only raw pixel inputs.
Policy Gradient Methods
Policy gradient methods directly optimize parameterized policies by gradient ascent on expected return.[11] The REINFORCE algorithm updates policy parameters θ using:
∇θJ(θ) ≈ Σt Gt ∇θ log πθ(at|st)
Proximal Policy Optimization (PPO)
Proximal Policy Optimization constrains policy updates to prevent catastrophic performance drops.[12] PPO optimizes a clipped surrogate objective:
RT-1/RT-2: Vision-language-action models for robotics
Pre-trained world models from internet-scale data
See Also
Template:Columns-list
References
Sutton, R. S., & Barto, A. G. (2018). *Reinforcement Learning: An Introduction* (2nd ed.). MIT Press.
Kaelbling, L. P., Littman, M. L., & Moore, A. W. (1996). "Reinforcement learning: A survey." *Journal of Artificial Intelligence Research*, 4, 237-285.
Sutton, R. S., & Barto, A. G. (1998). *Reinforcement Learning: An Introduction* (1st ed.). MIT Press.
Bellman, R. (1957). *Dynamic Programming*. Princeton University Press.
Silver, D., et al. (2016). "Mastering the game of Go with deep neural networks and tree search." *Nature*, 529(7587), 484-489.
OpenAI. (2019). "OpenAI Five defeats Dota 2 world champions." OpenAI Blog.
ACM. (2024). "ACM A.M. Turing Award recognizes pioneers of reinforcement learning."
Pavlov, I. P. (1927). *Conditioned Reflexes: An Investigation of the Physiological Activity of the Cerebral Cortex*. Oxford University Press.
Thorndike, E. L. (1911). *Animal Intelligence: Experimental Studies*. Macmillan.
Bellman, R. (1957). "A Markovian decision process." *Journal of Mathematics and Mechanics*, 6(5), 679-684.
Sutton, R. S. (1988). "Learning to predict by the methods of temporal differences." *Machine Learning*, 3(1), 9-44.
Sutton, R. S., & Barto, A. G. (1998). *Reinforcement Learning: An Introduction*. MIT Press.
Puterman, M. L. (1994). *Markov Decision Processes: Discrete Stochastic Dynamic Programming*. Wiley.
Sutton, R. S. (1990). "Integrated architectures for learning, planning, and reacting based on approximating dynamic programming." *Proceedings of the 7th International Conference on Machine Learning*.
Hafner, D., et al. (2023). "Mastering diverse domains through world models." *arXiv:2301.04104*.
Watkins, C. J. C. H., & Dayan, P. (1992). "Q-learning." *Machine Learning*, 8(3), 279-292.
Rummery, G. A., & Niranjan, M. (1994). "On-line Q-learning using connectionist systems." *Technical Report CUED/F-INFENG/TR 166*, Cambridge University.
Mnih, V., et al. (2015). "Human-level control through deep reinforcement learning." *Nature*, 518(7540), 529-533.
Williams, R. J. (1992). "Simple statistical gradient-following algorithms for connectionist reinforcement learning." *Machine Learning*, 8(3), 229-256.
Konda, V. R., & Tsitsiklis, J. N. (2000). "Actor-critic algorithms." *Advances in Neural Information Processing Systems*, 12.
Mnih, V., et al. (2016). "Asynchronous methods for deep reinforcement learning." *Proceedings of the 33rd International Conference on Machine Learning*.
Lillicrap, T. P., et al. (2015). "Continuous control with deep reinforcement learning." *arXiv:1509.02971*.
Fujimoto, S., Hoof, H., & Meger, D. (2018). "Addressing function approximation error in actor-critic methods." *Proceedings of the 35th International Conference on Machine Learning*.
Schulman, J., et al. (2017). "Proximal policy optimization algorithms." *arXiv:1707.06347*.
Haarnoja, T., et al. (2018). "Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor." *Proceedings of the 35th International Conference on Machine Learning*.
Tesauro, G. (1995). "Temporal difference learning and TD-Gammon." *Communications of the ACM*, 38(3), 58-68.
Silver, D., et al. (2017). "Mastering the game of Go without human knowledge." *Nature*, 550(7676), 354-359.
Silver, D., et al. (2018). "A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play." *Science*, 362(6419), 1140-1144.
Vinyals, O., et al. (2019). "Grandmaster level in StarCraft II using multi-agent reinforcement learning." *Nature*, 575(7782), 350-354.
Christiano, P. F., et al. (2017). "Deep reinforcement learning from human preferences." *Advances in Neural Information Processing Systems*, 30.
Ouyang, L., et al. (2022). "Training language models to follow instructions with human feedback." *Advances in Neural Information Processing Systems*, 35.
Bai, Y., et al. (2022). "Constitutional AI: Harmlessness from AI feedback." *arXiv:2212.08073*.
DeepSeek-AI. (2025). "DeepSeek-R1: Incentivizing reasoning capability in LLMs via reinforcement learning." *arXiv:2501.12948*.
Brown, N., & Sandholm, T. (2019). "Superhuman AI for multiplayer poker." *Science*, 365(6456), 885-890.
FAIR et al. (2022). "Human-level play in the game of Diplomacy by combining language models with strategic reasoning." *Science*, 378(6624), 1067-1074.
OpenAI et al. (2019). "Solving Rubik's Cube with a robot hand." *arXiv:1910.07113*.
Komorowski, M., et al. (2018). "The Artificial Intelligence Clinician learns optimal treatment strategies for sepsis in intensive care." *Nature Medicine*, 24(11), 1716-1720.
Evans, R., & Gao, J. (2016). "DeepMind AI reduces Google data centre cooling bill by 40%." DeepMind Blog.
Busoniu, L., Babuska, R., & De Schutter, B. (2008). "A comprehensive survey of multiagent reinforcement learning." *IEEE Transactions on Systems, Man, and Cybernetics*, 38(2), 156-172.
Albrecht, S. V., Christianos, F., & Schafer, L. (2024). *Multi-Agent Reinforcement Learning: Foundations and Modern Approaches*. MIT Press.
Dulac-Arnold, G., et al. (2019). "Challenges of real-world reinforcement learning." *arXiv:1904.12901*.
Amodei, D., et al. (2016). "Concrete problems in AI safety." *arXiv:1606.06565*.
Zhao, W., et al. (2020). "Sim-to-real transfer in deep reinforcement learning for robotics: A survey." *arXiv:2009.13303*.
Levine, S., et al. (2020). "Offline reinforcement learning: Tutorial, review, and perspectives on open problems." *arXiv:2005.01643*.
Reed, S., et al. (2022). "A generalist agent." *arXiv:2205.06175*.