Reinforcement learning

From AI Wiki
(Redirected from Reinforcement learning (RL))

Template:Infobox algorithm

Reinforcement learning (RL) is an area of machine learning concerned with how intelligent agents ought to take actions in an environment in order to maximize the notion of cumulative reward.[1] Reinforcement learning is one of three basic machine learning paradigms, alongside supervised learning and unsupervised learning.[2]

Unlike supervised learning which requires labeled input/output pairs, and unlike unsupervised learning which focuses on finding hidden structure in unlabeled data, reinforcement learning focuses on finding a balance between exploration (of uncharted territory) and exploitation (of current knowledge) through trial-and-error interaction with an environment.[3] The environment is typically formulated as a Markov decision process (MDP), as many reinforcement learning algorithms utilize dynamic programming techniques.[4]

Overview

Reinforcement learning achieved widespread recognition through several landmark achievements. In 2016, DeepMind's AlphaGo defeated world champion Lee Sedol in the complex game of Go[5], a feat previously thought to be decades away. In 2019, OpenAI Five defeated the reigning world champion team in Dota 2[6], demonstrating RL's ability to handle complex team-based strategy games.

The field emerged from the convergence of multiple intellectual traditions. The psychology of animal learning, beginning with Edward Thorndike's Law of Effect in 1911, established that behaviors followed by satisfying consequences tend to be repeated. The mathematical framework came from optimal control theory and Richard Bellman's development of dynamic programming in the 1950s. These threads were unified in the modern field through the work of Richard Sutton and Andrew Barto, who received the 2024 Turing Award for their foundational contributions.[7]

Core Concepts

Agent-Environment Interaction

Reinforcement learning problems involve an agent interacting with an environment through a cycle of observation, action, and reward.[3] At each discrete time step t:

  1. The agent observes the current state st of the environment
  2. Based on its policy π, the agent selects an action at
  3. The environment transitions to a new state st+1 according to transition probabilities P(s'|s,a)
  4. The agent receives a scalar reward rt+1 indicating the immediate benefit of that action

The agent's objective is to learn a policy that maximizes the expected return (cumulative reward), typically discounted by factor γ (gamma) where 0 ≤ γ ≤ 1:[1]

Gt = Rt+1 + γRt+2 + γ²Rt+3 + ... = Σk=0 γkRt+k+1

Key Components

Core Components of Reinforcement Learning Systems
Component Description Example
Agent The learner and decision-maker Robot, game-playing AI, trading algorithm
Environment External world the agent interacts with Maze, chess board, stock market
State (s) Complete description of environment configuration Board position in chess
Action (a) Choice available to the agent Move piece, buy/sell stock
Reward (r) Immediate feedback signal Points scored, profit/loss
Policy (π) Agent's strategy mapping states to actions "If in state X, take action Y"
Value Function Expected long-term reward from a state Position evaluation in chess
Model Agent's representation of environment dynamics Predicted next state and reward

Value Functions

Value functions are central to reinforcement learning, estimating the expected return from states or state-action pairs:[1]

  • State-value function Vπ(s): Expected return starting from state s and following policy π
  • Action-value function Qπ(s,a): Expected return from taking action a in state s, then following policy π

The optimal value functions satisfy the Bellman optimality equations:[4]

  • V*(s) = maxa Σs' P(s'|s,a)[R(s,a,s') + γV*(s')]
  • Q*(s,a) = Σs' P(s'|s,a)[R(s,a,s') + γ maxa' Q*(s',a')]

Exploration vs. Exploitation

One fundamental challenge in reinforcement learning is the exploration-exploitation tradeoff.[2] The agent must balance:

  • Exploration: Trying new actions to discover potentially better strategies
  • Exploitation: Using current knowledge to maximize immediate rewards

Common strategies include ε-greedy (acting randomly with probability ε), upper confidence bound (UCB), and Thompson sampling.

Mathematical Foundations

Markov Decision Processes

Reinforcement learning problems are formally modeled as Markov Decision Processes (MDPs), defined by the tuple (S, A, P, R, γ):[8]

  • S: Finite set of states (state space)
  • A: Finite set of actions (action space)
  • P(s'|s,a): State transition probability function
  • R(s,a,s'): Reward function
  • γ: Discount factor (0 ≤ γ < 1)

The Markov property states that the future depends only on the current state, not on the sequence of events that preceded it: P(st+1|st,at,st-1,...,s0) = P(st+1|st,at)

Algorithm Taxonomy

Taxonomy of Reinforcement Learning Algorithms
Category Description Examples
Model-Based vs. Model-Free Whether agent learns environment dynamics Model-Based: Dyna-Q, AlphaZero
Model-Free: Q-Learning, PPO
Value-Based vs. Policy-Based What the agent learns Value-Based: Q-Learning, DQN
Policy-Based: REINFORCE, PPO
On-Policy vs. Off-Policy Source of learning data On-Policy: SARSA, A2C
Off-Policy: Q-Learning, DQN
Tabular vs. Function Approximation State representation method Tabular: Classic Q-Learning
Function Approx: Deep RL

Key Algorithms

Q-Learning

Q-learning is a model-free, off-policy algorithm that learns the optimal action-value function.[9] The update rule is:

Q(s,a) ← Q(s,a) + α[r + γ maxa' Q(s',a') - Q(s,a)]

where α is the learning rate. Q-learning converges to the optimal Q-function with probability 1 under certain conditions.

Deep Q-Networks (DQN)

Deep Q-Networks revolutionized RL by using deep neural networks to approximate Q-values for high-dimensional state spaces.[10] Key innovations include:

  • Experience replay: Stores transitions in buffer and samples randomly for training
  • Target network: Separate network for computing target values, updated periodically

DQN achieved human-level performance on 29 of 49 Atari games using only raw pixel inputs.

Policy Gradient Methods

Policy gradient methods directly optimize parameterized policies by gradient ascent on expected return.[11] The REINFORCE algorithm updates policy parameters θ using:

θJ(θ) ≈ Σt Gtθ log πθ(at|st)

Proximal Policy Optimization (PPO)

Proximal Policy Optimization constrains policy updates to prevent catastrophic performance drops.[12] PPO optimizes a clipped surrogate objective:

LCLIP(θ) = E[min(rt(θ)Ât, clip(rt(θ), 1-ε, 1+ε)Ât)]

where rt(θ) = πθ(at|st)/πθold(at|st) and ε is typically 0.2.

Actor-Critic Methods

Actor-critic algorithms combine value-based and policy-based approaches:[13]

  • Actor: Policy network that selects actions
  • Critic: Value network that evaluates actions

Examples include A2C, A3C, SAC, and DDPG.

Algorithm Comparison

Comparison of Major RL Algorithms
Algorithm Type Year Key Innovation Best For Sample Efficiency
Q-Learning Value, Off-policy 1989 Model-free optimal control Tabular tasks Low
SARSA Value, On-policy 1994 On-policy TD control Safe learning Low
DQN Value, Off-policy 2013 Deep RL with replay buffer Discrete actions, visual input Medium
DDPG Actor-Critic, Off-policy 2015 Continuous action DQN Continuous control Medium
TRPO Policy, On-policy 2015 Trust region constraints Stable learning Low
PPO Policy, On-policy 2017 Clipped surrogate objective General purpose Low
A3C Actor-Critic, On-policy 2016 Asynchronous parallel training CPU-based training Low
SAC Actor-Critic, Off-policy 2018 Maximum entropy RL Continuous control High
TD3 Actor-Critic, Off-policy 2018 Twin critics, delayed updates Continuous control High
AlphaZero Model-based 2017 Self-play with MCTS Perfect info games Very High
MuZero Model-based 2020 Learned latent dynamics Games without rules Very High

Historical Milestones

Key Milestones in Reinforcement Learning History
Year Milestone Key Contributor(s) Significance
1911 Law of Effect Edward Thorndike Established principle that rewarded actions are reinforced
1950s Dynamic Programming & Bellman Equation Richard Bellman Mathematical framework for sequential decision-making
1959 Checkers Program Arthur Samuel First self-learning game program, coined "machine learning"
1963 MENACE Donald Michie Matchbox machine that learned tic-tac-toe
1988 Temporal Difference Learning Richard Sutton Unified Monte Carlo and dynamic programming methods
1989 Q-Learning Christopher Watkins Model-free off-policy control algorithm
1992 TD-Gammon Gerald Tesauro Achieved world-class backgammon performance
1992 REINFORCE Algorithm Ronald Williams Fundamental policy gradient algorithm
1998 "Reinforcement Learning: An Introduction" Richard Sutton, Andrew Barto Seminal textbook defining the field
2013-2015 Deep Q-Networks DeepMind Deep RL breakthrough on Atari games
2016 AlphaGo defeats Lee Sedol DeepMind First AI to defeat world champion at Go
2017 AlphaGo Zero DeepMind Learned Go from scratch through self-play
2018 AlphaZero DeepMind Mastered chess, shogi, and Go in 24 hours
2019 OpenAI Five OpenAI Defeated Dota 2 world champions
2019 AlphaStar DeepMind Achieved Grandmaster in StarCraft II
2024 Turing Award Richard Sutton, Andrew Barto Recognition for RL foundations

Applications

Game Playing

Reinforcement learning has achieved superhuman performance in numerous games:

Robotics

RL enables robots to learn complex motor skills through trial and error:

  • Locomotion: Boston Dynamics robots use RL for walking and navigation[16]
  • Manipulation: Robotic hands solving Rubik's Cube, grasping diverse objects
  • Assembly: Industrial robots learning assembly sequences
  • Sim-to-Real Transfer: Training in simulation before real-world deployment

Autonomous Vehicles

Self-driving cars employ RL for:

  • Path planning and trajectory optimization
  • Lane changing and merging decisions
  • Adaptive cruise control
  • Traffic light negotiation
  • Waymo reports over 20 million autonomous miles driven[17]

Healthcare

RL applications in medicine include:

  • Treatment Optimization: Dynamic treatment regimes for chronic diseases[18]
  • Drug Discovery: Molecular design and optimization
  • Personalized Medicine: Adaptive clinical trials
  • Resource Allocation: ICU bed management, staff scheduling

Finance and Trading

Financial applications include:

  • Algorithmic Trading: Automated trading strategies[19]
  • Portfolio Management: Dynamic asset allocation
  • Risk Management: Credit scoring, fraud detection
  • Market Making: Liquidity provision strategies

Energy and Sustainability

  • Data Center Cooling: Google achieved 40% energy reduction using RL[20]
  • Smart Grids: Load balancing and demand response
  • Wind Farms: Turbine control optimization
  • Building Management: HVAC system optimization

Natural Language Processing

Development Tools and Frameworks

Popular RL Development Frameworks
Framework Language GitHub Stars Backend Best For
OpenAI Gym/Gymnasium Python 35,000+ Agnostic Environment standard
Ray RLlib Python 33,000+ Multiple Production, distributed training
Stable-Baselines3 Python 9,000+ PyTorch Reliable implementations
Unity ML-Agents C#/Python 17,000+ PyTorch 3D/VR/AR simulation
TorchRL Python 2,300+ PyTorch Research flexibility
TF-Agents Python 2,800+ TensorFlow TF ecosystem
Tianshou Python 7,800+ PyTorch Modular design
ACME Python 3,400+ JAX/TF DeepMind research

Simulation Environments

Challenges and Limitations

Sample Inefficiency

RL algorithms often require millions of interactions to learn:[22]

  • DQN: 200 million frames for Atari (equivalent to 924 hours of human play)
  • OpenAI Five: 45,000 years of Dota 2 gameplay
  • AlphaGo Zero: 4.9 million self-play games

Solutions include model-based RL, transfer learning, and curriculum learning.

Exploration Challenges

Effective exploration remains difficult in:

  • Sparse Reward Environments: Where rewards are rare
  • Large State Spaces: Exponential growth of possibilities
  • Safety-Critical Domains: Where exploration risks catastrophic failure

Approaches include curiosity-driven learning, intrinsic motivation, and safe exploration.

Reward Specification

Designing appropriate reward functions is challenging:[23]

  • Reward Hacking: Agents exploit unintended loopholes
  • Reward Shaping: Manual engineering is difficult and error-prone
  • Multi-Objective Optimization: Balancing competing goals

Solutions include inverse reinforcement learning, preference learning, and reward modeling.

Generalization and Transfer

RL agents often fail to generalize:

  • Domain Shift: Performance degrades in new environments
  • Sim-to-Real Gap: Policies trained in simulation fail in reality
  • Catastrophic Forgetting: Learning new tasks overwrites old knowledge

Research areas include meta-learning, domain randomization, and continual learning.

Interpretability and Safety

  • Black Box Policies: Neural networks lack interpretability
  • Verification Challenges: Difficult to prove safety guarantees
  • Adversarial Vulnerabilities: Susceptible to adversarial attacks
  • Alignment Problem: Ensuring AI goals align with human values

Current Research Directions

Offline Reinforcement Learning

Offline RL learns from fixed datasets without environment interaction:[24]

  • Conservative Q-Learning (CQL)
  • Implicit Q-Learning (IQL)
  • Decision Transformers
  • Applications in healthcare and robotics where exploration is expensive

Multi-Agent Reinforcement Learning

Multi-agent RL addresses scenarios with multiple learning agents:[25]

  • Cooperative: Team coordination and communication
  • Competitive: Game theory and Nash equilibria
  • Mixed: Social dilemmas and negotiation
  • Applications in autonomous driving, robotics swarms

Hierarchical Reinforcement Learning

Hierarchical RL decomposes complex tasks into subtasks:

  • Options framework for temporal abstraction
  • Goal-conditioned policies
  • Feudal networks
  • Applications in long-horizon planning

Model-Based Reinforcement Learning

Leveraging learned environment models for planning:[26]

  • World models and imagination
  • MuZero: Planning without knowing rules
  • Dreamer: Visual model-based RL
  • Differentiable physics simulators

Reinforcement Learning from Human Feedback (RLHF)

Aligning AI systems with human preferences:[27]

  • Preference modeling from comparisons
  • Constitutional AI for value alignment
  • AI feedback (RLAIF) for scalability
  • Applications in large language models

Foundation Models and Transformers

Integration with large-scale pre-trained models:

  • Decision Transformers: RL as sequence modeling
  • Gato: Generalist agent for multiple domains
  • RT-1/RT-2: Vision-language-action models for robotics
  • Pre-trained world models from internet-scale data

See Also

Template:Columns-list

References

  1. 1.0 1.1 1.2 Sutton, R.S. and Barto, A.G. (2018). "Reinforcement Learning: An Introduction (2nd ed.)". MIT Press. [1]
  2. 2.0 2.1 IBM, "What is reinforcement learning?" IBM Think Blog. [2]
  3. 3.0 3.1 Amazon Web Services, "What is Reinforcement Learning?" AWS Machine Learning Guide. [3]
  4. 4.0 4.1 Bellman, R. (1957). "Dynamic Programming". Princeton University Press. [4]
  5. Silver, D. et al. (2016). "Mastering the game of Go with deep neural networks and tree search." Nature, 529(7587):484–489. [5]
  6. OpenAI, "OpenAI Five defeats Dota 2 world champions" (Blog post, April 15, 2019). [6]
  7. ACM, "Andrew Barto and Richard Sutton are the recipients of the 2024 ACM A.M. Turing Award". [7]
  8. Puterman, M.L. (1994). "Markov Decision Processes: Discrete Stochastic Dynamic Programming". Wiley. [8]
  9. Watkins, C.J.C.H. (1989). "Learning from Delayed Rewards" (Ph.D. thesis, University of Cambridge). [9]
  10. Mnih, V. et al. (2015). "Human-level control through deep reinforcement learning." Nature, 518(7540):529–533. [10]
  11. Williams, R.J. (1992). "Simple statistical gradient-following algorithms for connectionist reinforcement learning". Machine Learning, 8(3-4):229-256. [11]
  12. Schulman, J. et al. (2017). "Proximal Policy Optimization Algorithms". arXiv preprint. [12]
  13. Konda, V.R. and Tsitsiklis, J.N. (2000). "Actor-critic algorithms". Advances in Neural Information Processing Systems. [13]
  14. Silver, D. et al. (2018). "A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play." Science, 362(6419):1140–1144. [14]
  15. Vinyals, O. et al. (2019). "Grandmaster level in StarCraft II using multi-agent reinforcement learning." Nature, 575:350–354. [15]
  16. Boston Dynamics, "Starting on the Right Foot with Reinforcement Learning". [16]
  17. Waymo, "Waymo One surpasses 20 million rider-only miles". [17]
  18. Liu, Y. et al. (2023). "Reinforcement Learning for Clinical Decision Support: A Comprehensive Survey". Medical Image Analysis. [18]
  19. Hambly, B. et al. (2023). "Recent Advances in Reinforcement Learning in Finance". Mathematical Finance. [19]
  20. DeepMind, "DeepMind AI Reduces Google Data Centre Cooling Bill by 40%". [20]
  21. Ouyang, L. et al. (2022). "Training language models to follow instructions with human feedback". NeurIPS. [21]
  22. Kaiser, L. et al. (2020). "Model-Based Reinforcement Learning for Atari". ICLR. [22]
  23. Hadfield-Menell, D. et al. (2017). "The Off-Switch Game". IJCAI. [23]
  24. Levine, S. et al. (2020). "Offline Reinforcement Learning: Tutorial, Review, and Perspectives". arXiv. [24]
  25. Zhang, K. et al. (2021). "Multi-Agent Reinforcement Learning: A Selective Overview". Foundations and Trends in Machine Learning. [25]
  26. Moerland, T.M. et al. (2023). "Model-based Reinforcement Learning: A Survey". Foundations and Trends in Machine Learning. [26]
  27. Christiano, P. et al. (2017). "Deep Reinforcement Learning from Human Preferences". NeurIPS. [27]

External Links

Template:Machine learning

Template:Navbox Template:Game artificial intelligence