Reinforcement learning (RL) is an area of machine learning in which an agent learns to make decisions by interacting with an environment, receiving numerical reward signals that indicate how well it is performing. Rather than being told the correct action (as in supervised learning) or discovering hidden patterns in unlabeled data (as in unsupervised learning), an RL agent improves its behavior through trial and error over time.[1] The agent's goal is to discover a policy that maximizes its cumulative reward over the long run.
Reinforcement learning is rooted in decades of research spanning psychology, control theory, and computer science. It is formalized using the Markov decision process (MDP) framework and has produced some of the most visible accomplishments in modern artificial intelligence, from defeating world champions at Go and StarCraft II to aligning large language models with human preferences through reinforcement learning from human feedback (RLHF).
Imagine you are training a dog. When the dog sits on command, you give it a treat. When it ignores you, it gets no treat. Over many repetitions, the dog figures out that sitting when told earns rewards, so it starts doing it more often. Reinforcement learning works the same way: a computer program (the "agent") tries different actions in some situation (the "environment"), gets points ("rewards") for good actions, and gradually learns which actions lead to the most points. Nobody tells the agent exactly what to do; it discovers the best strategy by experimenting and keeping track of what works.
Reinforcement learning problems are typically modeled as a Markov decision process (MDP). An MDP is defined by the tuple (S, A, P, R, gamma), where:[2]
The Markov property states that the future state depends only on the current state and action, not on the history of prior states: P(s_{t+1} | s_t, a_t, s_{t-1}, ..., s_0) = P(s_{t+1} | s_t, a_t). This memoryless property is what makes MDPs mathematically tractable. When the full state is not observable to the agent, the problem becomes a partially observable MDP (POMDP), which is significantly harder to solve.
At each time step t, the agent observes the current state s_t, selects an action a_t according to its policy, transitions to a new state s_{t+1}, and receives a reward r_{t+1}. The agent's objective is to maximize the expected return, defined as the discounted sum of future rewards:
G_t = r_{t+1} + gamma * r_{t+2} + gamma^2 * r_{t+3} + ... = Sum_{k=0}^{infinity} gamma^k * r_{t+k+1}
A discount factor close to 1 makes the agent far-sighted (it cares about rewards far in the future), while a discount factor close to 0 makes it myopic (it mostly cares about the next reward).
A policy, denoted pi, is the agent's strategy for choosing actions. It can be deterministic (pi(s) = a) or stochastic (pi(a | s) gives a probability distribution over actions for each state). The central goal of RL is to find an optimal policy pi* that maximizes the expected return from every state.
The state-value function V^pi(s) estimates the expected return when starting from state s and following policy pi thereafter. It answers the question: "How good is it to be in this state?"[1]
V^pi(s) = E[G_t | s_t = s, pi]
The action-value function Q^pi(s, a) estimates the expected return when starting from state s, taking action a, and then following policy pi. It answers: "How good is it to take this action in this state?"
Q^pi(s, a) = E[G_t | s_t = s, a_t = a, pi]
The optimal value functions, V*(s) and Q*(s, a), satisfy the Bellman equation, which provides a recursive decomposition:[2]
V*(s) = max_a Sum_{s'} P(s' | s, a) [R(s, a, s') + gamma * V*(s')]
Q*(s, a) = Sum_{s'} P(s' | s, a) [R(s, a, s') + gamma * max_{a'} Q*(s', a')]
These equations state that the value of a state (or state-action pair) equals the best available immediate reward plus the discounted value of the best reachable next state. They form the mathematical backbone of nearly all RL algorithms.
The return G_t is the total discounted reward from time step t onward. Maximizing the expected return is the agent's objective. The discount factor gamma controls the horizon: with gamma = 0 the agent is purely greedy, while with gamma approaching 1 it plans far into the future.
One of the fundamental challenges in reinforcement learning is the exploration-exploitation tradeoff.[3] The agent must balance two competing goals:
An agent that only exploits risks getting stuck in a suboptimal policy, never discovering better options. An agent that only explores wastes time on actions it has already learned are poor. Several strategies have been developed to manage this tradeoff:
| Strategy | How it works | Strengths | Weaknesses |
|---|---|---|---|
| Epsilon-greedy | With probability epsilon, choose a random action; otherwise choose the greedy action | Simple to implement | Uniform random exploration is inefficient in large spaces |
| Epsilon decay | Start with a high epsilon and decrease it over time | Explores broadly early, exploits later | Requires tuning the decay schedule |
| Upper Confidence Bound (UCB) | Select the action with the highest upper confidence bound on its estimated value | Principled; balances uncertainty and value mathematically | Assumes bounded rewards; can be expensive for large action spaces |
| Thompson sampling | Sample action values from a posterior distribution and act greedily on the sample | Bayesian; naturally balances exploration without manual tuning | Computationally intensive for complex reward structures |
| Boltzmann (softmax) | Choose actions with probability proportional to exponentiated Q-values | Smooth exploration controlled by a temperature parameter | Temperature must be tuned; can be slow to converge |
| Curiosity-driven exploration | Provide intrinsic reward for visiting novel or surprising states | Effective in sparse-reward environments | Intrinsic reward design can introduce its own biases |
RL algorithms are broadly divided into model-free and model-based categories based on whether the agent learns or uses a model of the environment's dynamics.[4]
Model-free algorithms learn a policy or value function directly from experience, without building an explicit model of how the environment works. Q-learning, SARSA, DQN, and PPO are all model-free. These methods are conceptually simpler and do not suffer from model errors, but they typically require many more interactions with the environment to learn effectively.
Model-based algorithms learn (or are given) a model of the environment's transition dynamics and reward function, then use that model for planning. By simulating experience internally, model-based methods can be much more sample-efficient.
Key model-based approaches include:
The main risk of model-based methods is that inaccurate models can lead to compounding errors in planning, causing the agent to learn poor policies. Techniques like ensemble models and uncertainty estimation help mitigate this issue.
Q-learning, introduced by Christopher Watkins in 1989, is one of the most foundational RL algorithms.[8] It is model-free and off-policy, meaning it can learn about the optimal policy while following an exploratory one. The update rule is:
Q(s, a) <- Q(s, a) + alpha * [r + gamma * max_{a'} Q(s', a') - Q(s, a)]
The key insight is that the update always uses the maximum Q-value over next-state actions, regardless of which action the agent actually took. Watkins proved that Q-learning converges to the optimal Q-function given sufficient exploration and decreasing learning rates. Q-learning works well for small, discrete state and action spaces but requires function approximation for larger problems.
SARSA (State-Action-Reward-State-Action), introduced by Rummery and Niranjan in 1994, is an on-policy variant of Q-learning.[9] Its update rule uses the action the agent actually took in the next state:
Q(s, a) <- Q(s, a) + alpha * [r + gamma * Q(s', a') - Q(s, a)]
Because SARSA evaluates the policy it is currently following (including exploratory actions), it tends to learn safer, more conservative policies. In the classic cliff-walking problem, Q-learning learns the shortest path along the dangerous cliff edge (the optimal path), while SARSA learns a longer but safer path further from the edge.
Deep Q-Networks (DQN), published by Mnih et al. at DeepMind in 2013 and later in Nature in 2015, represented a breakthrough by combining Q-learning with deep neural networks.[10] DQN used deep convolutional neural networks to approximate Q-values for high-dimensional state spaces, taking raw pixel inputs from Atari 2600 games and learning to play 49 different games with a single architecture. It achieved human-level performance on 29 of them.
Two key innovations made DQN work:
DQN was the first demonstration that a single RL agent could learn complex behaviors directly from sensory input across many different tasks. It sparked the deep reinforcement learning revolution.
REINFORCE, introduced by Ronald Williams in 1992, was the first policy gradient algorithm.[11] Instead of learning a value function, REINFORCE directly parameterizes the policy and optimizes it by estimating the gradient of expected return with respect to the policy parameters theta:
nabla_theta J(theta) ~ Sum_t G_t * nabla_theta log pi_theta(a_t | s_t)
The intuition is simple: increase the probability of actions that led to high returns, and decrease the probability of actions that led to low returns. REINFORCE is elegant but suffers from high variance in gradient estimates because returns can vary widely across episodes.
The policy gradient theorem, formalized by Sutton et al. in 1999, provides a general expression for the gradient of expected return with respect to policy parameters.[12] It shows that the gradient can be expressed as an expectation under the current policy, enabling practical gradient estimation from sampled trajectories. This theorem underpins virtually all modern policy optimization algorithms, including PPO and SAC.
Adding a baseline (typically an estimate of the state value V(s)) to the gradient estimate reduces variance without introducing bias:
nabla_theta J(theta) ~ Sum_t (G_t - V(s_t)) * nabla_theta log pi_theta(a_t | s_t)
The term (G_t - V(s_t)) is called the advantage, and this insight leads directly to actor-critic methods.
Actor-critic algorithms combine policy-based and value-based learning into a single framework.[13] The system has two components:
The critic provides a learned baseline that reduces the variance of policy gradient estimates, making training more stable than pure policy gradient methods while retaining the ability to handle continuous action spaces.
A3C (Asynchronous Advantage Actor-Critic), introduced by Mnih et al. in 2016, runs multiple agents in parallel on separate copies of the environment, each contributing gradients asynchronously to a shared model.[14] This was one of the first methods to effectively parallelize RL training across many CPU cores.
A2C (Advantage Actor-Critic) is a synchronous variant where all parallel workers complete their rollouts before a single centralized update. A2C typically makes more effective use of GPUs and often achieves similar or better results than A3C while being simpler to implement and debug.
Proximal Policy Optimization (PPO), introduced by Schulman et al. at OpenAI in 2017, constrains policy updates to prevent destructively large changes.[15] It optimizes a clipped surrogate objective:
L^CLIP(theta) = E[min(r_t(theta) * A_t, clip(r_t(theta), 1 - epsilon, 1 + epsilon) * A_t)]
where r_t(theta) = pi_theta(a_t | s_t) / pi_{theta_old}(a_t | s_t) is the probability ratio and epsilon is typically 0.2. The clipping mechanism prevents the new policy from changing too much in a single update, which is key to training stability.
PPO has become one of the most widely used RL algorithms due to its simplicity, reliability, and strong empirical performance. OpenAI used PPO to train OpenAI Five (Dota 2), and it was the RL algorithm originally used in RLHF for ChatGPT and InstructGPT.
Soft Actor-Critic (SAC), introduced by Haarnoja et al. at UC Berkeley in 2018, augments the standard RL objective with an entropy term that encourages exploration:[16]
J(pi) = Sum_t E[r(s_t, a_t) + alpha * H(pi(. | s_t))]
where H is the entropy of the policy and alpha is a temperature parameter. By maximizing both reward and entropy, SAC avoids premature convergence to deterministic policies. SAC is off-policy, uses experience replay, and automatically tunes the temperature parameter. It achieves strong performance on continuous control benchmarks with better sample efficiency than on-policy methods.
| Algorithm | Type | Year | Key innovation | Best suited for |
|---|---|---|---|---|
| Q-learning | Value, off-policy | 1989 | Model-free optimal control | Small discrete problems |
| SARSA | Value, on-policy | 1994 | On-policy TD control | Safe learning scenarios |
| DQN | Value, off-policy | 2013/2015 | Deep RL with experience replay and target networks | Discrete actions, visual input |
| DDPG | Actor-critic, off-policy | 2015 | Continuous-action DQN | Continuous control |
| A3C | Actor-critic, on-policy | 2016 | Asynchronous parallel training | CPU-based distributed training |
| PPO | Policy, on-policy | 2017 | Clipped surrogate objective | General purpose, RLHF |
| SAC | Actor-critic, off-policy | 2018 | Maximum entropy RL | Continuous control, robotics |
| TD3 | Actor-critic, off-policy | 2018 | Twin critics, delayed updates | Continuous control |
| MuZero | Model-based | 2020 | Learned latent dynamics for planning | Games without known rules |
Deep reinforcement learning (deep RL) combines RL algorithms with deep learning as function approximators, enabling agents to handle high-dimensional state and action spaces that are intractable for tabular methods.[10]
Classic RL algorithms like tabular Q-learning maintain a table of values for every state-action pair. This works for small state spaces but fails when states are described by images or continuous variables. A single Atari game frame has 210 x 160 pixels with 128 possible colors per pixel, making the raw state space astronomically large. Neural networks solve this by learning compact, generalizable representations.
Deep RL uses several recurring architectural patterns:
Combining neural networks with RL introduces stability challenges absent in supervised learning. The training data distribution changes as the policy improves (non-stationarity), and small changes in the value function can cause large policy shifts. Experience replay, target networks, gradient clipping, and entropy regularization are standard techniques for mitigating these issues.
Reinforcement learning has produced several landmark results that captured worldwide attention and advanced the field significantly.
TD-Gammon, developed by Gerald Tesauro at IBM, used a neural network trained through self-play with TD(lambda) to achieve expert-level backgammon play.[17] It played approximately 1.5 million games against itself during training. TD-Gammon is widely cited as a precursor to the deep RL breakthroughs that followed two decades later.
DeepMind's DQN was the first system to learn successful control policies directly from raw pixel inputs across a diverse set of tasks.[10] The 2013 paper demonstrated strong performance on seven Atari games. The 2015 Nature paper extended this to 49 games, achieving human-level performance on 29 of them using identical architecture and hyperparameters. This showed that a single deep RL agent could generalize across very different tasks.
DeepMind's AlphaGo defeated 18-time world Go champion Lee Sedol 4 games to 1 in March 2016, an event watched by over 200 million people.[18] AlphaGo combined supervised learning from human expert games with RL through self-play, using Monte Carlo tree search (MCTS) guided by a policy network and a value network. The game of Go has roughly 10^170 possible board positions, making brute-force search impossible. AlphaGo's victory was previously thought to be at least a decade away.
AlphaGo Zero (2017) eliminated the need for human data entirely, learning exclusively through self-play and surpassing the original AlphaGo within 40 hours of training.[19] AlphaZero (2017) further generalized this approach to chess and shogi, defeating the strongest existing programs in all three games within 24 hours of training from scratch.[20]
OpenAI Five tackled Dota 2, a game with far greater complexity than Go: imperfect information, real-time decision-making, long time horizons (roughly 20,000 frames per game), a massive action space, and five-player teamwork.[21] The system used PPO with self-play across 128,000 CPU cores and 256 GPUs, accumulating the equivalent of 45,000 years of gameplay. In April 2019, OpenAI Five defeated OG, the reigning human world champions, 2-0.
DeepMind's AlphaStar reached Grandmaster level in StarCraft II in August 2019, placing in the top 0.2% of human players on the official European ladder.[22] StarCraft II poses challenges beyond Go: imperfect information (fog of war), real-time actions, long-term strategic planning, and a combinatorial action space. AlphaStar combined imitation learning from human replays with multi-agent reinforcement learning, training a league of agents that competed against one another to develop diverse strategies.
| Achievement | Year | System | Developer | Significance |
|---|---|---|---|---|
| Expert backgammon | 1992 | TD-Gammon | IBM (Tesauro) | First neural-network RL agent at expert game level |
| 49 Atari games at human level | 2015 | DQN | DeepMind (Mnih et al.) | First deep RL from raw pixels across diverse tasks |
| World champion at Go | 2016 | AlphaGo | DeepMind (Silver et al.) | Defeated Lee Sedol 4-1; previously thought decades away |
| Superhuman Go from self-play only | 2017 | AlphaGo Zero | DeepMind | No human data needed; surpassed AlphaGo in 40 hours |
| Chess, shogi, Go from scratch | 2017 | AlphaZero | DeepMind | Single algorithm mastered three games in 24 hours |
| Dota 2 world champions defeated | 2019 | OpenAI Five | OpenAI | Complex multi-agent real-time game with imperfect info |
| StarCraft II Grandmaster | 2019 | AlphaStar | DeepMind | Multi-agent RL in real-time strategy with fog of war |
| Multiplayer poker | 2019 | Pluribus | CMU / Facebook AI | First AI to beat pros in 6-player no-limit Hold'em |
| Data center cooling optimization | 2016 | DeepMind AI | Google DeepMind | 40% reduction in cooling energy consumption |
| LLM alignment via RLHF | 2022 | InstructGPT / ChatGPT | OpenAI | RL became central to training conversational AI |
RLHF has become one of the most consequential applications of reinforcement learning. It is the technique that transforms a pre-trained language model into a conversational assistant that follows instructions, avoids harmful outputs, and behaves in ways humans find helpful.[23]
The RLHF process typically involves three stages:
OpenAI's InstructGPT (2022) was one of the first published demonstrations of this approach.[24] A 1.3 billion parameter InstructGPT model was preferred by human evaluators over the much larger 175 billion parameter GPT-3, while producing fewer factual errors and toxic responses. The same methodology was used for ChatGPT, released in November 2022. Anthropic applied a variant called Constitutional AI (CAI) to train Claude, where AI-generated feedback partially replaces human labeling.[25]
| Method | Year | Key idea |
|---|---|---|
| PPO-based RLHF | 2022 | Original approach used for InstructGPT and ChatGPT |
| Direct Preference Optimization (DPO) | 2023 | Eliminates the separate reward model and RL step; optimizes directly on preference pairs |
| Kahneman-Tversky Optimization (KTO) | 2024 | Works with binary (good/bad) labels instead of pairwise preferences |
| Group Relative Policy Optimization (GRPO) | 2024 | Eliminates the value network; estimates advantages from group reward distributions |
| Reinforcement Learning from AI Feedback (RLAIF) | 2023+ | Uses AI-generated preferences to scale alignment efforts |
By 2025, RLHF and its variants became the default alignment strategy for LLMs, with approximately 70% of enterprises adopting methods like RLHF or DPO to align AI outputs.[26] DeepSeek-R1, released in January 2025, demonstrated that pure RL training using GRPO with verifiable rewards can produce strong reasoning capabilities in LLMs without any supervised fine-tuning step.
Multi-agent reinforcement learning (MARL) extends RL to settings where multiple agents interact within a shared environment.[27] This introduces challenges absent in single-agent RL: the environment is non-stationary from each agent's perspective because other agents are simultaneously learning and changing their behavior.
| Setting | Description | Examples |
|---|---|---|
| Fully cooperative | All agents share a common reward | Robot swarm coordination, team games |
| Fully competitive | Zero-sum; one agent's gain is another's loss | Board games, competitive video games |
| Mixed (general-sum) | Agents have partially aligned, partially conflicting goals | Autonomous driving, economic markets, negotiation |
MARL has been applied to autonomous driving (vehicles negotiating at intersections), robotic swarms, traffic signal control, multiplayer games (Dota 2, StarCraft II), and resource allocation in smart grids.
Policies trained in simulation often fail when deployed on physical hardware because of the "sim-to-real gap": differences in physics, sensor noise, actuator dynamics, and visual appearance between the simulator and the real world.[28] Since RL algorithms typically require millions of interactions to learn, training directly on physical systems is often impractical.
Several techniques help bridge this gap:
OpenAI demonstrated sim-to-real transfer with a robotic hand (Dactyl) that learned to solve a Rubik's Cube in simulation and then transferred the skill to a physical robot using extensive domain randomization (2019).
Inverse reinforcement learning (IRL) is the problem of inferring the reward function that an agent is optimizing, given observations of its behavior.[29] While standard RL takes a reward function as input and outputs a policy, IRL works in the opposite direction: given a policy (or demonstrations), it recovers the underlying reward function.
IRL is useful when the reward function is difficult to specify manually but expert demonstrations are available. Applications include:
Key IRL algorithms include Maximum Entropy IRL (Ziebart et al., 2008), which models behavior as approximately optimal under a reward function while maximizing entropy, and Generative Adversarial Imitation Learning (GAIL), which combines IRL with generative adversarial training.
RL has achieved superhuman performance in numerous games. Beyond the milestone achievements discussed above, notable examples include Pluribus (2019), which defeated professional players in six-player no-limit Texas Hold'em, and Meta's Cicero (2022), which achieved human-level performance in Diplomacy by combining RL with natural language generation for negotiation.
RL enables robots to acquire motor skills through trial and error rather than manual programming. Applications include locomotion (walking and running on uneven terrain), manipulation (grasping and assembling objects), and navigation. Training typically occurs in physics simulators like MuJoCo or NVIDIA Isaac Gym before transferring policies to physical hardware via sim-to-real techniques.
Self-driving systems employ RL for path planning, lane-change decisions, intersection negotiation, and adaptive cruise control. Research has shown that PPO outperforms DQN in terms of stability and scalability in highway driving simulations.[30] Companies like Waymo and Tesla use RL as one component of their autonomous driving stacks, typically combined with rule-based safety constraints and imitation learning.
RL is used in recommendation engines where the goal is to maximize long-term user engagement rather than immediate click-through rates. Platforms like YouTube, Netflix, and Spotify use RL-inspired approaches to balance exploration (showing new content) with exploitation (recommending proven favorites) and to optimize for retention over short-term clicks.
RL has been applied to dynamic treatment regimes for chronic diseases, including sepsis management in intensive care units where RL agents recommend drug dosages and ventilator settings.[31] Other applications include drug discovery (molecular design and optimization), personalized medicine (adaptive clinical trial designs), and medical imaging (RL-guided landmark detection).
Google DeepMind achieved a 40% reduction in data center cooling energy consumption by using RL to optimize HVAC settings (2016).[32] Other energy applications include smart grid load balancing, wind farm optimization (turbine yaw angles and blade pitch), and building management systems.
RLHF and its successors (DPO, GRPO, RLVR) have become central to training large language models like ChatGPT, Claude, GPT-4, Gemini, Llama, and DeepSeek to follow instructions and align with human values. This is arguably the highest-impact application of RL as of 2025.
The field of reinforcement learning emerged from the convergence of three intellectual traditions.[1]
The earliest roots lie in experimental psychology. Ivan Pavlov's work on classical conditioning in the 1890s demonstrated that animals learn to associate stimuli with rewards. Edward Thorndike formalized this in 1911 with his Law of Effect: responses followed by satisfying outcomes become more firmly associated with the situation. B.F. Skinner extended these ideas in the 1930s through operant conditioning. These principles directly inspired RL's reward-based learning framework.
Richard Bellman developed dynamic programming in the 1950s as a method for solving sequential decision problems. His key insight, formalized in the Bellman equation (1957), was that an optimal policy can be decomposed into an immediate decision plus the optimal policy from the resulting state onward.[2] This recursive formulation became the mathematical backbone of RL.
Richard Sutton introduced temporal difference (TD) learning in 1988, bridging Monte Carlo methods and dynamic programming by allowing agents to update value estimates after each step rather than waiting for an episode to end.[33] Christopher Watkins introduced Q-learning in 1989.[8] Sutton and Barto's 1998 textbook, Reinforcement Learning: An Introduction, unified these threads into a coherent framework and became the field's standard reference. The second edition was published in 2018. Sutton and Barto received the 2024 Turing Award for their foundational contributions to the field.
| Year | Development | Key contributor(s) |
|---|---|---|
| 1890s | Classical conditioning | Ivan Pavlov |
| 1911 | Law of Effect | Edward Thorndike |
| 1950s | Dynamic programming, Bellman equation | Richard Bellman |
| 1988 | TD(lambda) | Richard Sutton |
| 1989 | Q-learning | Christopher Watkins |
| 1992 | REINFORCE | Ronald Williams |
| 1992 | TD-Gammon | Gerald Tesauro |
| 1998 | Reinforcement Learning: An Introduction | Sutton, Barto |
| 2013 | DQN on Atari | DeepMind (Mnih et al.) |
| 2015 | DQN published in Nature | DeepMind |
| 2016 | AlphaGo defeats Lee Sedol | DeepMind (Silver et al.) |
| 2017 | AlphaZero; PPO published | DeepMind; OpenAI (Schulman et al.) |
| 2018 | SAC published | Haarnoja et al. (UC Berkeley) |
| 2019 | OpenAI Five; AlphaStar | OpenAI; DeepMind |
| 2020 | MuZero | DeepMind |
| 2022 | RLHF used for ChatGPT | OpenAI |
| 2024 | Turing Award for RL | Richard Sutton, Andrew Barto |
| 2025 | DeepSeek-R1 with GRPO | DeepSeek |
RL algorithms often require enormous amounts of interaction data. DQN needed 200 million frames for Atari (roughly 924 hours of play), and OpenAI Five accumulated the equivalent of 45,000 years of Dota 2 gameplay. This makes training on physical systems impractical for most current algorithms. Approaches to improve sample efficiency include model-based RL, transfer learning, curriculum learning, and offline RL.
Designing reward functions that truly capture the intended objective is difficult. Agents frequently exploit unintended shortcuts in the reward function. A boat-racing agent famously learned to drive in circles collecting bonus items instead of finishing the race because the bonuses gave more reward than completing it. In RLHF for LLMs, reward hacking manifests as verbose, sycophantic responses that score well with the reward model but are not genuinely helpful.
RL agents often fail to generalize beyond their training environment. A policy trained in one version of a game may fail on a slightly different version. When learning multiple tasks sequentially, neural networks suffer from catastrophic forgetting, where learning a new task overwrites weights needed for prior tasks.
Neural network policies are opaque; understanding why an agent takes a particular action is difficult. This creates barriers for deploying RL in safety-critical domains like healthcare and autonomous driving, where explainable decision-making and formal safety guarantees are essential.