Reinforcement Learning Models
Last reviewed
May 13, 2026
Sources
60 citations
Review status
Source-backed
Revision
v2 ยท 6,588 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 13, 2026
Sources
60 citations
Review status
Source-backed
Revision
v2 ยท 6,588 words
Add missing citations, update stale details, or suggest a clearer explanation.
See also: Reinforcement Learning and Models
Reinforcement learning models are computational systems that learn to make sequential decisions by interacting with an environment and receiving scalar reward signals. They are the algorithmic core of many of the most visible breakthroughs in modern artificial intelligence: the DQN agent that played 49 Atari games from raw pixels (Mnih et al., 2015), AlphaGo and AlphaZero at Go, chess, and shogi, MuZero playing the same games without knowing their rules, OpenAI Five at Dota 2, AlphaStar at StarCraft II, and the RLHF pipelines that turned base language models into ChatGPT, Claude, and DeepSeek-R1. This article surveys the algorithm families, milestone systems, benchmarks, and software libraries that define the field as of 2025.
The term covers a wide range of approaches, including tabular Q-Learning (Watkins, 1989), policy gradient methods such as REINFORCE (Williams, 1992), deep value-based methods like Rainbow DQN, actor-critic methods such as PPO and SAC, model-based agents such as Dreamer and MuZero, offline algorithms such as CQL and IQL, and vision-language-action robot policies such as RT-2, OpenVLA, and Pi-0.
Reinforcement learning is formalized as a Markov Decision Process (MDP). An MDP is a tuple (S, A, P, R, gamma) where S is the set of states, A the set of actions, P(s' | s, a) the transition probability, R(s, a) the reward function, and gamma in [0, 1) the discount factor. An agent observes state s, picks an action a according to a policy pi(a | s), receives a reward r, and transitions to a new state s'. The objective is to maximise the expected discounted return E[sum_t gamma^t r_t].
Three quantities sit at the centre of most RL algorithms. The state value V^pi(s) is the expected return starting in state s under policy pi. The action value Q^pi(s, a) is the expected return after taking action a in state s and then following pi. The advantage A^pi(s, a) = Q^pi(s, a) - V^pi(s) measures how much better an action is than the average behaviour of pi. The Bellman equations connect these values across consecutive time steps: V^pi(s) = E_a[R(s, a) + gamma * E_s'[V^pi(s')]], and the corresponding optimal Bellman equations characterise the optimal policy pi*.
When the model P and R are known, dynamic programming methods such as value iteration and policy iteration find pi*. When the model is unknown, temporal difference learning replaces the expectation with sampled transitions, leading to algorithms such as Q-learning, SARSA, and TD(lambda) (Sutton, 1988). Deep RL replaces tabular value functions with deep neural networks, and policy gradient methods optimise pi_theta directly through stochastic gradient ascent on the expected return.
The field is usually carved up along four axes: how the agent represents value, whether it learns a model of the world, whether it learns on-policy or off-policy, and whether data is gathered online or supplied offline.
| Axis | Options | Example algorithms |
|---|---|---|
| Value representation | Value-based, policy-gradient, actor-critic | DQN, REINFORCE, A3C |
| World model | Model-free, model-based | PPO (free), Dreamer, MuZero (model) |
| Policy sample source | On-policy, off-policy | PPO (on), DQN, SAC (off) |
| Data regime | Online, offline | SAC (online), CQL, IQL (offline) |
| Agent count | Single-agent, multi-agent | DQN (single), MADDPG, QMIX, MAPPO (multi) |
Value-based methods learn Q(s, a) and act greedily with respect to it; they handle discrete actions naturally but require a max over actions, which is awkward in continuous spaces. Policy-gradient methods parameterise pi_theta directly and use the policy gradient theorem (Sutton et al., 2000) to compute grad J(theta). Actor-critic methods combine the two: an actor learns pi_theta and a critic learns V or Q to lower the variance of policy-gradient estimates. Model-based methods additionally learn a transition model and use it for planning or for generating synthetic experience.
The table below highlights landmark deep RL systems with confirmed authorship, publication venue, and headline result. All dates refer to the original peer-reviewed publication or first arXiv preprint.
| Year | System | Group | Domain | Headline result |
|---|---|---|---|---|
| 2013 | DQN preprint | DeepMind | Atari 2600 | First deep network trained with Q-learning on raw pixels (Mnih et al., NIPS 2013 workshop). |
| 2015 | DQN (Nature) | DeepMind | Atari (49 games) | Human-level scores across 49 Atari games using one network and one set of hyperparameters. |
| 2015 | DDPG | DeepMind | Continuous control | Off-policy actor-critic that learns from pixels on more than 20 simulated physics tasks. |
| 2016 | AlphaGo | DeepMind | Go | Defeated European champion Fan Hui 5 to 0 (Jan 2016) and Lee Sedol 4 to 1 (Mar 2016). |
| 2017 | AlphaGo Zero | DeepMind | Go | Beat the published AlphaGo 100 to 0 starting from random weights and self-play only. |
| 2017 | PPO | OpenAI | Continuous and discrete control | Simpler and more sample-efficient than TRPO; became the default RLHF optimiser. |
| 2018 | AlphaZero | DeepMind | Go, chess, shogi | One algorithm reached superhuman play in three games (Science 362, 2018). |
| 2018 | OpenAI Five (TI8) | OpenAI | Dota 2 | Lost two matches against pros at The International 2018. |
| 2019 | OpenAI Five (Finals) | OpenAI | Dota 2 | Beat reigning world champions OG 2 to 0 in San Francisco. |
| 2019 | Pluribus | CMU and Facebook AI | 6-player no-limit Texas hold'em | Beat top human pros over 10,000 hands (Science, 2019). |
| 2019 | AlphaStar | DeepMind | StarCraft II | Reached Grandmaster on all three races on the European ladder (Nature 575, 2019). |
| 2020 | MuZero | DeepMind | Atari, Go, chess, shogi | Matched AlphaZero without being told the rules (Nature 588, 2020). |
| 2022 | CICERO | Meta AI | Diplomacy | Top 10% in online Diplomacy with open-domain natural language negotiation (Science, 2022). |
| 2022 | InstructGPT | OpenAI | Language modelling | First widely deployed RLHF pipeline turned GPT-3 into a usable assistant. |
| 2023 | DreamerV3 | DeepMind and Toronto | Atari, DMC, Minecraft | First single algorithm to collect diamonds in Minecraft from scratch. |
| 2023 | RT-2 | Google DeepMind | Robotics | First vision-language-action model fine-tuned from a web-scale VLM. |
| 2024 | OpenVLA | Stanford, UC Berkeley, others | Robotics | 7B open-source VLA trained on 970,000 demonstrations from Open X-Embodiment. |
| 2024 | Pi-0 | Physical Intelligence | Robotics | Foundation flow-matching VLA running on multiple robot embodiments. |
| 2025 | DeepSeek-R1 | DeepSeek AI | Reasoning LLMs | Pure-RL reasoning model trained with GRPO; matched OpenAI o1 on math and code. |
| 2025 | Gemini Robotics | Google DeepMind | Robotics | VLA built on Gemini 2.0 doubled prior generalisation scores. |
Several of these systems are described in more detail below.
Value-based methods estimate Q(s, a) and choose actions greedily, often combined with epsilon-greedy exploration. Their modern era starts with the Deep Q-Network.
DQN (Mnih et al., 2015). The 2015 Nature paper "Human-level control through deep reinforcement learning" trained a convolutional network with Q-learning on raw 84 by 84 grayscale Atari frames. Two design choices made training stable: an experience replay buffer that decorrelates consecutive transitions, and a target network whose weights are updated less frequently than the online network. DQN reached human-level scores on 29 of 49 games using the same architecture and hyperparameters across all games.
Double DQN (van Hasselt, Guez and Silver, AAAI 2016). Standard Q-learning systematically overestimates action values because the max in the Bellman target couples action selection and evaluation. Double DQN uses the online network to pick the next action and the target network to evaluate it, which removes most of the bias and improves median scores on the Atari benchmark.
Prioritized Experience Replay (Schaul, Quan, Antonoglou and Silver, ICLR 2016). Transitions with large temporal-difference error are sampled more often, which accelerates training. Adding prioritised replay to DQN improved performance on 41 of 49 Atari games.
Dueling DQN (Wang et al., ICML 2016). The network is split into separate streams for the state value V(s) and the action advantage A(s, a), with a re-combination layer that enforces zero mean over the advantage. The architecture took the best paper award at ICML 2016.
Distributional RL (C51, Bellemare, Dabney and Munos, ICML 2017). Instead of estimating the expected return, the agent learns its full distribution over returns, represented by 51 atoms on a fixed support. C51 was the first algorithm to outperform DQN by a wide margin on Atari and inspired follow-up work on QR-DQN (Dabney et al., AAAI 2018) and Implicit Quantile Networks (Dabney et al., ICML 2018).
Rainbow DQN (Hessel et al., AAAI 2018). Combines six DQN improvements: double Q-learning, prioritised replay, dueling architecture, multi-step learning, distributional C51, and Noisy Nets for exploration (Fortunato et al., 2018). Rainbow reached state-of-the-art Atari scores with substantially better sample efficiency than each component on its own. Ablations showed prioritised replay and multi-step returns contributed the most across games.
Policy-gradient methods originate with Williams' REINFORCE algorithm (Machine Learning, 1992), which estimated the policy gradient using likelihood-ratio sampling. Modern variants reduce variance with a learned baseline (actor-critic) and constrain the size of each update.
A3C and A2C (Mnih et al., ICML 2016). Asynchronous Advantage Actor-Critic uses multiple CPU workers running in parallel to gather diverse trajectories and update a shared network. A3C and its synchronous variant A2C surpassed DQN on Atari while training on a single multi-core machine instead of a GPU.
TRPO (Schulman, Levine, Abbeel, Jordan and Moritz, ICML 2015). Trust Region Policy Optimization formalises the intuition that each policy update should not drift too far from the previous policy. TRPO enforces a KL-divergence constraint between successive policies and produces near-monotonic improvement on simulated robotics tasks and Atari.
PPO (Schulman, Wolski, Dhariwal, Radford and Klimov, arXiv 2017). Proximal Policy Optimization replaces TRPO's hard KL constraint with a clipped surrogate objective. It is far simpler to implement, scales well with parallel workers, and has become the default optimizer for a wide range of tasks. PPO was the algorithm behind OpenAI Five in Dota 2 and the standard policy optimizer in InstructGPT-style RLHF pipelines.
DDPG (Lillicrap et al., ICLR 2016). Deep Deterministic Policy Gradient combines the deterministic policy gradient theorem (Silver et al., ICML 2014) with DQN tricks (target networks and replay) to handle continuous action spaces. The paper showed end-to-end pixel control on over 20 simulated physics tasks.
TD3 (Fujimoto, van Hoof and Meger, ICML 2018). Twin Delayed DDPG addresses three pathologies of DDPG: it learns two critics and uses the minimum (clipped double Q-learning) to avoid value overestimation, delays policy updates so the critic stabilises first, and adds noise to the target action (target policy smoothing) to regularise the value function.
SAC (Haarnoja, Zhou, Abbeel and Levine, ICML 2018). Soft Actor-Critic uses a maximum-entropy objective J = E[sum_t r_t + alpha * H(pi(. | s_t))], which encourages exploration and improves robustness. SAC and its automatic temperature-tuning extension (Haarnoja et al., arXiv 2018) became standard baselines for continuous control and real-robot learning at Berkeley.
IMPALA (Espeholt et al., ICML 2018). Importance-Weighted Actor-Learner Architectures introduced the V-trace off-policy correction, which lets large fleets of actors generate trajectories that are slightly stale by the time the centralised learner processes them. IMPALA scaled to thousands of machines without losing data efficiency and powered multi-task agents on DMLab-30 and Atari-57.
Model-based RL learns or is given a model of the environment dynamics. The model can be used for planning at decision time or to generate synthetic experience for a model-free learner.
World Models (Ha and Schmidhuber, NeurIPS 2018). A variational autoencoder (V) compresses each frame into a latent code; an MDN-RNN (M) predicts the next latent given the previous action; a small controller (C) is optimised with CMA-ES to maximise return. The system learned to drive in CarRacing and survive in VizDoom, with the controller trained entirely inside the learned dream.
PlaNet (Hafner et al., ICML 2019). PlaNet learns a Recurrent State-Space Model with deterministic and stochastic components and plans by random shooting in latent space. It matched model-free agents while using roughly 50 times less data on DeepMind Control Suite tasks from pixels.
Dreamer (Hafner et al., ICLR 2020). Replaces planning with an actor-critic trained inside the learned model. Imagined trajectories let the policy learn from far more data than real interactions provide.
DreamerV2 (Hafner, Lillicrap, Norouzi and Ba, ICLR 2021). Switched the world-model latents from Gaussian to discrete categorical variables and introduced KL balancing. DreamerV2 was the first agent to reach human-level performance on the Atari benchmark of 55 games while learning behaviours entirely inside a separately trained world model.
DreamerV3 (Hafner, Pasukonis, Ba and Lillicrap, arXiv 2023; published in Nature 2025). Uses fixed hyperparameters across more than 150 domains, including DM Control, Atari, ProcGen, and Crafter. DreamerV3 was the first algorithm to collect diamonds in Minecraft from scratch without human data, demonstrations, or hand-designed curricula.
MuZero (Schrittwieser et al., Nature 588, 2020). MuZero learns a latent dynamics model along with policy and value heads, then plans with Monte Carlo Tree Search over imagined rollouts. Without being told the rules of the games, MuZero matched AlphaZero in Go, chess, and shogi, and set new state-of-the-art Atari scores. Follow-up work includes EfficientZero (Ye et al., NeurIPS 2021), Sampled MuZero, and Stochastic MuZero (Antonoglou et al., ICLR 2022).
Genie (Bruce et al., 2024). Google DeepMind trained an 11B-parameter generative model on unlabelled internet videos to produce action-controllable 2D environments. Genie infers latent actions between frames using an unsupervised inverse dynamics model and lets users explore generated worlds frame by frame. Genie 2 and Genie 3 extended the approach to 3D worlds and longer horizons.
Self-play, in which an agent learns by playing against copies of itself, has produced most of the field's most striking demonstrations.
AlphaGo (Silver et al., Nature 529, 2016). Combined supervised learning from 30 million expert moves with self-play reinforcement learning. Policy and value networks guided a Monte Carlo Tree Search. AlphaGo defeated European champion Fan Hui 5 to 0 in October 2015 (published Jan 2016) and 18-time world champion Lee Sedol 4 to 1 in March 2016. AlphaGo Master, a refined version, beat world number one Ke Jie 3 to 0 in May 2017.
AlphaGo Zero (Silver et al., Nature 550, 2017). Removed all human game data and relied solely on self-play with a single neural network producing both policy and value. Starting from random weights, AlphaGo Zero exceeded AlphaGo Lee in three days and beat the published AlphaGo 100 to 0.
AlphaZero (Silver et al., Science 362, 2018). Generalised AlphaGo Zero to chess and shogi using the same algorithm and hyperparameters. After nine hours of training on TPU, AlphaZero defeated Stockfish 8 in chess with 28 wins, 0 losses, and 72 draws over 100 games. In shogi, AlphaZero defeated the 2017 CSA champion Elmo, winning 91.2% of games. In Go, it beat AlphaGo Zero in 61% of games.
OpenAI Five (OpenAI, 2018-2019). Trained five LSTM policies with PPO at large scale. At The International 2018, OpenAI Five lost two matches against pro teams paiN and Big God. After 8x more training compute (about 45,000 self-play years over 10 months), the system beat the reigning world champions OG 2 to 0 in April 2019 at the OpenAI Five Finals, the first time an AI defeated the human world champions of an esport. A public arena ran 42,729 games against humans, winning 99.4%.
AlphaStar (Vinyals et al., Nature 575, 2019). Played all three StarCraft II races. Trained with a league of agents that included main agents (trained against the whole league), main exploiters (trained to exploit current main agents), and league exploiters (trained to exploit the league as a whole), preventing the kind of intransitive cycles that plague pure self-play. AlphaStar reached Grandmaster on the European ranked ladder for Protoss, Terran, and Zerg, above 99.8% of officially ranked players.
Pluribus (Brown and Sandholm, Science 365, 2019). First AI to beat top humans at six-player no-limit Texas hold'em. Pluribus learned a coarse blueprint strategy with a variant of Monte Carlo counterfactual regret minimization and refined it at play time with depth-limited subgame search. Over 10,000 hands against 13 elite pros it won an average of 5 milli-big-blinds per hand, a statistically significant margin.
CICERO (Bakhtin et al., Science 378, 2022). Played Diplomacy, a seven-player negotiation game that requires both cooperative dialogue and tactical reasoning. CICERO combined a 2.7B-parameter dialogue model with the piKL planning algorithm, which iteratively updates each player's predicted policy toward higher expected value. Over 40 online games on webDiplomacy.net, CICERO finished in the top 10% with more than double the average human score.
Offline RL learns a policy from a fixed dataset of pre-recorded transitions without further environment interaction. The main challenge is distributional shift: bootstrapping a Q-function with out-of-distribution actions leads to wildly optimistic estimates.
BCQ (Fujimoto, Meger and Precup, ICML 2019). Batch-Constrained Q-learning learns a conditional variational autoencoder over actions seen in the dataset and only evaluates Q on actions sampled from that generator. This was one of the first deep batch RL methods to clearly outperform behavioural cloning on continuous control.
CQL (Kumar, Zhou, Tucker and Levine, NeurIPS 2020). Conservative Q-Learning adds a regularizer to the Bellman loss that pushes down Q-values on actions absent from the dataset and pulls them up on observed actions. The resulting Q-function is a provable lower bound on the true value. CQL outperformed prior offline RL methods by 2 to 5x on D4RL.
IQL (Kostrikov, Nair and Levine, ICLR 2022). Implicit Q-Learning never queries Q on unseen actions. It fits an expectile regression to the state-conditional distribution of action values, treats that as a value function, and extracts a policy via advantage-weighted behavioural cloning. IQL is simple, efficient, and achieved state of the art on D4RL.
Decision Transformer (Chen et al., NeurIPS 2021). Recast offline RL as sequence modelling. A causal Transformer is trained to predict actions conditioned on the desired return, past states, and past actions. At test time the user specifies a target return and the model autoregressively generates an action sequence. Decision Transformer matched or exceeded CQL on Atari, OpenAI Gym, and the Key-to-Door long-horizon tasks. The contemporaneous Trajectory Transformer (Janner, Li and Levine, NeurIPS 2021) treated planning as beam search over a sequence model.
Reinforcement learning has become a standard ingredient in post-training of large language models. The dominant pipeline is reinforcement learning from human feedback (RLHF).
Learning to summarise from human feedback (Stiennon et al., NeurIPS 2020). OpenAI's summarisation paper introduced the modern three-stage RLHF recipe: collect preference comparisons between pairs of summaries; train a reward model to predict which summary humans prefer; fine-tune the policy with PPO against that reward model with a KL penalty to the supervised initial policy. A 1.3B-parameter model fine-tuned this way produced summaries that humans preferred over the supervised 12B baseline.
InstructGPT (Ouyang et al., NeurIPS 2022). Applied the same pipeline to GPT-3 across a broad set of instruction-following tasks. Humans preferred 1.3B InstructGPT outputs over 175B GPT-3 outputs in 71% of comparisons. InstructGPT became the immediate predecessor to ChatGPT.
Constitutional AI / RLAIF (Bai et al., Anthropic, arXiv 2212.08073, 2022). Replaces human preference labels with AI-generated labels guided by a written set of principles (the constitution). A first phase samples model responses, critiques them against the constitution, and revises them; a second phase trains a preference model from AI comparisons and uses RL to align the assistant. This is the methodology behind Claude.
DPO (Rafailov, Sharma, Mitchell, Manning, Ermon and Finn, NeurIPS 2023). Direct Preference Optimization removes the explicit reward model entirely. By exploiting the closed-form relationship between the optimal RLHF policy and the reward function, the paper derives a simple classification loss on preference pairs that is mathematically equivalent to PPO-based RLHF in the limit. DPO is far easier to implement than PPO RLHF and has become the standard preference-optimization recipe in many open-source pipelines.
GRPO (Shao et al., DeepSeekMath, arXiv 2402.03300, Feb 2024). Group Relative Policy Optimization is a PPO variant that drops the learned critic. For each prompt the model samples a group of completions, scores them with the reward model, and computes advantages by z-scoring the rewards within the group. Removing the critic roughly halves memory consumption. GRPO became famous as the training method behind DeepSeek-R1 (Jan 2025), which used pure GRPO on verifiable math and code rewards to elicit chain-of-thought reasoning that rivalled OpenAI's o1.
Process reward models (PRM). Outcome reward models score only the final answer; PRMs score every intermediate reasoning step. OpenAI's "Let's verify step by step" (Lightman et al., 2023) showed PRMs trained on the PRM800K dataset outperformed outcome supervision on MATH benchmark problems. Several open-source reasoning systems, including DeepSeek-R1 variants, ablated PRMs because of reward-hacking issues at scale, but PRMs remain an active area of research.
Other RL-for-LLM variants include RLHF with REINFORCE++ (open-source baseline), self-play preference optimization (SPIN, SPPO), and online RLHF with rejection sampling. The common pattern is to optimise a small KL-regularised divergence from a supervised baseline against a reward signal that captures the desired behaviour.
Reinforcement learning for robotics combines simulation-based training, demonstration data, and increasingly large vision-language backbones.
QT-Opt (Kalashnikov et al., CoRL 2018). Trained a Q-function on 580,000 real grasp attempts from a fleet of seven robot arms. The result was 96% grasp success on previously unseen objects with emergent regrasping, pre-grasp manipulation, and disturbance handling.
RT-1 (Brohan et al., Google, Dec 2022). A 35M-parameter Transformer that tokenised images and instructions into a fixed sequence and produced tokenised actions. RT-1 was trained on 130,000 episodes covering 700+ tasks gathered over 17 months with 13 Everyday Robots arms. Code was open-sourced.
RT-2 (Brohan et al., Google DeepMind, July 2023). First vision-language-action (VLA) model fine-tuned from a web-scale VLM (PaLI-X or PaLM-E). RT-2 encodes actions as text tokens, lets the VLM produce them, and decodes them back into motor commands. The model showed strong generalisation to new objects, novel commands, and chain-of-thought-style reasoning about which everyday object to use as an improvised tool.
Open X-Embodiment / RT-X (Open X-Embodiment Collaboration, Oct 2023). Twenty-one institutions pooled 60 robot datasets covering 22 robot embodiments, 527 skills, and over 160,000 tasks. Training RT-1 and RT-2 on this mixture (RT-1-X, RT-2-X) yielded a 50% absolute improvement in success rate on five commonly used robots compared to single-robot baselines.
OpenVLA (Kim et al., Stanford, UC Berkeley and others, June 2024). Open-source 7B-parameter VLA built on SigLIP, DinoV2, and Llama 2 7B, trained on 970,000 demonstrations from Open X-Embodiment. OpenVLA outperformed the proprietary 55B RT-2-X by 16.5 percentage points on a benchmark of 29 multi-embodiment tasks while using 7 times fewer parameters. Released weights, code, and LoRA fine-tuning notebooks.
Pi-0 (Physical Intelligence, Nov 2024). A flow-matching VLA built on a pre-trained VLM with a separate action expert. Pi-0 was trained on data from seven robot configurations and 68 tasks and could fold laundry, assemble cardboard boxes, and bus tables. Physical Intelligence open-sourced the model and weights via its openpi repository.
Gemini Robotics (Google DeepMind, March 2025). VLA built on Gemini 2.0 with physical actions as a new output modality. On the same generalisation benchmark used in the RT-2 paper, Gemini Robotics more than doubled the score of prior VLAs. DeepMind also released Gemini Robotics-ER, an embodied-reasoning variant focused on spatial planning, and follow-up versions Gemini Robotics 1.5 and Gemini Robotics On-Device.
Most current robot foundation models combine three ingredients: a pre-trained VLM backbone, a large mixture of demonstrations, and either standard behavioural cloning or RL fine-tuning on top of that. Pure online RL in the real world remains rare because of safety and data-collection costs.
Imitation learning is closely related to RL and often used in combination with it.
Behavioural cloning. Supervised learning on (state, action) pairs from an expert. Simple but suffers from compounding error: small mistakes shift the state distribution away from the training data, where the policy was never trained.
DAgger (Ross, Gordon and Bagnell, AISTATS 2011). Dataset Aggregation iteratively rolls out the current learner, queries the expert for the correct action at each visited state, aggregates the new data, and retrains. DAgger gives a regret bound under no-regret online learning and outperforms naive behavioural cloning in theory and practice.
GAIL (Ho and Ermon, NeurIPS 2016). Generative Adversarial Imitation Learning matches the state-action distribution of the imitator to that of the expert. A discriminator network learns to distinguish expert from imitator transitions; the imitator is updated with a policy-gradient method using the discriminator log probability as reward. GAIL is more sample-efficient in expert demonstrations than inverse RL methods that first recover a reward function and then plan.
MADDPG (Lowe, Wu, Tamar, Harb, Abbeel and Mordatch, NeurIPS 2017). Multi-Agent Deep Deterministic Policy Gradient uses a centralised critic for each agent that conditions on the joint state and joint action during training, while each agent's actor only uses its own observation at execution. The framework handles cooperative, competitive, and mixed settings and addresses the non-stationarity that arises when many agents update their policies simultaneously.
QMIX (Rashid et al., ICML 2018). A value-based method for cooperative multi-agent RL. A central mixing network combines per-agent utilities into a joint action value, subject to a monotonicity constraint that lets each agent act greedily on its own utility. QMIX strongly outperformed prior value-decomposition methods on the StarCraft Multi-Agent Challenge.
MAPPO (Yu et al., NeurIPS 2022). Multi-Agent PPO showed that, with shared parameters and a centralised value function, vanilla PPO matches or exceeds specialised multi-agent algorithms on StarCraft Multi-Agent Challenge, Multi-Particle Environments, Hanabi, and Google Research Football, despite being conceptually much simpler.
Other notable multi-agent algorithms include COMA (Foerster et al., AAAI 2018), VDN (Sunehag et al., AAMAS 2018), and TRPO-derived methods such as HATRPO and HAPPO (Kuba et al., ICLR 2022).
Reinforcement learning research relies on a shared set of benchmarks that span pixels, continuous control, generalisation, and embodied agents.
| Benchmark | Introduced | Maintainer | Description |
|---|---|---|---|
| Arcade Learning Environment (Atari) | 2013 | Bellemare, Naddaf, Veness, Bowling, JAIR | Interface to hundreds of Atari 2600 games; the canonical pixel-based RL benchmark. |
| OpenAI Gym | 2016 | Brockman et al., OpenAI | Standard Python interface for RL environments; succeeded by Gymnasium (Farama). |
| MuJoCo Locomotion | 2012 | Todorov, Erez and Tassa | Continuous-control locomotion tasks (HalfCheetah, Ant, Humanoid) commonly run via Gym. |
| DeepMind Control Suite | 2018 | Tassa et al., DeepMind | Continuous-control tasks with standardised rewards on MuJoCo. |
| DM Lab | 2016 | Beattie et al., DeepMind | 3D first-person learning environment built on Quake III Arena. |
| ProcGen Benchmark | 2019 to 2020 | Cobbe, Hesse, Hilton and Schulman, OpenAI | 16 procedurally generated game environments to measure generalisation. |
| MiniGrid and BabyAI | 2018, 2023 | Chevalier-Boisvert et al., Farama | Modular grid worlds with goal-oriented tasks and language instructions. |
| Crafter | 2021 | Hafner, ICLR 2022 | 2D Minecraft-like survival benchmark with 22 achievements; expert humans score 50.5%. |
| NetHack Learning Environment | 2020 | Kuttler et al., NeurIPS 2020 | RL interface to NetHack 3.6.6; one of the hardest roguelikes ever built. |
| Habitat | 2019 | Savva et al., ICCV 2019 | Photorealistic 3D simulator for embodied navigation and manipulation. |
| BSuite | 2019 to 2020 | Osband et al., ICLR 2020 | Targeted experiments isolating exploration, credit assignment, and scale. |
| D4RL | 2020 | Fu et al., arXiv 2004.07219 | Standard datasets for offline RL across MuJoCo, AntMaze, Adroit and Franka Kitchen. |
| Open X-Embodiment | 2023 | Open X-Embodiment Collaboration | Pooled robot manipulation data covering 22 embodiments and 527 skills. |
| StarCraft Multi-Agent Challenge | 2019 | Samvelyan et al., AAMAS 2019 | Cooperative micromanagement scenarios in StarCraft II. |
| Procgen Benchmark | 2020 | Cobbe et al., ICML 2020 | Same as ProcGen above; used in the NeurIPS 2020 Procgen Competition. |
Atari (the 57-game subset known as Atari-57) remains the most-cited single benchmark in deep RL despite its age, partly because it spans dense and sparse rewards, partial observability, and long horizons within a uniform interface.
| Library | Group | Notes |
|---|---|---|
| Stable Baselines3 | Open-source community | PyTorch reimplementations of PPO, DQN, SAC, TD3 and others with a consistent API. Successor to Stable Baselines (TensorFlow). |
| Ray RLlib | Anyscale | Scalable distributed RL on the Ray framework; broad algorithm coverage including offline RL and multi-agent. |
| CleanRL | Costa Huang et al., JMLR 2022 | High-quality single-file implementations of PPO, DQN, SAC, TD3, DDPG and others; easy to read and modify. |
| TorchRL | PyTorch Foundation | Modular RL library tightly integrated with PyTorch; offers data primitives, replay buffers, and algorithm building blocks. |
| Acme | DeepMind | Research-oriented framework for distributed RL, built around the Reverb replay system. |
| Dopamine | TensorFlow- and JAX-based research framework focused on value-based methods. | |
| Tianshou | Tsinghua and community | PyTorch RL library with modular API and broad algorithm support. |
| OpenSpiel | DeepMind | Framework for research in games (board games, card games, RTS, etc.) with implementations of CFR, MCTS, and many RL algorithms. |
| TRL | Hugging Face | Library focused on RL fine-tuning of language models (PPO RLHF, DPO, GRPO, RLOO). |
Other widely used systems include rllib-tuned algorithms inside DeepMind's JAX stack (DQN Zoo, Coax), Pearl by Meta AI, and TRLX by CarperAI for large-scale RLHF.
Reinforcement learning models share a set of well-known practical limitations.
Sample efficiency is the biggest. Deep RL agents often need tens of millions of environment steps to match what a human achieves in minutes. Model-based methods narrow the gap considerably but rarely close it, and Dreamer-class agents still need millions of steps on harder benchmarks like Crafter and Minecraft.
Reward specification is harder than it looks. A naively specified reward typically produces reward hacking: the agent finds an unintended shortcut that scores high on the reward function but violates the user's intent. OpenAI's CoinRun and the famous coast runners boat are canonical examples. The problem becomes more acute when reward models are themselves learned from preferences, as in RLHF, where models can drift toward sycophancy or verbosity that humans subtly favour.
Generalisation is shaky. Most Atari and MuJoCo training procedures evaluate on the same level seeds they trained on. ProcGen and Crafter were designed to expose how brittle these policies are when the level layout changes. Even large-scale agents like AlphaStar required curated leagues to avoid catastrophic forgetting and intransitive cycles.
Exploration in sparse-reward, long-horizon tasks is unsolved. Methods like curiosity-driven exploration (Pathak et al., ICML 2017), Go-Explore (Ecoffet et al., Nature 590, 2021), and NGU/Agent57 (Badia et al., 2020) make progress in specific settings but no method works robustly across domains.
Safety and offline-online gap remain open in robotics. Offline RL methods can fail silently when the deployment distribution drifts; online RL on physical hardware risks damage and is expensive to instrument.
Finally, evaluating RL is itself difficult. Reproducibility studies (Henderson et al., AAAI 2018) showed that reported scores depend heavily on random seeds, hyperparameter sweeps, and implementation details. The reliable evaluation library (Agarwal et al., NeurIPS 2021) and rliable's interquartile mean became standard responses, but careful evaluation remains comparatively rare in published deep RL.