DQN
Last reviewed
Apr 28, 2026
Sources
30 citations
Review status
Source-backed
Revision
v4 ยท 7,018 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Apr 28, 2026
Sources
30 citations
Review status
Source-backed
Revision
v4 ยท 7,018 words
Add missing citations, update stale details, or suggest a clearer explanation.
The Deep Q-Network (DQN) is a model-free, off-policy reinforcement learning algorithm that combines Q-learning with a deep neural network function approximator. DQN was introduced by Volodymyr Mnih and colleagues at DeepMind in a 2013 NeurIPS Deep Learning Workshop paper, "Playing Atari with Deep Reinforcement Learning" [1], and was extended into the landmark Nature paper "Human-level control through deep reinforcement learning" published in February 2015 [2]. The Nature paper showed that a single neural network architecture, trained from raw pixels and the game score alone, could reach professional human-level play on 49 different Atari 2600 games in the Arcade Learning Environment (ALE) [2][3].
DQN is widely credited with launching the modern era of deep reinforcement learning. It demonstrated that the long-standing instability problems of combining Q-learning with non-linear function approximation could be tamed using two simple ideas, experience replay and a separate target network, both running on top of a convolutional neural network trained with stochastic gradient descent. The Nature paper has accumulated tens of thousands of citations and motivated a long line of variants including Double DQN, Dueling DQN, Prioritized Experience Replay, Distributional DQN (C51), Noisy DQN, Rainbow, R2D2, NGU, Agent57, and the planning-based MuZero [4][5][6][7][8][9][10][11][12].
DeepMind, the London-based research lab founded in 2010 by Demis Hassabis, Shane Legg, and Mustafa Suleyman, was acquired by Google in January 2014 for a reported sum of around 400 to 500 million USD [13]. The Atari result, originally posted as an arXiv preprint in December 2013, played a substantial role in convincing Google that the company had something singular [13]. After the Nature paper, DQN became one of the most studied algorithms in machine learning, and its experience replay buffer, target network, and CNN-on-pixels recipe became the standard template for value-based deep RL.
DQN solves the value-based reinforcement learning problem of estimating the optimal action-value function Q*(s, a), which gives the maximum expected discounted return obtainable from state s by taking action a and following an optimal policy thereafter. In classical tabular Q-learning, the function Q(s, a) is stored in a table with one entry per state-action pair, which is intractable for state spaces with millions or billions of distinct states. DQN replaces the table with a parameterized neural network Q(s, a; theta) that maps a state vector to a vector of Q-values, one per discrete action. The network is trained to minimize the squared temporal difference (TD) error between the predicted Q-value and a bootstrapped target derived from the Bellman equation.
The original Atari setup feeds the network the last four grayscale game frames preprocessed to 84x84 pixels, runs them through three convolutional layers and one fully connected layer, and outputs the Q-values for the 4 to 18 discrete actions available on a given game [2]. The agent acts according to an epsilon-greedy policy on these Q-values, and at each environment step it stores the transition (s, a, r, s') in a replay buffer. Mini-batches sampled uniformly from the buffer are used to update the network with RMSprop, with the bootstrapped target computed from a periodically synchronized copy of the network called the target network.
The combination of these ingredients, deep convolutional Q-network, large experience replay buffer, slowly updated target network, reward clipping, frame skipping, and frame stacking, was sufficient to reach or exceed the performance of a professional human games tester on 22 of 49 tested Atari games and to surpass all previous reinforcement learning algorithms on 43 of them [2]. The same network architecture and the same hyperparameters were used for all games, with no game-specific tuning.
Reinforcement learning studies how an agent should act in a sequential decision problem to maximize cumulative reward. The standard formalism is the Markov decision process (MDP) defined by a state space S, an action space A, a transition kernel P(s' | s, a), a reward function R(s, a), and a discount factor gamma in [0, 1] [14]. A policy pi maps states to (distributions over) actions, and the goal is to find a policy that maximizes the expected discounted return E[sum_{t=0}^infty gamma^t r_t].
The optimal action-value function Q*(s, a) satisfies the Bellman optimality equation
Q*(s, a) = E[r + gamma * max_{a'} Q*(s', a') | s, a].
Given Q*, an optimal policy is to act greedily with respect to it: pi*(s) = argmax_a Q*(s, a). Solving the Bellman equation directly is feasible only for small tabular MDPs, so practical methods rely on approximate dynamic programming or sampling-based estimation [14].
Q-learning, introduced by Chris Watkins in his 1989 PhD thesis [15] and analyzed by Watkins and Dayan in 1992 [16], is an off-policy temporal-difference algorithm that learns Q* directly from sampled transitions. After observing a transition (s, a, r, s'), the tabular update is
Q(s, a) <- Q(s, a) + alpha * (r + gamma * max_{a'} Q(s', a') - Q(s, a)).
Under mild conditions on the learning rate schedule and infinite visitation of all state-action pairs, tabular Q-learning converges to Q* with probability one [16]. The off-policy property is important: the algorithm can use data collected by any behavior policy, including a fully random one, while still learning the optimal greedy policy.
Replacing the table with a neural network parameterized by theta gives the natural objective
L(theta) = E[(r + gamma * max_{a'} Q(s', a'; theta) - Q(s, a; theta))^2].
In principle this can be optimized by stochastic gradient descent. In practice, doing so naively was known to be unstable or divergent for years before DQN [17][18]. The combination of off-policy bootstrapping (using current estimates inside the target), nonlinear function approximation, and correlated training data forms what Sutton and Barto call "the deadly triad" of value-based RL [14]. The pathologies include violent oscillations of the predicted Q-values, divergence of the loss to infinity, and catastrophic forgetting of states the agent has not visited recently.
DQN's central contribution was to neutralize the deadly triad enough to make deep value learning practical, primarily through experience replay and target networks.
The full DQN algorithm as described in the Nature paper proceeds as follows [2]:
The full pseudocode in Algorithm 1 of the Nature paper is essentially this with the explicit preprocessing and Huber loss clipping written out [2].
Experience replay stores each transition (s_t, a_t, r_t, s_{t+1}) the agent observes in a buffer D, and the training updates sample mini-batches uniformly at random from D rather than using only the most recent experience [2][19]. The technique was introduced by Long-Ji Lin in his 1992 PhD work for reinforcement learning with neural networks [19]. DQN scaled it up dramatically: the original buffer holds the most recent one million transitions, and each gradient step trains on a mini-batch of 32 samples drawn from this large buffer.
Replay serves three purposes [2]:
The replay buffer also makes DQN inherently off-policy: the data in D was generated by older versions of the policy, but Q-learning's bootstrap update is valid as long as transitions are drawn from any reasonable behavior distribution.
The target network is a periodic snapshot of the online Q-network, parameterized by theta^- and used only to compute the bootstrap target r + gamma * max_{a'} Q_hat(s', a'; theta^-). Every C gradient steps (10,000 in the Nature paper) the target weights are overwritten with the current online weights [2].
The motivation is to break a feedback loop. Without a target network, every gradient update on (s, a) immediately changes the prediction for the very same (s, a) that appears as the bootstrap target for nearby states, which can chase the target and amplify itself into divergence. Holding the target fixed for thousands of steps gives the online network a stationary objective long enough to make stable progress before the target moves [2]. The Nature paper showed that removing the target network alone caused the average Q estimate to grow without bound on several games, while removing replay caused similar but smaller divergences on others.
A common variant, popularized by DDPG and later picked up by some DQN-style agents, replaces the periodic hard copy with a slow Polyak average theta^- <- tau * theta + (1 - tau) * theta^- with tau small (e.g., 0.005) [20]. Both schemes accomplish the same stabilizing effect.
DQN uses several preprocessing tricks to standardize the Atari domain [2]:
The Nature paper's hyperparameters became reference defaults for value-based deep RL. The most important ones are summarized below [2].
| Hyperparameter | Value | Role |
|---|---|---|
| Discount factor gamma | 0.99 | Weighting of future rewards |
| Replay buffer size | 1,000,000 transitions | Decorrelation and sample reuse |
| Mini-batch size | 32 | Stochastic gradient batch |
| Target network update period C | 10,000 steps | Frequency of theta^- <- theta |
| Initial epsilon | 1.0 | Starting exploration rate |
| Final epsilon | 0.1 | Long-term exploration rate (0.05 at evaluation) |
| Epsilon anneal length | 1,000,000 frames | Linear decay schedule |
| Replay start size | 50,000 transitions | Random play before learning starts |
| Optimizer | RMSprop | Adaptive per-parameter learning rate |
| Learning rate | 0.00025 | RMSprop step size |
| Squared gradient momentum | 0.95 | RMSprop second-moment decay |
| Min squared gradient | 0.01 | RMSprop denominator floor |
| Frame skip k | 4 (3 on Space Invaders) | Action repeat frequency |
| Agent history length | 4 frames | Stacked input |
| Action repeat | 4 | Same as frame skip |
| No-op max | 30 | Random initial no-ops at episode start |
| Reward clipping | [-1, 1] | Cross-game scale normalization |
The DQN convolutional architecture used in the Nature paper is a feed-forward network mapping a stack of four 84x84 grayscale frames to a 4 to 18 dimensional Q-value vector. There is no recurrent state, no batch normalization, and no skip connection [2].
| Layer | Type | Filters / Units | Kernel | Stride | Activation | Output shape |
|---|---|---|---|---|---|---|
| Input | Frame stack | 4 channels | n/a | n/a | n/a | 84 x 84 x 4 |
| Conv1 | Convolution | 32 | 8 x 8 | 4 | ReLU | 20 x 20 x 32 |
| Conv2 | Convolution | 64 | 4 x 4 | 2 | ReLU | 9 x 9 x 64 |
| Conv3 | Convolution | 64 | 3 x 3 | 1 | ReLU | 7 x 7 x 64 |
| Flatten | Reshape | n/a | n/a | n/a | n/a | 3,136 |
| FC1 | Fully connected | 512 | n/a | n/a | ReLU | 512 |
| Output | Fully connected | num_actions | n/a | n/a | Linear | num_actions |
A crucial architectural choice is that the network outputs Q-values for all actions in a single forward pass, given a state input alone [2]. This is more efficient than the alternative of taking the (state, action) pair as input and producing a single Q-value, because the inner max_{a'} Q(s', a'; theta^-) can be computed from one forward pass rather than one per action. The total parameter count is about 1.7 million.
The original 2013 NeurIPS workshop version of DQN used a slightly smaller network with two convolutional layers (16 and 32 filters) and a 256-unit fully connected layer [1]. The 2015 Nature version increased depth and width and added the third convolutional layer, the larger 512-unit fully connected layer, and the explicit target network [2]. The smaller 2013 architecture is sometimes called "DQN-2013" and the Nature one "DQN-2015" or simply DQN.
The Nature paper evaluated DQN on 49 Atari 2600 games using the Arcade Learning Environment (ALE) [2][3]. Each game was played from raw pixel input with the same network, the same hyperparameters, and the same training budget of 50 million frames per game (about 38 days of game time at 60 frames per second). After training, the agent was evaluated for 30 episodes per game with a fixed greedy policy plus low-probability epsilon = 0.05 noise.
The paper reports normalized scores of the form 100% * (DQN_score - random_score) / (human_score - random_score), where the human score is the average of two hours of professional play by a human games tester. A normalized score of 100% therefore means human-level [2].
DQN reached above 75% of human-level performance on 29 of 49 games, and exceeded human-level on 22 of them. It surpassed all previous learning algorithms on 43 of 49 games and was within 5% of the best previous result on the remaining 6 [2]. The most striking results were the games where the agent learned counter-intuitive strategies that human players had not used before, such as digging a tunnel along the left wall in Breakout to bounce the ball into the brick layer from above [2].
A representative sample of normalized scores from Extended Data Table 2 of the Nature paper is shown below [2].
| Game | Random | Human | DQN | DQN normalized |
|---|---|---|---|---|
| Video Pinball | 0 | 17,298 | 42,684 | 2,539% |
| Boxing | 0 | 4 | 71.8 | 1,707% |
| Breakout | 1 | 31 | 401 | 1,327% |
| Star Gunner | 664 | 10,250 | 57,997 | 598% |
| Robotank | 2 | 12 | 51.6 | 509% |
| Atlantis | 12,850 | 29,028 | 85,641 | 449% |
| Crazy Climber | 10,781 | 35,411 | 114,103 | 419% |
| Gopher | 257 | 2,321 | 8,520 | 400% |
| Demon Attack | 152 | 3,401 | 9,711 | 294% |
| Name This Game | 2,250 | 4,076 | 7,257 | 278% |
| Krull | 1,151 | 2,395 | 3,805 | 213% |
| Assault | 222 | 1,496 | 3,359 | 246% |
| Road Runner | 200 | 7,845 | 18,257 | 232% |
| Kangaroo | 52 | 3,035 | 6,740 | 224% |
| James Bond | 29 | 303 | 576.7 | 200% |
| Tennis | -24 | -8 | -2.5 | 143% |
| Pong | -21 | 10 | 18.9 | 132% |
| Space Invaders | 148 | 1,652 | 1,976 | 121% |
| Beam Rider | 364 | 5,775 | 6,846 | 119% |
| Tutankham | 11 | 167 | 186.7 | 112% |
| Kung-Fu Master | 258 | 22,736 | 23,270 | 102% |
| Freeway | 0 | 30 | 30.3 | 102% |
| Time Pilot | 3,568 | 5,925 | 5,947 | 100% |
| Enduro | 0 | 309 | 301.8 | 97% |
| Fishing Derby | -91 | 6 | -0.8 | 93% |
| Up and Down | 533 | 9,082 | 8,456 | 92% |
| Ice Hockey | -11 | 1 | -1.6 | 79% |
| Q*bert | 164 | 13,455 | 10,596 | 78% |
| H.E.R.O. | 1,027 | 25,763 | 19,950 | 76% |
| Asterix | 210 | 8,503 | 6,012 | 70% |
| Battle Zone | 2,360 | 37,800 | 26,300 | 67% |
| Wizard of Wor | 564 | 4,757 | 3,393 | 67% |
| Chopper Command | 811 | 9,882 | 6,687 | 65% |
| Centipede | 2,091 | 11,963 | 8,309 | 63% |
| Bank Heist | 14 | 753 | 429.7 | 56% |
| River Raid | 1,338 | 13,513 | 8,316 | 57% |
| Zaxxon | 32 | 9,173 | 4,977 | 54% |
| Amidar | 6 | 1,676 | 740 | 44% |
| Alien | 227 | 6,875 | 3,069 | 43% |
| Venture | 0 | 1,188 | 380 | 32% |
| Seaquest | 68 | 20,182 | 5,286 | 26% |
| Frostbite | 65 | 4,335 | 328.3 | 6% |
| Asteroids | 719 | 13,157 | 1,629.3 | 7% |
| Private Eye | 25 | 69,571 | 1,788 | 3% |
| Gravitar | 173 | 2,672 | 306.7 | 5% |
| Ms. Pacman | 307 | 15,693 | 2,311 | 13% |
| Bowling | 23 | 154 | 42.4 | 14% |
| Double Dunk | -19 | -16 | -18.1 | 17% |
| Montezuma's Revenge | 0 | 4,367 | 0 | 0% |
The games where DQN failed badly, in particular Montezuma's Revenge, Private Eye, Gravitar, Frostbite, and Pitfall, all involve very sparse rewards or long-horizon planning that the local epsilon-greedy bootstrap update cannot solve on its own. These hard-exploration games became a benchmark in their own right and motivated later work on intrinsic motivation, hierarchical RL, and learned exploration policies, culminating in Agent57 [11].
The Nature DQN paper inspired a long line of follow-up algorithms, each addressing one of its known weaknesses. The most influential are summarized below.
| Variant | Authors | Year / Venue | Key idea |
|---|---|---|---|
| Double DQN (DDQN) | van Hasselt, Guez, Silver | AAAI 2016 [4] | Decouple action selection from action evaluation in the bootstrap target to reduce overestimation bias |
| Prioritized Experience Replay (PER) | Schaul, Quan, Antonoglou, Silver | ICLR 2016 [6] | Sample replay transitions in proportion to their TD error magnitude |
| Dueling DQN | Wang, Schaul, Hessel, van Hasselt, Lanctot, de Freitas | ICML 2016 [5] | Decompose Q(s, a) into a state value V(s) and an advantage A(s, a) with a shared trunk |
| Bootstrapped DQN | Osband, Blundell, Pritzel, Van Roy | NeurIPS 2016 [21] | Train an ensemble of Q-heads for deep exploration via Thompson-style sampling |
| Distributional DQN (C51) | Bellemare, Dabney, Munos | ICML 2017 [7] | Learn the full return distribution Z(s, a) over a fixed support of 51 atoms instead of just its mean |
| Multi-step DQN (n-step) | Sutton (concept), used in Rainbow | 1988 / 2017 | Bootstrap from the n-step return r_t + ... + gamma^{n-1} r_{t+n-1} + gamma^n max Q |
| Noisy DQN | Fortunato et al. | ICLR 2018 [8] | Add learnable parameter noise to the network to drive exploration without epsilon-greedy |
| Quantile Regression DQN (QR-DQN) | Dabney, Rowland, Bellemare, Munos | AAAI 2018 [22] | Distributional DQN with quantile regression rather than fixed support |
| Implicit Quantile Networks (IQN) | Dabney, Ostrovski, Silver, Munos | ICML 2018 [23] | Sampled-quantile distributional DQN with continuous quantile inputs |
| Rainbow | Hessel et al. | AAAI 2018 [9] | Combine Double, Dueling, PER, n-step, C51, and Noisy nets in one agent |
| Ape-X DQN | Horgan, Quan, Budden, Barth-Maron, Hessel, van Hasselt, Silver | ICLR 2018 [24] | Distributed actor-learner with shared prioritized replay across hundreds of CPU actors |
| R2D2 | Kapturowski, Ostrovski, Quan, Munos, Dabney | ICLR 2019 [10] | Recurrent Replay Distributed DQN with LSTM and stored hidden state |
| Never Give Up (NGU) | Badia et al. | ICLR 2020 [25] | Episodic + lifelong intrinsic rewards added to a recurrent DQN-style agent |
| Agent57 | Badia et al. | ICML 2020 [11] | Adaptive mixture of explorative and exploitative policies; first to beat the human baseline on all 57 ALE games |
| MuZero | Schrittwieser et al. | Nature 2020 [12] | Combines a learned model with Monte Carlo tree search; subsumes many DQN-era results |
Double DQN, introduced by Hado van Hasselt, Arthur Guez, and David Silver [4], addresses Q-learning's well-known overestimation bias. In standard DQN, the bootstrap uses max_{a'} Q_hat(s', a'; theta^-), which both selects and evaluates the next action with the same network. Because max selects the maximum of noisy estimates, the resulting target is biased upward. Hasselt's earlier 2010 paper had introduced "Double Q-learning" as a tabular fix, and the 2016 paper carries the idea over to deep networks [4]. The Double DQN target is
y = r + gamma * Q_hat(s', argmax_{a'} Q(s', a'; theta); theta^-).
The online network theta picks the action and the target network theta^- evaluates it. This requires no extra parameters, since DQN already has both networks. Double DQN reduced the overestimation on most ALE games and improved both the mean and median normalized scores noticeably [4].
Dueling DQN, by Ziyu Wang and colleagues [5], rearchitects the network to share a convolutional trunk and split into two heads: one estimating the state value V(s; theta_V) and the other estimating the advantage A(s, a; theta_A) for each action. The Q-value is recombined as
Q(s, a) = V(s) + (A(s, a) - mean_{a'} A(s, a')).
Subtracting the mean advantage enforces identifiability since V and A are otherwise underdetermined. The intuition is that on many states, the choice of action does not change the value much (e.g., when the agent is far from any obstacle), and forcing the network to estimate V separately allows information about state value to be shared across all actions. The Dueling architecture combined with Double DQN and Prioritized Experience Replay set new state of the art on ALE in 2016 [5].
Prioritized Experience Replay (PER), by Tom Schaul and colleagues [6], replaces the uniform sampling of replay transitions with sampling proportional to the most recent absolute TD error |delta_i| of each transition. Transitions where the network was "surprised" are sampled more often, so the network spends its updates on the experiences with the most to learn from. To correct for the bias of non-uniform sampling, the gradient is reweighted by an importance-sampling factor (1 / (N * P(i)))^beta, where beta is annealed from a small value to 1 over training. PER consistently improved sample efficiency and final scores [6].
Distributional DQN, introduced by Marc Bellemare, Will Dabney, and Remi Munos [7], replaces the scalar Q-value Q(s, a) with a discrete distribution Z(s, a) over the possible returns. C51 represents the return distribution as 51 atoms uniformly spaced between V_min and V_max (typically -10 and 10 after reward clipping), and trains the network to match the projected Bellman target distribution under a cross-entropy loss. Despite using the same expected return for action selection, learning the full distribution gave a substantial gain over scalar DQN on ALE [7]. The follow-ups QR-DQN [22] and IQN [23] extended the idea to continuous quantile representations.
Noisy DQN, by Meire Fortunato and colleagues [8], replaces the linear weights of selected layers with noisy weights of the form w_mu + w_sigma * epsilon, where epsilon is a sampled noise vector and w_mu, w_sigma are learned. This makes the policy stochastic and lets the network learn how much exploration noise to inject in different parts of state space. Compared to epsilon-greedy, noisy nets often explored more strategically and removed the need to manually tune an exploration schedule [8].
Multi-step (or n-step) returns replace the one-step bootstrap with the n-step return r_t + gamma * r_{t+1} + ... + gamma^{n-1} * r_{t+n-1} + gamma^n * max_{a'} Q_hat(s_{t+n}, a'). With n in the range 3 to 5, this propagates reward information faster while still using bootstrapping to control variance. Multi-step DQN is technically off-policy biased when the behavior policy and target policy differ over the n steps, but in practice this bias is small enough that the method is widely used and forms one of the six ingredients in Rainbow [9].
Rainbow, by Matteo Hessel and colleagues at DeepMind [9], asks a deceptively simple question: which of the many DQN improvements actually matter, and do they compose? The paper combines Double DQN, Dueling DQN, Prioritized Experience Replay, Multi-step learning, Distributional RL (C51), and Noisy nets into a single agent and runs an ablation removing each component in turn.
Rainbow set a new state of the art on the 57 game Atari benchmark, exceeding the previous best agent (Distributional DQN with multi-step bootstrapping) by a wide margin in both data efficiency and final performance, and matching the performance of much more compute-hungry distributed agents at a fraction of the wall-clock budget [9]. The ablations showed that Prioritized Replay, Multi-step learning, and the Distributional component were the largest contributors, while Double DQN had a relatively small effect (in part because the Distributional formulation already mitigates overestimation), and Dueling and Noisy nets had moderate effects that varied by game.
Rainbow is widely considered the strongest single-actor DQN-family agent and has become the standard baseline in subsequent value-based RL papers. The full set of Rainbow ingredients along with their original citations are summarized below.
| Component | Source | Effect when removed |
|---|---|---|
| Double Q-learning | van Hasselt et al. 2016 [4] | Small drop in median performance |
| Prioritized Experience Replay | Schaul et al. 2016 [6] | Largest drop in early-training data efficiency |
| Dueling networks | Wang et al. 2016 [5] | Moderate drop, varies by game |
| Multi-step learning | Sutton 1988 (concept) | Large drop in median performance |
| Distributional RL (C51) | Bellemare et al. 2017 [7] | Large drop in final performance |
| Noisy nets | Fortunato et al. 2018 [8] | Moderate drop, mostly on hard-exploration games |
Ape-X DQN, by Dan Horgan and colleagues [24], decouples acting from learning. Hundreds of CPU actor processes generate experience and write into a single shared prioritized replay buffer, while a single GPU learner samples from the buffer and pushes updated weights back to the actors. The actors run different epsilon values for built-in exploration diversity. With this architecture, Ape-X consumed roughly 22 billion environment frames per agent on Atari and substantially exceeded Rainbow's data-volume scaling [24].
R2D2 (Recurrent Replay Distributed DQN), by Steven Kapturowski and colleagues [10], extends Ape-X with an LSTM core and adopts a careful protocol for training the recurrent state from replay. It stores fixed-length sequences in the replay buffer along with the recurrent hidden state at the start of each sequence, and uses a "burn-in" period to refresh the LSTM state before computing the loss. R2D2 doubled Rainbow's median ALE score and reached super-human performance on 52 of the 57 standard Atari games [10].
A persistent gap remained on hard-exploration games, in particular Montezuma's Revenge, Pitfall, Private Eye, Solaris, Skiing, and Venture. Never Give Up (NGU), by Adria Puigdomenech Badia and colleagues [25], augments R2D2 with two intrinsic reward signals: an episodic novelty bonus computed from the distance to the agent's recent state memory, and a lifelong novelty bonus computed from a Random Network Distillation predictor. NGU was the first agent to score positively on all 57 Atari games.
Agent57, also by Badia and colleagues [11], built on NGU by parameterizing a family of policies indexed by an exploration coefficient and a discount factor, and using a meta-controller to choose which policy to act with at each episode. The meta-controller is itself a bandit that learns to allocate behavior across exploitative and explorative policies. Agent57 became the first agent to exceed the human baseline on all 57 standard Atari games in 2020 [11].
MuZero, by Julian Schrittwieser and colleagues [12], goes a step further by combining a learned environment model with Monte Carlo tree search (MCTS) in the style of AlphaZero. It does not assume access to the simulator dynamics; instead it learns three networks, a representation function, a dynamics function, and a prediction function, that together let it plan in latent space. MuZero matched AlphaZero's superhuman play on Go, chess, and shogi while also matching or exceeding R2D2 on Atari, all with the same algorithm [12]. While MuZero is not literally a DQN variant, it descends directly from the DQN-on-Atari research program and shares the use of a value head trained by bootstrapping from future returns.
Because DQN is so widely studied, well-tested implementations exist in every major reinforcement learning library.
| Library | Maintainer | Implementation notes |
|---|---|---|
| Stable-Baselines3 | Antonin Raffin and contributors | PyTorch DQN with Double DQN, Prioritized Replay, and Dueling options |
| Dopamine | Google Research | TensorFlow / JAX baselines including DQN, C51, Rainbow, IQN, QR-DQN |
| Acme | DeepMind | JAX agents for DQN, Rainbow, R2D2, Ape-X, MuZero, IMPALA |
| RLlib | Anyscale (Ray) | Distributed DQN, Ape-X, R2D2 with TensorFlow and PyTorch backends |
| CleanRL | Costa Huang and contributors | Single-file PyTorch reference implementations of DQN and many variants |
| Tianshou | Tsinghua University | PyTorch DQN, Double DQN, Dueling, Rainbow, C51, QR-DQN |
| OpenAI Baselines | OpenAI | Original TensorFlow reference for DQN, used for many follow-up papers |
| Coach | Intel | Multi-framework RL library including DQN family |
Dopamine [26] was released by Google in 2018 as a small, reproducible TensorFlow research codebase aimed specifically at value-based deep RL on Atari, and it has been the reference implementation for many subsequent benchmark papers. Acme [27] is the more recent DeepMind framework that contains modular implementations of most algorithms in the DQN family, including R2D2 and MuZero, along with shared replay buffers and learner / actor abstractions.
A minimal PyTorch DQN training step looks roughly like this:
# states, actions, rewards, next_states, dones from replay buffer
q_values = q_net(states).gather(1, actions.unsqueeze(1)).squeeze(1)
with torch.no_grad():
next_q = target_net(next_states).max(dim=1).values
target = rewards + gamma * next_q * (1.0 - dones)
loss = F.smooth_l1_loss(q_values, target)
optimizer.zero_grad()
loss.backward()
for p in q_net.parameters():
p.grad.data.clamp_(-1, 1)
optimizer.step()
if step % target_update_period == 0:
target_net.load_state_dict(q_net.state_dict())
The Huber loss (smooth_l1_loss in PyTorch) and gradient clipping reproduce the Nature paper's loss-clipping behavior. Stable-Baselines3 wraps this entire loop, the replay buffer, the target update schedule, and the epsilon-greedy exploration into a few lines of user code [28].
Despite its impact, DQN has substantial limitations as a general RL algorithm.
DQN's broader impact on machine learning is hard to overstate. Before December 2013, deep learning's most visible successes were in supervised settings: ImageNet image classification, speech recognition, and machine translation. Reinforcement learning was mostly tabular and operated on hand-crafted features. DQN demonstrated that the same convolutional networks that had transformed perception could learn to act, given a working scheme for stabilizing value-based training [1][2].
The paper's influence shows up in three connected ways. First, it cemented DeepMind's reputation as the leading deep RL lab, contributing materially to Google's January 2014 acquisition for around 400 to 500 million USD [13]. Second, it spawned a research program at DeepMind, OpenAI, Berkeley, Google Brain, and many other labs that produced a flood of value-based and actor-critic deep RL algorithms during 2015 to 2020. Third, it set the template, raw observations into a deep network with experience replay and a target network, that nearly all subsequent value-based deep RL agents have followed.
The Atari result also laid the groundwork for DeepMind's later milestones. AlphaGo's value network [29] uses the same idea of learning to predict expected returns by bootstrapping from self-play, even though it goes through a different training process. AlphaZero [30] and MuZero [12] continue along the same line by combining learned value functions with planning. The Nature paper's authors went on to lead substantial parts of those projects.
Finally, DQN gave the field its first widely accepted benchmark for deep RL: Atari 2600 played from pixels. The 57-game ALE benchmark, with the now-standard "human-normalized score" reporting, became the field's MNIST for sequential decision making, and the long arc from DQN (22 of 49 games above human) through Rainbow, Ape-X, R2D2, NGU, and Agent57 (57 of 57 games above human) tracks roughly five years of rapid algorithmic progress.