# DQN

> Source: https://aiwiki.ai/wiki/dqn
> Updated: 2026-06-23
> Categories: Deep Learning, Google DeepMind, Reinforcement Learning
> From AI Wiki (https://aiwiki.ai), a free encyclopedia of artificial intelligence. Quote with attribution.

The **Deep Q-Network** (**DQN**) is a model-free, off-policy [reinforcement learning](/wiki/reinforcement_learning) algorithm that combines [Q-learning](/wiki/q-learning) with a [deep neural network](/wiki/deep_learning) function approximator, learning to act directly from raw pixels. DQN was introduced by Volodymyr Mnih and colleagues at [DeepMind](/wiki/deepmind) in a 2013 NeurIPS Deep Learning Workshop paper, "Playing Atari with Deep Reinforcement Learning" [1], and was extended into the landmark Nature paper "Human-level control through deep reinforcement learning" published on 25 February 2015 (Nature volume 518, pages 529 to 533) [2]. The Nature paper showed that a single neural network architecture, trained from raw pixels and the game score alone, could reach professional human-level play on 49 different Atari 2600 games in the Arcade Learning Environment (ALE) [2][3]. The authors reported that the agent surpassed all previous algorithms and "achieved a level comparable to that of a professional human games tester across a set of 49 games, using the same algorithm, network architecture and hyperparameters." [2]

A longer companion overview of this algorithm is maintained at [Deep Q-Network (DQN)](/wiki/deep_q-network_dqn); this page is the in-depth technical reference for the same method.

DQN is widely credited with launching the modern era of [deep reinforcement learning](/wiki/reinforcement_learning). It demonstrated that the long-standing instability problems of combining [Q-learning](/wiki/q-learning) with non-linear function approximation could be tamed using two simple ideas, **experience replay** and a separate **target network**, both running on top of a [convolutional neural network](/wiki/convolutional_neural_network) trained with stochastic [gradient descent](/wiki/gradient_descent). The Nature paper has accumulated tens of thousands of citations and motivated a long line of variants including Double DQN, Dueling DQN, Prioritized Experience Replay, Distributional DQN (C51), Noisy DQN, Rainbow, R2D2, NGU, [Agent57](/wiki/agent57), and the planning-based [MuZero](/wiki/muzero) [4][5][6][7][8][9][10][11][12].

DeepMind, the London-based research lab founded in 2010 by Demis Hassabis, Shane Legg, and Mustafa Suleyman, was acquired by [Google](/wiki/google) in January 2014 for a reported sum of around 400 to 500 million USD [13]. The Atari result, originally posted as an arXiv preprint in December 2013, played a substantial role in convincing Google that the company had something singular [13]. After the Nature paper, DQN became one of the most studied algorithms in machine learning, and its experience replay buffer, target network, and CNN-on-pixels recipe became the standard template for value-based deep RL.

## Overview

DQN solves the value-based reinforcement learning problem of estimating the optimal action-value function Q*(s, a), which gives the maximum expected discounted return obtainable from state s by taking action a and following an optimal policy thereafter. In classical [tabular Q-learning](/wiki/q-learning), the function Q(s, a) is stored in a table with one entry per state-action pair, which is intractable for state spaces with millions or billions of distinct states. DQN replaces the table with a parameterized neural network Q(s, a; theta) that maps a state vector to a vector of Q-values, one per discrete action. The network is trained to minimize the squared temporal difference (TD) error between the predicted Q-value and a bootstrapped target derived from the [Bellman equation](/wiki/bellman_equation).

The original Atari setup feeds the network the last four grayscale game frames preprocessed to 84x84 pixels, runs them through three convolutional layers and one fully connected layer, and outputs the Q-values for the 4 to 18 discrete actions available on a given game [2]. The agent acts according to an [epsilon-greedy](/wiki/epsilon_greedy) policy on these Q-values, and at each environment step it stores the transition (s, a, r, s') in a replay buffer. Mini-batches sampled uniformly from the buffer are used to update the network with [RMSprop](/wiki/rmsprop), with the bootstrapped target computed from a periodically synchronized copy of the network called the target network.

The combination of these ingredients, deep convolutional Q-network, large experience replay buffer, slowly updated target network, reward clipping, frame skipping, and frame stacking, was sufficient to reach or exceed the performance of a professional human games tester on 22 of 49 tested Atari games and to surpass all previous reinforcement learning algorithms on 43 of them [2]. The same network architecture and the same hyperparameters were used for all games, with no game-specific tuning.

## Background

### Reinforcement learning and the Bellman equation

[Reinforcement learning](/wiki/reinforcement_learning) studies how an agent should act in a sequential decision problem to maximize cumulative reward. The standard formalism is the [Markov decision process](/wiki/markov_decision_process_mdp) (MDP) defined by a state space S, an action space A, a transition kernel P(s' | s, a), a reward function R(s, a), and a discount factor gamma in [0, 1] [14]. A policy pi maps states to (distributions over) actions, and the goal is to find a policy that maximizes the expected discounted return E[sum_{t=0}^infty gamma^t r_t].

The optimal action-value function Q*(s, a) satisfies the Bellman optimality equation

Q*(s, a) = E[r + gamma * max_{a'} Q*(s', a') | s, a].

Given Q*, an optimal policy is to act greedily with respect to it: pi*(s) = argmax_a Q*(s, a). Solving the Bellman equation directly is feasible only for small tabular MDPs, so practical methods rely on approximate dynamic programming or sampling-based estimation [14].

### Q-learning

[Q-learning](/wiki/q-learning), introduced by Chris Watkins in his 1989 PhD thesis [15] and analyzed by Watkins and Dayan in 1992 [16], is an off-policy temporal-difference algorithm that learns Q* directly from sampled transitions. After observing a transition (s, a, r, s'), the tabular update is

Q(s, a) <- Q(s, a) + alpha * (r + gamma * max_{a'} Q(s', a') - Q(s, a)).

Under mild conditions on the learning rate schedule and infinite visitation of all state-action pairs, tabular Q-learning converges to Q* with probability one [16]. The off-policy property is important: the algorithm can use data collected by any behavior policy, including a fully random one, while still learning the optimal greedy policy.

### Why naive deep Q-learning is unstable

Replacing the table with a neural network parameterized by theta gives the natural objective

L(theta) = E[(r + gamma * max_{a'} Q(s', a'; theta) - Q(s, a; theta))^2].

In principle this can be optimized by stochastic gradient descent. In practice, doing so naively was known to be unstable or divergent for years before DQN [17][18]. The combination of off-policy bootstrapping (using current estimates inside the target), nonlinear function approximation, and correlated training data forms what Sutton and Barto call "the deadly triad" of value-based RL [14]. The pathologies include violent oscillations of the predicted Q-values, divergence of the loss to infinity, and catastrophic forgetting of states the agent has not visited recently.

DQN's central contribution was to neutralize the deadly triad enough to make deep value learning practical, primarily through experience replay and target networks.

## Algorithm

### High-level loop

The full DQN algorithm as described in the Nature paper proceeds as follows [2]:

1. Initialize the action-value network Q with random weights theta and a replay buffer D of fixed capacity (typically 1 million transitions).
2. Initialize the target network Q_hat with weights theta^- equal to theta.
3. For each episode, observe the initial state s_0 and preprocess it into phi_0.
4. For each step t in the episode:
   a. With probability epsilon select a random action a_t, otherwise select a_t = argmax_a Q(phi_t, a; theta).
   b. Execute a_t in the emulator, receive reward r_t and the next frame, and form phi_{t+1}.
   c. Store the transition (phi_t, a_t, r_t, phi_{t+1}) in D.
   d. Sample a uniform mini-batch of transitions (phi_j, a_j, r_j, phi_{j+1}) from D.
   e. Compute the target y_j = r_j if the next state is terminal, otherwise y_j = r_j + gamma * max_{a'} Q_hat(phi_{j+1}, a'; theta^-).
   f. Take a gradient step on (y_j - Q(phi_j, a_j; theta))^2 with respect to theta.
   g. Every C steps, set theta^- <- theta.
5. Anneal epsilon from 1.0 to 0.1 over the first million frames, then keep it fixed.

The full pseudocode in Algorithm 1 of the Nature paper is essentially this with the explicit preprocessing and Huber loss clipping written out [2].

### Experience replay

[Experience replay](/wiki/experience_replay) stores each transition (s_t, a_t, r_t, s_{t+1}) the agent observes in a buffer D, and the training updates sample mini-batches uniformly at random from D rather than using only the most recent experience [2][19]. DeepMind described it as the more important of the two stabilizers, writing that "the incorporation of experience replay was critical to the success of DQN." [31] The technique was introduced by Long-Ji Lin in his 1992 PhD work for reinforcement learning with neural networks [19]. DQN scaled it up dramatically: the original buffer holds the most recent one million transitions, and each gradient step trains on a mini-batch of 32 samples drawn from this large buffer.

Replay serves three purposes [2]:

- **Decorrelation**: Consecutive frames in an Atari game are highly correlated, and SGD assumes roughly independent and identically distributed samples. Sampling random transitions from a large buffer breaks these correlations and gives gradient estimates that look much more like i.i.d. samples.
- **Sample efficiency**: Each transition can be reused many times instead of being thrown away after a single update. This is especially valuable when interacting with the environment is much slower than running a gradient step, which is true in robotics, ALE, and most simulators.
- **Stabilization**: Averaging over many past behavior policies smooths the data distribution that the network is trained on, preventing the parameters from oscillating in lockstep with the most recent policy change.

The replay buffer also makes DQN inherently off-policy: the data in D was generated by older versions of the policy, but Q-learning's bootstrap update is valid as long as transitions are drawn from any reasonable behavior distribution.

### Target network

The target network is a periodic snapshot of the online Q-network, parameterized by theta^- and used only to compute the bootstrap target r + gamma * max_{a'} Q_hat(s', a'; theta^-). Every C gradient steps (10,000 in the Nature paper) the target weights are overwritten with the current online weights [2].

The motivation is to break a feedback loop. Without a target network, every gradient update on (s, a) immediately changes the prediction for the very same (s, a) that appears as the bootstrap target for nearby states, which can chase the target and amplify itself into divergence. Holding the target fixed for thousands of steps gives the online network a stationary objective long enough to make stable progress before the target moves [2]. The Nature paper showed that removing the target network alone caused the average Q estimate to grow without bound on several games, while removing replay caused similar but smaller divergences on others.

A common variant, popularized by DDPG and later picked up by some DQN-style agents, replaces the periodic hard copy with a slow Polyak average theta^- <- tau * theta + (1 - tau) * theta^- with tau small (e.g., 0.005) [20]. Both schemes accomplish the same stabilizing effect.

### Reward clipping, frame skipping, and frame stacking

DQN uses several preprocessing tricks to standardize the Atari domain [2]:

- **Reward clipping**: All positive rewards are clipped to +1 and all negative rewards to -1, so the same learning rate works across games whose raw scores differ by orders of magnitude. This loses some information about reward magnitudes but lets a single set of hyperparameters apply across all 49 games.
- **Frame skipping**: The agent selects an action every k frames (k = 4 for most games, 3 for Space Invaders to avoid lasers becoming invisible), and the chosen action is repeated for the skipped frames. This effectively cuts the decision frequency by 4x, speeding up learning without losing important detail.
- **Frame stacking**: The state phi_t is the concatenation of the most recent 4 preprocessed frames, giving the network access to short-term motion. Without frame stacking the policy would be Markov-limited to a single static image, which would make many Atari games (e.g., Pong, Breakout) effectively partially observable.
- **Grayscale 84x84 preprocessing**: Each 210x160 RGB frame is converted to grayscale, then downsampled and cropped to 84x84 to reduce the input dimensionality.
- **Huber loss**: Although the Nature paper describes the loss as squared TD error, the implementation actually clips the gradient of the loss to the range [-1, 1], which is mathematically equivalent to using the Huber loss. This prevents very large TD errors from producing oversized parameter updates.

### Hyperparameters

The Nature paper's hyperparameters became reference defaults for value-based deep RL. The most important ones are summarized below [2].

| Hyperparameter | Value | Role |
|---|---|---|
| Discount factor gamma | 0.99 | Weighting of future rewards |
| Replay buffer size | 1,000,000 transitions | Decorrelation and sample reuse |
| Mini-batch size | 32 | Stochastic gradient batch |
| Target network update period C | 10,000 steps | Frequency of theta^- <- theta |
| Initial epsilon | 1.0 | Starting exploration rate |
| Final epsilon | 0.1 | Long-term exploration rate (0.05 at evaluation) |
| Epsilon anneal length | 1,000,000 frames | Linear decay schedule |
| Replay start size | 50,000 transitions | Random play before learning starts |
| Optimizer | RMSprop | Adaptive per-parameter learning rate |
| Learning rate | 0.00025 | RMSprop step size |
| Squared gradient momentum | 0.95 | RMSprop second-moment decay |
| Min squared gradient | 0.01 | RMSprop denominator floor |
| Frame skip k | 4 (3 on Space Invaders) | Action repeat frequency |
| Agent history length | 4 frames | Stacked input |
| Action repeat | 4 | Same as frame skip |
| No-op max | 30 | Random initial no-ops at episode start |
| Reward clipping | [-1, 1] | Cross-game scale normalization |

## Architecture

The DQN convolutional architecture used in the Nature paper is a feed-forward network mapping a stack of four 84x84 grayscale frames to a 4 to 18 dimensional Q-value vector. There is no recurrent state, no batch normalization, and no skip connection [2].

| Layer | Type | Filters / Units | Kernel | Stride | Activation | Output shape |
|---|---|---|---|---|---|---|
| Input | Frame stack | 4 channels | n/a | n/a | n/a | 84 x 84 x 4 |
| Conv1 | Convolution | 32 | 8 x 8 | 4 | ReLU | 20 x 20 x 32 |
| Conv2 | Convolution | 64 | 4 x 4 | 2 | ReLU | 9 x 9 x 64 |
| Conv3 | Convolution | 64 | 3 x 3 | 1 | ReLU | 7 x 7 x 64 |
| Flatten | Reshape | n/a | n/a | n/a | n/a | 3,136 |
| FC1 | Fully connected | 512 | n/a | n/a | ReLU | 512 |
| Output | Fully connected | num_actions | n/a | n/a | Linear | num_actions |

A crucial architectural choice is that the network outputs Q-values for all actions in a single forward pass, given a state input alone [2]. This is more efficient than the alternative of taking the (state, action) pair as input and producing a single Q-value, because the inner max_{a'} Q(s', a'; theta^-) can be computed from one forward pass rather than one per action. The total parameter count is about 1.7 million.

The original 2013 NeurIPS workshop version of DQN used a slightly smaller network with two convolutional layers (16 and 32 filters) and a 256-unit fully connected layer [1]. The 2015 Nature version increased depth and width and added the third convolutional layer, the larger 512-unit fully connected layer, and the explicit target network [2]. The smaller 2013 architecture is sometimes called "DQN-2013" and the Nature one "DQN-2015" or simply DQN.

## Original Atari results

The Nature paper evaluated DQN on 49 Atari 2600 games using the Arcade Learning Environment (ALE) [2][3]. Each game was played from raw pixel input with the same network, the same hyperparameters, and the same training budget of 50 million frames per game (about 38 days of game time at 60 frames per second). After training, the agent was evaluated for 30 episodes per game with a fixed greedy policy plus low-probability epsilon = 0.05 noise.

The paper reports normalized scores of the form 100% * (DQN_score - random_score) / (human_score - random_score), where the human score is the average of two hours of professional play by a human games tester. A normalized score of 100% therefore means human-level [2].

DQN reached above 75% of human-level performance on 29 of 49 games, and exceeded human-level on 22 of them. It surpassed all previous learning algorithms on 43 of 49 games and was within 5% of the best previous result on the remaining 6 [2]. The most striking results were the games where the agent learned counter-intuitive strategies that human players had not used before, such as digging a tunnel along the left wall in Breakout to bounce the ball into the brick layer from above [2].

A representative sample of normalized scores from Extended Data Table 2 of the Nature paper is shown below [2].

| Game | Random | Human | DQN | DQN normalized |
|---|---|---|---|---|
| Video Pinball | 0 | 17,298 | 42,684 | 2,539% |
| Boxing | 0 | 4 | 71.8 | 1,707% |
| Breakout | 1 | 31 | 401 | 1,327% |
| Star Gunner | 664 | 10,250 | 57,997 | 598% |
| Robotank | 2 | 12 | 51.6 | 509% |
| Atlantis | 12,850 | 29,028 | 85,641 | 449% |
| Crazy Climber | 10,781 | 35,411 | 114,103 | 419% |
| Gopher | 257 | 2,321 | 8,520 | 400% |
| Demon Attack | 152 | 3,401 | 9,711 | 294% |
| Name This Game | 2,250 | 4,076 | 7,257 | 278% |
| Krull | 1,151 | 2,395 | 3,805 | 213% |
| Assault | 222 | 1,496 | 3,359 | 246% |
| Road Runner | 200 | 7,845 | 18,257 | 232% |
| Kangaroo | 52 | 3,035 | 6,740 | 224% |
| James Bond | 29 | 303 | 576.7 | 200% |
| Tennis | -24 | -8 | -2.5 | 143% |
| Pong | -21 | 10 | 18.9 | 132% |
| Space Invaders | 148 | 1,652 | 1,976 | 121% |
| Beam Rider | 364 | 5,775 | 6,846 | 119% |
| Tutankham | 11 | 167 | 186.7 | 112% |
| Kung-Fu Master | 258 | 22,736 | 23,270 | 102% |
| Freeway | 0 | 30 | 30.3 | 102% |
| Time Pilot | 3,568 | 5,925 | 5,947 | 100% |
| Enduro | 0 | 309 | 301.8 | 97% |
| Fishing Derby | -91 | 6 | -0.8 | 93% |
| Up and Down | 533 | 9,082 | 8,456 | 92% |
| Ice Hockey | -11 | 1 | -1.6 | 79% |
| Q*bert | 164 | 13,455 | 10,596 | 78% |
| H.E.R.O. | 1,027 | 25,763 | 19,950 | 76% |
| Asterix | 210 | 8,503 | 6,012 | 70% |
| Battle Zone | 2,360 | 37,800 | 26,300 | 67% |
| Wizard of Wor | 564 | 4,757 | 3,393 | 67% |
| Chopper Command | 811 | 9,882 | 6,687 | 65% |
| Centipede | 2,091 | 11,963 | 8,309 | 63% |
| Bank Heist | 14 | 753 | 429.7 | 56% |
| River Raid | 1,338 | 13,513 | 8,316 | 57% |
| Zaxxon | 32 | 9,173 | 4,977 | 54% |
| Amidar | 6 | 1,676 | 740 | 44% |
| Alien | 227 | 6,875 | 3,069 | 43% |
| Venture | 0 | 1,188 | 380 | 32% |
| Seaquest | 68 | 20,182 | 5,286 | 26% |
| Frostbite | 65 | 4,335 | 328.3 | 6% |
| Asteroids | 719 | 13,157 | 1,629.3 | 7% |
| Private Eye | 25 | 69,571 | 1,788 | 3% |
| Gravitar | 173 | 2,672 | 306.7 | 5% |
| Ms. Pacman | 307 | 15,693 | 2,311 | 13% |
| Bowling | 23 | 154 | 42.4 | 14% |
| Double Dunk | -19 | -16 | -18.1 | 17% |
| Montezuma's Revenge | 0 | 4,367 | 0 | 0% |

The games where DQN failed badly, in particular Montezuma's Revenge, Private Eye, Gravitar, Frostbite, and Pitfall, all involve very sparse rewards or long-horizon planning that the local epsilon-greedy bootstrap update cannot solve on its own. These hard-exploration games became a benchmark in their own right and motivated later work on intrinsic motivation, hierarchical RL, and learned exploration policies, culminating in [Agent57](/wiki/agent57) [11].

## Variants

The Nature DQN paper inspired a long line of follow-up algorithms, each addressing one of its known weaknesses. The most influential are summarized below.

| Variant | Authors | Year / Venue | Key idea |
|---|---|---|---|
| Double DQN (DDQN) | van Hasselt, Guez, Silver | AAAI 2016 [4] | Decouple action selection from action evaluation in the bootstrap target to reduce overestimation bias |
| Prioritized Experience Replay (PER) | Schaul, Quan, Antonoglou, Silver | ICLR 2016 [6] | Sample replay transitions in proportion to their TD error magnitude |
| Dueling DQN | Wang, Schaul, Hessel, van Hasselt, Lanctot, de Freitas | ICML 2016 [5] | Decompose Q(s, a) into a state value V(s) and an advantage A(s, a) with a shared trunk |
| Bootstrapped DQN | Osband, Blundell, Pritzel, Van Roy | NeurIPS 2016 [21] | Train an ensemble of Q-heads for deep exploration via Thompson-style sampling |
| Distributional DQN (C51) | Bellemare, Dabney, Munos | ICML 2017 [7] | Learn the full return distribution Z(s, a) over a fixed support of 51 atoms instead of just its mean |
| Multi-step DQN (n-step) | Sutton (concept), used in Rainbow | 1988 / 2017 | Bootstrap from the n-step return r_t + ... + gamma^{n-1} r_{t+n-1} + gamma^n max Q |
| Noisy DQN | Fortunato et al. | ICLR 2018 [8] | Add learnable parameter noise to the network to drive exploration without epsilon-greedy |
| Quantile Regression DQN (QR-DQN) | Dabney, Rowland, Bellemare, Munos | AAAI 2018 [22] | Distributional DQN with quantile regression rather than fixed support |
| Implicit Quantile Networks (IQN) | Dabney, Ostrovski, Silver, Munos | ICML 2018 [23] | Sampled-quantile distributional DQN with continuous quantile inputs |
| Rainbow | Hessel et al. | AAAI 2018 [9] | Combine Double, Dueling, PER, n-step, C51, and Noisy nets in one agent |
| Ape-X DQN | Horgan, Quan, Budden, Barth-Maron, Hessel, van Hasselt, Silver | ICLR 2018 [24] | Distributed actor-learner with shared prioritized replay across hundreds of CPU actors |
| R2D2 | Kapturowski, Ostrovski, Quan, Munos, Dabney | ICLR 2019 [10] | Recurrent Replay Distributed DQN with LSTM and stored hidden state |
| Never Give Up (NGU) | Badia et al. | ICLR 2020 [25] | Episodic + lifelong intrinsic rewards added to a recurrent DQN-style agent |
| Agent57 | Badia et al. | ICML 2020 [11] | Adaptive mixture of explorative and exploitative policies; first to beat the human baseline on all 57 ALE games |
| MuZero | Schrittwieser et al. | Nature 2020 [12] | Combines a learned model with Monte Carlo tree search; subsumes many DQN-era results |

### Double DQN

Double DQN, introduced by Hado van Hasselt, Arthur Guez, and David Silver [4], addresses Q-learning's well-known overestimation bias. In standard DQN, the bootstrap uses max_{a'} Q_hat(s', a'; theta^-), which both selects and evaluates the next action with the same network. Because max selects the maximum of noisy estimates, the resulting target is biased upward. Hasselt's earlier 2010 paper had introduced "Double Q-learning" as a tabular fix, and the 2016 paper carries the idea over to deep networks [4]. The Double DQN target is

y = r + gamma * Q_hat(s', argmax_{a'} Q(s', a'; theta); theta^-).

The online network theta picks the action and the target network theta^- evaluates it. This requires no extra parameters, since DQN already has both networks. Double DQN reduced the overestimation on most ALE games and improved both the mean and median normalized scores noticeably [4].

### Dueling DQN

Dueling DQN, by Ziyu Wang and colleagues [5], rearchitects the network to share a convolutional trunk and split into two heads: one estimating the state value V(s; theta_V) and the other estimating the advantage A(s, a; theta_A) for each action. The Q-value is recombined as

Q(s, a) = V(s) + (A(s, a) - mean_{a'} A(s, a')).

Subtracting the mean advantage enforces identifiability since V and A are otherwise underdetermined. The intuition is that on many states, the choice of action does not change the value much (e.g., when the agent is far from any obstacle), and forcing the network to estimate V separately allows information about state value to be shared across all actions. The Dueling architecture combined with Double DQN and Prioritized Experience Replay set new state of the art on ALE in 2016 [5].

### Prioritized Experience Replay

Prioritized Experience Replay (PER), by Tom Schaul and colleagues [6], replaces the uniform sampling of replay transitions with sampling proportional to the most recent absolute TD error |delta_i| of each transition. Transitions where the network was "surprised" are sampled more often, so the network spends its updates on the experiences with the most to learn from. To correct for the bias of non-uniform sampling, the gradient is reweighted by an importance-sampling factor (1 / (N * P(i)))^beta, where beta is annealed from a small value to 1 over training. PER consistently improved sample efficiency and final scores [6].

### Distributional DQN (C51)

Distributional DQN, introduced by Marc Bellemare, Will Dabney, and Remi Munos [7], replaces the scalar Q-value Q(s, a) with a discrete distribution Z(s, a) over the possible returns. C51 represents the return distribution as 51 atoms uniformly spaced between V_min and V_max (typically -10 and 10 after reward clipping), and trains the network to match the projected Bellman target distribution under a cross-entropy loss. Despite using the same expected return for action selection, learning the full distribution gave a substantial gain over scalar DQN on ALE [7]. The follow-ups QR-DQN [22] and IQN [23] extended the idea to continuous quantile representations.

### Noisy DQN

Noisy DQN, by Meire Fortunato and colleagues [8], replaces the linear weights of selected layers with noisy weights of the form w_mu + w_sigma * epsilon, where epsilon is a sampled noise vector and w_mu, w_sigma are learned. This makes the policy stochastic and lets the network learn how much exploration noise to inject in different parts of state space. Compared to epsilon-greedy, noisy nets often explored more strategically and removed the need to manually tune an exploration schedule [8].

### Multi-step returns

Multi-step (or n-step) returns replace the one-step bootstrap with the n-step return r_t + gamma * r_{t+1} + ... + gamma^{n-1} * r_{t+n-1} + gamma^n * max_{a'} Q_hat(s_{t+n}, a'). With n in the range 3 to 5, this propagates reward information faster while still using bootstrapping to control variance. Multi-step DQN is technically off-policy biased when the behavior policy and target policy differ over the n steps, but in practice this bias is small enough that the method is widely used and forms one of the six ingredients in Rainbow [9].

## Rainbow

Rainbow, by Matteo Hessel and colleagues at DeepMind [9], asks a deceptively simple question: which of the many DQN improvements actually matter, and do they compose? The paper combines Double DQN, Dueling DQN, Prioritized Experience Replay, Multi-step learning, Distributional RL (C51), and Noisy nets into a single agent and runs an ablation removing each component in turn.

Rainbow set a new state of the art on the 57 game Atari benchmark, exceeding the previous best agent (Distributional DQN with multi-step bootstrapping) by a wide margin in both data efficiency and final performance, and matching the performance of much more compute-hungry distributed agents at a fraction of the wall-clock budget [9]. The ablations showed that Prioritized Replay, Multi-step learning, and the Distributional component were the largest contributors, while Double DQN had a relatively small effect (in part because the Distributional formulation already mitigates overestimation), and Dueling and Noisy nets had moderate effects that varied by game.

Rainbow is widely considered the strongest single-actor DQN-family agent and has become the standard baseline in subsequent value-based RL papers. The full set of Rainbow ingredients along with their original citations are summarized below.

| Component | Source | Effect when removed |
|---|---|---|
| Double Q-learning | van Hasselt et al. 2016 [4] | Small drop in median performance |
| Prioritized Experience Replay | Schaul et al. 2016 [6] | Largest drop in early-training data efficiency |
| Dueling networks | Wang et al. 2016 [5] | Moderate drop, varies by game |
| Multi-step learning | Sutton 1988 (concept) | Large drop in median performance |
| Distributional RL (C51) | Bellemare et al. 2017 [7] | Large drop in final performance |
| Noisy nets | Fortunato et al. 2018 [8] | Moderate drop, mostly on hard-exploration games |

## Successors and beyond

### Distributed DQN: Ape-X and R2D2

Ape-X DQN, by Dan Horgan and colleagues [24], decouples acting from learning. Hundreds of CPU actor processes generate experience and write into a single shared prioritized replay buffer, while a single GPU learner samples from the buffer and pushes updated weights back to the actors. The actors run different epsilon values for built-in exploration diversity. With this architecture, Ape-X consumed roughly 22 billion environment frames per agent on Atari and substantially exceeded Rainbow's data-volume scaling [24].

[R2D2](/wiki/r2d2) (Recurrent Replay Distributed DQN), by Steven Kapturowski and colleagues [10], extends Ape-X with an LSTM core and adopts a careful protocol for training the recurrent state from replay. It stores fixed-length sequences in the replay buffer along with the recurrent hidden state at the start of each sequence, and uses a "burn-in" period to refresh the LSTM state before computing the loss. R2D2 doubled Rainbow's median ALE score and reached super-human performance on 52 of the 57 standard Atari games [10].

### Never Give Up and Agent57

A persistent gap remained on hard-exploration games, in particular Montezuma's Revenge, Pitfall, Private Eye, Solaris, Skiing, and Venture. Never Give Up (NGU), by Adria Puigdomenech Badia and colleagues [25], augments R2D2 with two intrinsic reward signals: an episodic novelty bonus computed from the distance to the agent's recent state memory, and a lifelong novelty bonus computed from a Random Network Distillation predictor. NGU was the first agent to score positively on all 57 Atari games.

[Agent57](/wiki/agent57), also by Badia and colleagues [11], built on NGU by parameterizing a family of policies indexed by an exploration coefficient and a discount factor, and using a meta-controller to choose which policy to act with at each episode. The meta-controller is itself a [bandit](/wiki/bandit) that learns to allocate behavior across exploitative and explorative policies. Agent57 became the first agent to exceed the human baseline on all 57 standard Atari games in 2020 [11].

### MuZero

[MuZero](/wiki/muzero), by Julian Schrittwieser and colleagues [12], goes a step further by combining a learned environment model with [Monte Carlo tree search](/wiki/monte_carlo_tree_search) (MCTS) in the style of [AlphaZero](/wiki/alphazero). It does not assume access to the simulator dynamics; instead it learns three networks, a representation function, a dynamics function, and a prediction function, that together let it plan in latent space. MuZero matched AlphaZero's superhuman play on Go, chess, and shogi while also matching or exceeding R2D2 on Atari, all with the same algorithm [12]. While MuZero is not literally a DQN variant, it descends directly from the DQN-on-Atari research program and shares the use of a value head trained by bootstrapping from future returns.

## Implementations

Because DQN is so widely studied, well-tested implementations exist in every major reinforcement learning library.

| Library | Maintainer | Implementation notes |
|---|---|---|
| [Stable-Baselines3](/wiki/stable_baselines) | Antonin Raffin and contributors | PyTorch DQN with Double DQN, Prioritized Replay, and Dueling options |
| Dopamine | Google Research | TensorFlow / JAX baselines including DQN, C51, Rainbow, IQN, QR-DQN |
| Acme | DeepMind | JAX agents for DQN, Rainbow, R2D2, Ape-X, MuZero, IMPALA |
| RLlib | Anyscale (Ray) | Distributed DQN, Ape-X, R2D2 with TensorFlow and PyTorch backends |
| CleanRL | Costa Huang and contributors | Single-file PyTorch reference implementations of DQN and many variants |
| Tianshou | Tsinghua University | PyTorch DQN, Double DQN, Dueling, Rainbow, C51, QR-DQN |
| OpenAI Baselines | OpenAI | Original TensorFlow reference for DQN, used for many follow-up papers |
| Coach | Intel | Multi-framework RL library including DQN family |

Dopamine [26] was released by Google in 2018 as a small, reproducible TensorFlow research codebase aimed specifically at value-based deep RL on Atari, and it has been the reference implementation for many subsequent benchmark papers. Acme [27] is the more recent DeepMind framework that contains modular implementations of most algorithms in the DQN family, including R2D2 and MuZero, along with shared replay buffers and learner / actor abstractions.

A minimal PyTorch DQN training step looks roughly like this:

```python
# states, actions, rewards, next_states, dones from replay buffer
q_values = q_net(states).gather(1, actions.unsqueeze(1)).squeeze(1)
with torch.no_grad():
    next_q = target_net(next_states).max(dim=1).values
    target = rewards + gamma * next_q * (1.0 - dones)
loss = F.smooth_l1_loss(q_values, target)
optimizer.zero_grad()
loss.backward()
for p in q_net.parameters():
    p.grad.data.clamp_(-1, 1)
optimizer.step()

if step % target_update_period == 0:
    target_net.load_state_dict(q_net.state_dict())
```

The Huber loss (smooth_l1_loss in PyTorch) and gradient clipping reproduce the Nature paper's loss-clipping behavior. Stable-Baselines3 wraps this entire loop, the replay buffer, the target update schedule, and the epsilon-greedy exploration into a few lines of user code [28].

## Limitations

Despite its impact, DQN has substantial limitations as a general RL algorithm.

- **Discrete action space only**: The argmax over actions in the bootstrap step requires enumerable actions, so DQN does not directly apply to continuous control. [DDPG](/wiki/ddpg), [TD3](/wiki/td3), and [SAC](/wiki/sac) extend the actor-critic framework with similar replay and target ideas to continuous action spaces [20].
- **Sample inefficiency**: The Nature paper used 50 million frames per game, equivalent to roughly 38 days of game time. Even Rainbow needs tens of millions of frames to reach near-human performance. Modern model-based methods like [DreamerV3](/wiki/dreamer) reach competitive scores with one to two orders of magnitude fewer frames.
- **Hard-exploration games**: With epsilon-greedy alone, DQN scores zero on Montezuma's Revenge, Pitfall, and similar games whose rewards are extremely sparse. Solving these required intrinsic motivation (RND, NGU, Agent57) and / or expert demonstrations [11][25].
- **Overestimation bias**: Plain DQN systematically overestimates Q-values because of the maximization in the target. Double DQN reduces but does not fully remove the bias.
- **Catastrophic forgetting**: As the policy improves and stops visiting earlier-stage states, the network can forget how to play those states well. The replay buffer mitigates but does not eliminate this.
- **Partial observability**: A four-frame stack is only a crude approximation to a Markov state for most games. Recurrent variants like R2D2 are needed for genuinely partially observable problems.
- **Reward clipping discards information**: Clipping to [-1, +1] makes hyperparameters portable across games but loses the magnitude information that would let the agent prefer big rewards over small ones. Distributional methods can side-step this with appropriate value supports.
- **Hyperparameter sensitivity in some regimes**: While DQN was tuned for one set of Atari hyperparameters, transferring to a new domain often requires significant adjustment of replay capacity, target update frequency, learning rate, and reward scaling.
- **Compute cost**: Reproducing the Nature paper takes roughly 8 GPU-days per game with the original code, and Rainbow takes around 10 GPU-days per game. Bigger distributed variants like Ape-X and R2D2 multiply this further.

## Impact

DQN's broader impact on machine learning is hard to overstate. Before December 2013, deep learning's most visible successes were in supervised settings: ImageNet image classification, speech recognition, and machine translation. Reinforcement learning was mostly tabular and operated on hand-crafted features. DQN demonstrated that the same convolutional networks that had transformed perception could learn to act, given a working scheme for stabilizing value-based training [1][2]. DeepMind characterized the result as "the first demonstration of a general purpose learning agent that can be trained end-to-end to handle a wide variety of challenging tasks." [31]

The paper's influence shows up in three connected ways. First, it cemented [DeepMind](/wiki/deepmind)'s reputation as the leading deep RL lab, contributing materially to Google's January 2014 acquisition for around 400 to 500 million USD [13]. Second, it spawned a research program at DeepMind, [OpenAI](/wiki/openai), Berkeley, Google Brain, and many other labs that produced a flood of value-based and actor-critic deep RL algorithms during 2015 to 2020. Third, it set the template, raw observations into a deep network with experience replay and a target network, that nearly all subsequent value-based deep RL agents have followed.

The Atari result also laid the groundwork for DeepMind's later milestones. [AlphaGo](/wiki/alphago)'s value network [29] uses the same idea of learning to predict expected returns by bootstrapping from self-play, even though it goes through a different training process. [AlphaZero](/wiki/alphazero) [30] and [MuZero](/wiki/muzero) [12] continue along the same line by combining learned value functions with planning. The Nature paper's authors went on to lead substantial parts of those projects.

Finally, DQN gave the field its first widely accepted benchmark for deep RL: Atari 2600 played from pixels. The 57-game ALE benchmark, with the now-standard "human-normalized score" reporting, became the field's MNIST for sequential decision making, and the long arc from DQN (22 of 49 games above human) through Rainbow, Ape-X, R2D2, NGU, and Agent57 (57 of 57 games above human) tracks roughly five years of rapid algorithmic progress.

## See also

- [Deep Q-Network (DQN)](/wiki/deep_q-network_dqn)
- [Reinforcement learning](/wiki/reinforcement_learning)
- [Q-learning](/wiki/q-learning)
- [Experience replay](/wiki/experience_replay)
- [Deep reinforcement learning](/wiki/reinforcement_learning)
- [Bellman equation](/wiki/bellman_equation)
- [Markov decision process](/wiki/markov_decision_process_mdp)
- [Convolutional neural network](/wiki/convolutional_neural_network)
- [DeepMind](/wiki/deepmind)
- [Arcade Learning Environment](/wiki/ale)
- [Agent57](/wiki/agent57)
- [MuZero](/wiki/muzero)
- [AlphaGo](/wiki/alphago)
- [Policy gradient](/wiki/policy_gradient)
- [Actor-critic](/wiki/actor_critic)

## References

1. Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., and Riedmiller, M. (2013). "Playing Atari with Deep Reinforcement Learning." NIPS Deep Learning Workshop. [https://arxiv.org/abs/1312.5602](https://arxiv.org/abs/1312.5602)
2. Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidjeland, A. K., Ostrovski, G., et al. (2015). "Human-level control through deep reinforcement learning." Nature, 518(7540), 529-533. [https://www.nature.com/articles/nature14236](https://www.nature.com/articles/nature14236)
3. Bellemare, M. G., Naddaf, Y., Veness, J., and Bowling, M. (2013). "The Arcade Learning Environment: An Evaluation Platform for General Agents." Journal of Artificial Intelligence Research, 47, 253-279. [https://arxiv.org/abs/1207.4708](https://arxiv.org/abs/1207.4708)
4. van Hasselt, H., Guez, A., and Silver, D. (2016). "Deep Reinforcement Learning with Double Q-learning." Proceedings of the AAAI Conference on Artificial Intelligence, 30(1). [https://arxiv.org/abs/1509.06461](https://arxiv.org/abs/1509.06461)
5. Wang, Z., Schaul, T., Hessel, M., van Hasselt, H., Lanctot, M., and de Freitas, N. (2016). "Dueling Network Architectures for Deep Reinforcement Learning." Proceedings of ICML 2016. [https://arxiv.org/abs/1511.06581](https://arxiv.org/abs/1511.06581)
6. Schaul, T., Quan, J., Antonoglou, I., and Silver, D. (2016). "Prioritized Experience Replay." Proceedings of ICLR 2016. [https://arxiv.org/abs/1511.05952](https://arxiv.org/abs/1511.05952)
7. Bellemare, M. G., Dabney, W., and Munos, R. (2017). "A Distributional Perspective on Reinforcement Learning." Proceedings of ICML 2017. [https://arxiv.org/abs/1707.06887](https://arxiv.org/abs/1707.06887)
8. Fortunato, M., Azar, M. G., Piot, B., Menick, J., Osband, I., Graves, A., Mnih, V., Munos, R., Hassabis, D., Pietquin, O., Blundell, C., and Legg, S. (2018). "Noisy Networks for Exploration." Proceedings of ICLR 2018. [https://arxiv.org/abs/1706.10295](https://arxiv.org/abs/1706.10295)
9. Hessel, M., Modayil, J., van Hasselt, H., Schaul, T., Ostrovski, G., Dabney, W., Horgan, D., Piot, B., Azar, M., and Silver, D. (2018). "Rainbow: Combining Improvements in Deep Reinforcement Learning." Proceedings of AAAI 2018. [https://arxiv.org/abs/1710.02298](https://arxiv.org/abs/1710.02298)
10. Kapturowski, S., Ostrovski, G., Quan, J., Munos, R., and Dabney, W. (2019). "Recurrent Experience Replay in Distributed Reinforcement Learning." Proceedings of ICLR 2019. [https://openreview.net/forum?id=r1lyTjAqYX](https://openreview.net/forum?id=r1lyTjAqYX)
11. Badia, A. P., Piot, B., Kapturowski, S., Sprechmann, P., Vitvitskyi, A., Guo, D., and Blundell, C. (2020). "Agent57: Outperforming the Atari Human Benchmark." Proceedings of ICML 2020. [https://arxiv.org/abs/2003.13350](https://arxiv.org/abs/2003.13350)
12. Schrittwieser, J., Antonoglou, I., Hubert, T., Simonyan, K., Sifre, L., Schmitt, S., Guez, A., Lockhart, E., Hassabis, D., Graepel, T., Lillicrap, T., and Silver, D. (2020). "Mastering Atari, Go, chess and shogi by planning with a learned model." Nature, 588, 604-609. [https://www.nature.com/articles/s41586-020-03051-4](https://www.nature.com/articles/s41586-020-03051-4)
13. Shu, C. (2014). "Google Acquires Artificial Intelligence Startup DeepMind For More Than $500M." TechCrunch, January 26, 2014. [https://techcrunch.com/2014/01/26/google-deepmind/](https://techcrunch.com/2014/01/26/google-deepmind/)
14. Sutton, R. S. and Barto, A. G. (2018). "Reinforcement Learning: An Introduction" (2nd ed.). MIT Press. [http://incompleteideas.net/book/the-book.html](http://incompleteideas.net/book/the-book.html)
15. Watkins, C. J. C. H. (1989). "Learning from Delayed Rewards." PhD thesis, King's College, Cambridge. [https://www.cs.rhul.ac.uk/~chrisw/new_thesis.pdf](https://www.cs.rhul.ac.uk/~chrisw/new_thesis.pdf)
16. Watkins, C. J. C. H. and Dayan, P. (1992). "Q-learning." Machine Learning, 8(3), 279-292. [https://link.springer.com/article/10.1007/BF00992698](https://link.springer.com/article/10.1007/BF00992698)
17. Tsitsiklis, J. N. and Van Roy, B. (1997). "An Analysis of Temporal-Difference Learning with Function Approximation." IEEE Transactions on Automatic Control, 42(5), 674-690. [https://web.mit.edu/jnt/www/Papers/J063-97-bvr-td.pdf](https://web.mit.edu/jnt/www/Papers/J063-97-bvr-td.pdf)
18. Baird, L. (1995). "Residual Algorithms: Reinforcement Learning with Function Approximation." Proceedings of ICML 1995. [https://www.leemon.com/papers/1995b.pdf](https://www.leemon.com/papers/1995b.pdf)
19. Lin, L.-J. (1992). "Self-Improving Reactive Agents Based on Reinforcement Learning, Planning and Teaching." Machine Learning, 8, 293-321. [https://link.springer.com/article/10.1007/BF00992699](https://link.springer.com/article/10.1007/BF00992699)
20. Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., and Wierstra, D. (2016). "Continuous control with deep reinforcement learning." Proceedings of ICLR 2016 (DDPG). [https://arxiv.org/abs/1509.02971](https://arxiv.org/abs/1509.02971)
21. Osband, I., Blundell, C., Pritzel, A., and Van Roy, B. (2016). "Deep Exploration via Bootstrapped DQN." Proceedings of NeurIPS 2016. [https://arxiv.org/abs/1602.04621](https://arxiv.org/abs/1602.04621)
22. Dabney, W., Rowland, M., Bellemare, M. G., and Munos, R. (2018). "Distributional Reinforcement Learning with Quantile Regression." Proceedings of AAAI 2018. [https://arxiv.org/abs/1710.10044](https://arxiv.org/abs/1710.10044)
23. Dabney, W., Ostrovski, G., Silver, D., and Munos, R. (2018). "Implicit Quantile Networks for Distributional Reinforcement Learning." Proceedings of ICML 2018. [https://arxiv.org/abs/1806.06923](https://arxiv.org/abs/1806.06923)
24. Horgan, D., Quan, J., Budden, D., Barth-Maron, G., Hessel, M., van Hasselt, H., and Silver, D. (2018). "Distributed Prioritized Experience Replay." Proceedings of ICLR 2018. [https://arxiv.org/abs/1803.00933](https://arxiv.org/abs/1803.00933)
25. Badia, A. P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., Bolt, A., and Blundell, C. (2020). "Never Give Up: Learning Directed Exploration Strategies." Proceedings of ICLR 2020. [https://arxiv.org/abs/2002.06038](https://arxiv.org/abs/2002.06038)
26. Castro, P. S., Moitra, S., Gelada, C., Kumar, S., and Bellemare, M. G. (2018). "Dopamine: A Research Framework for Deep Reinforcement Learning." [https://arxiv.org/abs/1812.06110](https://arxiv.org/abs/1812.06110)
27. Hoffman, M., Shahriari, B., Aslanides, J., Barth-Maron, G., et al. (2020). "Acme: A Research Framework for Distributed Reinforcement Learning." [https://arxiv.org/abs/2006.00979](https://arxiv.org/abs/2006.00979)
28. Raffin, A., Hill, A., Gleave, A., Kanervisto, A., Ernestus, M., and Dormann, N. (2021). "Stable-Baselines3: Reliable Reinforcement Learning Implementations." Journal of Machine Learning Research, 22(268), 1-8. [https://jmlr.org/papers/v22/20-1364.html](https://jmlr.org/papers/v22/20-1364.html)
29. Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., van den Driessche, G., Schrittwieser, J., et al. (2016). "Mastering the game of Go with deep neural networks and tree search." Nature, 529, 484-489. [https://www.nature.com/articles/nature16961](https://www.nature.com/articles/nature16961)
30. Silver, D., Hubert, T., Schrittwieser, J., Antonoglou, I., Lai, M., Guez, A., Lanctot, M., Sifre, L., Kumaran, D., Graepel, T., Lillicrap, T., Simonyan, K., and Hassabis, D. (2018). "A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play." Science, 362(6419), 1140-1144. [https://www.science.org/doi/10.1126/science.aar6404](https://www.science.org/doi/10.1126/science.aar6404)
31. DeepMind / Google Research. (2015). "From Pixels to Actions: Human-level control through Deep Reinforcement Learning." Google Research Blog, February 25, 2015. [https://research.google/blog/from-pixels-to-actions-human-level-control-through-deep-reinforcement-learning/](https://research.google/blog/from-pixels-to-actions-human-level-control-through-deep-reinforcement-learning/)
