# Deep Q-Network (DQN)

> Source: https://aiwiki.ai/wiki/deep_q-network_dqn
> Updated: 2026-06-21
> Categories: Deep Learning, Machine Learning, Reinforcement Learning
> From AI Wiki (https://aiwiki.ai), a free encyclopedia of artificial intelligence. Quote with attribution.

**Deep Q-Network (DQN)** is a [reinforcement learning](/wiki/reinforcement_learning_rl) algorithm that uses a deep [neural network](/wiki/neural_network) to approximate the optimal action-value function (Q-function), letting a single agent learn to play Atari 2600 video games directly from raw screen pixels. Introduced by researchers at DeepMind in 2013 and published in *Nature* in 2015, DQN was the first algorithm to combine deep learning with reinforcement learning at scale, outperforming the best previous methods on 43 of 49 Atari games and reaching performance comparable to a professional human games tester using the same network, architecture, and hyperparameters across every game.[1][2] DeepMind described it as "the first demonstration of a general purpose learning agent that can be trained end-to-end to handle a wide variety of challenging tasks."[10] The result catalyzed the modern field of deep reinforcement learning and followed Google's acquisition of DeepMind for a reported sum of more than $500 million in 2014.[10]

## Background

### Reinforcement Learning and Q-Learning

[Reinforcement learning](/wiki/reinforcement_learning_rl) (RL) is a branch of machine learning in which an agent learns to make decisions by interacting with an environment and receiving reward signals. The agent's goal is to discover a policy that maximizes the cumulative reward over time. [Q-learning](/wiki/q-learning), introduced by Christopher Watkins in 1989, is a foundational RL algorithm that learns an action-value function Q(s, a), representing the expected cumulative discounted reward of taking action *a* in state *s* and then following the optimal policy thereafter.[9]

In classical Q-learning, the Q-function is stored in a table with one entry for every state-action pair. This works well for small, discrete state spaces but becomes intractable when the state space is large or continuous, such as when the input is a raw image with thousands of pixels.

### Why did Q-learning need function approximation?

To handle high-dimensional state spaces, researchers explored using function approximators (such as neural networks) in place of Q-tables. However, combining nonlinear function approximators with Q-learning had long been considered unstable and prone to divergence. Prior attempts often failed due to correlated training samples and the fact that small updates to the Q-function could drastically shift the policy, leading to oscillations or catastrophic forgetting. DQN introduced two key innovations that overcame these stability problems: [experience replay](/wiki/experience_replay) and a separate target network.[2]

## History

### When was DQN first published?

The original DQN paper, "Playing Atari with Deep Reinforcement Learning," was authored by Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. It was presented at the NIPS Deep Learning Workshop in December 2013.[1] This paper demonstrated that a [convolutional neural network](/wiki/convolutional_neural_network) trained with a variant of Q-learning could learn to play seven Atari 2600 games directly from raw pixel input, outperforming all previous approaches on six of the seven games and surpassing a human expert on three of them.[1]

### The 2015 Nature Paper

The follow-up paper, "Human-level control through deep reinforcement learning," appeared in *Nature* on 25 February 2015 (volume 518, pages 529-533).[2] The author list expanded to include Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, Martin Riedmiller, Andreas K. Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis Hassabis. This version tested DQN on 49 Atari games using the same architecture, hyperparameters, and algorithm across all games, with no game-specific tuning. The agent received only raw pixels and the game score as input. The authors reported that the agent achieved a level "comparable to that of a professional human games tester across a set of 49 games," outperforming the best existing reinforcement learning methods on 43 of them.[2]

## Architecture

The DQN architecture uses a [convolutional neural network](/wiki/convolutional_neural_network) (CNN) that takes preprocessed game frames as input and outputs a Q-value for each possible action.

### Input Preprocessing

Raw Atari frames (210 x 160 pixels, RGB) undergo several preprocessing steps before being fed to the network:

| Step | Description |
|---|---|
| Grayscale conversion | RGB frames are converted to single-channel grayscale images |
| Frame resizing | Images are downscaled to 84 x 84 pixels |
| Frame stacking | Four consecutive preprocessed frames are stacked together, producing an input tensor of shape (4, 84, 84) to capture temporal information such as motion and velocity |
| Max-pooling across frames | A component-wise maximum is taken over two consecutive raw frames before preprocessing, to handle sprite flickering in certain Atari games |
| Frame skipping | The agent selects an action every 4th frame; the chosen action is repeated for the skipped frames, reducing computation while preserving important dynamics |

### What does the DQN network look like?

The CNN architecture from the 2015 Nature paper consists of three convolutional layers followed by two fully connected layers:[2]

| Layer | Type | Filters / Units | Kernel Size | Stride | Activation |
|---|---|---|---|---|---|
| 1 | Convolutional | 32 | 8 x 8 | 4 | ReLU |
| 2 | Convolutional | 64 | 4 x 4 | 2 | ReLU |
| 3 | Convolutional | 64 | 3 x 3 | 1 | ReLU |
| 4 | Fully connected | 512 | - | - | ReLU |
| 5 | Fully connected (output) | Number of actions | - | - | Linear |

The output layer has one neuron per valid action in the game (typically between 4 and 18 for Atari games). Each output neuron produces the estimated Q-value for the corresponding action given the current state. The network uses no pooling layers; the convolutional layers alone reduce the spatial dimensions.

## Key Innovations

### How does experience replay work?

[Experience replay](/wiki/experience_replay) is a technique in which the agent stores its interactions with the environment as tuples (s, a, r, s') in a fixed-size replay buffer. During training, the network is updated using mini-batches sampled uniformly at random from this buffer, rather than using the most recent consecutive experiences. In the Nature paper's ablation study, disabling experience replay "caused a severe deterioration in performance," confirming it as one of the two innovations most responsible for DQN's stability.[2][10]

Experience replay provides three major benefits:

1. **Breaking temporal correlations.** Consecutive frames in a game are highly correlated. Training on sequential data can cause the network to overfit to recent patterns or oscillate. Random sampling breaks these correlations and produces a more stable gradient signal.
2. **Data efficiency.** Each experience can potentially be used in many weight updates, allowing the agent to learn more from fewer environment interactions.
3. **Smoothing the data distribution.** By mixing experiences from many different time steps and policies, the training distribution becomes more stationary, which helps gradient-based optimization converge.

In the Nature DQN implementation, the replay buffer holds up to 1,000,000 transitions.[2]

### What is the target network?

The target network is a separate copy of the Q-network whose parameters are frozen and only updated periodically (every C steps) by copying the weights from the online (main) network. During training, the target Q-values used to compute the loss are generated by this frozen network rather than the constantly changing online network.

Without a target network, the Q-value targets shift with every gradient step, because the same network that is being updated is also used to generate the targets. This creates a moving-target problem that leads to oscillations, divergence, or slow convergence. By holding the target network fixed for many steps, DQN stabilizes training significantly.[2]

### Reward Clipping

DQN clips all rewards to the range {-1, 0, +1} based on their sign. Positive rewards become +1, negative rewards become -1, and zero rewards remain 0. This normalization ensures that the same hyperparameters and learning rate can be used across games with very different score scales, though it does remove information about the magnitude of rewards.[2]

## Training Procedure

### Loss Function

DQN minimizes the mean squared error between the predicted Q-value and the target Q-value. The loss at iteration *i* is:

L_i(theta_i) = E[(r + gamma * max_a' Q(s', a'; theta_i^-) - Q(s, a; theta_i))^2]

Here, theta_i are the parameters of the online Q-network, theta_i^- are the parameters of the target network (frozen), gamma is the discount factor, and the expectation is taken over mini-batches sampled uniformly from the replay buffer.

### Epsilon-Greedy Exploration

DQN uses an [epsilon-greedy policy](/wiki/epsilon_greedy_policy) for action selection during training. With probability epsilon, the agent selects a random action (exploration); with probability 1 - epsilon, it selects the action with the highest Q-value (exploitation). The exploration rate epsilon is annealed linearly from 1.0 to 0.1 over the first 1,000,000 frames, after which it remains fixed at 0.1 for the remainder of training.[2] This schedule allows extensive exploration early in training and shifts toward exploitation as the Q-function improves.

### Hyperparameters

The following table lists the key hyperparameters used in the Nature DQN paper:[2]

| Hyperparameter | Value |
|---|---|
| Discount factor (gamma) | 0.99 |
| Minibatch size | 32 |
| Replay buffer size | 1,000,000 transitions |
| Target network update frequency (C) | Every 10,000 steps |
| Learning rate (RMSProp) | 0.00025 |
| Initial epsilon | 1.0 |
| Final epsilon | 0.1 |
| Epsilon annealing period | 1,000,000 frames |
| Replay start size | 50,000 transitions |
| Training frames total | 50,000,000 frames |
| Optimizer | RMSProp (momentum 0.95) |
| Action repeat (frame skip) | 4 |

### Algorithm Pseudocode

The DQN training algorithm proceeds as follows:

1. Initialize the replay buffer D with capacity N.
2. Initialize the Q-network with random weights theta.
3. Initialize the target network with weights theta^- = theta.
4. For each episode:
   - Observe the initial state s (stack of 4 preprocessed frames).
   - For each time step:
     - With probability epsilon, select a random action a; otherwise select a = argmax_a Q(s, a; theta).
     - Execute action a in the environment, observe reward r and next state s'.
     - Clip the reward to {-1, 0, +1}.
     - Store the transition (s, a, r, s') in D.
     - Sample a random minibatch of transitions from D.
     - Compute the target: y = r + gamma * max_a' Q(s', a'; theta^-) (or y = r if the episode has ended).
     - Perform a gradient descent step on (y - Q(s, a; theta))^2 with respect to theta.
     - Every C steps, update the target network: theta^- = theta.

## Atari Game Results

DQN was evaluated on 49 Atari 2600 games from the Arcade Learning Environment (ALE). The same network architecture, hyperparameters, and algorithm were used for all games, with no game-specific engineering. The agent received only the raw screen pixels and game score as input.

Key results from the 2015 Nature paper include:

- DQN outperformed all previous machine learning methods on 43 of the 49 games tested.[2]
- On more than half of the 49 games, DQN achieved at least 75% of the score of a professional human game tester.[2]
- In several games (including Video Pinball, Boxing, and Breakout), DQN exceeded human-level performance by a wide margin.[2]
- In Breakout specifically, DQN discovered a strategy of tunneling through the wall to bounce the ball behind the bricks, a tactic that maximizes score efficiently.[2]

The following table shows selected Atari game results comparing DQN to human performance:

| Game | DQN Score | Human Score | DQN vs. Human (%) |
|---|---|---|---|
| Breakout | 401.2 | 31.8 | 1,262% |
| Video Pinball | 42,684.4 | 17,297.6 | 247% |
| Boxing | 91.6 | 12.1 | 757% |
| Pong | 20.9 | 9.3 | 225% |
| Space Invaders | 1,976.0 | 1,668.7 | 118% |
| Seaquest | 5,286.8 | 42,054.7 | 13% |
| Montezuma's Revenge | 0.0 | 4,753.3 | 0% |
| Private Eye | 1,788.0 | 69,571.3 | 3% |

DQN performed poorly on games requiring long-term planning and exploration, such as Montezuma's Revenge and Private Eye, where the agent scored near zero.[2] These games involve sparse rewards and require the agent to explore large environments systematically, which the epsilon-greedy strategy handles poorly.

## Extensions and Variants

Since the original DQN publication, researchers have proposed numerous improvements. The most significant extensions are summarized below.

### Double DQN (DDQN)

Proposed by Hado van Hasselt, Arthur Guez, and David Silver (AAAI 2016), Double DQN addresses the overestimation bias inherent in standard Q-learning.[3] In regular DQN, the same network both selects and evaluates the best next action, which tends to overestimate Q-values because the max operator introduces a positive bias when value estimates contain noise.

Double DQN decouples action selection from action evaluation. The online network selects the best action for the next state, but the target network evaluates the Q-value of that action:

y = r + gamma * Q(s', argmax_a' Q(s', a'; theta); theta^-)

This simple change reduces overestimation and often improves performance significantly.[3]

### Prioritized Experience Replay

Proposed by Tom Schaul, John Quan, Ioannis Antonoglou, and David Silver (ICLR 2016), Prioritized Experience Replay replaces uniform random sampling from the replay buffer with a prioritized sampling strategy.[4] Transitions with larger temporal-difference (TD) errors are sampled more frequently, as they represent experiences from which the network can learn the most.

Two variants were proposed:

| Variant | Prioritization Method |
|---|---|
| Proportional | Priority proportional to the absolute TD error plus a small constant epsilon |
| Rank-based | Priority based on the rank of each transition when sorted by absolute TD error |

Prioritized replay improved DQN performance on 41 out of 49 Atari games compared to uniform replay.[4]

### Dueling DQN

Proposed by Ziyu Wang and colleagues (ICML 2016, Best Paper), the Dueling Network Architecture modifies the network structure by splitting the final layers into two separate streams:[5]

1. A **value stream** V(s) that estimates the value of being in a given state regardless of the action taken.
2. An **advantage stream** A(s, a) that estimates how much better each action is compared to the average action in that state.

The Q-value is then reconstructed as Q(s, a) = V(s) + A(s, a) - mean(A(s, .)). This decomposition allows the network to learn which states are valuable without having to evaluate every action independently, which is particularly beneficial in states where the choice of action does not matter much. The Dueling architecture outperformed Double DQN on 50 of 57 Atari games.[5]

### Noisy Networks (NoisyNet)

Proposed by Meire Fortunato, Mohammad Gheshlaghi Azar, and colleagues (ICLR 2018), Noisy Networks replace the deterministic epsilon-greedy exploration strategy with learned stochastic noise added directly to the network weights.[7] The noise parameters are trained alongside the regular weights using gradient descent.

This approach provides state-dependent exploration: the network can learn to explore more in unfamiliar states and less in well-understood ones. NoisyNet eliminates the need for manually tuning the epsilon schedule and achieved improved scores on many Atari games.[7]

### Distributional DQN (C51)

Proposed by Marc G. Bellemare, Will Dabney, and Remi Munos (ICML 2017), C51 moves beyond estimating the expected Q-value to learning the full probability distribution of returns.[6] Instead of outputting a single Q-value per action, the network outputs a discrete probability distribution over 51 equally spaced "atoms" spanning a range [V_min, V_max]. The number 51 was found experimentally to offer the best tradeoff between performance and computation.[6]

Learning the full distribution captures the inherent uncertainty and multimodality of returns, which provides richer gradient signals during training. C51 substantially outperformed all prior DQN variants at the time of publication.[6]

### Multi-Step Learning

Standard DQN uses single-step TD targets (bootstrapping from the very next state). Multi-step returns instead accumulate rewards over *n* consecutive steps before bootstrapping:

y = r_t + gamma * r_{t+1} + gamma^2 * r_{t+2} + ... + gamma^{n-1} * r_{t+n-1} + gamma^n * max_a Q(s_{t+n}, a; theta^-)

Multi-step returns propagate reward information more quickly and can accelerate learning, though they introduce more variance and a different bias-variance tradeoff.

### Rainbow DQN

Proposed by Matteo Hessel, Joseph Modayil, Hado van Hasselt, Tom Schaul, Georg Ostrovski, Will Dabney, Dan Horgan, Bilal Piot, Mohammad Azar, and David Silver (AAAI 2018), Rainbow combines six improvements to DQN into a single integrated agent:[8]

| Component | Key Idea |
|---|---|
| Double Q-learning | Reduces overestimation bias |
| Prioritized Experience Replay | Focuses training on high-error transitions |
| Dueling Networks | Separates state value from action advantage |
| Multi-step Learning (n=3) | Faster reward propagation |
| Distributional RL (C51) | Learns full return distribution |
| Noisy Networks | Learned exploration replacing epsilon-greedy |

Rainbow reached a median human-normalized score of roughly 231% across the 57-game Atari benchmark after 200 million frames of training, far above the comparable score of vanilla DQN.[8] An ablation study found that prioritized replay and multi-step learning provided the largest individual performance gains, while distributional RL and noisy networks were also important.[8] Rainbow also demonstrated significantly improved data efficiency, matching DQN's best performance after just 7 million frames and surpassing every baseline within 44 million frames, compared with the 50 million frames used to train the original DQN.[8]

## Limitations

Despite its groundbreaking success, DQN has several notable limitations:

1. **Discrete action spaces only.** DQN outputs a Q-value for each possible action, which requires enumerating all actions. This approach does not scale to continuous action spaces (such as robotic control with continuous joint torques). Algorithms like Deep Deterministic Policy Gradient (DDPG) and Soft Actor-Critic (SAC) were developed to address this gap.

2. **Overestimation bias.** Standard DQN systematically overestimates Q-values due to the use of the max operator in the Bellman target. The severity of overestimation grows with the number of actions: if Q-value estimates contain random errors uniformly distributed in [-epsilon, epsilon], the overestimation can be as large as gamma * epsilon * (m-1) / (m+1), where m is the number of actions.[3] Double DQN was specifically designed to mitigate this problem.[3]

3. **Poor exploration.** The epsilon-greedy exploration strategy explores randomly without considering the structure of the environment. This leads to extremely poor performance on games with sparse or delayed rewards (such as Montezuma's Revenge), where systematic exploration is required.

4. **Sample inefficiency.** Although experience replay improves data reuse, DQN still requires tens of millions of frames (equivalent to hundreds of hours of gameplay) to learn a single game. This makes it impractical for real-world applications where data collection is expensive.

5. **Reward clipping loses information.** Clipping all rewards to {-1, 0, +1} discards information about reward magnitudes. A reward of +100 is treated the same as a reward of +1, which can lead to suboptimal policies in environments where reward scale matters.

## How does DQN differ from policy gradient methods?

DQN and policy gradient methods represent two fundamental approaches to deep reinforcement learning:

| Aspect | DQN (Value-Based) | Policy Gradient Methods |
|---|---|---|
| What is learned | Q-value function Q(s, a) | Policy pi(a given s) directly |
| Action space | Discrete only | Discrete and continuous |
| Exploration | Epsilon-greedy (random) | Stochastic policy (natural) |
| Training data | Off-policy (replay buffer) | Typically on-policy |
| Sample efficiency | Higher (reuses past data) | Lower (requires fresh data) |
| Variance | Lower (value estimates) | Higher (reward-based gradients) |
| Convergence | Can diverge with function approximation | Converges to local optima |

Policy gradient methods such as REINFORCE, Proximal Policy Optimization (PPO), and Trust Region Policy Optimization (TRPO) optimize the policy directly by estimating the gradient of expected return with respect to policy parameters. They handle continuous actions naturally but tend to have high variance in their gradient estimates and are typically on-policy, meaning they cannot reuse old experience.

Actor-critic methods, including Advantage Actor-Critic (A2C), Asynchronous Advantage Actor-Critic (A3C), and Soft Actor-Critic (SAC), combine elements of both approaches by maintaining both a policy (actor) and a value function (critic). These hybrid methods often achieve the best of both worlds.

## Why was DQN important for deep reinforcement learning?

DQN's publication in *Nature* in 2015 is widely regarded as the event that launched the modern era of deep reinforcement learning. Its impact extends across several dimensions:

- **Proof of concept.** DQN demonstrated that a single neural network agent could learn diverse tasks from raw sensory input without task-specific feature engineering, using the same algorithm and architecture across all tasks.[2]
- **Algorithmic foundation.** The experience replay and target network innovations introduced by DQN became standard components in virtually all subsequent deep RL algorithms.
- **Industry investment.** The Atari results came shortly after Google's acquisition of DeepMind for a reported sum of more than $500 million, signaling to the broader technology industry that deep RL was a commercially viable research direction.[10]
- **Subsequent breakthroughs.** DQN laid the groundwork for later achievements including AlphaGo (2016), AlphaZero (2017), OpenAI Five (2018), and AlphaStar (2019), all of which built upon the deep RL paradigm that DQN helped establish.

## Explain Like I'm 5 (ELI5)

Imagine you are learning to play a new video game. At first, you press buttons randomly and sometimes you score points. Over time, you start remembering which buttons worked well in different situations. That is basically what DQN does, except instead of a human brain, it uses a computer program called a [neural network](/wiki/neural_network).

The neural network looks at what is happening on the game screen (the pixels) and tries to predict how many points it will eventually get for each button it could press. It picks the button that it thinks will lead to the most points.

To learn faster, the agent keeps a "scrapbook" of past moments from the game (this is called [experience replay](/wiki/experience_replay)). Instead of only learning from what just happened, it flips through its scrapbook and studies random old moments too. This helps it learn more steadily.

There is also a trick called the "target network." Think of it like having a teacher who gives you answers to check your work against. The teacher does not change their answers every second; they only update their answer key once in a while. This keeps things stable so the agent does not get confused by constantly shifting goals.

Using these ideas together, DQN was able to learn to play dozens of Atari video games (like Breakout and Pong) just by watching the screen, with no one telling it the rules. In many games, it played better than human experts.

## References

1. Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., & Riedmiller, M. (2013). "Playing Atari with Deep Reinforcement Learning." *NIPS Deep Learning Workshop*. [arXiv:1312.5602](https://arxiv.org/abs/1312.5602)
2. Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A.A., Veness, J., Bellemare, M.G., Graves, A., Riedmiller, M., Fidjeland, A.K., Ostrovski, G., Petersen, S., Beattie, C., Sadik, A., Antonoglou, I., King, H., Kumaran, D., Wierstra, D., Legg, S., & Hassabis, D. (2015). "Human-level control through deep reinforcement learning." *Nature*, 518(7540), 529-533. [doi:10.1038/nature14236](https://www.nature.com/articles/nature14236)
3. Van Hasselt, H., Guez, A., & Silver, D. (2016). "Deep Reinforcement Learning with Double Q-learning." *Proceedings of the AAAI Conference on Artificial Intelligence*, 30, 2094-2100. [arXiv:1509.06461](https://arxiv.org/abs/1509.06461)
4. Schaul, T., Quan, J., Antonoglou, I., & Silver, D. (2016). "Prioritized Experience Replay." *Proceedings of ICLR 2016*. [arXiv:1511.05952](https://arxiv.org/abs/1511.05952)
5. Wang, Z., Schaul, T., Hessel, M., Van Hasselt, H., Lanctot, M., & De Freitas, N. (2016). "Dueling Network Architectures for Deep Reinforcement Learning." *Proceedings of ICML 2016*. [arXiv:1511.06581](https://arxiv.org/abs/1511.06581)
6. Bellemare, M.G., Dabney, W., & Munos, R. (2017). "A Distributional Perspective on Reinforcement Learning." *Proceedings of ICML 2017*. [arXiv:1707.06887](https://arxiv.org/abs/1707.06887)
7. Fortunato, M., Azar, M.G., Piot, B., Menick, J., Osband, I., Graves, A., Mnih, V., Munos, R., Hassabis, D., Pietquin, O., Blundell, C., & Legg, S. (2018). "Noisy Networks for Exploration." *Proceedings of ICLR 2018*. [arXiv:1706.10295](https://arxiv.org/abs/1706.10295)
8. Hessel, M., Modayil, J., Van Hasselt, H., Schaul, T., Ostrovski, G., Dabney, W., Horgan, D., Piot, B., Azar, M., & Silver, D. (2018). "Rainbow: Combining Improvements in Deep Reinforcement Learning." *Proceedings of AAAI 2018*. [arXiv:1710.02298](https://arxiv.org/abs/1710.02298)
9. Watkins, C.J.C.H., & Dayan, P. (1992). "Q-learning." *Machine Learning*, 8(3-4), 279-292.
10. DeepMind. (2015). "From Pixels to Actions: Human-level control through Deep Reinforcement Learning." *Google Research Blog*. [Link](https://research.google/blog/from-pixels-to-actions-human-level-control-through-deep-reinforcement-learning/)
11. TechCrunch. (2014). "Google Acquires Artificial Intelligence Startup DeepMind For More Than $500M." [Link](https://techcrunch.com/2014/01/26/google-deepmind/)

