A target network is a separate copy of a neural network used in deep reinforcement learning algorithms to stabilize the training process. Instead of computing learning targets from the same network whose parameters are being updated, a target network provides a fixed or slowly changing reference point for computing target values in the Bellman equation. Target networks were introduced as part of the Deep Q-Network (DQN) algorithm by Mnih et al. in 2013 and further refined in their 2015 Nature paper. They have since become a standard component in many off-policy deep reinforcement learning methods, including Deep Deterministic Policy Gradient (DDPG), Twin Delayed DDPG (TD3), and Soft Actor-Critic (SAC).
Imagine you are learning to throw a basketball into a hoop. Every time you throw, someone tells you how far off your throw was so you can adjust. But what if the hoop kept moving every time you threw? It would be really hard to get better because the target keeps changing.
In reinforcement learning, a computer agent has the same problem. It is trying to learn how good different actions are, but the "answer key" it uses to check itself keeps changing as it learns. A target network is like having a second, frozen hoop that stays in one place for a while. The agent practices throwing at that frozen hoop, and only after many practice throws does someone move the frozen hoop to match the real one. This makes learning much easier because the agent has a steady goal to aim for.
In Q-learning, an agent learns to estimate the expected cumulative reward for taking an action in a given state. When a neural network is used to approximate the Q-function (as in DQN), the network's parameters appear on both sides of the update equation. The Q-value prediction for the current state-action pair is compared against a target that itself depends on the Q-values of the next state, computed by the same network.
This creates a feedback loop: each time the network's weights are updated to reduce the error on one sample, the target values for all other samples also shift. The result is a "moving target" problem where the optimization objective changes with every parameter update. In practice, this can cause oscillations, divergence, or slow convergence during training.
The instability problem is closely related to what Sutton and Barto (2018) called the deadly triad in reinforcement learning. The deadly triad occurs when three elements are combined simultaneously:
When all three are present, learning can become unstable and value estimates may diverge to infinity. The target network was one of the key techniques introduced to mitigate this instability. Zhang, Yao, and Whiteson (2021) provided formal theoretical results showing that target networks, combined with ridge regularization, can break the deadly triad and guarantee convergence for linear Q-learning algorithms under non-restrictive conditions.
A target network is architecturally identical to the online network (also called the main network or policy network) but maintains a separate set of parameters. During training:
By decoupling the parameters used for the prediction from those used for the target computation, the learning targets remain stable over many update steps. This reduces the feedback loop and makes the optimization problem more similar to standard supervised learning, where the targets are fixed.
In standard DQN, the loss function for a single transition (s, a, r, s') is:
L(θ) = (r + γ max_a' Q(s', a'; θ⁻) - Q(s, a; θ))²
where:
The term r + γ max_a' Q(s', a'; θ⁻) is called the TD target. Because it uses the target network's fixed parameters θ⁻ rather than the online network's changing parameters θ, the TD target does not shift with every gradient update.
In practice, the loss is computed over mini-batches sampled from an experience replay buffer:
L(θ) = (1/N) Σᵢ (rᵢ + γ max_a' Q(s'ᵢ, a'; θ⁻) - Q(sᵢ, aᵢ; θ))²
Some implementations use the Huber loss instead of mean squared error to make training more robust to outliers in the TD error.
There are two main strategies for updating the target network's parameters to eventually reflect the online network's learned values.
In the original DQN algorithm, the target network's parameters are periodically replaced with the online network's parameters every C steps:
θ⁻ ← θ (every C steps)
Between updates, the target network's parameters remain completely frozen. This is the simplest update strategy and was used in both the original Mnih et al. (2013) paper and the 2015 Nature paper.
Introduced in the DDPG algorithm by Lillicrap et al. (2015), soft updates apply a weighted average at every training step:
θ⁻ ← τθ + (1 - τ)θ⁻
where τ (tau) is a small positive number called the soft update coefficient, typically in the range 0.001 to 0.01. This formula is also known as Polyak averaging (or exponential moving average of parameters). At each step, the target network moves a tiny fraction toward the online network, resulting in a smooth and continuous update rather than an abrupt replacement.
The name "Polyak averaging" comes from the work of Boris Polyak and Anatoli Juditsky (1992), who showed that averaging the iterates of stochastic optimization algorithms can improve convergence rates.
| Property | Hard update | Soft update (Polyak averaging) |
|---|---|---|
| Update formula | θ⁻ ← θ every C steps | θ⁻ ← τθ + (1 - τ)θ⁻ every step |
| Update frequency | Every C steps (e.g., 1,000 or 10,000) | Every training step |
| Key hyperparameter | C (update period) | τ (interpolation coefficient) |
| Typical hyperparameter values | C = 1,000 to 10,000 | τ = 0.001 to 0.01 |
| Target stability | Very stable between updates; abrupt jumps at update points | Continuously changing but very slowly |
| Smoothness | Discontinuous; target values can shift suddenly | Smooth; target values change gradually |
| Used in | DQN, Double DQN | DDPG, TD3, SAC |
| Sensitivity | Sensitive to choice of C; too large causes staleness, too small reduces stability | Sensitive to choice of τ; too large reduces stability, too small causes staleness |
Target networks appear in many deep reinforcement learning algorithms. The following table summarizes how different algorithms use them.
| Algorithm | Year | Target network usage | Update method | Networks with targets | Key innovation |
|---|---|---|---|---|---|
| DQN | 2013/2015 | Target Q-network for computing TD targets | Hard update every C steps | Q-network | Introduced target networks with experience replay |
| Double DQN | 2016 | Target Q-network for value evaluation; online network for action selection | Hard update every C steps | Q-network | Decoupled action selection from value evaluation to reduce overestimation |
| DDPG | 2015 | Separate target networks for both actor and critic | Soft update (τ = 0.001) | Actor, Critic | Extended target networks to continuous action spaces with soft updates |
| TD3 | 2018 | Target networks for twin critics and actor | Soft update with delayed policy updates | Actor, two Critics | Added delayed updates and target policy smoothing |
| SAC | 2018 | Target Q-networks (two) with Polyak averaging | Soft update (τ = 0.005) | Two Q-networks | Combined target networks with maximum entropy framework |
The DQN algorithm, published by Mnih et al. at DeepMind, was the first to combine deep neural networks with Q-learning for high-dimensional state spaces such as Atari game frames. DQN introduced two stabilization techniques that work together:
Experience replay addresses the correlation between consecutive training samples, while the target network addresses the non-stationarity of the learning targets. Together, they make deep Q-learning practical for complex tasks. In the original DQN experiments on Atari games, the target network was updated every 10,000 steps.
Van Hasselt, Guez, and Silver (2016) identified that the max operator in DQN's target computation leads to systematic overestimation of Q-values. Standard DQN uses the same (target) network both to select the best action and to evaluate it:
y = r + γ max_a' Q(s', a'; θ⁻)
Double DQN decouples these two steps. It uses the online network to select the action and the target network to evaluate that action's value:
y = r + γ Q(s', argmax_a' Q(s', a'; θ); θ⁻)
This small change significantly reduces overestimation bias and leads to improved performance across many Atari games.
Lillicrap et al. (2015) adapted DQN concepts to continuous action spaces using an actor-critic architecture. DDPG maintains four networks:
DDPG was the first algorithm to use soft (Polyak averaging) updates for target networks, with the target parameters constrained to change slowly at each step. The DDPG paper used τ = 0.001.
Fujimoto, van Hoof, and Meger (2018) identified overestimation bias as a problem in actor-critic methods and introduced three improvements on top of DDPG:
Clipped double Q-learning: Two critic networks are maintained, and the smaller of the two Q-value estimates is used in the target. This prevents the policy from exploiting overestimated Q-values.
Delayed policy updates: The actor (and target networks) are updated less frequently than the critics, typically once for every two critic updates.
Target policy smoothing: Noise is added to the actions selected by the target actor to smooth the Q-value estimates:
a'(s') = clip(μ_θ_targ(s') + clip(ε, -c, c), a_low, a_high), where ε ~ N(0, σ)
These three techniques work together to produce more stable and accurate learning.
Haarnoja et al. (2018) combined target networks with an entropy-regularized objective. SAC uses two target Q-networks updated via Polyak averaging and takes the minimum of their predictions (similar to TD3's clipped double Q-learning). The target computation in SAC incorporates an entropy term:
y = r + γ (min(Q_targ,1(s', a'), Q_targ,2(s', a')) - α log π(a'|s'))
where α is a temperature parameter controlling the trade-off between reward maximization and entropy (exploration).
The following pseudocode shows how a target network is used in the DQN algorithm:
Initialize online Q-network with random parameters θ
Initialize target Q-network with parameters θ⁻ = θ
Initialize replay buffer D
For each episode:
Observe initial state s
For each step:
Select action a using ε-greedy policy based on Q(s, ·; θ)
Execute action a, observe reward r and next state s'
Store transition (s, a, r, s') in replay buffer D
Sample random mini-batch of transitions from D
For each transition (sᵢ, aᵢ, rᵢ, s'ᵢ):
If s'ᵢ is terminal:
yᵢ = rᵢ
Else:
yᵢ = rᵢ + γ max_a' Q(s'ᵢ, a'; θ⁻) // Use target network
Update θ by minimizing loss: L = (1/N) Σ (yᵢ - Q(sᵢ, aᵢ; θ))²
Every C steps: θ⁻ ← θ // Update target network
s ← s'
In frameworks like PyTorch, a soft update is typically implemented by iterating over the parameters of both networks:
def soft_update(online_net, target_net, tau):
for target_param, online_param in zip(
target_net.parameters(), online_net.parameters()
):
target_param.data.copy_(
tau * online_param.data + (1.0 - tau) * target_param.data
)
A hard update is simply the special case where τ = 1:
def hard_update(online_net, target_net):
target_net.load_state_dict(online_net.state_dict())
The performance of target networks depends on the careful selection of their associated hyperparameters.
The update frequency C controls how often the target network is synchronized with the online network. The following guidelines are commonly observed:
| Update frequency (C) | Effect | When to use |
|---|---|---|
| Very small (e.g., 100) | Target changes frequently; less stability | Simple environments with fast learning |
| Moderate (e.g., 1,000-5,000) | Balanced stability and freshness | Most standard tasks |
| Large (e.g., 10,000+) | Very stable but potentially stale targets | Complex environments; original DQN Atari setting |
If C is too small, the target network changes too quickly and the stabilization benefit is lost. If C is too large, the target network becomes stale, meaning its predictions diverge from the online network's current understanding, which can slow down learning.
The coefficient τ determines how quickly the target network tracks the online network:
| τ value | Effect | Typical usage |
|---|---|---|
| 0.001 | Very slow tracking; high stability | DDPG (original paper) |
| 0.005 | Moderate tracking speed | SAC, many modern implementations |
| 0.01 | Faster tracking; less stability | Some TD3 implementations |
| 1.0 | Equivalent to hard update (full copy) | DQN-style periodic updates |
In Stable Baselines3, the default τ for DQN is 1.0 (hard update) with an update period of 10,000 steps, while the default τ for DDPG is 0.005 with updates at every step.
Target networks and experience replay are complementary techniques that address different sources of instability in deep reinforcement learning.
| Source of instability | Solution | Mechanism |
|---|---|---|
| Correlated training samples | Experience replay | Randomly samples from a buffer to break temporal correlations |
| Non-stationary learning targets | Target network | Uses a separate, slowly changing network for target computation |
| Combined effect | Both together | Training resembles supervised learning with i.i.d. data and fixed targets |
When both techniques are used together, the training process more closely resembles supervised machine learning: the training data is approximately independent and identically distributed (due to replay), and the targets are approximately fixed (due to the target network). This combination was essential to the success of the original DQN algorithm.
The theoretical understanding of why target networks help has developed over time:
These theoretical results confirm the practical observation that target networks are effective at stabilizing training, while also highlighting that the standard Polyak averaging may not always suffice; additional modifications (such as projections or regularization) may be necessary for guaranteed convergence.
Target networks are effective but come with trade-offs:
Several approaches have been proposed to address the limitations of standard target networks:
| Approach | Description | Reference |
|---|---|---|
| DeepMellow | Replaces the max operator with the Mellowmax operator, eliminating the need for a target network entirely | Kim et al. (2019) |
| Gradient target tracking | Uses the gradient of the target computation to adaptively adjust the target, reducing staleness | Yang et al. (2025) |
| t-Soft update | Generalizes the soft update rule using a Student-t distribution to allow adaptive blending between hard and soft updates | Xu et al. (2020) |
| Iterated Q-learning | Bridges the performance gap between target-free and target-based methods | Multiple authors (2025) |
| Functional regularization | Views the target network as an implicit regularizer and replaces it with explicit functional regularization | Tang et al. (2021) |
The DeepMellow approach is particularly notable. Kim, Asadi, Littman, and Konidaris (2019) showed that replacing the max operator in Q-learning with the Mellowmax operator (a smooth approximation of the max) can stabilize learning without a target network, achieving competitive or superior performance in Atari games when the temperature parameter is properly tuned.
Target networks are primarily used in off-policy deep reinforcement learning algorithms that combine neural network function approximation with bootstrapped value estimates. On-policy methods like A3C and PPO do not typically use target networks because they do not suffer from the same degree of instability.
General guidelines:
The development of target networks follows the broader trajectory of stabilizing deep reinforcement learning:
| Year | Development | Authors |
|---|---|---|
| 1992 | Polyak averaging for stochastic optimization | Polyak and Juditsky |
| 2013 | DQN with target network and experience replay (workshop paper) | Mnih et al. |
| 2015 | DQN published in Nature; human-level Atari performance | Mnih et al. |
| 2015 | DDPG introduces soft updates for target networks | Lillicrap et al. |
| 2016 | Double DQN uses target network to reduce overestimation | van Hasselt, Guez, Silver |
| 2018 | TD3 adds delayed updates and target policy smoothing | Fujimoto, van Hoof, Meger |
| 2018 | SAC combines target networks with entropy regularization | Haarnoja et al. |
| 2019 | DeepMellow removes the need for target networks | Kim et al. |
| 2021 | Theoretical proof that target networks break the deadly triad | Zhang, Yao, Whiteson |