Target Network
Last reviewed
Sources
14 citations
Review status
Source-backed
Revision
v4 ยท 4,124 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Sources
14 citations
Review status
Source-backed
Revision
v4 ยท 4,124 words
Add missing citations, update stale details, or suggest a clearer explanation.
A target network is a separate, slowly updated copy of a neural network used in deep reinforcement learning to compute stable learning targets, decoupling the bootstrap target from the rapidly changing network being optimized. It was introduced in the 2015 Nature paper that defined the modern Deep Q-Network (DQN) algorithm: every C updates the online network Q is cloned to produce a target network, and that frozen copy generates the temporal difference (TD) targets for the next C updates. In the original DQN experiments, C was set to 10,000 steps.[1] Mnih et al. wrote that "generating the targets using an older set of parameters adds a delay between the time an update to Q is made and the time the update affects the targets y_j, making divergence or oscillations much more unlikely."[1] Target networks have since become a standard component of off-policy deep reinforcement learning methods, including Deep Deterministic Policy Gradient (DDPG), Twin Delayed DDPG (TD3), and Soft Actor-Critic (SAC).
Imagine you are learning to throw a basketball into a hoop. Every time you throw, someone tells you how far off your throw was so you can adjust. But what if the hoop kept moving every time you threw? It would be really hard to get better because the target keeps changing.
In reinforcement learning, a computer agent has the same problem. It is trying to learn how good different actions are, but the "answer key" it uses to check itself keeps changing as it learns. A target network is like having a second, frozen hoop that stays in one place for a while. The agent practices throwing at that frozen hoop, and only after many practice throws does someone move the frozen hoop to match the real one. This makes learning much easier because the agent has a steady goal to aim for.
In Q-learning, an agent learns to estimate the expected cumulative reward for taking an action in a given state. When a neural network is used to approximate the Q-function (as in DQN), the network's parameters appear on both sides of the update equation. The Q-value prediction for the current state-action pair is compared against a target that itself depends on the Q-values of the next state, computed by the same network.
This creates a feedback loop: each time the network's weights are updated to reduce the error on one sample, the target values for all other samples also shift. The result is a "moving target" problem where the optimization objective changes with every parameter update. The DQN authors described the underlying mechanism precisely: in standard online Q-learning "an update that increases Q(s_t, a_t) often also increases Q(s_{t+1}, a) for all a and hence also increases the target y_j, possibly leading to oscillations or divergence of the policy."[1] In practice, this can cause oscillations, divergence, or slow convergence during training.
The instability problem is closely related to what Sutton and Barto (2018) called the deadly triad in reinforcement learning.[9] The deadly triad occurs when three elements are combined simultaneously:
When all three are present, learning can become unstable and value estimates may diverge to infinity. The target network was one of the key techniques introduced to mitigate this instability. Zhang, Yao, and Whiteson (2021) provided formal theoretical results showing that a target network, combined with two projections added to the Polyak-averaging update, can break the deadly triad and guarantee convergence for linear off-policy algorithms with bootstrapping under non-restrictive conditions.[7]
A target network is architecturally identical to the online network (also called the main network or policy network) but maintains a separate set of parameters. During training:
By decoupling the parameters used for the prediction from those used for the target computation, the learning targets remain stable over many update steps. This reduces the feedback loop and makes the optimization problem more similar to standard supervised learning, where the targets are fixed. In the DQN methods, the authors note this approach works by "smoothing out learning and avoiding oscillations or divergence in the parameters."[1]
In standard DQN, the loss function for a single transition (s, a, r, s') is:
L(theta) = (r + gamma max_a' Q(s', a'; theta-minus) - Q(s, a; theta))^2
where:
The term r + gamma max_a' Q(s', a'; theta-minus) is called the TD target. Because it uses the target network's fixed parameters theta-minus rather than the online network's changing parameters theta, the TD target does not shift with every gradient update.
In practice, the loss is computed over mini-batches sampled from an experience replay buffer (the original DQN used a minibatch size of 32):[1]
L(theta) = (1/N) Sum_i (r_i + gamma max_a' Q(s'_i, a'; theta-minus) - Q(s_i, a_i; theta))^2
Some implementations use the Huber loss instead of mean squared error to make training more robust to outliers in the TD error.
There are two main strategies for updating the target network's parameters to eventually reflect the online network's learned values.
In the original DQN algorithm, the target network's parameters are periodically replaced with the online network's parameters every C steps:
theta-minus <- theta (every C steps)
Between updates, the target network's parameters remain completely frozen. The DQN paper describes the rule in one sentence: "every C updates we clone the network Q to obtain a target network Q-hat and use Q-hat for generating the Q-learning targets y_j for the following C updates to Q."[1] This is the simplest update strategy. The 2013 workshop paper that first paired deep networks with Q-learning used only experience replay and did NOT include a target network; the target network was added in the 2015 Nature paper, where C was set to 10,000 steps.[1][2]
Introduced in the DDPG algorithm by Lillicrap et al. (2015), soft updates apply a weighted average at every training step:
theta-minus <- tau * theta + (1 - tau) * theta-minus
where tau is a small positive number called the soft update coefficient, typically in the range 0.001 to 0.005. This formula is also known as Polyak averaging (or an exponential moving average of parameters). At each step, the target network moves a tiny fraction toward the online network, resulting in a smooth and continuous update rather than an abrupt replacement. Lillicrap et al. explained the benefit directly: "This means that the target values are constrained to change slowly, greatly improving the stability of learning."[3] The DDPG experiments used tau = 0.001.[3]
The name "Polyak averaging" comes from the work of Boris Polyak and Anatoli Juditsky (1992), who showed that averaging the iterates of stochastic optimization algorithms can improve convergence rates.[10]
| Property | Hard update | Soft update (Polyak averaging) |
|---|---|---|
| Update formula | theta-minus <- theta every C steps | theta-minus <- tau*theta + (1-tau)*theta-minus every step |
| Update frequency | Every C steps (e.g., 1,000 or 10,000) | Every training step |
| Key hyperparameter | C (update period) | tau (interpolation coefficient) |
| Typical hyperparameter values | C = 1,000 to 10,000 | tau = 0.001 to 0.005 |
| Target stability | Very stable between updates; abrupt jumps at update points | Continuously changing but very slowly |
| Smoothness | Discontinuous; target values can shift suddenly | Smooth; target values change gradually |
| Used in | DQN, Double DQN | DDPG, TD3, SAC |
| Sensitivity | Sensitive to choice of C; too large causes staleness, too small reduces stability | Sensitive to choice of tau; too large reduces stability, too small causes staleness |
Target networks appear in many deep reinforcement learning algorithms. The following table summarizes how different algorithms use them.
| Algorithm | Year | Target network usage | Update method | Networks with targets | Key innovation |
|---|---|---|---|---|---|
| DQN | 2015 | Target Q-network for computing TD targets | Hard update every C = 10,000 steps | Q-network | Introduced target networks with experience replay |
| Double DQN | 2016 | Target Q-network for value evaluation; online network for action selection | Hard update every C steps | Q-network | Decoupled action selection from value evaluation to reduce overestimation |
| DDPG | 2015/2016 | Separate target networks for both actor and critic | Soft update (tau = 0.001) | Actor, Critic | Extended target networks to continuous action spaces with soft updates |
| TD3 | 2018 | Target networks for twin critics and actor | Soft update (tau = 0.005), delayed every d = 2 steps | Actor, two Critics | Added delayed updates and target policy smoothing |
| SAC | 2018 | Target Q-networks (two) with Polyak averaging | Soft update (tau = 0.005) | Two Q-networks | Combined target networks with maximum entropy framework |
The DQN algorithm, published by Mnih et al. at Google DeepMind, was the first to combine deep neural networks with Q-learning for high-dimensional state spaces such as Atari game frames. It was evaluated on 49 Atari 2600 games and was, in the authors' words, "able to surpass the performance of all previous algorithms and achieve a level comparable to that of a professional human games tester across a set of 49 games."[1] DQN introduced two stabilization techniques that work together:
Experience replay addresses the correlation between consecutive training samples, while the target network addresses the non-stationarity of the learning targets. Together with a discount factor of gamma = 0.99 and a learning rate of 0.00025, they made deep Q-learning practical for complex tasks.[1]
Van Hasselt, Guez, and Silver (2016) identified that the max operator in DQN's target computation leads to systematic overestimation of Q-values. As they put it, the max operator "uses the same values both to select and to evaluate an action," which "makes it more likely to select overestimated values, resulting in overoptimistic value estimates."[4] Standard DQN uses the same (target) network both to select the best action and to evaluate it:
y = r + gamma max_a' Q(s', a'; theta-minus)
Double DQN decouples these two steps. It uses the online network to select the action and the target network to evaluate that action's value:
y = r + gamma Q(s', argmax_a' Q(s', a'; theta); theta-minus)
This small change significantly reduces overestimation bias and leads to improved performance across many Atari games, raising the median normalized score across 49 games to 114.7% of a human games tester versus 93.5% for DQN.[4]
Lillicrap et al. (2015) adapted DQN concepts to continuous action spaces using an actor-critic architecture; it "robustly solves more than 20 simulated physics tasks" using the same learning algorithm, network architecture, and hyperparameters.[3] DDPG maintains four networks:
DDPG was the first algorithm to use soft (Polyak averaging) updates for target networks, with the target parameters constrained to change slowly at each step. The DDPG paper used tau = 0.001.[3]
Fujimoto, van Hoof, and Meger (2018) showed that "function approximation errors are known to lead to overestimated value estimates and suboptimal policies" and that "this problem persists in an actor-critic setting," then introduced three improvements on top of DDPG:[5]
Clipped double Q-learning: Two critic networks are maintained, and the smaller of the two Q-value estimates is used in the target. This prevents the policy from exploiting overestimated Q-values.
Delayed policy updates: The actor (and target networks) are updated less frequently than the critics, once every d = 2 critic updates.[5]
Target policy smoothing: Noise is added to the actions selected by the target actor to smooth the Q-value estimates:
a'(s') = clip(mu_theta_targ(s') + clip(epsilon, -c, c), a_low, a_high), where epsilon ~ N(0, sigma)
Both target networks in TD3 are updated with tau = 0.005.[5] These techniques work together to produce more stable and accurate learning.
Haarnoja et al. (2018) described Soft Actor-Critic as "an off-policy actor-critic deep RL algorithm based on the maximum entropy reinforcement learning framework" in which "the actor aims to maximize expected reward while also maximizing entropy."[6] SAC uses target Q-networks updated via Polyak averaging (with tau = 0.005) and takes the minimum of two Q-estimates (similar to TD3's clipped double Q-learning).[6] The target computation in SAC incorporates an entropy term:
y = r + gamma (min(Q_targ,1(s', a'), Q_targ,2(s', a')) - alpha log pi(a'|s'))
where alpha is a temperature parameter controlling the trade-off between reward maximization and entropy (exploration).
The following pseudocode shows how a target network is used in the DQN algorithm:
Initialize online Q-network with random parameters theta
Initialize target Q-network with parameters theta-minus = theta
Initialize replay buffer D
For each episode:
Observe initial state s
For each step:
Select action a using epsilon-greedy policy based on Q(s, .; theta)
Execute action a, observe reward r and next state s'
Store transition (s, a, r, s') in replay buffer D
Sample random mini-batch of transitions from D
For each transition (s_i, a_i, r_i, s'_i):
If s'_i is terminal:
y_i = r_i
Else:
y_i = r_i + gamma max_a' Q(s'_i, a'; theta-minus) // Use target network
Update theta by minimizing loss: L = (1/N) Sum (y_i - Q(s_i, a_i; theta))^2
Every C steps: theta-minus <- theta // Update target network
s <- s'
In frameworks like PyTorch, a soft update is typically implemented by iterating over the parameters of both networks:
def soft_update(online_net, target_net, tau):
for target_param, online_param in zip(
target_net.parameters(), online_net.parameters()
):
target_param.data.copy_(
tau * online_param.data + (1.0 - tau) * target_param.data
)
A hard update is simply the special case where tau = 1:
def hard_update(online_net, target_net):
target_net.load_state_dict(online_net.state_dict())
The performance of target networks depends on the careful selection of their associated hyperparameters.
The update frequency C controls how often the target network is synchronized with the online network. The following guidelines are commonly observed:
| Update frequency (C) | Effect | When to use |
|---|---|---|
| Very small (e.g., 100) | Target changes frequently; less stability | Simple environments with fast learning |
| Moderate (e.g., 1,000-5,000) | Balanced stability and freshness | Most standard tasks |
| Large (e.g., 10,000+) | Very stable but potentially stale targets | Complex environments; original DQN Atari setting |
If C is too small, the target network changes too quickly and the stabilization benefit is lost. If C is too large, the target network becomes stale, meaning its predictions diverge from the online network's current understanding, which can slow down learning.
The coefficient tau determines how quickly the target network tracks the online network:
| tau value | Effect | Typical usage |
|---|---|---|
| 0.001 | Very slow tracking; high stability | DDPG (original paper) |
| 0.005 | Moderate tracking speed | TD3, SAC, many modern implementations |
| 0.01 | Faster tracking; less stability | Some implementations |
| 1.0 | Equivalent to hard update (full copy) | DQN-style periodic updates |
In Stable Baselines3, the default tau for DQN is 1.0 (hard update) with an update period of 10,000 steps, while the default tau for DDPG, TD3, and SAC is 0.005 with updates at every step.
Target networks and experience replay are complementary techniques that address different sources of instability in deep reinforcement learning.
| Source of instability | Solution | Mechanism |
|---|---|---|
| Correlated training samples | Experience replay | Randomly samples from a buffer to break temporal correlations |
| Non-stationary learning targets | Target network | Uses a separate, slowly changing network for target computation |
| Combined effect | Both together | Training resembles supervised learning with i.i.d. data and fixed targets |
When both techniques are used together, the training process more closely resembles supervised machine learning: the training data is approximately independent and identically distributed (due to replay), and the targets are approximately fixed (due to the target network). This combination was essential to the success of the original DQN algorithm.
The theoretical understanding of why target networks help has developed over time:
These theoretical results confirm the practical observation that target networks are effective at stabilizing training, while also highlighting that the standard Polyak averaging may not always suffice; additional modifications (such as projections or regularization) may be necessary for guaranteed convergence.
Target networks are effective but come with trade-offs:
Several approaches have been proposed to address the limitations of standard target networks:
| Approach | Description | Reference |
|---|---|---|
| DeepMellow | Replaces the max operator with the Mellowmax operator, eliminating the need for a target network entirely | Kim et al. (2019)[8] |
| t-Soft update | Generalizes the soft update rule using a Student-t distribution to allow adaptive blending between hard and soft updates | Kobayashi and Ilboudo (2021)[13] |
| Functional regularization | Views the target network as an implicit regularizer and replaces it with explicit functional regularization | Piche et al. (2021)[14] |
The DeepMellow approach is particularly notable. Kim, Asadi, Littman, and Konidaris (2019) showed that replacing the max operator in Q-learning with the Mellowmax operator (a smooth approximation of the max) can stabilize learning without a target network, achieving competitive or superior performance in Atari games when the temperature parameter is properly tuned.[8]
Target networks are primarily used in off-policy deep reinforcement learning algorithms that combine neural network function approximation with bootstrapped value estimates. On-policy methods like A3C and PPO do not typically use target networks because they do not suffer from the same degree of instability.
General guidelines:
The development of target networks follows the broader trajectory of stabilizing deep reinforcement learning:
| Year | Development | Authors |
|---|---|---|
| 1992 | Polyak averaging for stochastic optimization | Polyak and Juditsky |
| 2013 | Deep Q-learning with experience replay only (no target network), workshop paper | Mnih et al. |
| 2015 | DQN published in Nature; target network introduced (C = 10,000); human-level Atari performance | Mnih et al. |
| 2015 | DDPG introduces soft updates for target networks (tau = 0.001) | Lillicrap et al. |
| 2016 | Double DQN uses target network to reduce overestimation | van Hasselt, Guez, Silver |
| 2018 | TD3 adds delayed updates and target policy smoothing | Fujimoto, van Hoof, Meger |
| 2018 | SAC combines target networks with entropy regularization | Haarnoja et al. |
| 2019 | DeepMellow removes the need for target networks | Kim et al. |
| 2021 | Theoretical proof that target networks break the deadly triad | Zhang, Yao, Whiteson |