Target Network

A target network is a separate copy of a neural network used in deep reinforcement learning algorithms to stabilize the training process. Instead of computing learning targets from the same network whose parameters are being updated, a target network provides a fixed or slowly changing reference point for computing target values in the Bellman equation. Target networks were introduced as part of the Deep Q-Network (DQN) algorithm by Mnih et al. in 2013 and further refined in their 2015 Nature paper. They have since become a standard component in many off-policy deep reinforcement learning methods, including Deep Deterministic Policy Gradient (DDPG), Twin Delayed DDPG (TD3), and Soft Actor-Critic (SAC).

Explain like I'm 5 (ELI5)

Imagine you are learning to throw a basketball into a hoop. Every time you throw, someone tells you how far off your throw was so you can adjust. But what if the hoop kept moving every time you threw? It would be really hard to get better because the target keeps changing.

In reinforcement learning, a computer agent has the same problem. It is trying to learn how good different actions are, but the "answer key" it uses to check itself keeps changing as it learns. A target network is like having a second, frozen hoop that stays in one place for a while. The agent practices throwing at that frozen hoop, and only after many practice throws does someone move the frozen hoop to match the real one. This makes learning much easier because the agent has a steady goal to aim for.

Background and motivation

The moving target problem

In Q-learning, an agent learns to estimate the expected cumulative reward for taking an action in a given state. When a neural network is used to approximate the Q-function (as in DQN), the network's parameters appear on both sides of the update equation. The Q-value prediction for the current state-action pair is compared against a target that itself depends on the Q-values of the next state, computed by the same network.

This creates a feedback loop: each time the network's weights are updated to reduce the error on one sample, the target values for all other samples also shift. The result is a "moving target" problem where the optimization objective changes with every parameter update. In practice, this can cause oscillations, divergence, or slow convergence during training.

The deadly triad

The instability problem is closely related to what Sutton and Barto (2018) called the deadly triad in reinforcement learning. The deadly triad occurs when three elements are combined simultaneously:

Function approximation (such as neural networks) to represent value functions
Bootstrapping (using estimated values to update other estimated values, as in temporal difference learning)
Off-policy learning (learning about a policy different from the one generating the data)

When all three are present, learning can become unstable and value estimates may diverge to infinity. The target network was one of the key techniques introduced to mitigate this instability. Zhang, Yao, and Whiteson (2021) provided formal theoretical results showing that target networks, combined with ridge regularization, can break the deadly triad and guarantee convergence for linear Q-learning algorithms under non-restrictive conditions.

How target networks work

Basic mechanism

A target network is architecturally identical to the online network (also called the main network or policy network) but maintains a separate set of parameters. During training:

The online network's parameters (denoted θ) are updated at every training step using gradient descent on the loss function.
The target network's parameters (denoted θ⁻) remain fixed for a period of time or are updated very slowly.
When computing the temporal difference (TD) target for the loss function, the target network's parameters are used instead of the online network's parameters.

By decoupling the parameters used for the prediction from those used for the target computation, the learning targets remain stable over many update steps. This reduces the feedback loop and makes the optimization problem more similar to standard supervised learning, where the targets are fixed.

Loss function with target network

In standard DQN, the loss function for a single transition (s, a, r, s') is:

L(θ) = (r + γ max_a' Q(s', a'; θ⁻) - Q(s, a; θ))²

where:

Q(s, a; θ) is the Q-value predicted by the online network for state s and action a
Q(s', a'; θ⁻) is the Q-value predicted by the target network for the next state s'
γ is the discount factor
r is the observed reward
θ represents the online network's parameters
θ⁻ represents the target network's parameters

The term r + γ max_a' Q(s', a'; θ⁻) is called the TD target. Because it uses the target network's fixed parameters θ⁻ rather than the online network's changing parameters θ, the TD target does not shift with every gradient update.

In practice, the loss is computed over mini-batches sampled from an experience replay buffer:

L(θ) = (1/N) Σᵢ (rᵢ + γ max_a' Q(s'ᵢ, a'; θ⁻) - Q(sᵢ, aᵢ; θ))²

Some implementations use the Huber loss instead of mean squared error to make training more robust to outliers in the TD error.

Update strategies

There are two main strategies for updating the target network's parameters to eventually reflect the online network's learned values.

Hard updates (periodic replacement)

In the original DQN algorithm, the target network's parameters are periodically replaced with the online network's parameters every C steps:

θ⁻ ← θ (every C steps)

Between updates, the target network's parameters remain completely frozen. This is the simplest update strategy and was used in both the original Mnih et al. (2013) paper and the 2015 Nature paper.

Soft updates (Polyak averaging)

Introduced in the DDPG algorithm by Lillicrap et al. (2015), soft updates apply a weighted average at every training step:

θ⁻ ← τθ + (1 - τ)θ⁻

where τ (tau) is a small positive number called the soft update coefficient, typically in the range 0.001 to 0.01. This formula is also known as Polyak averaging (or exponential moving average of parameters). At each step, the target network moves a tiny fraction toward the online network, resulting in a smooth and continuous update rather than an abrupt replacement.

The name "Polyak averaging" comes from the work of Boris Polyak and Anatoli Juditsky (1992), who showed that averaging the iterates of stochastic optimization algorithms can improve convergence rates.

Comparison of update strategies

Property	Hard update	Soft update (Polyak averaging)
Update formula	θ⁻ ← θ every C steps	θ⁻ ← τθ + (1 - τ)θ⁻ every step
Update frequency	Every C steps (e.g., 1,000 or 10,000)	Every training step
Key hyperparameter	C (update period)	τ (interpolation coefficient)
Typical hyperparameter values	C = 1,000 to 10,000	τ = 0.001 to 0.01
Target stability	Very stable between updates; abrupt jumps at update points	Continuously changing but very slowly
Smoothness	Discontinuous; target values can shift suddenly	Smooth; target values change gradually
Used in	DQN, Double DQN	DDPG, TD3, SAC
Sensitivity	Sensitive to choice of C; too large causes staleness, too small reduces stability	Sensitive to choice of τ; too large reduces stability, too small causes staleness

Role in specific algorithms

Target networks appear in many deep reinforcement learning algorithms. The following table summarizes how different algorithms use them.

Algorithm	Year	Target network usage	Update method	Networks with targets	Key innovation
DQN	2013/2015	Target Q-network for computing TD targets	Hard update every C steps	Q-network	Introduced target networks with experience replay
Double DQN	2016	Target Q-network for value evaluation; online network for action selection	Hard update every C steps	Q-network	Decoupled action selection from value evaluation to reduce overestimation
DDPG	2015	Separate target networks for both actor and critic	Soft update (τ = 0.001)	Actor, Critic	Extended target networks to continuous action spaces with soft updates
TD3	2018	Target networks for twin critics and actor	Soft update with delayed policy updates	Actor, two Critics	Added delayed updates and target policy smoothing
SAC	2018	Target Q-networks (two) with Polyak averaging	Soft update (τ = 0.005)	Two Q-networks	Combined target networks with maximum entropy framework

DQN (Deep Q-Network)

The DQN algorithm, published by Mnih et al. at DeepMind, was the first to combine deep neural networks with Q-learning for high-dimensional state spaces such as Atari game frames. DQN introduced two stabilization techniques that work together:

Experience replay: Transitions are stored in a replay buffer and sampled randomly for training, breaking the temporal correlations between consecutive experiences.
Target network: A separate network with frozen parameters provides stable TD targets.

Experience replay addresses the correlation between consecutive training samples, while the target network addresses the non-stationarity of the learning targets. Together, they make deep Q-learning practical for complex tasks. In the original DQN experiments on Atari games, the target network was updated every 10,000 steps.

Double DQN

Van Hasselt, Guez, and Silver (2016) identified that the max operator in DQN's target computation leads to systematic overestimation of Q-values. Standard DQN uses the same (target) network both to select the best action and to evaluate it:

y = r + γ max_a' Q(s', a'; θ⁻)

Double DQN decouples these two steps. It uses the online network to select the action and the target network to evaluate that action's value:

y = r + γ Q(s', argmax_a' Q(s', a'; θ); θ⁻)

This small change significantly reduces overestimation bias and leads to improved performance across many Atari games.

DDPG (Deep Deterministic Policy Gradient)

Lillicrap et al. (2015) adapted DQN concepts to continuous action spaces using an actor-critic architecture. DDPG maintains four networks:

Online actor (policy network): Selects actions
Online critic (Q-network): Evaluates state-action pairs
Target actor: Provides stable action predictions for target computation
Target critic: Provides stable Q-value targets

DDPG was the first algorithm to use soft (Polyak averaging) updates for target networks, with the target parameters constrained to change slowly at each step. The DDPG paper used τ = 0.001.

TD3 (Twin Delayed DDPG)

Fujimoto, van Hoof, and Meger (2018) identified overestimation bias as a problem in actor-critic methods and introduced three improvements on top of DDPG:

Clipped double Q-learning: Two critic networks are maintained, and the smaller of the two Q-value estimates is used in the target. This prevents the policy from exploiting overestimated Q-values.
Delayed policy updates: The actor (and target networks) are updated less frequently than the critics, typically once for every two critic updates.
Target policy smoothing: Noise is added to the actions selected by the target actor to smooth the Q-value estimates:

a'(s') = clip(μ_θ_targ(s') + clip(ε, -c, c), a_low, a_high), where ε ~ N(0, σ)

These three techniques work together to produce more stable and accurate learning.

SAC (Soft Actor-Critic)

Haarnoja et al. (2018) combined target networks with an entropy-regularized objective. SAC uses two target Q-networks updated via Polyak averaging and takes the minimum of their predictions (similar to TD3's clipped double Q-learning). The target computation in SAC incorporates an entropy term:

y = r + γ (min(Q_targ,1(s', a'), Q_targ,2(s', a')) - α log π(a'|s'))

where α is a temperature parameter controlling the trade-off between reward maximization and entropy (exploration).

Implementation

Pseudocode for DQN with target network

The following pseudocode shows how a target network is used in the DQN algorithm:

Initialize online Q-network with random parameters θ
Initialize target Q-network with parameters θ⁻ = θ
Initialize replay buffer D

For each episode:
    Observe initial state s
    For each step:
        Select action a using ε-greedy policy based on Q(s, ·; θ)
        Execute action a, observe reward r and next state s'
        Store transition (s, a, r, s') in replay buffer D
        
        Sample random mini-batch of transitions from D
        For each transition (sᵢ, aᵢ, rᵢ, s'ᵢ):
            If s'ᵢ is terminal:
                yᵢ = rᵢ
            Else:
                yᵢ = rᵢ + γ max_a' Q(s'ᵢ, a'; θ⁻)  // Use target network
        
        Update θ by minimizing loss: L = (1/N) Σ (yᵢ - Q(sᵢ, aᵢ; θ))²
        
        Every C steps: θ⁻ ← θ  // Update target network
        
        s ← s'

Soft update implementation

In frameworks like PyTorch, a soft update is typically implemented by iterating over the parameters of both networks:

def soft_update(online_net, target_net, tau):
    for target_param, online_param in zip(
        target_net.parameters(), online_net.parameters()
    ):
        target_param.data.copy_(
            tau * online_param.data + (1.0 - tau) * target_param.data
        )

A hard update is simply the special case where τ = 1:

def hard_update(online_net, target_net):
    target_net.load_state_dict(online_net.state_dict())

Hyperparameter tuning

The performance of target networks depends on the careful selection of their associated hyperparameters.

Hard update frequency (C)

The update frequency C controls how often the target network is synchronized with the online network. The following guidelines are commonly observed:

Update frequency (C)	Effect	When to use
Very small (e.g., 100)	Target changes frequently; less stability	Simple environments with fast learning
Moderate (e.g., 1,000-5,000)	Balanced stability and freshness	Most standard tasks
Large (e.g., 10,000+)	Very stable but potentially stale targets	Complex environments; original DQN Atari setting

If C is too small, the target network changes too quickly and the stabilization benefit is lost. If C is too large, the target network becomes stale, meaning its predictions diverge from the online network's current understanding, which can slow down learning.

Soft update coefficient (τ)

The coefficient τ determines how quickly the target network tracks the online network:

τ value	Effect	Typical usage
0.001	Very slow tracking; high stability	DDPG (original paper)
0.005	Moderate tracking speed	SAC, many modern implementations
0.01	Faster tracking; less stability	Some TD3 implementations
1.0	Equivalent to hard update (full copy)	DQN-style periodic updates

In Stable Baselines3, the default τ for DQN is 1.0 (hard update) with an update period of 10,000 steps, while the default τ for DDPG is 0.005 with updates at every step.

Interaction with experience replay

Target networks and experience replay are complementary techniques that address different sources of instability in deep reinforcement learning.

Source of instability	Solution	Mechanism
Correlated training samples	Experience replay	Randomly samples from a buffer to break temporal correlations
Non-stationary learning targets	Target network	Uses a separate, slowly changing network for target computation
Combined effect	Both together	Training resembles supervised learning with i.i.d. data and fixed targets

When both techniques are used together, the training process more closely resembles supervised machine learning: the training data is approximately independent and identically distributed (due to replay), and the targets are approximately fixed (due to the target network). This combination was essential to the success of the original DQN algorithm.

Theoretical analysis

The theoretical understanding of why target networks help has developed over time:

Mnih et al. (2015) provided empirical evidence that target networks stabilize DQN training but did not offer formal convergence guarantees.
Fan et al. (2020) analyzed deep Q-learning theoretically and showed that the target network plays a role in controlling the approximation error and ensuring convergence under certain conditions.
Zhang, Yao, and Whiteson (2021) proved that a target network with a modified Polyak averaging update rule (augmented with two projections) can break the deadly triad. They showed convergence for linear Q-learning algorithms with off-policy data and function approximation, providing the first such result without requiring restrictive assumptions on the behavior policy.

These theoretical results confirm the practical observation that target networks are effective at stabilizing training, while also highlighting that the standard Polyak averaging may not always suffice; additional modifications (such as projections or regularization) may be necessary for guaranteed convergence.

Limitations and alternatives

Limitations

Target networks are effective but come with trade-offs:

Increased memory usage: Maintaining a second copy of the network doubles the memory required for network parameters. In algorithms like TD3 and SAC, which use multiple target networks, the overhead is even higher.
Staleness: The target network's predictions can become outdated relative to the online network's current knowledge. This lag slows the propagation of value estimates through the Bellman backup chain, potentially reducing learning speed.
Additional hyperparameters: The update frequency C or coefficient τ introduces another hyperparameter that requires tuning and can affect performance.
Incompatibility with online learning: The need for a fixed target is at odds with purely online learning settings where the agent must learn and adapt in real time without a replay buffer.

Alternatives and extensions

Several approaches have been proposed to address the limitations of standard target networks:

Approach	Description	Reference
DeepMellow	Replaces the max operator with the Mellowmax operator, eliminating the need for a target network entirely	Kim et al. (2019)
Gradient target tracking	Uses the gradient of the target computation to adaptively adjust the target, reducing staleness	Yang et al. (2025)
t-Soft update	Generalizes the soft update rule using a Student-t distribution to allow adaptive blending between hard and soft updates	Xu et al. (2020)
Iterated Q-learning	Bridges the performance gap between target-free and target-based methods	Multiple authors (2025)
Functional regularization	Views the target network as an implicit regularizer and replaces it with explicit functional regularization	Tang et al. (2021)

The DeepMellow approach is particularly notable. Kim, Asadi, Littman, and Konidaris (2019) showed that replacing the max operator in Q-learning with the Mellowmax operator (a smooth approximation of the max) can stabilize learning without a target network, achieving competitive or superior performance in Atari games when the temperature parameter is properly tuned.

When to use target networks

Target networks are primarily used in off-policy deep reinforcement learning algorithms that combine neural network function approximation with bootstrapped value estimates. On-policy methods like A3C and PPO do not typically use target networks because they do not suffer from the same degree of instability.

General guidelines:

Use a target network when training off-policy algorithms with value function bootstrapping (DQN, DDPG, TD3, SAC).
Use hard updates for discrete action spaces with DQN-style algorithms.
Use soft updates for continuous control with actor-critic methods.
Consider alternatives when memory is limited, when faster learning speed is needed, or when operating in online learning settings.

Historical context

The development of target networks follows the broader trajectory of stabilizing deep reinforcement learning:

Year	Development	Authors
1992	Polyak averaging for stochastic optimization	Polyak and Juditsky
2013	DQN with target network and experience replay (workshop paper)	Mnih et al.
2015	DQN published in Nature; human-level Atari performance	Mnih et al.
2015	DDPG introduces soft updates for target networks	Lillicrap et al.
2016	Double DQN uses target network to reduce overestimation	van Hasselt, Guez, Silver
2018	TD3 adds delayed updates and target policy smoothing	Fujimoto, van Hoof, Meger
2018	SAC combines target networks with entropy regularization	Haarnoja et al.
2019	DeepMellow removes the need for target networks	Kim et al.
2021	Theoretical proof that target networks break the deadly triad	Zhang, Yao, Whiteson

References

Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., and Riedmiller, M. (2013). "Playing Atari with Deep Reinforcement Learning." *arXiv preprint arXiv:1312.5602*.
Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A.A., Veness, J., Bellemare, M.G., Graves, A., Riedmiller, M., Fidjeland, A.K., Ostrovski, G., et al. (2015). "Human-level control through deep reinforcement learning." *Nature*, 518(7540), 529-533.
Lillicrap, T.P., Hunt, J.J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., and Wierstra, D. (2015). "Continuous control with deep reinforcement learning." *arXiv preprint arXiv:1509.02971*.
van Hasselt, H., Guez, A., and Silver, D. (2016). "Deep Reinforcement Learning with Double Q-learning." *Proceedings of the AAAI Conference on Artificial Intelligence*, 30(1).
Fujimoto, S., van Hoof, H., and Meger, D. (2018). "Addressing Function Approximation Error in Actor-Critic Methods." *Proceedings of the 35th International Conference on Machine Learning (ICML)*.
Haarnoja, T., Zhou, A., Abbeel, P., and Levine, S. (2018). "Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor." *Proceedings of the 35th International Conference on Machine Learning (ICML)*.
Zhang, S., Yao, H., and Whiteson, S. (2021). "Breaking the Deadly Triad with a Target Network." *Proceedings of the 38th International Conference on Machine Learning (ICML)*, 139, 12621-12631.
Kim, S., Asadi, K., Littman, M.L., and Konidaris, G. (2019). "DeepMellow: Removing the Need for a Target Network in Deep Q-Learning." *Proceedings of the 28th International Joint Conference on Artificial Intelligence (IJCAI)*.
Sutton, R.S. and Barto, A.G. (2018). *Reinforcement Learning: An Introduction*. 2nd edition. MIT Press.
Polyak, B.T. and Juditsky, A.B. (1992). "Acceleration of Stochastic Approximation by Averaging." *SIAM Journal on Control and Optimization*, 30(4), 838-855.
Fan, J., Wang, Z., Xie, Y., and Yang, Z. (2020). "A Theoretical Analysis of Deep Q-Learning." *Proceedings of the 2nd Annual Conference on Learning for Dynamics and Control (L4DC)*.
van Hasselt, H. (2010). "Double Q-learning." *Advances in Neural Information Processing Systems (NeurIPS)*, 23.
Xu, Z., van Hasselt, H., and Silver, D. (2020). "t-Soft Update of Target Network for Deep Reinforcement Learning." *Neural Networks*, 136, 63-71.
Tang, S., Chen, Z., Bhatt, A., and Precup, D. (2021). "Bridging the Gap Between Target Networks and Functional Regularization." *arXiv preprint arXiv:2106.02613*.

Explain like I'm 5 (ELI5)

Background and motivation

The moving target problem

The deadly triad

How target networks work

Basic mechanism

Loss function with target network

Update strategies

Hard updates (periodic replacement)

Soft updates (Polyak averaging)

Comparison of update strategies

Role in specific algorithms

DQN (Deep Q-Network)

Double DQN

DDPG (Deep Deterministic Policy Gradient)

TD3 (Twin Delayed DDPG)

SAC (Soft Actor-Critic)

Implementation

Pseudocode for DQN with target network

Soft update implementation

Hyperparameter tuning

Hard update frequency (C)

Soft update coefficient (τ)

Interaction with experience replay

Theoretical analysis

Limitations and alternatives

Limitations

Alternatives and extensions

When to use target networks

Historical context

See also

References

Improve this article

Related Articles

Machine learning terms/Reinforcement Learning

Sparse autoencoder

ARC-AGI 2

AlphaGo

GELU (Gaussian Error Linear Unit)

LeNet

Explain like I'm 5 (ELI5)

Background and motivation

The moving target problem

The deadly triad

How target networks work

Basic mechanism

Loss function with target network

Update strategies

Hard updates (periodic replacement)

Soft updates (Polyak averaging)

Comparison of update strategies

Role in specific algorithms

DQN (Deep Q-Network)

Double DQN

DDPG (Deep Deterministic Policy Gradient)

TD3 (Twin Delayed DDPG)

SAC (Soft Actor-Critic)

Implementation

Pseudocode for DQN with target network

Soft update implementation

Hyperparameter tuning

Hard update frequency (C)

Soft update coefficient (τ)

Interaction with experience replay

Theoretical analysis

Limitations and alternatives

Limitations

Alternatives and extensions

When to use target networks

Historical context

See also

References

Related Articles

Machine learning terms/Reinforcement Learning

Sparse autoencoder

ARC-AGI 2

AlphaGo

GELU (Gaussian Error Linear Unit)

LeNet