# Target Network

> Source: https://aiwiki.ai/wiki/target_network
> Updated: 2026-07-11
> Categories: Deep Learning, Machine Learning, Reinforcement Learning
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

A **target network** is a separate, slowly updated copy of a [neural network](/wiki/neural_network) used in [deep reinforcement learning](/wiki/reinforcement_learning) to compute stable learning targets, decoupling the bootstrap target from the rapidly changing network being optimized. It was introduced in the 2015 Nature paper that defined the modern Deep Q-Network ([DQN](/wiki/dqn)) algorithm: every C updates the online network Q is cloned to produce a target network, and that frozen copy generates the [temporal difference](/wiki/temporal_difference_learning) (TD) targets for the next C updates. In the original DQN experiments, C was set to 10,000 steps.[1] Mnih et al. wrote that "generating the targets using an older set of parameters adds a delay between the time an update to Q is made and the time the update affects the targets $$y_j$$, making divergence or oscillations much more unlikely."[1] Target networks have since become a standard component of off-policy deep reinforcement learning methods, including Deep Deterministic Policy Gradient (DDPG), Twin Delayed DDPG (TD3), and Soft Actor-Critic (SAC).

## Explain like I'm 5 (ELI5)

Imagine you are learning to throw a basketball into a hoop. Every time you throw, someone tells you how far off your throw was so you can adjust. But what if the hoop kept moving every time you threw? It would be really hard to get better because the target keeps changing.

In [reinforcement learning](/wiki/reinforcement_learning), a computer agent has the same problem. It is trying to learn how good different actions are, but the "answer key" it uses to check itself keeps changing as it learns. A target network is like having a second, frozen hoop that stays in one place for a while. The agent practices throwing at that frozen hoop, and only after many practice throws does someone move the frozen hoop to match the real one. This makes learning much easier because the agent has a steady goal to aim for.

## What problem does a target network solve?

### The moving target problem

In [Q-learning](/wiki/q-learning), an agent learns to estimate the expected cumulative [reward](/wiki/reward) for taking an action in a given state. When a [neural network](/wiki/neural_network) is used to approximate the Q-function (as in DQN), the network's parameters appear on both sides of the update equation. The Q-value prediction for the current state-action pair is compared against a target that itself depends on the Q-values of the next state, computed by the same network.

This creates a feedback loop: each time the network's weights are updated to reduce the error on one sample, the target values for all other samples also shift. The result is a "moving target" problem where the optimization objective changes with every parameter update. The DQN authors described the underlying mechanism precisely: in standard online Q-learning "an update that increases $$Q(s_t, a_t)$$ often also increases $$Q(s_{t+1}, a)$$ for all a and hence also increases the target $$y_j$$, possibly leading to oscillations or divergence of the policy."[1] In practice, this can cause oscillations, divergence, or slow convergence during training.

### The deadly triad

The instability problem is closely related to what Sutton and Barto (2018) called the **deadly triad** in reinforcement learning.[9] The deadly triad occurs when three elements are combined simultaneously:

1. **Function approximation** (such as neural networks) to represent value functions
2. **Bootstrapping** (using estimated values to update other estimated values, as in temporal difference learning)
3. **Off-policy learning** (learning about a policy different from the one generating the data)

When all three are present, learning can become unstable and value estimates may diverge to infinity. The target network was one of the key techniques introduced to mitigate this instability. Zhang, Yao, and Whiteson (2021) provided formal theoretical results showing that a target network, combined with two projections added to the Polyak-averaging update, can break the deadly triad and guarantee [convergence](/wiki/convergence) for linear off-policy algorithms with bootstrapping under non-restrictive conditions.[7]

## How does a target network work?

### Basic mechanism

A target network is architecturally identical to the **online network** (also called the main network or policy network) but maintains a separate set of parameters. During training:

1. The online network's parameters (denoted $$\theta$$) are updated at every training step using [gradient descent](/wiki/gradient_descent) on the [loss function](/wiki/loss_function).
2. The target network's parameters (denoted $$\theta^-$$) remain fixed for a period of time or are updated very slowly.
3. When computing the temporal difference (TD) target for the loss function, the target network's parameters are used instead of the online network's parameters.

By decoupling the parameters used for the prediction from those used for the target computation, the learning targets remain stable over many update steps. This reduces the feedback loop and makes the optimization problem more similar to standard supervised learning, where the targets are fixed. In the DQN methods, the authors note this approach works by "smoothing out learning and avoiding oscillations or divergence in the parameters."[1]

### Loss function with target network

In standard DQN, the loss function for a single transition (s, a, r, s') is:

$$
L(\theta) = \left(r + \gamma \max_{a'} Q(s', a'; \theta^-) - Q(s, a; \theta)\right)^2
$$

where:

- $$Q(s, a; \theta)$$ is the Q-value predicted by the online network for state *s* and action *a*
- $$Q(s', a'; \theta^-)$$ is the Q-value predicted by the target network for the next state *s'*
- $$\gamma$$ is the discount factor
- $$r$$ is the observed reward
- $$\theta$$ represents the online network's parameters
- $$\theta^-$$ represents the target network's parameters

The term $$r + \gamma \max_{a'} Q(s', a'; \theta^-)$$ is called the **TD target**. Because it uses the target network's fixed parameters $$\theta^-$$ rather than the online network's changing parameters $$\theta$$, the TD target does not shift with every gradient update.

In practice, the loss is computed over mini-batches sampled from an [experience replay](/wiki/experience_replay) buffer (the original DQN used a minibatch size of 32):[1]

$$
L(\theta) = \frac{1}{N} \sum_i \left(r_i + \gamma \max_{a'} Q(s'_i, a'; \theta^-) - Q(s_i, a_i; \theta)\right)^2
$$

Some implementations use the Huber loss instead of mean squared error to make training more robust to outliers in the TD error.

## How is the target network updated?

There are two main strategies for updating the target network's parameters to eventually reflect the online network's learned values.

### Hard updates (periodic replacement)

In the original DQN algorithm, the target network's parameters are periodically replaced with the online network's parameters every $$C$$ steps:

$$
\theta^- \leftarrow \theta \quad \text{(every C steps)}
$$

Between updates, the target network's parameters remain completely frozen. The DQN paper describes the rule in one sentence: "every C updates we clone the network Q to obtain a target network Q-hat and use Q-hat for generating the Q-learning targets $$y_j$$ for the following C updates to Q."[1] This is the simplest update strategy. The 2013 workshop paper that first paired deep networks with Q-learning used only experience replay and did NOT include a target network; the target network was added in the 2015 Nature paper, where C was set to 10,000 steps.[1][2]

### Soft updates (Polyak averaging)

Introduced in the DDPG algorithm by Lillicrap et al. (2015), soft updates apply a weighted average at every training step:

$$
\theta^- \leftarrow \tau \theta + (1 - \tau) \theta^-
$$

where $$\tau$$ is a small positive number called the soft update coefficient, typically in the range 0.001 to 0.005. This formula is also known as **Polyak averaging** (or an exponential moving average of parameters). At each step, the target network moves a tiny fraction toward the online network, resulting in a smooth and continuous update rather than an abrupt replacement. Lillicrap et al. explained the benefit directly: "This means that the target values are constrained to change slowly, greatly improving the stability of learning."[3] The DDPG experiments used $$\tau = 0.001$$.[3]

The name "Polyak averaging" comes from the work of Boris Polyak and Anatoli Juditsky (1992), who showed that averaging the iterates of stochastic optimization algorithms can improve convergence rates.[10]

### How do hard and soft updates differ?

| Property | Hard update | Soft update (Polyak averaging) |
|---|---|---|
| Update formula | $$\theta^- \leftarrow \theta$$ every C steps | $$\theta^- \leftarrow \tau\theta + (1-\tau)\theta^-$$ every step |
| Update frequency | Every C steps (e.g., 1,000 or 10,000) | Every training step |
| Key hyperparameter | $$C$$ (update period) | $$\tau$$ (interpolation coefficient) |
| Typical hyperparameter values | $$C = 1{,}000$$ to $$10{,}000$$ | $$\tau = 0.001$$ to $$0.005$$ |
| Target stability | Very stable between updates; abrupt jumps at update points | Continuously changing but very slowly |
| Smoothness | Discontinuous; target values can shift suddenly | Smooth; target values change gradually |
| Used in | DQN, Double DQN | DDPG, TD3, SAC |
| Sensitivity | Sensitive to choice of $$C$$; too large causes staleness, too small reduces stability | Sensitive to choice of $$\tau$$; too large reduces stability, too small causes staleness |

## How is the target network used in specific algorithms?

Target networks appear in many deep reinforcement learning algorithms. The following table summarizes how different algorithms use them.

| Algorithm | Year | Target network usage | Update method | Networks with targets | Key innovation |
|---|---|---|---|---|---|
| [DQN](/wiki/dqn) | 2015 | Target Q-network for computing TD targets | Hard update every $$C = 10{,}000$$ steps | Q-network | Introduced target networks with [experience replay](/wiki/experience_replay) |
| Double DQN | 2016 | Target Q-network for value evaluation; online network for action selection | Hard update every C steps | Q-network | Decoupled action selection from value evaluation to reduce overestimation |
| DDPG | 2015/2016 | Separate target networks for both actor and critic | Soft update ($$\tau = 0.001$$) | Actor, Critic | Extended target networks to continuous action spaces with soft updates |
| TD3 | 2018 | Target networks for twin critics and actor | Soft update ($$\tau = 0.005$$), delayed every $$d = 2$$ steps | Actor, two Critics | Added delayed updates and target policy smoothing |
| SAC | 2018 | Target Q-networks (two) with Polyak averaging | Soft update ($$\tau = 0.005$$) | Two Q-networks | Combined target networks with maximum entropy framework |

### DQN (Deep Q-Network)

The [DQN](/wiki/dqn) algorithm, published by Mnih et al. at [Google DeepMind](/wiki/google_deepmind), was the first to combine deep [neural networks](/wiki/neural_network) with Q-learning for high-dimensional state spaces such as Atari game frames. It was evaluated on 49 Atari 2600 games and was, in the authors' words, "able to surpass the performance of all previous algorithms and achieve a level comparable to that of a professional human games tester across a set of 49 games."[1] DQN introduced two stabilization techniques that work together:

1. **[Experience replay](/wiki/experience_replay):** Transitions are stored in a [replay buffer](/wiki/replay_buffer) (capacity 1,000,000 in the original paper) and sampled randomly for training, breaking the temporal correlations between consecutive experiences.[1]
2. **Target network:** A separate network with frozen parameters provides stable TD targets, refreshed every 10,000 steps.[1]

Experience replay addresses the correlation between consecutive training samples, while the target network addresses the non-stationarity of the learning targets. Together with a discount factor of $$\gamma = 0.99$$ and a learning rate of 0.00025, they made deep Q-learning practical for complex tasks.[1]

### Double DQN

Van Hasselt, Guez, and Silver (2016) identified that the max operator in DQN's target computation leads to systematic overestimation of Q-values. As they put it, the max operator "uses the same values both to select and to evaluate an action," which "makes it more likely to select overestimated values, resulting in overoptimistic value estimates."[4] Standard DQN uses the same (target) network both to select the best action and to evaluate it:

$$
y = r + \gamma \max_{a'} Q(s', a'; \theta^-)
$$

Double DQN decouples these two steps. It uses the **online network** to select the action and the **target network** to evaluate that action's value:

$$
y = r + \gamma Q(s', \arg\max_{a'} Q(s', a'; \theta); \theta^-)
$$

This small change significantly reduces overestimation bias and leads to improved performance across many Atari games, raising the median normalized score across 49 games to 114.7% of a human games tester versus 93.5% for DQN.[4]

### DDPG (Deep Deterministic Policy Gradient)

Lillicrap et al. (2015) adapted DQN concepts to continuous action spaces using an actor-critic architecture; it "robustly solves more than 20 simulated physics tasks" using the same learning algorithm, network architecture, and hyperparameters.[3] DDPG maintains four networks:

1. **Online actor** (policy network): Selects actions
2. **Online critic** (Q-network): Evaluates state-action pairs
3. **Target actor:** Provides stable action predictions for target computation
4. **Target critic:** Provides stable Q-value targets

DDPG was the first algorithm to use soft (Polyak averaging) updates for target networks, with the target parameters constrained to change slowly at each step. The DDPG paper used $$\tau = 0.001$$.[3]

### TD3 (Twin Delayed DDPG)

Fujimoto, van Hoof, and Meger (2018) showed that "function approximation errors are known to lead to overestimated value estimates and suboptimal policies" and that "this problem persists in an actor-critic setting," then introduced three improvements on top of DDPG:[5]

1. **Clipped double Q-learning:** Two critic networks are maintained, and the smaller of the two Q-value estimates is used in the target. This prevents the policy from exploiting overestimated Q-values.
2. **Delayed policy updates:** The actor (and target networks) are updated less frequently than the critics, once every d = 2 critic updates.[5]
3. **Target policy smoothing:** Noise is added to the actions selected by the target actor to smooth the Q-value estimates:

    $$a'(s') = \mathrm{clip}(\mu_{\theta_{\mathrm{targ}}}(s') + \mathrm{clip}(\epsilon, -c, c), a_{\mathrm{low}}, a_{\mathrm{high}})$$, where $$\epsilon \sim \mathcal{N}(0, \sigma)$$

Both target networks in TD3 are updated with $$\tau = 0.005$$.[5] These techniques work together to produce more stable and accurate learning.

### SAC (Soft Actor-Critic)

Haarnoja et al. (2018) described [Soft Actor-Critic](/wiki/soft_actor_critic) as "an off-policy actor-critic deep RL algorithm based on the maximum entropy reinforcement learning framework" in which "the actor aims to maximize expected reward while also maximizing entropy."[6] SAC uses target Q-networks updated via Polyak averaging (with tau = 0.005) and takes the minimum of two Q-estimates (similar to TD3's clipped double Q-learning).[6] The target computation in SAC incorporates an entropy term:

$$
y = r + \gamma \left(\min(Q_{\mathrm{targ},1}(s', a'), Q_{\mathrm{targ},2}(s', a')) - \alpha \log \pi(a' \mid s')\right)
$$

where $$\alpha$$ is a temperature parameter controlling the trade-off between reward maximization and entropy (exploration).

## Implementation

### Pseudocode for DQN with target network

The following pseudocode shows how a target network is used in the DQN algorithm:

```
Initialize online Q-network with random parameters theta
Initialize target Q-network with parameters theta-minus = theta
Initialize replay buffer D

For each episode:
    Observe initial state s
    For each step:
        Select action a using epsilon-greedy policy based on Q(s, .; theta)
        Execute action a, observe reward r and next state s'
        Store transition (s, a, r, s') in replay buffer D
        
        Sample random mini-batch of transitions from D
        For each transition (s_i, a_i, r_i, s'_i):
            If s'_i is terminal:
                y_i = r_i
            Else:
                y_i = r_i + gamma max_a' Q(s'_i, a'; theta-minus)  // Use target network
        
        Update theta by minimizing loss: L = (1/N) Sum (y_i - Q(s_i, a_i; theta))^2
        
        Every C steps: theta-minus <- theta  // Update target network
        
        s <- s'
```

### Soft update implementation

In frameworks like PyTorch, a soft update is typically implemented by iterating over the parameters of both networks:

```python
def soft_update(online_net, target_net, tau):
    for target_param, online_param in zip(
        target_net.parameters(), online_net.parameters()
    ):
        target_param.data.copy_(
            tau * online_param.data + (1.0 - tau) * target_param.data
        )
```

A hard update is simply the special case where $$\tau = 1$$:

```python
def hard_update(online_net, target_net):
    target_net.load_state_dict(online_net.state_dict())
```

## How are target network hyperparameters tuned?

The performance of target networks depends on the careful selection of their associated [hyperparameters](/wiki/hyperparameter).

### Hard update frequency (C)

The update frequency $$C$$ controls how often the target network is synchronized with the online network. The following guidelines are commonly observed:

| Update frequency (C) | Effect | When to use |
|---|---|---|
| Very small (e.g., 100) | Target changes frequently; less stability | Simple environments with fast learning |
| Moderate (e.g., 1,000-5,000) | Balanced stability and freshness | Most standard tasks |
| Large (e.g., 10,000+) | Very stable but potentially stale targets | Complex environments; original DQN Atari setting |

If $$C$$ is too small, the target network changes too quickly and the stabilization benefit is lost. If $$C$$ is too large, the target network becomes stale, meaning its predictions diverge from the online network's current understanding, which can slow down learning.

### Soft update coefficient (tau)

The coefficient $$\tau$$ determines how quickly the target network tracks the online network:

| tau value | Effect | Typical usage |
|---|---|---|
| 0.001 | Very slow tracking; high stability | DDPG (original paper) |
| 0.005 | Moderate tracking speed | TD3, SAC, many modern implementations |
| 0.01 | Faster tracking; less stability | Some implementations |
| 1.0 | Equivalent to hard update (full copy) | DQN-style periodic updates |

In Stable Baselines3, the default $$\tau$$ for DQN is 1.0 (hard update) with an update period of 10,000 steps, while the default $$\tau$$ for DDPG, TD3, and SAC is 0.005 with updates at every step.

## How do target networks and experience replay work together?

Target networks and [experience replay](/wiki/experience_replay) are complementary techniques that address different sources of instability in deep reinforcement learning.

| Source of instability | Solution | Mechanism |
|---|---|---|
| Correlated training samples | Experience replay | Randomly samples from a buffer to break temporal correlations |
| Non-stationary learning targets | Target network | Uses a separate, slowly changing network for target computation |
| Combined effect | Both together | Training resembles supervised learning with i.i.d. data and fixed targets |

When both techniques are used together, the training process more closely resembles supervised [machine learning](/wiki/machine_learning): the training data is approximately independent and identically distributed (due to replay), and the targets are approximately fixed (due to the target network). This combination was essential to the success of the original DQN algorithm.

## Theoretical analysis

The theoretical understanding of why target networks help has developed over time:

- **Mnih et al. (2015)** provided empirical evidence that target networks stabilize DQN training but did not offer formal convergence guarantees.[1]
- **Fan et al. (2020)** analyzed deep Q-learning theoretically and showed that the target network plays a role in controlling the approximation error and ensuring convergence under certain conditions.[11]
- **Zhang, Yao, and Whiteson (2021)** proposed a target network update rule that augments the commonly used Polyak-averaging update "with two projections," providing what they called "theoretical support for the conventional wisdom that a target network stabilizes training."[7] They proved convergence for off-policy algorithms with linear function approximation and bootstrapping, spanning both policy evaluation and control, without requiring restrictive assumptions on the behavior policy.[7]

These theoretical results confirm the practical observation that target networks are effective at stabilizing training, while also highlighting that the standard Polyak averaging may not always suffice; additional modifications (such as projections or regularization) may be necessary for guaranteed convergence.

## Limitations and alternatives

### Limitations

Target networks are effective but come with trade-offs:

1. **Increased memory usage:** Maintaining a second copy of the network doubles the memory required for network parameters. In algorithms like TD3 and SAC, which use multiple target networks, the overhead is even higher.
2. **Staleness:** The target network's predictions can become outdated relative to the online network's current knowledge. This lag slows the propagation of value estimates through the Bellman backup chain, potentially reducing learning speed.
3. **Additional hyperparameters:** The update frequency $$C$$ or coefficient $$\tau$$ introduces another hyperparameter that requires tuning and can affect performance.
4. **Incompatibility with online learning:** The need for a fixed target is at odds with purely online learning settings where the agent must learn and adapt in real time without a replay buffer.

### Alternatives and extensions

Several approaches have been proposed to address the limitations of standard target networks:

| Approach | Description | Reference |
|---|---|---|
| DeepMellow | Replaces the max operator with the Mellowmax operator, eliminating the need for a target network entirely | Kim et al. (2019)[8] |
| t-Soft update | Generalizes the soft update rule using a Student-t distribution to allow adaptive blending between hard and soft updates | Kobayashi and Ilboudo (2021)[13] |
| Functional regularization | Views the target network as an implicit regularizer and replaces it with explicit functional regularization | Piche et al. (2021)[14] |

The DeepMellow approach is particularly notable. Kim, Asadi, Littman, and Konidaris (2019) showed that replacing the max operator in Q-learning with the Mellowmax operator (a smooth approximation of the max) can stabilize learning without a target network, achieving competitive or superior performance in Atari games when the temperature parameter is properly tuned.[8]

## When should you use a target network?

Target networks are primarily used in **off-policy** deep reinforcement learning algorithms that combine neural network function approximation with bootstrapped value estimates. On-policy methods like A3C and PPO do not typically use target networks because they do not suffer from the same degree of instability.

General guidelines:

- **Use a target network** when training off-policy algorithms with value function bootstrapping (DQN, DDPG, TD3, SAC).
- **Use hard updates** for discrete action spaces with DQN-style algorithms.
- **Use soft updates** for continuous control with actor-critic methods.
- **Consider alternatives** when memory is limited, when faster learning speed is needed, or when operating in online learning settings.

## Historical context

The development of target networks follows the broader trajectory of stabilizing deep reinforcement learning:

| Year | Development | Authors |
|---|---|---|
| 1992 | Polyak averaging for stochastic optimization | Polyak and Juditsky |
| 2013 | Deep Q-learning with experience replay only (no target network), workshop paper | Mnih et al. |
| 2015 | DQN published in Nature; target network introduced ($$C = 10{,}000$$); human-level Atari performance | Mnih et al. |
| 2015 | DDPG introduces soft updates for target networks ($$\tau = 0.001$$) | Lillicrap et al. |
| 2016 | Double DQN uses target network to reduce overestimation | van Hasselt, Guez, Silver |
| 2018 | TD3 adds delayed updates and target policy smoothing | Fujimoto, van Hoof, Meger |
| 2018 | SAC combines target networks with entropy regularization | Haarnoja et al. |
| 2019 | DeepMellow removes the need for target networks | Kim et al. |
| 2021 | Theoretical proof that target networks break the deadly triad | Zhang, Yao, Whiteson |

## See also

- [DQN](/wiki/dqn)
- [Experience Replay](/wiki/experience_replay)
- [Q-Learning](/wiki/q-learning)
- [Reinforcement Learning](/wiki/reinforcement_learning)
- [Temporal-difference learning](/wiki/temporal_difference_learning)
- [Soft Actor-Critic](/wiki/soft_actor_critic)
- [Bellman Equation](/wiki/bellman_equation)
- [Replay Buffer](/wiki/replay_buffer)

## References

1. Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A.A., Veness, J., Bellemare, M.G., Graves, A., Riedmiller, M., Fidjeland, A.K., Ostrovski, G., et al. (2015). "Human-level control through deep reinforcement learning." *Nature*, 518(7540), 529-533.
2. Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., and Riedmiller, M. (2013). "Playing Atari with Deep Reinforcement Learning." *arXiv preprint arXiv:1312.5602* (NIPS Deep Learning Workshop).
3. Lillicrap, T.P., Hunt, J.J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., and Wierstra, D. (2015). "Continuous control with deep reinforcement learning." *arXiv preprint arXiv:1509.02971* (ICLR 2016).
4. van Hasselt, H., Guez, A., and Silver, D. (2016). "Deep Reinforcement Learning with Double Q-learning." *Proceedings of the AAAI Conference on Artificial Intelligence*, 30(1).
5. Fujimoto, S., van Hoof, H., and Meger, D. (2018). "Addressing Function Approximation Error in Actor-Critic Methods." *Proceedings of the 35th International Conference on Machine Learning (ICML)*.
6. Haarnoja, T., Zhou, A., Abbeel, P., and Levine, S. (2018). "Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor." *Proceedings of the 35th International Conference on Machine Learning (ICML)*.
7. Zhang, S., Yao, H., and Whiteson, S. (2021). "Breaking the Deadly Triad with a Target Network." *Proceedings of the 38th International Conference on Machine Learning (ICML)*, 139, 12621-12631.
8. Kim, S., Asadi, K., Littman, M.L., and Konidaris, G. (2019). "DeepMellow: Removing the Need for a Target Network in Deep Q-Learning." *Proceedings of the 28th International Joint Conference on Artificial Intelligence (IJCAI)*.
9. Sutton, R.S. and Barto, A.G. (2018). *Reinforcement Learning: An Introduction*. 2nd edition. MIT Press.
10. Polyak, B.T. and Juditsky, A.B. (1992). "Acceleration of Stochastic Approximation by Averaging." *SIAM Journal on Control and Optimization*, 30(4), 838-855.
11. Fan, J., Wang, Z., Xie, Y., and Yang, Z. (2020). "A Theoretical Analysis of Deep Q-Learning." *Proceedings of the 2nd Annual Conference on Learning for Dynamics and Control (L4DC)*.
12. van Hasselt, H. (2010). "Double Q-learning." *Advances in Neural Information Processing Systems (NeurIPS)*, 23.
13. Kobayashi, T. and Ilboudo, W.E.L. (2021). "t-Soft Update of Target Network for Deep Reinforcement Learning." *Neural Networks*, 136, 63-71.
14. Piche, A., Thomas, V., Pardinas, R., and Pal, C. (2021). "Bridging the Gap Between Target Networks and Functional Regularization." *arXiv preprint arXiv:2106.02613*.