# Twin Delayed DDPG

> Source: https://aiwiki.ai/wiki/td3
> Updated: 2026-06-27
> Categories: Algorithms, Deep Learning, Reinforcement Learning
> From AI Wiki (https://aiwiki.ai), a free encyclopedia of artificial intelligence. Quote with attribution.

**Twin Delayed Deep Deterministic Policy Gradient** (**TD3**) is an off-policy [actor-critic](/wiki/actor_critic) [reinforcement learning](/wiki/reinforcement_learning) algorithm for continuous action spaces, introduced by Scott Fujimoto, Herke van Hoof, and David Meger at [ICML](/wiki/icml) 2018 [1]. It builds directly on [DDPG](/wiki/ddpg) and was designed to fix that algorithm's well known tendency to overestimate Q-values, which often led to unstable learning and brittle policies. The paper, "Addressing Function Approximation Error in Actor-Critic Methods" (arXiv:1802.09477), introduced three changes to DDPG: clipped double Q-learning, delayed policy updates, and target policy smoothing [1]. Together they turned DDPG from a finicky algorithm into one of the standard baselines for continuous control benchmarks.

The paper's own one-line summary of the result is direct: TD3 was evaluated "on the suite of OpenAI gym tasks, outperforming the state of the art in every environment tested" [1]. In numbers, that meant a best-of-10-trials score of 9636.95 on HalfCheetah-v1 versus 3305.60 for the standard DDPG baseline, and wins on six of seven [MuJoCo](/wiki/mujoco) tasks at release [1].

TD3 has remained a default reference algorithm in continuous-control [deep reinforcement learning](/wiki/deep_reinforcement_learning) for nearly a decade since its release. It is the algorithm most newer methods compare against on [MuJoCo](/wiki/mujoco) tasks, the foundation for several offline RL methods, and a frequent first choice for robotics simulation work in [Isaac Lab](/wiki/isaac_lab) and similar platforms. Its three modifications are conceptually small but each one targets a concrete failure mode of [DDPG](/wiki/ddpg), and the combination is what made the difference.

## Infobox

| Field | Value |
|---|---|
| Full name | Twin Delayed Deep Deterministic Policy Gradient |
| Type | Off-policy actor-critic, model-free |
| Action space | Continuous |
| Policy | Deterministic |
| Authors | [Scott Fujimoto](/wiki/scott_fujimoto), Herke van Hoof, David Meger |
| Affiliations | [McGill University](/wiki/mcgill_university), University of Amsterdam |
| First released | February 2018 (arXiv) |
| Conference | [ICML](/wiki/icml) 2018 (PMLR 80) |
| Paper | arXiv:1802.09477 |
| Reference code | [github.com/sfujim/TD3](https://github.com/sfujim/TD3) (PyTorch) |
| Direct predecessor | [DDPG](/wiki/ddpg) |
| Sibling algorithm | [Soft Actor-Critic](/wiki/soft_actor_critic) (SAC) |
| License of reference code | MIT |
| Common framework | [PyTorch](/wiki/pytorch) |

## What is TD3?

TD3 is a model-free, off-policy actor-critic method that learns a single deterministic policy for continuous control while taming the value-overestimation problem that destabilizes its predecessor, DDPG. Mechanically, it keeps DDPG's structure (a deterministic actor, two-network Polyak-averaged targets, off-policy training from a [replay buffer](/wiki/replay_buffer)) and adds exactly three ingredients: a pair of critics whose minimum forms the Bellman target (clipped double Q-learning), an actor that is updated less often than the critics (delayed policy updates), and clipped Gaussian noise added to the target action (target policy smoothing) [1]. The abstract frames the contribution as building "on Double Q-learning, by taking the minimum value between a pair of critics to limit overestimation" [1].

## Why did DDPG need fixing?

[DDPG](/wiki/ddpg) (Lillicrap et al., 2015) was, at the time, the standard recipe for continuous control with deep networks [2]. It maintained a deterministic policy and a single Q-network, both with [target network](/wiki/target_network) copies updated by Polyak averaging, and trained off policy from a [replay buffer](/wiki/replay_buffer). It worked, sometimes spectacularly, but it was notorious for being seed-sensitive and unstable. A run that hit 6,000 reward on HalfCheetah could be followed by another run on the same code that flatlined.

Fujimoto and colleagues traced much of the trouble back to a problem already familiar from discrete-action [Q-learning](/wiki/q_learning): overestimation bias [1]. When you take a maximum over noisy value estimates, the result is biased upward, because the maximum operation systematically picks out the actions whose value happened to be overestimated. In the discrete setting [Double Q-learning](/wiki/double_q_learning) (van Hasselt, 2010) and Double DQN (van Hasselt et al., 2016) had been the standard fixes [3][4]. The TD3 paper proved that the same kind of bias also appears in deterministic policy gradients, even though there is no explicit max operator in the actor update. The policy improvement step implicitly maximizes the critic, and that is enough to introduce bias [1]. As OpenAI's Spinning Up tutorial puts it, in DDPG "the learned Q-function begins to dramatically overestimate Q-values, which then leads to the policy breaking" [12].

Worse, the paper showed that the natural Double DQN port to actor-critic does not really help, because the policy changes too slowly for the current and target value estimates to be independent [1]. Something else was needed.

### How does overestimation arise in deterministic policy gradients?

The deterministic policy gradient theorem (Silver et al., 2014) writes the policy update as [5]:

```
grad_phi J(phi) = E_s [ grad_a Q(s, a)|_{a=pi_phi(s)} * grad_phi pi_phi(s) ]
```

The gradient of Q with respect to a tells the policy which direction increases value. If the critic systematically rates some actions higher than they really are, the policy will be pushed toward those actions even though the true environment return is lower. The next round of TD updates uses transitions from this slightly worse policy, the critic refits to overoptimistic targets, and the cycle compounds. Fujimoto et al. show this empirically by tracking the average critic prediction against a Monte Carlo estimate of the true return on standard MuJoCo tasks; for vanilla DDPG, the gap grows steadily over training [1].

The gap is not just a curiosity. A critic that drifts away from the true value function can mislead the actor into reward-free regions of state space, and once a deterministic policy collapses onto a bad action it can be slow to recover, since exploration noise in DDPG is small relative to the action range.

### Why does function approximation make the bias real?

In tabular settings, [Q-learning](/wiki/q_learning) overestimation comes from the max operator over noisy value estimates: `max_a Q_hat(s, a) >= max_a E[Q_hat(s, a)]` by Jensen's inequality [19]. Deterministic policy gradients do not have an explicit max, but the actor update is essentially climbing the critic's value surface. If the critic has approximation error, the actor learns to exploit it. Fujimoto et al. quantify this with a theorem (their Theorem 1) showing that under standard assumptions, the actor-critic value estimate `Q(s, pi(s))` exceeds the true value `Q^pi(s, pi(s))` in expectation when both networks are trained on the same replay data [1].

This is more than a tabular curiosity, because in practice the critic is a neural network with millions of parameters fit to a few million transitions. Approximation noise is not optional and it does not cancel out.

## What are the three tricks of TD3?

TD3 inherits the entire DDPG skeleton (deterministic actor, replay buffer, off-policy training, target networks updated with Polyak averaging) and changes three things [1][12]. The short version, in the words of OpenAI's Spinning Up: "TD3 learns two Q-functions instead of one (hence twin), and uses the smaller of the two Q-values to form the targets"; "TD3 updates the policy (and target networks) less frequently than the Q-function"; and "TD3 adds noise to the target action, to make it harder for the policy to exploit Q-function errors" [12].

### Trick 1: clipped double Q-learning

TD3 trains two independent critics, Q_theta1 and Q_theta2, each with their own target network. The Bellman target shared by both critics is the minimum of the two target Q-values evaluated at the next state and the target policy's action [1]:

```
y = r + gamma * min(Q_theta1'(s', pi_phi'(s')), Q_theta2'(s', pi_phi'(s')))
```

Taking the minimum is the "clipped" part. Plain Double Q-learning would use one critic to select an action and the other to evaluate it; here both critics are evaluated and the smaller value wins. The trick may bias estimates downward, but the paper argues this is the lesser evil. Underestimated actions are not propagated through the policy update, because the actor avoids low-value actions, while overestimated actions actively poison the policy [1]. As a side effect, the min operator favors states with low-variance value estimates, which steers the policy away from regions where the critic is uncertain.

A convergence proof for the finite MDP case appears in the paper's supplementary material. The intuition is that if the two estimates have independent noise with mean zero, the min has a negative bias whose magnitude is bounded by the standard deviation of the noise. So overestimation turns into a controlled, small underestimation, which the actor can compensate for through more exploration.

Why not three or more critics? The paper tests an ablation with three critics and reports diminishing returns. Two critics are cheap (the second critic adds roughly 30% to backward pass cost since most layers are not shared) and capture most of the benefit. Later work, especially [REDQ](/wiki/redq) and TQC, revisits this question and shows that larger ensembles can pay off, but at a different cost trade-off [10][11].

### Trick 2: delayed policy updates

The second change is also simple. The actor and the target networks are updated less often than the critics, typically once for every two critic updates [1][12]. The justification is that policy improvement on a noisy critic produces a noisy gradient, which then makes the next critic update worse, and the cycle compounds. By letting the value estimate settle for a few steps before nudging the policy, TD3 reduces the variance of the policy update.

The practical recommendation in the paper is d = 2, meaning the actor and target networks update every other gradient step; Spinning Up summarizes this as "one policy update for every two Q-function updates" [12]. The authors note that a larger d would yield a larger benefit in terms of accumulated error, but training the actor too rarely cripples learning, so 2 is the safe default [1].

In the ablation Figure 4 of the paper, removing the delay drops average HalfCheetah return from roughly 9,500 to about 7,000 over 1 million steps, and increases run-to-run variance noticeably. The delay also makes learning curves smoother visually, which helps debugging.

### Trick 3: target policy smoothing

Deterministic policies tend to overfit narrow peaks in the value function. Pick a slightly different action and the critic might tell you it is much worse, even though in reality the values should be similar. Target policy smoothing is a regularization that adds clipped Gaussian noise to the target action before evaluating the next-state Q-value [1]:

```
a_tilde = pi_phi'(s') + clip(N(0, sigma), -c, c)
y      = r + gamma * min_i Q_theta_i'(s', a_tilde)
```

The noise forces the critic to fit a small region around the target action rather than a single point, which the paper notes is similar in spirit to a SARSA update [1]. Defaults are sigma = 0.2 with the noise clipped to the interval [-0.5, 0.5] (assuming actions are scaled to [-1, 1]).

After clipping, the action is also clipped to the valid action range, which matters for environments that reject out-of-range actions or saturate them silently. The noise is independent of the exploration noise added during data collection: smoothing happens only in the Bellman target computation.

A useful way to think about smoothing: the critic is being asked to predict the value of an expanded action distribution rather than a delta function. This makes the value function locally smoother, which is exactly what the policy gradient needs in order to produce stable updates.

## What does the TD3 algorithm look like?

The full algorithm as it appears in the paper [1]:

```
Initialize critic networks Q_theta1, Q_theta2 and actor network pi_phi
  with random parameters theta1, theta2, phi
Initialize target networks: theta1' <- theta1, theta2' <- theta2, phi' <- phi
Initialize replay buffer B

for t = 1 to T:
    Select action with exploration noise:
        a ~ pi_phi(s) + epsilon,  epsilon ~ N(0, sigma)
    Execute a, observe reward r and new state s'
    Store transition (s, a, r, s') in B

    Sample mini-batch of N transitions (s, a, r, s') from B

    a_tilde <- pi_phi'(s') + epsilon,  epsilon ~ clip(N(0, sigma_tilde), -c, c)
    y       <- r + gamma * min_{i=1,2} Q_theta_i'(s', a_tilde)

    Update critics:
        theta_i <- argmin_{theta_i} (1/N) * sum (y - Q_theta_i(s, a))^2

    if t mod d == 0:
        Update phi by the deterministic policy gradient:
            grad_phi J(phi) = (1/N) * sum grad_a Q_theta1(s, a)|_{a=pi_phi(s)} * grad_phi pi_phi(s)
        Update target networks:
            theta_i' <- tau * theta_i + (1 - tau) * theta_i'
            phi'     <- tau * phi    + (1 - tau) * phi'
end for
```

A few details worth noting. The actor is trained only against Q_theta1, not against the minimum of the two critics, which keeps the policy gradient less conservative. Both target networks are updated on the delayed schedule along with the actor. Exploration noise during data collection is independent of the smoothing noise added inside the target.

In most implementations, the loop also includes a warmup phase: for the first 10,000 to 25,000 steps, actions are sampled uniformly from the action space rather than from the policy. This produces a more diverse initial replay buffer and avoids early collapse onto a poorly initialized policy.

## Reference PyTorch implementation

The central update step in the author's reference implementation looks like the following PyTorch sketch [8]. State, action, reward, next-state, and not-done arrays come from a sampled minibatch, and `actor`, `actor_target`, `critic`, `critic_target` are the four networks.

```python
import torch
import torch.nn.functional as F

def td3_update(self, batch):
    state, action, next_state, reward, not_done = batch

    with torch.no_grad():
        noise = (
            torch.randn_like(action) * self.policy_noise
        ).clamp(-self.noise_clip, self.noise_clip)
        next_action = (
            self.actor_target(next_state) + noise
        ).clamp(-self.max_action, self.max_action)

        target_Q1, target_Q2 = self.critic_target(next_state, next_action)
        target_Q = torch.min(target_Q1, target_Q2)
        target_Q = reward + not_done * self.discount * target_Q

    current_Q1, current_Q2 = self.critic(state, action)
    critic_loss = F.mse_loss(current_Q1, target_Q) + F.mse_loss(current_Q2, target_Q)

    self.critic_optimizer.zero_grad()
    critic_loss.backward()
    self.critic_optimizer.step()

    if self.total_it % self.policy_freq == 0:
        actor_loss = -self.critic.Q1(state, self.actor(state)).mean()

        self.actor_optimizer.zero_grad()
        actor_loss.backward()
        self.actor_optimizer.step()

        for p, p_target in zip(self.critic.parameters(), self.critic_target.parameters()):
            p_target.data.mul_(1 - self.tau)
            p_target.data.add_(self.tau * p.data)
        for p, p_target in zip(self.actor.parameters(), self.actor_target.parameters()):
            p_target.data.mul_(1 - self.tau)
            p_target.data.add_(self.tau * p.data)

    self.total_it += 1
```

The critic class wraps two Q-networks and has a `Q1` method that returns only the first head, used inside the actor loss. Returning `min(Q1, Q2)` from the actor side would be more conservative but also more pessimistic, and the paper found it slowed learning slightly [1].

## What network architecture does TD3 use?

The reference implementation uses small multi-layer perceptrons, the same shape for both actor and critics. In the paper, both use two hidden layers with 400 and 300 units, ReLU activations, and a tanh on the actor output to bound actions [1]. The critics take the state and action concatenated as input to the first layer (unlike the original DDPG paper, which fed the action only into the second layer). The current public reference repository uses 256-256 hidden layers instead, the change being one of the "minor adjustments to hyperparameters" the README mentions [8].

Both networks are optimized with Adam.

A simplified PyTorch definition of the actor and the twin critic block:

```python
import torch
import torch.nn as nn
import torch.nn.functional as F

class Actor(nn.Module):
    def __init__(self, state_dim, action_dim, max_action, hidden=256):
        super().__init__()
        self.l1 = nn.Linear(state_dim, hidden)
        self.l2 = nn.Linear(hidden, hidden)
        self.l3 = nn.Linear(hidden, action_dim)
        self.max_action = max_action

    def forward(self, state):
        a = F.relu(self.l1(state))
        a = F.relu(self.l2(a))
        return self.max_action * torch.tanh(self.l3(a))

class Critic(nn.Module):
    def __init__(self, state_dim, action_dim, hidden=256):
        super().__init__()
        self.l1 = nn.Linear(state_dim + action_dim, hidden)
        self.l2 = nn.Linear(hidden, hidden)
        self.l3 = nn.Linear(hidden, 1)
        self.l4 = nn.Linear(state_dim + action_dim, hidden)
        self.l5 = nn.Linear(hidden, hidden)
        self.l6 = nn.Linear(hidden, 1)

    def forward(self, state, action):
        sa = torch.cat([state, action], 1)
        q1 = F.relu(self.l1(sa)); q1 = F.relu(self.l2(q1)); q1 = self.l3(q1)
        q2 = F.relu(self.l4(sa)); q2 = F.relu(self.l5(q2)); q2 = self.l6(q2)
        return q1, q2

    def Q1(self, state, action):
        sa = torch.cat([state, action], 1)
        q1 = F.relu(self.l1(sa)); q1 = F.relu(self.l2(q1)); q1 = self.l3(q1)
        return q1
```

Image-based observations swap the MLP for a small CNN trunk (typically the [Nature DQN](/wiki/dqn) architecture), but the rest of the algorithm is unchanged.

## What are the default hyperparameters for TD3?

The defaults below match the values used in the original paper and the author's PyTorch reference [1][8]. Some downstream libraries differ on minor points (most often layer width, batch size, and the warmup phase).

| Hyperparameter | Symbol | Paper default | Notes |
|---|---|---|---|
| Discount factor | gamma | 0.99 | Standard for MuJoCo |
| Soft target update rate | tau | 0.005 | Polyak averaging coefficient |
| Policy update delay | d | 2 | One actor update per two critic updates |
| Target policy noise std | sigma_tilde | 0.2 | Clipped Gaussian on target action |
| Target noise clip | c | 0.5 | Clip range [-c, c] |
| Exploration noise std | sigma | 0.1 | Gaussian, added to actor output during data collection |
| Replay buffer size | | 1,000,000 | Full history of the agent |
| Mini-batch size | N | 100 | Reference repo and SB3 use 256 |
| Optimizer | | Adam | Both actor and critics |
| Learning rate | | 1e-3 | Same for actor and critics; reference repo uses 3e-4 |
| Hidden layers (actor and critic) | | (400, 300) | Reference repo uses (256, 256) |
| Activations | | ReLU + tanh on actor output | |
| Random data collection | | 10,000 steps for HalfCheetah and Ant; 1,000 steps for the rest | Pure exploration warmup |

### Hyperparameter sensitivity in practice

The headline numbers depend on a handful of choices that are easy to miss. Discount gamma at 0.99 is standard; pushing to 0.995 or above can help on long-horizon tasks but tends to amplify Q-function instability. The target update rate tau of 0.005 is a conservative Polyak factor; values around 0.01 train slightly faster but make the critic more reactive to noise. Increasing the policy delay d above 2 sometimes helps on easy tasks but starves the actor on harder ones.

Exploration noise sigma at 0.1 is small relative to the [-1, 1] action range, which is fine when the policy is initialized near zero and starts moving meaningfully early in training. For long-horizon sparse reward tasks, replacing Gaussian exploration with [Ornstein-Uhlenbeck](/wiki/ornstein_uhlenbeck) noise (as in the original DDPG paper) or with parameter-space noise can help, though TD3 itself does not require either.

Replay buffer size of 1 million transitions is enough for the standard 1 million step training budget but should grow proportionally for longer runs. Smaller buffers (200,000 or so) can lead to overfitting on recent transitions, especially when combined with a high gradient-update-per-environment-step ratio.

## How well does TD3 do on MuJoCo benchmarks?

The paper reports the maximum average return over 10 trials of 1 million environment steps. Results are on the original v1 [MuJoCo](/wiki/mujoco) tasks from [OpenAI Gym](/wiki/openai_gym), evaluated every 5,000 steps with 10 noise-free episodes per evaluation [1]. The headline claim is unambiguous: TD3 was the first method to outperform "the state of the art in every environment tested" [1].

| Environment | TD3 | DDPG (baselines) | DDPG (our re-tune) | [PPO](/wiki/ppo) | TRPO | ACKTR | SAC |
|---|---|---|---|---|---|---|---|
| HalfCheetah-v1 | **9636.95 +/- 859.07** | 3305.60 | 8577.29 | 1795.43 | -15.57 | 1450.46 | 2347.19 |
| Hopper-v1 | **3564.07 +/- 114.74** | 2020.46 | 1860.02 | 2164.70 | 2471.30 | 2428.39 | 2996.66 |
| Walker2d-v1 | **4682.82 +/- 539.64** | 1843.85 | 3098.11 | 3317.69 | 2321.47 | 1216.70 | 1283.67 |
| Ant-v1 | **4372.44 +/- 1000.33** | 1005.30 | 888.77 | 1083.20 | -75.85 | 1821.94 | 655.35 |
| Reacher-v1 | **-3.60 +/- 0.56** | -6.51 | -4.01 | -6.18 | -111.43 | -4.26 | -4.44 |
| InvertedPendulum-v1 | **1000.00 +/- 0.00** | 1000.00 | 1000.00 | 1000.00 | 985.40 | 1000.00 | 1000.00 |
| InvertedDoublePendulum-v1 | **9337.47 +/- 14.96** | 9355.52 | 8369.95 | 8977.94 | 205.85 | 9081.92 | 8487.15 |

TD3 won outright on six of the seven tasks and tied the maximum on InvertedPendulum, where the cap is the environment's reward ceiling [1]. The SAC numbers in the original Table 1 reflect a now-superseded implementation; later tuned SAC code closes much of the gap, particularly on the harder tasks [6]. The paper acknowledges this in a footnote and provides comparison numbers in its supplementary material.

Later third-party benchmarks on newer MuJoCo versions tell roughly the same story. CleanRL's TD3 implementation reaches around 9,583 on HalfCheetah-v4, 4,058 on Walker2d-v4, 3,135 on Hopper-v4, and 5,035 on Humanoid-v4 over three seeds [14].

### Ablation breakdown

The paper's Table 2 ablates each TD3 modification on HalfCheetah, Hopper, Walker2d, and Ant. The numbers below are the 10-seed average of the maximum return over 1 million steps [1].

| Variant | HalfCheetah | Hopper | Walker2d | Ant |
|---|---|---|---|---|
| TD3 (full) | 9532.99 | 3304.18 | 4565.24 | 4185.06 |
| TD3 minus delayed policy | 9412.35 | 2790.66 | 3853.34 | 4040.34 |
| TD3 minus target smoothing | 8775.91 | 1939.12 | 2952.46 | 4097.39 |
| TD3 minus clipped double Q | 7894.97 | 2266.36 | 4046.67 | 4063.07 |
| TD3 with single Q (DDPG-style) | 8538.56 | 2253.23 | 3522.74 | 3538.46 |
| AHE (delayed and smoothing only) | 8401.30 | 1652.65 | 4130.09 | 1944.61 |

No single component carries the result; the combination is what closes the gap. Removing target smoothing hurts most on Hopper and Walker2d, both of which have brittle dynamics that punish extreme actions. Removing clipped double Q hurts most on HalfCheetah and Ant, which run for full 1,000-step episodes and accumulate the most overestimation [1].

### Critic value tracking

Figure 1 of the paper plots the average critic prediction (`Q(s, pi(s))`) against the true return measured by Monte Carlo rollouts. For DDPG, the predicted value floats around 1,500 while the true return sits near zero on HalfCheetah, growing to a gap of more than 10,000 by 1 million steps on Hopper. For TD3, the predicted value tracks the true return closely throughout training [1]. This is the diagnostic the paper uses to argue that the algorithm actually fixes the bias rather than masking it.

## How does TD3 compare with related algorithms?

| Algorithm | On/Off policy | Policy type | Action space | Sample efficiency | Key idea |
|---|---|---|---|---|---|
| TD3 | [Off-policy](/wiki/off_policy) | Deterministic | Continuous | High | Two critics with min target, delayed policy updates, target action smoothing |
| [DDPG](/wiki/ddpg) | Off-policy | Deterministic | Continuous | High but unstable | Deterministic actor with single critic and replay buffer |
| [SAC](/wiki/soft_actor_critic) | Off-policy | Stochastic | Continuous | High | Maximum-entropy objective with twin critics and reparameterized Gaussian policy |
| [PPO](/wiki/ppo) | On-policy | Stochastic | Continuous and discrete | Lower per sample, very stable | Clipped surrogate objective, multiple epochs over each rollout |
| [A3C](/wiki/a3c) | On-policy | Stochastic | Continuous and discrete | Low per sample | Asynchronous advantage actor-critic with parallel workers |
| [DQN](/wiki/dqn) | Off-policy | Stochastic (epsilon-greedy) | Discrete | High but bounded | Q-learning with replay buffer and target network |

TD3 and SAC came out within a few months of each other in 2018 and tend to perform similarly on standard MuJoCo benchmarks, with SAC often having an edge on the harder tasks (Humanoid in particular) thanks to its entropy regularization [6]. People still argue about which is the better default. PPO is the comparison algorithm everyone reaches for when they want stability or when the environment is cheap to simulate, since on-policy methods burn through far more samples but rarely diverge [7].

### How does TD3 differ from SAC?

The two algorithms are often discussed as siblings. Both were published in 2018, both use twin critics with a min target, both are off policy with replay, and both are designed for continuous action spaces [1][6]. The differences are also instructive.

| Aspect | TD3 | SAC |
|---|---|---|
| Policy class | Deterministic, tanh-bounded | Stochastic Gaussian, tanh-squashed |
| Exploration | External Gaussian noise on actor output | Built-in policy entropy |
| Loss | Standard deterministic policy gradient | Soft policy gradient with entropy term |
| Critic targets | min(Q1, Q2) | min(Q1, Q2) minus entropy bonus |
| Tunables | Exploration sigma, smoothing sigma, delay d | Entropy temperature alpha (often auto-tuned) |
| Strengths | Simple, fast, very predictable on standard tasks | Robust to hyperparameters, often best on hard tasks |
| Weaknesses | Brittle exploration on sparse reward, no entropy | Slightly more compute per step, slower to debug |

In standard MuJoCo, SAC tends to outperform TD3 on Humanoid by a wide margin (roughly 9,000 versus 5,000 over 3 million steps), match it on Walker2d and Ant, and slightly underperform it on HalfCheetah [6]. The auto-tuned entropy in modern SAC removes one of TD3's old advantages in setup simplicity, but TD3 is still typically a few percent faster per gradient step because its policy does not require sampling.

### How does TD3 differ from PPO?

TD3 is sample efficient where [PPO](/wiki/proximal_policy_optimization) is wall-clock efficient [7]. On a single MuJoCo environment with one CPU and one GPU, TD3 reaches a target score in roughly 1 million environment steps; PPO needs around 5 to 10 million for the same target. PPO catches up if you can run 16 or 32 environments in parallel, since it scales near-linearly with parallelism, while TD3 with a single replay buffer and one learning thread does not. For real robots and any setting where each environment step is expensive, TD3 is the better fit. For massively parallel simulation (Isaac Lab, [Brax](/wiki/brax), [EnvPool](/wiki/envpool)), PPO often wins on wall clock.

## What libraries implement TD3?

| Library | URL | Notes |
|---|---|---|
| Author's reference (PyTorch) | [github.com/sfujim/TD3](https://github.com/sfujim/TD3) | The canonical reference; README warns the current code differs slightly from the paper |
| [Stable-Baselines3](/wiki/stable_baselines3) | [stable-baselines3.readthedocs.io](https://stable-baselines3.readthedocs.io/en/master/modules/td3.html) | PyTorch; uses ReLU MlpPolicy to match the paper, batch size 256 |
| OpenAI Spinning Up | [spinningup.openai.com/algorithms/td3](https://spinningup.openai.com/en/latest/algorithms/td3.html) | PyTorch and TensorFlow versions; tutorial-style explanations |
| CleanRL | [docs.cleanrl.dev/rl-algorithms/td3](https://docs.cleanrl.dev/rl-algorithms/td3/) | Single-file PyTorch implementations, reproducible benchmarks |
| Tianshou | [github.com/thu-ml/tianshou](https://github.com/thu-ml/tianshou) | Modular PyTorch RL library, MuJoCo benchmarks at parity with the original |
| RLlib (Ray) | [docs.ray.io/en/latest/rllib](https://docs.ray.io/en/latest/rllib/index.html) | Distributed RL library with TD3 in its catalog |
| Acme (DeepMind) | [github.com/google-deepmind/acme](https://github.com/google-deepmind/acme) | JAX and TensorFlow versions; modular agent components |
| Sample Factory | [github.com/alex-petrenko/sample-factory](https://github.com/alex-petrenko/sample-factory) | High-throughput PyTorch RL library, supports TD3 baseline runs |

For most users picking up TD3 for a project, [Stable-Baselines3](/wiki/stable_baselines3) or CleanRL are the easiest entry points [13][14]. The author's reference is short enough to read end-to-end and is still the cleanest match to the paper's pseudocode [8].

### Stable-Baselines3 minimal example

A minimal TD3 training loop in [Stable-Baselines3](/wiki/stable_baselines3) on the standard Pendulum environment [13]:

```python
import gymnasium as gym
import numpy as np
from stable_baselines3 import TD3
from stable_baselines3.common.noise import NormalActionNoise

env = gym.make("Pendulum-v1")

n_actions = env.action_space.shape[-1]
action_noise = NormalActionNoise(
    mean=np.zeros(n_actions), sigma=0.1 * np.ones(n_actions)
)

model = TD3(
    "MlpPolicy",
    env,
    action_noise=action_noise,
    learning_rate=3e-4,
    buffer_size=1_000_000,
    batch_size=256,
    tau=0.005,
    gamma=0.99,
    policy_delay=2,
    target_policy_noise=0.2,
    target_noise_clip=0.5,
    verbose=1,
)

model.learn(total_timesteps=200_000)
model.save("td3_pendulum")
```

Replacing `Pendulum-v1` with any continuous-action [Gymnasium](/wiki/gymnasium) environment usually works without further changes, although harder tasks need longer training and sometimes wider networks (`policy_kwargs=dict(net_arch=[400, 300])`).

## What variants and extensions build on TD3?

TD3 has been a launching pad for a handful of follow-up algorithms.

* **TD3+BC** (Fujimoto and Gu, NeurIPS 2021) adapts TD3 to the offline RL setting by adding a behavior cloning regularizer to the policy loss and normalizing observations [9]. The change is described as "a few lines of code" and matches the performance of much more elaborate offline methods on the D4RL benchmark, which has made it a standard baseline.
* **TQC** (Truncated Quantile Critics, Kuznetsov et al., ICML 2020) replaces the min over two critics with quantile regression over a larger ensemble, dropping the top quantiles to control overestimation more flexibly [10]. It builds on the same intuition as TD3's clipped double Q trick.
* **REDQ** (Randomized Ensemble Double Q-Learning, Chen et al., ICLR 2021) keeps an ensemble of Q-networks (typically 10), samples a small subset to compute the target, and runs many gradient updates per environment step [11]. The result is sample efficiency that approaches model-based methods while staying model-free.
* **DroQ** and various dropout-regularized critic variants take the ensemble idea further while keeping a TD3-like backbone.
* **TD3+HER** combines TD3 with [Hindsight Experience Replay](/wiki/hindsight_experience_replay) (Andrychowicz et al., 2017) for sparse-reward goal-conditioned tasks, especially in robotics manipulation [16]. The HER relabeling trick supplies dense pseudo-rewards that the TD3 critic can learn from.
* **D4PG** (Distributed Distributional DDPG, Barth-Maron et al., 2018) is a parallel cousin from DeepMind that uses distributional critics and multiple actors but does not include the clipped double Q trick [17]. Several later papers combine D4PG actors with TD3 targets.
* **CrossQ** (Bhatt et al., 2024) drops target networks entirely in favor of batch-renormalized critics, recovering most of TD3's stability with simpler bookkeeping [18].

The core moves of TD3 (twin critics with min target, target action regularization) show up in nearly every modern off-policy continuous control algorithm in some form.

### TD3 in offline reinforcement learning

The transition from online to offline RL has been one of the more consequential changes to the field since 2020. TD3+BC is often the first thing tried on a new offline benchmark because the implementation is short [9]. The policy loss in TD3+BC is:

```
actor_loss = -lambda * Q(s, pi(s)) + ||pi(s) - a_dataset||^2
```

where `a_dataset` is the action recorded in the offline dataset and lambda balances the value-maximization term against the cloning term. Fujimoto and Gu set lambda = 2.5 / mean(|Q|) so that it scales with the magnitude of the value function, removing one tunable [9]. With this single change plus state normalization, TD3+BC matches or beats CQL, BCQ, and BRAC on most D4RL Mujoco subsets.

## What is TD3 used for?

TD3 has been used or evaluated in a range of continuous-control problems beyond MuJoCo benchmarks.

* **Robotic manipulation in simulation:** TD3 is one of the standard baselines in [Isaac Lab](/wiki/isaac_lab), [robosuite](/wiki/robosuite), and the [Robotics Gymnasium](/wiki/robotics_gymnasium) suite. Tasks include reaching, pushing, grasping, and door opening, often combined with HER.
* **Locomotion:** Quadruped controllers in simulation, including bipedal humanoid balance and walking gait synthesis. TD3 has been used as a comparison baseline in several papers on Cassie and ANYmal locomotion.
* **Sim-to-real robot control:** TD3 has been applied to real robotic arms (notably the Franka Emika Panda and the Sawyer) when combined with domain randomization. Real-robot training is rare because of sample requirements, but TD3 fine-tuning on top of behavior-cloned policies is common.
* **Autonomous driving simulation:** [CARLA](/wiki/carla) and [LGSVL](/wiki/lgsvl) studies use TD3 for steering and throttle control in lane-keeping and intersection navigation tasks. Performance gains over PPO are mixed and depend heavily on reward shaping.
* **Power grid and energy management:** TD3 has been applied to building HVAC control, electric vehicle charging coordination, and microgrid energy dispatch, where the action space is naturally continuous.
* **Network resource allocation:** Continuous bandwidth and power allocation problems in wireless networks, including beamforming for 5G and edge offloading.
* **Drone control:** Quadrotor stabilization and trajectory tracking, both in [Gazebo](/wiki/gazebo) simulation and on real hardware after sim-to-real transfer.
* **Finance:** Portfolio allocation and trade execution as continuous-action MDPs, although academic results on real markets remain controversial.

In each of these domains, TD3 is rarely the state of the art on its own. It is more often used as a starting point that gets extended with HER, distributional critics, or domain-specific reward shaping.

### Notable benchmark suites where TD3 appears

| Suite | Maintainer | Notes |
|---|---|---|
| MuJoCo Gym | Farama Foundation | Standard physics tasks; TD3 is a default baseline |
| DMControl Suite | DeepMind | DM-style task and observation specs; TD3 trained with image inputs |
| Meta-World | Stanford | 50 manipulation tasks; TD3 used as a non-meta baseline |
| D4RL | UC Berkeley | Offline benchmark; TD3+BC is among the standard baselines |
| Isaac Lab | NVIDIA | Massively parallel GPU simulation; TD3 supported via SKRL and rsl_rl |
| Robotics Gymnasium | Farama Foundation | Goal-conditioned manipulation; TD3 typically combined with HER |

## Practical guidance

A few patterns recur in production use of TD3.

* **Normalize observations.** TD3 is sensitive to scale. Either standardize observations to zero mean and unit variance with running statistics, or use layer normalization in the critic. Without normalization the critic can blow up early in training on environments with unbounded observation values.
* **Clip rewards or scale them.** Outsized rewards (one shot of +1000 in a sea of small rewards) destabilize the value function. Reward scaling or clipping to [-1, 1] is a common preprocessing step.
* **Set the random seed deliberately.** TD3 is more reproducible than DDPG but still benefits from setting numpy, torch, and gym seeds explicitly. Ten seeds is the de facto reporting standard for paper results [1][15].
* **Watch the critic.** Plot `mean(Q(s, a))` over training. If it grows unboundedly, smoothing noise is too small or the discount is too high. If it sits near zero forever, the actor is not getting useful gradients; check exploration noise.
* **Use a longer warmup on hard tasks.** 25,000 steps of pure-random data collection helps on Humanoid and Ant. Skipping the warmup can cause the policy to collapse onto a single bad mode.
* **Use SB3 for production, the reference repo for research.** SB3 has fewer footguns; the reference repo is closer to the paper math.
* **Do not increase the learning rate without a reason.** TD3 is fine at 3e-4 across most tasks. Going higher tends to win on easy tasks and lose on hard ones.
* **Replay buffer size grows with training budget.** A 1 million transition buffer is right for 1 million environment steps. Beyond that, scale linearly or you start training the critic on stale transitions.

### Common failure modes

* **Critic divergence.** Q-values explode toward infinity over training. Caused by too-high gamma, too-small target smoothing, or too-rare actor updates. Lower gamma to 0.95, raise sigma_tilde to 0.3, or move to SAC if the problem persists.
* **Policy collapse.** Actor outputs the same near-saturated action regardless of state. Caused by insufficient exploration, narrow value-function gradients, or warmup that is too short. Increase exploration noise, restart with a larger warmup, or check that the tanh output is not stuck at +/-1.
* **Reward hacking.** Policy finds a degenerate behavior with high reward but no useful skill. Not a TD3-specific issue but it shows up especially in sparse reward and shaped reward tasks.
* **Slow convergence on Humanoid.** TD3 is generally weaker than SAC on Humanoid because it lacks entropy regularization [6]. Either switch to SAC or add adaptive entropy bonuses on top of TD3 (the result is essentially TADD or similar variants).

## What are the limitations of TD3?

TD3 does not solve every continuous control problem. Its main limitations:

* **Deterministic policies are bad at multimodal tasks.** If the optimal behavior involves randomization (for instance, defensive driving or game-theoretic tasks), TD3's deterministic actor will collapse to one mode. SAC and other stochastic-policy methods handle this naturally [6].
* **Exploration is fragile.** Gaussian noise on the actor output is the simplest possible exploration scheme. On sparse-reward tasks it usually fails, which is why HER and curiosity-based methods are paired with TD3 in robotics [16].
* **Sensitive to reward scale.** Without reward normalization, the critic can blow up. The paper does not normalize rewards, but most downstream libraries do.
* **Not great in high-dimensional action spaces.** TD3 has been tested mostly with action dimension under 30. Beyond that, the deterministic policy gradient becomes brittle, partly because the smoothing noise covers a smaller fraction of the action space.
* **Struggles with sparse, long-horizon credit assignment.** The clipped double Q trick reduces variance but does not help with credit assignment over thousands of steps. Hierarchical RL methods or n-step returns are typical add-ons.
* **Single-task by design.** Multi-task and meta-RL settings need additional machinery. TD3 alone does not transfer.

## Reception and impact

TD3 became one of the most cited reinforcement learning papers of 2018 and quickly settled into the role of a default baseline for continuous control. Most papers that propose a new off-policy continuous control algorithm benchmark against either TD3, SAC, or both. The clipped double Q trick in particular has been adopted across the field, and even SAC implementations now use it by default [6].

In applications, TD3 and its descendants are widely used in robotics, including manipulation, mobile robot navigation, and path planning, where deterministic policies and continuous control fit naturally. Surveys of deep RL in robotics consistently list TD3 alongside SAC and PPO as the algorithms most commonly tried first.

The paper's broader contribution was probably methodological as much as algorithmic. It pushed the field to evaluate over more seeds, to take ablations seriously, and to be honest about the variance of deep RL results. The reproducibility complaints raised by Henderson et al. (2017), which the TD3 authors cite, were taken to heart [15]. The 10-seed evaluation protocol used in the paper is closer to what later work treats as the bare minimum [1].

Google Scholar lists TD3 with more than 7,000 citations as of 2025, putting it in the same range as SAC and DDPG and in the top tier of post-2017 RL papers. The reference repository has been forked thousands of times and is one of the most copied teaching examples for continuous-control RL alongside [OpenAI](/wiki/openai) Spinning Up [12].

### Influence on later work

The clipped double Q trick is now the default in SAC, REDQ, TQC, DroQ, and most modern off-policy continuous control algorithms [6][10][11]. Target action smoothing is less universal but has become standard in robotics-oriented codebases. The paper is also frequently cited in offline RL work, where overestimation under distribution shift is even more acute. CQL, BCQ, and IQL cite TD3 directly when motivating their approach to value pessimism [9].

## Theoretical context

TD3 sits in the lineage of deterministic policy gradient methods that began with the deterministic policy gradient theorem (Silver et al., 2014) [5]. For a deterministic policy `pi_phi`, the policy gradient is:

```
grad_phi J = E_{s ~ rho^pi} [ grad_a Q^pi(s, a)|_{a=pi_phi(s)} * grad_phi pi_phi(s) ]
```

where `rho^pi` is the discounted state visitation distribution. DPG was the theoretical foundation for DDPG, which in turn was the practical predecessor to TD3 [2][5].

The overestimation analysis builds on the theory of Q-learning with function approximation (Thrun and Schwartz 1993; van Hasselt 2010) [19][3]. The TD3 paper extends this analysis to actor-critic by showing that even without an explicit max operator, the policy update implicitly takes a max-like step that is biased upward [1]. Clipped double Q-learning is related to the broader technique of pessimism in value estimation, which appears in offline RL (CQL), exploration, and safe RL.

## ELI5: TD3 in plain language

Imagine you are coaching a robot to walk, and you have an assistant whose only job is to guess how good each move will be. DDPG used one assistant, but that assistant tended to be an optimist: it kept overrating moves, so the robot chased moves that looked great on paper and fell over in practice. TD3 fixes this with three common-sense rules. First, hire two assistants and always trust the more pessimistic one (clipped double Q), so the robot stops believing inflated promises. Second, do not change the robot's strategy after every single guess; wait a couple of rounds for the guesses to settle down (delayed policy updates). Third, when judging a planned move, also check a few moves that are slightly different, so the robot does not bet everything on one razor-thin sweet spot that might not really be there (target policy smoothing). None of these ideas is fancy on its own, but together they make a jittery learner into a dependable one.

## See also

* [Reinforcement learning](/wiki/reinforcement_learning)
* [Deep reinforcement learning](/wiki/deep_reinforcement_learning)
* [Actor-critic](/wiki/actor_critic) methods
* [DDPG](/wiki/ddpg)
* [Soft Actor-Critic](/wiki/soft_actor_critic)
* [PPO](/wiki/ppo)
* [Q-learning](/wiki/q_learning)
* [Double Q-learning](/wiki/double_q_learning)
* [DQN](/wiki/dqn)
* [Off-policy](/wiki/off_policy) learning
* [Replay buffer](/wiki/replay_buffer)
* [Experience replay](/wiki/experience_replay)
* [Target network](/wiki/target_network)
* [MuJoCo](/wiki/mujoco)
* [OpenAI Gym](/wiki/openai_gym)
* [Stable Baselines3](/wiki/stable_baselines3)
* [Hindsight Experience Replay](/wiki/hindsight_experience_replay)
* [Isaac Lab](/wiki/isaac_lab)

## References

1. Fujimoto, S., van Hoof, H., and Meger, D. (2018). "Addressing Function Approximation Error in Actor-Critic Methods." *Proceedings of the 35th International Conference on Machine Learning (ICML)*, PMLR 80. arXiv:[1802.09477](https://arxiv.org/abs/1802.09477).
2. Lillicrap, T. P. et al. (2015). "Continuous control with deep reinforcement learning." arXiv:[1509.02971](https://arxiv.org/abs/1509.02971). The DDPG paper.
3. van Hasselt, H. (2010). "Double Q-learning." *Advances in Neural Information Processing Systems (NeurIPS)*.
4. van Hasselt, H., Guez, A., and Silver, D. (2016). "Deep Reinforcement Learning with Double Q-Learning." *AAAI*.
5. Silver, D. et al. (2014). "Deterministic Policy Gradient Algorithms." *ICML*.
6. Haarnoja, T. et al. (2018). "Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor." *ICML*. arXiv:[1801.01290](https://arxiv.org/abs/1801.01290).
7. Schulman, J. et al. (2017). "Proximal Policy Optimization Algorithms." arXiv:[1707.06347](https://arxiv.org/abs/1707.06347).
8. Fujimoto, S. (2018). "TD3 reference implementation." GitHub: [sfujim/TD3](https://github.com/sfujim/TD3).
9. Fujimoto, S. and Gu, S. S. (2021). "A Minimalist Approach to Offline Reinforcement Learning." *NeurIPS*. arXiv:[2106.06860](https://arxiv.org/abs/2106.06860). The TD3+BC paper.
10. Kuznetsov, A. et al. (2020). "Controlling Overestimation Bias with Truncated Mixture of Continuous Distributional Quantile Critics." *ICML*. arXiv:[2005.04269](https://arxiv.org/abs/2005.04269).
11. Chen, X. et al. (2021). "Randomized Ensembled Double Q-Learning: Learning Fast Without a Model." *ICLR*. arXiv:[2101.05982](https://arxiv.org/abs/2101.05982).
12. OpenAI. "Twin Delayed DDPG." *Spinning Up in Deep RL* documentation. [spinningup.openai.com](https://spinningup.openai.com/en/latest/algorithms/td3.html).
13. Raffin, A. et al. (2021). "Stable-Baselines3: Reliable Reinforcement Learning Implementations." *Journal of Machine Learning Research*. Documentation: [stable-baselines3.readthedocs.io](https://stable-baselines3.readthedocs.io/en/master/modules/td3.html).
14. Huang, S. et al. (2022). "CleanRL: High-quality Single-file Implementations of Deep Reinforcement Learning Algorithms." *JMLR*. Documentation: [docs.cleanrl.dev](https://docs.cleanrl.dev/rl-algorithms/td3/).
15. Henderson, P. et al. (2018). "Deep Reinforcement Learning that Matters." *AAAI*. arXiv:[1709.06560](https://arxiv.org/abs/1709.06560).
16. Andrychowicz, M. et al. (2017). "Hindsight Experience Replay." *NeurIPS*. arXiv:[1707.01495](https://arxiv.org/abs/1707.01495).
17. Barth-Maron, G. et al. (2018). "Distributed Distributional Deterministic Policy Gradients." *ICLR*. arXiv:[1804.08617](https://arxiv.org/abs/1804.08617).
18. Bhatt, A. et al. (2024). "CrossQ: Batch Normalization in Deep Reinforcement Learning for Greater Sample Efficiency and Simplicity." *ICLR*. arXiv:[1902.05605](https://arxiv.org/abs/1902.05605).
19. Thrun, S. and Schwartz, A. (1993). "Issues in Using Function Approximation for Reinforcement Learning." *Proceedings of the 1993 Connectionist Models Summer School*.
20. Sutton, R. S. and Barto, A. G. (2018). *Reinforcement Learning: An Introduction*, second edition. MIT Press.
21. Levine, S. (2024). "CS285: Deep Reinforcement Learning." Course materials, UC Berkeley. [rail.eecs.berkeley.edu/deeprlcourse](https://rail.eecs.berkeley.edu/deeprlcourse/).

