Twin Delayed DDPG
Last reviewed
May 9, 2026
Sources
21 citations
Review status
Source-backed
Revision
v2 · 6,449 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 9, 2026
Sources
21 citations
Review status
Source-backed
Revision
v2 · 6,449 words
Add missing citations, update stale details, or suggest a clearer explanation.
Twin Delayed Deep Deterministic Policy Gradient (TD3) is an off-policy actor-critic reinforcement learning algorithm for continuous action spaces, introduced by Scott Fujimoto, Herke van Hoof, and David Meger at ICML 2018. It builds directly on DDPG and was designed to fix that algorithm's well known tendency to overestimate Q-values, which often led to unstable learning and brittle policies. The paper, "Addressing Function Approximation Error in Actor-Critic Methods" (arXiv:1802.09477), introduced three changes to DDPG: clipped double Q-learning, delayed policy updates, and target policy smoothing. Together they turned DDPG from a finicky algorithm into one of the standard baselines for continuous control benchmarks.
TD3 has remained a default reference algorithm in continuous-control deep reinforcement learning for nearly a decade since its release. It is the algorithm most newer methods compare against on MuJoCo tasks, the foundation for several offline RL methods, and a frequent first choice for robotics simulation work in Isaac Lab and similar platforms. Its three modifications are conceptually small but each one targets a concrete failure mode of DDPG, and the combination is what made the difference.
| Field | Value |
|---|---|
| Full name | Twin Delayed Deep Deterministic Policy Gradient |
| Type | Off-policy actor-critic, model-free |
| Action space | Continuous |
| Policy | Deterministic |
| Authors | Scott Fujimoto, Herke van Hoof, David Meger |
| Affiliations | McGill University, University of Amsterdam |
| First released | February 2018 (arXiv) |
| Conference | ICML 2018 (PMLR 80) |
| Paper | arXiv:1802.09477 |
| Reference code | github.com/sfujim/TD3 (PyTorch) |
| Direct predecessor | DDPG |
| Sibling algorithm | Soft Actor-Critic (SAC) |
| License of reference code | MIT |
| Common framework | PyTorch |
DDPG (Lillicrap et al., 2015) was, at the time, the standard recipe for continuous control with deep networks. It maintained a deterministic policy and a single Q-network, both with target network copies updated by Polyak averaging, and trained off policy from a replay buffer. It worked, sometimes spectacularly, but it was notorious for being seed-sensitive and unstable. A run that hit 6,000 reward on HalfCheetah could be followed by another run on the same code that flatlined.
Fujimoto and colleagues traced much of the trouble back to a problem already familiar from discrete-action Q-learning: overestimation bias. When you take a maximum over noisy value estimates, the result is biased upward, because the maximum operation systematically picks out the actions whose value happened to be overestimated. In the discrete setting Double Q-learning (van Hasselt, 2010) and Double DQN (van Hasselt et al., 2016) had been the standard fixes. The TD3 paper proved that the same kind of bias also appears in deterministic policy gradients, even though there is no explicit max operator in the actor update. The policy improvement step implicitly maximizes the critic, and that is enough to introduce bias.
Worse, the paper showed that the natural Double DQN port to actor-critic does not really help, because the policy changes too slowly for the current and target value estimates to be independent. Something else was needed.
The deterministic policy gradient theorem (Silver et al., 2014) writes the policy update as:
grad_phi J(phi) = E_s [ grad_a Q(s, a)|_{a=pi_phi(s)} * grad_phi pi_phi(s) ]
The gradient of Q with respect to a tells the policy which direction increases value. If the critic systematically rates some actions higher than they really are, the policy will be pushed toward those actions even though the true environment return is lower. The next round of TD updates uses transitions from this slightly worse policy, the critic refits to overoptimistic targets, and the cycle compounds. Fujimoto et al. show this empirically by tracking the average critic prediction against a Monte Carlo estimate of the true return on standard MuJoCo tasks; for vanilla DDPG, the gap grows steadily over training.
The gap is not just a curiosity. A critic that drifts away from the true value function can mislead the actor into reward-free regions of state space, and once a deterministic policy collapses onto a bad action it can be slow to recover, since exploration noise in DDPG is small relative to the action range.
In tabular settings, Q-learning overestimation comes from the max operator over noisy value estimates: max_a Q_hat(s, a) >= max_a E[Q_hat(s, a)] by Jensen's inequality. Deterministic policy gradients do not have an explicit max, but the actor update is essentially climbing the critic's value surface. If the critic has approximation error, the actor learns to exploit it. Fujimoto et al. quantify this with a theorem (their Theorem 1) showing that under standard assumptions, the actor-critic value estimate Q(s, pi(s)) exceeds the true value Q^pi(s, pi(s)) in expectation when both networks are trained on the same replay data.
This is more than a tabular curiosity, because in practice the critic is a neural network with millions of parameters fit to a few million transitions. Approximation noise is not optional and it does not cancel out.
TD3 inherits the entire DDPG skeleton (deterministic actor, replay buffer, off-policy training, target networks updated with Polyak averaging) and changes three things.
TD3 trains two independent critics, Q_theta1 and Q_theta2, each with their own target network. The Bellman target shared by both critics is the minimum of the two target Q-values evaluated at the next state and the target policy's action:
y = r + gamma * min(Q_theta1'(s', pi_phi'(s')), Q_theta2'(s', pi_phi'(s')))
Taking the minimum is the "clipped" part. Plain Double Q-learning would use one critic to select an action and the other to evaluate it; here both critics are evaluated and the smaller value wins. The trick may bias estimates downward, but the paper argues this is the lesser evil. Underestimated actions are not propagated through the policy update, because the actor avoids low-value actions, while overestimated actions actively poison the policy. As a side effect, the min operator favors states with low-variance value estimates, which steers the policy away from regions where the critic is uncertain.
A convergence proof for the finite MDP case appears in the paper's supplementary material. The intuition is that if the two estimates have independent noise with mean zero, the min has a negative bias whose magnitude is bounded by the standard deviation of the noise. So overestimation turns into a controlled, small underestimation, which the actor can compensate for through more exploration.
Why not three or more critics? The paper tests an ablation with three critics and reports diminishing returns. Two critics are cheap (the second critic adds roughly 30% to backward pass cost since most layers are not shared) and capture most of the benefit. Later work, especially REDQ and TQC, revisits this question and shows that larger ensembles can pay off, but at a different cost trade-off.
The second change is also simple. The actor and the target networks are updated less often than the critics, typically once for every two critic updates. The justification is that policy improvement on a noisy critic produces a noisy gradient, which then makes the next critic update worse, and the cycle compounds. By letting the value estimate settle for a few steps before nudging the policy, TD3 reduces the variance of the policy update.
The practical recommendation in the paper is d = 2, meaning the actor and target networks update every other gradient step. The authors note that a larger d would yield a larger benefit in terms of accumulated error, but training the actor too rarely cripples learning, so 2 is the safe default.
In the ablation Figure 4 of the paper, removing the delay drops average HalfCheetah return from roughly 9,500 to about 7,000 over 1 million steps, and increases run-to-run variance noticeably. The delay also makes learning curves smoother visually, which helps debugging.
Deterministic policies tend to overfit narrow peaks in the value function. Pick a slightly different action and the critic might tell you it is much worse, even though in reality the values should be similar. Target policy smoothing is a regularization that adds clipped Gaussian noise to the target action before evaluating the next-state Q-value:
a_tilde = pi_phi'(s') + clip(N(0, sigma), -c, c)
y = r + gamma * min_i Q_theta_i'(s', a_tilde)
The noise forces the critic to fit a small region around the target action rather than a single point, which the paper notes is similar in spirit to a SARSA update. Defaults are sigma = 0.2 with the noise clipped to the interval [-0.5, 0.5] (assuming actions are scaled to [-1, 1]).
After clipping, the action is also clipped to the valid action range, which matters for environments that reject out-of-range actions or saturate them silently. The noise is independent of the exploration noise added during data collection: smoothing happens only in the Bellman target computation.
A useful way to think about smoothing: the critic is being asked to predict the value of an expanded action distribution rather than a delta function. This makes the value function locally smoother, which is exactly what the policy gradient needs in order to produce stable updates.
The full algorithm as it appears in the paper:
Initialize critic networks Q_theta1, Q_theta2 and actor network pi_phi
with random parameters theta1, theta2, phi
Initialize target networks: theta1' <- theta1, theta2' <- theta2, phi' <- phi
Initialize replay buffer B
for t = 1 to T:
Select action with exploration noise:
a ~ pi_phi(s) + epsilon, epsilon ~ N(0, sigma)
Execute a, observe reward r and new state s'
Store transition (s, a, r, s') in B
Sample mini-batch of N transitions (s, a, r, s') from B
a_tilde <- pi_phi'(s') + epsilon, epsilon ~ clip(N(0, sigma_tilde), -c, c)
y <- r + gamma * min_{i=1,2} Q_theta_i'(s', a_tilde)
Update critics:
theta_i <- argmin_{theta_i} (1/N) * sum (y - Q_theta_i(s, a))^2
if t mod d == 0:
Update phi by the deterministic policy gradient:
grad_phi J(phi) = (1/N) * sum grad_a Q_theta1(s, a)|_{a=pi_phi(s)} * grad_phi pi_phi(s)
Update target networks:
theta_i' <- tau * theta_i + (1 - tau) * theta_i'
phi' <- tau * phi + (1 - tau) * phi'
end for
A few details worth noting. The actor is trained only against Q_theta1, not against the minimum of the two critics, which keeps the policy gradient less conservative. Both target networks are updated on the delayed schedule along with the actor. Exploration noise during data collection is independent of the smoothing noise added inside the target.
In most implementations, the loop also includes a warmup phase: for the first 10,000 to 25,000 steps, actions are sampled uniformly from the action space rather than from the policy. This produces a more diverse initial replay buffer and avoids early collapse onto a poorly initialized policy.
The central update step in the author's reference implementation looks like the following PyTorch sketch. State, action, reward, next-state, and not-done arrays come from a sampled minibatch, and actor, actor_target, critic, critic_target are the four networks.
import torch
import torch.nn.functional as F
def td3_update(self, batch):
state, action, next_state, reward, not_done = batch
with torch.no_grad():
noise = (
torch.randn_like(action) * self.policy_noise
).clamp(-self.noise_clip, self.noise_clip)
next_action = (
self.actor_target(next_state) + noise
).clamp(-self.max_action, self.max_action)
target_Q1, target_Q2 = self.critic_target(next_state, next_action)
target_Q = torch.min(target_Q1, target_Q2)
target_Q = reward + not_done * self.discount * target_Q
current_Q1, current_Q2 = self.critic(state, action)
critic_loss = F.mse_loss(current_Q1, target_Q) + F.mse_loss(current_Q2, target_Q)
self.critic_optimizer.zero_grad()
critic_loss.backward()
self.critic_optimizer.step()
if self.total_it % self.policy_freq == 0:
actor_loss = -self.critic.Q1(state, self.actor(state)).mean()
self.actor_optimizer.zero_grad()
actor_loss.backward()
self.actor_optimizer.step()
for p, p_target in zip(self.critic.parameters(), self.critic_target.parameters()):
p_target.data.mul_(1 - self.tau)
p_target.data.add_(self.tau * p.data)
for p, p_target in zip(self.actor.parameters(), self.actor_target.parameters()):
p_target.data.mul_(1 - self.tau)
p_target.data.add_(self.tau * p.data)
self.total_it += 1
The critic class wraps two Q-networks and has a Q1 method that returns only the first head, used inside the actor loss. Returning min(Q1, Q2) from the actor side would be more conservative but also more pessimistic, and the paper found it slowed learning slightly.
The reference implementation uses small multi-layer perceptrons, the same shape for both actor and critics. In the paper, both use two hidden layers with 400 and 300 units, ReLU activations, and a tanh on the actor output to bound actions. The critics take the state and action concatenated as input to the first layer (unlike the original DDPG paper, which fed the action only into the second layer). The current public reference repository uses 256-256 hidden layers instead, the change being one of the "minor adjustments to hyperparameters" the README mentions.
Both networks are optimized with Adam.
A simplified PyTorch definition of the actor and the twin critic block:
import torch
import torch.nn as nn
import torch.nn.functional as F
class Actor(nn.Module):
def __init__(self, state_dim, action_dim, max_action, hidden=256):
super().__init__()
self.l1 = nn.Linear(state_dim, hidden)
self.l2 = nn.Linear(hidden, hidden)
self.l3 = nn.Linear(hidden, action_dim)
self.max_action = max_action
def forward(self, state):
a = F.relu(self.l1(state))
a = F.relu(self.l2(a))
return self.max_action * torch.tanh(self.l3(a))
class Critic(nn.Module):
def __init__(self, state_dim, action_dim, hidden=256):
super().__init__()
self.l1 = nn.Linear(state_dim + action_dim, hidden)
self.l2 = nn.Linear(hidden, hidden)
self.l3 = nn.Linear(hidden, 1)
self.l4 = nn.Linear(state_dim + action_dim, hidden)
self.l5 = nn.Linear(hidden, hidden)
self.l6 = nn.Linear(hidden, 1)
def forward(self, state, action):
sa = torch.cat([state, action], 1)
q1 = F.relu(self.l1(sa)); q1 = F.relu(self.l2(q1)); q1 = self.l3(q1)
q2 = F.relu(self.l4(sa)); q2 = F.relu(self.l5(q2)); q2 = self.l6(q2)
return q1, q2
def Q1(self, state, action):
sa = torch.cat([state, action], 1)
q1 = F.relu(self.l1(sa)); q1 = F.relu(self.l2(q1)); q1 = self.l3(q1)
return q1
Image-based observations swap the MLP for a small CNN trunk (typically the Nature DQN architecture), but the rest of the algorithm is unchanged.
The defaults below match the values used in the original paper and the author's PyTorch reference. Some downstream libraries differ on minor points (most often layer width, batch size, and the warmup phase).
| Hyperparameter | Symbol | Paper default | Notes |
|---|---|---|---|
| Discount factor | gamma | 0.99 | Standard for MuJoCo |
| Soft target update rate | tau | 0.005 | Polyak averaging coefficient |
| Policy update delay | d | 2 | One actor update per two critic updates |
| Target policy noise std | sigma_tilde | 0.2 | Clipped Gaussian on target action |
| Target noise clip | c | 0.5 | Clip range [-c, c] |
| Exploration noise std | sigma | 0.1 | Gaussian, added to actor output during data collection |
| Replay buffer size | 1,000,000 | Full history of the agent | |
| Mini-batch size | N | 100 | Reference repo and SB3 use 256 |
| Optimizer | Adam | Both actor and critics | |
| Learning rate | 1e-3 | Same for actor and critics; reference repo uses 3e-4 | |
| Hidden layers (actor and critic) | (400, 300) | Reference repo uses (256, 256) | |
| Activations | ReLU + tanh on actor output | ||
| Random data collection | 10,000 steps for HalfCheetah and Ant; 1,000 steps for the rest | Pure exploration warmup |
The headline numbers depend on a handful of choices that are easy to miss. Discount gamma at 0.99 is standard; pushing to 0.995 or above can help on long-horizon tasks but tends to amplify Q-function instability. The target update rate tau of 0.005 is a conservative Polyak factor; values around 0.01 train slightly faster but make the critic more reactive to noise. Increasing the policy delay d above 2 sometimes helps on easy tasks but starves the actor on harder ones.
Exploration noise sigma at 0.1 is small relative to the [-1, 1] action range, which is fine when the policy is initialized near zero and starts moving meaningfully early in training. For long-horizon sparse reward tasks, replacing Gaussian exploration with Ornstein-Uhlenbeck noise (as in the original DDPG paper) or with parameter-space noise can help, though TD3 itself does not require either.
Replay buffer size of 1 million transitions is enough for the standard 1 million step training budget but should grow proportionally for longer runs. Smaller buffers (200,000 or so) can lead to overfitting on recent transitions, especially when combined with a high gradient-update-per-environment-step ratio.
The paper reports the maximum average return over 10 trials of 1 million environment steps. Results are on the original v1 MuJoCo tasks from OpenAI Gym, evaluated every 5,000 steps with 10 noise-free episodes per evaluation.
| Environment | TD3 | DDPG (baselines) | DDPG (our re-tune) | PPO | TRPO | ACKTR | SAC |
|---|---|---|---|---|---|---|---|
| HalfCheetah-v1 | 9636.95 +/- 859.07 | 3305.60 | 8577.29 | 1795.43 | -15.57 | 1450.46 | 2347.19 |
| Hopper-v1 | 3564.07 +/- 114.74 | 2020.46 | 1860.02 | 2164.70 | 2471.30 | 2428.39 | 2996.66 |
| Walker2d-v1 | 4682.82 +/- 539.64 | 1843.85 | 3098.11 | 3317.69 | 2321.47 | 1216.70 | 1283.67 |
| Ant-v1 | 4372.44 +/- 1000.33 | 1005.30 | 888.77 | 1083.20 | -75.85 | 1821.94 | 655.35 |
| Reacher-v1 | -3.60 +/- 0.56 | -6.51 | -4.01 | -6.18 | -111.43 | -4.26 | -4.44 |
| InvertedPendulum-v1 | 1000.00 +/- 0.00 | 1000.00 | 1000.00 | 1000.00 | 985.40 | 1000.00 | 1000.00 |
| InvertedDoublePendulum-v1 | 9337.47 +/- 14.96 | 9355.52 | 8369.95 | 8977.94 | 205.85 | 9081.92 | 8487.15 |
TD3 won outright on six of the seven tasks and tied the maximum on InvertedPendulum, where the cap is the environment's reward ceiling. The SAC numbers in the original Table 1 reflect a now-superseded implementation; later tuned SAC code closes much of the gap, particularly on the harder tasks. The paper acknowledges this in a footnote and provides comparison numbers in its supplementary material.
Later third-party benchmarks on newer MuJoCo versions tell roughly the same story. CleanRL's TD3 implementation reaches around 9,583 on HalfCheetah-v4, 4,058 on Walker2d-v4, 3,135 on Hopper-v4, and 5,035 on Humanoid-v4 over three seeds.
The paper's Table 2 ablates each TD3 modification on HalfCheetah, Hopper, Walker2d, and Ant. The numbers below are the 10-seed average of the maximum return over 1 million steps.
| Variant | HalfCheetah | Hopper | Walker2d | Ant |
|---|---|---|---|---|
| TD3 (full) | 9532.99 | 3304.18 | 4565.24 | 4185.06 |
| TD3 minus delayed policy | 9412.35 | 2790.66 | 3853.34 | 4040.34 |
| TD3 minus target smoothing | 8775.91 | 1939.12 | 2952.46 | 4097.39 |
| TD3 minus clipped double Q | 7894.97 | 2266.36 | 4046.67 | 4063.07 |
| TD3 with single Q (DDPG-style) | 8538.56 | 2253.23 | 3522.74 | 3538.46 |
| AHE (delayed and smoothing only) | 8401.30 | 1652.65 | 4130.09 | 1944.61 |
No single component carries the result; the combination is what closes the gap. Removing target smoothing hurts most on Hopper and Walker2d, both of which have brittle dynamics that punish extreme actions. Removing clipped double Q hurts most on HalfCheetah and Ant, which run for full 1,000-step episodes and accumulate the most overestimation.
Figure 1 of the paper plots the average critic prediction (Q(s, pi(s))) against the true return measured by Monte Carlo rollouts. For DDPG, the predicted value floats around 1,500 while the true return sits near zero on HalfCheetah, growing to a gap of more than 10,000 by 1 million steps on Hopper. For TD3, the predicted value tracks the true return closely throughout training. This is the diagnostic the paper uses to argue that the algorithm actually fixes the bias rather than masking it.
| Algorithm | On/Off policy | Policy type | Action space | Sample efficiency | Key idea |
|---|---|---|---|---|---|
| TD3 | Off-policy | Deterministic | Continuous | High | Two critics with min target, delayed policy updates, target action smoothing |
| DDPG | Off-policy | Deterministic | Continuous | High but unstable | Deterministic actor with single critic and replay buffer |
| SAC | Off-policy | Stochastic | Continuous | High | Maximum-entropy objective with twin critics and reparameterized Gaussian policy |
| PPO | On-policy | Stochastic | Continuous and discrete | Lower per sample, very stable | Clipped surrogate objective, multiple epochs over each rollout |
| A3C | On-policy | Stochastic | Continuous and discrete | Low per sample | Asynchronous advantage actor-critic with parallel workers |
| DQN | Off-policy | Stochastic (epsilon-greedy) | Discrete | High but bounded | Q-learning with replay buffer and target network |
TD3 and SAC came out within a few months of each other in 2018 and tend to perform similarly on standard MuJoCo benchmarks, with SAC often having an edge on the harder tasks (Humanoid in particular) thanks to its entropy regularization. People still argue about which is the better default. PPO is the comparison algorithm everyone reaches for when they want stability or when the environment is cheap to simulate, since on-policy methods burn through far more samples but rarely diverge.
The two algorithms are often discussed as siblings. Both were published in 2018, both use twin critics with a min target, both are off policy with replay, and both are designed for continuous action spaces. The differences are also instructive.
| Aspect | TD3 | SAC |
|---|---|---|
| Policy class | Deterministic, tanh-bounded | Stochastic Gaussian, tanh-squashed |
| Exploration | External Gaussian noise on actor output | Built-in policy entropy |
| Loss | Standard deterministic policy gradient | Soft policy gradient with entropy term |
| Critic targets | min(Q1, Q2) | min(Q1, Q2) minus entropy bonus |
| Tunables | Exploration sigma, smoothing sigma, delay d | Entropy temperature alpha (often auto-tuned) |
| Strengths | Simple, fast, very predictable on standard tasks | Robust to hyperparameters, often best on hard tasks |
| Weaknesses | Brittle exploration on sparse reward, no entropy | Slightly more compute per step, slower to debug |
In standard MuJoCo, SAC tends to outperform TD3 on Humanoid by a wide margin (roughly 9,000 versus 5,000 over 3 million steps), match it on Walker2d and Ant, and slightly underperform it on HalfCheetah. The auto-tuned entropy in modern SAC removes one of TD3's old advantages in setup simplicity, but TD3 is still typically a few percent faster per gradient step because its policy does not require sampling.
TD3 is sample efficient where PPO is wall-clock efficient. On a single MuJoCo environment with one CPU and one GPU, TD3 reaches a target score in roughly 1 million environment steps; PPO needs around 5 to 10 million for the same target. PPO catches up if you can run 16 or 32 environments in parallel, since it scales near-linearly with parallelism, while TD3 with a single replay buffer and one learning thread does not. For real robots and any setting where each environment step is expensive, TD3 is the better fit. For massively parallel simulation (Isaac Lab, Brax, EnvPool), PPO often wins on wall clock.
| Library | URL | Notes |
|---|---|---|
| Author's reference (PyTorch) | github.com/sfujim/TD3 | The canonical reference; README warns the current code differs slightly from the paper |
| Stable-Baselines3 | stable-baselines3.readthedocs.io | PyTorch; uses ReLU MlpPolicy to match the paper, batch size 256 |
| OpenAI Spinning Up | spinningup.openai.com/algorithms/td3 | PyTorch and TensorFlow versions; tutorial-style explanations |
| CleanRL | docs.cleanrl.dev/rl-algorithms/td3 | Single-file PyTorch implementations, reproducible benchmarks |
| Tianshou | github.com/thu-ml/tianshou | Modular PyTorch RL library, MuJoCo benchmarks at parity with the original |
| RLlib (Ray) | docs.ray.io/en/latest/rllib | Distributed RL library with TD3 in its catalog |
| Acme (DeepMind) | github.com/google-deepmind/acme | JAX and TensorFlow versions; modular agent components |
| Sample Factory | github.com/alex-petrenko/sample-factory | High-throughput PyTorch RL library, supports TD3 baseline runs |
For most users picking up TD3 for a project, Stable-Baselines3 or CleanRL are the easiest entry points. The author's reference is short enough to read end-to-end and is still the cleanest match to the paper's pseudocode.
A minimal TD3 training loop in Stable-Baselines3 on the standard Pendulum environment:
import gymnasium as gym
import numpy as np
from stable_baselines3 import TD3
from stable_baselines3.common.noise import NormalActionNoise
env = gym.make("Pendulum-v1")
n_actions = env.action_space.shape[-1]
action_noise = NormalActionNoise(
mean=np.zeros(n_actions), sigma=0.1 * np.ones(n_actions)
)
model = TD3(
"MlpPolicy",
env,
action_noise=action_noise,
learning_rate=3e-4,
buffer_size=1_000_000,
batch_size=256,
tau=0.005,
gamma=0.99,
policy_delay=2,
target_policy_noise=0.2,
target_noise_clip=0.5,
verbose=1,
)
model.learn(total_timesteps=200_000)
model.save("td3_pendulum")
Replacing Pendulum-v1 with any continuous-action Gymnasium environment usually works without further changes, although harder tasks need longer training and sometimes wider networks (policy_kwargs=dict(net_arch=[400, 300])).
TD3 has been a launching pad for a handful of follow-up algorithms.
The core moves of TD3 (twin critics with min target, target action regularization) show up in nearly every modern off-policy continuous control algorithm in some form.
The transition from online to offline RL has been one of the more consequential changes to the field since 2020. TD3+BC is often the first thing tried on a new offline benchmark because the implementation is short. The policy loss in TD3+BC is:
actor_loss = -lambda * Q(s, pi(s)) + ||pi(s) - a_dataset||^2
where a_dataset is the action recorded in the offline dataset and lambda balances the value-maximization term against the cloning term. Fujimoto and Gu set lambda = 2.5 / mean(|Q|) so that it scales with the magnitude of the value function, removing one tunable. With this single change plus state normalization, TD3+BC matches or beats CQL, BCQ, and BRAC on most D4RL Mujoco subsets.
TD3 has been used or evaluated in a range of continuous-control problems beyond MuJoCo benchmarks.
In each of these domains, TD3 is rarely the state of the art on its own. It is more often used as a starting point that gets extended with HER, distributional critics, or domain-specific reward shaping.
| Suite | Maintainer | Notes |
|---|---|---|
| MuJoCo Gym | Farama Foundation | Standard physics tasks; TD3 is a default baseline |
| DMControl Suite | DeepMind | DM-style task and observation specs; TD3 trained with image inputs |
| Meta-World | Stanford | 50 manipulation tasks; TD3 used as a non-meta baseline |
| D4RL | UC Berkeley | Offline benchmark; TD3+BC is among the standard baselines |
| Isaac Lab | NVIDIA | Massively parallel GPU simulation; TD3 supported via SKRL and rsl_rl |
| Robotics Gymnasium | Farama Foundation | Goal-conditioned manipulation; TD3 typically combined with HER |
A few patterns recur in production use of TD3.
mean(Q(s, a)) over training. If it grows unboundedly, smoothing noise is too small or the discount is too high. If it sits near zero forever, the actor is not getting useful gradients; check exploration noise.TD3 does not solve every continuous control problem. Its main limitations:
TD3 became one of the most cited reinforcement learning papers of 2018 and quickly settled into the role of a default baseline for continuous control. Most papers that propose a new off-policy continuous control algorithm benchmark against either TD3, SAC, or both. The clipped double Q trick in particular has been adopted across the field, and even SAC implementations now use it by default.
In applications, TD3 and its descendants are widely used in robotics, including manipulation, mobile robot navigation, and path planning, where deterministic policies and continuous control fit naturally. Surveys of deep RL in robotics consistently list TD3 alongside SAC and PPO as the algorithms most commonly tried first.
The paper's broader contribution was probably methodological as much as algorithmic. It pushed the field to evaluate over more seeds, to take ablations seriously, and to be honest about the variance of deep RL results. The reproducibility complaints raised by Henderson et al. (2017), which the TD3 authors cite, were taken to heart. The 10-seed evaluation protocol used in the paper is closer to what later work treats as the bare minimum.
Google Scholar lists TD3 with more than 7,000 citations as of 2025, putting it in the same range as SAC and DDPG and in the top tier of post-2017 RL papers. The reference repository has been forked thousands of times and is one of the most copied teaching examples for continuous-control RL alongside OpenAI Spinning Up.
The clipped double Q trick is now the default in SAC, REDQ, TQC, DroQ, and most modern off-policy continuous control algorithms. Target action smoothing is less universal but has become standard in robotics-oriented codebases. The paper is also frequently cited in offline RL work, where overestimation under distribution shift is even more acute. CQL, BCQ, and IQL cite TD3 directly when motivating their approach to value pessimism.
TD3 sits in the lineage of deterministic policy gradient methods that began with the deterministic policy gradient theorem (Silver et al., 2014). For a deterministic policy pi_phi, the policy gradient is:
grad_phi J = E_{s ~ rho^pi} [ grad_a Q^pi(s, a)|_{a=pi_phi(s)} * grad_phi pi_phi(s) ]
where rho^pi is the discounted state visitation distribution. DPG was the theoretical foundation for DDPG, which in turn was the practical predecessor to TD3.
The overestimation analysis builds on the theory of Q-learning with function approximation (Thrun and Schwartz 1993; van Hasselt 2010). The TD3 paper extends this analysis to actor-critic by showing that even without an explicit max operator, the policy update implicitly takes a max-like step that is biased upward. Clipped double Q-learning is related to the broader technique of pessimism in value estimation, which appears in offline RL (CQL), exploration, and safe RL.