Twin Delayed DDPG

Twin Delayed Deep Deterministic Policy Gradient (TD3) is an off-policy actor-critic reinforcement learning algorithm for continuous action spaces, introduced by Scott Fujimoto, Herke van Hoof, and David Meger at ICML 2018. It builds directly on DDPG and was designed to fix that algorithm's well known tendency to overestimate Q-values, which often led to unstable learning and brittle policies. The paper, "Addressing Function Approximation Error in Actor-Critic Methods" (arXiv:1802.09477), introduced three changes to DDPG: clipped double Q-learning, delayed policy updates, and target policy smoothing. Together they turned DDPG from a finicky algorithm into one of the standard baselines for continuous control benchmarks.

TD3 has remained a default reference algorithm in continuous-control deep reinforcement learning for nearly a decade since its release. It is the algorithm most newer methods compare against on MuJoCo tasks, the foundation for several offline RL methods, and a frequent first choice for robotics simulation work in Isaac Lab and similar platforms. Its three modifications are conceptually small but each one targets a concrete failure mode of DDPG, and the combination is what made the difference.

Infobox

Field	Value
Full name	Twin Delayed Deep Deterministic Policy Gradient
Type	Off-policy actor-critic, model-free
Action space	Continuous
Policy	Deterministic
Authors	Scott Fujimoto, Herke van Hoof, David Meger
Affiliations	McGill University, University of Amsterdam
First released	February 2018 (arXiv)
Conference	ICML 2018 (PMLR 80)
Paper	arXiv:1802.09477
Reference code	github.com/sfujim/TD3 (PyTorch)
Direct predecessor	DDPG
Sibling algorithm	Soft Actor-Critic (SAC)
License of reference code	MIT
Common framework	PyTorch

Background: why DDPG needed fixing

DDPG (Lillicrap et al., 2015) was, at the time, the standard recipe for continuous control with deep networks. It maintained a deterministic policy and a single Q-network, both with target network copies updated by Polyak averaging, and trained off policy from a replay buffer. It worked, sometimes spectacularly, but it was notorious for being seed-sensitive and unstable. A run that hit 6,000 reward on HalfCheetah could be followed by another run on the same code that flatlined.

Fujimoto and colleagues traced much of the trouble back to a problem already familiar from discrete-action Q-learning: overestimation bias. When you take a maximum over noisy value estimates, the result is biased upward, because the maximum operation systematically picks out the actions whose value happened to be overestimated. In the discrete setting Double Q-learning (van Hasselt, 2010) and Double DQN (van Hasselt et al., 2016) had been the standard fixes. The TD3 paper proved that the same kind of bias also appears in deterministic policy gradients, even though there is no explicit max operator in the actor update. The policy improvement step implicitly maximizes the critic, and that is enough to introduce bias.

Worse, the paper showed that the natural Double DQN port to actor-critic does not really help, because the policy changes too slowly for the current and target value estimates to be independent. Something else was needed.

How overestimation arises in deterministic policy gradients

The deterministic policy gradient theorem (Silver et al., 2014) writes the policy update as:

grad_phi J(phi) = E_s [ grad_a Q(s, a)|_{a=pi_phi(s)} * grad_phi pi_phi(s) ]

The gradient of Q with respect to a tells the policy which direction increases value. If the critic systematically rates some actions higher than they really are, the policy will be pushed toward those actions even though the true environment return is lower. The next round of TD updates uses transitions from this slightly worse policy, the critic refits to overoptimistic targets, and the cycle compounds. Fujimoto et al. show this empirically by tracking the average critic prediction against a Monte Carlo estimate of the true return on standard MuJoCo tasks; for vanilla DDPG, the gap grows steadily over training.

The gap is not just a curiosity. A critic that drifts away from the true value function can mislead the actor into reward-free regions of state space, and once a deterministic policy collapses onto a bad action it can be slow to recover, since exploration noise in DDPG is small relative to the action range.

Why function approximation makes the bias real

In tabular settings, Q-learning overestimation comes from the max operator over noisy value estimates: max_a Q_hat(s, a) >= max_a E[Q_hat(s, a)] by Jensen's inequality. Deterministic policy gradients do not have an explicit max, but the actor update is essentially climbing the critic's value surface. If the critic has approximation error, the actor learns to exploit it. Fujimoto et al. quantify this with a theorem (their Theorem 1) showing that under standard assumptions, the actor-critic value estimate Q(s, pi(s)) exceeds the true value Q^pi(s, pi(s)) in expectation when both networks are trained on the same replay data.

This is more than a tabular curiosity, because in practice the critic is a neural network with millions of parameters fit to a few million transitions. Approximation noise is not optional and it does not cancel out.

The three modifications

TD3 inherits the entire DDPG skeleton (deterministic actor, replay buffer, off-policy training, target networks updated with Polyak averaging) and changes three things.

Clipped double Q-learning

TD3 trains two independent critics, Q_theta1 and Q_theta2, each with their own target network. The Bellman target shared by both critics is the minimum of the two target Q-values evaluated at the next state and the target policy's action:

y = r + gamma * min(Q_theta1'(s', pi_phi'(s')), Q_theta2'(s', pi_phi'(s')))

Taking the minimum is the "clipped" part. Plain Double Q-learning would use one critic to select an action and the other to evaluate it; here both critics are evaluated and the smaller value wins. The trick may bias estimates downward, but the paper argues this is the lesser evil. Underestimated actions are not propagated through the policy update, because the actor avoids low-value actions, while overestimated actions actively poison the policy. As a side effect, the min operator favors states with low-variance value estimates, which steers the policy away from regions where the critic is uncertain.

A convergence proof for the finite MDP case appears in the paper's supplementary material. The intuition is that if the two estimates have independent noise with mean zero, the min has a negative bias whose magnitude is bounded by the standard deviation of the noise. So overestimation turns into a controlled, small underestimation, which the actor can compensate for through more exploration.

Why not three or more critics? The paper tests an ablation with three critics and reports diminishing returns. Two critics are cheap (the second critic adds roughly 30% to backward pass cost since most layers are not shared) and capture most of the benefit. Later work, especially REDQ and TQC, revisits this question and shows that larger ensembles can pay off, but at a different cost trade-off.

Delayed policy updates

The second change is also simple. The actor and the target networks are updated less often than the critics, typically once for every two critic updates. The justification is that policy improvement on a noisy critic produces a noisy gradient, which then makes the next critic update worse, and the cycle compounds. By letting the value estimate settle for a few steps before nudging the policy, TD3 reduces the variance of the policy update.

The practical recommendation in the paper is d = 2, meaning the actor and target networks update every other gradient step. The authors note that a larger d would yield a larger benefit in terms of accumulated error, but training the actor too rarely cripples learning, so 2 is the safe default.

In the ablation Figure 4 of the paper, removing the delay drops average HalfCheetah return from roughly 9,500 to about 7,000 over 1 million steps, and increases run-to-run variance noticeably. The delay also makes learning curves smoother visually, which helps debugging.

Target policy smoothing

Deterministic policies tend to overfit narrow peaks in the value function. Pick a slightly different action and the critic might tell you it is much worse, even though in reality the values should be similar. Target policy smoothing is a regularization that adds clipped Gaussian noise to the target action before evaluating the next-state Q-value:

a_tilde = pi_phi'(s') + clip(N(0, sigma), -c, c)
y      = r + gamma * min_i Q_theta_i'(s', a_tilde)

The noise forces the critic to fit a small region around the target action rather than a single point, which the paper notes is similar in spirit to a SARSA update. Defaults are sigma = 0.2 with the noise clipped to the interval [-0.5, 0.5] (assuming actions are scaled to [-1, 1]).

After clipping, the action is also clipped to the valid action range, which matters for environments that reject out-of-range actions or saturate them silently. The noise is independent of the exploration noise added during data collection: smoothing happens only in the Bellman target computation.

A useful way to think about smoothing: the critic is being asked to predict the value of an expanded action distribution rather than a delta function. This makes the value function locally smoother, which is exactly what the policy gradient needs in order to produce stable updates.

Algorithm

The full algorithm as it appears in the paper:

Initialize critic networks Q_theta1, Q_theta2 and actor network pi_phi
  with random parameters theta1, theta2, phi
Initialize target networks: theta1' <- theta1, theta2' <- theta2, phi' <- phi
Initialize replay buffer B

for t = 1 to T:
    Select action with exploration noise:
        a ~ pi_phi(s) + epsilon,  epsilon ~ N(0, sigma)
    Execute a, observe reward r and new state s'
    Store transition (s, a, r, s') in B

    Sample mini-batch of N transitions (s, a, r, s') from B

    a_tilde <- pi_phi'(s') + epsilon,  epsilon ~ clip(N(0, sigma_tilde), -c, c)
    y       <- r + gamma * min_{i=1,2} Q_theta_i'(s', a_tilde)

    Update critics:
        theta_i <- argmin_{theta_i} (1/N) * sum (y - Q_theta_i(s, a))^2

    if t mod d == 0:
        Update phi by the deterministic policy gradient:
            grad_phi J(phi) = (1/N) * sum grad_a Q_theta1(s, a)|_{a=pi_phi(s)} * grad_phi pi_phi(s)
        Update target networks:
            theta_i' <- tau * theta_i + (1 - tau) * theta_i'
            phi'     <- tau * phi    + (1 - tau) * phi'
end for

A few details worth noting. The actor is trained only against Q_theta1, not against the minimum of the two critics, which keeps the policy gradient less conservative. Both target networks are updated on the delayed schedule along with the actor. Exploration noise during data collection is independent of the smoothing noise added inside the target.

In most implementations, the loop also includes a warmup phase: for the first 10,000 to 25,000 steps, actions are sampled uniformly from the action space rather than from the policy. This produces a more diverse initial replay buffer and avoids early collapse onto a poorly initialized policy.

Reference PyTorch implementation

The central update step in the author's reference implementation looks like the following PyTorch sketch. State, action, reward, next-state, and not-done arrays come from a sampled minibatch, and actor, actor_target, critic, critic_target are the four networks.

import torch
import torch.nn.functional as F

def td3_update(self, batch):
    state, action, next_state, reward, not_done = batch

    with torch.no_grad():
        noise = (
            torch.randn_like(action) * self.policy_noise
        ).clamp(-self.noise_clip, self.noise_clip)
        next_action = (
            self.actor_target(next_state) + noise
        ).clamp(-self.max_action, self.max_action)

        target_Q1, target_Q2 = self.critic_target(next_state, next_action)
        target_Q = torch.min(target_Q1, target_Q2)
        target_Q = reward + not_done * self.discount * target_Q

    current_Q1, current_Q2 = self.critic(state, action)
    critic_loss = F.mse_loss(current_Q1, target_Q) + F.mse_loss(current_Q2, target_Q)

    self.critic_optimizer.zero_grad()
    critic_loss.backward()
    self.critic_optimizer.step()

    if self.total_it % self.policy_freq == 0:
        actor_loss = -self.critic.Q1(state, self.actor(state)).mean()

        self.actor_optimizer.zero_grad()
        actor_loss.backward()
        self.actor_optimizer.step()

        for p, p_target in zip(self.critic.parameters(), self.critic_target.parameters()):
            p_target.data.mul_(1 - self.tau)
            p_target.data.add_(self.tau * p.data)
        for p, p_target in zip(self.actor.parameters(), self.actor_target.parameters()):
            p_target.data.mul_(1 - self.tau)
            p_target.data.add_(self.tau * p.data)

    self.total_it += 1

The critic class wraps two Q-networks and has a Q1 method that returns only the first head, used inside the actor loss. Returning min(Q1, Q2) from the actor side would be more conservative but also more pessimistic, and the paper found it slowed learning slightly.

Network architecture

The reference implementation uses small multi-layer perceptrons, the same shape for both actor and critics. In the paper, both use two hidden layers with 400 and 300 units, ReLU activations, and a tanh on the actor output to bound actions. The critics take the state and action concatenated as input to the first layer (unlike the original DDPG paper, which fed the action only into the second layer). The current public reference repository uses 256-256 hidden layers instead, the change being one of the "minor adjustments to hyperparameters" the README mentions.

Both networks are optimized with Adam.

A simplified PyTorch definition of the actor and the twin critic block:

import torch
import torch.nn as nn
import torch.nn.functional as F

class Actor(nn.Module):
    def __init__(self, state_dim, action_dim, max_action, hidden=256):
        super().__init__()
        self.l1 = nn.Linear(state_dim, hidden)
        self.l2 = nn.Linear(hidden, hidden)
        self.l3 = nn.Linear(hidden, action_dim)
        self.max_action = max_action

    def forward(self, state):
        a = F.relu(self.l1(state))
        a = F.relu(self.l2(a))
        return self.max_action * torch.tanh(self.l3(a))

class Critic(nn.Module):
    def __init__(self, state_dim, action_dim, hidden=256):
        super().__init__()
        self.l1 = nn.Linear(state_dim + action_dim, hidden)
        self.l2 = nn.Linear(hidden, hidden)
        self.l3 = nn.Linear(hidden, 1)
        self.l4 = nn.Linear(state_dim + action_dim, hidden)
        self.l5 = nn.Linear(hidden, hidden)
        self.l6 = nn.Linear(hidden, 1)

    def forward(self, state, action):
        sa = torch.cat([state, action], 1)
        q1 = F.relu(self.l1(sa)); q1 = F.relu(self.l2(q1)); q1 = self.l3(q1)
        q2 = F.relu(self.l4(sa)); q2 = F.relu(self.l5(q2)); q2 = self.l6(q2)
        return q1, q2

    def Q1(self, state, action):
        sa = torch.cat([state, action], 1)
        q1 = F.relu(self.l1(sa)); q1 = F.relu(self.l2(q1)); q1 = self.l3(q1)
        return q1

Image-based observations swap the MLP for a small CNN trunk (typically the Nature DQN architecture), but the rest of the algorithm is unchanged.

Default hyperparameters

The defaults below match the values used in the original paper and the author's PyTorch reference. Some downstream libraries differ on minor points (most often layer width, batch size, and the warmup phase).

Hyperparameter	Symbol	Paper default	Notes
Discount factor	gamma	0.99	Standard for MuJoCo
Soft target update rate	tau	0.005	Polyak averaging coefficient
Policy update delay	d	2	One actor update per two critic updates
Target policy noise std	sigma_tilde	0.2	Clipped Gaussian on target action
Target noise clip	c	0.5	Clip range [-c, c]
Exploration noise std	sigma	0.1	Gaussian, added to actor output during data collection
Replay buffer size		1,000,000	Full history of the agent
Mini-batch size	N	100	Reference repo and SB3 use 256
Optimizer		Adam	Both actor and critics
Learning rate		1e-3	Same for actor and critics; reference repo uses 3e-4
Hidden layers (actor and critic)		(400, 300)	Reference repo uses (256, 256)
Activations		ReLU + tanh on actor output
Random data collection		10,000 steps for HalfCheetah and Ant; 1,000 steps for the rest	Pure exploration warmup

Hyperparameter sensitivity in practice

The headline numbers depend on a handful of choices that are easy to miss. Discount gamma at 0.99 is standard; pushing to 0.995 or above can help on long-horizon tasks but tends to amplify Q-function instability. The target update rate tau of 0.005 is a conservative Polyak factor; values around 0.01 train slightly faster but make the critic more reactive to noise. Increasing the policy delay d above 2 sometimes helps on easy tasks but starves the actor on harder ones.

Exploration noise sigma at 0.1 is small relative to the [-1, 1] action range, which is fine when the policy is initialized near zero and starts moving meaningfully early in training. For long-horizon sparse reward tasks, replacing Gaussian exploration with Ornstein-Uhlenbeck noise (as in the original DDPG paper) or with parameter-space noise can help, though TD3 itself does not require either.

Replay buffer size of 1 million transitions is enough for the standard 1 million step training budget but should grow proportionally for longer runs. Smaller buffers (200,000 or so) can lead to overfitting on recent transitions, especially when combined with a high gradient-update-per-environment-step ratio.

MuJoCo benchmark results

The paper reports the maximum average return over 10 trials of 1 million environment steps. Results are on the original v1 MuJoCo tasks from OpenAI Gym, evaluated every 5,000 steps with 10 noise-free episodes per evaluation.

Environment	TD3	DDPG (baselines)	DDPG (our re-tune)	PPO	TRPO	ACKTR	SAC
HalfCheetah-v1	9636.95 +/- 859.07	3305.60	8577.29	1795.43	-15.57	1450.46	2347.19
Hopper-v1	3564.07 +/- 114.74	2020.46	1860.02	2164.70	2471.30	2428.39	2996.66
Walker2d-v1	4682.82 +/- 539.64	1843.85	3098.11	3317.69	2321.47	1216.70	1283.67
Ant-v1	4372.44 +/- 1000.33	1005.30	888.77	1083.20	-75.85	1821.94	655.35
Reacher-v1	-3.60 +/- 0.56	-6.51	-4.01	-6.18	-111.43	-4.26	-4.44
InvertedPendulum-v1	1000.00 +/- 0.00	1000.00	1000.00	1000.00	985.40	1000.00	1000.00
InvertedDoublePendulum-v1	9337.47 +/- 14.96	9355.52	8369.95	8977.94	205.85	9081.92	8487.15

TD3 won outright on six of the seven tasks and tied the maximum on InvertedPendulum, where the cap is the environment's reward ceiling. The SAC numbers in the original Table 1 reflect a now-superseded implementation; later tuned SAC code closes much of the gap, particularly on the harder tasks. The paper acknowledges this in a footnote and provides comparison numbers in its supplementary material.

Later third-party benchmarks on newer MuJoCo versions tell roughly the same story. CleanRL's TD3 implementation reaches around 9,583 on HalfCheetah-v4, 4,058 on Walker2d-v4, 3,135 on Hopper-v4, and 5,035 on Humanoid-v4 over three seeds.

Ablation breakdown

The paper's Table 2 ablates each TD3 modification on HalfCheetah, Hopper, Walker2d, and Ant. The numbers below are the 10-seed average of the maximum return over 1 million steps.

Variant	HalfCheetah	Hopper	Walker2d	Ant
TD3 (full)	9532.99	3304.18	4565.24	4185.06
TD3 minus delayed policy	9412.35	2790.66	3853.34	4040.34
TD3 minus target smoothing	8775.91	1939.12	2952.46	4097.39
TD3 minus clipped double Q	7894.97	2266.36	4046.67	4063.07
TD3 with single Q (DDPG-style)	8538.56	2253.23	3522.74	3538.46
AHE (delayed and smoothing only)	8401.30	1652.65	4130.09	1944.61

No single component carries the result; the combination is what closes the gap. Removing target smoothing hurts most on Hopper and Walker2d, both of which have brittle dynamics that punish extreme actions. Removing clipped double Q hurts most on HalfCheetah and Ant, which run for full 1,000-step episodes and accumulate the most overestimation.

Critic value tracking

Figure 1 of the paper plots the average critic prediction (Q(s, pi(s))) against the true return measured by Monte Carlo rollouts. For DDPG, the predicted value floats around 1,500 while the true return sits near zero on HalfCheetah, growing to a gap of more than 10,000 by 1 million steps on Hopper. For TD3, the predicted value tracks the true return closely throughout training. This is the diagnostic the paper uses to argue that the algorithm actually fixes the bias rather than masking it.

Algorithm	On/Off policy	Policy type	Action space	Sample efficiency	Key idea
TD3	Off-policy	Deterministic	Continuous	High	Two critics with min target, delayed policy updates, target action smoothing
DDPG	Off-policy	Deterministic	Continuous	High but unstable	Deterministic actor with single critic and replay buffer
SAC	Off-policy	Stochastic	Continuous	High	Maximum-entropy objective with twin critics and reparameterized Gaussian policy
PPO	On-policy	Stochastic	Continuous and discrete	Lower per sample, very stable	Clipped surrogate objective, multiple epochs over each rollout
A3C	On-policy	Stochastic	Continuous and discrete	Low per sample	Asynchronous advantage actor-critic with parallel workers
DQN	Off-policy	Stochastic (epsilon-greedy)	Discrete	High but bounded	Q-learning with replay buffer and target network

TD3 and SAC came out within a few months of each other in 2018 and tend to perform similarly on standard MuJoCo benchmarks, with SAC often having an edge on the harder tasks (Humanoid in particular) thanks to its entropy regularization. People still argue about which is the better default. PPO is the comparison algorithm everyone reaches for when they want stability or when the environment is cheap to simulate, since on-policy methods burn through far more samples but rarely diverge.

TD3 vs SAC: a closer look

The two algorithms are often discussed as siblings. Both were published in 2018, both use twin critics with a min target, both are off policy with replay, and both are designed for continuous action spaces. The differences are also instructive.

Aspect	TD3	SAC
Policy class	Deterministic, tanh-bounded	Stochastic Gaussian, tanh-squashed
Exploration	External Gaussian noise on actor output	Built-in policy entropy
Loss	Standard deterministic policy gradient	Soft policy gradient with entropy term
Critic targets	min(Q1, Q2)	min(Q1, Q2) minus entropy bonus
Tunables	Exploration sigma, smoothing sigma, delay d	Entropy temperature alpha (often auto-tuned)
Strengths	Simple, fast, very predictable on standard tasks	Robust to hyperparameters, often best on hard tasks
Weaknesses	Brittle exploration on sparse reward, no entropy	Slightly more compute per step, slower to debug

In standard MuJoCo, SAC tends to outperform TD3 on Humanoid by a wide margin (roughly 9,000 versus 5,000 over 3 million steps), match it on Walker2d and Ant, and slightly underperform it on HalfCheetah. The auto-tuned entropy in modern SAC removes one of TD3's old advantages in setup simplicity, but TD3 is still typically a few percent faster per gradient step because its policy does not require sampling.

TD3 vs PPO

TD3 is sample efficient where PPO is wall-clock efficient. On a single MuJoCo environment with one CPU and one GPU, TD3 reaches a target score in roughly 1 million environment steps; PPO needs around 5 to 10 million for the same target. PPO catches up if you can run 16 or 32 environments in parallel, since it scales near-linearly with parallelism, while TD3 with a single replay buffer and one learning thread does not. For real robots and any setting where each environment step is expensive, TD3 is the better fit. For massively parallel simulation (Isaac Lab, Brax, EnvPool), PPO often wins on wall clock.

Implementations

Library	URL	Notes
Author's reference (PyTorch)	github.com/sfujim/TD3	The canonical reference; README warns the current code differs slightly from the paper
Stable-Baselines3	stable-baselines3.readthedocs.io	PyTorch; uses ReLU MlpPolicy to match the paper, batch size 256
OpenAI Spinning Up	spinningup.openai.com/algorithms/td3	PyTorch and TensorFlow versions; tutorial-style explanations
CleanRL	docs.cleanrl.dev/rl-algorithms/td3	Single-file PyTorch implementations, reproducible benchmarks
Tianshou	github.com/thu-ml/tianshou	Modular PyTorch RL library, MuJoCo benchmarks at parity with the original
RLlib (Ray)	docs.ray.io/en/latest/rllib	Distributed RL library with TD3 in its catalog
Acme (DeepMind)	github.com/google-deepmind/acme	JAX and TensorFlow versions; modular agent components
Sample Factory	github.com/alex-petrenko/sample-factory	High-throughput PyTorch RL library, supports TD3 baseline runs

For most users picking up TD3 for a project, Stable-Baselines3 or CleanRL are the easiest entry points. The author's reference is short enough to read end-to-end and is still the cleanest match to the paper's pseudocode.

Stable-Baselines3 minimal example

A minimal TD3 training loop in Stable-Baselines3 on the standard Pendulum environment:

import gymnasium as gym
import numpy as np
from stable_baselines3 import TD3
from stable_baselines3.common.noise import NormalActionNoise

env = gym.make("Pendulum-v1")

n_actions = env.action_space.shape[-1]
action_noise = NormalActionNoise(
    mean=np.zeros(n_actions), sigma=0.1 * np.ones(n_actions)
)

model = TD3(
    "MlpPolicy",
    env,
    action_noise=action_noise,
    learning_rate=3e-4,
    buffer_size=1_000_000,
    batch_size=256,
    tau=0.005,
    gamma=0.99,
    policy_delay=2,
    target_policy_noise=0.2,
    target_noise_clip=0.5,
    verbose=1,
)

model.learn(total_timesteps=200_000)
model.save("td3_pendulum")

Replacing Pendulum-v1 with any continuous-action Gymnasium environment usually works without further changes, although harder tasks need longer training and sometimes wider networks (policy_kwargs=dict(net_arch=[400, 300])).

Variants and extensions

TD3 has been a launching pad for a handful of follow-up algorithms.

TD3+BC (Fujimoto and Gu, NeurIPS 2021) adapts TD3 to the offline RL setting by adding a behavior cloning regularizer to the policy loss and normalizing observations. The change is described as "a few lines of code" and matches the performance of much more elaborate offline methods on the D4RL benchmark, which has made it a standard baseline.
TQC (Truncated Quantile Critics, Kuznetsov et al., ICML 2020) replaces the min over two critics with quantile regression over a larger ensemble, dropping the top quantiles to control overestimation more flexibly. It builds on the same intuition as TD3's clipped double Q trick.
REDQ (Randomized Ensemble Double Q-Learning, Chen et al., ICLR 2021) keeps an ensemble of Q-networks (typically 10), samples a small subset to compute the target, and runs many gradient updates per environment step. The result is sample efficiency that approaches model-based methods while staying model-free.
DroQ and various dropout-regularized critic variants take the ensemble idea further while keeping a TD3-like backbone.
TD3+HER combines TD3 with Hindsight Experience Replay (Andrychowicz et al., 2017) for sparse-reward goal-conditioned tasks, especially in robotics manipulation. The HER relabeling trick supplies dense pseudo-rewards that the TD3 critic can learn from.
D4PG (Distributed Distributional DDPG, Barth-Maron et al., 2018) is a parallel cousin from DeepMind that uses distributional critics and multiple actors but does not include the clipped double Q trick. Several later papers combine D4PG actors with TD3 targets.
CrossQ (Bhatt et al., 2024) drops target networks entirely in favor of batch-renormalized critics, recovering most of TD3's stability with simpler bookkeeping.

The core moves of TD3 (twin critics with min target, target action regularization) show up in nearly every modern off-policy continuous control algorithm in some form.

TD3 in offline reinforcement learning

The transition from online to offline RL has been one of the more consequential changes to the field since 2020. TD3+BC is often the first thing tried on a new offline benchmark because the implementation is short. The policy loss in TD3+BC is:

actor_loss = -lambda * Q(s, pi(s)) + ||pi(s) - a_dataset||^2

where a_dataset is the action recorded in the offline dataset and lambda balances the value-maximization term against the cloning term. Fujimoto and Gu set lambda = 2.5 / mean(|Q|) so that it scales with the magnitude of the value function, removing one tunable. With this single change plus state normalization, TD3+BC matches or beats CQL, BCQ, and BRAC on most D4RL Mujoco subsets.

Applications

TD3 has been used or evaluated in a range of continuous-control problems beyond MuJoCo benchmarks.

Robotic manipulation in simulation: TD3 is one of the standard baselines in Isaac Lab, robosuite, and the Robotics Gymnasium suite. Tasks include reaching, pushing, grasping, and door opening, often combined with HER.
Locomotion: Quadruped controllers in simulation, including bipedal humanoid balance and walking gait synthesis. TD3 has been used as a comparison baseline in several papers on Cassie and ANYmal locomotion.
Sim-to-real robot control: TD3 has been applied to real robotic arms (notably the Franka Emika Panda and the Sawyer) when combined with domain randomization. Real-robot training is rare because of sample requirements, but TD3 fine-tuning on top of behavior-cloned policies is common.
Autonomous driving simulation: CARLA and LGSVL studies use TD3 for steering and throttle control in lane-keeping and intersection navigation tasks. Performance gains over PPO are mixed and depend heavily on reward shaping.
Power grid and energy management: TD3 has been applied to building HVAC control, electric vehicle charging coordination, and microgrid energy dispatch, where the action space is naturally continuous.
Network resource allocation: Continuous bandwidth and power allocation problems in wireless networks, including beamforming for 5G and edge offloading.
Drone control: Quadrotor stabilization and trajectory tracking, both in Gazebo simulation and on real hardware after sim-to-real transfer.
Finance: Portfolio allocation and trade execution as continuous-action MDPs, although academic results on real markets remain controversial.

In each of these domains, TD3 is rarely the state of the art on its own. It is more often used as a starting point that gets extended with HER, distributional critics, or domain-specific reward shaping.

Notable benchmark suites where TD3 appears

Suite	Maintainer	Notes
MuJoCo Gym	Farama Foundation	Standard physics tasks; TD3 is a default baseline
DMControl Suite	DeepMind	DM-style task and observation specs; TD3 trained with image inputs
Meta-World	Stanford	50 manipulation tasks; TD3 used as a non-meta baseline
D4RL	UC Berkeley	Offline benchmark; TD3+BC is among the standard baselines
Isaac Lab	NVIDIA	Massively parallel GPU simulation; TD3 supported via SKRL and rsl_rl
Robotics Gymnasium	Farama Foundation	Goal-conditioned manipulation; TD3 typically combined with HER

Practical guidance

A few patterns recur in production use of TD3.

Normalize observations. TD3 is sensitive to scale. Either standardize observations to zero mean and unit variance with running statistics, or use layer normalization in the critic. Without normalization the critic can blow up early in training on environments with unbounded observation values.
Clip rewards or scale them. Outsized rewards (one shot of +1000 in a sea of small rewards) destabilize the value function. Reward scaling or clipping to [-1, 1] is a common preprocessing step.
Set the random seed deliberately. TD3 is more reproducible than DDPG but still benefits from setting numpy, torch, and gym seeds explicitly. Ten seeds is the de facto reporting standard for paper results.
Watch the critic. Plot mean(Q(s, a)) over training. If it grows unboundedly, smoothing noise is too small or the discount is too high. If it sits near zero forever, the actor is not getting useful gradients; check exploration noise.
Use a longer warmup on hard tasks. 25,000 steps of pure-random data collection helps on Humanoid and Ant. Skipping the warmup can cause the policy to collapse onto a single bad mode.
Use SB3 for production, the reference repo for research. SB3 has fewer footguns; the reference repo is closer to the paper math.
Do not increase the learning rate without a reason. TD3 is fine at 3e-4 across most tasks. Going higher tends to win on easy tasks and lose on hard ones.
Replay buffer size grows with training budget. A 1 million transition buffer is right for 1 million environment steps. Beyond that, scale linearly or you start training the critic on stale transitions.

Common failure modes

Critic divergence. Q-values explode toward infinity over training. Caused by too-high gamma, too-small target smoothing, or too-rare actor updates. Lower gamma to 0.95, raise sigma_tilde to 0.3, or move to SAC if the problem persists.
Policy collapse. Actor outputs the same near-saturated action regardless of state. Caused by insufficient exploration, narrow value-function gradients, or warmup that is too short. Increase exploration noise, restart with a larger warmup, or check that the tanh output is not stuck at +/-1.
Reward hacking. Policy finds a degenerate behavior with high reward but no useful skill. Not a TD3-specific issue but it shows up especially in sparse reward and shaped reward tasks.
Slow convergence on Humanoid. TD3 is generally weaker than SAC on Humanoid because it lacks entropy regularization. Either switch to SAC or add adaptive entropy bonuses on top of TD3 (the result is essentially TADD or similar variants).

Limitations and known weaknesses

TD3 does not solve every continuous control problem. Its main limitations:

Deterministic policies are bad at multimodal tasks. If the optimal behavior involves randomization (for instance, defensive driving or game-theoretic tasks), TD3's deterministic actor will collapse to one mode. SAC and other stochastic-policy methods handle this naturally.
Exploration is fragile. Gaussian noise on the actor output is the simplest possible exploration scheme. On sparse-reward tasks it usually fails, which is why HER and curiosity-based methods are paired with TD3 in robotics.
Sensitive to reward scale. Without reward normalization, the critic can blow up. The paper does not normalize rewards, but most downstream libraries do.
Not great in high-dimensional action spaces. TD3 has been tested mostly with action dimension under 30. Beyond that, the deterministic policy gradient becomes brittle, partly because the smoothing noise covers a smaller fraction of the action space.
Struggles with sparse, long-horizon credit assignment. The clipped double Q trick reduces variance but does not help with credit assignment over thousands of steps. Hierarchical RL methods or n-step returns are typical add-ons.
Single-task by design. Multi-task and meta-RL settings need additional machinery. TD3 alone does not transfer.

Reception and impact

TD3 became one of the most cited reinforcement learning papers of 2018 and quickly settled into the role of a default baseline for continuous control. Most papers that propose a new off-policy continuous control algorithm benchmark against either TD3, SAC, or both. The clipped double Q trick in particular has been adopted across the field, and even SAC implementations now use it by default.

In applications, TD3 and its descendants are widely used in robotics, including manipulation, mobile robot navigation, and path planning, where deterministic policies and continuous control fit naturally. Surveys of deep RL in robotics consistently list TD3 alongside SAC and PPO as the algorithms most commonly tried first.

The paper's broader contribution was probably methodological as much as algorithmic. It pushed the field to evaluate over more seeds, to take ablations seriously, and to be honest about the variance of deep RL results. The reproducibility complaints raised by Henderson et al. (2017), which the TD3 authors cite, were taken to heart. The 10-seed evaluation protocol used in the paper is closer to what later work treats as the bare minimum.

Google Scholar lists TD3 with more than 7,000 citations as of 2025, putting it in the same range as SAC and DDPG and in the top tier of post-2017 RL papers. The reference repository has been forked thousands of times and is one of the most copied teaching examples for continuous-control RL alongside OpenAI Spinning Up.

Influence on later work

The clipped double Q trick is now the default in SAC, REDQ, TQC, DroQ, and most modern off-policy continuous control algorithms. Target action smoothing is less universal but has become standard in robotics-oriented codebases. The paper is also frequently cited in offline RL work, where overestimation under distribution shift is even more acute. CQL, BCQ, and IQL cite TD3 directly when motivating their approach to value pessimism.

Theoretical context

TD3 sits in the lineage of deterministic policy gradient methods that began with the deterministic policy gradient theorem (Silver et al., 2014). For a deterministic policy pi_phi, the policy gradient is:

grad_phi J = E_{s ~ rho^pi} [ grad_a Q^pi(s, a)|_{a=pi_phi(s)} * grad_phi pi_phi(s) ]

where rho^pi is the discounted state visitation distribution. DPG was the theoretical foundation for DDPG, which in turn was the practical predecessor to TD3.

The overestimation analysis builds on the theory of Q-learning with function approximation (Thrun and Schwartz 1993; van Hasselt 2010). The TD3 paper extends this analysis to actor-critic by showing that even without an explicit max operator, the policy update implicitly takes a max-like step that is biased upward. Clipped double Q-learning is related to the broader technique of pessimism in value estimation, which appears in offline RL (CQL), exploration, and safe RL.

References

Fujimoto, S., van Hoof, H., and Meger, D. (2018). "Addressing Function Approximation Error in Actor-Critic Methods." *Proceedings of the 35th International Conference on Machine Learning (ICML)*, PMLR 80. arXiv:1802.09477.
Lillicrap, T. P. et al. (2015). "Continuous control with deep reinforcement learning." arXiv:1509.02971. The DDPG paper.
van Hasselt, H. (2010). "Double Q-learning." *Advances in Neural Information Processing Systems (NeurIPS)*.
van Hasselt, H., Guez, A., and Silver, D. (2016). "Deep Reinforcement Learning with Double Q-Learning." *AAAI*.
Silver, D. et al. (2014). "Deterministic Policy Gradient Algorithms." *ICML*.
Haarnoja, T. et al. (2018). "Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor." *ICML*. arXiv:1801.01290.
Schulman, J. et al. (2017). "Proximal Policy Optimization Algorithms." arXiv:1707.06347.
Fujimoto, S. (2018). "TD3 reference implementation." GitHub: sfujim/TD3.
Fujimoto, S. and Gu, S. S. (2021). "A Minimalist Approach to Offline Reinforcement Learning." *NeurIPS*. arXiv:2106.06860. The TD3+BC paper.
Kuznetsov, A. et al. (2020). "Controlling Overestimation Bias with Truncated Mixture of Continuous Distributional Quantile Critics." *ICML*. arXiv:2005.04269.
Chen, X. et al. (2021). "Randomized Ensembled Double Q-Learning: Learning Fast Without a Model." *ICLR*. arXiv:2101.05982.
OpenAI. "Twin Delayed DDPG." *Spinning Up in Deep RL* documentation. spinningup.openai.com.
Raffin, A. et al. (2021). "Stable-Baselines3: Reliable Reinforcement Learning Implementations." *Journal of Machine Learning Research*. Documentation: stable-baselines3.readthedocs.io.
Huang, S. et al. (2022). "CleanRL: High-quality Single-file Implementations of Deep Reinforcement Learning Algorithms." *JMLR*. Documentation: docs.cleanrl.dev.
Henderson, P. et al. (2018). "Deep Reinforcement Learning that Matters." *AAAI*. arXiv:1709.06560.
Andrychowicz, M. et al. (2017). "Hindsight Experience Replay." *NeurIPS*. arXiv:1707.01495.
Barth-Maron, G. et al. (2018). "Distributed Distributional Deterministic Policy Gradients." *ICLR*. arXiv:1804.08617.
Bhatt, A. et al. (2024). "CrossQ: Batch Normalization in Deep Reinforcement Learning for Greater Sample Efficiency and Simplicity." *ICLR*. arXiv:1902.05605.
Thrun, S. and Schwartz, A. (1993). "Issues in Using Function Approximation for Reinforcement Learning." *Proceedings of the 1993 Connectionist Models Summer School*.
Sutton, R. S. and Barto, A. G. (2018). *Reinforcement Learning: An Introduction*, second edition. MIT Press.
Levine, S. (2024). "CS285: Deep Reinforcement Learning." Course materials, UC Berkeley. rail.eecs.berkeley.edu/deeprlcourse.

Infobox

Background: why DDPG needed fixing

How overestimation arises in deterministic policy gradients

Why function approximation makes the bias real

The three modifications

Clipped double Q-learning

Delayed policy updates

Target policy smoothing

Algorithm

Reference PyTorch implementation

Network architecture

Default hyperparameters

Hyperparameter sensitivity in practice

MuJoCo benchmark results

Ablation breakdown

Critic value tracking

Comparison with related algorithms

TD3 vs SAC: a closer look

TD3 vs PPO

Implementations

Stable-Baselines3 minimal example

Variants and extensions

TD3 in offline reinforcement learning

Applications

Notable benchmark suites where TD3 appears

Practical guidance

Common failure modes

Limitations and known weaknesses

Reception and impact

Influence on later work

Theoretical context

See also

References

Improve this article

Related Articles

Machine learning terms/Reinforcement Learning

AlphaGo

Soft Actor-Critic

Sparse autoencoder

GELU (Gaussian Error Linear Unit)

LeNet

Infobox

Background: why DDPG needed fixing

How overestimation arises in deterministic policy gradients

Why function approximation makes the bias real

The three modifications

Clipped double Q-learning

Delayed policy updates

Target policy smoothing

Algorithm

Reference PyTorch implementation

Network architecture

Default hyperparameters

Hyperparameter sensitivity in practice

MuJoCo benchmark results

Ablation breakdown

Critic value tracking

Comparison with related algorithms

TD3 vs SAC: a closer look

TD3 vs PPO

Implementations

Stable-Baselines3 minimal example

Variants and extensions

TD3 in offline reinforcement learning

Applications

Notable benchmark suites where TD3 appears

Practical guidance

Common failure modes

Limitations and known weaknesses

Reception and impact

Influence on later work

Theoretical context

See also

References

Related Articles

Machine learning terms/Reinforcement Learning

AlphaGo

Soft Actor-Critic

Sparse autoencoder

GELU (Gaussian Error Linear Unit)