DDPG (Deep Deterministic Policy Gradient)
Last reviewed
May 2, 2026
Sources
15 citations
Review status
Source-backed
Revision
v1 · 3,601 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 2, 2026
Sources
15 citations
Review status
Source-backed
Revision
v1 · 3,601 words
Add missing citations, update stale details, or suggest a clearer explanation.
DDPG (Deep Deterministic Policy Gradient) is an off-policy, model-free actor-critic algorithm in deep reinforcement learning for environments with continuous action spaces. It was introduced by Timothy Lillicrap and colleagues at DeepMind in the paper Continuous control with deep reinforcement learning, posted to arXiv in September 2015 and presented at ICLR 2016. DDPG combined the Deterministic Policy Gradient (DPG) theorem of David Silver et al. (ICML 2014) with the engineering tricks that made DQN work on Atari games, namely a replay buffer and slowly-updated target networks. The result was the first deep RL method that could learn end-to-end control policies in continuous action spaces, including from raw pixels, without resorting to explicit policy parameterization or discretization.
The algorithm trains two neural networks at the same time. A deterministic actor network maps states directly to actions, and a critic network estimates the action-value function. The actor is updated by following the gradient of the critic with respect to actions, an idea borrowed directly from the DPG theorem. Off-policy data sampled from a replay buffer is used to train both networks, while exploration is injected by adding noise (typically Ornstein-Uhlenbeck or Gaussian) to the deterministic actor's output during data collection.
DDPG dominated continuous-control benchmarks for a brief period and shaped a whole family of off-policy, deterministic, actor-critic algorithms including TD3 (Twin Delayed DDPG), D4PG, and the first versions of MPO and DDPG-from-pixels. Its weaknesses (overestimation bias in the critic, brittle hyperparameters, and well-documented seed sensitivity) drove a wave of follow-up research. Soft Actor-Critic eventually displaced it as the default off-policy continuous-control algorithm, but DDPG is still taught as the canonical bridge between DPG and modern deep RL, and it remains a useful baseline in robotics, simulation, and energy management research.
Before DDPG, deep RL had two reasonably strong stories. On the value-based side, DQN showed that you could fit a Q-function with a neural network on raw Atari pixels if you stabilized training with experience replay and a slowly updated target network. On the policy-gradient side, methods like REINFORCE, TRPO, and natural policy gradient could handle continuous actions but were on-policy, sample-hungry, and (in TRPO's case) computationally heavy.
The gap was obvious. DQN was off-policy and data-efficient but only worked for discrete actions, because picking the greedy action requires argmax_a Q(s,a), which is intractable when a is a real-valued vector in, say, twenty dimensions. Policy-gradient methods worked for continuous actions but needed enormous amounts of fresh on-policy data and tended to thrash on tasks like locomotion.
DDPG was an attempt to get the best of both. Use a deterministic policy that you can train with the DPG gradient, learn the Q-function the way DQN does, and replace the argmax with the action that the policy network already produces. The paper makes this lineage explicit: the abstract calls the method "an actor-critic, model-free algorithm based on the deterministic policy gradient that can operate over continuous action spaces."
The theoretical groundwork was laid by David Silver, Guy Lever, Nicolas Heess, Thomas Degris, Daan Wierstra, and Martin Riedmiller in Deterministic Policy Gradient Algorithms, presented at ICML 2014. Until that paper, the conventional wisdom in policy-gradient RL was that the policy had to be stochastic, because the standard policy gradient theorem (Sutton et al., 1999) integrates over the action distribution. The Silver et al. paper showed that a deterministic policy μ(s) has a well-defined gradient too, given by
∇θ J(μθ) = E_{s ~ ρ^μ} [ ∇θ μθ(s) · ∇a Q^μ(s,a) | a = μθ(s) ]
The expectation is over the state visitation distribution induced by the behavior policy, and the action gradient ∇a Q^μ(s,a) is evaluated at the action the deterministic policy would currently choose. The proof relies on a regularity argument that connects the stochastic policy gradient to its limit as policy variance goes to zero, and the practical consequence is enormous: you no longer have to integrate over actions, which is precisely what kills value-based methods in continuous spaces.
The ICML 2014 paper introduced an off-policy actor-critic version (OPDAC) that used a behavior policy plus importance sampling for the critic, and showed strong empirical results on simulated octopus-arm and bicycle-balancing tasks with linear function approximators. DDPG took the same theorem and pushed it through deep neural networks, which is what made the method famous.
DDPG learns four networks at once: an actor μ(s|θ^μ) and a critic Q(s,a|θ^Q), plus their target copies μ'(s|θ^μ') and Q'(s,a|θ^Q'). All four are deep neural networks trained with gradient-based optimizers (the original paper used Adam).
| Component | Symbol | Role |
|---|---|---|
| Actor network | `μ(s | θ^μ)` |
| Critic network | `Q(s,a | θ^Q)` |
| Target actor | `μ'(s | θ^μ')` |
| Target critic | `Q'(s,a | θ^Q')` |
| Replay buffer | R | Stores transitions (s_t, a_t, r_t, s_{t+1}) for off-policy updates |
| Exploration noise | N_t | Added to actor output during rollouts, typically Ornstein-Uhlenbeck |
| Batch normalization | (in the original paper) | Normalizes per-feature inputs to handle low-dimensional states across different physical units |
The critic is fit to a one-step Bellman target using off-policy samples from the replay buffer. For a minibatch of N transitions (s_i, a_i, r_i, s_{i+1}), the target is
y_i = r_i + γ · Q'(s_{i+1}, μ'(s_{i+1}|θ^μ') | θ^Q')
and the critic loss is the mean-squared TD error
L(θ^Q) = (1/N) Σ_i ( y_i - Q(s_i, a_i | θ^Q) )^2.
This is essentially the DQN update except that the next-state action is supplied by the target actor instead of by an argmax.
The actor is updated by gradient ascent on the critic's estimate of expected return, applied through the deterministic policy gradient:
∇θ^μ J ≈ (1/N) Σ_i ∇a Q(s, a | θ^Q) | s = s_i, a = μ(s_i) · ∇θ^μ μ(s_i | θ^μ).
In code this is typically implemented as loss = -mean(Q(s, μ(s))) and then backpropagated. Because the critic and actor share no weights, the chain rule conveniently routes the gradient from the Q-value through the action and into the actor parameters.
Unlike DQN, which periodically copies the online weights into the target network, DDPG uses soft updates after every gradient step:
θ' ← τ θ + (1 - τ) θ',
with a small τ (the paper uses 0.001). This Polyak averaging gives the target networks a much slower effective learning rate than the online networks and was found to be essential for stability. The paper notes that without target networks the critic frequently diverges.
Because the policy is deterministic, all exploration must come from outside. The original paper adds an Ornstein-Uhlenbeck (OU) process to the actor's output:
a_t = μ(s_t | θ^μ) + N_t,
where N_t is sampled from an OU process with mean-reversion parameter θ = 0.15 and volatility σ = 0.2. The OU noise was chosen because it is temporally correlated, which the authors hypothesized would help on physical control tasks with momentum. Later work (especially TD3 and SAC) showed that uncorrelated Gaussian noise works just as well in practice on standard MuJoCo tasks, so most modern implementations skip the OU process.
Initialize critic Q(s,a|θ^Q) and actor μ(s|θ^μ) with random weights.
Initialize target networks θ^Q' ← θ^Q, θ^μ' ← θ^μ.
Initialize replay buffer R.
for episode = 1 to M:
Initialize a random process N for exploration.
Receive initial observation s_1.
for t = 1 to T:
Select action a_t = μ(s_t | θ^μ) + N_t.
Execute a_t, observe r_t and s_{t+1}.
Store transition (s_t, a_t, r_t, s_{t+1}) in R.
Sample minibatch of N transitions from R.
Compute target y_i = r_i + γ Q'(s_{i+1}, μ'(s_{i+1}|θ^μ') | θ^Q').
Update critic by minimizing (1/N) Σ_i (y_i - Q(s_i, a_i | θ^Q))^2.
Update actor by sampled DPG:
∇θ^μ J ≈ (1/N) Σ_i ∇a Q(s, a|θ^Q)|_{s=s_i, a=μ(s_i)} · ∇θ^μ μ(s|θ^μ)|_{s_i}.
Soft-update target networks:
θ^Q' ← τ θ^Q + (1-τ) θ^Q'
θ^μ' ← τ θ^μ + (1-τ) θ^μ'.
end for
end for
This is essentially Algorithm 1 of Lillicrap et al. (2016), modulo notation.
The paper reports a single hyperparameter setting that worked across all tested environments without per-task tuning, which was the headline result at the time. These values still appear as the defaults in most reimplementations.
| Hyperparameter | Value | Notes |
|---|---|---|
| Actor learning rate | 1e-4 | Adam |
| Critic learning rate | 1e-3 | Adam, with L2 weight decay 1e-2 |
Discount factor γ | 0.99 | |
Soft update rate τ | 0.001 | Polyak averaging |
| Replay buffer size | 1e6 | Stores transitions FIFO |
| Minibatch size | 64 | |
| Hidden layer sizes | 400, 300 | Two fully connected layers; actor has tanh output |
| Action input layer | After first hidden layer (in critic) | The action is concatenated with the first-layer state activations |
| Final-layer init | Uniform [-3e-3, 3e-3] | To keep initial actions and Q-values near zero |
| Other layers init | Uniform [-1/√f, 1/√f] | Where f is fan-in |
| Exploration noise | OU process with θ = 0.15, σ = 0.2 | Added to actor output |
| Batch normalization | Yes, on every layer of the actor and on state path of the critic | Critical for low-dim states with mixed units |
| Reward scaling | None for low-dim, 0.1 for pixels | Pixel agents had different reward scales |
The paper actually describes two main architectures: a low-dimensional state version and a pixel version. The pixel agent uses three convolutional layers (32 filters, 3 by 3, no pooling) before the fully connected stack, and a stack of three frames as the input.
DDPG was evaluated on more than 20 simulated continuous-control tasks, mostly built in MuJoCo. The authors compared a low-dimensional version (state vector input) with a pixel version (raw 64 by 64 RGB frames) and a planning baseline (iLQG with full access to the simulator's dynamics).
| Domain | Description | Action dim | Result |
|---|---|---|---|
| Cartpole swing-up | Swing up and balance an underactuated pole | 1 | Solved from low-dim and pixels |
| Pendulum | Classic swing-up | 1 | Solved |
| Reacher | 2-link arm reaching a random target | 2 | Solved |
| Cheetah | Planar half-cheetah running | 6 | Strong policies, comparable to iLQG with planning |
| Walker2d | Bipedal walking | 6 | Learned forward locomotion |
| Hopper | One-legged hopping | 3 | Learned hopping gait |
| Ant | Quadrupedal locomotion | 8 | Learned forward gait |
| Humanoid | High-dim humanoid | 17 | Limited progress; later improved by D4PG and TD3 |
| Gripper | Robotic gripper grasping | 5 | Learned grasping |
| Torcs | Driving simulator | 3 (steering, throttle, brake) | Lapped tracks; included a pixel-only version |
The authors reported that, on most tasks, the low-dimensional and pixel agents reached comparable performance, which was the most impressive part of the result at the time. The Humanoid task already hinted at DDPG's instability on very high-dimensional control, a weakness that later motivated TD3 and D4PG.
DDPG sits at the head of a family tree of off-policy actor-critic methods. Each successor was designed to fix a specific failure mode in DDPG.
| Algorithm | Year | Authors | Key change vs. DDPG |
|---|---|---|---|
| TD3 (Twin Delayed DDPG) | 2018 | Fujimoto, Hoof, Meger (ICML 2018) | Two critics with min to mitigate Q overestimation; delayed actor updates; target policy smoothing noise |
| SAC (Soft Actor-Critic) | 2018 | Haarnoja et al. (ICML 2018, plus 2018 "Algorithms and Applications" follow-up) | Stochastic Gaussian actor, maximum-entropy objective with learned temperature, two critics like TD3 |
| D4PG (Distributed Distributional DDPG) | 2018 | Barth-Maron et al. (ICLR 2018) | Distributional critic (C51-style), N-step returns, prioritized experience replay, distributed actors |
| MPO (Maximum a Posteriori Policy Optimization) | 2018 | Abdolmaleki et al. | Reframes actor update as expectation-maximization with KL constraints; closely related family but with stochastic policies |
| DDPG-from-demonstrations | 2017 | Vecerik et al. | Adds a demonstration buffer with prioritized sampling for sparse-reward robotics |
The TD3 paper is particularly important for understanding DDPG's reputation. Fujimoto et al. showed that DDPG's critic systematically overestimates Q-values, that the deterministic actor exploits these overestimations, and that a single change (taking the minimum of two independently trained critics for the Bellman target) closes most of the gap to better-tuned methods. They also showed that adding clipped Gaussian noise to the target action during the Bellman backup ("target policy smoothing") reduces overfitting to narrow action peaks.
SAC went further by replacing the deterministic actor with a stochastic Gaussian and adding an entropy-bonus term to the reward, which made the algorithm both more robust to hyperparameters and less seed-sensitive. By 2019 SAC had largely replaced DDPG as the default off-policy choice for continuous control.
For a side-by-side comparison of the three methods most often confused with each other:
| Property | DDPG | TD3 | SAC |
|---|---|---|---|
| Policy | Deterministic | Deterministic | Stochastic Gaussian |
| Critics | 1 | 2 (twin, take min) | 2 (twin, take min) |
| Actor update frequency | Every step | Every d critic steps (default 2) | Every step |
| Exploration | OU or Gaussian noise added externally | Gaussian noise added externally | Stochastic policy + entropy bonus |
| Target smoothing | No | Yes | Implicit via stochastic policy |
| Entropy term | No | No | Yes, with learnable temperature |
| Reproducibility | Notoriously sensitive | Better | Best of the three |
DDPG is included in essentially every modern RL library. Common implementations include:
| Library | DDPG implementation |
|---|---|
| OpenAI Spinning Up | Reference PyTorch and TF1 implementations with paper-faithful defaults; the docs explicitly walk through DDPG, TD3, and SAC together |
| Stable Baselines3 | stable_baselines3.DDPG, with TD3 as the recommended successor |
| Ray RLlib | ray.rllib.algorithms.ddpg.DDPG, supports multi-GPU and distributed training |
| CleanRL | Single-file ddpg_continuous_action.py; widely used for teaching and reproducibility |
| TF-Agents | tf_agents.agents.ddpg.ddpg_agent.DdpgAgent |
| Acme | acme.agents.tf.ddpg, the DeepMind in-house framework |
| MushroomRL, Tianshou, Garage | All include DDPG, mostly for completeness |
Most of these libraries default to Gaussian exploration noise (rather than OU) and use somewhat larger replay buffers and minibatches than the original paper. Modern reimplementations also tend to drop batch normalization on the critic, since later work found it to be more trouble than it was worth on standard benchmarks.
DDPG developed a reputation for being temperamental almost as soon as it was released. The Henderson, Islam, Bachman, Pineau, Precup, and Meger paper Deep Reinforcement Learning that Matters (AAAI 2018) is the standard citation here. The authors compared DDPG implementations across libraries on the same MuJoCo tasks and found that:
Later work explained part of this: the deterministic actor combined with a single critic gives the policy a strong incentive to drive into regions where the critic over-estimates Q, and these regions are sensitive to initialization. TD3's twin critics and SAC's entropy bonus both help here, which is one reason both methods are noticeably less seed-sensitive than DDPG.
Other failure modes that show up in practice:
The practical advice that emerged is roughly: if you can use SAC or TD3, do; if you must use DDPG, run at least 5 seeds, watch for Q-value blowup, and tune the noise and learning rates carefully on a small task before scaling up.
Despite its limitations, DDPG and its descendants have been used in a wide range of continuous-control settings.
In most of these areas TD3 or SAC are now the default choice in published baselines. DDPG is still the algorithm people start with when explaining the method to a class.
In modern RL practice, DDPG is mostly a teaching algorithm and a baseline. The current default for off-policy continuous control is SAC, often with implementation details borrowed from TD3 (twin critics, target smoothing). The combination of "deterministic actor + single critic + replay" that defines DDPG has been almost entirely replaced by "stochastic actor + entropy + twin critics + replay."
What keeps DDPG relevant is its pedagogical role. It is the smallest deep RL algorithm that exposes all the moving parts at once: an actor, a critic, a replay buffer, target networks, and an exploration scheme. Read the DDPG paper, then the TD3 paper, then the SAC paper, and you have a tour of what off-policy actor-critic deep RL learned between 2015 and 2018. The lineage from Silver et al. (2014) through Lillicrap et al. (2016) to Fujimoto et al. (2018) and Haarnoja et al. (2018) is the cleanest progression in deep RL: each step fixes a specific, identifiable problem with the previous one.
The deterministic-policy idea itself has aged better than DDPG-the-algorithm. Off-policy deterministic actors still appear in robotics-scale work where stochastic exploration is impractical, in offline RL methods that need a target policy with a defined argmax_a Q(s,a), and in distillation pipelines that compress stochastic teachers into deterministic students. The Silver et al. theorem that powered DDPG continues to be cited as the basis for these methods.