# DDPG (Deep Deterministic Policy Gradient)

> Source: https://aiwiki.ai/wiki/ddpg
> Updated: 2026-06-24
> Categories: Deep Learning, Reinforcement Learning
> From AI Wiki (https://aiwiki.ai), a free encyclopedia of artificial intelligence. Quote with attribution.

**DDPG** (Deep Deterministic Policy Gradient) is an off-policy, model-free actor-critic algorithm in [deep reinforcement learning](/wiki/reinforcement_learning) that learns continuous-control policies by combining a deterministic actor with a Q-value critic. It was introduced by Timothy Lillicrap, Jonathan Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra at [DeepMind](/wiki/deepmind) in the paper *Continuous control with deep reinforcement learning*, posted to arXiv on September 9, 2015 and presented at ICLR 2016 in San Juan, Puerto Rico.[1] Using a single set of hyperparameters, DDPG "robustly solves more than 20 simulated physics tasks, including classic problems such as cartpole swing-up, dexterous manipulation, legged locomotion and car driving," and for many of those tasks it learned policies "end-to-end: directly from raw pixel inputs."[1] It combined the [Deterministic Policy Gradient](/wiki/policy_gradient) (DPG) theorem of David Silver et al. (ICML 2014) with the engineering tricks that made [DQN](/wiki/dqn) work on [Atari](/wiki/atari) games, namely a [replay buffer](/wiki/replay_buffer) and slowly-updated [target networks](/wiki/target_network).[1][2][5]

The algorithm trains two [neural networks](/wiki/neural_network) at the same time. A *deterministic* actor network maps states directly to actions, and a [critic](/wiki/critic) network estimates the action-value function. The actor is updated by following the gradient of the critic with respect to actions, an idea borrowed directly from the DPG theorem.[2] Off-policy data sampled from a [replay buffer](/wiki/experience_replay) is used to train both networks, while exploration is injected by adding noise (typically Ornstein-Uhlenbeck or Gaussian) to the deterministic actor's output during data collection.[1] DDPG is an [actor-critic](/wiki/actor_critic) method: it stores no policy distribution to sample from, so all exploration has to be added externally.

DDPG dominated continuous-control benchmarks for a brief period and shaped a whole family of off-policy, deterministic, actor-critic algorithms including [TD3](/wiki/td3) (Twin Delayed DDPG), D4PG, and the first versions of MPO and DDPG-from-pixels.[6][9] Its weaknesses (overestimation bias in the critic, brittle hyperparameters, and well-documented seed sensitivity) drove a wave of follow-up research.[6][10] [Soft Actor-Critic](/wiki/soft_actor_critic) eventually displaced it as the default off-policy continuous-control algorithm, but DDPG is still taught as the canonical bridge between DPG and modern deep RL, and it remains a useful baseline in robotics, simulation, and energy management research.[7]

## What problem was DDPG designed to solve?

Before DDPG, deep RL had two reasonably strong stories. On the value-based side, [DQN](/wiki/dqn) showed that you could fit a Q-function with a [neural network](/wiki/neural_network) on raw Atari pixels if you stabilized training with [experience replay](/wiki/experience_replay) and a slowly updated [target network](/wiki/target_network).[5] On the policy-gradient side, methods like REINFORCE, TRPO, and natural policy gradient could handle continuous actions but were on-policy, sample-hungry, and (in TRPO's case) computationally heavy.

The gap was obvious. DQN was off-policy and data-efficient but only worked for discrete actions, because picking the greedy action requires `argmax_a Q(s,a)`, which is intractable when `a` is a real-valued vector in, say, twenty dimensions. Policy-gradient methods worked for continuous actions but needed enormous amounts of fresh on-policy data and tended to thrash on tasks like locomotion.

DDPG was an attempt to get the best of both. Use a deterministic policy that you can train with the DPG gradient, learn the Q-function the way DQN does, and replace the `argmax` with the action that the policy network already produces. The paper makes this lineage explicit: the abstract states that the authors "adapt the ideas underlying the success of Deep Q-Learning to the continuous action domain" and "present an actor-critic, model-free algorithm based on the deterministic policy gradient that can operate over continuous action spaces."[1]

## Predecessor: the deterministic policy gradient theorem

The theoretical groundwork was laid by David Silver, Guy Lever, Nicolas Heess, Thomas Degris, Daan Wierstra, and Martin Riedmiller in *Deterministic Policy Gradient Algorithms*, presented at ICML 2014.[2] Until that paper, the conventional wisdom in policy-gradient RL was that the policy had to be stochastic, because the standard policy gradient theorem (Sutton et al., 1999) integrates over the action distribution.[4] The Silver et al. paper showed that a deterministic policy `μ(s)` has a well-defined gradient too, given by

```
∇θ J(μθ) = E_{s ~ ρ^μ} [ ∇θ μθ(s) · ∇a Q^μ(s,a) | a = μθ(s) ]
```

The expectation is over the state visitation distribution induced by the behavior policy, and the action gradient `∇a Q^μ(s,a)` is evaluated at the action the deterministic policy would currently choose. The proof relies on a regularity argument that connects the stochastic policy gradient to its limit as policy variance goes to zero, and the practical consequence is enormous: the deterministic gradient takes the form of an expected gradient of the action-value function, so you no longer have to integrate over actions, which is precisely what kills value-based methods in continuous spaces.[2]

The ICML 2014 paper introduced an off-policy actor-critic version (OPDAC) that used a behavior policy plus importance sampling for the critic, and showed that deterministic policy gradient algorithms can significantly outperform their stochastic counterparts in high-dimensional action spaces.[2] DDPG took the same theorem and pushed it through deep neural networks, which is what made the method famous.

## How does the DDPG algorithm work?

DDPG learns four networks at once: an actor `μ(s|θ^μ)` and a critic `Q(s,a|θ^Q)`, plus their target copies `μ'(s|θ^μ')` and `Q'(s,a|θ^Q')`. All four are deep [neural networks](/wiki/neural_network) trained with gradient-based optimizers (the original paper used [Adam](/wiki/adam_optimizer)).[1]

### Components

| Component | Symbol | Role |
|---|---|---|
| Actor network | `μ(s|θ^μ)` | Deterministic policy, maps state to action |
| Critic network | `Q(s,a|θ^Q)` | Action-value function, estimates expected return |
| Target actor | `μ'(s|θ^μ')` | Slow-moving copy of the actor for stable Q targets |
| Target critic | `Q'(s,a|θ^Q')` | Slow-moving copy of the critic for stable Q targets |
| Replay buffer | `R` | Stores transitions `(s_t, a_t, r_t, s_{t+1})` for off-policy updates |
| Exploration noise | `N_t` | Added to actor output during rollouts, typically Ornstein-Uhlenbeck |
| Batch normalization | (in the original paper) | Normalizes per-feature inputs to handle low-dimensional states across different physical units |

### Critic update

The critic is fit to a one-step Bellman target using off-policy samples from the replay buffer. For a minibatch of `N` transitions `(s_i, a_i, r_i, s_{i+1})`, the target is

```
y_i = r_i + γ · Q'(s_{i+1}, μ'(s_{i+1}|θ^μ') | θ^Q')
```

and the critic loss is the mean-squared TD error

```
L(θ^Q) = (1/N) Σ_i ( y_i - Q(s_i, a_i | θ^Q) )^2.
```

This is essentially the [DQN](/wiki/dqn) update except that the next-state action is supplied by the target actor instead of by an `argmax`.[1][5]

### Actor update

The actor is updated by gradient ascent on the critic's estimate of expected return, applied through the deterministic policy gradient:

```
∇θ^μ J ≈ (1/N) Σ_i ∇a Q(s, a | θ^Q) | s = s_i, a = μ(s_i) · ∇θ^μ μ(s_i | θ^μ).
```

In code this is typically implemented as `loss = -mean(Q(s, μ(s)))` and then backpropagated. Because the critic and actor share no weights, the chain rule conveniently routes the gradient from the Q-value through the action and into the actor parameters.[1][2]

### Soft target updates

Unlike DQN, which periodically copies the online weights into the target network, DDPG uses *soft* updates after every gradient step:

```
θ' ← τ θ + (1 - τ) θ',
```

with a small `τ` (the paper uses 0.001). This Polyak averaging gives the target networks a much slower effective learning rate than the online networks and was found to be essential for stability. The paper notes that without target networks the critic frequently diverges.[1]

### Exploration

Because the policy is deterministic, all exploration must come from outside. The original paper adds an Ornstein-Uhlenbeck (OU) process to the actor's output:

```
a_t = μ(s_t | θ^μ) + N_t,
```

where `N_t` is sampled from an OU process with mean-reversion parameter `θ = 0.15` and volatility `σ = 0.2`.[1] The OU noise was chosen because it is temporally correlated, which the authors hypothesized would help on physical control tasks with momentum. Later work (especially TD3 and SAC) showed that uncorrelated Gaussian noise works just as well in practice on standard MuJoCo tasks, so most modern implementations skip the OU process.[6][7]

### Pseudo-code

```
Initialize critic Q(s,a|θ^Q) and actor μ(s|θ^μ) with random weights.
Initialize target networks θ^Q' ← θ^Q, θ^μ' ← θ^μ.
Initialize replay buffer R.

for episode = 1 to M:
    Initialize a random process N for exploration.
    Receive initial observation s_1.
    for t = 1 to T:
        Select action a_t = μ(s_t | θ^μ) + N_t.
        Execute a_t, observe r_t and s_{t+1}.
        Store transition (s_t, a_t, r_t, s_{t+1}) in R.

        Sample minibatch of N transitions from R.
        Compute target y_i = r_i + γ Q'(s_{i+1}, μ'(s_{i+1}|θ^μ') | θ^Q').
        Update critic by minimizing (1/N) Σ_i (y_i - Q(s_i, a_i | θ^Q))^2.

        Update actor by sampled DPG:
            ∇θ^μ J ≈ (1/N) Σ_i ∇a Q(s, a|θ^Q)|_{s=s_i, a=μ(s_i)} · ∇θ^μ μ(s|θ^μ)|_{s_i}.

        Soft-update target networks:
            θ^Q' ← τ θ^Q + (1-τ) θ^Q'
            θ^μ' ← τ θ^μ + (1-τ) θ^μ'.
    end for
end for
```

This is essentially Algorithm 1 of Lillicrap et al. (2016), modulo notation.[1]

## Default hyperparameters

The paper reports a single hyperparameter setting that worked across all tested environments without per-task tuning, which was the headline result at the time.[1] These values still appear as the defaults in most reimplementations.

| Hyperparameter | Value | Notes |
|---|---|---|
| Actor learning rate | 1e-4 | [Adam](/wiki/adam_optimizer) |
| Critic learning rate | 1e-3 | Adam, with L2 weight decay 1e-2 |
| Discount factor `γ` | 0.99 | |
| Soft update rate `τ` | 0.001 | Polyak averaging |
| Replay buffer size | 1e6 | Stores transitions FIFO |
| Minibatch size | 64 | |
| Hidden layer sizes | 400, 300 | Two fully connected layers; actor has tanh output |
| Action input layer | After first hidden layer (in critic) | The action is concatenated with the first-layer state activations |
| Final-layer init | Uniform `[-3e-3, 3e-3]` | To keep initial actions and Q-values near zero |
| Other layers init | Uniform `[-1/√f, 1/√f]` | Where `f` is fan-in |
| Exploration noise | OU process with `θ = 0.15`, `σ = 0.2` | Added to actor output |
| [Batch normalization](/wiki/batch_normalization) | Yes, on every layer of the actor and on state path of the critic | Critical for low-dim states with mixed units |
| Reward scaling | None for low-dim, 0.1 for pixels | Pixel agents had different reward scales |

The paper actually describes two main architectures: a low-dimensional state version and a pixel version. The pixel agent uses three convolutional layers (32 filters, 3 by 3, no pooling) before the fully connected stack, and a stack of three frames as the input.[1]

## What tasks did the original DDPG paper solve?

DDPG was evaluated on more than 20 simulated continuous-control tasks, mostly built in [MuJoCo](/wiki/mujoco).[1] The authors compared a low-dimensional version (state vector input) with a pixel version (raw 64 by 64 RGB frames) and a planning baseline (iLQG with full access to the simulator's dynamics).

| Domain | Description | Action dim | Result |
|---|---|---|---|
| Cartpole swing-up | Swing up and balance an underactuated pole | 1 | Solved from low-dim and pixels |
| Pendulum | Classic swing-up | 1 | Solved |
| Reacher | 2-link arm reaching a random target | 2 | Solved |
| Cheetah | Planar half-cheetah running | 6 | Strong policies, comparable to iLQG with planning |
| Walker2d | Bipedal walking | 6 | Learned forward locomotion |
| Hopper | One-legged hopping | 3 | Learned hopping gait |
| Ant | Quadrupedal locomotion | 8 | Learned forward gait |
| Humanoid | High-dim humanoid | 17 | Limited progress; later improved by D4PG and TD3 |
| Gripper | Robotic gripper grasping | 5 | Learned grasping |
| Torcs | Driving simulator | 3 (steering, throttle, brake) | Lapped tracks; included a pixel-only version |

The authors reported that policies found by DDPG were "competitive with those found by a planning algorithm with full access to the dynamics of the domain and its derivatives," and that on most tasks the low-dimensional and pixel agents reached comparable performance, which was the most impressive part of the result at the time.[1] The Humanoid task already hinted at DDPG's instability on very high-dimensional control, a weakness that later motivated TD3 and D4PG.[6][9]

## Successors and related algorithms

DDPG sits at the head of a family tree of off-policy actor-critic methods. Each successor was designed to fix a specific failure mode in DDPG.

| Algorithm | Year | Authors | Key change vs. DDPG |
|---|---|---|---|
| **[TD3](/wiki/td3)** (Twin Delayed DDPG) | 2018 | Fujimoto, Hoof, Meger (ICML 2018) | Two critics with `min` to mitigate Q overestimation; delayed actor updates; target policy smoothing noise |
| **[SAC](/wiki/soft_actor_critic)** (Soft Actor-Critic) | 2018 | Haarnoja et al. (ICML 2018, plus 2018 "Algorithms and Applications" follow-up) | Stochastic Gaussian actor, maximum-entropy objective with learned temperature, two critics like TD3 |
| **D4PG** (Distributed Distributional DDPG) | 2018 | Barth-Maron et al. (ICLR 2018) | Distributional critic (C51-style), N-step returns, prioritized experience replay, distributed actors |
| **MPO** (Maximum a Posteriori Policy Optimization) | 2018 | Abdolmaleki et al. | Reframes actor update as expectation-maximization with KL constraints; closely related family but with stochastic policies |
| **DDPG-from-demonstrations** | 2017 | Vecerik et al. | Adds a demonstration buffer with prioritized sampling for sparse-reward robotics |

The TD3 paper is particularly important for understanding DDPG's reputation. Fujimoto et al. showed that DDPG's critic systematically overestimates Q-values, that the deterministic actor exploits these overestimations, and that taking the minimum of two independently trained critics for the Bellman target closes most of the gap to better-tuned methods.[6] They also showed that adding clipped Gaussian noise to the target action during the Bellman backup ("target policy smoothing") reduces overfitting to narrow action peaks.[6]

SAC went further by replacing the deterministic actor with a stochastic Gaussian and adding an entropy-bonus term to the objective, so the actor "aims to maximize expected reward while also maximizing entropy," which made the algorithm both more robust to hyperparameters and less seed-sensitive.[7] By 2019 SAC had largely replaced DDPG as the default off-policy choice for continuous control.

## How do DDPG, TD3, and SAC differ?

For a side-by-side comparison of the three methods most often confused with each other:

| Property | DDPG | TD3 | SAC |
|---|---|---|---|
| Policy | Deterministic | Deterministic | Stochastic Gaussian |
| Critics | 1 | 2 (twin, take `min`) | 2 (twin, take `min`) |
| Actor update frequency | Every step | Every `d` critic steps (default 2) | Every step |
| Exploration | OU or Gaussian noise added externally | Gaussian noise added externally | Stochastic policy + entropy bonus |
| Target smoothing | No | Yes | Implicit via stochastic policy |
| Entropy term | No | No | Yes, with learnable temperature |
| Reproducibility | Notoriously sensitive | Better | Best of the three |

The shared lineage is direct: TD3 keeps DDPG's deterministic actor and replay buffer but adds twin critics and delayed updates, while SAC keeps the off-policy twin-critic structure but swaps the deterministic actor for an entropy-regularized stochastic one.[6][7]

## Implementation libraries

DDPG is included in essentially every modern RL library. Common implementations include:

| Library | DDPG implementation |
|---|---|
| OpenAI Spinning Up | Reference PyTorch and TF1 implementations with paper-faithful defaults; the docs explicitly walk through DDPG, TD3, and SAC together [12] |
| Stable Baselines3 | `stable_baselines3.DDPG`, with TD3 as the recommended successor [13] |
| Ray RLlib | `ray.rllib.algorithms.ddpg.DDPG`, supports multi-GPU and distributed training [14] |
| CleanRL | Single-file `ddpg_continuous_action.py`; widely used for teaching and reproducibility [15] |
| TF-Agents | `tf_agents.agents.ddpg.ddpg_agent.DdpgAgent` |
| Acme | `acme.agents.tf.ddpg`, the DeepMind in-house framework |
| MushroomRL, Tianshou, Garage | All include DDPG, mostly for completeness |

Most of these libraries default to Gaussian exploration noise (rather than OU) and use somewhat larger replay buffers and minibatches than the original paper. Modern reimplementations also tend to drop [batch normalization](/wiki/batch_normalization) on the critic, since later work found it to be more trouble than it was worth on standard benchmarks.[12]

## Why is DDPG considered unstable and hard to reproduce?

DDPG developed a reputation for being temperamental almost as soon as it was released. The Henderson, Islam, Bachman, Pineau, Precup, and Meger paper *Deep Reinforcement Learning that Matters* (AAAI 2018) is the standard citation here.[10] The authors compared DDPG implementations across libraries on the same MuJoCo tasks and found that:

- Performance varied dramatically across implementations of the "same" algorithm, even with matched hyperparameters.[10]
- Different random seeds, on the same code, produced very different learning curves; in some cases the median return across one set of five seeds differed by roughly a factor of two from the median across a different five.[10]
- Network architectures, reward scaling, and choice of exploration noise all materially affected results, often more than the choice of algorithm.[10]

Later work explained part of this: the deterministic actor combined with a single critic gives the policy a strong incentive to drive into regions where the critic over-estimates Q, and these regions are sensitive to initialization. TD3's twin critics and SAC's entropy bonus both help here, which is one reason both methods are noticeably less seed-sensitive than DDPG.[6][7]

Other failure modes that show up in practice:

- The critic can diverge if the Q-target is not stabilized by target networks; the original paper reports this as the motivation for soft updates.[1]
- L2 weight decay on the critic was important in the original code but is sometimes silently dropped in reimplementations, which can change the picture.[1]
- Reward scaling matters; the paper used 0.1 reward scaling for pixel agents but not for low-dim agents, and reimplementations that pick the wrong default tend to underperform.[1]

The practical advice that emerged is roughly: if you can use SAC or TD3, do; if you must use DDPG, run at least 5 seeds, watch for Q-value blowup, and tune the noise and learning rates carefully on a small task before scaling up.[10]

## What is DDPG used for?

Despite its limitations, DDPG and its descendants have been used in a wide range of continuous-control settings.

- **[Robotics](/wiki/robotics)**: simulated and real-robot manipulation, especially for grasping, pushing, and reaching. The DDPG-from-demonstrations work above came directly out of attempts to apply DDPG on real arms with sparse rewards.[11]
- **Locomotion**: bipedal and quadrupedal locomotion in MuJoCo, PyBullet, and Isaac Gym. Most modern locomotion work uses PPO or SAC instead, but DDPG was the first method to do this end-to-end from low-dim states.[1]
- **Autonomous driving research**: lane following and speed control in TORCS and CARLA-style simulators, often with image input and a discretized critic.
- **Energy management and grid control**: building HVAC control, microgrid dispatch, and demand response, where the action is a continuous setpoint.
- **Quantitative finance**: portfolio rebalancing and execution, sometimes as a baseline against PPO/SAC.
- **Process control**: chemical process control and tuning of PID-style controllers.
- **Game environments**: any continuous-action game or simulator, including TORCS in the original paper and many follow-ups in DeepMind Control Suite, RLBench, and Meta-World.[1]

In most of these areas TD3 or SAC are now the default choice in published baselines. DDPG is still the algorithm people start with when explaining the method to a class.

## Where DDPG sits in modern reinforcement learning

In modern RL practice, DDPG is mostly a teaching algorithm and a baseline. The current default for off-policy continuous control is SAC, often with implementation details borrowed from TD3 (twin critics, target smoothing).[6][7] The combination of "deterministic actor + single critic + replay" that defines DDPG has been almost entirely replaced by "stochastic actor + entropy + twin critics + replay."

What keeps DDPG relevant is its pedagogical role. It is the smallest deep RL algorithm that exposes all the moving parts at once: an actor, a critic, a replay buffer, target networks, and an exploration scheme. Read the DDPG paper, then the TD3 paper, then the SAC paper, and you have a tour of what off-policy actor-critic deep RL learned between 2015 and 2018. The lineage from Silver et al. (2014) through Lillicrap et al. (2016) to Fujimoto et al. (2018) and Haarnoja et al. (2018) is the cleanest progression in deep RL: each step fixes a specific, identifiable problem with the previous one.[2][1][6][7]

The deterministic-policy idea itself has aged better than DDPG-the-algorithm. Off-policy deterministic actors still appear in robotics-scale work where stochastic exploration is impractical, in offline RL methods that need a target policy with a defined `argmax_a Q(s,a)`, and in distillation pipelines that compress stochastic teachers into deterministic students. The Silver et al. theorem that powered DDPG continues to be cited as the basis for these methods.[2]

## See also

- [Reinforcement learning](/wiki/reinforcement_learning)
- [Actor-critic](/wiki/actor_critic)
- [Policy gradient](/wiki/policy_gradient)
- [DQN](/wiki/dqn)
- [Experience replay](/wiki/experience_replay)
- [Target network](/wiki/target_network)
- [Critic](/wiki/critic)
- [Batch normalization](/wiki/batch_normalization)
- [Adam optimizer](/wiki/adam_optimizer)
- [Bellman equation](/wiki/bellman_equation)
- [DeepMind](/wiki/deepmind)
- [OpenAI Gym](/wiki/openai_gym)
- [MuJoCo](/wiki/mujoco)
- [Robotics](/wiki/robotics)
- [TD3](/wiki/td3)
- [Soft Actor-Critic](/wiki/soft_actor_critic)
- [PPO](/wiki/ppo)

## References

1. Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., & Wierstra, D. (2016). *Continuous control with deep reinforcement learning*. International Conference on Learning Representations (ICLR 2016). arXiv:1509.02971. https://arxiv.org/abs/1509.02971
2. Silver, D., Lever, G., Heess, N., Degris, T., Wierstra, D., & Riedmiller, M. (2014). *Deterministic Policy Gradient Algorithms*. International Conference on Machine Learning (ICML 2014). https://proceedings.mlr.press/v32/silver14.html
3. Sutton, R. S., & Barto, A. G. (2018). *Reinforcement Learning: An Introduction* (2nd ed.). MIT Press.
4. Sutton, R. S., McAllester, D., Singh, S., & Mansour, Y. (1999). *Policy gradient methods for reinforcement learning with function approximation*. NeurIPS 1999.
5. Mnih, V., et al. (2015). *Human-level control through deep reinforcement learning*. Nature, 518(7540), 529-533.
6. Fujimoto, S., van Hoof, H., & Meger, D. (2018). *Addressing Function Approximation Error in Actor-Critic Methods*. ICML 2018. (TD3.) arXiv:1802.09477.
7. Haarnoja, T., Zhou, A., Abbeel, P., & Levine, S. (2018). *Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor*. ICML 2018. arXiv:1801.01290.
8. Haarnoja, T., et al. (2018). *Soft Actor-Critic Algorithms and Applications*. arXiv:1812.05905.
9. Barth-Maron, G., Hoffman, M., et al. (2018). *Distributed Distributional Deterministic Policy Gradients*. ICLR 2018. (D4PG.) arXiv:1804.08617.
10. Henderson, P., Islam, R., Bachman, P., Pineau, J., Precup, D., & Meger, D. (2018). *Deep Reinforcement Learning that Matters*. AAAI 2018. arXiv:1709.06560.
11. Vecerik, M., et al. (2017). *Leveraging Demonstrations for Deep Reinforcement Learning on Robotics Problems with Sparse Rewards*. arXiv:1707.08817.
12. Achiam, J. (2018). *Spinning Up in Deep RL*. OpenAI documentation. https://spinningup.openai.com
13. Raffin, A., et al. (2021). *Stable-Baselines3: Reliable Reinforcement Learning Implementations*. JMLR 22(268).
14. Liang, E., et al. (2018). *RLlib: Abstractions for Distributed Reinforcement Learning*. ICML 2018.
15. Huang, S., et al. (2022). *CleanRL: High-quality Single-file Implementations of Deep Reinforcement Learning Algorithms*. JMLR.

