# Replay Buffer

> Source: https://aiwiki.ai/wiki/replay_buffer
> Updated: 2026-07-12
> Categories: Deep Learning, Machine Learning, Reinforcement Learning
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

A **replay buffer** (also called an **experience replay buffer** or **replay memory**) is a fixed-size memory that stores an off-policy reinforcement learning agent's past transitions, each a tuple of state, action, reward, and next state (s, a, r, s'), so the agent can sample and re-train on them many times instead of learning only from its most recent step. The technique was introduced by Long-Ji Lin in 1992 and became a core component of modern deep reinforcement learning after its integration into the [Deep Q-Network](/wiki/deep_q-network_dqn) (DQN) algorithm by Mnih et al. in 2013. [1][2] Its two central benefits are breaking the temporal correlation between consecutive samples and improving sample efficiency by reusing each experience for multiple gradient updates. The original DQN that achieved human-level play on Atari used a replay memory of 1,000,000 transitions sampled in minibatches of 32. [3]

The replay buffer is closely related to, and often used interchangeably with, the broader idea of [experience replay](/wiki/experience_replay): the buffer is the data structure, while experience replay is the training procedure that draws from it.

## ELI5

Imagine you are learning to play a video game. Every time you try something, you write down what happened on an index card: what you saw on screen, what button you pressed, whether you got points or lost a life, and what the screen looked like afterward. You toss each card into a big shoebox.

When it is time to study, instead of only looking at the very last card you wrote, you reach into the shoebox and pull out a random handful of cards. Some are from today, some from last week. By studying a mix of old and new experiences, you get a much better picture of the whole game rather than just the last thing that happened. That shoebox is the replay buffer.

Once the shoebox gets full, you throw away the oldest cards to make room for new ones. Some fancier versions of the shoebox let you mark certain cards as extra important so you pull them out more often, which helps you learn even faster.

## Who invented experience replay, and when?

The concept of experience replay was introduced by Long-Ji Lin in his 1992 paper "Self-improving reactive agents based on reinforcement learning, planning and teaching," published in the journal *Machine Learning* (volume 8, pages 293-321). [1] Lin proposed storing past experiences and replaying them during learning as a way to accelerate convergence of [reinforcement learning](/wiki/reinforcement_learning) algorithms. At the time, reinforcement learning methods converged slowly, and experience replay was one of three extensions Lin developed to speed up the process (the other two being learning action models for planning and teaching from external advice). [1]

The technique received relatively little attention for about two decades until Mnih et al. combined it with [deep neural networks](/wiki/deep_neural_network) in their 2013 paper "Playing Atari with Deep Reinforcement Learning." [2] This work introduced the DQN algorithm, which used a [convolutional neural network](/wiki/convolutional_neural_network) to learn control policies directly from raw pixel inputs. Experience replay was one of two stabilization techniques that made this combination work, the other being a [target network](/wiki/target_network). The follow-up 2015 *Nature* paper "Human-level control through deep reinforcement learning" demonstrated that the same architecture and hyperparameters could achieve human-level performance across 49 Atari 2600 games. [3]

The biological inspiration for experience replay comes from neuroscience research on hippocampal replay. During sleep and periods of rest, the hippocampus replays compressed sequences of neural activity patterns that correspond to previously experienced events. [11] This replay process is believed to support memory consolidation by strengthening synaptic connections in cortical circuits. The parallel between biological memory replay and computational experience replay has been noted by multiple researchers, though the computational version was developed independently of the neuroscience findings.

## How does a replay buffer work?

### Transition tuple

At each timestep, the agent interacts with its [environment](/wiki/environment) and produces a transition tuple. The standard tuple contains five elements:

| Element | Symbol | Description |
|---|---|---|
| Current state | s | The observation of the environment before the agent acts |
| Action | a | The action selected by the agent's [policy](/wiki/policy) |
| Reward | r | The immediate [reward](/wiki/reward) signal received after taking the action |
| Next state | s' | The observation of the environment after the action is executed |
| Done flag | d | A boolean indicating whether the episode has terminated |

This tuple (s, a, r, s', d) is the fundamental unit of data stored in the replay buffer.

### Storage and eviction

The replay buffer is typically implemented as a fixed-size circular buffer (also called a ring buffer). New transitions are appended to the buffer sequentially. Once the buffer reaches its maximum capacity, the oldest transitions are overwritten in a first-in-first-out (FIFO) manner. The original DQN stored only the last N experiences, where N was set to one million, and sampled uniformly at random from that memory. [2][3] Common buffer sizes in practice range from 10,000 to 1,000,000 transitions, depending on the complexity of the environment and available memory.

An alternative to FIFO eviction is reservoir sampling, where each incoming transition has a probability of replacing a randomly selected existing transition. This approach ensures that all previously seen transitions have an equal probability of being retained, providing a more uniform sample of the agent's entire history rather than a recency-biased window.

### Sampling and training

During training, instead of using the most recent transition for a gradient update, the algorithm samples a random minibatch of transitions from the buffer. A typical minibatch size is 32 or 64 transitions; the DQN papers used 32. [3] The sampled transitions are used to compute the [loss function](/wiki/loss_function) (for example, the temporal-difference error in [Q-learning](/wiki/q-learning)) and perform a gradient descent step to update the network parameters.

The following pseudocode outlines the basic experience replay loop:

```
Initialize replay buffer D with capacity N
Initialize Q-network with random weights

for each episode:
    Initialize state s
    for each step:
        Select action a using epsilon-greedy policy from Q
        Execute action a, observe reward r and next state s'
        Store transition (s, a, r, s', done) in D
        Sample random minibatch of transitions from D
        Compute target y = r + gamma * max_a' Q_target(s', a')
        Update Q-network by minimizing (y - Q(s, a))^2
        s = s'
```

## Why is experience replay needed?

Experience replay addresses several problems that arise when combining reinforcement learning with [neural networks](/wiki/neural_network). The DQN authors summarized the core mechanism plainly: "randomizing the samples breaks these correlations and therefore reduces the variance of the updates." [3]

### Temporal correlation

Consecutive transitions collected by an agent following a policy are highly correlated. For example, in an Atari game, consecutive frames differ by only a few pixels. Training a neural network on correlated sequential data violates the assumption of independently and identically distributed (i.i.d.) samples that underlies [stochastic gradient descent](/wiki/stochastic_gradient_descent), leading to unstable and inefficient learning. Random sampling from a replay buffer breaks these temporal correlations by presenting the network with a diverse, decorrelated set of transitions. [2][3]

### Sample efficiency

Without experience replay, each transition is used for exactly one gradient update and then discarded. This is wasteful, especially when environment interactions are expensive (for example, in robotics or complex simulations). As Mnih et al. noted, experience replay yields "greater data efficiency" because "each step of experience is potentially used in many weight updates." [3] With a replay buffer, each transition can be sampled and reused for learning many times, significantly improving sample efficiency.

### Catastrophic forgetting

[Neural networks](/wiki/neural_network) are prone to catastrophic forgetting, where learning new information causes the network to lose previously acquired knowledge. When training only on the most recent transitions, the network may forget how to handle situations it encountered earlier in training. The replay buffer mitigates this by continuously mixing old and new experiences during training.

### Non-stationarity

In reinforcement learning, the data distribution changes as the agent's policy improves. The states and rewards the agent encounters shift over time. A replay buffer smooths out these distributional shifts by maintaining a window of experiences collected under different versions of the policy.

## What are the main variants of experience replay?

### Uniform experience replay

The simplest form of experience replay samples transitions uniformly at random from the buffer. Each stored transition has an equal probability of being selected for training. This is the version used in the original DQN algorithm and remains a strong baseline due to its simplicity and lack of additional hyperparameters. [2][3]

### Prioritized experience replay

Prioritized experience replay (PER), introduced by Tom Schaul, John Quan, Ioannis Antonoglou, and David Silver in 2015 (published at ICLR 2016, arXiv:1511.05952), assigns a priority to each transition and samples transitions with probability proportional to their priority. [4] The authors framed the goal as a way to "replay important transitions more frequently, and therefore learn more efficiently." [4] The key idea is that some transitions are more informative than others, and the agent should replay those transitions more frequently.

The priority of a transition is typically based on the magnitude of its temporal-difference (TD) error, which measures how "surprising" the transition is to the current model. A large TD error means the network's prediction was far from the observed outcome, suggesting the transition contains information the network has not yet learned.

PER defines two variants for computing sampling probabilities:

| Variant | Priority formula | Sampling probability | Characteristics |
|---|---|---|---|
| Proportional | $$p_i = \lvert \delta_i \rvert + \epsilon$$ | $$P(i) = \frac{p_i^\alpha}{\sum_k p_k^\alpha}$$ | Directly proportional to TD error magnitude; sensitive to outliers |
| Rank-based | $$p_i = 1 / \text{rank}(i)$$ | $$P(i) = \frac{p_i^\alpha}{\sum_k p_k^\alpha}$$ | Based on rank ordering of TD errors; more robust to outliers; produces a power-law distribution |

In both variants, $$\alpha$$ controls how much prioritization is applied. When $$\alpha = 0$$, sampling is uniform. When $$\alpha = 1$$, sampling is fully proportional to priorities. [4]

#### Importance sampling correction

Prioritized sampling introduces bias because it changes the expected distribution of updates. To correct this bias, PER uses importance sampling weights:

$$
w_i = \left(\frac{1}{N \cdot P(i)}\right)^\beta
$$

The hyperparameter $$\beta$$ controls the degree of bias correction. It is typically annealed from a starting value (often 0.4 to 0.6) linearly to 1.0 over the course of training. At $$\beta = 1$$, the bias is fully corrected. The annealing schedule reflects the fact that unbiased updates matter more toward the end of training when the policy is converging. [4]

#### Implementation with sum trees

Efficient implementation of proportional PER uses a sum tree (a type of segment tree) data structure. Each leaf node stores the priority of one transition, and each internal node stores the sum of its children's priorities. The root node contains the total priority sum. This structure allows both sampling (proportional to priority) and priority updates in $$O(\log N)$$ time, compared to $$O(N)$$ for a naive implementation.

PER was shown to improve DQN performance on 41 out of 49 Atari 2600 games compared to uniform replay, setting a new state of the art at the time. [4] In the Rainbow DQN ablation study by Hessel et al. (2018), prioritized experience replay was found to be the single most important component contributing to overall performance. [8]

### Hindsight experience replay

Hindsight experience replay (HER), proposed by Marcin Andrychowicz, Filip Wolski, Alex Ray, Jonas Schneider, and others at [OpenAI](/wiki/openai) in 2017 (NeurIPS 2017, arXiv:1707.01495), addresses the challenge of learning from sparse binary rewards in goal-conditioned reinforcement learning. [5] The authors describe its purpose as enabling "sample-efficient learning from rewards which are sparse and binary and therefore avoid the need for complicated reward engineering." [5]

In many robotic manipulation tasks, the reward function is binary: the agent receives a reward of 0 if it achieves the goal and -1 otherwise. With such sparse rewards, the agent almost never experiences positive outcomes during early training, making learning extremely difficult.

HER solves this by retroactively relabeling the goal in stored transitions. After an episode ends, the agent stores not only the original transitions (with the intended goal) but also additional copies of the transitions with substitute goals that were actually achieved during the episode. This way, even a "failed" episode produces transitions where the agent "succeeded" at reaching alternative goals.

HER defines four goal-relabeling strategies:

| Strategy | Description |
|---|---|
| Final | Replace the goal with the final state achieved in the episode |
| Future | Replace the goal with a randomly selected state from later in the same episode |
| Episode | Replace the goal with a randomly selected state from anywhere in the episode |
| Random | Replace the goal with a randomly selected state from the entire replay buffer |

The "future" strategy is generally the most effective, as it provides a natural curriculum: early in training when the agent accomplishes little, the substitute goals are close to the starting state, and as the agent improves, the substitute goals move further away.

HER can be combined with any off-policy reinforcement learning algorithm, including DQN, [DDPG](/wiki/ddpg), and [SAC](/wiki/soft_actor_critic). It was demonstrated on robotic pushing, sliding, and pick-and-place tasks using only binary rewards, and policies trained in simulation using HER were successfully transferred to physical robots. [5]

### Combined experience replay

Combined experience replay (CER), proposed by Shangtong Zhang and Richard Sutton in 2017 (arXiv:1712.01275), is a simple modification that ensures the most recent transition is always included in the training minibatch. [6] Instead of sampling all transitions uniformly from the buffer, CER samples (batch_size - 1) transitions randomly and adds the latest transition to complete the batch. This addresses a potential problem with large replay buffers: when the buffer is very large, the most recent transition may not be sampled for a long time, delaying learning from the latest experience. CER adds only O(1) extra computation. [6]

### N-step experience replay

Standard experience replay stores single-step transitions (s, a, r, s'). N-step experience replay extends this by storing multi-step returns:

$$
R_n = r_t + \gamma r_{t+1} + \gamma^2 r_{t+2} + \cdots + \gamma^{n-1} r_{t+n-1}
$$

The stored tuple becomes $$(s_t, a_t, R_n, s_{t+n})$$. N-step returns reduce the bias of bootstrapped value estimates at the cost of increased variance. In practice, $$n = 3$$ or $$n = 5$$ is commonly used. N-step experience replay is one of the components of the Rainbow DQN algorithm, where Hessel et al. found that, alongside prioritization, multi-step learning was among the most impactful additions. [8]

## How is experience replay scaled to distributed systems?

### Ape-X

Ape-X (Distributed Prioritized Experience Replay), introduced by Dan Horgan, John Quan, David Budden, Gabriel Barth-Maron, Matteo Hessel, Hado van Hasselt, and David Silver in 2018 (ICLR 2018, arXiv:1803.00933), scales experience replay to distributed settings. [7] The architecture decouples acting from learning:

- Multiple **actor** processes run in parallel, each interacting with its own copy of the environment. Actors select actions using a shared neural network and compute initial priorities for new transitions.
- A single **learner** process samples prioritized minibatches from a shared, centralized replay buffer and performs gradient updates on the network.
- Actors periodically synchronize their network weights with the learner.

This design allows the system to generate experience much faster than a single agent could. In experiments, Ape-X used hundreds of CPU actors and a single GPU learner. The system achieved state-of-the-art results on the Arcade Learning Environment (Atari games), training on billions of frames in a fraction of the wall-clock time required by single-agent methods. [7]

### R2D2

R2D2 (Recurrent Replay Distributed DQN), proposed by Kapturowski et al. in 2019 (ICLR 2019), extends the distributed replay paradigm to handle partial observability using [recurrent neural networks](/wiki/recurrent_neural_network). [9] R2D2 uses an [LSTM](/wiki/long_short-term_memory_lstm) layer after the convolutional stack and stores fixed-length sequences of transitions (typically 80 steps with 40 steps of overlap between consecutive sequences) rather than individual transition tuples. [9]

Using recurrent networks with experience replay introduces a challenge called recurrent state staleness: the hidden states stored with old transitions were computed under an older version of the network and may no longer be accurate. R2D2 addresses this through two strategies:

- **Stored states**: Saving the LSTM hidden state at the beginning of each stored sequence and using it to initialize the recurrent network during replay.
- **Burn-in**: Discarding the first portion of each replayed sequence and using it only to warm up the LSTM hidden state before computing the actual loss on the remaining steps.

R2D2 achieved state-of-the-art results on both Atari-57 and DMLab-30 benchmarks using a single set of hyperparameters. [9]

## How do continuous-control algorithms use replay buffers?

Experience replay is equally important for off-policy algorithms designed for continuous action spaces. The following table summarizes how several popular algorithms use replay buffers:

| Algorithm | Year | Action space | Replay buffer usage | Key details |
|---|---|---|---|---|
| [DDPG](/wiki/ddpg) | 2015 | Continuous | Uniform replay | Actor-critic; stores (s, a, r, s') tuples; typical buffer size of 1M transitions |
| [TD3](/wiki/td3) | 2018 | Continuous | Uniform replay | Extends DDPG with clipped double Q-learning and delayed policy updates |
| [SAC](/wiki/soft_actor_critic) | 2018 | Continuous | Uniform replay | Maximum entropy framework; automatic temperature tuning; stochastic policy |
| [DDPG](/wiki/ddpg) + PER | Various | Continuous | Prioritized replay | Combines DDPG with prioritized experience replay for potentially faster convergence |

All of these algorithms are off-policy, meaning they can learn from data collected by a different (older) version of their policy. This property is what makes experience replay possible: the stored transitions remain useful even as the policy changes.

## Implementation considerations

### Buffer size

The size of the replay buffer is an important [hyperparameter](/wiki/hyperparameter) that requires tuning for each problem domain. A buffer that is too small fails to break temporal correlations effectively and provides insufficient diversity. A buffer that is too large consumes excessive memory and may retain outdated transitions that are no longer relevant to the current policy, potentially slowing down learning. Common buffer sizes in the literature range from 10,000 to 1,000,000 transitions. Research by Zhang and Sutton (2017) showed that varying the size of the experience replay buffer can hurt performance even in very simple tasks, confirming that buffer size is a hyperparameter with no universally optimal value. [6]

### Memory efficiency

For environments with high-dimensional observations (such as image-based environments), storing millions of transitions can require substantial memory. Several techniques help reduce memory usage:

- **Observation compression**: Storing observations as uint8 (0-255) rather than float32 reduces memory usage by 4x for image observations.
- **Lazy frames**: Storing each frame only once and constructing stacked frame observations on-the-fly by referencing frame indices, avoiding redundant storage of overlapping frames.
- **Next-state elimination**: Since consecutive transitions share the next-state/current-state, storing only states and reconstructing next-states from adjacent entries.

### Software frameworks

Several software libraries provide production-quality replay buffer implementations:

| Framework | Developer | Key features |
|---|---|---|
| [TensorFlow](/wiki/tensorflow) Agents | Google | TFUniformReplayBuffer, TFPrioritizedReplayBuffer; supports distributed collection |
| Stable-Baselines3 | Community | ReplayBuffer, HerReplayBuffer, DictReplayBuffer; [PyTorch](/wiki/pytorch)-native |
| RLlib (Ray) | Anyscale | MultiAgentReplayBuffer; supports multiple prioritization modes and sampling strategies |
| Reverb | [DeepMind](/wiki/deepmind) | High-performance C++ server with gRPC interface; supports uniform, prioritized, FIFO, LIFO sampling; rate limiting; scales to thousands of concurrent clients |
| cpprb | Community | C++ backed Python library; supports PER, N-step, HER, and various buffer types |

Reverb, released by DeepMind in 2021, is implemented in C++ and exposes a gRPC service for adding, sampling, and updating buffer contents; its sampler manages long-lived streams at a flow-controlled rate and is designed to support up to thousands of concurrent clients in distributed training. [10]

## What are the limitations of replay buffers?

### The deadly triad

The combination of function approximation (neural networks), bootstrapping (using estimated values to compute targets), and off-policy learning (training on data from older policies via replay) constitutes what Richard Sutton and Andrew Barto call the "deadly triad." [14] This combination can cause value estimates to diverge and become unbounded. Experience replay contributes to this problem by increasing the degree of off-policyness: transitions stored in the buffer may have been collected under a substantially different policy than the current one. A 2018 DeepMind study of the deadly triad in deep RL found that target networks and various regularization techniques partially mitigate this issue but do not eliminate it entirely. [12]

### Buffer staleness

As the agent's policy improves over time, older transitions in the buffer become less representative of the current data distribution. This staleness can slow learning or cause instability. Smaller buffer sizes reduce staleness but at the cost of less diversity. Some approaches address staleness explicitly by weighting recent transitions more heavily or by periodically clearing the buffer.

### Restriction to off-policy algorithms

Experience replay can only be used with off-policy algorithms because the stored transitions were generated by a different (older) version of the policy. On-policy algorithms such as [PPO](/wiki/proximal_policy_optimization) and A2C require data collected under the current policy and cannot use a replay buffer directly. This limits the applicability of experience replay to algorithms like DQN, DDPG, TD3, and SAC.

### Memory overhead

Storing millions of transitions with high-dimensional observations (such as 84x84 pixel images with frame stacking) requires significant RAM. For very large-scale experiments, the replay buffer can become a memory bottleneck. Distributed replay systems like Reverb address this by running the buffer as a separate service, but this adds architectural complexity. [10]

## Comparison of replay buffer variants

| Variant | Sampling strategy | Key benefit | Main overhead | Introduced |
|---|---|---|---|---|
| Uniform replay | Random uniform | Simple; no extra hyperparameters | None | Lin, 1992 |
| Prioritized (PER) | Proportional to TD error | Faster learning on informative transitions | Sum tree; importance sampling weights; alpha and beta hyperparameters | Schaul et al., 2015 |
| Hindsight (HER) | Uniform with goal relabeling | Enables learning from sparse binary rewards | Goal relabeling computation; only applicable to goal-conditioned tasks | Andrychowicz et al., 2017 |
| Combined (CER) | Uniform + latest transition | Ensures recency; negligible overhead | Minimal | Zhang and Sutton, 2017 |
| N-step | Random uniform over N-step returns | Reduced bootstrap bias | Increased variance; N-step return computation | Various |
| Distributed (Ape-X) | Prioritized, centralized buffer | Scales to many actors; massive throughput | Distributed infrastructure; network synchronization | Horgan et al., 2018 |
| Recurrent (R2D2) | Prioritized over sequences | Handles partial observability with LSTM | Sequence storage; burn-in computation; recurrent state management | Kapturowski et al., 2019 |

## References

1. Lin, L.-J. (1992). "Self-improving reactive agents based on reinforcement learning, planning and teaching." *Machine Learning*, 8(3-4), 293-321.
2. Mnih, V., Kavukcuoglu, K., Silver, D., et al. (2013). "Playing Atari with Deep Reinforcement Learning." *arXiv preprint arXiv:1312.5602*. https://arxiv.org/abs/1312.5602
3. Mnih, V., Kavukcuoglu, K., Silver, D., et al. (2015). "Human-level control through deep reinforcement learning." *Nature*, 518(7540), 529-533. https://www.nature.com/articles/nature14236
4. Schaul, T., Quan, J., Antonoglou, I., & Silver, D. (2016). "Prioritized Experience Replay." *Proceedings of the International Conference on Learning Representations (ICLR)*. arXiv:1511.05952. https://arxiv.org/abs/1511.05952
5. Andrychowicz, M., Wolski, F., Ray, A., Schneider, J., et al. (2017). "Hindsight Experience Replay." *Advances in Neural Information Processing Systems (NeurIPS)*, 30. arXiv:1707.01495. https://arxiv.org/abs/1707.01495
6. Zhang, S., & Sutton, R. S. (2017). "A Deeper Look at Experience Replay." *arXiv preprint arXiv:1712.01275*. https://arxiv.org/abs/1712.01275
7. Horgan, D., Quan, J., Budden, D., et al. (2018). "Distributed Prioritized Experience Replay." *Proceedings of the International Conference on Learning Representations (ICLR)*. arXiv:1803.00933. https://arxiv.org/abs/1803.00933
8. Hessel, M., Modayil, J., van Hasselt, H., et al. (2018). "Rainbow: Combining Improvements in Deep Reinforcement Learning." *Proceedings of the AAAI Conference on Artificial Intelligence*, 32(1). arXiv:1710.02298. https://arxiv.org/abs/1710.02298
9. Kapturowski, S., Ostrovski, G., Quan, J., Munos, R., & Dabney, W. (2019). "Recurrent Experience Replay in Distributed Reinforcement Learning." *International Conference on Learning Representations (ICLR)*. https://openreview.net/forum?id=r1lyTjAqYX
10. Cassirer, A., Barth-Maron, G., Brevdo, E., et al. (2021). "Reverb: A Framework for Experience Replay." *arXiv preprint arXiv:2102.04736*. https://arxiv.org/abs/2102.04736
11. Carr, M. F., Jadhav, S. P., & Frank, L. M. (2011). "Hippocampal replay in the awake state: a potential substrate for memory consolidation and retrieval." *Nature Neuroscience*, 14(2), 147-153.
12. van Hasselt, H., Doron, Y., Strub, F., et al. (2018). "Deep Reinforcement Learning and the Deadly Triad." *arXiv preprint arXiv:1812.02648*. https://arxiv.org/abs/1812.02648
13. Lillicrap, T. P., Hunt, J. J., Pritzel, A., et al. (2016). "Continuous control with deep reinforcement learning." *Proceedings of the International Conference on Learning Representations (ICLR)*. arXiv:1509.02971. https://arxiv.org/abs/1509.02971
14. Sutton, R. S., & Barto, A. G. (2018). *Reinforcement Learning: An Introduction* (2nd ed.). MIT Press. Chapter 11 ("Off-policy Methods with Approximation"), section on the deadly triad.
15. Haarnoja, T., Zhou, A., Abbeel, P., & Levine, S. (2018). "Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor." *Proceedings of the 35th International Conference on Machine Learning (ICML)*. arXiv:1801.01290. https://arxiv.org/abs/1801.01290