A replay buffer (also called an experience replay buffer or replay memory) is a data structure used in reinforcement learning to store and reuse past experiences for training. First introduced by Long-Ji Lin in 1992, the technique became a core component of modern deep reinforcement learning after its integration into the Deep Q-Network (DQN) algorithm by Mnih et al. in 2013. The replay buffer allows an agent to learn from previously collected transitions rather than only from the most recent interaction, improving both sample efficiency and training stability.
Imagine you are learning to play a video game. Every time you try something, you write down what happened on an index card: what you saw on screen, what button you pressed, whether you got points or lost a life, and what the screen looked like afterward. You toss each card into a big shoebox.
When it is time to study, instead of only looking at the very last card you wrote, you reach into the shoebox and pull out a random handful of cards. Some are from today, some from last week. By studying a mix of old and new experiences, you get a much better picture of the whole game rather than just the last thing that happened. That shoebox is the replay buffer.
Once the shoebox gets full, you throw away the oldest cards to make room for new ones. Some fancier versions of the shoebox let you mark certain cards as extra important so you pull them out more often, which helps you learn even faster.
The concept of experience replay was introduced by Long-Ji Lin in his 1992 paper "Self-improving reactive agents based on reinforcement learning, planning and teaching." Lin proposed storing past experiences and replaying them during learning as a way to accelerate convergence of reinforcement learning algorithms. At the time, reinforcement learning methods converged slowly, and experience replay was one of three extensions Lin developed to speed up the process (the other two being learning action models for planning and teaching from external advice).
The technique received relatively little attention for about two decades until Mnih et al. combined it with deep neural networks in their 2013 paper "Playing Atari with Deep Reinforcement Learning." This work introduced the DQN algorithm, which used a convolutional neural network to learn control policies directly from raw pixel inputs. Experience replay was one of two stabilization techniques that made this combination work (the other being a target network). The follow-up 2015 Nature paper "Human-level control through deep reinforcement learning" demonstrated that the same architecture and hyperparameters could achieve human-level performance across 49 Atari 2600 games.
The biological inspiration for experience replay comes from neuroscience research on hippocampal replay. During sleep and periods of rest, the hippocampus replays compressed sequences of neural activity patterns that correspond to previously experienced events. This replay process is believed to support memory consolidation by strengthening synaptic connections in cortical circuits. The parallel between biological memory replay and computational experience replay has been noted by multiple researchers, though the computational version was developed independently of the neuroscience findings.
At each timestep, the agent interacts with its environment and produces a transition tuple. The standard tuple contains five elements:
| Element | Symbol | Description |
|---|---|---|
| Current state | s | The observation of the environment before the agent acts |
| Action | a | The action selected by the agent's policy |
| Reward | r | The immediate reward signal received after taking the action |
| Next state | s' | The observation of the environment after the action is executed |
| Done flag | d | A boolean indicating whether the episode has terminated |
This tuple (s, a, r, s', d) is the fundamental unit of data stored in the replay buffer.
The replay buffer is typically implemented as a fixed-size circular buffer (also called a ring buffer). New transitions are appended to the buffer sequentially. Once the buffer reaches its maximum capacity, the oldest transitions are overwritten in a first-in-first-out (FIFO) manner. Common buffer sizes in practice range from 10,000 to 1,000,000 transitions, depending on the complexity of the environment and available memory.
An alternative to FIFO eviction is reservoir sampling, where each incoming transition has a probability of replacing a randomly selected existing transition. This approach ensures that all previously seen transitions have an equal probability of being retained, providing a more uniform sample of the agent's entire history rather than a recency-biased window.
During training, instead of using the most recent transition for a gradient update, the algorithm samples a random minibatch of transitions from the buffer. A typical minibatch size is 32 or 64 transitions. The sampled transitions are used to compute the loss function (for example, the temporal-difference error in Q-learning) and perform a gradient descent step to update the network parameters.
The following pseudocode outlines the basic experience replay loop:
Initialize replay buffer D with capacity N
Initialize Q-network with random weights
for each episode:
Initialize state s
for each step:
Select action a using epsilon-greedy policy from Q
Execute action a, observe reward r and next state s'
Store transition (s, a, r, s', done) in D
Sample random minibatch of transitions from D
Compute target y = r + gamma * max_a' Q_target(s', a')
Update Q-network by minimizing (y - Q(s, a))^2
s = s'
Experience replay addresses several problems that arise when combining reinforcement learning with neural networks.
Consecutive transitions collected by an agent following a policy are highly correlated. For example, in an Atari game, consecutive frames differ by only a few pixels. Training a neural network on correlated sequential data violates the assumption of independently and identically distributed (i.i.d.) samples that underlies stochastic gradient descent, leading to unstable and inefficient learning. Random sampling from a replay buffer breaks these temporal correlations by presenting the network with a diverse, decorrelated set of transitions.
Without experience replay, each transition is used for exactly one gradient update and then discarded. This is wasteful, especially when environment interactions are expensive (for example, in robotics or complex simulations). With a replay buffer, each transition can be sampled and used for learning multiple times, significantly improving sample efficiency.
Neural networks are prone to catastrophic forgetting, where learning new information causes the network to lose previously acquired knowledge. When training only on the most recent transitions, the network may forget how to handle situations it encountered earlier in training. The replay buffer mitigates this by continuously mixing old and new experiences during training.
In reinforcement learning, the data distribution changes as the agent's policy improves. The states and rewards the agent encounters shift over time. A replay buffer smooths out these distributional shifts by maintaining a window of experiences collected under different versions of the policy.
The simplest form of experience replay samples transitions uniformly at random from the buffer. Each stored transition has an equal probability of being selected for training. This is the version used in the original DQN algorithm and remains a strong baseline due to its simplicity and lack of additional hyperparameters.
Prioritized experience replay (PER), introduced by Tom Schaul, John Quan, Ioannis Antonoglou, and David Silver in 2015 (published at ICLR 2016), assigns a priority to each transition and samples transitions with probability proportional to their priority. The key idea is that some transitions are more informative than others, and the agent should replay those transitions more frequently.
The priority of a transition is typically based on the magnitude of its temporal-difference (TD) error, which measures how "surprising" the transition is to the current model. A large TD error means the network's prediction was far from the observed outcome, suggesting the transition contains information the network has not yet learned.
PER defines two variants for computing sampling probabilities:
| Variant | Priority formula | Sampling probability | Characteristics |
|---|---|---|---|
| Proportional | p_i = |delta_i| + epsilon | P(i) = p_i^alpha / sum(p_k^alpha) | Directly proportional to TD error magnitude; sensitive to outliers |
| Rank-based | p_i = 1 / rank(i) | P(i) = p_i^alpha / sum(p_k^alpha) | Based on rank ordering of TD errors; more robust to outliers; produces a power-law distribution |
In both variants, alpha controls how much prioritization is applied. When alpha = 0, sampling is uniform. When alpha = 1, sampling is fully proportional to priorities.
Prioritized sampling introduces bias because it changes the expected distribution of updates. To correct this bias, PER uses importance sampling weights:
w_i = (1 / (N * P(i)))^beta
The hyperparameter beta controls the degree of bias correction. It is typically annealed from a starting value (often 0.4 to 0.6) linearly to 1.0 over the course of training. At beta = 1, the bias is fully corrected. The annealing schedule reflects the fact that unbiased updates matter more toward the end of training when the policy is converging.
Efficient implementation of proportional PER uses a sum tree (a type of segment tree) data structure. Each leaf node stores the priority of one transition, and each internal node stores the sum of its children's priorities. The root node contains the total priority sum. This structure allows both sampling (proportional to priority) and priority updates in O(log N) time, compared to O(N) for a naive implementation.
PER was shown to improve DQN performance on 41 out of 49 Atari 2600 games compared to uniform replay. In the Rainbow DQN ablation study by Hessel et al. (2018), prioritized experience replay was found to be the single most important component contributing to overall performance.
Hindsight experience replay (HER), proposed by Marcin Andrychowicz, Filip Wolski, Alex Ray, Jonas Schneider, and others at OpenAI in 2017 (NeurIPS 2017), addresses the challenge of learning from sparse binary rewards in goal-conditioned reinforcement learning.
In many robotic manipulation tasks, the reward function is binary: the agent receives a reward of 0 if it achieves the goal and -1 otherwise. With such sparse rewards, the agent almost never experiences positive outcomes during early training, making learning extremely difficult.
HER solves this by retroactively relabeling the goal in stored transitions. After an episode ends, the agent stores not only the original transitions (with the intended goal) but also additional copies of the transitions with substitute goals that were actually achieved during the episode. This way, even a "failed" episode produces transitions where the agent "succeeded" at reaching alternative goals.
HER defines four goal-relabeling strategies:
| Strategy | Description |
|---|---|
| Final | Replace the goal with the final state achieved in the episode |
| Future | Replace the goal with a randomly selected state from later in the same episode |
| Episode | Replace the goal with a randomly selected state from anywhere in the episode |
| Random | Replace the goal with a randomly selected state from the entire replay buffer |
The "future" strategy is generally the most effective, as it provides a natural curriculum: early in training when the agent accomplishes little, the substitute goals are close to the starting state, and as the agent improves, the substitute goals move further away.
HER can be combined with any off-policy reinforcement learning algorithm, including DQN, DDPG, and SAC. It was demonstrated on robotic pushing, sliding, and pick-and-place tasks, and policies trained in simulation using HER were successfully transferred to physical robots.
Combined experience replay (CER), proposed by Shangtong Zhang and Richard Sutton in 2017, is a simple modification that ensures the most recent transition is always included in the training minibatch. Instead of sampling all transitions uniformly from the buffer, CER samples (batch_size - 1) transitions randomly and adds the latest transition to complete the batch. This addresses a potential problem with large replay buffers: when the buffer is very large, the most recent transition may not be sampled for a long time, delaying learning from the latest experience. CER adds only O(1) extra computation.
Standard experience replay stores single-step transitions (s, a, r, s'). N-step experience replay extends this by storing multi-step returns:
R_n = r_t + gamma * r_{t+1} + gamma^2 * r_{t+2} + ... + gamma^{n-1} * r_{t+n-1}
The stored tuple becomes (s_t, a_t, R_n, s_{t+n}). N-step returns reduce the bias of bootstrapped value estimates at the cost of increased variance. In practice, n = 3 or n = 5 is commonly used. N-step experience replay is one of the components of the Rainbow DQN algorithm.
Ape-X (Distributed Prioritized Experience Replay), introduced by Dan Horgan, John Quan, David Budden, Gabriel Barth-Maron, Matteo Hessel, Hado van Hasselt, and David Silver in 2018 (ICLR 2018), scales experience replay to distributed settings. The architecture decouples acting from learning:
This design allows the system to generate experience much faster than a single agent could. In experiments, Ape-X used hundreds of CPU actors and a single GPU learner. The system achieved state-of-the-art results on the Arcade Learning Environment (Atari games), training on billions of frames in a fraction of the wall-clock time required by single-agent methods.
R2D2 (Recurrent Replay Distributed DQN), proposed by Kapturowski et al. in 2019, extends the distributed replay paradigm to handle partial observability using recurrent neural networks. R2D2 uses an LSTM layer after the convolutional stack and stores fixed-length sequences of transitions (typically 80 steps with 40 steps of overlap between consecutive sequences) rather than individual transition tuples.
Using recurrent networks with experience replay introduces a challenge called recurrent state staleness: the hidden states stored with old transitions were computed under an older version of the network and may no longer be accurate. R2D2 addresses this through two strategies:
R2D2 achieved state-of-the-art results on both Atari-57 and DMLab-30 benchmarks using a single set of hyperparameters.
Experience replay is equally important for off-policy algorithms designed for continuous action spaces. The following table summarizes how several popular algorithms use replay buffers:
| Algorithm | Year | Action space | Replay buffer usage | Key details |
|---|---|---|---|---|
| DDPG | 2015 | Continuous | Uniform replay | Actor-critic; stores (s, a, r, s') tuples; typical buffer size of 1M transitions |
| TD3 | 2018 | Continuous | Uniform replay | Extends DDPG with clipped double Q-learning and delayed policy updates |
| SAC | 2018 | Continuous | Uniform replay | Maximum entropy framework; automatic temperature tuning; stochastic policy |
| DDPG + PER | Various | Continuous | Prioritized replay | Combines DDPG with prioritized experience replay for potentially faster convergence |
All of these algorithms are off-policy, meaning they can learn from data collected by a different (older) version of their policy. This property is what makes experience replay possible: the stored transitions remain useful even as the policy changes.
The size of the replay buffer is an important hyperparameter that requires tuning for each problem domain. A buffer that is too small fails to break temporal correlations effectively and provides insufficient diversity. A buffer that is too large consumes excessive memory and may retain outdated transitions that are no longer relevant to the current policy, potentially slowing down learning. Common buffer sizes in the literature range from 10,000 to 1,000,000 transitions. Research by Zhang and Sutton (2017) showed that the effect of buffer size on performance is complex and domain-dependent, with no universally optimal value.
For environments with high-dimensional observations (such as image-based environments), storing millions of transitions can require substantial memory. Several techniques help reduce memory usage:
Several software libraries provide production-quality replay buffer implementations:
| Framework | Developer | Key features |
|---|---|---|
| TensorFlow Agents | TFUniformReplayBuffer, TFPrioritizedReplayBuffer; supports distributed collection | |
| Stable-Baselines3 | Community | ReplayBuffer, HerReplayBuffer, DictReplayBuffer; PyTorch-native |
| RLlib (Ray) | Anyscale | MultiAgentReplayBuffer; supports multiple prioritization modes and sampling strategies |
| Reverb | DeepMind | High-performance C++ server with gRPC interface; supports uniform, prioritized, FIFO, LIFO sampling; rate limiting; scales to thousands of concurrent clients |
| cpprb | Community | C++ backed Python library; supports PER, N-step, HER, and various buffer types |
The combination of function approximation (neural networks), bootstrapping (using estimated values to compute targets), and off-policy learning (training on data from older policies via replay) constitutes what Richard Sutton and Andrew Barto call the "deadly triad." This combination can cause value estimates to diverge and become unbounded. Experience replay contributes to this problem by increasing the degree of off-policyness: transitions stored in the buffer may have been collected under a substantially different policy than the current one. Target networks and various regularization techniques partially mitigate this issue but do not eliminate it entirely.
As the agent's policy improves over time, older transitions in the buffer become less representative of the current data distribution. This staleness can slow learning or cause instability. Smaller buffer sizes reduce staleness but at the cost of less diversity. Some approaches address staleness explicitly by weighting recent transitions more heavily or by periodically clearing the buffer.
Experience replay can only be used with off-policy algorithms because the stored transitions were generated by a different (older) version of the policy. On-policy algorithms such as PPO and A2C require data collected under the current policy and cannot use a replay buffer directly. This limits the applicability of experience replay to algorithms like DQN, DDPG, TD3, and SAC.
Storing millions of transitions with high-dimensional observations (such as 84x84 pixel images with frame stacking) requires significant RAM. For very large-scale experiments, the replay buffer can become a memory bottleneck. Distributed replay systems like Reverb address this by running the buffer as a separate service, but this adds architectural complexity.
| Variant | Sampling strategy | Key benefit | Main overhead | Introduced |
|---|---|---|---|---|
| Uniform replay | Random uniform | Simple; no extra hyperparameters | None | Lin, 1992 |
| Prioritized (PER) | Proportional to TD error | Faster learning on informative transitions | Sum tree; importance sampling weights; alpha and beta hyperparameters | Schaul et al., 2015 |
| Hindsight (HER) | Uniform with goal relabeling | Enables learning from sparse binary rewards | Goal relabeling computation; only applicable to goal-conditioned tasks | Andrychowicz et al., 2017 |
| Combined (CER) | Uniform + latest transition | Ensures recency; negligible overhead | Minimal | Zhang and Sutton, 2017 |
| N-step | Random uniform over N-step returns | Reduced bootstrap bias | Increased variance; N-step return computation | Various |
| Distributed (Ape-X) | Prioritized, centralized buffer | Scales to many actors; massive throughput | Distributed infrastructure; network synchronization | Horgan et al., 2018 |
| Recurrent (R2D2) | Prioritized over sequences | Handles partial observability with LSTM | Sequence storage; burn-in computation; recurrent state management | Kapturowski et al., 2019 |