Replay Buffer

A replay buffer (also called an experience replay buffer or replay memory) is a data structure used in reinforcement learning to store and reuse past experiences for training. First introduced by Long-Ji Lin in 1992, the technique became a core component of modern deep reinforcement learning after its integration into the Deep Q-Network (DQN) algorithm by Mnih et al. in 2013. The replay buffer allows an agent to learn from previously collected transitions rather than only from the most recent interaction, improving both sample efficiency and training stability.

ELI5

Imagine you are learning to play a video game. Every time you try something, you write down what happened on an index card: what you saw on screen, what button you pressed, whether you got points or lost a life, and what the screen looked like afterward. You toss each card into a big shoebox.

When it is time to study, instead of only looking at the very last card you wrote, you reach into the shoebox and pull out a random handful of cards. Some are from today, some from last week. By studying a mix of old and new experiences, you get a much better picture of the whole game rather than just the last thing that happened. That shoebox is the replay buffer.

Once the shoebox gets full, you throw away the oldest cards to make room for new ones. Some fancier versions of the shoebox let you mark certain cards as extra important so you pull them out more often, which helps you learn even faster.

History and motivation

The concept of experience replay was introduced by Long-Ji Lin in his 1992 paper "Self-improving reactive agents based on reinforcement learning, planning and teaching." Lin proposed storing past experiences and replaying them during learning as a way to accelerate convergence of reinforcement learning algorithms. At the time, reinforcement learning methods converged slowly, and experience replay was one of three extensions Lin developed to speed up the process (the other two being learning action models for planning and teaching from external advice).

The technique received relatively little attention for about two decades until Mnih et al. combined it with deep neural networks in their 2013 paper "Playing Atari with Deep Reinforcement Learning." This work introduced the DQN algorithm, which used a convolutional neural network to learn control policies directly from raw pixel inputs. Experience replay was one of two stabilization techniques that made this combination work (the other being a target network). The follow-up 2015 Nature paper "Human-level control through deep reinforcement learning" demonstrated that the same architecture and hyperparameters could achieve human-level performance across 49 Atari 2600 games.

The biological inspiration for experience replay comes from neuroscience research on hippocampal replay. During sleep and periods of rest, the hippocampus replays compressed sequences of neural activity patterns that correspond to previously experienced events. This replay process is believed to support memory consolidation by strengthening synaptic connections in cortical circuits. The parallel between biological memory replay and computational experience replay has been noted by multiple researchers, though the computational version was developed independently of the neuroscience findings.

How it works

Transition tuple

At each timestep, the agent interacts with its environment and produces a transition tuple. The standard tuple contains five elements:

Element	Symbol	Description
Current state	s	The observation of the environment before the agent acts
Action	a	The action selected by the agent's policy
Reward	r	The immediate reward signal received after taking the action
Next state	s'	The observation of the environment after the action is executed
Done flag	d	A boolean indicating whether the episode has terminated

This tuple (s, a, r, s', d) is the fundamental unit of data stored in the replay buffer.

Storage and eviction

The replay buffer is typically implemented as a fixed-size circular buffer (also called a ring buffer). New transitions are appended to the buffer sequentially. Once the buffer reaches its maximum capacity, the oldest transitions are overwritten in a first-in-first-out (FIFO) manner. Common buffer sizes in practice range from 10,000 to 1,000,000 transitions, depending on the complexity of the environment and available memory.

An alternative to FIFO eviction is reservoir sampling, where each incoming transition has a probability of replacing a randomly selected existing transition. This approach ensures that all previously seen transitions have an equal probability of being retained, providing a more uniform sample of the agent's entire history rather than a recency-biased window.

Sampling and training

During training, instead of using the most recent transition for a gradient update, the algorithm samples a random minibatch of transitions from the buffer. A typical minibatch size is 32 or 64 transitions. The sampled transitions are used to compute the loss function (for example, the temporal-difference error in Q-learning) and perform a gradient descent step to update the network parameters.

The following pseudocode outlines the basic experience replay loop:

Initialize replay buffer D with capacity N
Initialize Q-network with random weights

for each episode:
    Initialize state s
    for each step:
        Select action a using epsilon-greedy policy from Q
        Execute action a, observe reward r and next state s'
        Store transition (s, a, r, s', done) in D
        Sample random minibatch of transitions from D
        Compute target y = r + gamma * max_a' Q_target(s', a')
        Update Q-network by minimizing (y - Q(s, a))^2
        s = s'

Why experience replay is needed

Experience replay addresses several problems that arise when combining reinforcement learning with neural networks.

Temporal correlation

Consecutive transitions collected by an agent following a policy are highly correlated. For example, in an Atari game, consecutive frames differ by only a few pixels. Training a neural network on correlated sequential data violates the assumption of independently and identically distributed (i.i.d.) samples that underlies stochastic gradient descent, leading to unstable and inefficient learning. Random sampling from a replay buffer breaks these temporal correlations by presenting the network with a diverse, decorrelated set of transitions.

Sample efficiency

Without experience replay, each transition is used for exactly one gradient update and then discarded. This is wasteful, especially when environment interactions are expensive (for example, in robotics or complex simulations). With a replay buffer, each transition can be sampled and used for learning multiple times, significantly improving sample efficiency.

Catastrophic forgetting

Neural networks are prone to catastrophic forgetting, where learning new information causes the network to lose previously acquired knowledge. When training only on the most recent transitions, the network may forget how to handle situations it encountered earlier in training. The replay buffer mitigates this by continuously mixing old and new experiences during training.

Non-stationarity

In reinforcement learning, the data distribution changes as the agent's policy improves. The states and rewards the agent encounters shift over time. A replay buffer smooths out these distributional shifts by maintaining a window of experiences collected under different versions of the policy.

Variants of experience replay

Uniform experience replay

The simplest form of experience replay samples transitions uniformly at random from the buffer. Each stored transition has an equal probability of being selected for training. This is the version used in the original DQN algorithm and remains a strong baseline due to its simplicity and lack of additional hyperparameters.

Prioritized experience replay

Prioritized experience replay (PER), introduced by Tom Schaul, John Quan, Ioannis Antonoglou, and David Silver in 2015 (published at ICLR 2016), assigns a priority to each transition and samples transitions with probability proportional to their priority. The key idea is that some transitions are more informative than others, and the agent should replay those transitions more frequently.

The priority of a transition is typically based on the magnitude of its temporal-difference (TD) error, which measures how "surprising" the transition is to the current model. A large TD error means the network's prediction was far from the observed outcome, suggesting the transition contains information the network has not yet learned.

PER defines two variants for computing sampling probabilities:

Variant	Priority formula	Sampling probability	Characteristics
Proportional	p_i = \|delta_i\| + epsilon	P(i) = p_i^alpha / sum(p_k^alpha)	Directly proportional to TD error magnitude; sensitive to outliers
Rank-based	p_i = 1 / rank(i)	P(i) = p_i^alpha / sum(p_k^alpha)	Based on rank ordering of TD errors; more robust to outliers; produces a power-law distribution

In both variants, alpha controls how much prioritization is applied. When alpha = 0, sampling is uniform. When alpha = 1, sampling is fully proportional to priorities.

Importance sampling correction

Prioritized sampling introduces bias because it changes the expected distribution of updates. To correct this bias, PER uses importance sampling weights:

w_i = (1 / (N * P(i)))^beta

The hyperparameter beta controls the degree of bias correction. It is typically annealed from a starting value (often 0.4 to 0.6) linearly to 1.0 over the course of training. At beta = 1, the bias is fully corrected. The annealing schedule reflects the fact that unbiased updates matter more toward the end of training when the policy is converging.

Implementation with sum trees

Efficient implementation of proportional PER uses a sum tree (a type of segment tree) data structure. Each leaf node stores the priority of one transition, and each internal node stores the sum of its children's priorities. The root node contains the total priority sum. This structure allows both sampling (proportional to priority) and priority updates in O(log N) time, compared to O(N) for a naive implementation.

PER was shown to improve DQN performance on 41 out of 49 Atari 2600 games compared to uniform replay. In the Rainbow DQN ablation study by Hessel et al. (2018), prioritized experience replay was found to be the single most important component contributing to overall performance.

Hindsight experience replay

Hindsight experience replay (HER), proposed by Marcin Andrychowicz, Filip Wolski, Alex Ray, Jonas Schneider, and others at OpenAI in 2017 (NeurIPS 2017), addresses the challenge of learning from sparse binary rewards in goal-conditioned reinforcement learning.

In many robotic manipulation tasks, the reward function is binary: the agent receives a reward of 0 if it achieves the goal and -1 otherwise. With such sparse rewards, the agent almost never experiences positive outcomes during early training, making learning extremely difficult.

HER solves this by retroactively relabeling the goal in stored transitions. After an episode ends, the agent stores not only the original transitions (with the intended goal) but also additional copies of the transitions with substitute goals that were actually achieved during the episode. This way, even a "failed" episode produces transitions where the agent "succeeded" at reaching alternative goals.

HER defines four goal-relabeling strategies:

Strategy	Description
Final	Replace the goal with the final state achieved in the episode
Future	Replace the goal with a randomly selected state from later in the same episode
Episode	Replace the goal with a randomly selected state from anywhere in the episode
Random	Replace the goal with a randomly selected state from the entire replay buffer

The "future" strategy is generally the most effective, as it provides a natural curriculum: early in training when the agent accomplishes little, the substitute goals are close to the starting state, and as the agent improves, the substitute goals move further away.

HER can be combined with any off-policy reinforcement learning algorithm, including DQN, DDPG, and SAC. It was demonstrated on robotic pushing, sliding, and pick-and-place tasks, and policies trained in simulation using HER were successfully transferred to physical robots.

Combined experience replay

Combined experience replay (CER), proposed by Shangtong Zhang and Richard Sutton in 2017, is a simple modification that ensures the most recent transition is always included in the training minibatch. Instead of sampling all transitions uniformly from the buffer, CER samples (batch_size - 1) transitions randomly and adds the latest transition to complete the batch. This addresses a potential problem with large replay buffers: when the buffer is very large, the most recent transition may not be sampled for a long time, delaying learning from the latest experience. CER adds only O(1) extra computation.

N-step experience replay

Standard experience replay stores single-step transitions (s, a, r, s'). N-step experience replay extends this by storing multi-step returns:

R_n = r_t + gamma * r_{t+1} + gamma^2 * r_{t+2} + ... + gamma^{n-1} * r_{t+n-1}

The stored tuple becomes (s_t, a_t, R_n, s_{t+n}). N-step returns reduce the bias of bootstrapped value estimates at the cost of increased variance. In practice, n = 3 or n = 5 is commonly used. N-step experience replay is one of the components of the Rainbow DQN algorithm.

Distributed experience replay

Ape-X

Ape-X (Distributed Prioritized Experience Replay), introduced by Dan Horgan, John Quan, David Budden, Gabriel Barth-Maron, Matteo Hessel, Hado van Hasselt, and David Silver in 2018 (ICLR 2018), scales experience replay to distributed settings. The architecture decouples acting from learning:

Multiple actor processes run in parallel, each interacting with its own copy of the environment. Actors select actions using a shared neural network and compute initial priorities for new transitions.
A single learner process samples prioritized minibatches from a shared, centralized replay buffer and performs gradient updates on the network.
Actors periodically synchronize their network weights with the learner.

This design allows the system to generate experience much faster than a single agent could. In experiments, Ape-X used hundreds of CPU actors and a single GPU learner. The system achieved state-of-the-art results on the Arcade Learning Environment (Atari games), training on billions of frames in a fraction of the wall-clock time required by single-agent methods.

R2D2

R2D2 (Recurrent Replay Distributed DQN), proposed by Kapturowski et al. in 2019, extends the distributed replay paradigm to handle partial observability using recurrent neural networks. R2D2 uses an LSTM layer after the convolutional stack and stores fixed-length sequences of transitions (typically 80 steps with 40 steps of overlap between consecutive sequences) rather than individual transition tuples.

Using recurrent networks with experience replay introduces a challenge called recurrent state staleness: the hidden states stored with old transitions were computed under an older version of the network and may no longer be accurate. R2D2 addresses this through two strategies:

Stored states: Saving the LSTM hidden state at the beginning of each stored sequence and using it to initialize the recurrent network during replay.
Burn-in: Discarding the first portion of each replayed sequence and using it only to warm up the LSTM hidden state before computing the actual loss on the remaining steps.

R2D2 achieved state-of-the-art results on both Atari-57 and DMLab-30 benchmarks using a single set of hyperparameters.

Replay buffer in continuous control

Experience replay is equally important for off-policy algorithms designed for continuous action spaces. The following table summarizes how several popular algorithms use replay buffers:

Algorithm	Year	Action space	Replay buffer usage	Key details
DDPG	2015	Continuous	Uniform replay	Actor-critic; stores (s, a, r, s') tuples; typical buffer size of 1M transitions
TD3	2018	Continuous	Uniform replay	Extends DDPG with clipped double Q-learning and delayed policy updates
SAC	2018	Continuous	Uniform replay	Maximum entropy framework; automatic temperature tuning; stochastic policy
DDPG + PER	Various	Continuous	Prioritized replay	Combines DDPG with prioritized experience replay for potentially faster convergence

All of these algorithms are off-policy, meaning they can learn from data collected by a different (older) version of their policy. This property is what makes experience replay possible: the stored transitions remain useful even as the policy changes.

Implementation considerations

Buffer size

The size of the replay buffer is an important hyperparameter that requires tuning for each problem domain. A buffer that is too small fails to break temporal correlations effectively and provides insufficient diversity. A buffer that is too large consumes excessive memory and may retain outdated transitions that are no longer relevant to the current policy, potentially slowing down learning. Common buffer sizes in the literature range from 10,000 to 1,000,000 transitions. Research by Zhang and Sutton (2017) showed that the effect of buffer size on performance is complex and domain-dependent, with no universally optimal value.

Memory efficiency

For environments with high-dimensional observations (such as image-based environments), storing millions of transitions can require substantial memory. Several techniques help reduce memory usage:

Observation compression: Storing observations as uint8 (0-255) rather than float32 reduces memory usage by 4x for image observations.
Lazy frames: Storing each frame only once and constructing stacked frame observations on-the-fly by referencing frame indices, avoiding redundant storage of overlapping frames.
Next-state elimination: Since consecutive transitions share the next-state/current-state, storing only states and reconstructing next-states from adjacent entries.

Software frameworks

Several software libraries provide production-quality replay buffer implementations:

Framework	Developer	Key features
TensorFlow Agents	Google	TFUniformReplayBuffer, TFPrioritizedReplayBuffer; supports distributed collection
Stable-Baselines3	Community	ReplayBuffer, HerReplayBuffer, DictReplayBuffer; PyTorch-native
RLlib (Ray)	Anyscale	MultiAgentReplayBuffer; supports multiple prioritization modes and sampling strategies
Reverb	DeepMind	High-performance C++ server with gRPC interface; supports uniform, prioritized, FIFO, LIFO sampling; rate limiting; scales to thousands of concurrent clients
cpprb	Community	C++ backed Python library; supports PER, N-step, HER, and various buffer types

Limitations and challenges

The deadly triad

The combination of function approximation (neural networks), bootstrapping (using estimated values to compute targets), and off-policy learning (training on data from older policies via replay) constitutes what Richard Sutton and Andrew Barto call the "deadly triad." This combination can cause value estimates to diverge and become unbounded. Experience replay contributes to this problem by increasing the degree of off-policyness: transitions stored in the buffer may have been collected under a substantially different policy than the current one. Target networks and various regularization techniques partially mitigate this issue but do not eliminate it entirely.

Buffer staleness

As the agent's policy improves over time, older transitions in the buffer become less representative of the current data distribution. This staleness can slow learning or cause instability. Smaller buffer sizes reduce staleness but at the cost of less diversity. Some approaches address staleness explicitly by weighting recent transitions more heavily or by periodically clearing the buffer.

Restriction to off-policy algorithms

Experience replay can only be used with off-policy algorithms because the stored transitions were generated by a different (older) version of the policy. On-policy algorithms such as PPO and A2C require data collected under the current policy and cannot use a replay buffer directly. This limits the applicability of experience replay to algorithms like DQN, DDPG, TD3, and SAC.

Memory overhead

Storing millions of transitions with high-dimensional observations (such as 84x84 pixel images with frame stacking) requires significant RAM. For very large-scale experiments, the replay buffer can become a memory bottleneck. Distributed replay systems like Reverb address this by running the buffer as a separate service, but this adds architectural complexity.

Comparison of replay buffer variants

Variant	Sampling strategy	Key benefit	Main overhead	Introduced
Uniform replay	Random uniform	Simple; no extra hyperparameters	None	Lin, 1992
Prioritized (PER)	Proportional to TD error	Faster learning on informative transitions	Sum tree; importance sampling weights; alpha and beta hyperparameters	Schaul et al., 2015
Hindsight (HER)	Uniform with goal relabeling	Enables learning from sparse binary rewards	Goal relabeling computation; only applicable to goal-conditioned tasks	Andrychowicz et al., 2017
Combined (CER)	Uniform + latest transition	Ensures recency; negligible overhead	Minimal	Zhang and Sutton, 2017
N-step	Random uniform over N-step returns	Reduced bootstrap bias	Increased variance; N-step return computation	Various
Distributed (Ape-X)	Prioritized, centralized buffer	Scales to many actors; massive throughput	Distributed infrastructure; network synchronization	Horgan et al., 2018
Recurrent (R2D2)	Prioritized over sequences	Handles partial observability with LSTM	Sequence storage; burn-in computation; recurrent state management	Kapturowski et al., 2019

References

Lin, L.-J. (1992). "Self-improving reactive agents based on reinforcement learning, planning and teaching." *Machine Learning*, 8(3-4), 293-321.
Mnih, V., Kavukcuoglu, K., Silver, D., et al. (2013). "Playing Atari with Deep Reinforcement Learning." *arXiv preprint arXiv:1312.5602*.
Mnih, V., Kavukcuoglu, K., Silver, D., et al. (2015). "Human-level control through deep reinforcement learning." *Nature*, 518(7540), 529-533.
Schaul, T., Quan, J., Antonoglou, I., & Silver, D. (2016). "Prioritized Experience Replay." *Proceedings of the International Conference on Learning Representations (ICLR)*. arXiv:1511.05952.
Andrychowicz, M., Wolski, F., Ray, A., Schneider, J., et al. (2017). "Hindsight Experience Replay." *Advances in Neural Information Processing Systems (NeurIPS)*, 30.
Zhang, S., & Sutton, R. S. (2017). "A Deeper Look at Experience Replay." *arXiv preprint arXiv:1712.01275*.
Horgan, D., Quan, J., Budden, D., et al. (2018). "Distributed Prioritized Experience Replay." *Proceedings of the International Conference on Learning Representations (ICLR)*. arXiv:1803.00933.
Hessel, M., Modayil, J., van Hasselt, H., et al. (2018). "Rainbow: Combining Improvements in Deep Reinforcement Learning." *Proceedings of the AAAI Conference on Artificial Intelligence*, 32(1).
Kapturowski, S., Ostrovski, G., Quan, J., Munos, R., & Dabney, W. (2019). "Recurrent Experience Replay in Distributed Reinforcement Learning." *International Conference on Learning Representations (ICLR)*.
Cassirer, A., Barth-Maron, G., Brevdo, E., et al. (2021). "Reverb: A Framework for Experience Replay." *arXiv preprint arXiv:2102.04736*.
Carr, M. F., Jadhav, S. P., & Frank, L. M. (2011). "Hippocampal replay in the awake state: a potential substrate for memory consolidation and retrieval." *Nature Neuroscience*, 14(2), 147-153.
van Hasselt, H., Doron, Y., Strub, F., et al. (2018). "Deep Reinforcement Learning and the Deadly Triad." *arXiv preprint arXiv:1812.02648*.
Lillicrap, T. P., Hunt, J. J., Pritzel, A., et al. (2016). "Continuous control with deep reinforcement learning." *Proceedings of the International Conference on Learning Representations (ICLR)*.
Haarnoja, T., Zhou, A., Abbeel, P., & Levine, S. (2018). "Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor." *Proceedings of the 35th International Conference on Machine Learning (ICML)*.

ELI5

History and motivation

How it works

Transition tuple

Storage and eviction

Sampling and training

Why experience replay is needed

Temporal correlation

Sample efficiency

Catastrophic forgetting

Non-stationarity

Variants of experience replay

Uniform experience replay

Prioritized experience replay

Importance sampling correction

Implementation with sum trees

Hindsight experience replay

Combined experience replay

N-step experience replay

Distributed experience replay

Ape-X

R2D2

Replay buffer in continuous control

Implementation considerations

Buffer size

Memory efficiency

Software frameworks

Limitations and challenges

The deadly triad

Buffer staleness

Restriction to off-policy algorithms

Memory overhead

Comparison of replay buffer variants

References

Improve this article

Related Articles

Machine learning terms/Reinforcement Learning

Sparse autoencoder

ARC-AGI 2

AlphaGo

GELU (Gaussian Error Linear Unit)

LeNet

ELI5

History and motivation

How it works

Transition tuple

Storage and eviction

Sampling and training

Why experience replay is needed

Temporal correlation

Sample efficiency

Catastrophic forgetting

Non-stationarity

Variants of experience replay

Uniform experience replay

Prioritized experience replay

Importance sampling correction

Implementation with sum trees

Hindsight experience replay

Combined experience replay

N-step experience replay

Distributed experience replay

Ape-X

R2D2

Replay buffer in continuous control

Implementation considerations

Buffer size

Memory efficiency

Software frameworks

Limitations and challenges

The deadly triad

Buffer staleness

Restriction to off-policy algorithms

Memory overhead

Comparison of replay buffer variants

References

Related Articles

Machine learning terms/Reinforcement Learning

Sparse autoencoder

ARC-AGI 2