Experience Replay

Introduction

Experience replay is a foundational technique in reinforcement learning that allows an agent to store past interactions with its environment in a memory structure called a replay buffer and later resample those interactions during training. Rather than learning exclusively from the most recent experience, the agent randomly draws mini-batches of earlier transitions from the buffer, breaking the temporal correlations present in sequential data and dramatically improving both the stability and sample efficiency of the learning process.

First proposed by Long-Ji Lin in 1992, the technique remained relatively niche until it became a critical component of the Deep Q-Network (DQN) architecture introduced by Mnih et al. in 2013 and 2015. Since then, experience replay has become a standard building block of virtually all off-policy deep reinforcement learning algorithms, including Deep Deterministic Policy Gradient (DDPG), Twin Delayed DDPG (TD3), and Soft Actor-Critic (SAC). It also underpins offline reinforcement learning, distributed actor-learner systems, and goal-conditioned algorithms such as Hindsight Experience Replay.

A single transition is stored as a tuple (s, a, r, s', d), where s is the state observed before acting, a is the action chosen, r is the scalar reward received, s' is the next state, and d is a flag indicating whether s' is terminal. The buffer is typically a fixed-capacity ring that overwrites its oldest entries once full, and gradient updates draw uniform or prioritized mini-batches from this pool.

Historical background

The concept of experience replay was introduced by Long-Ji Lin in his 1992 paper "Self-Improving Reactive Agents Based on Reinforcement Learning, Planning and Teaching," published in the journal Machine Learning. In that work, Lin compared eight reinforcement learning frameworks built around two base algorithms: the Adaptive Heuristic Critic (AHC) and Q-learning. He proposed three extensions to speed up learning: experience replay, learning action models for planning, and teaching. Among these, experience replay proved to be the simplest and most broadly applicable.

Lin's central insight was that an agent could store its past experiences and revisit them later instead of discarding each transition after a single learning update. This idea drew loose inspiration from biological findings about hippocampal replay in mammals, where the brain replays neural activity patterns from prior waking experiences during sleep and rest, consolidating memories and improving future decision-making. Lin's thesis at Carnegie Mellon University, completed the following year, expanded these ideas and remains one of the most-cited early works on memory-based reinforcement learning.

The technique gained mainstream prominence in 2013 when Mnih et al. at DeepMind combined experience replay with deep neural networks to create the DQN algorithm. The arXiv preprint "Playing Atari with Deep Reinforcement Learning" (1312.5602) was followed by the 2015 Nature paper "Human-Level Control Through Deep Reinforcement Learning," which used a replay buffer of one million transitions to train an agent that achieved human-level performance on dozens of Atari 2600 games. This result demonstrated that experience replay was essential for stabilizing the training of deep neural networks in reinforcement learning settings, and it spread the technique across the entire deep reinforcement learning literature in the years that followed.

Connection to neuroscience

Wilson and McNaughton (1994) recorded ensembles of hippocampal place cells in rats during spatial tasks and during the slow-wave sleep that followed those tasks. They found that pairs of cells which fired together during waking exploration showed an elevated tendency to fire together again during subsequent sleep, and the effect decayed gradually over the course of the rest period. This phenomenon, often called hippocampal replay, was interpreted as a substrate for memory consolidation.

Follow-up work by Lee and Wilson (2002), Foster and Wilson (2006), and many others extended these findings, documenting both forward and reverse replay sequences and connecting them explicitly to temporal-difference learning. The convergence between the algorithmic notion of resampling stored transitions and the neural notion of reactivating cell ensembles has become an active interface between machine learning and computational neuroscience. Modern research treats experience replay as a useful normative model of why brains might consolidate memories during rest, and conversely treats hippocampal replay as a source of inspiration for new replay-based algorithms.

How experience replay works

The replay buffer

The replay buffer (also called experience memory or replay memory) is typically implemented as a circular buffer with a fixed maximum capacity. Each entry in the buffer represents a single transition, stored as a tuple:

Element	Symbol	Description
State	s	The environment state observed by the agent before taking an action
Action	a	The action selected by the agent
Reward	r	The scalar reward signal received from the environment after taking the action
Next State	s'	The environment state observed after the action was executed
Done Flag	d	A boolean indicating whether s' is a terminal state

As the agent interacts with the environment, new transitions are appended to the buffer. When the buffer reaches its maximum capacity, the oldest transitions are overwritten, ensuring the buffer always contains the most recent experiences up to its size limit. Some implementations also store auxiliary fields such as the discount factor at the time of the transition, the policy log-probability of the action, the time step within the episode, or a goal vector for goal-conditioned tasks.

Sampling and training

During each training step, the agent draws a random mini-batch of transitions from the replay buffer (typically 32 to 256 transitions). These sampled transitions are then used to compute loss values and update the agent's parameters via gradient descent. In the case of Q-learning variants, the sampled transitions are used to compute temporal-difference (TD) targets and minimize the TD error. The standard one-step TD target for transition (s, a, r, s', d) is:

y = r + gamma * (1 - d) * max_a' Q_target(s', a')

where gamma is the discount factor and Q_target is a slowly-updated copy of the Q-network. The loss minimized over the mini-batch is then the mean squared error between Q(s, a) and y.

The random sampling is the key mechanism that provides experience replay's benefits. Because the mini-batch is drawn uniformly at random from a large pool of transitions collected over many episodes, consecutive samples in the training batch are unlikely to be correlated with each other, and the gradient estimate is closer to that of supervised learning on independent samples.

The four basic operations

Most replay buffer implementations expose a small interface with four core operations:

Operation	Description	Typical complexity
add	Append a transition (s, a, r, s', d) to the buffer	O(1)
sample	Draw a uniform random mini-batch of size B	O(B)
update_priorities	Update priority values after a learning step (PER only)	O(B log N)
clear	Reset the buffer to empty	O(1) or O(N)

The agent typically calls add once per environment step and sample once per gradient update, with the ratio of the two operations controlled by a hyperparameter known as the replay ratio.

Why experience replay works

Experience replay addresses several fundamental challenges that arise when training neural networks with reinforcement learning data.

Breaking temporal correlations

When an agent learns online from a stream of consecutive experiences, successive transitions are highly correlated. For example, an agent navigating a maze will see a long sequence of spatially adjacent states. Training a neural network on such correlated data violates the independent and identically distributed (i.i.d.) assumption underlying stochastic gradient descent, which can cause the network to overfit to recent trajectories and produce unstable weight updates. By sampling randomly from a large buffer, experience replay decorrelates the training data and approximates the i.i.d. condition.

Improved sample efficiency

Without replay, each transition is used for exactly one parameter update and then discarded. This is extremely wasteful, especially in environments where collecting data is expensive or slow. Experience replay allows each transition to be reused across multiple training updates, extracting more learning signal from each interaction with the environment. A single rare but informative transition can contribute to learning dozens or hundreds of times before it is eventually overwritten. Real-world robotic systems, where each environment step may take seconds and risk hardware wear, benefit disproportionately from this reuse.

Stabilized learning

In deep reinforcement learning, the target values used for training depend on the agent's own parameters, creating a moving-target problem. Random sampling from a diverse buffer smooths out these fluctuations by ensuring that any single training batch reflects a broad distribution of experiences rather than the agent's current behavioral regime. This stabilization effect was one of the key reasons DQN succeeded where earlier attempts to combine neural networks with Q-learning had failed.

Reduced catastrophic forgetting

Neural networks are prone to catastrophic forgetting, where learning new information erases previously acquired knowledge. By continually revisiting older experiences stored in the buffer, the agent retains knowledge about earlier parts of the state space even as its policy evolves and explores new regions. This effect is particularly important in long training runs where the agent's behavior changes substantially over time and the most recent on-policy data alone would not suffice to maintain accurate Q-values for older state regions.

Off-policy and on-policy considerations

Experience replay is fundamentally an off-policy technique. The transitions in the buffer were collected by past versions of the agent's policy, while gradient updates are applied to the current policy. Off-policy algorithms such as Q-learning, DQN, DDPG, TD3, and SAC are mathematically able to learn from data generated by a different (behavior) policy than the one being optimized, which is what makes replay viable for them.

In contrast, on-policy algorithms such as REINFORCE, A2C, A3C, TRPO, and PPO require that the data used for each gradient step comes from the current policy. Reusing old data would introduce bias that policy gradient theorems do not account for, so these algorithms typically use a small rollout buffer that holds only the most recent batch of trajectories and is then discarded. Some on-policy methods use importance sampling to partially correct for off-policy data, but they generally do not maintain large persistent replay buffers in the way DQN and its descendants do.

This off-policy versus on-policy distinction has practical consequences. Algorithms with replay tend to be more sample-efficient because they can squeeze multiple gradient updates out of each interaction. On-policy algorithms tend to be more stable and have stronger convergence guarantees because their training data always reflects the current behavior, but they need more environment interactions to reach the same performance.

Uniform vs prioritized sampling

The original experience replay formulation uses uniform random sampling: every transition in the buffer has an equal probability of being selected. While simple and effective, uniform sampling treats all transitions as equally valuable for learning, which is not always the case.

Uniform sampling

Uniform sampling is straightforward to implement and introduces no bias into the learning process. Its main limitation is inefficiency: many sampled transitions may be easy examples that the agent already handles well, while rare or surprising transitions that could drive significant learning progress are sampled no more frequently than any other.

Prioritized Experience Replay (PER)

Schaul et al. introduced Prioritized Experience Replay (PER) in arXiv preprint 1511.05952 (2015), and the work was published at ICLR 2016. The core idea is that transitions should be replayed in proportion to how much the agent can learn from them. The authors used the magnitude of the TD error as a proxy for learning potential: transitions where the agent's prediction was far from the actual outcome are presumably more informative.

PER defines the sampling probability for transition i as:

P(i) = p_i^alpha / sum_k(p_k^alpha)

where alpha controls the degree of prioritization (alpha = 0 yields uniform sampling) and p_i is the priority of transition i. The paper presents two variants for computing p_i:

Variant	Priority formula	Description
Proportional	p_i = abs(delta_i) + epsilon	Priority is proportional to the absolute TD error plus a small constant epsilon that prevents zero-priority transitions from never being replayed
Rank-based	p_i = 1 / rank(i)	Priority is inversely proportional to the rank of the transition when sorted by TD error magnitude

The rank-based variant is more robust to outliers because it depends only on the ordering of TD errors rather than their raw magnitudes. Its heavy-tail distribution also ensures diversity in the sampled mini-batches.

Importance sampling correction

Prioritized sampling introduces bias because transitions with high TD errors are overrepresented relative to the true data distribution. To correct this, PER applies importance sampling weights:

w_i = (1/N * 1/P(i))^beta

where N is the buffer size and beta controls the degree of correction. The weights are normalized by dividing by the maximum weight in the mini-batch. The parameter beta is annealed from a low initial value to 1 over the course of training, reflecting the fact that unbiased updates become most important near convergence when the policy is close to optimal.

Experiments showed that Double DQN combined with prioritized experience replay significantly outperformed the previous state-of-the-art results on the Atari Learning Environment benchmark. PER outperformed DQN with uniform replay on 41 of 49 Atari games tested by Schaul et al.

Sum tree data structure

A naive implementation of proportional PER would require O(N) work for each sample, since the cumulative distribution must be inverted. To make sampling efficient, PER is typically implemented with a binary tree called a sum tree (a kind of segment tree), which stores transitions in its leaf nodes and the sum of children's priorities at every internal node. The root node holds the total priority. Both adding a new transition and updating an existing priority cost O(log N), and drawing a single sample is also O(log N), giving the buffer logarithmic asymptotic cost in the buffer size.

A mini-batch of size B is drawn by partitioning the interval [0, p_total] into B equal-width segments, sampling a uniform random number from each segment, and traversing the tree to find the leaf corresponding to that priority mass. This stratified sampling reduces variance compared to drawing all B samples from the full interval.

Buffer size considerations

The size of the replay buffer is an important hyperparameter that involves several tradeoffs.

Factor	Small buffer	Large buffer
Memory usage	Low	High
Data diversity	Limited; recent transitions only	Broad; spans many episodes and policies
Off-policyness	Low; data is close to current policy	High; oldest transitions may come from very different policies
Correlation breaking	Less effective	More effective
Staleness risk	Low	High; old transitions may mislead the agent

Mnih et al. (2015) set the DQN replay buffer to hold one million transitions, and this value became a widely adopted default in subsequent work. Lillicrap et al. (2015) used the same buffer size for DDPG in continuous control tasks. However, research by Fedus et al. (2020) in "Revisiting Fundamentals of Experience Replay" showed that performance consistently improves with increased replay capacity for certain algorithms, while other algorithms are unaffected, suggesting that the one-million default is a practical starting point rather than a universally optimal choice.

The replay ratio, defined as the number of gradient updates per environment step, interacts closely with buffer size. Increasing the buffer while holding the replay ratio constant means the oldest transitions in the buffer become more stale. Conversely, reducing the buffer while maintaining the replay ratio keeps the data more on-policy but limits diversity. Finding the right balance depends on the specific algorithm, environment complexity, and computational budget.

The Fedus 2020 findings

Fedus, Ramachandran, Agarwal, Bengio, Larochelle, Rowland, and Dabney published "Revisiting Fundamentals of Experience Replay" at ICML 2020. Their study isolated the effects of replay capacity, the age of the oldest data, and the replay ratio across DQN and Rainbow DQN. They reported several findings that have shaped subsequent practice:

The original DQN does not benefit much from increasing the replay buffer beyond the default one million transitions.
Rainbow DQN does benefit, and the gain comes mainly from the use of n-step returns.
Adding n-step returns to plain DQN makes it benefit from larger buffers; removing n-step returns from Rainbow eliminates the benefit.
The replay ratio interacts strongly with capacity, and the right ratio depends on the algorithm.

These findings suggested that buffer size cannot be tuned independently of other algorithmic choices and that n-step bootstrapping plays a quiet but central role in modern off-policy methods.

Replay ratio and the update-to-data ratio

A closely related quantity is the update-to-data ratio (UTD), sometimes called the replay ratio or gradient steps per environment step. SAC and DDPG often default to a UTD of 1, meaning one gradient update per environment step. Methods such as REDQ (Chen et al. 2021), DroQ, and the work "Sample-Efficient Reinforcement Learning by Breaking the Replay Ratio Barrier" by D'Oro et al. (ICLR 2023) increase UTD to 10, 20, or higher, often combined with periodic network resets to prevent the value function from collapsing into pathological regions. High-UTD methods can match the sample efficiency of model-based RL on continuous control benchmarks but spend significantly more compute per environment interaction.

Hindsight Experience Replay (HER)

Andrychowicz et al. (2017) introduced Hindsight Experience Replay (HER) at NeurIPS 2017 to tackle a persistent challenge in goal-conditioned reinforcement learning: sparse rewards. In many real-world tasks, the agent receives a reward signal only upon successfully completing a goal (for example, placing an object at a target location) and receives zero reward otherwise. Under these conditions, standard experience replay struggles because almost every transition in the buffer carries no useful reward signal.

HER addresses this by replaying each episode with substitute goals. After the agent completes an episode (even if it failed to reach the intended goal), HER stores the episode in the replay buffer twice: once with the original goal, and once with the goal replaced by a state the agent actually reached during the episode. This relabeling trick transforms failed experiences into successful ones under the substitute goal, providing a dense learning signal even in sparse-reward environments.

Because HER modifies only the goals and not the environment dynamics, it can be combined with any off-policy reinforcement learning algorithm such as DQN, DDPG, or SAC. The researchers demonstrated that policies trained with HER in physics simulations could be successfully transferred to physical robots performing pushing, sliding, and pick-and-place tasks using only binary success/failure rewards. HER can be viewed as a form of implicit curriculum learning, where the agent gradually learns to reach increasingly distant goals by first mastering nearby ones.

Goal selection strategies

The original HER paper compared four strategies for selecting substitute goals during relabeling:

Strategy	Description	Typical use
final	Use the state reached at the end of the episode as the substitute goal	Simple baseline; often outperformed by future
future	Sample k future states from the same trajectory as substitute goals	Recommended default; k=4 or k=8 worked best
episode	Sample k random states from the same episode as substitute goals	Useful when within-episode states are diverse
random	Sample k random states from the entire replay buffer as substitute goals	Most off-policy; can hurt learning

The future strategy with k=4 or k=8 was the best-performing variant in pushing, sliding, and pick-and-place tasks. The exact value of k controls the augmentation factor: each real transition produces k synthetic transitions with relabeled goals, so the effective amount of data the agent sees grows by a factor of k+1.

Implementation in libraries

Most modern RL frameworks ship a HER implementation. In Stable Baselines3, HER is no longer a separate algorithm but a buffer class called HerReplayBuffer that extends DictReplayBuffer and is passed to off-policy algorithms such as SAC, TD3, or DQN through the replay_buffer_class argument. The user supplies n_sampled_goal (the k value above) and goal_selection_strategy (one of "future", "final", or "episode"). The environment must follow the GoalEnv interface with a dict observation space containing observation, achieved_goal, and desired_goal keys, and must expose a vectorized compute_reward method so that the buffer can recompute rewards for relabeled goals without re-running the simulator.

Experience replay in key algorithms

Deep Q-Network (DQN)

The Deep Q-Network algorithm, introduced by Mnih et al. (2013, 2015), was the first to demonstrate that experience replay could enable stable training of deep neural networks for reinforcement learning at scale. DQN stores transitions in a buffer of one million entries and samples uniform random mini-batches of 32 transitions for each training step. Together with a separate target network (updated every 10,000 steps in the Nature paper), experience replay was one of DQN's two key innovations for stabilizing training. The 2015 Nature implementation also reduced computation by performing a gradient update every 4 environment frames rather than every frame.

DDPG, TD3, and SAC

Deep Deterministic Policy Gradient (DDPG), introduced by Lillicrap et al. (2015), extended the DQN approach to continuous action spaces by combining a Q-function critic with a deterministic policy actor, both trained using transitions sampled from a replay buffer. TD3 (Fujimoto et al., 2018) improved on DDPG by using twin critics to reduce overestimation bias and delayed policy updates to stabilize learning, while continuing to rely on experience replay for training data. Soft Actor-Critic (SAC) (Haarnoja et al., 2018) added entropy regularization to encourage exploration, and like DDPG and TD3, uses a replay buffer as a core component of its training pipeline.

Rainbow DQN

Rainbow (Hessel et al., AAAI 2018) combined six independent improvements to DQN into a single agent: double Q-learning, prioritized experience replay, dueling networks, multi-step (n-step) returns, distributional RL (C51), and noisy networks for exploration. Prioritized experience replay was a key contributor to Rainbow's gains. In the distributional setting, Rainbow uses the Kullback-Leibler divergence between the predicted return distribution and the target distribution as the priority signal, replacing the scalar TD error used in the original PER paper. Rainbow matched DQN's final performance after 7 million frames and surpassed every individual baseline by 44 million frames on the 57-game Atari benchmark.

Distributed Prioritized Experience Replay (Ape-X)

Horgan et al. (2018) scaled prioritized experience replay to distributed settings with the Ape-X architecture. In Ape-X, hundreds or thousands of parallel actors each interact with their own copy of the environment and add transitions with initial priorities to a shared centralized replay buffer. A single learner process samples prioritized mini-batches from this shared buffer and updates the network weights. Actors periodically synchronize their network parameters with the learner. This architecture reduced wall-clock training time by factors of two to four compared to single-actor baselines while also improving final performance. The same idea was extended to recurrent networks in R2D2 (Kapturowski et al., ICLR 2019), which stored sequences of transitions rather than individual transitions and addressed the partial-observability issues that arise when LSTM hidden states are reused from out-of-date rollouts.

Algorithms and their replay usage

The table below summarizes how prominent off-policy and offline algorithms use experience replay.

Algorithm	Action space	Replay type	Default buffer size	Notes
DQN (2015)	Discrete	Uniform	1,000,000	Original deep RL replay; trained on Atari
Double DQN	Discrete	Uniform	1,000,000	Decoupled action selection and evaluation
Prioritized DQN	Discrete	Proportional or rank PER	1,000,000	Sampling weighted by abs TD error
Rainbow	Discrete	Proportional PER	1,000,000	Combines six DQN extensions
DDPG	Continuous	Uniform	1,000,000	Off-policy actor-critic for continuous control
TD3	Continuous	Uniform	1,000,000	Twin critics, delayed policy updates
SAC	Continuous	Uniform	1,000,000	Maximum-entropy actor-critic
HER + DDPG/SAC	Continuous, goal-conditioned	Uniform with goal relabeling	Variable	Sparse-reward goal tasks
Ape-X	Discrete	Distributed PER	up to 2,000,000 or more	Many actors share one prioritized buffer
R2D2	Discrete (recurrent)	Distributed PER over sequences	Variable	Stores fixed-length sequences
BCQ	Continuous, offline	Fixed dataset	Static	Offline RL with behavior-constrained policy
CQL	Continuous, offline	Fixed dataset	Static	Conservative Q-learning regularizer

Offline reinforcement learning

Offline reinforcement learning (also called batch RL) takes the idea of experience replay to its extreme: the agent never collects new data and trains entirely from a fixed dataset of transitions. The dataset itself is treated as a static replay buffer that is never overwritten, and the agent must extract the best possible policy from whatever data is available.

The main difficulty in offline RL is distributional shift: the policy being learned may visit state-action pairs that the dataset's behavior policy never visited, which causes the value function to extrapolate wildly into out-of-distribution regions. Algorithms such as Batch-Constrained Q-learning (BCQ; Fujimoto et al. 2019), Conservative Q-Learning (CQL; Kumar et al. 2020), Implicit Q-Learning (IQL; Kostrikov et al. 2022), and Behavior-Regularized Actor-Critic (BRAC) address this by either constraining the policy to stay close to the data distribution or by penalizing Q-values for unseen actions. Offline RL has become a major research area because it enables learning from logged data such as medical records, robot demonstrations, and historical user interactions without further online interaction.

A hybrid setting called offline-to-online RL pretrains on a fixed dataset and then fine-tunes online, often using two replay buffers: one containing the offline data and one filling up with newly collected online transitions. Mixing samples from both buffers during fine-tuning helps the agent retain knowledge from offline pretraining while adapting to its own freshly generated data.

Variants and extensions

A range of variants have been proposed to improve on uniform random sampling.

Variant	Year	Sampling rule	Key idea
Uniform replay	1992 (Lin); 2013 (DQN)	Uniform random	Original formulation
Prioritized Experience Replay	2015	Proportional or rank-based on abs TD error	Replay informative transitions more often
Hindsight Experience Replay	2017	Uniform with goal relabeling	Synthetic successes for sparse-reward tasks
Combined Experience Replay	2017 (Zhang and Sutton)	Uniform plus latest transition	Always include the most recent transition in each batch
Distributed PER (Ape-X)	2018	Centralized prioritized	Scale across many parallel actors
Reverse experience replay	Various	Sample in reverse temporal order within an episode	Inspired by hippocampal reverse replay
Episodic memory replay	Various	Sample whole episodes or sequences	Used with recurrent agents and meta-RL
Experience Replay Optimization (ERO)	2019 (Zha et al.)	Learned sampling network	Replace heuristic priority with a meta-learned policy
Map-based experience replay	2023	Sample from clustered states	Reduce catastrophic forgetting in continual RL

Reverse replay is motivated by neuroscience findings that hippocampal place cells often replay sequences in reverse order at decision points. In RL, reverse-order replay can speed credit assignment because the value of the terminal transition is updated first and propagated backward through the sequence in subsequent updates.

Episodic memory replay stores entire trajectories rather than individual transitions and is particularly useful for partially observable problems where temporal context matters. R2D2 and IMPALA-style architectures use this approach.

Implementation

Minimal Python implementation

A simple uniform replay buffer can be implemented in a few lines of Python using a deque from the collections module:

import random
from collections import deque

class ReplayBuffer:
    def __init__(self, capacity):
        self.buffer = deque(maxlen=capacity)

    def push(self, state, action, reward, next_state, done):
        self.buffer.append((state, action, reward, next_state, done))

    def sample(self, batch_size):
        batch = random.sample(self.buffer, batch_size)
        state, action, reward, next_state, done = zip(*batch)
        return state, action, reward, next_state, done

    def __len__(self):
        return len(self.buffer)

This implementation is concise and correct, but it is inefficient for large buffers because each sample call has O(B) time complexity in Python's pure-Python deque, and the conversion from a list of tuples to per-field tensors in zip(*batch) becomes a bottleneck.

Tensor-backed implementation

Production implementations typically allocate fixed-size NumPy arrays or PyTorch tensors at construction time and write into them by index, avoiding the overhead of Python-level deques and per-sample tensor creation:

import numpy as np
import torch

class TensorReplayBuffer:
    def __init__(self, capacity, state_dim, action_dim, device="cuda"):
        self.capacity = capacity
        self.device = device
        self.states = np.empty((capacity, state_dim), dtype=np.float32)
        self.actions = np.empty((capacity, action_dim), dtype=np.float32)
        self.rewards = np.empty((capacity,), dtype=np.float32)
        self.next_states = np.empty((capacity, state_dim), dtype=np.float32)
        self.dones = np.empty((capacity,), dtype=np.float32)
        self.idx = 0
        self.size = 0

    def push(self, s, a, r, s2, d):
        i = self.idx
        self.states[i] = s
        self.actions[i] = a
        self.rewards[i] = r
        self.next_states[i] = s2
        self.dones[i] = d
        self.idx = (self.idx + 1) % self.capacity
        self.size = min(self.size + 1, self.capacity)

    def sample(self, batch_size):
        idx = np.random.randint(0, self.size, size=batch_size)
        return (
            torch.from_numpy(self.states[idx]).to(self.device),
            torch.from_numpy(self.actions[idx]).to(self.device),
            torch.from_numpy(self.rewards[idx]).to(self.device),
            torch.from_numpy(self.next_states[idx]).to(self.device),
            torch.from_numpy(self.dones[idx]).to(self.device),
        )

For image-based observations such as Atari frames, a common optimization is to store frames as 8-bit unsigned integers and only deduplicate consecutive frames in a stack, reducing memory usage by a factor of four or more. The Dopamine library and the original DQN code both apply this trick: a one-million transition buffer of 84 by 84 grayscale frames takes about 7 GB instead of 28 GB.

Library implementations

Several mature libraries provide replay buffer implementations:

Library	Replay class	Notes
Stable Baselines3	ReplayBuffer, DictReplayBuffer, HerReplayBuffer	Used by SB3's SAC, TD3, DQN
TorchRL	ReplayBuffer with Storage backends (List, Tensor, LazyTensor, LazyMemmap)	Composable, supports custom samplers
RLlib	Various replay actors via RLlib's execution plans	Distributed replay for Ray Tune
Dopamine	OutOfGraphReplayBuffer	Memory-efficient for Atari; supports n-step
TF-Agents	TFUniformReplayBuffer	TensorFlow-native
Acme	Reverb	Distributed replay built on the Reverb backend

DeepMind's Reverb, used by Acme and several other DeepMind libraries, deserves special mention. It is a high-performance, distributed replay system written in C++ with Python bindings, supporting prioritized sampling, sequential sampling, and arbitrary user-defined samplers, all over a network. Reverb scales to billions of transitions and is the engine behind many of DeepMind's distributed RL results.

Memory and engineering considerations

Memory footprint

A replay buffer's memory footprint is dominated by the state and next-state fields. For an Atari-style buffer of one million transitions with 84 by 84 grayscale uint8 frames stacked four deep, the naive cost of storing both state and next-state would be:

2 * 1,000,000 * 4 * 84 * 84 = 56.4 billion bytes (about 56 GB)

This is why production implementations store only the unique frames and reconstruct state stacks at sample time. With deduplication, a one-million transition buffer fits in roughly 7 GB. For continuous control with low-dimensional state vectors (for example, a 17-dimensional MuJoCo state), the same buffer might take only a few hundred megabytes and fits easily in CPU RAM.

CPU vs GPU storage

Replay buffers are usually kept in CPU RAM because most agents have buffer sizes much larger than typical GPU memory. Mini-batches are transferred to the GPU at sample time, often with pinned memory and asynchronous data loaders to overlap data movement with computation. A few high-throughput systems (TorchRL's LazyTensorStorage on CUDA, JAX-based implementations on TPU) keep the entire buffer in accelerator memory when it fits, which removes host-to-device copy latency at the cost of using accelerator RAM that could otherwise hold a larger model or batch.

Pinned memory and prefetching

For heavy training loops, sample throughput becomes a bottleneck. Common optimizations include allocating buffer arrays in pinned (page-locked) host memory so that PyTorch can issue asynchronous host-to-device transfers, prefetching the next mini-batch in a background thread, and replacing per-element Python iteration with vectorized NumPy or Torch operations. Distributed setups go further by sharding the buffer across machines and using RPC or Reverb-style sampling.

Numerical considerations

A few subtle implementation issues recur in practice. Storing rewards as float32 is usually fine, but storing returns (cumulative rewards) sometimes overflows float16. Boolean done flags should be stored as float so they can be multiplied directly into the bootstrap term (1 - done). When computing TD targets for a mini-batch, it is important that the actions used to bootstrap the target Q-value come from the next-state field, not the current state, an off-by-one error that has appeared in many beginner implementations.

Evaluation and benchmarks

Replay-based methods are most commonly evaluated on three families of benchmarks. The Arcade Learning Environment (Atari 2600) introduced by Bellemare et al. (2013) was the original deep RL benchmark and remains a standard testbed for replay-based discrete-action methods. The DeepMind Control Suite and the OpenAI Gym MuJoCo tasks are the standard continuous control benchmarks for DDPG, TD3, SAC, and their replay variants. The OpenAI Gym Robotics tasks (Fetch and HandManipulate) are the standard sparse-reward goal-conditioned benchmarks for HER. More recent benchmarks include the DeepMind Control Suite from pixels, ProcGen and Crafter for generalization, and the D4RL and Robomimic datasets for offline RL.

Common pitfalls

Several mistakes recur in practice when implementing or tuning experience replay.

Pitfall	Symptom	Fix
Sampling before warmup	Initial Q-values are extremely noisy	Wait until the buffer holds a minimum number of transitions before training
Using on-policy data only	Q-values overfit to the current policy	Use a sufficiently large buffer or add an off-policy correction
Ignoring done flags	Bootstrapping past terminal states corrupts targets	Multiply the next-state value by (1 - done)
Forgetting importance sampling weights in PER	Biased gradient estimates	Apply (N * P(i))^(-beta) weights and normalize
Storing full state stacks	Out-of-memory on Atari-scale problems	Deduplicate frames or use uint8 storage
Mixing rewards across episodes	Bootstrapping crosses episode boundaries	Reset done flags correctly, use per-episode storage if using sequence sampling
Stale priorities	New transitions never get replayed	Initialize new transitions with the maximum priority in the buffer
Replay ratio too high	Q-function diverges or value collapses to a constant	Lower the ratio or apply periodic network resets

Most mature libraries handle these pitfalls automatically, but custom implementations frequently fall into one or more of them.

Explain like I'm 5 (ELI5)

Imagine you are learning to ride a bike. Every time you try, you remember what happened: how you leaned, how you pedaled, and whether you fell or stayed upright. Now imagine you have a big scrapbook where you paste a picture and a note about every single attempt.

Without the scrapbook, you would only remember your very last try. Maybe that last try was a lucky one where you did not fall, so you would not learn much about what to avoid. Or maybe you fell in a weird way that does not happen often, and you would overreact to that one bad memory.

With the scrapbook, you can flip to any random page before your next attempt and review a handful of old memories. Some are from yesterday, some from last week. By studying a mix of different memories instead of just the latest one, you learn faster, you do not forget the lessons from earlier attempts, and you do not overreact to any single ride. That scrapbook is what computer scientists call a replay buffer, and the process of flipping back through it to learn is called experience replay.

A more clever version of the scrapbook puts colored sticky notes on the pages with the most surprising mistakes, so you spend extra time reviewing those. That is what prioritized experience replay does. And another clever version, called hindsight experience replay, lets you flip back to a failed attempt and pretend the place you ended up was actually the place you meant to go, so even your failures teach you something useful.

References

Lin, L.-J. (1992). "Self-Improving Reactive Agents Based on Reinforcement Learning, Planning and Teaching." *Machine Learning*, 8, 293-321.
Wilson, M.A., and McNaughton, B.L. (1994). "Reactivation of Hippocampal Ensemble Memories During Sleep." *Science*, 265(5172), 676-679.
Mnih, V., Kavukcuoglu, K., Silver, D., et al. (2013). "Playing Atari with Deep Reinforcement Learning." *arXiv preprint arXiv:1312.5602*.
Mnih, V., Kavukcuoglu, K., Silver, D., et al. (2015). "Human-Level Control Through Deep Reinforcement Learning." *Nature*, 518(7540), 529-533.
Schaul, T., Quan, J., Antonoglou, I., and Silver, D. (2016). "Prioritized Experience Replay." *Proceedings of the International Conference on Learning Representations (ICLR 2016)*. arXiv:1511.05952.
Lillicrap, T.P., Hunt, J.J., Pritzel, A., et al. (2016). "Continuous Control with Deep Reinforcement Learning." *Proceedings of ICLR 2016*.
Andrychowicz, M., Wolski, F., Ray, A., et al. (2017). "Hindsight Experience Replay." *Advances in Neural Information Processing Systems (NeurIPS 2017)*. arXiv:1707.01495.
Hessel, M., Modayil, J., van Hasselt, H., et al. (2018). "Rainbow: Combining Improvements in Deep Reinforcement Learning." *Proceedings of AAAI 2018*. arXiv:1710.02298.
Fujimoto, S., van Hoof, H., and Meger, D. (2018). "Addressing Function Approximation Error in Actor-Critic Methods." *Proceedings of ICML 2018*.
Haarnoja, T., Zhou, A., Abbeel, P., and Levine, S. (2018). "Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor." *Proceedings of ICML 2018*.
Horgan, D., Quan, J., Budden, D., et al. (2018). "Distributed Prioritized Experience Replay." *Proceedings of ICLR 2018*.
Kapturowski, S., Ostrovski, G., Quan, J., Munos, R., and Dabney, W. (2019). "Recurrent Experience Replay in Distributed Reinforcement Learning." *Proceedings of ICLR 2019*.
Fujimoto, S., Meger, D., and Precup, D. (2019). "Off-Policy Deep Reinforcement Learning Without Exploration." *Proceedings of ICML 2019*.
Fedus, W., Ramachandran, P., Agarwal, R., Bengio, Y., Larochelle, H., Rowland, M., and Dabney, W. (2020). "Revisiting Fundamentals of Experience Replay." *Proceedings of ICML 2020*. arXiv:2007.06700.
Kumar, A., Zhou, A., Tucker, G., and Levine, S. (2020). "Conservative Q-Learning for Offline Reinforcement Learning." *Advances in Neural Information Processing Systems (NeurIPS 2020)*.
Chen, X., Wang, C., Zhou, Z., and Ross, K. (2021). "Randomized Ensembled Double Q-Learning: Learning Fast Without a Model." *Proceedings of ICLR 2021*.
Kostrikov, I., Nair, A., and Levine, S. (2022). "Offline Reinforcement Learning with Implicit Q-Learning." *Proceedings of ICLR 2022*.
D'Oro, P., Schwarzer, M., Nikishin, E., Bacon, P.-L., Bellemare, M.G., and Courville, A. (2023). "Sample-Efficient Reinforcement Learning by Breaking the Replay Ratio Barrier." *Proceedings of ICLR 2023*.
Sutton, R.S., and Barto, A.G. (2018). *Reinforcement Learning: An Introduction* (2nd ed.). MIT Press.
Stable Baselines3 documentation. "HER and HerReplayBuffer." Accessed 2026.

Introduction

Historical background

Connection to neuroscience

How experience replay works

The replay buffer

Sampling and training

The four basic operations

Why experience replay works

Breaking temporal correlations

Improved sample efficiency

Stabilized learning

Reduced catastrophic forgetting

Off-policy and on-policy considerations

Uniform vs prioritized sampling

Uniform sampling

Prioritized Experience Replay (PER)

Importance sampling correction

Sum tree data structure

Buffer size considerations

The Fedus 2020 findings

Replay ratio and the update-to-data ratio

Hindsight Experience Replay (HER)

Goal selection strategies

Implementation in libraries

Experience replay in key algorithms

Deep Q-Network (DQN)

DDPG, TD3, and SAC

Rainbow DQN

Distributed Prioritized Experience Replay (Ape-X)

Algorithms and their replay usage

Offline reinforcement learning

Variants and extensions

Implementation

Minimal Python implementation

Tensor-backed implementation

Library implementations

Memory and engineering considerations

Memory footprint

CPU vs GPU storage

Pinned memory and prefetching

Numerical considerations

Evaluation and benchmarks

Common pitfalls

Explain like I'm 5 (ELI5)

See also

References

Improve this article

Related Articles

Machine learning terms/Reinforcement Learning

ARC-AGI 2

AlphaGo

State (Reinforcement Learning)

State-Action Value Function

Action (Reinforcement Learning)

Introduction

Historical background

Connection to neuroscience

How experience replay works

The replay buffer

Sampling and training

The four basic operations

Why experience replay works

Breaking temporal correlations

Improved sample efficiency

Stabilized learning

Reduced catastrophic forgetting

Off-policy and on-policy considerations

Uniform vs prioritized sampling

Uniform sampling

Prioritized Experience Replay (PER)

Importance sampling correction

Sum tree data structure

Buffer size considerations

The Fedus 2020 findings

Replay ratio and the update-to-data ratio

Hindsight Experience Replay (HER)

Goal selection strategies

Implementation in libraries

Experience replay in key algorithms

Deep Q-Network (DQN)