Experience Replay
Last reviewed
May 9, 2026
Sources
20 citations
Review status
Source-backed
Revision
v3 ยท 6,498 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 9, 2026
Sources
20 citations
Review status
Source-backed
Revision
v3 ยท 6,498 words
Add missing citations, update stale details, or suggest a clearer explanation.
See also: Reinforcement Learning, Deep Q-Network (DQN), Q-Learning, Replay Buffer
Experience replay is a foundational technique in reinforcement learning that allows an agent to store past interactions with its environment in a memory structure called a replay buffer and later resample those interactions during training. Rather than learning exclusively from the most recent experience, the agent randomly draws mini-batches of earlier transitions from the buffer, breaking the temporal correlations present in sequential data and dramatically improving both the stability and sample efficiency of the learning process.
First proposed by Long-Ji Lin in 1992, the technique remained relatively niche until it became a critical component of the Deep Q-Network (DQN) architecture introduced by Mnih et al. in 2013 and 2015. Since then, experience replay has become a standard building block of virtually all off-policy deep reinforcement learning algorithms, including Deep Deterministic Policy Gradient (DDPG), Twin Delayed DDPG (TD3), and Soft Actor-Critic (SAC). It also underpins offline reinforcement learning, distributed actor-learner systems, and goal-conditioned algorithms such as Hindsight Experience Replay.
A single transition is stored as a tuple (s, a, r, s', d), where s is the state observed before acting, a is the action chosen, r is the scalar reward received, s' is the next state, and d is a flag indicating whether s' is terminal. The buffer is typically a fixed-capacity ring that overwrites its oldest entries once full, and gradient updates draw uniform or prioritized mini-batches from this pool.
The concept of experience replay was introduced by Long-Ji Lin in his 1992 paper "Self-Improving Reactive Agents Based on Reinforcement Learning, Planning and Teaching," published in the journal Machine Learning. In that work, Lin compared eight reinforcement learning frameworks built around two base algorithms: the Adaptive Heuristic Critic (AHC) and Q-learning. He proposed three extensions to speed up learning: experience replay, learning action models for planning, and teaching. Among these, experience replay proved to be the simplest and most broadly applicable.
Lin's central insight was that an agent could store its past experiences and revisit them later instead of discarding each transition after a single learning update. This idea drew loose inspiration from biological findings about hippocampal replay in mammals, where the brain replays neural activity patterns from prior waking experiences during sleep and rest, consolidating memories and improving future decision-making. Lin's thesis at Carnegie Mellon University, completed the following year, expanded these ideas and remains one of the most-cited early works on memory-based reinforcement learning.
The technique gained mainstream prominence in 2013 when Mnih et al. at DeepMind combined experience replay with deep neural networks to create the DQN algorithm. The arXiv preprint "Playing Atari with Deep Reinforcement Learning" (1312.5602) was followed by the 2015 Nature paper "Human-Level Control Through Deep Reinforcement Learning," which used a replay buffer of one million transitions to train an agent that achieved human-level performance on dozens of Atari 2600 games. This result demonstrated that experience replay was essential for stabilizing the training of deep neural networks in reinforcement learning settings, and it spread the technique across the entire deep reinforcement learning literature in the years that followed.
Wilson and McNaughton (1994) recorded ensembles of hippocampal place cells in rats during spatial tasks and during the slow-wave sleep that followed those tasks. They found that pairs of cells which fired together during waking exploration showed an elevated tendency to fire together again during subsequent sleep, and the effect decayed gradually over the course of the rest period. This phenomenon, often called hippocampal replay, was interpreted as a substrate for memory consolidation.
Follow-up work by Lee and Wilson (2002), Foster and Wilson (2006), and many others extended these findings, documenting both forward and reverse replay sequences and connecting them explicitly to temporal-difference learning. The convergence between the algorithmic notion of resampling stored transitions and the neural notion of reactivating cell ensembles has become an active interface between machine learning and computational neuroscience. Modern research treats experience replay as a useful normative model of why brains might consolidate memories during rest, and conversely treats hippocampal replay as a source of inspiration for new replay-based algorithms.
The replay buffer (also called experience memory or replay memory) is typically implemented as a circular buffer with a fixed maximum capacity. Each entry in the buffer represents a single transition, stored as a tuple:
| Element | Symbol | Description |
|---|---|---|
| State | s | The environment state observed by the agent before taking an action |
| Action | a | The action selected by the agent |
| Reward | r | The scalar reward signal received from the environment after taking the action |
| Next State | s' | The environment state observed after the action was executed |
| Done Flag | d | A boolean indicating whether s' is a terminal state |
As the agent interacts with the environment, new transitions are appended to the buffer. When the buffer reaches its maximum capacity, the oldest transitions are overwritten, ensuring the buffer always contains the most recent experiences up to its size limit. Some implementations also store auxiliary fields such as the discount factor at the time of the transition, the policy log-probability of the action, the time step within the episode, or a goal vector for goal-conditioned tasks.
During each training step, the agent draws a random mini-batch of transitions from the replay buffer (typically 32 to 256 transitions). These sampled transitions are then used to compute loss values and update the agent's parameters via gradient descent. In the case of Q-learning variants, the sampled transitions are used to compute temporal-difference (TD) targets and minimize the TD error. The standard one-step TD target for transition (s, a, r, s', d) is:
y = r + gamma * (1 - d) * max_a' Q_target(s', a')
where gamma is the discount factor and Q_target is a slowly-updated copy of the Q-network. The loss minimized over the mini-batch is then the mean squared error between Q(s, a) and y.
The random sampling is the key mechanism that provides experience replay's benefits. Because the mini-batch is drawn uniformly at random from a large pool of transitions collected over many episodes, consecutive samples in the training batch are unlikely to be correlated with each other, and the gradient estimate is closer to that of supervised learning on independent samples.
Most replay buffer implementations expose a small interface with four core operations:
| Operation | Description | Typical complexity |
|---|---|---|
| add | Append a transition (s, a, r, s', d) to the buffer | O(1) |
| sample | Draw a uniform random mini-batch of size B | O(B) |
| update_priorities | Update priority values after a learning step (PER only) | O(B log N) |
| clear | Reset the buffer to empty | O(1) or O(N) |
The agent typically calls add once per environment step and sample once per gradient update, with the ratio of the two operations controlled by a hyperparameter known as the replay ratio.
Experience replay addresses several fundamental challenges that arise when training neural networks with reinforcement learning data.
When an agent learns online from a stream of consecutive experiences, successive transitions are highly correlated. For example, an agent navigating a maze will see a long sequence of spatially adjacent states. Training a neural network on such correlated data violates the independent and identically distributed (i.i.d.) assumption underlying stochastic gradient descent, which can cause the network to overfit to recent trajectories and produce unstable weight updates. By sampling randomly from a large buffer, experience replay decorrelates the training data and approximates the i.i.d. condition.
Without replay, each transition is used for exactly one parameter update and then discarded. This is extremely wasteful, especially in environments where collecting data is expensive or slow. Experience replay allows each transition to be reused across multiple training updates, extracting more learning signal from each interaction with the environment. A single rare but informative transition can contribute to learning dozens or hundreds of times before it is eventually overwritten. Real-world robotic systems, where each environment step may take seconds and risk hardware wear, benefit disproportionately from this reuse.
In deep reinforcement learning, the target values used for training depend on the agent's own parameters, creating a moving-target problem. Random sampling from a diverse buffer smooths out these fluctuations by ensuring that any single training batch reflects a broad distribution of experiences rather than the agent's current behavioral regime. This stabilization effect was one of the key reasons DQN succeeded where earlier attempts to combine neural networks with Q-learning had failed.
Neural networks are prone to catastrophic forgetting, where learning new information erases previously acquired knowledge. By continually revisiting older experiences stored in the buffer, the agent retains knowledge about earlier parts of the state space even as its policy evolves and explores new regions. This effect is particularly important in long training runs where the agent's behavior changes substantially over time and the most recent on-policy data alone would not suffice to maintain accurate Q-values for older state regions.
Experience replay is fundamentally an off-policy technique. The transitions in the buffer were collected by past versions of the agent's policy, while gradient updates are applied to the current policy. Off-policy algorithms such as Q-learning, DQN, DDPG, TD3, and SAC are mathematically able to learn from data generated by a different (behavior) policy than the one being optimized, which is what makes replay viable for them.
In contrast, on-policy algorithms such as REINFORCE, A2C, A3C, TRPO, and PPO require that the data used for each gradient step comes from the current policy. Reusing old data would introduce bias that policy gradient theorems do not account for, so these algorithms typically use a small rollout buffer that holds only the most recent batch of trajectories and is then discarded. Some on-policy methods use importance sampling to partially correct for off-policy data, but they generally do not maintain large persistent replay buffers in the way DQN and its descendants do.
This off-policy versus on-policy distinction has practical consequences. Algorithms with replay tend to be more sample-efficient because they can squeeze multiple gradient updates out of each interaction. On-policy algorithms tend to be more stable and have stronger convergence guarantees because their training data always reflects the current behavior, but they need more environment interactions to reach the same performance.
The original experience replay formulation uses uniform random sampling: every transition in the buffer has an equal probability of being selected. While simple and effective, uniform sampling treats all transitions as equally valuable for learning, which is not always the case.
Uniform sampling is straightforward to implement and introduces no bias into the learning process. Its main limitation is inefficiency: many sampled transitions may be easy examples that the agent already handles well, while rare or surprising transitions that could drive significant learning progress are sampled no more frequently than any other.
Schaul et al. introduced Prioritized Experience Replay (PER) in arXiv preprint 1511.05952 (2015), and the work was published at ICLR 2016. The core idea is that transitions should be replayed in proportion to how much the agent can learn from them. The authors used the magnitude of the TD error as a proxy for learning potential: transitions where the agent's prediction was far from the actual outcome are presumably more informative.
PER defines the sampling probability for transition i as:
P(i) = p_i^alpha / sum_k(p_k^alpha)
where alpha controls the degree of prioritization (alpha = 0 yields uniform sampling) and p_i is the priority of transition i. The paper presents two variants for computing p_i:
| Variant | Priority formula | Description |
|---|---|---|
| Proportional | p_i = abs(delta_i) + epsilon | Priority is proportional to the absolute TD error plus a small constant epsilon that prevents zero-priority transitions from never being replayed |
| Rank-based | p_i = 1 / rank(i) | Priority is inversely proportional to the rank of the transition when sorted by TD error magnitude |
The rank-based variant is more robust to outliers because it depends only on the ordering of TD errors rather than their raw magnitudes. Its heavy-tail distribution also ensures diversity in the sampled mini-batches.
Prioritized sampling introduces bias because transitions with high TD errors are overrepresented relative to the true data distribution. To correct this, PER applies importance sampling weights:
w_i = (1/N * 1/P(i))^beta
where N is the buffer size and beta controls the degree of correction. The weights are normalized by dividing by the maximum weight in the mini-batch. The parameter beta is annealed from a low initial value to 1 over the course of training, reflecting the fact that unbiased updates become most important near convergence when the policy is close to optimal.
Experiments showed that Double DQN combined with prioritized experience replay significantly outperformed the previous state-of-the-art results on the Atari Learning Environment benchmark. PER outperformed DQN with uniform replay on 41 of 49 Atari games tested by Schaul et al.
A naive implementation of proportional PER would require O(N) work for each sample, since the cumulative distribution must be inverted. To make sampling efficient, PER is typically implemented with a binary tree called a sum tree (a kind of segment tree), which stores transitions in its leaf nodes and the sum of children's priorities at every internal node. The root node holds the total priority. Both adding a new transition and updating an existing priority cost O(log N), and drawing a single sample is also O(log N), giving the buffer logarithmic asymptotic cost in the buffer size.
A mini-batch of size B is drawn by partitioning the interval [0, p_total] into B equal-width segments, sampling a uniform random number from each segment, and traversing the tree to find the leaf corresponding to that priority mass. This stratified sampling reduces variance compared to drawing all B samples from the full interval.
The size of the replay buffer is an important hyperparameter that involves several tradeoffs.
| Factor | Small buffer | Large buffer |
|---|---|---|
| Memory usage | Low | High |
| Data diversity | Limited; recent transitions only | Broad; spans many episodes and policies |
| Off-policyness | Low; data is close to current policy | High; oldest transitions may come from very different policies |
| Correlation breaking | Less effective | More effective |
| Staleness risk | Low | High; old transitions may mislead the agent |
Mnih et al. (2015) set the DQN replay buffer to hold one million transitions, and this value became a widely adopted default in subsequent work. Lillicrap et al. (2015) used the same buffer size for DDPG in continuous control tasks. However, research by Fedus et al. (2020) in "Revisiting Fundamentals of Experience Replay" showed that performance consistently improves with increased replay capacity for certain algorithms, while other algorithms are unaffected, suggesting that the one-million default is a practical starting point rather than a universally optimal choice.
The replay ratio, defined as the number of gradient updates per environment step, interacts closely with buffer size. Increasing the buffer while holding the replay ratio constant means the oldest transitions in the buffer become more stale. Conversely, reducing the buffer while maintaining the replay ratio keeps the data more on-policy but limits diversity. Finding the right balance depends on the specific algorithm, environment complexity, and computational budget.
Fedus, Ramachandran, Agarwal, Bengio, Larochelle, Rowland, and Dabney published "Revisiting Fundamentals of Experience Replay" at ICML 2020. Their study isolated the effects of replay capacity, the age of the oldest data, and the replay ratio across DQN and Rainbow DQN. They reported several findings that have shaped subsequent practice:
These findings suggested that buffer size cannot be tuned independently of other algorithmic choices and that n-step bootstrapping plays a quiet but central role in modern off-policy methods.
A closely related quantity is the update-to-data ratio (UTD), sometimes called the replay ratio or gradient steps per environment step. SAC and DDPG often default to a UTD of 1, meaning one gradient update per environment step. Methods such as REDQ (Chen et al. 2021), DroQ, and the work "Sample-Efficient Reinforcement Learning by Breaking the Replay Ratio Barrier" by D'Oro et al. (ICLR 2023) increase UTD to 10, 20, or higher, often combined with periodic network resets to prevent the value function from collapsing into pathological regions. High-UTD methods can match the sample efficiency of model-based RL on continuous control benchmarks but spend significantly more compute per environment interaction.
Andrychowicz et al. (2017) introduced Hindsight Experience Replay (HER) at NeurIPS 2017 to tackle a persistent challenge in goal-conditioned reinforcement learning: sparse rewards. In many real-world tasks, the agent receives a reward signal only upon successfully completing a goal (for example, placing an object at a target location) and receives zero reward otherwise. Under these conditions, standard experience replay struggles because almost every transition in the buffer carries no useful reward signal.
HER addresses this by replaying each episode with substitute goals. After the agent completes an episode (even if it failed to reach the intended goal), HER stores the episode in the replay buffer twice: once with the original goal, and once with the goal replaced by a state the agent actually reached during the episode. This relabeling trick transforms failed experiences into successful ones under the substitute goal, providing a dense learning signal even in sparse-reward environments.
Because HER modifies only the goals and not the environment dynamics, it can be combined with any off-policy reinforcement learning algorithm such as DQN, DDPG, or SAC. The researchers demonstrated that policies trained with HER in physics simulations could be successfully transferred to physical robots performing pushing, sliding, and pick-and-place tasks using only binary success/failure rewards. HER can be viewed as a form of implicit curriculum learning, where the agent gradually learns to reach increasingly distant goals by first mastering nearby ones.
The original HER paper compared four strategies for selecting substitute goals during relabeling:
| Strategy | Description | Typical use |
|---|---|---|
| final | Use the state reached at the end of the episode as the substitute goal | Simple baseline; often outperformed by future |
| future | Sample k future states from the same trajectory as substitute goals | Recommended default; k=4 or k=8 worked best |
| episode | Sample k random states from the same episode as substitute goals | Useful when within-episode states are diverse |
| random | Sample k random states from the entire replay buffer as substitute goals | Most off-policy; can hurt learning |
The future strategy with k=4 or k=8 was the best-performing variant in pushing, sliding, and pick-and-place tasks. The exact value of k controls the augmentation factor: each real transition produces k synthetic transitions with relabeled goals, so the effective amount of data the agent sees grows by a factor of k+1.
Most modern RL frameworks ship a HER implementation. In Stable Baselines3, HER is no longer a separate algorithm but a buffer class called HerReplayBuffer that extends DictReplayBuffer and is passed to off-policy algorithms such as SAC, TD3, or DQN through the replay_buffer_class argument. The user supplies n_sampled_goal (the k value above) and goal_selection_strategy (one of "future", "final", or "episode"). The environment must follow the GoalEnv interface with a dict observation space containing observation, achieved_goal, and desired_goal keys, and must expose a vectorized compute_reward method so that the buffer can recompute rewards for relabeled goals without re-running the simulator.
The Deep Q-Network algorithm, introduced by Mnih et al. (2013, 2015), was the first to demonstrate that experience replay could enable stable training of deep neural networks for reinforcement learning at scale. DQN stores transitions in a buffer of one million entries and samples uniform random mini-batches of 32 transitions for each training step. Together with a separate target network (updated every 10,000 steps in the Nature paper), experience replay was one of DQN's two key innovations for stabilizing training. The 2015 Nature implementation also reduced computation by performing a gradient update every 4 environment frames rather than every frame.
Deep Deterministic Policy Gradient (DDPG), introduced by Lillicrap et al. (2015), extended the DQN approach to continuous action spaces by combining a Q-function critic with a deterministic policy actor, both trained using transitions sampled from a replay buffer. TD3 (Fujimoto et al., 2018) improved on DDPG by using twin critics to reduce overestimation bias and delayed policy updates to stabilize learning, while continuing to rely on experience replay for training data. Soft Actor-Critic (SAC) (Haarnoja et al., 2018) added entropy regularization to encourage exploration, and like DDPG and TD3, uses a replay buffer as a core component of its training pipeline.
Rainbow (Hessel et al., AAAI 2018) combined six independent improvements to DQN into a single agent: double Q-learning, prioritized experience replay, dueling networks, multi-step (n-step) returns, distributional RL (C51), and noisy networks for exploration. Prioritized experience replay was a key contributor to Rainbow's gains. In the distributional setting, Rainbow uses the Kullback-Leibler divergence between the predicted return distribution and the target distribution as the priority signal, replacing the scalar TD error used in the original PER paper. Rainbow matched DQN's final performance after 7 million frames and surpassed every individual baseline by 44 million frames on the 57-game Atari benchmark.
Horgan et al. (2018) scaled prioritized experience replay to distributed settings with the Ape-X architecture. In Ape-X, hundreds or thousands of parallel actors each interact with their own copy of the environment and add transitions with initial priorities to a shared centralized replay buffer. A single learner process samples prioritized mini-batches from this shared buffer and updates the network weights. Actors periodically synchronize their network parameters with the learner. This architecture reduced wall-clock training time by factors of two to four compared to single-actor baselines while also improving final performance. The same idea was extended to recurrent networks in R2D2 (Kapturowski et al., ICLR 2019), which stored sequences of transitions rather than individual transitions and addressed the partial-observability issues that arise when LSTM hidden states are reused from out-of-date rollouts.
The table below summarizes how prominent off-policy and offline algorithms use experience replay.
| Algorithm | Action space | Replay type | Default buffer size | Notes |
|---|---|---|---|---|
| DQN (2015) | Discrete | Uniform | 1,000,000 | Original deep RL replay; trained on Atari |
| Double DQN | Discrete | Uniform | 1,000,000 | Decoupled action selection and evaluation |
| Prioritized DQN | Discrete | Proportional or rank PER | 1,000,000 | Sampling weighted by abs TD error |
| Rainbow | Discrete | Proportional PER | 1,000,000 | Combines six DQN extensions |
| DDPG | Continuous | Uniform | 1,000,000 | Off-policy actor-critic for continuous control |
| TD3 | Continuous | Uniform | 1,000,000 | Twin critics, delayed policy updates |
| SAC | Continuous | Uniform | 1,000,000 | Maximum-entropy actor-critic |
| HER + DDPG/SAC | Continuous, goal-conditioned | Uniform with goal relabeling | Variable | Sparse-reward goal tasks |
| Ape-X | Discrete | Distributed PER | up to 2,000,000 or more | Many actors share one prioritized buffer |
| R2D2 | Discrete (recurrent) | Distributed PER over sequences | Variable | Stores fixed-length sequences |
| BCQ | Continuous, offline | Fixed dataset | Static | Offline RL with behavior-constrained policy |
| CQL | Continuous, offline | Fixed dataset | Static | Conservative Q-learning regularizer |
Offline reinforcement learning (also called batch RL) takes the idea of experience replay to its extreme: the agent never collects new data and trains entirely from a fixed dataset of transitions. The dataset itself is treated as a static replay buffer that is never overwritten, and the agent must extract the best possible policy from whatever data is available.
The main difficulty in offline RL is distributional shift: the policy being learned may visit state-action pairs that the dataset's behavior policy never visited, which causes the value function to extrapolate wildly into out-of-distribution regions. Algorithms such as Batch-Constrained Q-learning (BCQ; Fujimoto et al. 2019), Conservative Q-Learning (CQL; Kumar et al. 2020), Implicit Q-Learning (IQL; Kostrikov et al. 2022), and Behavior-Regularized Actor-Critic (BRAC) address this by either constraining the policy to stay close to the data distribution or by penalizing Q-values for unseen actions. Offline RL has become a major research area because it enables learning from logged data such as medical records, robot demonstrations, and historical user interactions without further online interaction.
A hybrid setting called offline-to-online RL pretrains on a fixed dataset and then fine-tunes online, often using two replay buffers: one containing the offline data and one filling up with newly collected online transitions. Mixing samples from both buffers during fine-tuning helps the agent retain knowledge from offline pretraining while adapting to its own freshly generated data.
A range of variants have been proposed to improve on uniform random sampling.
| Variant | Year | Sampling rule | Key idea |
|---|---|---|---|
| Uniform replay | 1992 (Lin); 2013 (DQN) | Uniform random | Original formulation |
| Prioritized Experience Replay | 2015 | Proportional or rank-based on abs TD error | Replay informative transitions more often |
| Hindsight Experience Replay | 2017 | Uniform with goal relabeling | Synthetic successes for sparse-reward tasks |
| Combined Experience Replay | 2017 (Zhang and Sutton) | Uniform plus latest transition | Always include the most recent transition in each batch |
| Distributed PER (Ape-X) | 2018 | Centralized prioritized | Scale across many parallel actors |
| Reverse experience replay | Various | Sample in reverse temporal order within an episode | Inspired by hippocampal reverse replay |
| Episodic memory replay | Various | Sample whole episodes or sequences | Used with recurrent agents and meta-RL |
| Experience Replay Optimization (ERO) | 2019 (Zha et al.) | Learned sampling network | Replace heuristic priority with a meta-learned policy |
| Map-based experience replay | 2023 | Sample from clustered states | Reduce catastrophic forgetting in continual RL |
Reverse replay is motivated by neuroscience findings that hippocampal place cells often replay sequences in reverse order at decision points. In RL, reverse-order replay can speed credit assignment because the value of the terminal transition is updated first and propagated backward through the sequence in subsequent updates.
Episodic memory replay stores entire trajectories rather than individual transitions and is particularly useful for partially observable problems where temporal context matters. R2D2 and IMPALA-style architectures use this approach.
A simple uniform replay buffer can be implemented in a few lines of Python using a deque from the collections module:
import random
from collections import deque
class ReplayBuffer:
def __init__(self, capacity):
self.buffer = deque(maxlen=capacity)
def push(self, state, action, reward, next_state, done):
self.buffer.append((state, action, reward, next_state, done))
def sample(self, batch_size):
batch = random.sample(self.buffer, batch_size)
state, action, reward, next_state, done = zip(*batch)
return state, action, reward, next_state, done
def __len__(self):
return len(self.buffer)
This implementation is concise and correct, but it is inefficient for large buffers because each sample call has O(B) time complexity in Python's pure-Python deque, and the conversion from a list of tuples to per-field tensors in zip(*batch) becomes a bottleneck.
Production implementations typically allocate fixed-size NumPy arrays or PyTorch tensors at construction time and write into them by index, avoiding the overhead of Python-level deques and per-sample tensor creation:
import numpy as np
import torch
class TensorReplayBuffer:
def __init__(self, capacity, state_dim, action_dim, device="cuda"):
self.capacity = capacity
self.device = device
self.states = np.empty((capacity, state_dim), dtype=np.float32)
self.actions = np.empty((capacity, action_dim), dtype=np.float32)
self.rewards = np.empty((capacity,), dtype=np.float32)
self.next_states = np.empty((capacity, state_dim), dtype=np.float32)
self.dones = np.empty((capacity,), dtype=np.float32)
self.idx = 0
self.size = 0
def push(self, s, a, r, s2, d):
i = self.idx
self.states[i] = s
self.actions[i] = a
self.rewards[i] = r
self.next_states[i] = s2
self.dones[i] = d
self.idx = (self.idx + 1) % self.capacity
self.size = min(self.size + 1, self.capacity)
def sample(self, batch_size):
idx = np.random.randint(0, self.size, size=batch_size)
return (
torch.from_numpy(self.states[idx]).to(self.device),
torch.from_numpy(self.actions[idx]).to(self.device),
torch.from_numpy(self.rewards[idx]).to(self.device),
torch.from_numpy(self.next_states[idx]).to(self.device),
torch.from_numpy(self.dones[idx]).to(self.device),
)
For image-based observations such as Atari frames, a common optimization is to store frames as 8-bit unsigned integers and only deduplicate consecutive frames in a stack, reducing memory usage by a factor of four or more. The Dopamine library and the original DQN code both apply this trick: a one-million transition buffer of 84 by 84 grayscale frames takes about 7 GB instead of 28 GB.
Several mature libraries provide replay buffer implementations:
| Library | Replay class | Notes |
|---|---|---|
| Stable Baselines3 | ReplayBuffer, DictReplayBuffer, HerReplayBuffer | Used by SB3's SAC, TD3, DQN |
| TorchRL | ReplayBuffer with Storage backends (List, Tensor, LazyTensor, LazyMemmap) | Composable, supports custom samplers |
| RLlib | Various replay actors via RLlib's execution plans | Distributed replay for Ray Tune |
| Dopamine | OutOfGraphReplayBuffer | Memory-efficient for Atari; supports n-step |
| TF-Agents | TFUniformReplayBuffer | TensorFlow-native |
| Acme | Reverb | Distributed replay built on the Reverb backend |
DeepMind's Reverb, used by Acme and several other DeepMind libraries, deserves special mention. It is a high-performance, distributed replay system written in C++ with Python bindings, supporting prioritized sampling, sequential sampling, and arbitrary user-defined samplers, all over a network. Reverb scales to billions of transitions and is the engine behind many of DeepMind's distributed RL results.
A replay buffer's memory footprint is dominated by the state and next-state fields. For an Atari-style buffer of one million transitions with 84 by 84 grayscale uint8 frames stacked four deep, the naive cost of storing both state and next-state would be:
2 * 1,000,000 * 4 * 84 * 84 = 56.4 billion bytes (about 56 GB)
This is why production implementations store only the unique frames and reconstruct state stacks at sample time. With deduplication, a one-million transition buffer fits in roughly 7 GB. For continuous control with low-dimensional state vectors (for example, a 17-dimensional MuJoCo state), the same buffer might take only a few hundred megabytes and fits easily in CPU RAM.
Replay buffers are usually kept in CPU RAM because most agents have buffer sizes much larger than typical GPU memory. Mini-batches are transferred to the GPU at sample time, often with pinned memory and asynchronous data loaders to overlap data movement with computation. A few high-throughput systems (TorchRL's LazyTensorStorage on CUDA, JAX-based implementations on TPU) keep the entire buffer in accelerator memory when it fits, which removes host-to-device copy latency at the cost of using accelerator RAM that could otherwise hold a larger model or batch.
For heavy training loops, sample throughput becomes a bottleneck. Common optimizations include allocating buffer arrays in pinned (page-locked) host memory so that PyTorch can issue asynchronous host-to-device transfers, prefetching the next mini-batch in a background thread, and replacing per-element Python iteration with vectorized NumPy or Torch operations. Distributed setups go further by sharding the buffer across machines and using RPC or Reverb-style sampling.
A few subtle implementation issues recur in practice. Storing rewards as float32 is usually fine, but storing returns (cumulative rewards) sometimes overflows float16. Boolean done flags should be stored as float so they can be multiplied directly into the bootstrap term (1 - done). When computing TD targets for a mini-batch, it is important that the actions used to bootstrap the target Q-value come from the next-state field, not the current state, an off-by-one error that has appeared in many beginner implementations.
Replay-based methods are most commonly evaluated on three families of benchmarks. The Arcade Learning Environment (Atari 2600) introduced by Bellemare et al. (2013) was the original deep RL benchmark and remains a standard testbed for replay-based discrete-action methods. The DeepMind Control Suite and the OpenAI Gym MuJoCo tasks are the standard continuous control benchmarks for DDPG, TD3, SAC, and their replay variants. The OpenAI Gym Robotics tasks (Fetch and HandManipulate) are the standard sparse-reward goal-conditioned benchmarks for HER. More recent benchmarks include the DeepMind Control Suite from pixels, ProcGen and Crafter for generalization, and the D4RL and Robomimic datasets for offline RL.
Several mistakes recur in practice when implementing or tuning experience replay.
| Pitfall | Symptom | Fix |
|---|---|---|
| Sampling before warmup | Initial Q-values are extremely noisy | Wait until the buffer holds a minimum number of transitions before training |
| Using on-policy data only | Q-values overfit to the current policy | Use a sufficiently large buffer or add an off-policy correction |
| Ignoring done flags | Bootstrapping past terminal states corrupts targets | Multiply the next-state value by (1 - done) |
| Forgetting importance sampling weights in PER | Biased gradient estimates | Apply (N * P(i))^(-beta) weights and normalize |
| Storing full state stacks | Out-of-memory on Atari-scale problems | Deduplicate frames or use uint8 storage |
| Mixing rewards across episodes | Bootstrapping crosses episode boundaries | Reset done flags correctly, use per-episode storage if using sequence sampling |
| Stale priorities | New transitions never get replayed | Initialize new transitions with the maximum priority in the buffer |
| Replay ratio too high | Q-function diverges or value collapses to a constant | Lower the ratio or apply periodic network resets |
Most mature libraries handle these pitfalls automatically, but custom implementations frequently fall into one or more of them.
Imagine you are learning to ride a bike. Every time you try, you remember what happened: how you leaned, how you pedaled, and whether you fell or stayed upright. Now imagine you have a big scrapbook where you paste a picture and a note about every single attempt.
Without the scrapbook, you would only remember your very last try. Maybe that last try was a lucky one where you did not fall, so you would not learn much about what to avoid. Or maybe you fell in a weird way that does not happen often, and you would overreact to that one bad memory.
With the scrapbook, you can flip to any random page before your next attempt and review a handful of old memories. Some are from yesterday, some from last week. By studying a mix of different memories instead of just the latest one, you learn faster, you do not forget the lessons from earlier attempts, and you do not overreact to any single ride. That scrapbook is what computer scientists call a replay buffer, and the process of flipping back through it to learn is called experience replay.
A more clever version of the scrapbook puts colored sticky notes on the pages with the most surprising mistakes, so you spend extra time reviewing those. That is what prioritized experience replay does. And another clever version, called hindsight experience replay, lets you flip back to a failed attempt and pretend the place you ended up was actually the place you meant to go, so even your failures teach you something useful.