# Experience Replay

> Source: https://aiwiki.ai/wiki/experience_replay
> Updated: 2026-07-12
> Categories: Machine Learning, Reinforcement Learning
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

*See also: [Reinforcement Learning](/wiki/reinforcement_learning_rl), [Deep Q-Network (DQN)](/wiki/deep_q-network_dqn), [Q-Learning](/wiki/q-learning), [Replay Buffer](/wiki/replay_buffer)*

Experience replay is a [reinforcement learning](/wiki/reinforcement_learning_rl) technique in which an [agent](/wiki/agent) stores its past transitions in a memory called a [replay buffer](/wiki/replay_buffer) and samples random mini-batches of those stored transitions to train on, instead of learning only from the most recent experience. This random resampling breaks the temporal correlation between consecutive observations and lets each transition be reused many times, which improves both the stability and the sample efficiency of training. As the team that introduced Prioritized Experience Replay put it, "Experience replay lets online reinforcement learning agents remember and reuse experiences from the past." [5]

The technique was first proposed by Long-Ji Lin in 1992 [1] and became famous as one of the two key innovations of the [Deep Q-Network (DQN)](/wiki/deep_q-network_dqn) algorithm published by Mnih et al. at [DeepMind](/wiki/deepmind) in 2013 and 2015 [3][4], where a replay buffer of one million transitions helped a single network reach human-level play across dozens of Atari 2600 games [4]. Today experience replay is a standard building block of virtually all off-policy deep reinforcement learning algorithms, including [Deep Deterministic Policy Gradient (DDPG)](/wiki/ddpg), Twin Delayed DDPG ([TD3](/wiki/td3)), and [Soft Actor-Critic (SAC)](/wiki/soft_actor_critic). It also underpins offline reinforcement learning, distributed actor-learner systems, and goal-conditioned algorithms such as Hindsight Experience Replay.

A single transition is stored as a tuple (s, a, r, s', d), where s is the state observed before acting, a is the action chosen, r is the scalar reward received, s' is the next state, and d is a flag indicating whether s' is terminal. The buffer is typically a fixed-capacity ring that overwrites its oldest entries once full, and gradient updates draw uniform or prioritized mini-batches from this pool.

## Who invented experience replay?

The concept of experience replay was introduced by Long-Ji Lin in his 1992 paper "Self-Improving Reactive Agents Based on Reinforcement Learning, Planning and Teaching," published in the journal *Machine Learning*, volume 8, pages 293-321 [1]. In that work, Lin compared eight reinforcement learning frameworks built around two base algorithms: the Adaptive Heuristic Critic (AHC) and [Q-learning](/wiki/q-learning). He proposed three extensions to speed up learning: experience replay, learning action models for planning, and teaching [1]. Among these, experience replay proved to be the simplest and most broadly applicable.

Lin's central insight was that an agent could store its past experiences and revisit them later instead of discarding each transition after a single learning update. This idea drew loose inspiration from biological findings about hippocampal replay in mammals, where the brain replays neural activity patterns from prior waking experiences during sleep and rest, consolidating memories and improving future decision-making. Lin's thesis at Carnegie Mellon University, completed the following year, expanded these ideas and remains one of the most-cited early works on memory-based reinforcement learning.

The technique gained mainstream prominence in 2013 when Mnih et al. at [DeepMind](/wiki/deepmind) combined experience replay with deep [neural networks](/wiki/neural_network) to create the DQN algorithm. The arXiv preprint "Playing Atari with Deep Reinforcement Learning" (1312.5602) described "the first deep learning model to successfully learn control policies directly from high-dimensional sensory input using reinforcement learning," applied to seven Atari 2600 games with no adjustment of the architecture or learning algorithm between games [3]. It was followed by the 2015 *Nature* paper "Human-Level Control Through Deep Reinforcement Learning," which used a replay buffer of one million transitions to train an agent that achieved human-level performance on dozens of Atari 2600 games [4]. This result demonstrated that experience replay was essential for stabilizing the training of deep neural networks in reinforcement learning settings, and it spread the technique across the entire deep reinforcement learning literature in the years that followed.

### Connection to neuroscience

Wilson and McNaughton (1994) recorded ensembles of hippocampal place cells in rats during spatial tasks and during the slow-wave sleep that followed those tasks [2]. They found that pairs of cells which fired together during waking exploration showed an elevated tendency to fire together again during subsequent sleep, and the effect decayed gradually over the course of the rest period [2]. This phenomenon, often called hippocampal replay, was interpreted as a substrate for memory consolidation.

Follow-up work by Lee and Wilson (2002), Foster and Wilson (2006), and many others extended these findings, documenting both forward and reverse replay sequences and connecting them explicitly to temporal-difference learning. The convergence between the algorithmic notion of resampling stored transitions and the neural notion of reactivating cell ensembles has become an active interface between machine learning and computational neuroscience. Modern research treats experience replay as a useful normative model of why brains might consolidate memories during rest, and conversely treats hippocampal replay as a source of inspiration for new replay-based algorithms.

## How does experience replay work?

### The replay buffer

The replay buffer (also called experience memory or replay memory) is typically implemented as a circular buffer with a fixed maximum capacity. Each entry in the buffer represents a single transition, stored as a tuple:

| Element | Symbol | Description |
|---|---|---|
| State | s | The environment state observed by the [agent](/wiki/agent) before taking an action |
| Action | a | The action selected by the agent |
| Reward | r | The scalar reward signal received from the environment after taking the action |
| Next State | s' | The environment state observed after the action was executed |
| Done Flag | d | A boolean indicating whether s' is a terminal state |

As the agent interacts with the environment, new transitions are appended to the buffer. When the buffer reaches its maximum capacity, the oldest transitions are overwritten, ensuring the buffer always contains the most recent experiences up to its size limit. Some implementations also store auxiliary fields such as the discount factor at the time of the transition, the policy log-probability of the action, the time step within the episode, or a goal vector for goal-conditioned tasks.

### Sampling and training

During each training step, the agent draws a random mini-batch of transitions from the replay buffer (typically 32 to 256 transitions). These sampled transitions are then used to compute loss values and update the agent's parameters via [gradient descent](/wiki/gradient_descent). In the case of [Q-learning](/wiki/q-learning) variants, the sampled transitions are used to compute temporal-difference (TD) targets and minimize the TD error. The standard one-step TD target for transition (s, a, r, s', d) is:

$$
y = r + \gamma (1 - d) \max_{a'} Q_{\text{target}}(s', a')
$$

where $$\gamma$$ is the discount factor and $$Q_{\text{target}}$$ is a slowly-updated copy of the Q-network. The loss minimized over the mini-batch is then the mean squared error between $$Q(s, a)$$ and $$y$$.

The random sampling is the key mechanism that provides experience replay's benefits. Because the mini-batch is drawn uniformly at random from a large pool of transitions collected over many episodes, consecutive samples in the training batch are unlikely to be correlated with each other, and the gradient estimate is closer to that of supervised learning on independent samples.

### The four basic operations

Most replay buffer implementations expose a small interface with four core operations:

| Operation | Description | Typical complexity |
|---|---|---|
| add | Append a transition (s, a, r, s', d) to the buffer | O(1) |
| sample | Draw a uniform random mini-batch of size B | O(B) |
| update_priorities | Update priority values after a learning step (PER only) | O(B log N) |
| clear | Reset the buffer to empty | O(1) or O(N) |

The agent typically calls add once per environment step and sample once per gradient update, with the ratio of the two operations controlled by a hyperparameter known as the replay ratio.

## Why does experience replay work?

Experience replay addresses several fundamental challenges that arise when training neural networks with reinforcement learning data.

### Breaking temporal correlations

When an agent learns online from a stream of consecutive experiences, successive transitions are highly correlated. For example, an agent navigating a maze will see a long sequence of spatially adjacent states. Training a neural network on such correlated data violates the independent and identically distributed (i.i.d.) assumption underlying stochastic gradient descent, which can cause the network to overfit to recent trajectories and produce unstable weight updates. By sampling randomly from a large buffer, experience replay decorrelates the training data and approximates the i.i.d. condition.

### Improved sample efficiency

Without replay, each transition is used for exactly one parameter update and then discarded. This is extremely wasteful, especially in environments where collecting data is expensive or slow. Experience replay allows each transition to be reused across multiple training updates, extracting more learning signal from each interaction with the environment. A single rare but informative transition can contribute to learning dozens or hundreds of times before it is eventually overwritten. Real-world robotic systems, where each environment step may take seconds and risk hardware wear, benefit disproportionately from this reuse.

### Stabilized learning

In deep reinforcement learning, the target values used for training depend on the agent's own parameters, creating a moving-target problem. Random sampling from a diverse buffer smooths out these fluctuations by ensuring that any single training batch reflects a broad distribution of experiences rather than the agent's current behavioral regime. This stabilization effect was one of the key reasons DQN succeeded where earlier attempts to combine neural networks with Q-learning had failed.

### Reduced catastrophic forgetting

Neural networks are prone to catastrophic forgetting, where learning new information erases previously acquired knowledge. By continually revisiting older experiences stored in the buffer, the agent retains knowledge about earlier parts of the state space even as its policy evolves and explores new regions. This effect is particularly important in long training runs where the agent's behavior changes substantially over time and the most recent on-policy data alone would not suffice to maintain accurate Q-values for older state regions.

## Why is experience replay an off-policy technique?

Experience replay is fundamentally an [off-policy](/wiki/q-learning) technique. The transitions in the buffer were collected by past versions of the agent's policy, while gradient updates are applied to the current policy. Off-policy algorithms such as Q-learning, DQN, DDPG, TD3, and SAC are mathematically able to learn from data generated by a different (behavior) policy than the one being optimized, which is what makes replay viable for them [19].

In contrast, on-policy algorithms such as REINFORCE, A2C, A3C, TRPO, and PPO require that the data used for each gradient step comes from the current policy. Reusing old data would introduce bias that policy gradient theorems do not account for, so these algorithms typically use a small rollout buffer that holds only the most recent batch of trajectories and is then discarded. Some on-policy methods use importance sampling to partially correct for off-policy data, but they generally do not maintain large persistent replay buffers in the way DQN and its descendants do.

This off-policy versus on-policy distinction has practical consequences. Algorithms with replay tend to be more sample-efficient because they can squeeze multiple gradient updates out of each interaction. On-policy algorithms tend to be more stable and have stronger convergence guarantees because their training data always reflects the current behavior, but they need more environment interactions to reach the same performance.

## Uniform vs prioritized sampling

The original experience replay formulation uses uniform random sampling: every transition in the buffer has an equal probability of being selected. While simple and effective, uniform sampling treats all transitions as equally valuable for learning, which is not always the case.

### Uniform sampling

Uniform sampling is straightforward to implement and introduces no bias into the learning process. Its main limitation is inefficiency: many sampled transitions may be easy examples that the agent already handles well, while rare or surprising transitions that could drive significant learning progress are sampled no more frequently than any other.

### What is Prioritized Experience Replay (PER)?

Schaul et al. introduced Prioritized Experience Replay (PER) in arXiv preprint 1511.05952 (2015), and the work was published at ICLR 2016 [5]. The paper proposed "a framework for prioritizing experience, so as to replay important transitions more frequently, and therefore learn more efficiently" [5]. The core idea is that transitions should be replayed in proportion to how much the agent can learn from them. The authors used the magnitude of the TD error as a proxy for learning potential: transitions where the agent's prediction was far from the actual outcome are presumably more informative.

PER defines the sampling probability for transition i as:

$$
P(i) = \frac{p_i^\alpha}{\sum_k p_k^\alpha}
$$

where $$\alpha$$ controls the degree of prioritization ($$\alpha = 0$$ yields uniform sampling) and p_i is the priority of transition i. The paper presents two variants for computing p_i:

| Variant | Priority formula | Description |
|---|---|---|
| Proportional | $$p_i = \lvert \delta_i \rvert + \epsilon$$ | Priority is proportional to the absolute TD error plus a small constant epsilon that prevents zero-priority transitions from never being replayed |
| Rank-based | $$p_i = 1 / \mathrm{rank}(i)$$ | Priority is inversely proportional to the rank of the transition when sorted by TD error magnitude |

The rank-based variant is more robust to outliers because it depends only on the ordering of TD errors rather than their raw magnitudes. Its heavy-tail distribution also ensures diversity in the sampled mini-batches.

### Importance sampling correction

Prioritized sampling introduces bias because transitions with high TD errors are overrepresented relative to the true data distribution. To correct this, PER applies importance sampling weights:

$$
w_i = \left(\frac{1}{N} \cdot \frac{1}{P(i)}\right)^\beta
$$

where N is the buffer size and $$\beta$$ controls the degree of correction. The weights are normalized by dividing by the maximum weight in the mini-batch. The parameter beta is annealed from a low initial value to 1 over the course of training, reflecting the fact that unbiased updates become most important near convergence when the policy is close to optimal.

Experiments showed that Double DQN combined with prioritized experience replay significantly outperformed the previous state-of-the-art results on the Atari Learning Environment benchmark [5]. PER outperformed DQN with uniform replay on 41 of 49 Atari games tested by Schaul et al., and the proportional variant raised the median human-normalized score across the suite from 418 percent to 551 percent [5].

### Sum tree data structure

A naive implementation of proportional PER would require O(N) work for each sample, since the cumulative distribution must be inverted. To make sampling efficient, PER is typically implemented with a binary tree called a sum tree (a kind of segment tree), which stores transitions in its leaf nodes and the sum of children's priorities at every internal node. The root node holds the total priority. Both adding a new transition and updating an existing priority cost O(log N), and drawing a single sample is also O(log N), giving the buffer logarithmic asymptotic cost in the buffer size.

A mini-batch of size B is drawn by partitioning the interval [0, p_total] into B equal-width segments, sampling a uniform random number from each segment, and traversing the tree to find the leaf corresponding to that priority mass. This stratified sampling reduces variance compared to drawing all B samples from the full interval.

## How big should the replay buffer be?

The size of the replay buffer is an important hyperparameter that involves several tradeoffs.

| Factor | Small buffer | Large buffer |
|---|---|---|
| Memory usage | Low | High |
| Data diversity | Limited; recent transitions only | Broad; spans many episodes and policies |
| Off-policyness | Low; data is close to current policy | High; oldest transitions may come from very different policies |
| Correlation breaking | Less effective | More effective |
| Staleness risk | Low | High; old transitions may mislead the agent |

Mnih et al. (2015) set the DQN replay buffer to hold one million transitions, and this value became a widely adopted default in subsequent work [4]. Lillicrap et al. (2015) used the same buffer size for [DDPG](/wiki/ddpg) in continuous control tasks [6]. However, research by Fedus et al. (2020) in "Revisiting Fundamentals of Experience Replay" showed that performance consistently improves with increased replay capacity for certain algorithms, while other algorithms are unaffected, suggesting that the one-million default is a practical starting point rather than a universally optimal choice [14].

The replay ratio, defined as the number of gradient updates per environment step, interacts closely with buffer size. Increasing the buffer while holding the replay ratio constant means the oldest transitions in the buffer become more stale. Conversely, reducing the buffer while maintaining the replay ratio keeps the data more on-policy but limits diversity. Finding the right balance depends on the specific algorithm, environment complexity, and computational budget.

### The Fedus 2020 findings

Fedus, Ramachandran, Agarwal, Bengio, Larochelle, Rowland, and Dabney published "Revisiting Fundamentals of Experience Replay" at ICML 2020 [14]. Their study isolated the effects of replay capacity, the age of the oldest data, and the replay ratio across DQN and Rainbow DQN [14]. They reported several findings that have shaped subsequent practice:

1. The original DQN does not benefit much from increasing the replay buffer beyond the default one million transitions.
2. Rainbow DQN does benefit, and the gain comes mainly from the use of n-step returns.
3. Adding n-step returns to plain DQN makes it benefit from larger buffers; removing n-step returns from Rainbow eliminates the benefit.
4. The replay ratio interacts strongly with capacity, and the right ratio depends on the algorithm.

These findings suggested that buffer size cannot be tuned independently of other algorithmic choices and that n-step bootstrapping plays a quiet but central role in modern off-policy methods.

### Replay ratio and the update-to-data ratio

A closely related quantity is the update-to-data ratio (UTD), sometimes called the replay ratio or gradient steps per environment step. SAC and DDPG often default to a UTD of 1, meaning one gradient update per environment step. Methods such as REDQ (Chen et al. 2021) [16], DroQ, and the work "Sample-Efficient Reinforcement Learning by Breaking the Replay Ratio Barrier" by D'Oro et al. (ICLR 2023) increase UTD to 10, 20, or higher, often combined with periodic network resets to prevent the value function from collapsing into pathological regions [18]. High-UTD methods can match the sample efficiency of model-based RL on continuous control benchmarks but spend significantly more compute per environment interaction.

## What is Hindsight Experience Replay (HER)?

Andrychowicz et al. (2017) introduced Hindsight Experience Replay (HER) at NeurIPS 2017 to tackle a persistent challenge in goal-conditioned reinforcement learning: sparse rewards [7]. In many real-world tasks, the agent receives a reward signal only upon successfully completing a goal (for example, placing an object at a target location) and receives zero reward otherwise. Under these conditions, standard experience replay struggles because almost every transition in the buffer carries no useful reward signal.

HER addresses this by replaying each episode with substitute goals. After the agent completes an episode (even if it failed to reach the intended goal), HER stores the episode in the replay buffer twice: once with the original goal, and once with the goal replaced by a state the agent actually reached during the episode. This relabeling trick transforms failed experiences into successful ones under the substitute goal, providing a dense learning signal even in sparse-reward environments.

Because HER modifies only the goals and not the environment dynamics, it can be combined with any off-policy reinforcement learning algorithm such as DQN, DDPG, or [SAC](/wiki/soft_actor_critic). The researchers demonstrated that policies trained with HER in physics simulations could be successfully transferred to physical robots performing pushing, sliding, and pick-and-place tasks using only binary success/failure rewards [7]. HER can be viewed as a form of implicit [curriculum learning](/wiki/curriculum_learning), where the agent gradually learns to reach increasingly distant goals by first mastering nearby ones.

### Goal selection strategies

The original HER paper compared four strategies for selecting substitute goals during relabeling:

| Strategy | Description | Typical use |
|---|---|---|
| final | Use the state reached at the end of the episode as the substitute goal | Simple baseline; often outperformed by future |
| future | Sample k future states from the same trajectory as substitute goals | Recommended default; k=4 or k=8 worked best |
| episode | Sample k random states from the same episode as substitute goals | Useful when within-episode states are diverse |
| random | Sample k random states from the entire replay buffer as substitute goals | Most off-policy; can hurt learning |

The future strategy with k=4 or k=8 was the best-performing variant in pushing, sliding, and pick-and-place tasks [7]. The exact value of k controls the augmentation factor: each real transition produces k synthetic transitions with relabeled goals, so the effective amount of data the agent sees grows by a factor of k+1.

### Implementation in libraries

Most modern RL frameworks ship a HER implementation. In Stable Baselines3, HER is no longer a separate algorithm but a buffer class called HerReplayBuffer that extends DictReplayBuffer and is passed to off-policy algorithms such as SAC, TD3, or DQN through the replay_buffer_class argument [20]. The user supplies n_sampled_goal (the k value above) and goal_selection_strategy (one of "future", "final", or "episode") [20]. The environment must follow the GoalEnv interface with a dict observation space containing observation, achieved_goal, and desired_goal keys, and must expose a vectorized compute_reward method so that the buffer can recompute rewards for relabeled goals without re-running the simulator.

## Experience replay in key algorithms

### Deep Q-Network (DQN)

The [Deep Q-Network](/wiki/deep_q-network_dqn) algorithm, introduced by Mnih et al. (2013, 2015), was the first to demonstrate that experience replay could enable stable training of deep neural networks for reinforcement learning at scale [3][4]. DQN stores transitions in a buffer of one million entries and samples uniform random mini-batches of 32 transitions for each training step [4]. Together with a separate target network (updated every 10,000 steps in the Nature paper), experience replay was one of DQN's two key innovations for stabilizing training [4]. The 2015 Nature implementation also reduced computation by performing a gradient update every 4 environment frames rather than every frame [4].

### DDPG, TD3, and SAC

[Deep Deterministic Policy Gradient (DDPG)](/wiki/ddpg), introduced by Lillicrap et al. (2015), extended the DQN approach to continuous action spaces by combining a Q-function critic with a deterministic policy actor, both trained using transitions sampled from a replay buffer [6]. [TD3](/wiki/td3) (Fujimoto et al., 2018) improved on DDPG by using twin critics to reduce overestimation bias and delayed policy updates to stabilize learning, while continuing to rely on experience replay for training data [9]. [Soft Actor-Critic (SAC)](/wiki/soft_actor_critic) (Haarnoja et al., 2018) added entropy regularization to encourage exploration, and like DDPG and TD3, uses a replay buffer as a core component of its training pipeline [10].

### Rainbow DQN

Rainbow (Hessel et al., AAAI 2018) combined six independent improvements to DQN into a single agent: double Q-learning, prioritized experience replay, dueling networks, multi-step (n-step) returns, distributional RL (C51), and noisy networks for exploration [8]. Prioritized experience replay was a key contributor to Rainbow's gains [8]. In the distributional setting, Rainbow uses the Kullback-Leibler divergence between the predicted return distribution and the target distribution as the priority signal, replacing the scalar TD error used in the original PER paper. Rainbow matched DQN's final performance after 7 million frames and surpassed every individual baseline by 44 million frames on the 57-game Atari benchmark [8].

### Distributed Prioritized Experience Replay (Ape-X)

Horgan et al. (2018) scaled prioritized experience replay to distributed settings with the Ape-X architecture [11]. In Ape-X, hundreds or thousands of parallel actors each interact with their own copy of the environment and add transitions with initial priorities to a shared centralized replay buffer [11]. A single learner process samples prioritized mini-batches from this shared buffer and updates the network weights. Actors periodically synchronize their network parameters with the learner. This architecture reduced wall-clock training time by factors of two to four compared to single-actor baselines while also improving final performance. The same idea was extended to recurrent networks in R2D2 (Kapturowski et al., ICLR 2019), which stored sequences of transitions rather than individual transitions and addressed the partial-observability issues that arise when LSTM hidden states are reused from out-of-date rollouts [12].

### Algorithms and their replay usage

The table below summarizes how prominent off-policy and offline algorithms use experience replay.

| Algorithm | Action space | Replay type | Default buffer size | Notes |
|---|---|---|---|---|
| DQN (2015) | Discrete | Uniform | 1,000,000 | Original deep RL replay; trained on Atari |
| Double DQN | Discrete | Uniform | 1,000,000 | Decoupled action selection and evaluation |
| Prioritized DQN | Discrete | Proportional or rank PER | 1,000,000 | Sampling weighted by abs TD error |
| Rainbow | Discrete | Proportional PER | 1,000,000 | Combines six DQN extensions |
| DDPG | Continuous | Uniform | 1,000,000 | Off-policy actor-critic for continuous control |
| TD3 | Continuous | Uniform | 1,000,000 | Twin critics, delayed policy updates |
| SAC | Continuous | Uniform | 1,000,000 | Maximum-entropy actor-critic |
| HER + DDPG/SAC | Continuous, goal-conditioned | Uniform with goal relabeling | Variable | Sparse-reward goal tasks |
| Ape-X | Discrete | Distributed PER | up to 2,000,000 or more | Many actors share one prioritized buffer |
| R2D2 | Discrete (recurrent) | Distributed PER over sequences | Variable | Stores fixed-length sequences |
| BCQ | Continuous, offline | Fixed dataset | Static | Offline RL with behavior-constrained policy |
| CQL | Continuous, offline | Fixed dataset | Static | Conservative Q-learning regularizer |

## What is offline reinforcement learning?

Offline reinforcement learning (also called batch RL) takes the idea of experience replay to its extreme: the agent never collects new data and trains entirely from a fixed dataset of transitions. The dataset itself is treated as a static replay buffer that is never overwritten, and the agent must extract the best possible policy from whatever data is available.

The main difficulty in offline RL is distributional shift: the policy being learned may visit state-action pairs that the dataset's behavior policy never visited, which causes the value function to extrapolate wildly into out-of-distribution regions. Algorithms such as Batch-Constrained Q-learning (BCQ; Fujimoto et al. 2019) [13], Conservative Q-Learning (CQL; Kumar et al. 2020) [15], Implicit Q-Learning (IQL; Kostrikov et al. 2022) [17], and Behavior-Regularized Actor-Critic (BRAC) address this by either constraining the policy to stay close to the data distribution or by penalizing Q-values for unseen actions. Offline RL has become a major research area because it enables learning from logged data such as medical records, robot demonstrations, and historical user interactions without further online interaction.

A hybrid setting called offline-to-online RL pretrains on a fixed dataset and then fine-tunes online, often using two replay buffers: one containing the offline data and one filling up with newly collected online transitions. Mixing samples from both buffers during fine-tuning helps the agent retain knowledge from offline pretraining while adapting to its own freshly generated data.

## Variants and extensions

A range of variants have been proposed to improve on uniform random sampling.

| Variant | Year | Sampling rule | Key idea |
|---|---|---|---|
| Uniform replay | 1992 (Lin); 2013 (DQN) | Uniform random | Original formulation |
| Prioritized Experience Replay | 2015 | Proportional or rank-based on abs TD error | Replay informative transitions more often |
| Hindsight Experience Replay | 2017 | Uniform with goal relabeling | Synthetic successes for sparse-reward tasks |
| Combined Experience Replay | 2017 (Zhang and Sutton) | Uniform plus latest transition | Always include the most recent transition in each batch |
| Distributed PER (Ape-X) | 2018 | Centralized prioritized | Scale across many parallel actors |
| Reverse experience replay | Various | Sample in reverse temporal order within an episode | Inspired by hippocampal reverse replay |
| Episodic memory replay | Various | Sample whole episodes or sequences | Used with recurrent agents and meta-RL |
| Experience Replay Optimization (ERO) | 2019 (Zha et al.) | Learned sampling network | Replace heuristic priority with a meta-learned policy |
| Map-based experience replay | 2023 | Sample from clustered states | Reduce catastrophic forgetting in continual RL |

Reverse replay is motivated by neuroscience findings that hippocampal place cells often replay sequences in reverse order at decision points. In RL, reverse-order replay can speed credit assignment because the value of the terminal transition is updated first and propagated backward through the sequence in subsequent updates.

Episodic memory replay stores entire trajectories rather than individual transitions and is particularly useful for partially observable problems where temporal context matters. R2D2 and IMPALA-style architectures use this approach.

## How is a replay buffer implemented?

### Minimal Python implementation

A simple uniform replay buffer can be implemented in a few lines of Python using a deque from the collections module:

```python
import random
from collections import deque

class ReplayBuffer:
    def __init__(self, capacity):
        self.buffer = deque(maxlen=capacity)

    def push(self, state, action, reward, next_state, done):
        self.buffer.append((state, action, reward, next_state, done))

    def sample(self, batch_size):
        batch = random.sample(self.buffer, batch_size)
        state, action, reward, next_state, done = zip(*batch)
        return state, action, reward, next_state, done

    def __len__(self):
        return len(self.buffer)
```

This implementation is concise and correct, but it is inefficient for large buffers because each sample call has O(B) time complexity in Python's pure-Python deque, and the conversion from a list of tuples to per-field tensors in zip(*batch) becomes a bottleneck.

### Tensor-backed implementation

Production implementations typically allocate fixed-size NumPy arrays or PyTorch tensors at construction time and write into them by index, avoiding the overhead of Python-level deques and per-sample tensor creation:

```python
import numpy as np
import torch

class TensorReplayBuffer:
    def __init__(self, capacity, state_dim, action_dim, device="cuda"):
        self.capacity = capacity
        self.device = device
        self.states = np.empty((capacity, state_dim), dtype=np.float32)
        self.actions = np.empty((capacity, action_dim), dtype=np.float32)
        self.rewards = np.empty((capacity,), dtype=np.float32)
        self.next_states = np.empty((capacity, state_dim), dtype=np.float32)
        self.dones = np.empty((capacity,), dtype=np.float32)
        self.idx = 0
        self.size = 0

    def push(self, s, a, r, s2, d):
        i = self.idx
        self.states[i] = s
        self.actions[i] = a
        self.rewards[i] = r
        self.next_states[i] = s2
        self.dones[i] = d
        self.idx = (self.idx + 1) % self.capacity
        self.size = min(self.size + 1, self.capacity)

    def sample(self, batch_size):
        idx = np.random.randint(0, self.size, size=batch_size)
        return (
            torch.from_numpy(self.states[idx]).to(self.device),
            torch.from_numpy(self.actions[idx]).to(self.device),
            torch.from_numpy(self.rewards[idx]).to(self.device),
            torch.from_numpy(self.next_states[idx]).to(self.device),
            torch.from_numpy(self.dones[idx]).to(self.device),
        )
```

For image-based observations such as Atari frames, a common optimization is to store frames as 8-bit unsigned integers and only deduplicate consecutive frames in a stack, reducing memory usage by a factor of four or more. The Dopamine library and the original DQN code both apply this trick: a one-million transition buffer of 84 by 84 grayscale frames takes about 7 GB instead of 28 GB.

### Library implementations

Several mature libraries provide replay buffer implementations:

| Library | Replay class | Notes |
|---|---|---|
| Stable Baselines3 | ReplayBuffer, DictReplayBuffer, HerReplayBuffer | Used by SB3's SAC, TD3, DQN |
| TorchRL | ReplayBuffer with Storage backends (List, Tensor, LazyTensor, LazyMemmap) | Composable, supports custom samplers |
| RLlib | Various replay actors via RLlib's execution plans | Distributed replay for Ray Tune |
| Dopamine | OutOfGraphReplayBuffer | Memory-efficient for Atari; supports n-step |
| TF-Agents | TFUniformReplayBuffer | TensorFlow-native |
| Acme | Reverb | Distributed replay built on the Reverb backend |

DeepMind's Reverb, used by Acme and several other DeepMind libraries, deserves special mention. It is a high-performance, distributed replay system written in C++ with Python bindings, supporting prioritized sampling, sequential sampling, and arbitrary user-defined samplers, all over a network. Reverb scales to billions of transitions and is the engine behind many of DeepMind's distributed RL results.

## Memory and engineering considerations

### Memory footprint

A replay buffer's memory footprint is dominated by the state and next-state fields. For an Atari-style buffer of one million transitions with 84 by 84 grayscale uint8 frames stacked four deep, the naive cost of storing both state and next-state would be:

$$2 \times 1{,}000{,}000 \times 4 \times 84 \times 84 = 56.4$$ billion bytes (about 56 GB)

This is why production implementations store only the unique frames and reconstruct state stacks at sample time. With deduplication, a one-million transition buffer fits in roughly 7 GB. For continuous control with low-dimensional state vectors (for example, a 17-dimensional MuJoCo state), the same buffer might take only a few hundred megabytes and fits easily in CPU RAM.

### CPU vs GPU storage

Replay buffers are usually kept in CPU RAM because most agents have buffer sizes much larger than typical GPU memory. Mini-batches are transferred to the GPU at sample time, often with pinned memory and asynchronous data loaders to overlap data movement with computation. A few high-throughput systems (TorchRL's LazyTensorStorage on CUDA, JAX-based implementations on TPU) keep the entire buffer in accelerator memory when it fits, which removes host-to-device copy latency at the cost of using accelerator RAM that could otherwise hold a larger model or batch.

### Pinned memory and prefetching

For heavy training loops, sample throughput becomes a bottleneck. Common optimizations include allocating buffer arrays in pinned (page-locked) host memory so that PyTorch can issue asynchronous host-to-device transfers, prefetching the next mini-batch in a background thread, and replacing per-element Python iteration with vectorized NumPy or Torch operations. Distributed setups go further by sharding the buffer across machines and using RPC or Reverb-style sampling.

### Numerical considerations

A few subtle implementation issues recur in practice. Storing rewards as float32 is usually fine, but storing returns (cumulative rewards) sometimes overflows float16. Boolean done flags should be stored as float so they can be multiplied directly into the bootstrap term (1 - done). When computing TD targets for a mini-batch, it is important that the actions used to bootstrap the target Q-value come from the next-state field, not the current state, an off-by-one error that has appeared in many beginner implementations.

## Evaluation and benchmarks

Replay-based methods are most commonly evaluated on three families of benchmarks. The Arcade Learning Environment (Atari 2600) introduced by Bellemare et al. (2013) was the original deep RL benchmark and remains a standard testbed for replay-based discrete-action methods. The DeepMind Control Suite and the OpenAI Gym MuJoCo tasks are the standard continuous control benchmarks for DDPG, TD3, SAC, and their replay variants. The OpenAI Gym Robotics tasks (Fetch and HandManipulate) are the standard sparse-reward goal-conditioned benchmarks for HER. More recent benchmarks include the DeepMind Control Suite from pixels, ProcGen and Crafter for generalization, and the D4RL and Robomimic datasets for offline RL.

## Common pitfalls

Several mistakes recur in practice when implementing or tuning experience replay.

| Pitfall | Symptom | Fix |
|---|---|---|
| Sampling before warmup | Initial Q-values are extremely noisy | Wait until the buffer holds a minimum number of transitions before training |
| Using on-policy data only | Q-values overfit to the current policy | Use a sufficiently large buffer or add an off-policy correction |
| Ignoring done flags | Bootstrapping past terminal states corrupts targets | Multiply the next-state value by (1 - done) |
| Forgetting importance sampling weights in PER | Biased gradient estimates | Apply $$(N \cdot P(i))^{-\beta}$$ weights and normalize |
| Storing full state stacks | Out-of-memory on Atari-scale problems | Deduplicate frames or use uint8 storage |
| Mixing rewards across episodes | Bootstrapping crosses episode boundaries | Reset done flags correctly, use per-episode storage if using sequence sampling |
| Stale priorities | New transitions never get replayed | Initialize new transitions with the maximum priority in the buffer |
| Replay ratio too high | Q-function diverges or value collapses to a constant | Lower the ratio or apply periodic network resets |

Most mature libraries handle these pitfalls automatically, but custom implementations frequently fall into one or more of them.

## Explain like I'm 5 (ELI5)

Imagine you are learning to ride a bike. Every time you try, you remember what happened: how you leaned, how you pedaled, and whether you fell or stayed upright. Now imagine you have a big scrapbook where you paste a picture and a note about every single attempt.

Without the scrapbook, you would only remember your very last try. Maybe that last try was a lucky one where you did not fall, so you would not learn much about what to avoid. Or maybe you fell in a weird way that does not happen often, and you would overreact to that one bad memory.

With the scrapbook, you can flip to any random page before your next attempt and review a handful of old memories. Some are from yesterday, some from last week. By studying a mix of different memories instead of just the latest one, you learn faster, you do not forget the lessons from earlier attempts, and you do not overreact to any single ride. That scrapbook is what computer scientists call a replay buffer, and the process of flipping back through it to learn is called experience replay.

A more clever version of the scrapbook puts colored sticky notes on the pages with the most surprising mistakes, so you spend extra time reviewing those. That is what prioritized experience replay does. And another clever version, called hindsight experience replay, lets you flip back to a failed attempt and pretend the place you ended up was actually the place you meant to go, so even your failures teach you something useful.

## See also

- [Reinforcement Learning](/wiki/reinforcement_learning_rl)
- [Deep Q-Network (DQN)](/wiki/deep_q-network_dqn)
- [Q-Learning](/wiki/q-learning)
- [DDPG](/wiki/ddpg), [TD3](/wiki/td3), [Soft Actor-Critic (SAC)](/wiki/soft_actor_critic)
- [Replay Buffer](/wiki/replay_buffer)
- [DeepMind](/wiki/deepmind)

## References

1. Lin, L.-J. (1992). "Self-Improving Reactive Agents Based on Reinforcement Learning, Planning and Teaching." *Machine Learning*, 8, 293-321.
2. Wilson, M.A., and McNaughton, B.L. (1994). "Reactivation of Hippocampal Ensemble Memories During Sleep." *Science*, 265(5172), 676-679.
3. Mnih, V., Kavukcuoglu, K., Silver, D., et al. (2013). "Playing Atari with Deep Reinforcement Learning." *arXiv preprint arXiv:1312.5602*.
4. Mnih, V., Kavukcuoglu, K., Silver, D., et al. (2015). "Human-Level Control Through Deep Reinforcement Learning." *Nature*, 518(7540), 529-533.
5. Schaul, T., Quan, J., Antonoglou, I., and Silver, D. (2016). "Prioritized Experience Replay." *Proceedings of the International Conference on Learning Representations (ICLR 2016)*. arXiv:1511.05952.
6. Lillicrap, T.P., Hunt, J.J., Pritzel, A., et al. (2016). "Continuous Control with Deep Reinforcement Learning." *Proceedings of ICLR 2016*.
7. Andrychowicz, M., Wolski, F., Ray, A., et al. (2017). "Hindsight Experience Replay." *Advances in Neural Information Processing Systems (NeurIPS 2017)*. arXiv:1707.01495.
8. Hessel, M., Modayil, J., van Hasselt, H., et al. (2018). "Rainbow: Combining Improvements in Deep Reinforcement Learning." *Proceedings of AAAI 2018*. arXiv:1710.02298.
9. Fujimoto, S., van Hoof, H., and Meger, D. (2018). "Addressing Function Approximation Error in Actor-Critic Methods." *Proceedings of ICML 2018*.
10. Haarnoja, T., Zhou, A., Abbeel, P., and Levine, S. (2018). "Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor." *Proceedings of ICML 2018*.
11. Horgan, D., Quan, J., Budden, D., et al. (2018). "Distributed Prioritized Experience Replay." *Proceedings of ICLR 2018*.
12. Kapturowski, S., Ostrovski, G., Quan, J., Munos, R., and Dabney, W. (2019). "Recurrent Experience Replay in Distributed Reinforcement Learning." *Proceedings of ICLR 2019*.
13. Fujimoto, S., Meger, D., and Precup, D. (2019). "Off-Policy Deep Reinforcement Learning Without Exploration." *Proceedings of ICML 2019*.
14. Fedus, W., Ramachandran, P., Agarwal, R., Bengio, Y., Larochelle, H., Rowland, M., and Dabney, W. (2020). "Revisiting Fundamentals of Experience Replay." *Proceedings of ICML 2020*. arXiv:2007.06700.
15. Kumar, A., Zhou, A., Tucker, G., and Levine, S. (2020). "Conservative Q-Learning for Offline Reinforcement Learning." *Advances in Neural Information Processing Systems (NeurIPS 2020)*.
16. Chen, X., Wang, C., Zhou, Z., and Ross, K. (2021). "Randomized Ensembled Double Q-Learning: Learning Fast Without a Model." *Proceedings of ICLR 2021*.
17. Kostrikov, I., Nair, A., and Levine, S. (2022). "Offline Reinforcement Learning with Implicit Q-Learning." *Proceedings of ICLR 2022*.
18. D'Oro, P., Schwarzer, M., Nikishin, E., Bacon, P.-L., Bellemare, M.G., and Courville, A. (2023). "Sample-Efficient Reinforcement Learning by Breaking the Replay Ratio Barrier." *Proceedings of ICLR 2023*.
19. Sutton, R.S., and Barto, A.G. (2018). *Reinforcement Learning: An Introduction* (2nd ed.). MIT Press.
20. Stable Baselines3 documentation. "HER and HerReplayBuffer." Accessed 2026.