See also: Reinforcement Learning, Deep Q-Network (DQN), Q-Learning
Experience Replay is a foundational technique in reinforcement learning that allows an agent to store past interactions with its environment in a memory structure called a replay buffer and later resample those interactions during training. Rather than learning exclusively from the most recent experience, the agent randomly draws mini-batches of earlier transitions from the buffer, breaking the temporal correlations present in sequential data and dramatically improving both the stability and sample efficiency of the learning process.
First proposed by Long-Ji Lin in 1992, the technique remained relatively niche until it became a critical component of the Deep Q-Network (DQN) architecture introduced by Mnih et al. in 2013 and 2015. Since then, experience replay has become a standard building block of virtually all off-policy deep reinforcement learning algorithms, including Deep Deterministic Policy Gradient (DDPG), Twin Delayed DDPG (TD3), and Soft Actor-Critic (SAC).
The concept of experience replay was introduced by Long-Ji Lin in his 1992 paper "Self-Improving Reactive Agents Based on Reinforcement Learning, Planning and Teaching," published in the journal Machine Learning. In that work, Lin compared eight reinforcement learning frameworks built around two base algorithms: the Adaptive Heuristic Critic (AHC) and Q-learning. He proposed three extensions to speed up learning: experience replay, learning action models for planning, and teaching. Among these, experience replay proved to be the simplest and most broadly applicable.
Lin's central insight was that an agent could store its past experiences and revisit them later instead of discarding each transition after a single learning update. This idea drew loose inspiration from biological findings about hippocampal replay in mammals, where the brain replays neural activity patterns from prior waking experiences during sleep and rest, consolidating memories and improving future decision-making.
The technique gained mainstream prominence in 2013 when Mnih et al. at DeepMind combined experience replay with deep neural networks to create the DQN algorithm. DQN used a replay buffer of one million transitions to train an agent that achieved human-level performance on dozens of Atari 2600 games. This result demonstrated that experience replay was essential for stabilizing the training of deep neural networks in reinforcement learning settings.
The replay buffer (also called experience memory or replay memory) is typically implemented as a circular buffer with a fixed maximum capacity. Each entry in the buffer represents a single transition, stored as a tuple:
| Element | Symbol | Description |
|---|---|---|
| State | s | The environment state observed by the agent before taking an action |
| Action | a | The action selected by the agent |
| Reward | r | The scalar reward signal received from the environment after taking the action |
| Next State | s' | The environment state observed after the action was executed |
| Done Flag | d | A boolean indicating whether s' is a terminal state |
As the agent interacts with the environment, new transitions are appended to the buffer. When the buffer reaches its maximum capacity, the oldest transitions are overwritten, ensuring the buffer always contains the most recent experiences up to its size limit.
During each training step, the agent draws a random mini-batch of transitions from the replay buffer (typically 32 to 256 transitions). These sampled transitions are then used to compute loss values and update the agent's parameters via gradient descent. In the case of Q-learning variants, the sampled transitions are used to compute temporal-difference (TD) targets and minimize the TD error.
The random sampling is the key mechanism that provides experience replay's benefits. Because the mini-batch is drawn uniformly at random from a large pool of transitions collected over many episodes, consecutive samples in the training batch are unlikely to be correlated with each other.
Experience replay addresses several fundamental challenges that arise when training neural networks with reinforcement learning data.
When an agent learns online from a stream of consecutive experiences, successive transitions are highly correlated. For example, an agent navigating a maze will see a long sequence of spatially adjacent states. Training a neural network on such correlated data violates the independent and identically distributed (i.i.d.) assumption underlying stochastic gradient descent, which can cause the network to overfit to recent trajectories and produce unstable weight updates. By sampling randomly from a large buffer, experience replay decorrelates the training data and approximates the i.i.d. condition.
Without replay, each transition is used for exactly one parameter update and then discarded. This is extremely wasteful, especially in environments where collecting data is expensive or slow. Experience replay allows each transition to be reused across multiple training updates, extracting more learning signal from each interaction with the environment. A single rare but informative transition can contribute to learning dozens or hundreds of times before it is eventually overwritten.
In deep reinforcement learning, the target values used for training depend on the agent's own parameters, creating a moving-target problem. Random sampling from a diverse buffer smooths out these fluctuations by ensuring that any single training batch reflects a broad distribution of experiences rather than the agent's current behavioral regime. This stabilization effect was one of the key reasons DQN succeeded where earlier attempts to combine neural networks with Q-learning had failed.
Neural networks are prone to catastrophic forgetting, where learning new information erases previously acquired knowledge. By continually revisiting older experiences stored in the buffer, the agent retains knowledge about earlier parts of the state space even as its policy evolves and explores new regions.
The original experience replay formulation uses uniform random sampling: every transition in the buffer has an equal probability of being selected. While simple and effective, uniform sampling treats all transitions as equally valuable for learning, which is not always the case.
Uniform sampling is straightforward to implement and introduces no bias into the learning process. Its main limitation is inefficiency: many sampled transitions may be "easy" examples that the agent already handles well, while rare or surprising transitions that could drive significant learning progress are sampled no more frequently than any other.
Schaul et al. introduced Prioritized Experience Replay (PER) at ICLR 2016 to address this inefficiency. The core idea is that transitions should be replayed in proportion to how much the agent can learn from them. The authors used the magnitude of the TD error as a proxy for learning potential: transitions where the agent's prediction was far from the actual outcome are presumably more informative.
PER defines the sampling probability for transition i as:
P(i) = p_i^alpha / sum_k(p_k^alpha)
where alpha controls the degree of prioritization (alpha = 0 yields uniform sampling) and p_i is the priority of transition i. The paper presents two variants for computing p_i:
| Variant | Priority Formula | Description |
|---|---|---|
| Proportional | p_i = abs(delta_i) + epsilon | Priority is proportional to the absolute TD error plus a small constant epsilon that prevents zero-priority transitions from never being replayed |
| Rank-Based | p_i = 1 / rank(i) | Priority is inversely proportional to the rank of the transition when sorted by TD error magnitude |
The rank-based variant is more robust to outliers because it depends only on the ordering of TD errors rather than their raw magnitudes. Its heavy-tail distribution also ensures diversity in the sampled mini-batches.
Prioritized sampling introduces bias because transitions with high TD errors are overrepresented relative to the true data distribution. To correct this, PER applies importance sampling weights:
w_i = (1/N * 1/P(i))^beta
where N is the buffer size and beta controls the degree of correction. The weights are normalized by dividing by the maximum weight in the mini-batch. The parameter beta is annealed from a low initial value to 1 over the course of training, reflecting the fact that unbiased updates become most important near convergence when the policy is close to optimal.
Experiments showed that Double DQN combined with prioritized experience replay significantly outperformed the previous state-of-the-art results on the Atari Learning Environment benchmark.
The size of the replay buffer is an important hyperparameter that involves several tradeoffs.
| Factor | Small Buffer | Large Buffer |
|---|---|---|
| Memory Usage | Low | High |
| Data Diversity | Limited; recent transitions only | Broad; spans many episodes and policies |
| Off-Policyness | Low; data is close to current policy | High; oldest transitions may come from very different policies |
| Correlation Breaking | Less effective | More effective |
| Staleness Risk | Low | High; old transitions may mislead the agent |
Mnih et al. (2015) set the DQN replay buffer to hold one million transitions, and this value became a widely adopted default in subsequent work. Lillicrap et al. (2015) used the same buffer size for DDPG in continuous control tasks. However, research by Fedus et al. (2020) in "Revisiting Fundamentals of Experience Replay" showed that performance consistently improves with increased replay capacity for certain algorithms, while other algorithms are unaffected, suggesting that the one-million default is a practical starting point rather than a universally optimal choice.
The replay ratio, defined as the number of gradient updates per environment step, interacts closely with buffer size. Increasing the buffer while holding the replay ratio constant means the oldest transitions in the buffer become more stale. Conversely, reducing the buffer while maintaining the replay ratio keeps the data more on-policy but limits diversity. Finding the right balance depends on the specific algorithm, environment complexity, and computational budget.
Andrychowicz et al. (2017) introduced Hindsight Experience Replay (HER) at NeurIPS 2017 to tackle a persistent challenge in goal-conditioned reinforcement learning: sparse rewards. In many real-world tasks, the agent receives a reward signal only upon successfully completing a goal (for example, placing an object at a target location) and receives zero reward otherwise. Under these conditions, standard experience replay struggles because almost every transition in the buffer carries no useful reward signal.
HER addresses this by replaying each episode with substitute goals. After the agent completes an episode (even if it failed to reach the intended goal), HER stores the episode in the replay buffer twice: once with the original goal, and once with the goal replaced by a state the agent actually reached during the episode. This relabeling trick transforms failed experiences into successful ones under the substitute goal, providing a dense learning signal even in sparse-reward environments.
Because HER modifies only the goals and not the environment dynamics, it can be combined with any off-policy reinforcement learning algorithm such as DQN, DDPG, or SAC. The researchers demonstrated that policies trained with HER in physics simulations could be successfully transferred to physical robots performing pushing, sliding, and pick-and-place tasks using only binary success/failure rewards. HER can be viewed as a form of implicit curriculum learning, where the agent gradually learns to reach increasingly distant goals by first mastering nearby ones.
The Deep Q-Network algorithm, introduced by Mnih et al. (2013, 2015), was the first to demonstrate that experience replay could enable stable training of deep neural networks for reinforcement learning at scale. DQN stores transitions in a buffer of one million entries and samples uniform random mini-batches of 32 transitions for each training step. Together with a separate target network (updated periodically), experience replay was one of DQN's two key innovations for stabilizing training.
Deep Deterministic Policy Gradient (DDPG), introduced by Lillicrap et al. (2015), extended the DQN approach to continuous action spaces by combining a Q-function critic with a deterministic policy actor, both trained using transitions sampled from a replay buffer. TD3 (Fujimoto et al., 2018) improved on DDPG by using twin critics to reduce overestimation bias and delayed policy updates to stabilize learning, while continuing to rely on experience replay for training data. Soft Actor-Critic (SAC) (Haarnoja et al., 2018) added entropy regularization to encourage exploration, and like DDPG and TD3, uses a replay buffer as a core component of its training pipeline.
Horgan et al. (2018) scaled prioritized experience replay to distributed settings with the Ape-X architecture. In Ape-X, hundreds or thousands of parallel actors each interact with their own copy of the environment and add transitions with initial priorities to a shared centralized replay buffer. A single learner process samples prioritized mini-batches from this shared buffer and updates the network weights. Actors periodically synchronize their network parameters with the learner. This architecture reduced wall-clock training time by factors of two to four compared to single-actor baselines while also improving final performance.
Imagine you are learning to ride a bike. Every time you try, you remember what happened: how you leaned, how you pedaled, and whether you fell or stayed upright. Now imagine you have a big scrapbook where you paste a picture and a note about every single attempt.
Without the scrapbook, you would only remember your very last try. Maybe that last try was a lucky one where you did not fall, so you would not learn much about what to avoid. Or maybe you fell in a weird way that does not happen often, and you would overreact to that one bad memory.
With the scrapbook, you can flip to any random page before your next attempt and review a handful of old memories. Some are from yesterday, some from last week. By studying a mix of different memories instead of just the latest one, you learn faster, you do not forget the lessons from earlier attempts, and you do not overreact to any single ride. That scrapbook is what computer scientists call a replay buffer, and the process of flipping back through it to learn is called experience replay.