An episode in reinforcement learning is a complete sequence of interactions between an agent and its environment, starting from an initial state and ending when a terminal condition is reached. Each episode produces a trajectory of states, actions, and rewards that the agent uses to learn and improve its policy. Episodes are a foundational concept in RL, providing the structure through which agents gain experience, evaluate outcomes, and refine their behavior over time.
Imagine you are playing a video game. You start the level, run around collecting coins, dodge some enemies, and eventually either win the level or lose all your lives. That whole attempt, from start to finish, is one episode. When the level ends, you get to try again from the beginning, but this time you remember what worked and what didn't. Each time you play through the level is a new episode, and you get a little better each time because you learn from your past attempts.
In reinforcement learning, a computer agent does the same thing. It starts in some situation, makes a bunch of decisions, gets feedback (rewards or penalties), and eventually the round ends. Then it starts over, using what it learned to make smarter decisions next time.
In the framework of Markov decision processes (MDPs), an episode is a finite sequence of transitions:
$$S_0, A_0, R_1, S_1, A_1, R_2, \ldots, S_{T-1}, A_{T-1}, R_T, S_T$$
where:
The complete sequence is also referred to as a trajectory, denoted $\tau = (S_0, A_0, S_1, A_1, \ldots, S_T)$. Sutton and Barto (1998, 2018) formalize this in their textbook by defining episodic tasks as problems where the agent-environment interaction breaks naturally into subsequences, each of which ends in a special terminal state.
Every episode consists of several core components that define the agent's interaction with the environment:
| Component | Description |
|---|---|
| Initial state ($S_0$) | The starting configuration of the environment, sampled from distribution $\rho_0$. May be fixed or randomized across episodes. |
| States ($S_t$) | Representations of the environment at each time step, encoding the information the agent needs to make decisions. |
| Actions ($A_t$) | Decisions made by the agent at each time step, selected according to the current policy $\pi$. |
| Rewards ($R_{t+1}$) | Scalar feedback signals received after each action, guiding the agent toward desired behavior. |
| Transitions | The movement from one state to the next, governed by the environment's dynamics $P(s' |
| Terminal state ($S_T$) | A special state that marks the end of the episode. Reaching this state triggers a reset. |
The environment is typically reset after each episode, and the agent begins a new episode from a fresh initial state. This reset mechanism is what distinguishes episodic tasks from continuing tasks.
The return $G_t$ is the cumulative reward the agent seeks to maximize, calculated from time step $t$ to the end of the episode. There are two standard formulations:
Finite-horizon undiscounted return:
$$G_t = R_{t+1} + R_{t+2} + R_{t+3} + \cdots + R_T = \sum_{k=0}^{T-t-1} R_{t+k+1}$$
This simply sums all future rewards until the episode ends. It is well-defined in episodic tasks because $T$ is finite.
Discounted return:
$$G_t = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \cdots = \sum_{k=0}^{T-t-1} \gamma^k R_{t+k+1}$$
where $\gamma \in [0, 1)$ is the discount factor. The discount factor controls how much the agent values future rewards relative to immediate ones:
| Discount factor ($\gamma$) | Agent behavior |
|---|---|
| $\gamma = 0$ | Myopic: the agent cares only about the immediate reward $R_{t+1}$. |
| $\gamma$ close to 0 (e.g., 0.1) | Short-sighted: the agent heavily discounts future rewards. |
| $\gamma$ close to 1 (e.g., 0.99) | Far-sighted: the agent values future rewards almost as much as immediate ones. |
| $\gamma = 1$ | No discounting: all rewards weighted equally. Only valid when $T$ is finite (episodic tasks). |
In episodic tasks, setting $\gamma = 1$ is sometimes appropriate because the finite episode length guarantees that the return is bounded. In continuing tasks, a discount factor strictly less than 1 is necessary to keep the infinite sum from diverging.
Reinforcement learning problems fall into two broad categories based on whether the interaction has natural endpoints.
| Property | Episodic tasks | Continuing tasks | |---|---| | Natural endpoint | Yes, episodes end at terminal states | No, interaction continues indefinitely | | Return calculation | Can use undiscounted ($\gamma = 1$) or discounted returns | Must use discounted returns ($\gamma < 1$) | | Environment reset | Environment resets between episodes | No resets | | Examples | Board games, maze navigation, Atari games | Stock trading, process control, server management | | Monte Carlo methods | Directly applicable | Not directly applicable | | Temporal difference methods | Applicable | Applicable | | Episode independence | Each episode is independent of others | No episode boundaries |
Sutton and Barto (2018) introduce a unified notation by defining a special absorbing state that transitions only to itself and generates zero reward. This allows the finite sum in episodic returns to be expressed as an infinite sum, unifying the notation for both task types.
An episode can end in two distinct ways, and the distinction has practical consequences for learning algorithms.
Termination occurs when the agent reaches a genuine terminal state as defined by the MDP. Examples include winning or losing a game, a robot completing its task, or reaching a goal location. When an episode terminates, the future value from that state is truly zero.
Truncation occurs when an episode is cut short by an external constraint that is not part of the MDP, such as a maximum time step limit imposed for practical reasons. In this case, the episode ends, but the agent has not reached a true terminal state; there would have been additional rewards to collect.
This distinction matters for value estimation. As Pardo et al. (2018) demonstrated in their paper "Time Limits in Reinforcement Learning," conflating truncation with termination leads to incorrect value estimates:
| Scenario | Correct handling |
|---|---|
| True termination | Set future value to zero: $Q_{target} = r_t$ |
| Truncation | Bootstrap from the next state's value: $Q_{target} = r_t + \gamma \cdot V(s_{t+1})$ |
The Gymnasium library (successor to OpenAI Gym) addressed this by replacing the single done flag with separate terminated and truncated boolean values in its step() API (version 0.26 onward). This allows algorithms to handle each case correctly.
Several terms in reinforcement learning are closely related to the concept of an episode and are sometimes used interchangeably, though there are subtle distinctions.
| Term | Definition | Relationship to episode |
|---|---|---|
| Trajectory ($\tau$) | The ordered sequence of states, actions, and rewards: $(s_0, a_0, r_1, s_1, a_1, r_2, \ldots)$ | A trajectory is the data record produced by an episode. In many contexts, the two terms are used synonymously. |
| Rollout | The process of executing a policy in an environment to collect experience | A rollout produces a trajectory, which constitutes one episode. The term emphasizes the act of data collection. |
| Horizon ($H$) | The number of time steps in an episode or planning window | Defines the maximum length of an episode. A fixed horizon means all episodes have the same length. |
| Trial | An informal term for a single attempt at a task | Synonymous with episode in most practical contexts. |
| Time step | A single interaction within an episode where the agent observes, acts, and receives a reward | Episodes are composed of multiple time steps. |
| Epoch | One pass through the entire training dataset | Different from an episode; an epoch may encompass many episodes. |
Episodes play different roles depending on the class of reinforcement learning algorithm being used.
Monte Carlo methods have the strongest dependence on episodes. These methods estimate value functions by averaging the actual returns observed across many complete episodes. Because they use the full return $G_t$ rather than bootstrapping from intermediate estimates, Monte Carlo methods require episodes to terminate before any learning updates can occur.
The value of a state $s$ under policy $\pi$ is estimated as:
$$V(s) \approx \frac{1}{N(s)} \sum_{i=1}^{N(s)} G_t^{(i)}$$
where $N(s)$ is the number of times state $s$ was visited across all episodes, and $G_t^{(i)}$ is the return observed from the $i$-th visit to state $s$.
There are two variants: first-visit Monte Carlo, which counts only the first time $s$ appears in each episode, and every-visit Monte Carlo, which counts every occurrence.
Temporal difference (TD) learning methods can update value estimates at every time step within an episode, without waiting for the episode to end. TD methods use bootstrapping, updating estimates based on other estimates:
$$V(S_t) \leftarrow V(S_t) + \alpha \left[ R_{t+1} + \gamma V(S_{t+1}) - V(S_t) \right]$$
Because TD methods do not need to wait for episode completion, they can also be applied to continuing tasks. However, in episodic tasks, episode boundaries still serve an important role by defining when the environment resets and providing natural evaluation points.
The REINFORCE algorithm, introduced by Williams (1992), is a Monte Carlo policy gradient method that updates policy parameters after each complete episode. The policy gradient is estimated as:
$$\nabla J(\theta) \approx \sum_{t=0}^{T-1} \nabla \log \pi_\theta(A_t | S_t) \cdot G_t$$
Because REINFORCE uses the full episode return $G_t$, it must wait for the episode to finish before updating. This episode-level update leads to high variance in gradient estimates, which is a known limitation of the algorithm. Actor-critic methods address this by using TD-style bootstrapping to reduce variance while still operating within the episodic framework.
In Deep Q-Network (DQN) algorithms, individual transitions $(s_t, a_t, r_{t+1}, s_{t+1})$ from many different episodes are stored in a replay buffer. During training, random minibatches of transitions are sampled from this buffer, breaking the temporal correlations between consecutive time steps within an episode. This technique, called experience replay, was a key innovation in the original DQN paper by Mnih et al. (2015).
While experience replay decouples learning from the sequential structure of individual episodes, episodes still serve as the organizing unit for data collection: the agent runs episodes to fill the replay buffer.
The design of episodes, including their length, initial state distribution, and termination conditions, significantly affects learning performance.
Episode length influences the tradeoff between exploration and exploitation:
Many practical implementations use a maximum episode length (time limit) to prevent episodes from running indefinitely, especially in environments where the agent might get stuck.
Randomizing the initial state across episodes helps the agent learn a more general policy. If the agent always starts from the same state, it may overfit to that specific starting configuration. The Gymnasium reset() method supports seeded randomization of initial states for reproducibility.
In curriculum learning, the difficulty of episodes is gradually increased over training. For example, a robot learning to walk might begin with short, easy episodes on flat terrain and progress to longer episodes on rough terrain. This staged approach can accelerate convergence compared to training on the full-difficulty task from the start.
In modern RL frameworks such as Gymnasium (the maintained fork of OpenAI Gym), episodes are managed through a standardized API.
A typical episode loop follows this pattern:
env.reset() to initialize the environment and get the initial observationenv.step(action) to execute the actionterminated flag, truncated flag, and info dictterminated or truncated, break out of the loopDuring training, key per-episode metrics are typically logged:
| Metric | Description |
|---|---|
| Episode return | Total (possibly discounted) reward accumulated during the episode |
| Episode length | Number of time steps in the episode |
| Success rate | Whether the agent achieved its goal (for goal-conditioned tasks) |
| Average reward per step | Episode return divided by episode length |
These metrics, tracked over hundreds or thousands of episodes, form the learning curve that indicates whether the agent is improving.
The concept of an episode manifests differently depending on the application domain.
| Domain | What constitutes one episode | Terminal condition | Typical episode length |
|---|---|---|---|
| Board games (chess, Go) | One complete game from start to finish | Win, loss, or draw | Variable (tens to hundreds of moves) |
| Atari games (DQN) | One game life or full game | Game over or life lost | Variable (hundreds to thousands of frames) |
| Robotic manipulation | One attempt to grasp or place an object | Object placed, dropped, or time limit reached | Fixed (typically 50-200 steps) |
| Autonomous driving | One driving scenario or route | Destination reached, collision, or time limit | Variable (seconds to minutes of simulated time) |
| Dialogue systems | One complete conversation | User ends conversation or turn limit reached | Variable (5-50 turns) |
| Navigation tasks | One attempt to reach a goal position | Goal reached or maximum steps exceeded | Variable |
The concept of an episode in reinforcement learning has roots in the early work on dynamic programming and optimal control theory. Richard Bellman's work on sequential decision processes in the 1950s laid the groundwork by formalizing finite-horizon decision problems. The term "episode" became standard in the RL literature through Sutton and Barto's influential textbook "Reinforcement Learning: An Introduction" (first edition 1998, second edition 2018), which systematically distinguished episodic tasks from continuing tasks.
The practical significance of episodes grew with the development of Monte Carlo methods for RL (first formalized by Sutton and Barto drawing on classical Monte Carlo simulation techniques) and later with policy gradient methods like REINFORCE (Williams, 1992). The advent of deep reinforcement learning, particularly DQN (Mnih et al., 2013, 2015) and AlphaGo (Silver et al., 2016), brought renewed attention to episode design and management as agents were trained over millions of episodes on complex tasks.