# Episode (Reinforcement Learning)

> Source: https://aiwiki.ai/wiki/episode
> Updated: 2026-04-26
> Categories: Machine Learning, Reinforcement Learning
> From AI Wiki (https://aiwiki.ai), a free encyclopedia of artificial intelligence. Quote with attribution.

An **episode** in [reinforcement learning](/wiki/reinforcement_learning) is a complete sequence of interactions between an [agent](/wiki/agent) and its [environment](/wiki/environment), starting from an initial state and ending when a terminal condition is reached. Each episode produces a trajectory of states, actions, and [rewards](/wiki/reward) that the agent uses to learn and improve its [policy](/wiki/policy). Episodes are a foundational concept in RL, providing the structure through which agents gain experience, evaluate outcomes, and refine their behavior over time.

## ELI5 (Explain like I'm 5)

Imagine you are playing a video game. You start the level, run around collecting coins, dodge some enemies, and eventually either win the level or lose all your lives. That whole attempt, from start to finish, is one episode. When the level ends, you get to try again from the beginning, but this time you remember what worked and what didn't. Each time you play through the level is a new episode, and you get a little better each time because you learn from your past attempts.

In reinforcement learning, a computer agent does the same thing. It starts in some situation, makes a bunch of decisions, gets feedback (rewards or penalties), and eventually the round ends. Then it starts over, using what it learned to make smarter decisions next time.

## Formal definition

In the framework of [Markov decision processes](/wiki/markov_decision_process_mdp) (MDPs), an episode is a finite sequence of transitions:

$$S_0, A_0, R_1, S_1, A_1, R_2, \ldots, S_{T-1}, A_{T-1}, R_T, S_T$$

where:

- $S_0$ is the initial state, sampled from a start-state distribution $\rho_0$
- $A_t$ is the action taken by the agent at time step $t$, chosen according to its policy $\pi(a|s)$
- $R_{t+1}$ is the reward received after taking action $A_t$ in state $S_t$
- $S_T$ is the terminal state, at which point the episode ends
- $T$ is the length of the episode (which may vary across episodes)

The complete sequence is also referred to as a **trajectory**, denoted $\tau = (S_0, A_0, S_1, A_1, \ldots, S_T)$. Sutton and Barto (1998, 2018) formalize this in their textbook by defining episodic tasks as problems where the agent-environment interaction breaks naturally into subsequences, each of which ends in a special terminal state.

## Structure of an episode

Every episode consists of several core components that define the agent's interaction with the environment:

| Component | Description |
|---|---|
| Initial state ($S_0$) | The starting configuration of the environment, sampled from distribution $\rho_0$. May be fixed or randomized across episodes. |
| States ($S_t$) | Representations of the environment at each time step, encoding the information the agent needs to make decisions. |
| Actions ($A_t$) | Decisions made by the agent at each time step, selected according to the current policy $\pi$. |
| Rewards ($R_{t+1}$) | Scalar feedback signals received after each action, guiding the agent toward desired behavior. |
| Transitions | The movement from one state to the next, governed by the environment's dynamics $P(s'|s, a)$. |
| Terminal state ($S_T$) | A special state that marks the end of the episode. Reaching this state triggers a reset. |

The environment is typically reset after each episode, and the agent begins a new episode from a fresh initial state. This reset mechanism is what distinguishes episodic tasks from continuing tasks.

## Return and the discount factor

The **return** $G_t$ is the cumulative reward the agent seeks to maximize, calculated from time step $t$ to the end of the episode. There are two standard formulations:

**Finite-horizon undiscounted return:**

$$G_t = R_{t+1} + R_{t+2} + R_{t+3} + \cdots + R_T = \sum_{k=0}^{T-t-1} R_{t+k+1}$$

This simply sums all future rewards until the episode ends. It is well-defined in episodic tasks because $T$ is finite.

**Discounted return:**

$$G_t = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \cdots = \sum_{k=0}^{T-t-1} \gamma^k R_{t+k+1}$$

where $\gamma \in [0, 1)$ is the [discount factor](/wiki/gamma). The discount factor controls how much the agent values future rewards relative to immediate ones:

| Discount factor ($\gamma$) | Agent behavior |
|---|---|
| $\gamma = 0$ | Myopic: the agent cares only about the immediate reward $R_{t+1}$. |
| $\gamma$ close to 0 (e.g., 0.1) | Short-sighted: the agent heavily discounts future rewards. |
| $\gamma$ close to 1 (e.g., 0.99) | Far-sighted: the agent values future rewards almost as much as immediate ones. |
| $\gamma = 1$ | No discounting: all rewards weighted equally. Only valid when $T$ is finite (episodic tasks). |

In episodic tasks, setting $\gamma = 1$ is sometimes appropriate because the finite episode length guarantees that the return is bounded. In continuing tasks, a discount factor strictly less than 1 is necessary to keep the infinite sum from diverging.

## Episodic vs. continuing tasks

Reinforcement learning problems fall into two broad categories based on whether the interaction has natural endpoints.

| Property | Episodic tasks | Continuing tasks |
|---|---|
| Natural endpoint | Yes, episodes end at terminal states | No, interaction continues indefinitely |
| Return calculation | Can use undiscounted ($\gamma = 1$) or discounted returns | Must use discounted returns ($\gamma < 1$) |
| Environment reset | Environment resets between episodes | No resets |
| Examples | Board games, maze navigation, Atari games | Stock trading, process control, server management |
| [Monte Carlo methods](/wiki/monte_carlo) | Directly applicable | Not directly applicable |
| [Temporal difference](/wiki/temporal_difference) methods | Applicable | Applicable |
| Episode independence | Each episode is independent of others | No episode boundaries |

Sutton and Barto (2018) introduce a unified notation by defining a special **absorbing state** that transitions only to itself and generates zero reward. This allows the finite sum in episodic returns to be expressed as an infinite sum, unifying the notation for both task types.

## Termination vs. truncation

An episode can end in two distinct ways, and the distinction has practical consequences for learning algorithms.

**Termination** occurs when the agent reaches a genuine terminal state as defined by the MDP. Examples include winning or losing a game, a robot completing its task, or reaching a goal location. When an episode terminates, the future value from that state is truly zero.

**Truncation** occurs when an episode is cut short by an external constraint that is not part of the MDP, such as a maximum time step limit imposed for practical reasons. In this case, the episode ends, but the agent has not reached a true terminal state; there would have been additional rewards to collect.

This distinction matters for value estimation. As Pardo et al. (2018) demonstrated in their paper "Time Limits in Reinforcement Learning," conflating truncation with termination leads to incorrect value estimates:

| Scenario | Correct handling |
|---|---|
| True termination | Set future value to zero: $Q_{target} = r_t$ |
| Truncation | Bootstrap from the next state's value: $Q_{target} = r_t + \gamma \cdot V(s_{t+1})$ |

The Gymnasium library (successor to OpenAI [Gym](/wiki/openai)) addressed this by replacing the single `done` flag with separate `terminated` and `truncated` boolean values in its `step()` API (version 0.26 onward). This allows algorithms to handle each case correctly.

## Related terminology

Several terms in reinforcement learning are closely related to the concept of an episode and are sometimes used interchangeably, though there are subtle distinctions.

| Term | Definition | Relationship to episode |
|---|---|---|
| Trajectory ($\tau$) | The ordered sequence of states, actions, and rewards: $(s_0, a_0, r_1, s_1, a_1, r_2, \ldots)$ | A trajectory is the data record produced by an episode. In many contexts, the two terms are used synonymously. |
| Rollout | The process of executing a policy in an environment to collect experience | A rollout produces a trajectory, which constitutes one episode. The term emphasizes the act of data collection. |
| Horizon ($H$) | The number of time steps in an episode or planning window | Defines the maximum length of an episode. A fixed horizon means all episodes have the same length. |
| Trial | An informal term for a single attempt at a task | Synonymous with episode in most practical contexts. |
| Time step | A single interaction within an episode where the agent observes, acts, and receives a reward | Episodes are composed of multiple time steps. |
| [Epoch](/wiki/epoch) | One pass through the entire training dataset | Different from an episode; an epoch may encompass many episodes. |

## Role of episodes in learning algorithms

Episodes play different roles depending on the class of reinforcement learning algorithm being used.

### Monte Carlo methods

[Monte Carlo](/wiki/monte_carlo) methods have the strongest dependence on episodes. These methods estimate value functions by averaging the actual returns observed across many complete episodes. Because they use the full return $G_t$ rather than bootstrapping from intermediate estimates, Monte Carlo methods require episodes to terminate before any learning updates can occur.

The value of a state $s$ under policy $\pi$ is estimated as:

$$V(s) \approx \frac{1}{N(s)} \sum_{i=1}^{N(s)} G_t^{(i)}$$

where $N(s)$ is the number of times state $s$ was visited across all episodes, and $G_t^{(i)}$ is the return observed from the $i$-th visit to state $s$.

There are two variants: **first-visit Monte Carlo**, which counts only the first time $s$ appears in each episode, and **every-visit Monte Carlo**, which counts every occurrence.

### Temporal difference learning

[Temporal difference](/wiki/temporal_difference) (TD) learning methods can update value estimates at every time step within an episode, without waiting for the episode to end. TD methods use bootstrapping, updating estimates based on other estimates:

$$V(S_t) \leftarrow V(S_t) + \alpha \left[ R_{t+1} + \gamma V(S_{t+1}) - V(S_t) \right]$$

Because TD methods do not need to wait for episode completion, they can also be applied to continuing tasks. However, in episodic tasks, episode boundaries still serve an important role by defining when the environment resets and providing natural evaluation points.

### Policy gradient methods

The [REINFORCE](/wiki/reinforce) algorithm, introduced by Williams (1992), is a Monte Carlo policy gradient method that updates policy parameters after each complete episode. The policy gradient is estimated as:

$$\nabla J(\theta) \approx \sum_{t=0}^{T-1} \nabla \log \pi_\theta(A_t | S_t) \cdot G_t$$

Because REINFORCE uses the full episode return $G_t$, it must wait for the episode to finish before updating. This episode-level update leads to high variance in gradient estimates, which is a known limitation of the algorithm. Actor-critic methods address this by using TD-style bootstrapping to reduce variance while still operating within the episodic framework.

### Deep Q-Networks and experience replay

In [Deep Q-Network](/wiki/deep_q-network_dqn) (DQN) algorithms, individual transitions $(s_t, a_t, r_{t+1}, s_{t+1})$ from many different episodes are stored in a [replay buffer](/wiki/replay_buffer). During training, random minibatches of transitions are sampled from this buffer, breaking the temporal correlations between consecutive time steps within an episode. This technique, called [experience replay](/wiki/experience_replay), was a key innovation in the original DQN paper by Mnih et al. (2015).

While experience replay decouples learning from the sequential structure of individual episodes, episodes still serve as the organizing unit for data collection: the agent runs episodes to fill the replay buffer.

## Episode design considerations

The design of episodes, including their length, initial state distribution, and termination conditions, significantly affects learning performance.

### Episode length

Episode length influences the tradeoff between exploration and exploitation:

- **Short episodes** allow the agent to attempt more episodes within a fixed training budget, promoting broader exploration of different initial states and early trajectories. However, they may prevent the agent from learning long-horizon behaviors.
- **Long episodes** give the agent more time to discover long-term strategies but reduce the number of episodes the agent can experience, potentially slowing early learning.

Many practical implementations use a maximum episode length (time limit) to prevent episodes from running indefinitely, especially in environments where the agent might get stuck.

### Initial state distribution

Randomizing the initial state across episodes helps the agent learn a more general policy. If the agent always starts from the same state, it may overfit to that specific starting configuration. The Gymnasium `reset()` method supports seeded randomization of initial states for reproducibility.

### Curriculum learning

In [curriculum learning](/wiki/curriculum_learning), the difficulty of episodes is gradually increased over training. For example, a robot learning to walk might begin with short, easy episodes on flat terrain and progress to longer episodes on rough terrain. This staged approach can accelerate convergence compared to training on the full-difficulty task from the start.

## Practical implementation

In modern RL frameworks such as Gymnasium (the maintained fork of OpenAI Gym), episodes are managed through a standardized API.

### The episode loop

A typical episode loop follows this pattern:

1. Call `env.reset()` to initialize the environment and get the initial observation
2. Repeat:
   - Select an action based on the current observation and policy
   - Call `env.step(action)` to execute the action
   - Receive the next observation, reward, `terminated` flag, `truncated` flag, and info dict
   - Store the transition for learning
   - If `terminated` or `truncated`, break out of the loop
3. Update the agent's policy using the collected experience
4. Repeat from step 1 for the next episode

### Episode tracking and logging

During training, key per-episode metrics are typically logged:

| Metric | Description |
|---|---|
| Episode return | Total (possibly discounted) reward accumulated during the episode |
| Episode length | Number of time steps in the episode |
| Success rate | Whether the agent achieved its goal (for goal-conditioned tasks) |
| Average reward per step | Episode return divided by episode length |

These metrics, tracked over hundreds or thousands of episodes, form the learning curve that indicates whether the agent is improving.

## Examples of episodes across domains

The concept of an episode manifests differently depending on the application domain.

| Domain | What constitutes one episode | Terminal condition | Typical episode length |
|---|---|---|---|
| Board games (chess, [Go](/wiki/alphazero)) | One complete game from start to finish | Win, loss, or draw | Variable (tens to hundreds of moves) |
| Atari games ([DQN](/wiki/deep_q-network_dqn)) | One game life or full game | Game over or life lost | Variable (hundreds to thousands of frames) |
| Robotic manipulation | One attempt to grasp or place an object | Object placed, dropped, or time limit reached | Fixed (typically 50-200 steps) |
| [Autonomous driving](/wiki/autonomous_driving) | One driving scenario or route | Destination reached, collision, or time limit | Variable (seconds to minutes of simulated time) |
| Dialogue systems | One complete conversation | User ends conversation or turn limit reached | Variable (5-50 turns) |
| [Navigation](/wiki/environment) tasks | One attempt to reach a goal position | Goal reached or maximum steps exceeded | Variable |

## Historical context

The concept of an episode in reinforcement learning has roots in the early work on dynamic programming and optimal control theory. Richard Bellman's work on sequential decision processes in the 1950s laid the groundwork by formalizing finite-horizon decision problems. The term "episode" became standard in the RL literature through Sutton and Barto's influential textbook "Reinforcement Learning: An Introduction" (first edition 1998, second edition 2018), which systematically distinguished episodic tasks from continuing tasks.

The practical significance of episodes grew with the development of Monte Carlo methods for RL (first formalized by Sutton and Barto drawing on classical Monte Carlo simulation techniques) and later with policy gradient methods like REINFORCE (Williams, 1992). The advent of [deep reinforcement learning](/wiki/deep_learning), particularly DQN (Mnih et al., 2013, 2015) and [AlphaGo](/wiki/alphazero) (Silver et al., 2016), brought renewed attention to episode design and management as agents were trained over millions of episodes on complex tasks.

## References

1. Sutton, R. S., & Barto, A. G. (2018). *Reinforcement Learning: An Introduction* (2nd ed.). MIT Press. http://incompleteideas.net/book/the-book-2nd.html
2. Mnih, V., Kavukcuoglu, K., Silver, D., et al. (2015). Human-level control through deep reinforcement learning. *Nature*, 518(7540), 529-533.
3. Williams, R. J. (1992). Simple statistical gradient-following algorithms for connectionist reinforcement learning. *Machine Learning*, 8(3-4), 229-256.
4. Pardo, F., Tavakoli, A., Levdik, V., & Kormushev, P. (2018). Time limits in reinforcement learning. *Proceedings of the 35th International Conference on Machine Learning (ICML)*.
5. Silver, D., Huang, A., Maddison, C. J., et al. (2016). Mastering the game of Go with deep neural networks and tree search. *Nature*, 529(7587), 484-489.
6. Mnih, V., Kavukcuoglu, K., Silver, D., et al. (2013). Playing Atari with deep reinforcement learning. *arXiv preprint arXiv:1312.5602*.
7. OpenAI. (2016). OpenAI Gym. *arXiv preprint arXiv:1606.01540*.
8. Towers, M., Terry, J. K., Kwiatkowski, A., et al. (2023). Gymnasium. https://gymnasium.farama.org/
9. Bellman, R. (1957). *Dynamic Programming*. Princeton University Press.
10. Puterman, M. L. (2014). *Markov Decision Processes: Discrete Stochastic Dynamic Programming*. John Wiley & Sons.
11. Schulman, J., Wolski, F., Dhariwal, P., Radford, A., & Klimov, O. (2017). Proximal policy optimization algorithms. *arXiv preprint arXiv:1707.06347*.
12. Haarnoja, T., Zhou, A., Abbeel, P., & Levine, S. (2018). Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. *Proceedings of the 35th International Conference on Machine Learning (ICML)*.
