Episode (Reinforcement Learning)

An episode in reinforcement learning is a complete sequence of interactions between an agent and its environment, starting from an initial state and ending when a terminal condition is reached. Each episode produces a trajectory of states, actions, and rewards that the agent uses to learn and improve its policy. Episodes are a foundational concept in RL, providing the structure through which agents gain experience, evaluate outcomes, and refine their behavior over time.

ELI5 (Explain like I'm 5)

Imagine you are playing a video game. You start the level, run around collecting coins, dodge some enemies, and eventually either win the level or lose all your lives. That whole attempt, from start to finish, is one episode. When the level ends, you get to try again from the beginning, but this time you remember what worked and what didn't. Each time you play through the level is a new episode, and you get a little better each time because you learn from your past attempts.

In reinforcement learning, a computer agent does the same thing. It starts in some situation, makes a bunch of decisions, gets feedback (rewards or penalties), and eventually the round ends. Then it starts over, using what it learned to make smarter decisions next time.

Formal definition

In the framework of Markov decision processes (MDPs), an episode is a finite sequence of transitions:

$$S_0, A_0, R_1, S_1, A_1, R_2, \ldots, S_{T-1}, A_{T-1}, R_T, S_T$$

where:

$S_0$ is the initial state, sampled from a start-state distribution $\rho_0$
$A_t$ is the action taken by the agent at time step $t$, chosen according to its policy $\pi(a|s)$
$R_{t+1}$ is the reward received after taking action $A_t$ in state $S_t$
$S_T$ is the terminal state, at which point the episode ends
$T$ is the length of the episode (which may vary across episodes)

The complete sequence is also referred to as a trajectory, denoted $\tau = (S_0, A_0, S_1, A_1, \ldots, S_T)$. Sutton and Barto (1998, 2018) formalize this in their textbook by defining episodic tasks as problems where the agent-environment interaction breaks naturally into subsequences, each of which ends in a special terminal state.

Structure of an episode

Every episode consists of several core components that define the agent's interaction with the environment:

Component	Description
Initial state ($S_0$)	The starting configuration of the environment, sampled from distribution $\rho_0$. May be fixed or randomized across episodes.
States ($S_t$)	Representations of the environment at each time step, encoding the information the agent needs to make decisions.
Actions ($A_t$)	Decisions made by the agent at each time step, selected according to the current policy $\pi$.
Rewards ($R_{t+1}$)	Scalar feedback signals received after each action, guiding the agent toward desired behavior.
Transitions	The movement from one state to the next, governed by the environment's dynamics $P(s'
Terminal state ($S_T$)	A special state that marks the end of the episode. Reaching this state triggers a reset.

The environment is typically reset after each episode, and the agent begins a new episode from a fresh initial state. This reset mechanism is what distinguishes episodic tasks from continuing tasks.

Return and the discount factor

The return $G_t$ is the cumulative reward the agent seeks to maximize, calculated from time step $t$ to the end of the episode. There are two standard formulations:

Finite-horizon undiscounted return:

$$G_t = R_{t+1} + R_{t+2} + R_{t+3} + \cdots + R_T = \sum_{k=0}^{T-t-1} R_{t+k+1}$$

This simply sums all future rewards until the episode ends. It is well-defined in episodic tasks because $T$ is finite.

Discounted return:

$$G_t = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \cdots = \sum_{k=0}^{T-t-1} \gamma^k R_{t+k+1}$$

where $\gamma \in [0, 1)$ is the discount factor. The discount factor controls how much the agent values future rewards relative to immediate ones:

Discount factor ($\gamma$)	Agent behavior
$\gamma = 0$	Myopic: the agent cares only about the immediate reward $R_{t+1}$.
$\gamma$ close to 0 (e.g., 0.1)	Short-sighted: the agent heavily discounts future rewards.
$\gamma$ close to 1 (e.g., 0.99)	Far-sighted: the agent values future rewards almost as much as immediate ones.
$\gamma = 1$	No discounting: all rewards weighted equally. Only valid when $T$ is finite (episodic tasks).

In episodic tasks, setting $\gamma = 1$ is sometimes appropriate because the finite episode length guarantees that the return is bounded. In continuing tasks, a discount factor strictly less than 1 is necessary to keep the infinite sum from diverging.

Episodic vs. continuing tasks

Reinforcement learning problems fall into two broad categories based on whether the interaction has natural endpoints.

| Property | Episodic tasks | Continuing tasks | |---|---| | Natural endpoint | Yes, episodes end at terminal states | No, interaction continues indefinitely | | Return calculation | Can use undiscounted ($\gamma = 1$) or discounted returns | Must use discounted returns ($\gamma < 1$) | | Environment reset | Environment resets between episodes | No resets | | Examples | Board games, maze navigation, Atari games | Stock trading, process control, server management | | Monte Carlo methods | Directly applicable | Not directly applicable | | Temporal difference methods | Applicable | Applicable | | Episode independence | Each episode is independent of others | No episode boundaries |

Sutton and Barto (2018) introduce a unified notation by defining a special absorbing state that transitions only to itself and generates zero reward. This allows the finite sum in episodic returns to be expressed as an infinite sum, unifying the notation for both task types.

Termination vs. truncation

An episode can end in two distinct ways, and the distinction has practical consequences for learning algorithms.

Termination occurs when the agent reaches a genuine terminal state as defined by the MDP. Examples include winning or losing a game, a robot completing its task, or reaching a goal location. When an episode terminates, the future value from that state is truly zero.

Truncation occurs when an episode is cut short by an external constraint that is not part of the MDP, such as a maximum time step limit imposed for practical reasons. In this case, the episode ends, but the agent has not reached a true terminal state; there would have been additional rewards to collect.

This distinction matters for value estimation. As Pardo et al. (2018) demonstrated in their paper "Time Limits in Reinforcement Learning," conflating truncation with termination leads to incorrect value estimates:

Scenario	Correct handling
True termination	Set future value to zero: $Q_{target} = r_t$
Truncation	Bootstrap from the next state's value: $Q_{target} = r_t + \gamma \cdot V(s_{t+1})$

The Gymnasium library (successor to OpenAI Gym) addressed this by replacing the single done flag with separate terminated and truncated boolean values in its step() API (version 0.26 onward). This allows algorithms to handle each case correctly.

Several terms in reinforcement learning are closely related to the concept of an episode and are sometimes used interchangeably, though there are subtle distinctions.

Term	Definition	Relationship to episode
Trajectory ($\tau$)	The ordered sequence of states, actions, and rewards: $(s_0, a_0, r_1, s_1, a_1, r_2, \ldots)$	A trajectory is the data record produced by an episode. In many contexts, the two terms are used synonymously.
Rollout	The process of executing a policy in an environment to collect experience	A rollout produces a trajectory, which constitutes one episode. The term emphasizes the act of data collection.
Horizon ($H$)	The number of time steps in an episode or planning window	Defines the maximum length of an episode. A fixed horizon means all episodes have the same length.
Trial	An informal term for a single attempt at a task	Synonymous with episode in most practical contexts.
Time step	A single interaction within an episode where the agent observes, acts, and receives a reward	Episodes are composed of multiple time steps.
Epoch	One pass through the entire training dataset	Different from an episode; an epoch may encompass many episodes.

Role of episodes in learning algorithms

Episodes play different roles depending on the class of reinforcement learning algorithm being used.

Monte Carlo methods

Monte Carlo methods have the strongest dependence on episodes. These methods estimate value functions by averaging the actual returns observed across many complete episodes. Because they use the full return $G_t$ rather than bootstrapping from intermediate estimates, Monte Carlo methods require episodes to terminate before any learning updates can occur.

The value of a state $s$ under policy $\pi$ is estimated as:

$$V(s) \approx \frac{1}{N(s)} \sum_{i=1}^{N(s)} G_t^{(i)}$$

where $N(s)$ is the number of times state $s$ was visited across all episodes, and $G_t^{(i)}$ is the return observed from the $i$-th visit to state $s$.

There are two variants: first-visit Monte Carlo, which counts only the first time $s$ appears in each episode, and every-visit Monte Carlo, which counts every occurrence.

Temporal difference learning

Temporal difference (TD) learning methods can update value estimates at every time step within an episode, without waiting for the episode to end. TD methods use bootstrapping, updating estimates based on other estimates:

$$V(S_t) \leftarrow V(S_t) + \alpha \left[ R_{t+1} + \gamma V(S_{t+1}) - V(S_t) \right]$$

Because TD methods do not need to wait for episode completion, they can also be applied to continuing tasks. However, in episodic tasks, episode boundaries still serve an important role by defining when the environment resets and providing natural evaluation points.

Policy gradient methods

The REINFORCE algorithm, introduced by Williams (1992), is a Monte Carlo policy gradient method that updates policy parameters after each complete episode. The policy gradient is estimated as:

$$\nabla J(\theta) \approx \sum_{t=0}^{T-1} \nabla \log \pi_\theta(A_t | S_t) \cdot G_t$$

Because REINFORCE uses the full episode return $G_t$, it must wait for the episode to finish before updating. This episode-level update leads to high variance in gradient estimates, which is a known limitation of the algorithm. Actor-critic methods address this by using TD-style bootstrapping to reduce variance while still operating within the episodic framework.

Deep Q-Networks and experience replay

In Deep Q-Network (DQN) algorithms, individual transitions $(s_t, a_t, r_{t+1}, s_{t+1})$ from many different episodes are stored in a replay buffer. During training, random minibatches of transitions are sampled from this buffer, breaking the temporal correlations between consecutive time steps within an episode. This technique, called experience replay, was a key innovation in the original DQN paper by Mnih et al. (2015).

While experience replay decouples learning from the sequential structure of individual episodes, episodes still serve as the organizing unit for data collection: the agent runs episodes to fill the replay buffer.

Episode design considerations

The design of episodes, including their length, initial state distribution, and termination conditions, significantly affects learning performance.

Episode length

Episode length influences the tradeoff between exploration and exploitation:

Short episodes allow the agent to attempt more episodes within a fixed training budget, promoting broader exploration of different initial states and early trajectories. However, they may prevent the agent from learning long-horizon behaviors.
Long episodes give the agent more time to discover long-term strategies but reduce the number of episodes the agent can experience, potentially slowing early learning.

Many practical implementations use a maximum episode length (time limit) to prevent episodes from running indefinitely, especially in environments where the agent might get stuck.

Initial state distribution

Randomizing the initial state across episodes helps the agent learn a more general policy. If the agent always starts from the same state, it may overfit to that specific starting configuration. The Gymnasium reset() method supports seeded randomization of initial states for reproducibility.

Curriculum learning

In curriculum learning, the difficulty of episodes is gradually increased over training. For example, a robot learning to walk might begin with short, easy episodes on flat terrain and progress to longer episodes on rough terrain. This staged approach can accelerate convergence compared to training on the full-difficulty task from the start.

Practical implementation

In modern RL frameworks such as Gymnasium (the maintained fork of OpenAI Gym), episodes are managed through a standardized API.

The episode loop

A typical episode loop follows this pattern:

Call env.reset() to initialize the environment and get the initial observation
Repeat:
- Select an action based on the current observation and policy
- Call env.step(action) to execute the action
- Receive the next observation, reward, terminated flag, truncated flag, and info dict
- Store the transition for learning
- If terminated or truncated, break out of the loop
Update the agent's policy using the collected experience
Repeat from step 1 for the next episode

Episode tracking and logging

During training, key per-episode metrics are typically logged:

Metric	Description
Episode return	Total (possibly discounted) reward accumulated during the episode
Episode length	Number of time steps in the episode
Success rate	Whether the agent achieved its goal (for goal-conditioned tasks)
Average reward per step	Episode return divided by episode length

These metrics, tracked over hundreds or thousands of episodes, form the learning curve that indicates whether the agent is improving.

Examples of episodes across domains

The concept of an episode manifests differently depending on the application domain.

Domain	What constitutes one episode	Terminal condition	Typical episode length
Board games (chess, Go)	One complete game from start to finish	Win, loss, or draw	Variable (tens to hundreds of moves)
Atari games (DQN)	One game life or full game	Game over or life lost	Variable (hundreds to thousands of frames)
Robotic manipulation	One attempt to grasp or place an object	Object placed, dropped, or time limit reached	Fixed (typically 50-200 steps)
Autonomous driving	One driving scenario or route	Destination reached, collision, or time limit	Variable (seconds to minutes of simulated time)
Dialogue systems	One complete conversation	User ends conversation or turn limit reached	Variable (5-50 turns)
Navigation tasks	One attempt to reach a goal position	Goal reached or maximum steps exceeded	Variable

Historical context

The concept of an episode in reinforcement learning has roots in the early work on dynamic programming and optimal control theory. Richard Bellman's work on sequential decision processes in the 1950s laid the groundwork by formalizing finite-horizon decision problems. The term "episode" became standard in the RL literature through Sutton and Barto's influential textbook "Reinforcement Learning: An Introduction" (first edition 1998, second edition 2018), which systematically distinguished episodic tasks from continuing tasks.

The practical significance of episodes grew with the development of Monte Carlo methods for RL (first formalized by Sutton and Barto drawing on classical Monte Carlo simulation techniques) and later with policy gradient methods like REINFORCE (Williams, 1992). The advent of deep reinforcement learning, particularly DQN (Mnih et al., 2013, 2015) and AlphaGo (Silver et al., 2016), brought renewed attention to episode design and management as agents were trained over millions of episodes on complex tasks.

References

Sutton, R. S., & Barto, A. G. (2018). *Reinforcement Learning: An Introduction* (2nd ed.). MIT Press. http://incompleteideas.net/book/the-book-2nd.html
Mnih, V., Kavukcuoglu, K., Silver, D., et al. (2015). Human-level control through deep reinforcement learning. *Nature*, 518(7540), 529-533.
Williams, R. J. (1992). Simple statistical gradient-following algorithms for connectionist reinforcement learning. *Machine Learning*, 8(3-4), 229-256.
Pardo, F., Tavakoli, A., Levdik, V., & Kormushev, P. (2018). Time limits in reinforcement learning. *Proceedings of the 35th International Conference on Machine Learning (ICML)*.
Silver, D., Huang, A., Maddison, C. J., et al. (2016). Mastering the game of Go with deep neural networks and tree search. *Nature*, 529(7587), 484-489.
Mnih, V., Kavukcuoglu, K., Silver, D., et al. (2013). Playing Atari with deep reinforcement learning. *arXiv preprint arXiv:1312.5602*.
OpenAI. (2016). OpenAI Gym. *arXiv preprint arXiv:1606.01540*.
Towers, M., Terry, J. K., Kwiatkowski, A., et al. (2023). Gymnasium. https://gymnasium.farama.org/
Bellman, R. (1957). *Dynamic Programming*. Princeton University Press.
Puterman, M. L. (2014). *Markov Decision Processes: Discrete Stochastic Dynamic Programming*. John Wiley & Sons.
Schulman, J., Wolski, F., Dhariwal, P., Radford, A., & Klimov, O. (2017). Proximal policy optimization algorithms. *arXiv preprint arXiv:1707.06347*.
Haarnoja, T., Zhou, A., Abbeel, P., & Levine, S. (2018). Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. *Proceedings of the 35th International Conference on Machine Learning (ICML)*.

ELI5 (Explain like I'm 5)

Formal definition

Structure of an episode

Return and the discount factor

Episodic vs. continuing tasks

Termination vs. truncation

Related terminology

Role of episodes in learning algorithms

Monte Carlo methods

Temporal difference learning

Policy gradient methods

Deep Q-Networks and experience replay

Episode design considerations

Episode length

Initial state distribution

Curriculum learning

Practical implementation

The episode loop

Episode tracking and logging

Examples of episodes across domains

Historical context

References

Improve this article

Related Articles

Machine learning terms/Reinforcement Learning

ARC-AGI 2

AlphaGo

State (Reinforcement Learning)

State-Action Value Function

Action (Reinforcement Learning)

ELI5 (Explain like I'm 5)

Formal definition

Structure of an episode

Return and the discount factor

Episodic vs. continuing tasks

Termination vs. truncation

Related terminology

Role of episodes in learning algorithms

Monte Carlo methods

Temporal difference learning

Policy gradient methods

Deep Q-Networks and experience replay

Episode design considerations

Episode length

Initial state distribution

Curriculum learning

Practical implementation

The episode loop

Episode tracking and logging

Examples of episodes across domains

Historical context

References

Related Articles

Machine learning terms/Reinforcement Learning

ARC-AGI 2

AlphaGo

State (Reinforcement Learning)

State-Action Value Function

Action (Reinforcement Learning)