A trajectory in reinforcement learning is a sequence of states, actions, and rewards that an agent experiences while interacting with an environment. Formally denoted by the Greek letter tau, a trajectory captures the full history of an agent's behavior over a period of time and serves as the fundamental unit of data used for learning, evaluation, and policy improvement.
Trajectories are sometimes called episodes or rollouts, though these terms carry slightly different connotations depending on context. Understanding trajectories is essential for nearly every area of reinforcement learning, from policy gradient methods and value estimation to imitation learning and offline reinforcement learning.
Imagine you are playing a video game. Every time you play, you start at the beginning, make choices (go left, go right, jump, etc.), see what happens after each choice, and collect or lose points along the way. When you finish the game (or stop playing), you can look back at everything that happened: where you started, what you did, what you saw, and how many points you got at each step.
That whole record of your playthrough is a trajectory. It is like a diary of one game session. If you play the game many times, you get many different trajectories, and by studying them you can figure out which choices tend to give you more points. That is basically how a computer learns in reinforcement learning: it plays many times, collects trajectories, and uses them to get better.
A trajectory is a finite or infinite sequence of states, actions, and rewards generated by an agent acting in an environment. The standard notation, following Sutton and Barto (2018) and OpenAI Spinning Up, is:
$$\tau = (s_0, a_0, r_1, s_1, a_1, r_2, s_2, a_2, r_3, \ldots)$$
where:
Some formulations use a slightly different convention where the reward at time $t$ is written as $r_t = R(s_t, a_t)$ or $r_t = R(s_t, a_t, s_{t+1})$, depending on whether the reward function depends on the next state.
A trajectory is generated step by step according to the following process:
The probability of a particular trajectory under a given policy $\pi$ is:
$$P(\tau | \pi) = \rho_0(s_0) \prod_{t=0}^{T-1} \pi(a_t | s_t) \cdot P(s_{t+1} | s_t, a_t)$$
This probability is central to policy gradient methods and importance sampling corrections.
The return is the cumulative reward collected along a trajectory. There are two standard formulations.
For tasks with a fixed time horizon $T$:
$$R(\tau) = \sum_{t=0}^{T} r_t$$
For continuing tasks or when future rewards should be weighted less than immediate ones:
$$R(\tau) = \sum_{t=0}^{\infty} \gamma^t r_t$$
where $\gamma \in (0, 1)$ is the discount factor. The discount factor serves two purposes: it makes the infinite sum mathematically convergent, and it encodes a preference for sooner rewards over later ones.
The reward-to-go from time step $t$ is the return computed from that point forward:
$$\hat{R}t = \sum{t'=t}^{T} \gamma^{t'-t} r_{t'}$$
This quantity is used in many policy gradient algorithms because it reduces variance compared to using the full trajectory return. Only rewards that come after an action should influence the credit assigned to that action; rewards collected before the action are not affected by it.
The terms "trajectory," "episode," and "rollout" are often used interchangeably in the reinforcement learning literature, but they have subtly different connotations.
| Term | Definition | Typical usage |
|---|---|---|
| Trajectory | A sequence of states, actions, and rewards; may be partial or complete | General mathematical notation; used in policy gradient derivations and theoretical analysis |
| Episode | A complete trajectory from an initial state to a terminal state | Episodic tasks such as games, navigation problems, or any task with a defined end |
| Rollout | A trajectory generated by executing a policy in an environment (real or simulated) | Data collection phase; model-based RL where simulated trajectories are generated from a learned dynamics model |
A key distinction is that a trajectory can be a partial sequence (for example, a fixed-length segment used in a policy update), whereas an episode typically refers to a complete interaction from start to finish. Rollouts emphasize the act of generating data, particularly in simulation, and may be shorter than a full episode.
Trajectories appear in virtually every reinforcement learning algorithm, though the way they are collected, stored, and used varies.
On-policy algorithms learn from trajectories collected using the current policy. The agent generates trajectories, uses them to compute gradient estimates or value updates, and then discards them before collecting new data with the updated policy. Examples include REINFORCE, Proximal Policy Optimization (PPO), and Advantage Actor-Critic (A2C).
In the REINFORCE algorithm (Williams, 1992), the agent samples a complete trajectory, computes the return at each time step, and updates the policy parameters in the direction that increases the probability of high-return trajectories:
$$\nabla_\theta J(\theta) = \mathbb{E}{\tau \sim \pi\theta} \left[ \sum_{t=0}^{T} \nabla_\theta \log \pi_\theta(a_t | s_t) \cdot \hat{R}_t \right]$$
This gradient estimate is unbiased but can have high variance, which is why practical implementations add a baseline (often an estimate of the state value function) to reduce variance without introducing bias.
Off-policy algorithms can learn from trajectories generated by a different policy (the "behavior policy") than the one being optimized (the "target policy"). This makes it possible to reuse old trajectory data, improving sample efficiency. Examples include Q-learning, Deep Q-Networks (DQN), and Soft Actor-Critic (SAC).
When using trajectory data from a behavior policy $\beta$ to estimate quantities under a target policy $\pi$, importance sampling corrections are needed. The importance weight for a full trajectory is:
$$w(\tau) = \prod_{t=0}^{T-1} \frac{\pi(a_t | s_t)}{\beta(a_t | s_t)}$$
This product can become very large or very small as the trajectory length increases, leading to high variance. Per-decision importance sampling and techniques like weighted importance sampling help mitigate this problem.
Experience replay, introduced in the DQN architecture (Mnih et al., 2015), stores individual transitions $(s_t, a_t, r_{t+1}, s_{t+1})$ from trajectories in a replay buffer. During training, mini-batches of transitions are sampled from the buffer to break temporal correlations and improve learning stability. Prioritized experience replay (Schaul et al., 2016) samples transitions with higher temporal-difference errors more frequently, allowing the agent to learn more from surprising or informative experiences.
| Replay strategy | Selection method | Benefit |
|---|---|---|
| Uniform replay | Random sampling from buffer | Breaks temporal correlations; simple to implement |
| Prioritized replay | Weighted by TD error magnitude | Focuses learning on high-error (informative) transitions |
| Hindsight replay | Relabels goals after the fact | Enables learning from failed trajectories in goal-conditioned tasks |
| Trajectory replay | Samples contiguous trajectory segments | Preserves temporal structure for recurrent or multi-step methods |
Model-based reinforcement learning algorithms learn a model of the environment's dynamics and then generate synthetic trajectories (rollouts) using the learned model for planning or policy improvement. Monte Carlo Tree Search (MCTS), as used in AlphaGo and AlphaZero, simulates many possible future trajectories from the current state to evaluate candidate actions. The quality of each action is estimated by averaging the returns across simulated rollouts.
Dyna-style algorithms (Sutton, 1991) interleave real trajectory collection with model-generated rollouts, using both to update value functions or policies. The length and accuracy of model-generated rollouts is a practical concern, since errors in the learned dynamics model compound over longer horizons.
Policy gradient algorithms optimize a policy by estimating the gradient of an objective function with respect to the policy parameters. The objective is typically the expected return over trajectories:
$$J(\theta) = \mathbb{E}{\tau \sim \pi\theta} [R(\tau)]$$
The policy gradient theorem (Sutton et al., 2000) provides a way to compute the gradient of this objective:
$$\nabla_\theta J(\theta) = \mathbb{E}{\tau \sim \pi\theta} \left[ \sum_{t=0}^{T} \nabla_\theta \log \pi_\theta(a_t | s_t) \cdot A^{\pi}(s_t, a_t) \right]$$
where $A^{\pi}(s_t, a_t) = Q^{\pi}(s_t, a_t) - V^{\pi}(s_t)$ is the advantage function. In practice, the expectation is approximated by averaging over a batch of sampled trajectories.
| Algorithm | Trajectory usage | Variance reduction technique |
|---|---|---|
| REINFORCE | Full episodes; uses Monte Carlo returns | Baseline subtraction |
| A2C / A3C | Fixed-length trajectory segments | Advantage function with learned value baseline |
| PPO | Fixed-length trajectory segments | Clipped surrogate objective; advantage normalization |
| TRPO | Full or partial trajectories | KL divergence constraint on policy updates |
| SAC | Individual transitions from replay buffer | Entropy regularization; off-policy learning |
Imitation learning methods learn policies from expert demonstrations, which take the form of trajectories. In behavioral cloning, the agent is trained via supervised learning to map states to actions using expert trajectory data. In inverse reinforcement learning (IRL), the goal is to recover the reward function that best explains a set of expert trajectories, and then use that inferred reward function to train a policy via standard RL.
Formally, given a set of expert trajectories ${\tau_1^, \tau_2^, \ldots, \tau_N^}$ generated by an expert policy $\pi^$, an IRL algorithm seeks a reward function $R$ such that the expert policy is optimal (or near-optimal) under $R$. Maximum entropy IRL (Ziebart et al., 2008) assumes the expert follows a distribution over trajectories given by:
$$P(\tau) \propto \exp(R(\tau))$$
This framework connects trajectory-level reasoning to probabilistic modeling and has been influential in robotics, autonomous driving, and other domains where reward functions are hard to specify by hand.
Recent work has reframed reinforcement learning as a sequence modeling problem, treating trajectories as sequences of tokens to be processed by transformer architectures.
Decision Transformer (Chen et al., 2021) conditions on a desired return, past states, and past actions to autoregressively predict future actions. Rather than computing value functions or policy gradients, it models the trajectory distribution directly. Given a trajectory represented as a sequence of return-to-go, state, and action tokens, the model learns to produce actions that achieve specified return levels.
Trajectory Transformer (Janner et al., 2021) treats reinforcement learning as one big sequence modeling problem. It discretizes states, actions, and rewards and trains a transformer to predict the next token in a trajectory sequence. Planning is performed via beam search over predicted future trajectories.
These approaches are particularly well suited to offline reinforcement learning, where the agent must learn entirely from a fixed dataset of previously collected trajectories without any further environment interaction. Offline RL is important in settings where collecting new data is expensive or dangerous, such as healthcare, robotics, and autonomous driving.
| Method | Architecture | Planning approach | Key property |
|---|---|---|---|
| Decision Transformer | Causal transformer | Return-conditioned generation | No Bellman backups needed |
| Trajectory Transformer | Autoregressive transformer | Beam search over predicted sequences | Unified state, action, and reward modeling |
| Conservative Q-Learning (CQL) | Standard Q-network | Pessimistic value estimation | Penalizes out-of-distribution actions |
| Implicit Q-Learning (IQL) | Standard Q-network | Expectile regression | Avoids querying unseen actions |
Reinforcement learning from human feedback (RLHF) has become a standard technique for aligning large language models with human preferences. The concept of trajectories appears in RLHF in a specific way: human evaluators are presented with pairs of model-generated outputs (which can be viewed as trajectory segments in a text generation environment) and asked to indicate which output they prefer.
The original trajectory preference framework (Christiano et al., 2017) was developed for Atari game agents, where human raters compared short clips of agent behavior (trajectory segments) and indicated which behavior looked better. A reward model was then trained to predict these preferences, and the agent's policy was optimized to maximize the predicted reward.
In the language model setting, each generated response can be thought of as a trajectory through token space, where the states are partial sequences, the actions are token selections, and the reward comes from the learned preference model. This connection between trajectories and RLHF is one reason why reinforcement learning concepts have become relevant to the development of systems like ChatGPT and Claude.
Trajectory optimization is a family of methods that directly optimize a sequence of actions (or states and actions) to maximize cumulative reward, subject to dynamics constraints. Unlike policy optimization, which learns a mapping from states to actions, trajectory optimization produces a plan: a specific sequence of actions for a specific initial condition.
| Method type | Optimization variables | Dynamics enforcement | Strengths |
|---|---|---|---|
| Single shooting | Actions only | Forward simulation | Simple; works with black-box simulators |
| Multiple shooting | Actions and states at segment boundaries | Continuity constraints between segments | More numerically stable than single shooting |
| Direct collocation | States and actions at all time steps | Dynamics as equality constraints | Handles complex constraints well; good convergence properties |
Trajectory optimization is widely used in robotics for motion planning, where a robot must find a collision-free path from a start configuration to a goal. It is also used in autonomous driving, spacecraft guidance, and other control applications.
Recent work (Amos et al., 2018; Jin et al., 2024) has made trajectory optimization differentiable, allowing it to be integrated into end-to-end learning pipelines. DiffTOP (Jin et al., 2024) uses differentiable trajectory optimization as a policy class, enabling gradients to flow through the optimization process so that cost functions and dynamics models can be learned from data.
The quality and diversity of collected trajectories directly determines what an agent can learn. If an agent only visits a small region of the state space, it will have limited trajectory data and may learn a poor policy. This is the core of the exploration-exploitation tradeoff: the agent must balance collecting diverse trajectories (exploration) with acting according to its current best knowledge (exploitation).
Common exploration strategies that affect trajectory collection include:
Outside of reinforcement learning, the term "trajectory" is also used in optimization to describe the path that an algorithm traces through parameter space as it converges toward a local or global minimum. For example, the trajectory of gradient descent on a loss surface is the sequence of parameter values visited during training:
$$\theta_0, \theta_1, \theta_2, \ldots$$
where each update follows $\theta_{t+1} = \theta_t - \alpha \nabla_{\theta} L(\theta_t)$ and $\alpha$ is the learning rate.
Analyzing optimization trajectories can reveal properties of the loss surface (e.g., presence of saddle points, flat regions, or sharp minima), the effect of hyperparameters like learning rate and momentum, and differences between optimizers such as SGD, Adam, and RMSProp.
Trajectory-based methods and analysis have been applied across many domains.
| Application domain | Role of trajectories | Example systems |
|---|---|---|
| Game playing | Agent play-throughs used for training and evaluation | AlphaGo, OpenAI Five, Atari agents |
| Robotics | Motion planning and manipulation trajectories | Dexterous manipulation, legged locomotion |
| Autonomous driving | Vehicle path planning and prediction of other agents' trajectories | Waymo, Tesla Autopilot |
| Healthcare | Treatment sequences modeled as trajectories through patient state space | Sepsis treatment optimization, dosing strategies |
| Language models | Token generation sequences treated as trajectories for RLHF alignment | ChatGPT, Claude, Gemini |
| Finance | Trading action sequences optimized via RL | Portfolio management, order execution |