Trajectory (Reinforcement Learning)

A trajectory in reinforcement learning is a sequence of states, actions, and rewards that an agent experiences while interacting with an environment. Formally denoted by the Greek letter tau, a trajectory captures the full history of an agent's behavior over a period of time and serves as the fundamental unit of data used for learning, evaluation, and policy improvement.

Trajectories are sometimes called episodes or rollouts, though these terms carry slightly different connotations depending on context. Understanding trajectories is essential for nearly every area of reinforcement learning, from policy gradient methods and value estimation to imitation learning and offline reinforcement learning.

ELI5 (Explain Like I'm 5)

Imagine you are playing a video game. Every time you play, you start at the beginning, make choices (go left, go right, jump, etc.), see what happens after each choice, and collect or lose points along the way. When you finish the game (or stop playing), you can look back at everything that happened: where you started, what you did, what you saw, and how many points you got at each step.

That whole record of your playthrough is a trajectory. It is like a diary of one game session. If you play the game many times, you get many different trajectories, and by studying them you can figure out which choices tend to give you more points. That is basically how a computer learns in reinforcement learning: it plays many times, collects trajectories, and uses them to get better.

Formal definition

A trajectory is a finite or infinite sequence of states, actions, and rewards generated by an agent acting in an environment. The standard notation, following Sutton and Barto (2018) and OpenAI Spinning Up, is:

$$\tau = (s_0, a_0, r_1, s_1, a_1, r_2, s_2, a_2, r_3, \ldots)$$

where:

$s_t$ is the state of the environment at time step $t$
$a_t$ is the action taken by the agent at time step $t$
$r_{t+1}$ is the scalar reward received after taking action $a_t$ in state $s_t$

Some formulations use a slightly different convention where the reward at time $t$ is written as $r_t = R(s_t, a_t)$ or $r_t = R(s_t, a_t, s_{t+1})$, depending on whether the reward function depends on the next state.

Generation process

A trajectory is generated step by step according to the following process:

The initial state $s_0$ is sampled from a start-state distribution: $s_0 \sim \rho_0(\cdot)$
At each time step $t$, the agent selects an action according to its policy: $a_t \sim \pi(\cdot | s_t)$ (stochastic policy) or $a_t = \pi(s_t)$ (deterministic policy)
The environment transitions to a new state according to its dynamics: $s_{t+1} \sim P(\cdot | s_t, a_t)$ (stochastic) or $s_{t+1} = f(s_t, a_t)$ (deterministic)
The agent receives a reward $r_{t+1} = R(s_t, a_t, s_{t+1})$
Steps 2 through 4 repeat until a terminal condition is met or indefinitely in continuing tasks

The probability of a particular trajectory under a given policy $\pi$ is:

$$P(\tau | \pi) = \rho_0(s_0) \prod_{t=0}^{T-1} \pi(a_t | s_t) \cdot P(s_{t+1} | s_t, a_t)$$

This probability is central to policy gradient methods and importance sampling corrections.

Return and cumulative reward

The return is the cumulative reward collected along a trajectory. There are two standard formulations.

Finite-horizon undiscounted return

For tasks with a fixed time horizon $T$:

$$R(\tau) = \sum_{t=0}^{T} r_t$$

Infinite-horizon discounted return

For continuing tasks or when future rewards should be weighted less than immediate ones:

$$R(\tau) = \sum_{t=0}^{\infty} \gamma^t r_t$$

where $\gamma \in (0, 1)$ is the discount factor. The discount factor serves two purposes: it makes the infinite sum mathematically convergent, and it encodes a preference for sooner rewards over later ones.

Reward-to-go

The reward-to-go from time step $t$ is the return computed from that point forward:

$$\hat{R}t = \sum{t'=t}^{T} \gamma^{t'-t} r_{t'}$$

This quantity is used in many policy gradient algorithms because it reduces variance compared to using the full trajectory return. Only rewards that come after an action should influence the credit assigned to that action; rewards collected before the action are not affected by it.

Trajectory vs. episode vs. rollout

The terms "trajectory," "episode," and "rollout" are often used interchangeably in the reinforcement learning literature, but they have subtly different connotations.

Term	Definition	Typical usage
Trajectory	A sequence of states, actions, and rewards; may be partial or complete	General mathematical notation; used in policy gradient derivations and theoretical analysis
Episode	A complete trajectory from an initial state to a terminal state	Episodic tasks such as games, navigation problems, or any task with a defined end
Rollout	A trajectory generated by executing a policy in an environment (real or simulated)	Data collection phase; model-based RL where simulated trajectories are generated from a learned dynamics model

A key distinction is that a trajectory can be a partial sequence (for example, a fixed-length segment used in a policy update), whereas an episode typically refers to a complete interaction from start to finish. Rollouts emphasize the act of generating data, particularly in simulation, and may be shorter than a full episode.

Role of trajectories in reinforcement learning algorithms

Trajectories appear in virtually every reinforcement learning algorithm, though the way they are collected, stored, and used varies.

On-policy methods

On-policy algorithms learn from trajectories collected using the current policy. The agent generates trajectories, uses them to compute gradient estimates or value updates, and then discards them before collecting new data with the updated policy. Examples include REINFORCE, Proximal Policy Optimization (PPO), and Advantage Actor-Critic (A2C).

In the REINFORCE algorithm (Williams, 1992), the agent samples a complete trajectory, computes the return at each time step, and updates the policy parameters in the direction that increases the probability of high-return trajectories:

$$\nabla_\theta J(\theta) = \mathbb{E}{\tau \sim \pi\theta} \left[ \sum_{t=0}^{T} \nabla_\theta \log \pi_\theta(a_t | s_t) \cdot \hat{R}_t \right]$$

This gradient estimate is unbiased but can have high variance, which is why practical implementations add a baseline (often an estimate of the state value function) to reduce variance without introducing bias.

Off-policy methods

Off-policy algorithms can learn from trajectories generated by a different policy (the "behavior policy") than the one being optimized (the "target policy"). This makes it possible to reuse old trajectory data, improving sample efficiency. Examples include Q-learning, Deep Q-Networks (DQN), and Soft Actor-Critic (SAC).

When using trajectory data from a behavior policy $\beta$ to estimate quantities under a target policy $\pi$, importance sampling corrections are needed. The importance weight for a full trajectory is:

$$w(\tau) = \prod_{t=0}^{T-1} \frac{\pi(a_t | s_t)}{\beta(a_t | s_t)}$$

This product can become very large or very small as the trajectory length increases, leading to high variance. Per-decision importance sampling and techniques like weighted importance sampling help mitigate this problem.

Experience replay

Experience replay, introduced in the DQN architecture (Mnih et al., 2015), stores individual transitions $(s_t, a_t, r_{t+1}, s_{t+1})$ from trajectories in a replay buffer. During training, mini-batches of transitions are sampled from the buffer to break temporal correlations and improve learning stability. Prioritized experience replay (Schaul et al., 2016) samples transitions with higher temporal-difference errors more frequently, allowing the agent to learn more from surprising or informative experiences.

Replay strategy	Selection method	Benefit
Uniform replay	Random sampling from buffer	Breaks temporal correlations; simple to implement
Prioritized replay	Weighted by TD error magnitude	Focuses learning on high-error (informative) transitions
Hindsight replay	Relabels goals after the fact	Enables learning from failed trajectories in goal-conditioned tasks
Trajectory replay	Samples contiguous trajectory segments	Preserves temporal structure for recurrent or multi-step methods

Model-based methods

Model-based reinforcement learning algorithms learn a model of the environment's dynamics and then generate synthetic trajectories (rollouts) using the learned model for planning or policy improvement. Monte Carlo Tree Search (MCTS), as used in AlphaGo and AlphaZero, simulates many possible future trajectories from the current state to evaluate candidate actions. The quality of each action is estimated by averaging the returns across simulated rollouts.

Dyna-style algorithms (Sutton, 1991) interleave real trajectory collection with model-generated rollouts, using both to update value functions or policies. The length and accuracy of model-generated rollouts is a practical concern, since errors in the learned dynamics model compound over longer horizons.

Trajectories in policy gradient methods

Policy gradient algorithms optimize a policy by estimating the gradient of an objective function with respect to the policy parameters. The objective is typically the expected return over trajectories:

$$J(\theta) = \mathbb{E}{\tau \sim \pi\theta} [R(\tau)]$$

The policy gradient theorem (Sutton et al., 2000) provides a way to compute the gradient of this objective:

$$\nabla_\theta J(\theta) = \mathbb{E}{\tau \sim \pi\theta} \left[ \sum_{t=0}^{T} \nabla_\theta \log \pi_\theta(a_t | s_t) \cdot A^{\pi}(s_t, a_t) \right]$$

where $A^{\pi}(s_t, a_t) = Q^{\pi}(s_t, a_t) - V^{\pi}(s_t)$ is the advantage function. In practice, the expectation is approximated by averaging over a batch of sampled trajectories.

Algorithm	Trajectory usage	Variance reduction technique
REINFORCE	Full episodes; uses Monte Carlo returns	Baseline subtraction
A2C / A3C	Fixed-length trajectory segments	Advantage function with learned value baseline
PPO	Fixed-length trajectory segments	Clipped surrogate objective; advantage normalization
TRPO	Full or partial trajectories	KL divergence constraint on policy updates
SAC	Individual transitions from replay buffer	Entropy regularization; off-policy learning

Trajectories in imitation and inverse reinforcement learning

Imitation learning methods learn policies from expert demonstrations, which take the form of trajectories. In behavioral cloning, the agent is trained via supervised learning to map states to actions using expert trajectory data. In inverse reinforcement learning (IRL), the goal is to recover the reward function that best explains a set of expert trajectories, and then use that inferred reward function to train a policy via standard RL.

Formally, given a set of expert trajectories ${\tau_1^, \tau_2^, \ldots, \tau_N^}$ generated by an expert policy $\pi^$, an IRL algorithm seeks a reward function $R$ such that the expert policy is optimal (or near-optimal) under $R$. Maximum entropy IRL (Ziebart et al., 2008) assumes the expert follows a distribution over trajectories given by:

$$P(\tau) \propto \exp(R(\tau))$$

This framework connects trajectory-level reasoning to probabilistic modeling and has been influential in robotics, autonomous driving, and other domains where reward functions are hard to specify by hand.

Trajectory transformers and offline reinforcement learning

Recent work has reframed reinforcement learning as a sequence modeling problem, treating trajectories as sequences of tokens to be processed by transformer architectures.

Decision Transformer (Chen et al., 2021) conditions on a desired return, past states, and past actions to autoregressively predict future actions. Rather than computing value functions or policy gradients, it models the trajectory distribution directly. Given a trajectory represented as a sequence of return-to-go, state, and action tokens, the model learns to produce actions that achieve specified return levels.

Trajectory Transformer (Janner et al., 2021) treats reinforcement learning as one big sequence modeling problem. It discretizes states, actions, and rewards and trains a transformer to predict the next token in a trajectory sequence. Planning is performed via beam search over predicted future trajectories.

These approaches are particularly well suited to offline reinforcement learning, where the agent must learn entirely from a fixed dataset of previously collected trajectories without any further environment interaction. Offline RL is important in settings where collecting new data is expensive or dangerous, such as healthcare, robotics, and autonomous driving.

Method	Architecture	Planning approach	Key property
Decision Transformer	Causal transformer	Return-conditioned generation	No Bellman backups needed
Trajectory Transformer	Autoregressive transformer	Beam search over predicted sequences	Unified state, action, and reward modeling
Conservative Q-Learning (CQL)	Standard Q-network	Pessimistic value estimation	Penalizes out-of-distribution actions
Implicit Q-Learning (IQL)	Standard Q-network	Expectile regression	Avoids querying unseen actions

Trajectories in RLHF

Reinforcement learning from human feedback (RLHF) has become a standard technique for aligning large language models with human preferences. The concept of trajectories appears in RLHF in a specific way: human evaluators are presented with pairs of model-generated outputs (which can be viewed as trajectory segments in a text generation environment) and asked to indicate which output they prefer.

The original trajectory preference framework (Christiano et al., 2017) was developed for Atari game agents, where human raters compared short clips of agent behavior (trajectory segments) and indicated which behavior looked better. A reward model was then trained to predict these preferences, and the agent's policy was optimized to maximize the predicted reward.

In the language model setting, each generated response can be thought of as a trajectory through token space, where the states are partial sequences, the actions are token selections, and the reward comes from the learned preference model. This connection between trajectories and RLHF is one reason why reinforcement learning concepts have become relevant to the development of systems like ChatGPT and Claude.

Trajectory optimization

Trajectory optimization is a family of methods that directly optimize a sequence of actions (or states and actions) to maximize cumulative reward, subject to dynamics constraints. Unlike policy optimization, which learns a mapping from states to actions, trajectory optimization produces a plan: a specific sequence of actions for a specific initial condition.

Shooting methods vs. collocation methods

Method type	Optimization variables	Dynamics enforcement	Strengths
Single shooting	Actions only	Forward simulation	Simple; works with black-box simulators
Multiple shooting	Actions and states at segment boundaries	Continuity constraints between segments	More numerically stable than single shooting
Direct collocation	States and actions at all time steps	Dynamics as equality constraints	Handles complex constraints well; good convergence properties

Trajectory optimization is widely used in robotics for motion planning, where a robot must find a collision-free path from a start configuration to a goal. It is also used in autonomous driving, spacecraft guidance, and other control applications.

Differentiable trajectory optimization

Recent work (Amos et al., 2018; Jin et al., 2024) has made trajectory optimization differentiable, allowing it to be integrated into end-to-end learning pipelines. DiffTOP (Jin et al., 2024) uses differentiable trajectory optimization as a policy class, enabling gradients to flow through the optimization process so that cost functions and dynamics models can be learned from data.

Exploration and trajectory quality

The quality and diversity of collected trajectories directly determines what an agent can learn. If an agent only visits a small region of the state space, it will have limited trajectory data and may learn a poor policy. This is the core of the exploration-exploitation tradeoff: the agent must balance collecting diverse trajectories (exploration) with acting according to its current best knowledge (exploitation).

Common exploration strategies that affect trajectory collection include:

Epsilon-greedy: With probability $\epsilon$, the agent takes a random action instead of the greedy action, producing more varied trajectories
Boltzmann (softmax) exploration: Action selection probabilities are proportional to exponentiated Q-values, leading to stochastic trajectories that favor higher-valued actions
Intrinsic motivation: The agent receives bonus rewards for visiting novel states, encouraging trajectories that reach unexplored parts of the environment
Parameter space noise: Noise is added to the policy parameters rather than to actions, producing coherent exploratory trajectories rather than random jittering

Trajectories in optimization (non-RL context)

Outside of reinforcement learning, the term "trajectory" is also used in optimization to describe the path that an algorithm traces through parameter space as it converges toward a local or global minimum. For example, the trajectory of gradient descent on a loss surface is the sequence of parameter values visited during training:

$$\theta_0, \theta_1, \theta_2, \ldots$$

where each update follows $\theta_{t+1} = \theta_t - \alpha \nabla_{\theta} L(\theta_t)$ and $\alpha$ is the learning rate.

Analyzing optimization trajectories can reveal properties of the loss surface (e.g., presence of saddle points, flat regions, or sharp minima), the effect of hyperparameters like learning rate and momentum, and differences between optimizers such as SGD, Adam, and RMSProp.

Applications

Trajectory-based methods and analysis have been applied across many domains.

Application domain	Role of trajectories	Example systems
Game playing	Agent play-throughs used for training and evaluation	AlphaGo, OpenAI Five, Atari agents
Robotics	Motion planning and manipulation trajectories	Dexterous manipulation, legged locomotion
Autonomous driving	Vehicle path planning and prediction of other agents' trajectories	Waymo, Tesla Autopilot
Healthcare	Treatment sequences modeled as trajectories through patient state space	Sepsis treatment optimization, dosing strategies
Language models	Token generation sequences treated as trajectories for RLHF alignment	ChatGPT, Claude, Gemini
Finance	Trading action sequences optimized via RL	Portfolio management, order execution

References

Sutton, R. S., & Barto, A. G. (2018). *Reinforcement Learning: An Introduction* (2nd ed.). MIT Press. http://incompleteideas.net/book/the-book-2nd.html
Williams, R. J. (1992). Simple statistical gradient-following algorithms for connectionist reinforcement learning. *Machine Learning*, 8(3-4), 229-256.
Sutton, R. S., McAllester, D., Singh, S., & Mansour, Y. (2000). Policy gradient methods for reinforcement learning with function approximation. *Advances in Neural Information Processing Systems*, 12.
Mnih, V., Kavukcuoglu, K., Silver, D., et al. (2015). Human-level control through deep reinforcement learning. *Nature*, 518(7540), 529-533.
Schaul, T., Quan, J., Antonoglou, I., & Silver, D. (2016). Prioritized experience replay. *Proceedings of the International Conference on Learning Representations (ICLR)*.
Christiano, P. F., Leike, J., Brown, T., et al. (2017). Deep reinforcement learning from human preferences. *Advances in Neural Information Processing Systems*, 30.
Ziebart, B. D., Maas, A. L., Bagnell, J. A., & Dey, A. K. (2008). Maximum entropy inverse reinforcement learning. *Proceedings of the AAAI Conference on Artificial Intelligence*, 22(1).
Chen, L., Lu, K., Rajeswaran, A., et al. (2021). Decision Transformer: Reinforcement learning via sequence modeling. *Advances in Neural Information Processing Systems*, 34.
Janner, M., Li, Q., & Levine, S. (2021). Offline reinforcement learning as one big sequence modeling problem. *Advances in Neural Information Processing Systems*, 34.
Doerr, A., Ratliff, N., Bohg, J., Toussaint, M., & Schaal, S. (2019). Trajectory-based off-policy deep reinforcement learning. *Proceedings of the International Conference on Machine Learning (ICML)*, 97.
Achiam, J. (2018). Spinning Up in Deep Reinforcement Learning. OpenAI. https://spinningup.openai.com/
Levine, S. (2018). Reinforcement learning and control as probabilistic inference: Tutorial and review. *arXiv preprint arXiv:1805.00909*.
Jin, W., et al. (2024). DiffTORI: Differentiable trajectory optimization for deep reinforcement and imitation learning. *Advances in Neural Information Processing Systems*, 37.

ELI5 (Explain Like I'm 5)

Formal definition

Generation process

Return and cumulative reward

Finite-horizon undiscounted return

Infinite-horizon discounted return

Reward-to-go

Trajectory vs. episode vs. rollout

Role of trajectories in reinforcement learning algorithms

On-policy methods

Off-policy methods

Experience replay

Model-based methods

Trajectories in policy gradient methods

Trajectories in imitation and inverse reinforcement learning

Trajectory transformers and offline reinforcement learning

Trajectories in RLHF

Trajectory optimization

Shooting methods vs. collocation methods

Differentiable trajectory optimization

Exploration and trajectory quality

Trajectories in optimization (non-RL context)

Applications

See also

References

Improve this article

Related Articles

Machine learning terms/Reinforcement Learning

ARC-AGI 2

AlphaGo

State (Reinforcement Learning)

State-Action Value Function

Action (Reinforcement Learning)

ELI5 (Explain Like I'm 5)

Formal definition

Generation process

Return and cumulative reward

Finite-horizon undiscounted return

Infinite-horizon discounted return

Reward-to-go

Trajectory vs. episode vs. rollout

Role of trajectories in reinforcement learning algorithms

On-policy methods

Off-policy methods

Experience replay

Model-based methods

Trajectories in policy gradient methods

Trajectories in imitation and inverse reinforcement learning

Trajectory transformers and offline reinforcement learning

Trajectories in RLHF

Trajectory optimization

Shooting methods vs. collocation methods

Differentiable trajectory optimization

Exploration and trajectory quality

Trajectories in optimization (non-RL context)

Applications

See also

References

Related Articles

Machine learning terms/Reinforcement Learning

ARC-AGI 2

AlphaGo

State (Reinforcement Learning)

State-Action Value Function

Action (Reinforcement Learning)