# Trajectory (Reinforcement Learning)

> Source: https://aiwiki.ai/wiki/trajectory
> Updated: 2026-04-26
> Categories: Machine Learning, Reinforcement Learning
> From AI Wiki (https://aiwiki.ai), a free encyclopedia of artificial intelligence. Quote with attribution.

A **trajectory** in [reinforcement learning](/wiki/reinforcement_learning) is a sequence of states, actions, and rewards that an agent experiences while interacting with an environment. Formally denoted by the Greek letter tau, a trajectory captures the full history of an agent's behavior over a period of time and serves as the fundamental unit of data used for learning, evaluation, and policy improvement.

Trajectories are sometimes called **episodes** or **rollouts**, though these terms carry slightly different connotations depending on context. Understanding trajectories is essential for nearly every area of reinforcement learning, from [policy gradient](/wiki/policy_gradient) methods and value estimation to [imitation learning](/wiki/imitation_learning) and offline reinforcement learning.

## ELI5 (Explain Like I'm 5)

Imagine you are playing a video game. Every time you play, you start at the beginning, make choices (go left, go right, jump, etc.), see what happens after each choice, and collect or lose points along the way. When you finish the game (or stop playing), you can look back at everything that happened: where you started, what you did, what you saw, and how many points you got at each step.

That whole record of your playthrough is a **trajectory**. It is like a diary of one game session. If you play the game many times, you get many different trajectories, and by studying them you can figure out which choices tend to give you more points. That is basically how a computer learns in reinforcement learning: it plays many times, collects trajectories, and uses them to get better.

## Formal definition

A trajectory is a finite or infinite sequence of states, actions, and rewards generated by an agent acting in an environment. The standard notation, following Sutton and Barto (2018) and [OpenAI](/wiki/openai) Spinning Up, is:

$$\tau = (s_0, a_0, r_1, s_1, a_1, r_2, s_2, a_2, r_3, \ldots)$$

where:

- $s_t$ is the state of the environment at time step $t$
- $a_t$ is the action taken by the agent at time step $t$
- $r_{t+1}$ is the scalar reward received after taking action $a_t$ in state $s_t$

Some formulations use a slightly different convention where the reward at time $t$ is written as $r_t = R(s_t, a_t)$ or $r_t = R(s_t, a_t, s_{t+1})$, depending on whether the reward function depends on the next state.

### Generation process

A trajectory is generated step by step according to the following process:

1. The initial state $s_0$ is sampled from a start-state distribution: $s_0 \sim \rho_0(\cdot)$
2. At each time step $t$, the agent selects an action according to its policy: $a_t \sim \pi(\cdot | s_t)$ (stochastic policy) or $a_t = \pi(s_t)$ (deterministic policy)
3. The environment transitions to a new state according to its dynamics: $s_{t+1} \sim P(\cdot | s_t, a_t)$ (stochastic) or $s_{t+1} = f(s_t, a_t)$ (deterministic)
4. The agent receives a reward $r_{t+1} = R(s_t, a_t, s_{t+1})$
5. Steps 2 through 4 repeat until a terminal condition is met or indefinitely in continuing tasks

The probability of a particular trajectory under a given policy $\pi$ is:

$$P(\tau | \pi) = \rho_0(s_0) \prod_{t=0}^{T-1} \pi(a_t | s_t) \cdot P(s_{t+1} | s_t, a_t)$$

This probability is central to policy gradient methods and importance sampling corrections.

## Return and cumulative reward

The **return** is the cumulative reward collected along a trajectory. There are two standard formulations.

### Finite-horizon undiscounted return

For tasks with a fixed time horizon $T$:

$$R(\tau) = \sum_{t=0}^{T} r_t$$

### Infinite-horizon discounted return

For continuing tasks or when future rewards should be weighted less than immediate ones:

$$R(\tau) = \sum_{t=0}^{\infty} \gamma^t r_t$$

where $\gamma \in (0, 1)$ is the [discount factor](/wiki/discount_factor). The discount factor serves two purposes: it makes the infinite sum mathematically convergent, and it encodes a preference for sooner rewards over later ones.

### Reward-to-go

The **reward-to-go** from time step $t$ is the return computed from that point forward:

$$\hat{R}_t = \sum_{t'=t}^{T} \gamma^{t'-t} r_{t'}$$

This quantity is used in many policy gradient algorithms because it reduces variance compared to using the full trajectory return. Only rewards that come after an action should influence the credit assigned to that action; rewards collected before the action are not affected by it.

## Trajectory vs. episode vs. rollout

The terms "trajectory," "episode," and "rollout" are often used interchangeably in the reinforcement learning literature, but they have subtly different connotations.

| Term | Definition | Typical usage |
|------|-----------|---------------|
| Trajectory | A sequence of states, actions, and rewards; may be partial or complete | General mathematical notation; used in [policy gradient](/wiki/policy_gradient) derivations and theoretical analysis |
| Episode | A complete trajectory from an initial state to a terminal state | Episodic tasks such as games, navigation problems, or any task with a defined end |
| Rollout | A trajectory generated by executing a policy in an environment (real or simulated) | Data collection phase; model-based RL where simulated trajectories are generated from a learned dynamics model |

A key distinction is that a trajectory can be a partial sequence (for example, a fixed-length segment used in a policy update), whereas an episode typically refers to a complete interaction from start to finish. Rollouts emphasize the act of generating data, particularly in simulation, and may be shorter than a full episode.

## Role of trajectories in reinforcement learning algorithms

Trajectories appear in virtually every reinforcement learning algorithm, though the way they are collected, stored, and used varies.

### On-policy methods

On-policy algorithms learn from trajectories collected using the current policy. The agent generates trajectories, uses them to compute gradient estimates or value updates, and then discards them before collecting new data with the updated policy. Examples include REINFORCE, Proximal Policy Optimization (PPO), and Advantage Actor-Critic (A2C).

In the REINFORCE algorithm (Williams, 1992), the agent samples a complete trajectory, computes the return at each time step, and updates the policy parameters in the direction that increases the probability of high-return trajectories:

$$\nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta} \left[ \sum_{t=0}^{T} \nabla_\theta \log \pi_\theta(a_t | s_t) \cdot \hat{R}_t \right]$$

This gradient estimate is unbiased but can have high variance, which is why practical implementations add a baseline (often an estimate of the state value function) to reduce variance without introducing bias.

### Off-policy methods

Off-policy algorithms can learn from trajectories generated by a different policy (the "behavior policy") than the one being optimized (the "target policy"). This makes it possible to reuse old trajectory data, improving sample efficiency. Examples include [Q-learning](/wiki/q-learning), Deep Q-Networks (DQN), and Soft Actor-Critic (SAC).

When using trajectory data from a behavior policy $\beta$ to estimate quantities under a target policy $\pi$, importance sampling corrections are needed. The importance weight for a full trajectory is:

$$w(\tau) = \prod_{t=0}^{T-1} \frac{\pi(a_t | s_t)}{\beta(a_t | s_t)}$$

This product can become very large or very small as the trajectory length increases, leading to high variance. Per-decision importance sampling and techniques like weighted importance sampling help mitigate this problem.

### Experience replay

Experience replay, introduced in the DQN architecture (Mnih et al., 2015), stores individual transitions $(s_t, a_t, r_{t+1}, s_{t+1})$ from trajectories in a replay buffer. During training, mini-batches of transitions are sampled from the buffer to break temporal correlations and improve learning stability. Prioritized experience replay (Schaul et al., 2016) samples transitions with higher temporal-difference errors more frequently, allowing the agent to learn more from surprising or informative experiences.

| Replay strategy | Selection method | Benefit |
|----------------|-----------------|--------|
| Uniform replay | Random sampling from buffer | Breaks temporal correlations; simple to implement |
| Prioritized replay | Weighted by TD error magnitude | Focuses learning on high-error (informative) transitions |
| Hindsight replay | Relabels goals after the fact | Enables learning from failed trajectories in goal-conditioned tasks |
| Trajectory replay | Samples contiguous trajectory segments | Preserves temporal structure for recurrent or multi-step methods |

### Model-based methods

Model-based reinforcement learning algorithms learn a model of the environment's dynamics and then generate synthetic trajectories (rollouts) using the learned model for planning or policy improvement. Monte Carlo Tree Search (MCTS), as used in [AlphaGo](/wiki/alphago) and AlphaZero, simulates many possible future trajectories from the current state to evaluate candidate actions. The quality of each action is estimated by averaging the returns across simulated rollouts.

Dyna-style algorithms (Sutton, 1991) interleave real trajectory collection with model-generated rollouts, using both to update value functions or policies. The length and accuracy of model-generated rollouts is a practical concern, since errors in the learned dynamics model compound over longer horizons.

## Trajectories in policy gradient methods

Policy gradient algorithms optimize a policy by estimating the gradient of an objective function with respect to the policy parameters. The objective is typically the expected return over trajectories:

$$J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta} [R(\tau)]$$

The policy gradient theorem (Sutton et al., 2000) provides a way to compute the gradient of this objective:

$$\nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta} \left[ \sum_{t=0}^{T} \nabla_\theta \log \pi_\theta(a_t | s_t) \cdot A^{\pi}(s_t, a_t) \right]$$

where $A^{\pi}(s_t, a_t) = Q^{\pi}(s_t, a_t) - V^{\pi}(s_t)$ is the advantage function. In practice, the expectation is approximated by averaging over a batch of sampled trajectories.

| Algorithm | Trajectory usage | Variance reduction technique |
|-----------|-----------------|-----------------------------|
| REINFORCE | Full episodes; uses Monte Carlo returns | Baseline subtraction |
| A2C / A3C | Fixed-length trajectory segments | Advantage function with learned value baseline |
| PPO | Fixed-length trajectory segments | Clipped surrogate objective; advantage normalization |
| TRPO | Full or partial trajectories | KL divergence constraint on policy updates |
| SAC | Individual transitions from replay buffer | Entropy regularization; off-policy learning |

## Trajectories in imitation and inverse reinforcement learning

[Imitation learning](/wiki/imitation_learning) methods learn policies from expert demonstrations, which take the form of trajectories. In behavioral cloning, the agent is trained via supervised learning to map states to actions using expert trajectory data. In inverse reinforcement learning (IRL), the goal is to recover the reward function that best explains a set of expert trajectories, and then use that inferred reward function to train a policy via standard RL.

Formally, given a set of expert trajectories $\{\tau_1^*, \tau_2^*, \ldots, \tau_N^*\}$ generated by an expert policy $\pi^*$, an IRL algorithm seeks a reward function $R$ such that the expert policy is optimal (or near-optimal) under $R$. Maximum entropy IRL (Ziebart et al., 2008) assumes the expert follows a distribution over trajectories given by:

$$P(\tau) \propto \exp(R(\tau))$$

This framework connects trajectory-level reasoning to probabilistic modeling and has been influential in robotics, autonomous driving, and other domains where reward functions are hard to specify by hand.

## Trajectory transformers and offline reinforcement learning

Recent work has reframed reinforcement learning as a sequence modeling problem, treating trajectories as sequences of tokens to be processed by [transformer](/wiki/transformer) architectures.

**Decision Transformer** (Chen et al., 2021) conditions on a desired return, past states, and past actions to autoregressively predict future actions. Rather than computing value functions or policy gradients, it models the trajectory distribution directly. Given a trajectory represented as a sequence of return-to-go, state, and action tokens, the model learns to produce actions that achieve specified return levels.

**Trajectory Transformer** (Janner et al., 2021) treats reinforcement learning as one big sequence modeling problem. It discretizes states, actions, and rewards and trains a [transformer](/wiki/transformer) to predict the next token in a trajectory sequence. Planning is performed via beam search over predicted future trajectories.

These approaches are particularly well suited to **offline reinforcement learning**, where the agent must learn entirely from a fixed dataset of previously collected trajectories without any further environment interaction. Offline RL is important in settings where collecting new data is expensive or dangerous, such as healthcare, robotics, and autonomous driving.

| Method | Architecture | Planning approach | Key property |
|--------|-------------|------------------|--------------|
| Decision Transformer | Causal [transformer](/wiki/transformer) | Return-conditioned generation | No Bellman backups needed |
| Trajectory Transformer | Autoregressive [transformer](/wiki/transformer) | Beam search over predicted sequences | Unified state, action, and reward modeling |
| Conservative Q-Learning (CQL) | Standard Q-network | Pessimistic value estimation | Penalizes out-of-distribution actions |
| Implicit Q-Learning (IQL) | Standard Q-network | Expectile regression | Avoids querying unseen actions |

## Trajectories in RLHF

Reinforcement learning from human feedback (RLHF) has become a standard technique for aligning large language models with human preferences. The concept of trajectories appears in RLHF in a specific way: human evaluators are presented with pairs of model-generated outputs (which can be viewed as trajectory segments in a text generation environment) and asked to indicate which output they prefer.

The original trajectory preference framework (Christiano et al., 2017) was developed for Atari game agents, where human raters compared short clips of agent behavior (trajectory segments) and indicated which behavior looked better. A reward model was then trained to predict these preferences, and the agent's policy was optimized to maximize the predicted reward.

In the language model setting, each generated response can be thought of as a trajectory through token space, where the states are partial sequences, the actions are token selections, and the reward comes from the learned preference model. This connection between trajectories and RLHF is one reason why reinforcement learning concepts have become relevant to the development of systems like [ChatGPT](/wiki/chatgpt) and Claude.

## Trajectory optimization

Trajectory optimization is a family of methods that directly optimize a sequence of actions (or states and actions) to maximize cumulative reward, subject to dynamics constraints. Unlike policy optimization, which learns a mapping from states to actions, trajectory optimization produces a plan: a specific sequence of actions for a specific initial condition.

### Shooting methods vs. collocation methods

| Method type | Optimization variables | Dynamics enforcement | Strengths |
|------------|----------------------|---------------------|-----------|
| Single shooting | Actions only | Forward simulation | Simple; works with black-box simulators |
| Multiple shooting | Actions and states at segment boundaries | Continuity constraints between segments | More numerically stable than single shooting |
| Direct collocation | States and actions at all time steps | Dynamics as equality constraints | Handles complex constraints well; good convergence properties |

Trajectory optimization is widely used in robotics for motion planning, where a robot must find a collision-free path from a start configuration to a goal. It is also used in autonomous driving, spacecraft guidance, and other control applications.

### Differentiable trajectory optimization

Recent work (Amos et al., 2018; Jin et al., 2024) has made trajectory optimization differentiable, allowing it to be integrated into end-to-end learning pipelines. DiffTOP (Jin et al., 2024) uses differentiable trajectory optimization as a policy class, enabling gradients to flow through the optimization process so that cost functions and dynamics models can be learned from data.

## Exploration and trajectory quality

The quality and diversity of collected trajectories directly determines what an agent can learn. If an agent only visits a small region of the state space, it will have limited trajectory data and may learn a poor policy. This is the core of the exploration-exploitation tradeoff: the agent must balance collecting diverse trajectories (exploration) with acting according to its current best knowledge (exploitation).

Common exploration strategies that affect trajectory collection include:

- **Epsilon-greedy**: With probability $\epsilon$, the agent takes a random action instead of the greedy action, producing more varied trajectories
- **Boltzmann (softmax) exploration**: Action selection probabilities are proportional to exponentiated Q-values, leading to stochastic trajectories that favor higher-valued actions
- **Intrinsic motivation**: The agent receives bonus rewards for visiting novel states, encouraging trajectories that reach unexplored parts of the environment
- **Parameter space noise**: Noise is added to the policy parameters rather than to actions, producing coherent exploratory trajectories rather than random jittering

## Trajectories in optimization (non-RL context)

Outside of reinforcement learning, the term "trajectory" is also used in optimization to describe the path that an algorithm traces through parameter space as it converges toward a local or global minimum. For example, the trajectory of [gradient descent](/wiki/gradient_descent) on a loss surface is the sequence of parameter values visited during training:

$$\theta_0, \theta_1, \theta_2, \ldots$$

where each update follows $\theta_{t+1} = \theta_t - \alpha \nabla_{\theta} L(\theta_t)$ and $\alpha$ is the learning rate.

Analyzing optimization trajectories can reveal properties of the loss surface (e.g., presence of saddle points, flat regions, or sharp minima), the effect of hyperparameters like learning rate and momentum, and differences between optimizers such as SGD, Adam, and RMSProp.

## Applications

Trajectory-based methods and analysis have been applied across many domains.

| Application domain | Role of trajectories | Example systems |
|-------------------|---------------------|----------------|
| Game playing | Agent play-throughs used for training and evaluation | [AlphaGo](/wiki/alphago), [OpenAI](/wiki/openai) Five, Atari agents |
| Robotics | Motion planning and manipulation trajectories | Dexterous manipulation, legged locomotion |
| Autonomous driving | Vehicle path planning and prediction of other agents' trajectories | Waymo, Tesla Autopilot |
| Healthcare | Treatment sequences modeled as trajectories through patient state space | Sepsis treatment optimization, dosing strategies |
| Language models | Token generation sequences treated as trajectories for RLHF alignment | [ChatGPT](/wiki/chatgpt), Claude, Gemini |
| Finance | Trading action sequences optimized via RL | Portfolio management, order execution |

## See also

- [Reinforcement learning](/wiki/reinforcement_learning)
- [Q-learning](/wiki/q-learning)
- [Policy gradient](/wiki/policy_gradient)
- [Discount factor](/wiki/discount_factor)
- [Imitation learning](/wiki/imitation_learning)
- [Markov decision process](/wiki/markov_decision_process_mdp)
- [Transformer](/wiki/transformer)

## References

1. Sutton, R. S., & Barto, A. G. (2018). *Reinforcement Learning: An Introduction* (2nd ed.). MIT Press. http://incompleteideas.net/book/the-book-2nd.html
2. Williams, R. J. (1992). Simple statistical gradient-following algorithms for connectionist reinforcement learning. *Machine Learning*, 8(3-4), 229-256.
3. Sutton, R. S., McAllester, D., Singh, S., & Mansour, Y. (2000). Policy gradient methods for reinforcement learning with function approximation. *Advances in Neural Information Processing Systems*, 12.
4. Mnih, V., Kavukcuoglu, K., Silver, D., et al. (2015). Human-level control through deep reinforcement learning. *Nature*, 518(7540), 529-533.
5. Schaul, T., Quan, J., Antonoglou, I., & Silver, D. (2016). Prioritized experience replay. *Proceedings of the International Conference on Learning Representations (ICLR)*.
6. Christiano, P. F., Leike, J., Brown, T., et al. (2017). Deep reinforcement learning from human preferences. *Advances in Neural Information Processing Systems*, 30.
7. Ziebart, B. D., Maas, A. L., Bagnell, J. A., & Dey, A. K. (2008). Maximum entropy inverse reinforcement learning. *Proceedings of the AAAI Conference on Artificial Intelligence*, 22(1).
8. Chen, L., Lu, K., Rajeswaran, A., et al. (2021). Decision Transformer: Reinforcement learning via sequence modeling. *Advances in Neural Information Processing Systems*, 34.
9. Janner, M., Li, Q., & Levine, S. (2021). Offline reinforcement learning as one big sequence modeling problem. *Advances in Neural Information Processing Systems*, 34.
10. Doerr, A., Ratliff, N., Bohg, J., Toussaint, M., & Schaal, S. (2019). Trajectory-based off-policy deep reinforcement learning. *Proceedings of the International Conference on Machine Learning (ICML)*, 97.
11. Achiam, J. (2018). Spinning Up in Deep Reinforcement Learning. OpenAI. https://spinningup.openai.com/
12. Levine, S. (2018). Reinforcement learning and control as probabilistic inference: Tutorial and review. *arXiv preprint arXiv:1805.00909*.
13. Jin, W., et al. (2024). DiffTORI: Differentiable trajectory optimization for deep reinforcement and imitation learning. *Advances in Neural Information Processing Systems*, 37.
