In reinforcement learning, the return (commonly denoted G_t) is the cumulative reward an agent receives starting from a given time step. It represents the total payoff the agent aims to maximize through its policy and serves as the foundational quantity from which value functions, advantage functions, and nearly all optimization objectives in reinforcement learning are derived. The return connects the immediate reward signal to the agent's long-term goals, making it one of the most important concepts in the field.
At each discrete time step t, the agent takes an action a_t in state s_t, transitions to a new state s_{t+1}, and receives a scalar reward r_{t+1}. The return G_t aggregates these future rewards into a single scalar value. There are two primary formulations depending on the nature of the task.
In tasks that end after a fixed number of steps T (episodic tasks), the return is the simple sum of rewards collected over the episode:
G_t = r_{t+1} + r_{t+2} + r_{t+3} + ... + r_T
This formulation is appropriate when each episode has a well-defined terminal state, such as a game of chess ending in checkmate or a robot completing an assembly task. Since the horizon is finite, the sum is guaranteed to be bounded.
In continuing tasks with no natural endpoint (or in episodic tasks where discounting is still desired), a discount factor γ (gamma) is applied to weight future rewards progressively less:
G_t = r_{t+1} + γ r_{t+2} + γ² r_{t+3} + γ³ r_{t+4} + ...
Or equivalently:
G_t = Σ (k=0 to ∞) γ^k r_{t+k+1}
Here, γ ∈ [0, 1) is the discount factor. This geometric weighting ensures that the infinite sum converges to a finite value as long as rewards are bounded, because the series becomes a convergent geometric series.
A useful property of the discounted return is that it can be expressed recursively:
G_t = r_{t+1} + γ G_{t+1}
This recursive relationship is the basis of the Bellman equation and underpins most reinforcement learning algorithms, from temporal difference learning to Q-learning.
The discount factor γ is a hyperparameter that controls how the agent values future rewards relative to immediate ones. Its choice has both mathematical and behavioral implications.
| Gamma value | Behavior | Description |
|---|---|---|
| γ = 0 | Fully myopic | The agent only cares about the immediate next reward; G_t = r_{t+1}. |
| γ close to 0 | Short-sighted | The agent strongly prefers near-term rewards and largely ignores distant ones. |
| γ around 0.9 | Balanced | The agent values both near and moderately distant rewards. A common default in practice. |
| γ close to 1 | Far-sighted | The agent gives nearly equal weight to near and distant rewards. |
| γ = 1 | No discounting | All future rewards are weighted equally. Only valid in episodic tasks with guaranteed termination; otherwise the sum may diverge. |
There are several reasons why discounting is used in practice:
The return is the building block for all value functions in reinforcement learning. A value function expresses the expected return under a given policy.
The state-value function V^π(s) gives the expected return when starting in state s and following policy π thereafter:
V^π(s) = E_π [G_t | S_t = s]
This measures how "good" it is for the agent to be in state s under policy π.
The action-value function (or Q-function) Q^π(s, a) gives the expected return when starting in state s, taking action a, and then following policy π:
Q^π(s, a) = E_π [G_t | S_t = s, A_t = a]
This is used directly in algorithms like Q-learning and SARSA.
The advantage function measures how much better a specific action is compared to the average action under the current policy:
A^π(s, a) = Q^π(s, a) - V^π(s)
A positive advantage means the action yields a higher expected return than the policy's average for that state. The advantage function is used extensively in policy gradient methods.
The optimal state-value function V(s)* and optimal action-value function Q(s, a)* represent the maximum expected return achievable from each state or state-action pair across all possible policies:
V(s) = max_π V^π(s)*
Q(s, a) = max_π Q^π(s, a)*
Finding these optimal value functions is equivalent to solving the reinforcement learning problem.
The recursive property of the return (G_t = r_{t+1} + γ G_{t+1}) leads directly to the Bellman equations, which express the value of a state in terms of immediate reward plus the discounted value of successor states.
For a given policy π, the Bellman expectation equation for the state-value function is:
V^π(s) = Σ_a π(a|s) Σ_{s'} P(s'|s,a) [R(s,a,s') + γ V^π(s')]
This states that the value of a state equals the expected immediate reward plus the discounted expected value of the next state, averaged over actions (weighted by the policy) and transitions (weighted by environment dynamics).
For the optimal value function, the Bellman optimality equation replaces the policy-weighted average with a maximization:
V(s) = max_a Σ_{s'} P(s'|s,a) [R(s,a,s') + γ V*(s')]*
These equations form the theoretical backbone of algorithms like value iteration, policy iteration, and all forms of temporal difference learning.
In practice, the true return is rarely known in advance. Reinforcement learning algorithms differ in how they estimate returns from experience.
Monte Carlo methods estimate the return by running complete episodes and computing the actual cumulative reward observed. The return for a visited state is the sum of all rewards received from that state until the end of the episode:
G_t = r_{t+1} + γ r_{t+2} + γ² r_{t+3} + ... + γ^{T-t-1} r_T
The value estimate is then updated:
V(S_t) ← V(S_t) + α [G_t - V(S_t)]
where α is the learning rate. Monte Carlo estimates are unbiased (they use the true return) but have high variance because individual episodes can vary widely.
Temporal difference (TD) methods do not wait for the episode to end. Instead, they use the immediate reward plus the current estimate of the next state's value as a stand-in for the full return. This estimate is called the TD target:
TD target = r_{t+1} + γ V(S_{t+1})
The value update becomes:
V(S_t) ← V(S_t) + α [r_{t+1} + γ V(S_{t+1}) - V(S_t)]
The quantity r_{t+1} + γ V(S_{t+1}) - V(S_t) is called the TD error (denoted δ_t). TD methods are biased (they rely on potentially inaccurate value estimates) but have lower variance than Monte Carlo methods and can learn online without waiting for episode completion.
| Property | Monte Carlo | TD(0) |
|---|---|---|
| Return used | Actual return G_t | Estimated return r_{t+1} + γ V(S_{t+1}) |
| Update timing | End of episode | After each step |
| Bias | Unbiased | Biased (due to bootstrapping) |
| Variance | High | Low |
| Requires episode termination | Yes | No |
| Uses bootstrapping | No | Yes |
| Works in continuing tasks | No (needs termination) | Yes |
Monte Carlo and single-step TD represent two extremes on a spectrum. Intermediate methods blend both approaches by looking ahead multiple steps.
The n-step return combines n actual rewards with a bootstrapped value estimate:
G_t^{(n)} = r_{t+1} + γ r_{t+2} + ... + γ^{n-1} r_{t+n} + γ^n V(S_{t+n})
When n = 1, this reduces to the TD(0) target. When n = T - t (the remaining length of the episode), it becomes the full Monte Carlo return.
The lambda return, introduced by Sutton (1988), is a weighted average of all n-step returns, with weights that decay exponentially by a parameter λ ∈ [0, 1]:
G_t^λ = (1 - λ) Σ (n=1 to ∞) λ^{n-1} G_t^{(n)}
The parameter λ interpolates between TD(0) (when λ = 0) and Monte Carlo (when λ = 1). Eligibility traces provide an efficient online implementation of TD(λ), maintaining a decaying record of which states have been recently visited so that the TD error can be distributed backward in time.
Generalized advantage estimation, proposed by Schulman et al. (2016), applies the lambda-return idea to advantage estimation in policy gradient methods:
A_t^{GAE(γ,λ)} = Σ (l=0 to ∞) (γλ)^l δ_{t+l}
where δ_t = r_{t+1} + γ V(S_{t+1}) - V(S_t) is the TD error. GAE provides a smooth tradeoff between bias and variance in advantage estimates. A lower λ gives lower variance but higher bias; a higher λ gives lower bias but higher variance. GAE is used in many modern policy gradient algorithms including PPO and TRPO.
The definition and handling of the return differs between episodic and continuing tasks.
| Aspect | Episodic tasks | Continuing tasks |
|---|---|---|
| Termination | Natural endpoint (terminal state) | No natural endpoint |
| Return formulation | Undiscounted or discounted, finite sum | Discounted or average reward, infinite sum |
| Examples | Board games, maze navigation, Atari games | Stock trading, robotic control, server management |
| Discount factor | Can use γ = 1 safely | Must use γ < 1 or average reward formulation |
In continuing tasks, using γ = 1 would make the return infinite (assuming nonzero rewards), so either the discounted formulation or an alternative objective such as the average reward must be used.
For continuing tasks, an alternative to discounting is the average reward setting, introduced by Schwartz (1993) and developed further by Mahadevan (1996). Rather than maximizing the discounted return, the agent maximizes the long-term average reward per time step:
r(π) = lim (T→∞) (1/T) Σ (t=1 to T) E[r_t | π]
The differential return measures the total excess reward above the average:
G_t = Σ (k=0 to ∞) (r_{t+k+1} - r(π))
This formulation avoids the need for a discount factor altogether and is well-suited to problems where the agent operates indefinitely and there is no meaningful way to prioritize near-term rewards over far-off ones.
The concept of return is closely linked to the reward hypothesis, articulated by Richard Sutton (2004):
"All of what we mean by goals and purposes can be well thought of as the maximization of the expected value of the cumulative sum of a received scalar signal (called reward)."
Under this hypothesis, every goal an intelligent agent might have can be expressed as maximizing the expected return. This idea is a foundational assumption in reinforcement learning, though it has been debated in the research community. Critics have pointed out that certain objectives, such as multi-objective optimization, risk-sensitive behavior, and tasks requiring non-Markovian reward structures, may not be naturally expressible through a single scalar return.
Standard reinforcement learning maximizes the expected return. However, in safety-critical applications, the agent may need to consider the distribution of returns rather than just the mean. Risk-sensitive approaches modify the objective to account for the variability or worst-case outcomes of the return:
These formulations are relevant in robotics, autonomous driving, and financial applications where catastrophic failures must be avoided.
In hindsight experience replay (HER), introduced by Andrychowicz et al. (2017), failed trajectories are reinterpreted by substituting the goal with a state that was actually reached. This changes the return retroactively, allowing the agent to learn from failures as if they were successes for an alternative goal.
The concept of return is central to many well-known applications of reinforcement learning:
Imagine you are playing a video game and collecting coins. Each coin is a "reward." The return is the total number of coins you collect from now until the game ends.
But here is the twist: coins you pick up right now are worth more than coins you might pick up later, because you are not sure you will get to those later coins. So each future coin is worth a little bit less. If the very next coin is worth 1 point, a coin two steps away might be worth 0.9 points, and one three steps away might be worth 0.81 points, and so on.
The return adds up all these coin values. The agent's whole job is to play the game in a way that makes this total as big as possible.