Return (Reinforcement Learning)
Last reviewed
Sources
14 citations
Review status
Source-backed
Revision
v5 · 3,050 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Sources
14 citations
Review status
Source-backed
Revision
v5 · 3,050 words
Add missing citations, update stale details, or suggest a clearer explanation.
In reinforcement learning, the return (commonly denoted G_t) is the total cumulative reward an agent receives from time step t onward, usually with future rewards discounted by a factor gamma. Formally, G_t = R_{t+1} + gamma R_{t+2} + gamma^2 R_{t+3} + ..., and it obeys the recursive identity G_t = R_{t+1} + gamma G_{t+1} [1]. The return is the single quantity that reinforcement learning maximizes: the agent's entire objective is to choose a policy that makes the expected return as large as possible. It is the foundational quantity from which value functions, advantage functions, and nearly every optimization objective in the field are derived, connecting the immediate reward signal to the agent's long-term goals.
A reward received k time steps in the future contributes only gamma^k times what the same reward would be worth if received immediately, so the discount factor gamma in [0, 1] sets how far-sighted the agent is [1]. With gamma = 0 the agent is purely myopic; as gamma approaches 1 it weighs distant rewards almost as heavily as immediate ones.
At each discrete time step t, the agent takes an action a_t in state s_t, transitions to a new state s_{t+1}, and receives a scalar reward r_{t+1}. The return G_t aggregates these future rewards into a single scalar value. There are two primary formulations depending on the nature of the task [1].
In tasks that end after a fixed number of steps T (episodic tasks), the return is the simple sum of rewards collected over the episode:
G_t = r_{t+1} + r_{t+2} + r_{t+3} + ... + r_T
This formulation is appropriate when each episode has a well-defined terminal state, such as a game of chess ending in checkmate or a robot completing an assembly task. Since the horizon is finite, the sum is guaranteed to be bounded.
In continuing tasks with no natural endpoint (or in episodic tasks where discounting is still desired), a discount factor gamma (gamma) is applied to weight future rewards progressively less:
G_t = r_{t+1} + gamma r_{t+2} + gamma^2 r_{t+3} + gamma^3 r_{t+4} + ...
Or equivalently:
G_t = Sum (k=0 to infinity) gamma^k r_{t+k+1}
Here, gamma in [0, 1) is the discount factor. This geometric weighting ensures that the infinite sum converges to a finite value as long as rewards are bounded, because the series becomes a convergent geometric series [1].
A useful property of the discounted return is that it can be expressed recursively:
G_t = r_{t+1} + gamma G_{t+1}
This recursive relationship is the basis of the Bellman equation and underpins most reinforcement learning algorithms, from temporal difference learning to Q-learning [1][2].
The discount factor gamma is a hyperparameter that controls how the agent values future rewards relative to immediate ones. Sutton and Barto define it on the closed interval 0 <= gamma <= 1 and note that a reward arriving k steps later is worth gamma^k of its immediate value, so gamma doubles as the present-value discount rate familiar from economics [1]. Its choice has both mathematical and behavioral implications.
| Gamma value | Behavior | Description |
|---|---|---|
| gamma = 0 | Fully myopic | The agent only cares about the immediate next reward; G_t = r_{t+1}. |
| gamma close to 0 | Short-sighted | The agent strongly prefers near-term rewards and largely ignores distant ones. |
| gamma around 0.9 | Balanced | The agent values both near and moderately distant rewards. A common default in practice. |
| gamma close to 1 | Far-sighted | The agent gives nearly equal weight to near and distant rewards. |
| gamma = 1 | No discounting | All future rewards are weighted equally. Only valid in episodic tasks with guaranteed termination; otherwise the sum may diverge. |
Deep reinforcement learning systems typically use a discount factor close to but below 1. DeepMind's DQN agent, which reached human-level performance across 49 Atari 2600 games, used gamma = 0.99 [8], and the same value remains a standard default in modern policy gradient implementations such as PPO.
There are several reasons why discounting is used in practice:
The return is the building block for all value functions in reinforcement learning. A value function expresses the expected return under a given policy [1].
The state-value function V^pi(s) gives the expected return when starting in state s and following policy pi thereafter:
V^pi(s) = E_pi [G_t | S_t = s]
This measures how "good" it is for the agent to be in state s under policy pi.
The action-value function (or Q-function) Q^pi(s, a) gives the expected return when starting in state s, taking action a, and then following policy pi:
Q^pi(s, a) = E_pi [G_t | S_t = s, A_t = a]
This is used directly in algorithms like Q-learning [3] and SARSA.
The advantage function measures how much better a specific action is compared to the average action under the current policy:
A^pi(s, a) = Q^pi(s, a) - V^pi(s)
A positive advantage means the action yields a higher expected return than the policy's average for that state. The advantage function is used extensively in policy gradient methods.
The optimal state-value function V(s)* and optimal action-value function Q(s, a)* represent the maximum expected return achievable from each state or state-action pair across all possible policies:
V(s) = max_pi V^pi(s)*
Q(s, a) = max_pi Q^pi(s, a)*
Finding these optimal value functions is equivalent to solving the reinforcement learning problem [11].
The recursive property of the return (G_t = r_{t+1} + gamma G_{t+1}) leads directly to the Bellman equations, which express the value of a state in terms of immediate reward plus the discounted value of successor states [2].
For a given policy pi, the Bellman expectation equation for the state-value function is:
V^pi(s) = Sum_a pi(a|s) Sum_{s'} P(s'|s,a) [R(s,a,s') + gamma V^pi(s')]
This states that the value of a state equals the expected immediate reward plus the discounted expected value of the next state, averaged over actions (weighted by the policy) and transitions (weighted by environment dynamics).
For the optimal value function, the Bellman optimality equation replaces the policy-weighted average with a maximization:
V(s) = max_a Sum_{s'} P(s'|s,a) [R(s,a,s') + gamma V*(s')]*
These equations, formalized in the framework of Markov decision processes, form the theoretical backbone of algorithms like value iteration, policy iteration, and all forms of temporal difference learning [11][12].
In practice, the true return is rarely known in advance. Reinforcement learning algorithms differ in how they estimate returns from experience. The two classic extremes are Monte Carlo estimation, which uses the full observed return, and temporal difference learning, which bootstraps from an existing value estimate [1].
Monte Carlo methods estimate the return by running complete episodes and computing the actual cumulative reward observed. The return for a visited state is the sum of all rewards received from that state until the end of the episode:
G_t = r_{t+1} + gamma r_{t+2} + gamma^2 r_{t+3} + ... + gamma^{T-t-1} r_T
The value estimate is then updated:
V(S_t) <- V(S_t) + alpha [G_t - V(S_t)]
where alpha is the learning rate. Monte Carlo estimates are unbiased (they use the true return) but have high variance because individual episodes can vary widely [1].
Temporal difference (TD) methods do not wait for the episode to end. Instead, they use the immediate reward plus the current estimate of the next state's value as a stand-in for the full return. This estimate is called the TD target [4]:
TD target = r_{t+1} + gamma V(S_{t+1})
The value update becomes:
V(S_t) <- V(S_t) + alpha [r_{t+1} + gamma V(S_{t+1}) - V(S_t)]
The quantity r_{t+1} + gamma V(S_{t+1}) - V(S_t) is called the TD error (denoted delta_t). TD methods are biased (they rely on potentially inaccurate value estimates) but have lower variance than Monte Carlo methods and can learn online without waiting for episode completion [4].
| Property | Monte Carlo | TD(0) |
|---|---|---|
| Return used | Actual return G_t | Estimated return r_{t+1} + gamma V(S_{t+1}) |
| Update timing | End of episode | After each step |
| Bias | Unbiased | Biased (due to bootstrapping) |
| Variance | High | Low |
| Requires episode termination | Yes | No |
| Uses bootstrapping | No | Yes |
| Works in continuing tasks | No (needs termination) | Yes |
Monte Carlo and single-step TD represent two extremes on a spectrum. Intermediate methods blend both approaches by looking ahead multiple steps [1].
The n-step return combines n actual rewards with a bootstrapped value estimate:
G_t^{(n)} = r_{t+1} + gamma r_{t+2} + ... + gamma^{n-1} r_{t+n} + gamma^n V(S_{t+n})
When n = 1, this reduces to the TD(0) target. When n = T - t (the remaining length of the episode), it becomes the full Monte Carlo return.
The lambda return, introduced by Sutton (1988), is a weighted average of all n-step returns, with weights that decay exponentially by a parameter lambda in [0, 1] [4]:
G_t^lambda = (1 - lambda) Sum (n=1 to infinity) lambda^{n-1} G_t^{(n)}
The parameter lambda interpolates between TD(0) (when lambda = 0) and Monte Carlo (when lambda = 1). Eligibility traces provide an efficient online implementation of TD(lambda), maintaining a decaying record of which states have been recently visited so that the TD error can be distributed backward in time [1][4].
Generalized advantage estimation, proposed by Schulman et al. (2016), applies the lambda-return idea to advantage estimation in policy gradient methods [5]:
A_t^{GAE(gamma,lambda)} = Sum (l=0 to infinity) (gamma lambda)^l delta_{t+l}
where delta_t = r_{t+1} + gamma V(S_{t+1}) - V(S_t) is the TD error. GAE provides a smooth tradeoff between bias and variance in advantage estimates. A lower lambda gives lower variance but higher bias; a higher lambda gives lower bias but higher variance [5]. GAE is used in many modern policy gradient algorithms including PPO and TRPO.
The definition and handling of the return differs between episodic and continuing tasks [1].
| Aspect | Episodic tasks | Continuing tasks |
|---|---|---|
| Termination | Natural endpoint (terminal state) | No natural endpoint |
| Return formulation | Undiscounted or discounted, finite sum | Discounted or average reward, infinite sum |
| Examples | Board games, maze navigation, Atari games | Stock trading, robotic control, server management |
| Discount factor | Can use gamma = 1 safely | Must use gamma < 1 or average reward formulation |
In continuing tasks, using gamma = 1 would make the return infinite (assuming nonzero rewards), so either the discounted formulation or an alternative objective such as the average reward must be used [1].
For continuing tasks, an alternative to discounting is the average reward setting, introduced by Schwartz (1993) and developed further by Mahadevan (1996). Rather than maximizing the discounted return, the agent maximizes the long-term average reward per time step [6][7]:
r(pi) = lim (T->infinity) (1/T) Sum (t=1 to T) E[r_t | pi]
The differential return measures the total excess reward above the average:
G_t = Sum (k=0 to infinity) (r_{t+k+1} - r(pi))
This formulation avoids the need for a discount factor altogether and is well-suited to problems where the agent operates indefinitely and there is no meaningful way to prioritize near-term rewards over far-off ones [7].
The concept of return is closely linked to the reward hypothesis, articulated by Richard Sutton (2004):
"All of what we mean by goals and purposes can be well thought of as maximization of the expected value of the cumulative sum of a received scalar signal (reward)." [1]
Under this hypothesis, every goal an intelligent agent might have can be expressed as maximizing the expected return. This idea is a foundational assumption in reinforcement learning, though it has been debated in the research community. Silver, Singh, Precup, and Sutton later pushed the claim further in their 2021 paper "Reward is enough," arguing that maximizing a scalar reward is sufficient to drive the emergence of abilities such as knowledge, perception, and language [13]. Critics have pointed out that certain objectives, such as multi-objective optimization, risk-sensitive behavior, and tasks requiring non-Markovian reward structures, may not be naturally expressible through a single scalar return.
Standard reinforcement learning maximizes the expected return. However, in safety-critical applications, the agent may need to consider the distribution of returns rather than just the mean. Risk-sensitive approaches modify the objective to account for the variability or worst-case outcomes of the return:
These formulations are relevant in robotics, autonomous driving, and financial applications where catastrophic failures must be avoided.
In hindsight experience replay (HER), introduced by Andrychowicz et al. (2017), failed trajectories are reinterpreted by substituting the goal with a state that was actually reached. This changes the return retroactively, allowing the agent to learn from failures as if they were successes for an alternative goal [10].
The concept of return is central to many well-known applications of reinforcement learning:
Imagine you are playing a video game and collecting coins. Each coin is a "reward." The return is the total number of coins you collect from now until the game ends.
But here is the twist: coins you pick up right now are worth more than coins you might pick up later, because you are not sure you will get to those later coins. So each future coin is worth a little bit less. If the very next coin is worth 1 point, a coin two steps away might be worth 0.9 points, and one three steps away might be worth 0.81 points, and so on.
The return adds up all these coin values. The agent's whole job is to play the game in a way that makes this total as big as possible.