# Return (Reinforcement Learning)

> Source: https://aiwiki.ai/wiki/return
> Updated: 2026-07-11
> Categories: Machine Learning, Reinforcement Learning
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

In [reinforcement learning](/wiki/reinforcement_learning), the **return** (commonly denoted $$G_t$$) is the total cumulative reward an [agent](/wiki/agent) receives from time step $$t$$ onward, usually with future rewards discounted by a factor $$\gamma$$. Formally, $$G_t = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \cdots$$, and it obeys the recursive identity $$G_t = R_{t+1} + \gamma G_{t+1}$$ [1]. The return is the single quantity that reinforcement learning maximizes: the agent's entire objective is to choose a [policy](/wiki/policy) that makes the expected return as large as possible. It is the foundational quantity from which [value functions](/wiki/state), advantage functions, and nearly every optimization objective in the field are derived, connecting the immediate [reward](/wiki/reward) signal to the agent's long-term goals.

A reward received $$k$$ time steps in the future contributes only $$\gamma^k$$ times what the same reward would be worth if received immediately, so the discount factor $$\gamma$$ in [0, 1] sets how far-sighted the agent is [1]. With $$\gamma$$ = 0 the agent is purely myopic; as $$\gamma$$ approaches 1 it weighs distant rewards almost as heavily as immediate ones.

## What is the return in reinforcement learning?

At each discrete time step $$t$$, the agent takes an action $$a_t$$ in state $$s_t$$, transitions to a new state $$s_{t+1}$$, and receives a scalar reward $$r_{t+1}$$. The return $$G_t$$ aggregates these future rewards into a single scalar value. There are two primary formulations depending on the nature of the task [1].

### Finite-horizon undiscounted return

In tasks that end after a fixed number of steps $$T$$ (episodic tasks), the return is the simple sum of rewards collected over the episode:

$$
G_t = r_{t+1} + r_{t+2} + r_{t+3} + \cdots + r_T
$$

This formulation is appropriate when each episode has a well-defined terminal state, such as a game of chess ending in checkmate or a robot completing an assembly task. Since the horizon is finite, the sum is guaranteed to be bounded.

### Infinite-horizon discounted return

In continuing tasks with no natural endpoint (or in episodic tasks where discounting is still desired), a [discount factor](/wiki/discount_factor) $$\gamma$$ (gamma) is applied to weight future rewards progressively less:

$$
G_t = r_{t+1} + \gamma r_{t+2} + \gamma^2 r_{t+3} + \gamma^3 r_{t+4} + \cdots
$$

Or equivalently:

$$
G_t = \sum_{k=0}^{\infty} \gamma^k r_{t+k+1}
$$

Here, $$\gamma$$ in [0, 1) is the discount factor. This geometric weighting ensures that the infinite sum converges to a finite value as long as rewards are bounded, because the series becomes a convergent geometric series [1].

### Recursive formulation

A useful property of the discounted return is that it can be expressed recursively:

$$
G_t = r_{t+1} + \gamma G_{t+1}
$$

This recursive relationship is the basis of the [Bellman equation](/wiki/bellman_equation) and underpins most reinforcement learning algorithms, from [temporal difference learning](/wiki/temporal_difference_learning) to [Q-learning](/wiki/q-learning) [1][2].

## What is the discount factor and why is it used?

The discount factor $$\gamma$$ is a hyperparameter that controls how the agent values future rewards relative to immediate ones. Sutton and Barto define it on the closed interval $$0 \le \gamma \le 1$$ and note that a reward arriving $$k$$ steps later is worth $$\gamma^k$$ of its immediate value, so $$\gamma$$ doubles as the present-value discount rate familiar from economics [1]. Its choice has both mathematical and behavioral implications.

| Gamma value | Behavior | Description |
|---|---|---|
| $$\gamma = 0$$ | Fully myopic | The agent only cares about the immediate next reward; $$G_t = r_{t+1}$$. |
| $$\gamma$$ close to 0 | Short-sighted | The agent strongly prefers near-term rewards and largely ignores distant ones. |
| $$\gamma$$ around 0.9 | Balanced | The agent values both near and moderately distant rewards. A common default in practice. |
| $$\gamma$$ close to 1 | Far-sighted | The agent gives nearly equal weight to near and distant rewards. |
| $$\gamma = 1$$ | No discounting | All future rewards are weighted equally. Only valid in episodic tasks with guaranteed termination; otherwise the sum may diverge. |

Deep reinforcement learning systems typically use a discount factor close to but below 1. DeepMind's DQN agent, which reached human-level performance across 49 Atari 2600 games, used $$\gamma$$ = 0.99 [8], and the same value remains a standard default in modern policy gradient implementations such as [PPO](/wiki/ppo).

### Why is discounting used?

There are several reasons why discounting is used in practice:

1. **Mathematical convergence.** For continuing tasks, summing an infinite series of undiscounted rewards can yield an infinite value, making the optimization problem ill-defined. Discounting with $$\gamma < 1$$ guarantees convergence [1].
2. **Modeling uncertainty.** Future rewards are inherently less certain than immediate ones. A lower discount factor reflects the idea that distant outcomes are harder to predict and less reliable.
3. **Time preference.** In economic and decision-theoretic settings, there is a natural preference for receiving rewards sooner rather than later. Discounting encodes this preference.
4. **Computational tractability.** Algorithms that rely on value function estimation (such as TD learning) benefit from discounting because it makes value differences between states more pronounced and speeds up convergence.

## How does the return relate to value functions?

The return is the building block for all value functions in reinforcement learning. A value function expresses the *expected* return under a given policy [1].

### State-value function

The state-value function $$V^{\pi}(s)$$ gives the expected return when starting in state $$s$$ and following policy $$\pi$$ thereafter:

$$
V^{\pi}(s) = \mathbb{E}_{\pi}\left[G_t \mid S_t = s\right]
$$

This measures how "good" it is for the agent to be in state $$s$$ under policy $$\pi$$.

### Action-value function

The action-value function (or Q-function) $$Q^{\pi}(s, a)$$ gives the expected return when starting in state $$s$$, taking action $$a$$, and then following policy $$\pi$$:

$$
Q^{\pi}(s, a) = \mathbb{E}_{\pi}\left[G_t \mid S_t = s, A_t = a\right]
$$

This is used directly in algorithms like [Q-learning](/wiki/q-learning) [3] and [SARSA](/wiki/sarsa).

### Advantage function

The advantage function measures how much better a specific action is compared to the average action under the current policy:

$$
A^{\pi}(s, a) = Q^{\pi}(s, a) - V^{\pi}(s)
$$

A positive advantage means the action yields a higher expected return than the policy's average for that state. The advantage function is used extensively in policy gradient methods.

### Optimal value functions

The optimal state-value function $$V^*(s)$$ and optimal action-value function $$Q^*(s, a)$$ represent the maximum expected return achievable from each state or state-action pair across all possible policies:

$$
V^*(s) = \max_{\pi} V^{\pi}(s)
$$

$$
Q^*(s, a) = \max_{\pi} Q^{\pi}(s, a)
$$

Finding these optimal value functions is equivalent to solving the reinforcement learning problem [11].

## How does the return connect to the Bellman equation?

The recursive property of the return ($$G_t = r_{t+1} + \gamma G_{t+1}$$) leads directly to the Bellman equations, which express the value of a state in terms of immediate reward plus the discounted value of successor states [2].

### Bellman expectation equation

For a given policy $$\pi$$, the Bellman expectation equation for the state-value function is:

$$
V^{\pi}(s) = \sum_a \pi(a \mid s) \sum_{s'} P(s' \mid s, a)\left[R(s, a, s') + \gamma V^{\pi}(s')\right]
$$

This states that the value of a state equals the expected immediate reward plus the discounted expected value of the next state, averaged over actions (weighted by the policy) and transitions (weighted by environment dynamics).

### Bellman optimality equation

For the optimal value function, the Bellman optimality equation replaces the policy-weighted average with a maximization:

$$
V^*(s) = \max_a \sum_{s'} P(s' \mid s, a)\left[R(s, a, s') + \gamma V^*(s')\right]
$$

These equations, formalized in the framework of Markov decision processes, form the theoretical backbone of algorithms like [value iteration](/wiki/value_iteration), [policy iteration](/wiki/policy_iteration), and all forms of temporal difference learning [11][12].

## How is the return estimated?

In practice, the true return is rarely known in advance. Reinforcement learning algorithms differ in how they estimate returns from experience. The two classic extremes are Monte Carlo estimation, which uses the full observed return, and temporal difference learning, which bootstraps from an existing value estimate [1].

### Monte Carlo estimation

Monte Carlo methods estimate the return by running complete episodes and computing the actual cumulative reward observed. The return for a visited state is the sum of all rewards received from that state until the end of the episode:

$$
G_t = r_{t+1} + \gamma r_{t+2} + \gamma^2 r_{t+3} + \cdots + \gamma^{T-t-1} r_T
$$

The value estimate is then updated:

$$
V(S_t) \leftarrow V(S_t) + \alpha\left[G_t - V(S_t)\right]
$$

where $$\alpha$$ is the learning rate. Monte Carlo estimates are **unbiased** (they use the true return) but have **high variance** because individual episodes can vary widely [1].

### Temporal difference estimation

Temporal difference (TD) methods do not wait for the episode to end. Instead, they use the immediate reward plus the current estimate of the next state's value as a stand-in for the full return. This estimate is called the **TD target** [4]:

$$
\text{TD target} = r_{t+1} + \gamma V(S_{t+1})
$$

The value update becomes:

$$
V(S_t) \leftarrow V(S_t) + \alpha\left[r_{t+1} + \gamma V(S_{t+1}) - V(S_t)\right]
$$

The quantity $$r_{t+1} + \gamma V(S_{t+1}) - V(S_t)$$ is called the **TD error** (denoted $$\delta_t$$). TD methods are **biased** (they rely on potentially inaccurate value estimates) but have **lower variance** than Monte Carlo methods and can learn online without waiting for episode completion [4].

### Comparison of return estimation methods

| Property | Monte Carlo | TD(0) |
|---|---|---|
| Return used | Actual return $$G_t$$ | Estimated return $$r_{t+1} + \gamma V(S_{t+1})$$ |
| Update timing | End of episode | After each step |
| Bias | Unbiased | Biased (due to bootstrapping) |
| Variance | High | Low |
| Requires episode termination | Yes | No |
| Uses bootstrapping | No | Yes |
| Works in continuing tasks | No (needs termination) | Yes |

## Multi-step and lambda returns

Monte Carlo and single-step TD represent two extremes on a spectrum. Intermediate methods blend both approaches by looking ahead multiple steps [1].

### N-step return

The n-step return combines $$n$$ actual rewards with a bootstrapped value estimate:

$$
G_t^{(n)} = r_{t+1} + \gamma r_{t+2} + \cdots + \gamma^{n-1} r_{t+n} + \gamma^n V(S_{t+n})
$$

When $$n = 1$$, this reduces to the TD(0) target. When $$n = T - t$$ (the remaining length of the episode), it becomes the full Monte Carlo return.

### Lambda return (TD(lambda))

The lambda return, introduced by Sutton (1988), is a weighted average of all n-step returns, with weights that decay exponentially by a parameter $$\lambda$$ in [0, 1] [4]:

$$
G_t^{\lambda} = (1 - \lambda) \sum_{n=1}^{\infty} \lambda^{n-1} G_t^{(n)}
$$

The parameter $$\lambda$$ interpolates between TD(0) (when $$\lambda = 0$$) and Monte Carlo (when $$\lambda = 1$$). Eligibility traces provide an efficient online implementation of TD(lambda), maintaining a decaying record of which states have been recently visited so that the TD error can be distributed backward in time [1][4].

### Generalized advantage estimation (GAE)

Generalized advantage estimation, proposed by Schulman et al. (2016), applies the lambda-return idea to advantage estimation in policy gradient methods [5]:

$$
A_t^{\mathrm{GAE}(\gamma, \lambda)} = \sum_{l=0}^{\infty} (\gamma \lambda)^l \delta_{t+l}
$$

where $$\delta_t = r_{t+1} + \gamma V(S_{t+1}) - V(S_t)$$ is the TD error. GAE provides a smooth tradeoff between bias and variance in advantage estimates. A lower $$\lambda$$ gives lower variance but higher bias; a higher $$\lambda$$ gives lower bias but higher variance [5]. GAE is used in many modern policy gradient algorithms including [PPO](/wiki/ppo) and [TRPO](/wiki/trust_region_policy_optimization).

## How does the return differ between episodic and continuing tasks?

The definition and handling of the return differs between episodic and continuing tasks [1].

| Aspect | Episodic tasks | Continuing tasks |
|---|---|---|
| Termination | Natural endpoint (terminal state) | No natural endpoint |
| Return formulation | Undiscounted or discounted, finite sum | Discounted or average reward, infinite sum |
| Examples | Board games, maze navigation, Atari games | Stock trading, robotic control, server management |
| Discount factor | Can use $$\gamma = 1$$ safely | Must use $$\gamma < 1$$ or average reward formulation |

In continuing tasks, using $$\gamma = 1$$ would make the return infinite (assuming nonzero rewards), so either the discounted formulation or an alternative objective such as the average reward must be used [1].

### Average reward formulation

For continuing tasks, an alternative to discounting is the **average reward** setting, introduced by Schwartz (1993) and developed further by Mahadevan (1996). Rather than maximizing the discounted return, the agent maximizes the long-term average reward per time step [6][7]:

$$
r(\pi) = \lim_{T \to \infty} \frac{1}{T} \sum_{t=1}^{T} \mathbb{E}[r_t \mid \pi]
$$

The **differential return** measures the total excess reward above the average:

$$
G_t = \sum_{k=0}^{\infty} (r_{t+k+1} - r(\pi))
$$

This formulation avoids the need for a discount factor altogether and is well-suited to problems where the agent operates indefinitely and there is no meaningful way to prioritize near-term rewards over far-off ones [7].

## What is the reward hypothesis?

The concept of return is closely linked to the **reward hypothesis**, articulated by Richard Sutton (2004):

> "All of what we mean by goals and purposes can be well thought of as maximization of the expected value of the cumulative sum of a received scalar signal (reward)." [1]

Under this hypothesis, every goal an intelligent agent might have can be expressed as maximizing the expected return. This idea is a foundational assumption in reinforcement learning, though it has been debated in the research community. Silver, Singh, Precup, and Sutton later pushed the claim further in their 2021 paper "Reward is enough," arguing that maximizing a scalar reward is sufficient to drive the emergence of abilities such as knowledge, perception, and language [13]. Critics have pointed out that certain objectives, such as multi-objective optimization, risk-sensitive behavior, and tasks requiring non-Markovian reward structures, may not be naturally expressible through a single scalar return.

## Extensions and variants

### Risk-sensitive returns

Standard reinforcement learning maximizes the *expected* return. However, in safety-critical applications, the agent may need to consider the distribution of returns rather than just the mean. Risk-sensitive approaches modify the objective to account for the variability or worst-case outcomes of the return:

- **Conditional Value at Risk (CVaR):** optimizes the expected return in the worst $$\alpha$$ fraction of outcomes, producing more cautious policies [14].
- **Exponential utility:** applies an exponential transformation to the return, which penalizes high-variance outcomes.
- **Mean-variance optimization:** balances the expected return against its variance.

These formulations are relevant in robotics, autonomous driving, and financial applications where catastrophic failures must be avoided.

### Hindsight return

In hindsight experience replay (HER), introduced by Andrychowicz et al. (2017), failed trajectories are reinterpreted by substituting the goal with a state that was actually reached. This changes the return retroactively, allowing the agent to learn from failures as if they were successes for an alternative goal [10].

## What are real-world examples of the return?

The concept of return is central to many well-known applications of reinforcement learning:

- **Atari games.** In DeepMind's DQN agent, the return is the discounted sum of game scores. The agent learns a Q-function that predicts the expected return for each action given the current screen pixels, using $$\gamma$$ = 0.99, and matched or exceeded a professional human games tester on 29 of 49 Atari 2600 titles [8].
- **Go.** In [AlphaGo](/wiki/alphago) and AlphaGo Zero, the return at the end of a game is +1 for a win and -1 for a loss. The value network estimates the expected return (probability of winning) from any board position [9].
- **Robotics.** In simulated locomotion tasks (such as training a robot to walk), the return accumulates per-step rewards that combine forward velocity, energy efficiency, and penalties for falling. The discount factor balances making progress now against long-term stability.
- **Large language models.** In reinforcement learning from human feedback ([RLHF](/wiki/rlhf)) for [large language models](/wiki/large_language_model), the return is typically the scalar reward assigned by a [reward model](/wiki/reward_model) to a generated response, sometimes combined with a KL-divergence penalty to stay close to the base model.

## Explain like I'm 5 (ELI5)

Imagine you are playing a video game and collecting coins. Each coin is a "reward." The **return** is the total number of coins you collect from now until the game ends.

But here is the twist: coins you pick up right now are worth more than coins you might pick up later, because you are not sure you will get to those later coins. So each future coin is worth a little bit less. If the very next coin is worth 1 point, a coin two steps away might be worth 0.9 points, and one three steps away might be worth 0.81 points, and so on.

The return adds up all these coin values. The agent's whole job is to play the game in a way that makes this total as big as possible.

## References

1. Sutton, R. S. and Barto, A. G. (2018). *Reinforcement Learning: An Introduction* (2nd ed.). MIT Press. [http://incompleteideas.net/book/the-book-2nd.html](http://incompleteideas.net/book/the-book-2nd.html)
2. Bellman, R. (1957). *Dynamic Programming*. Princeton University Press.
3. Watkins, C. J. C. H. and Dayan, P. (1992). "Q-learning." *Machine Learning*, 8(3-4), 279-292.
4. Sutton, R. S. (1988). "Learning to predict by the methods of temporal differences." *Machine Learning*, 3(1), 9-44.
5. Schulman, J., Moritz, P., Levine, S., Jordan, M. I., and Abbeel, P. (2016). "High-dimensional continuous control using generalized advantage estimation." *Proceedings of the International Conference on Learning Representations (ICLR)*. [https://arxiv.org/abs/1506.02438](https://arxiv.org/abs/1506.02438)
6. Schwartz, A. (1993). "A reinforcement learning method for maximizing undiscounted rewards." *Proceedings of the 10th International Conference on Machine Learning (ICML)*, 298-305.
7. Mahadevan, S. (1996). "Average reward reinforcement learning: Foundations, algorithms, and empirical results." *Machine Learning*, 22(1-3), 159-195.
8. Mnih, V., Kavukcuoglu, K., Silver, D., et al. (2015). "Human-level control through deep reinforcement learning." *Nature*, 518(7540), 529-533. [https://www.nature.com/articles/nature14236](https://www.nature.com/articles/nature14236)
9. Silver, D., Schrittwieser, J., Simonyan, K., et al. (2017). "Mastering the game of Go without human knowledge." *Nature*, 550(7676), 354-359.
10. Andrychowicz, M., Wolski, F., Ray, A., et al. (2017). "Hindsight experience replay." *Advances in Neural Information Processing Systems (NeurIPS)*, 30.
11. Puterman, M. L. (1994). *Markov Decision Processes: Discrete Stochastic Dynamic Programming*. John Wiley & Sons.
12. Bertsekas, D. P. and Tsitsiklis, J. N. (1996). *Neuro-Dynamic Programming*. Athena Scientific.
13. Silver, D., Singh, S., Precup, D., and Sutton, R. S. (2021). "Reward is enough." *Artificial Intelligence*, 299, 103535.
14. Tamar, A., Glassner, Y., and Mannor, S. (2015). "Optimizing the CVaR via sampling." *Proceedings of the AAAI Conference on Artificial Intelligence*.