Return (Reinforcement Learning)

In reinforcement learning, the return (commonly denoted G_t) is the cumulative reward an agent receives starting from a given time step. It represents the total payoff the agent aims to maximize through its policy and serves as the foundational quantity from which value functions, advantage functions, and nearly all optimization objectives in reinforcement learning are derived. The return connects the immediate reward signal to the agent's long-term goals, making it one of the most important concepts in the field.

Formal definition

At each discrete time step t, the agent takes an action a_t in state s_t, transitions to a new state s_{t+1}, and receives a scalar reward r_{t+1}. The return G_t aggregates these future rewards into a single scalar value. There are two primary formulations depending on the nature of the task.

Finite-horizon undiscounted return

In tasks that end after a fixed number of steps T (episodic tasks), the return is the simple sum of rewards collected over the episode:

G_t = r_{t+1} + r_{t+2} + r_{t+3} + ... + r_T

This formulation is appropriate when each episode has a well-defined terminal state, such as a game of chess ending in checkmate or a robot completing an assembly task. Since the horizon is finite, the sum is guaranteed to be bounded.

Infinite-horizon discounted return

In continuing tasks with no natural endpoint (or in episodic tasks where discounting is still desired), a discount factor γ (gamma) is applied to weight future rewards progressively less:

G_t = r_{t+1} + γ r_{t+2} + γ² r_{t+3} + γ³ r_{t+4} + ...

Or equivalently:

G_t = Σ (k=0 to ∞) γ^k r_{t+k+1}

Here, γ ∈ [0, 1) is the discount factor. This geometric weighting ensures that the infinite sum converges to a finite value as long as rewards are bounded, because the series becomes a convergent geometric series.

Recursive formulation

A useful property of the discounted return is that it can be expressed recursively:

G_t = r_{t+1} + γ G_{t+1}

This recursive relationship is the basis of the Bellman equation and underpins most reinforcement learning algorithms, from temporal difference learning to Q-learning.

The discount factor

The discount factor γ is a hyperparameter that controls how the agent values future rewards relative to immediate ones. Its choice has both mathematical and behavioral implications.

Gamma value	Behavior	Description
γ = 0	Fully myopic	The agent only cares about the immediate next reward; G_t = r_{t+1}.
γ close to 0	Short-sighted	The agent strongly prefers near-term rewards and largely ignores distant ones.
γ around 0.9	Balanced	The agent values both near and moderately distant rewards. A common default in practice.
γ close to 1	Far-sighted	The agent gives nearly equal weight to near and distant rewards.
γ = 1	No discounting	All future rewards are weighted equally. Only valid in episodic tasks with guaranteed termination; otherwise the sum may diverge.

Reasons for discounting

There are several reasons why discounting is used in practice:

Mathematical convergence. For continuing tasks, summing an infinite series of undiscounted rewards can yield an infinite value, making the optimization problem ill-defined. Discounting with γ < 1 guarantees convergence.
Modeling uncertainty. Future rewards are inherently less certain than immediate ones. A lower discount factor reflects the idea that distant outcomes are harder to predict and less reliable.
Time preference. In economic and decision-theoretic settings, there is a natural preference for receiving rewards sooner rather than later. Discounting encodes this preference.
Computational tractability. Algorithms that rely on value function estimation (such as TD learning) benefit from discounting because it makes value differences between states more pronounced and speeds up convergence.

Connection to value functions

The return is the building block for all value functions in reinforcement learning. A value function expresses the expected return under a given policy.

State-value function

The state-value function V^π(s) gives the expected return when starting in state s and following policy π thereafter:

V^π(s) = E_π [G_t | S_t = s]

This measures how "good" it is for the agent to be in state s under policy π.

Action-value function

The action-value function (or Q-function) Q^π(s, a) gives the expected return when starting in state s, taking action a, and then following policy π:

Q^π(s, a) = E_π [G_t | S_t = s, A_t = a]

This is used directly in algorithms like Q-learning and SARSA.

Advantage function

The advantage function measures how much better a specific action is compared to the average action under the current policy:

A^π(s, a) = Q^π(s, a) - V^π(s)

A positive advantage means the action yields a higher expected return than the policy's average for that state. The advantage function is used extensively in policy gradient methods.

Optimal value functions

The optimal state-value function V(s)* and optimal action-value function Q(s, a)* represent the maximum expected return achievable from each state or state-action pair across all possible policies:

V(s) = max_π V^π(s)*

Q(s, a) = max_π Q^π(s, a)*

Finding these optimal value functions is equivalent to solving the reinforcement learning problem.

The Bellman equation

The recursive property of the return (G_t = r_{t+1} + γ G_{t+1}) leads directly to the Bellman equations, which express the value of a state in terms of immediate reward plus the discounted value of successor states.

Bellman expectation equation

For a given policy π, the Bellman expectation equation for the state-value function is:

V^π(s) = Σ_a π(a|s) Σ_{s'} P(s'|s,a) [R(s,a,s') + γ V^π(s')]

This states that the value of a state equals the expected immediate reward plus the discounted expected value of the next state, averaged over actions (weighted by the policy) and transitions (weighted by environment dynamics).

Bellman optimality equation

For the optimal value function, the Bellman optimality equation replaces the policy-weighted average with a maximization:

V(s) = max_a Σ_{s'} P(s'|s,a) [R(s,a,s') + γ V*(s')]*

These equations form the theoretical backbone of algorithms like value iteration, policy iteration, and all forms of temporal difference learning.

Estimating the return

In practice, the true return is rarely known in advance. Reinforcement learning algorithms differ in how they estimate returns from experience.

Monte Carlo estimation

Monte Carlo methods estimate the return by running complete episodes and computing the actual cumulative reward observed. The return for a visited state is the sum of all rewards received from that state until the end of the episode:

G_t = r_{t+1} + γ r_{t+2} + γ² r_{t+3} + ... + γ^{T-t-1} r_T

The value estimate is then updated:

V(S_t) ← V(S_t) + α [G_t - V(S_t)]

where α is the learning rate. Monte Carlo estimates are unbiased (they use the true return) but have high variance because individual episodes can vary widely.

Temporal difference estimation

Temporal difference (TD) methods do not wait for the episode to end. Instead, they use the immediate reward plus the current estimate of the next state's value as a stand-in for the full return. This estimate is called the TD target:

TD target = r_{t+1} + γ V(S_{t+1})

The value update becomes:

V(S_t) ← V(S_t) + α [r_{t+1} + γ V(S_{t+1}) - V(S_t)]

The quantity r_{t+1} + γ V(S_{t+1}) - V(S_t) is called the TD error (denoted δ_t). TD methods are biased (they rely on potentially inaccurate value estimates) but have lower variance than Monte Carlo methods and can learn online without waiting for episode completion.

Comparison of return estimation methods

Property	Monte Carlo	TD(0)
Return used	Actual return G_t	Estimated return r_{t+1} + γ V(S_{t+1})
Update timing	End of episode	After each step
Bias	Unbiased	Biased (due to bootstrapping)
Variance	High	Low
Requires episode termination	Yes	No
Uses bootstrapping	No	Yes
Works in continuing tasks	No (needs termination)	Yes

Multi-step and lambda returns

Monte Carlo and single-step TD represent two extremes on a spectrum. Intermediate methods blend both approaches by looking ahead multiple steps.

N-step return

The n-step return combines n actual rewards with a bootstrapped value estimate:

G_t^{(n)} = r_{t+1} + γ r_{t+2} + ... + γ^{n-1} r_{t+n} + γ^n V(S_{t+n})

When n = 1, this reduces to the TD(0) target. When n = T - t (the remaining length of the episode), it becomes the full Monte Carlo return.

Lambda return (TD(λ))

The lambda return, introduced by Sutton (1988), is a weighted average of all n-step returns, with weights that decay exponentially by a parameter λ ∈ [0, 1]:

G_t^λ = (1 - λ) Σ (n=1 to ∞) λ^{n-1} G_t^{(n)}

The parameter λ interpolates between TD(0) (when λ = 0) and Monte Carlo (when λ = 1). Eligibility traces provide an efficient online implementation of TD(λ), maintaining a decaying record of which states have been recently visited so that the TD error can be distributed backward in time.

Generalized advantage estimation (GAE)

Generalized advantage estimation, proposed by Schulman et al. (2016), applies the lambda-return idea to advantage estimation in policy gradient methods:

A_t^{GAE(γ,λ)} = Σ (l=0 to ∞) (γλ)^l δ_{t+l}

where δ_t = r_{t+1} + γ V(S_{t+1}) - V(S_t) is the TD error. GAE provides a smooth tradeoff between bias and variance in advantage estimates. A lower λ gives lower variance but higher bias; a higher λ gives lower bias but higher variance. GAE is used in many modern policy gradient algorithms including PPO and TRPO.

Episodic versus continuing tasks

The definition and handling of the return differs between episodic and continuing tasks.

Aspect	Episodic tasks	Continuing tasks
Termination	Natural endpoint (terminal state)	No natural endpoint
Return formulation	Undiscounted or discounted, finite sum	Discounted or average reward, infinite sum
Examples	Board games, maze navigation, Atari games	Stock trading, robotic control, server management
Discount factor	Can use γ = 1 safely	Must use γ < 1 or average reward formulation

In continuing tasks, using γ = 1 would make the return infinite (assuming nonzero rewards), so either the discounted formulation or an alternative objective such as the average reward must be used.

Average reward formulation

For continuing tasks, an alternative to discounting is the average reward setting, introduced by Schwartz (1993) and developed further by Mahadevan (1996). Rather than maximizing the discounted return, the agent maximizes the long-term average reward per time step:

r(π) = lim (T→∞) (1/T) Σ (t=1 to T) E[r_t | π]

The differential return measures the total excess reward above the average:

G_t = Σ (k=0 to ∞) (r_{t+k+1} - r(π))

This formulation avoids the need for a discount factor altogether and is well-suited to problems where the agent operates indefinitely and there is no meaningful way to prioritize near-term rewards over far-off ones.

The reward hypothesis

The concept of return is closely linked to the reward hypothesis, articulated by Richard Sutton (2004):

"All of what we mean by goals and purposes can be well thought of as the maximization of the expected value of the cumulative sum of a received scalar signal (called reward)."

Under this hypothesis, every goal an intelligent agent might have can be expressed as maximizing the expected return. This idea is a foundational assumption in reinforcement learning, though it has been debated in the research community. Critics have pointed out that certain objectives, such as multi-objective optimization, risk-sensitive behavior, and tasks requiring non-Markovian reward structures, may not be naturally expressible through a single scalar return.

Extensions and variants

Risk-sensitive returns

Standard reinforcement learning maximizes the expected return. However, in safety-critical applications, the agent may need to consider the distribution of returns rather than just the mean. Risk-sensitive approaches modify the objective to account for the variability or worst-case outcomes of the return:

Conditional Value at Risk (CVaR): optimizes the expected return in the worst α fraction of outcomes, producing more cautious policies.
Exponential utility: applies an exponential transformation to the return, which penalizes high-variance outcomes.
Mean-variance optimization: balances the expected return against its variance.

These formulations are relevant in robotics, autonomous driving, and financial applications where catastrophic failures must be avoided.

Hindsight return

In hindsight experience replay (HER), introduced by Andrychowicz et al. (2017), failed trajectories are reinterpreted by substituting the goal with a state that was actually reached. This changes the return retroactively, allowing the agent to learn from failures as if they were successes for an alternative goal.

Practical examples

The concept of return is central to many well-known applications of reinforcement learning:

Atari games. In DeepMind's DQN agent, the return is the discounted sum of game scores. The agent learns a Q-function that predicts the expected return for each action given the current screen pixels, using γ = 0.99.
Go. In AlphaGo and AlphaGo Zero, the return at the end of a game is +1 for a win and -1 for a loss. The value network estimates the expected return (probability of winning) from any board position.
Robotics. In simulated locomotion tasks (such as training a robot to walk), the return accumulates per-step rewards that combine forward velocity, energy efficiency, and penalties for falling. The discount factor balances making progress now against long-term stability.
Large language models. In reinforcement learning from human feedback (RLHF) for large language models, the return is typically the scalar reward assigned by a reward model to a generated response, sometimes combined with a KL-divergence penalty to stay close to the base model.

Explain like I'm 5 (ELI5)

Imagine you are playing a video game and collecting coins. Each coin is a "reward." The return is the total number of coins you collect from now until the game ends.

But here is the twist: coins you pick up right now are worth more than coins you might pick up later, because you are not sure you will get to those later coins. So each future coin is worth a little bit less. If the very next coin is worth 1 point, a coin two steps away might be worth 0.9 points, and one three steps away might be worth 0.81 points, and so on.

The return adds up all these coin values. The agent's whole job is to play the game in a way that makes this total as big as possible.

References

Sutton, R. S. and Barto, A. G. (2018). *Reinforcement Learning: An Introduction* (2nd ed.). MIT Press. http://incompleteideas.net/book/the-book-2nd.html
Bellman, R. (1957). *Dynamic Programming*. Princeton University Press.
Watkins, C. J. C. H. and Dayan, P. (1992). "Q-learning." *Machine Learning*, 8(3-4), 279-292.
Sutton, R. S. (1988). "Learning to predict by the methods of temporal differences." *Machine Learning*, 3(1), 9-44.
Schulman, J., Moritz, P., Levine, S., Jordan, M. I., and Abbeel, P. (2016). "High-dimensional continuous control using generalized advantage estimation." *Proceedings of the International Conference on Learning Representations (ICLR)*.
Schwartz, A. (1993). "A reinforcement learning method for maximizing undiscounted rewards." *Proceedings of the 10th International Conference on Machine Learning (ICML)*, 298-305.
Mahadevan, S. (1996). "Average reward reinforcement learning: Foundations, algorithms, and empirical results." *Machine Learning*, 22(1-3), 159-195.
Mnih, V., Kavukcuoglu, K., Silver, D., et al. (2015). "Human-level control through deep reinforcement learning." *Nature*, 518(7540), 529-533.
Silver, D., Schrittwieser, J., Simonyan, K., et al. (2017). "Mastering the game of Go without human knowledge." *Nature*, 550(7676), 354-359.
Andrychowicz, M., Wolski, F., Ray, A., et al. (2017). "Hindsight experience replay." *Advances in Neural Information Processing Systems (NeurIPS)*, 30.
Puterman, M. L. (1994). *Markov Decision Processes: Discrete Stochastic Dynamic Programming*. John Wiley & Sons.
Bertsekas, D. P. and Tsitsiklis, J. N. (1996). *Neuro-Dynamic Programming*. Athena Scientific.
Silver, D., Singh, S., Precup, D., and Sutton, R. S. (2021). "Reward is enough." *Artificial Intelligence*, 299, 103535.
Tamar, A., Glassner, Y., and Mannor, S. (2015). "Optimizing the CVaR via sampling." *Proceedings of the AAAI Conference on Artificial Intelligence*.

Formal definition

Finite-horizon undiscounted return

Infinite-horizon discounted return

Recursive formulation

The discount factor

Reasons for discounting

Connection to value functions

State-value function

Action-value function

Advantage function

Optimal value functions

The Bellman equation

Bellman expectation equation

Bellman optimality equation

Estimating the return

Monte Carlo estimation

Temporal difference estimation

Comparison of return estimation methods

Multi-step and lambda returns

N-step return

Lambda return (TD(λ))

Generalized advantage estimation (GAE)

Episodic versus continuing tasks

Average reward formulation

The reward hypothesis

Extensions and variants

Risk-sensitive returns

Hindsight return

Practical examples

Explain like I'm 5 (ELI5)

References

Improve this article

Related Articles

Machine learning terms/Reinforcement Learning

ARC-AGI 2

AlphaGo

State (Reinforcement Learning)

State-Action Value Function

Action (Reinforcement Learning)

Formal definition

Finite-horizon undiscounted return

Infinite-horizon discounted return

Recursive formulation

The discount factor

Reasons for discounting

Connection to value functions

State-value function

Action-value function

Advantage function

Optimal value functions

The Bellman equation

Bellman expectation equation

Bellman optimality equation

Estimating the return

Monte Carlo estimation

Temporal difference estimation

Comparison of return estimation methods

Multi-step and lambda returns

N-step return

Lambda return (TD(λ))

Generalized advantage estimation (GAE)

Episodic versus continuing tasks

Average reward formulation

The reward hypothesis

Extensions and variants

Risk-sensitive returns

Hindsight return

Practical examples

Explain like I'm 5 (ELI5)

References

Related Articles

Machine learning terms/Reinforcement Learning

ARC-AGI 2

AlphaGo

State (Reinforcement Learning)

State-Action Value Function

Action (Reinforcement Learning)