Discount Factor

Machine Learning Reinforcement Learning

23 min read

Updated Jul 11, 2026

Suggest edit History Talk

RawGraph

Last edited

Jul 11, 2026

Fact-checked

In review queue

Sources

23 citations

Revision

v4 · 4,573 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

The discount factor, almost always written as the Greek letter $\gamma$ (gamma), is a scalar hyperparameter in reinforcement learning that controls how much an agent values future rewards relative to immediate ones. It takes values in the closed interval $[0, 1]$ and appears as a geometric weight on each future reward when computing the return, the total quantity an agent tries to maximize: a reward arriving k time steps in the future is multiplied by $\gamma^k$ ^[1]. When $\gamma$ is close to 0 the agent is myopic and cares only about what happens next; when $\gamma$ is close to 1 it is far-sighted and plans many steps ahead ^[1]. In Sutton and Barto's Reinforcement Learning: An Introduction, "the discount rate determines the present value of future rewards" ^[1]. The discount factor sits inside almost every equation in the field, including the Bellman equation, the recursive definition of the value function, and the update rules used by Q-learning, SARSA, DQN, and policy gradient methods.

In modern deep reinforcement learning, $\gamma = 0.99$ (an effective planning horizon of about 100 steps) has become the de facto default, used in the original DQN Nature paper ^[11], A3C ^[12], PPO ^[14], DDPG ^[16], and SAC ^[17].

What is the discount factor?

Let $R_{t+1}, R_{t+2}, R_{t+3}, \ldots$ denote the sequence of scalar rewards an agent receives from time step t onward while interacting with an environment. The discounted return from time t, written $G_t$ , is defined as the weighted sum

G_t = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \gamma^3 R_{t+4} + \cdots = \sum_{k=0}^{\infty} \gamma^k R_{t+k+1}

Here $\gamma \in [0, 1]$ is the discount factor ^[1]. A reward arriving k time steps after t is multiplied by $\gamma^k$ , so the further into the future a reward lies, the less it contributes to $G_t$ . When $0 \le \gamma < 1$ and the rewards are bounded, this infinite sum converges to a finite value ^[1]. When $\gamma = 1$ the sum may not converge, so an undiscounted formulation only makes sense for episodic tasks or with additional structural assumptions.

The definition above is the standard one used in Sutton and Barto's textbook Reinforcement Learning: An Introduction (second edition, 2018) ^[1], and it generalizes the discounted utility model introduced by Paul Samuelson in 1937 in economics ^[8] to the sequential decision setting formalized by Richard Bellman in the 1950s ^[2].

ELI5 (explain like I'm 5)

Imagine someone offers you a choice. You can have one cookie right now, or you can have two cookies tomorrow. Most children reach for the cookie they can eat today, because tomorrow feels far away and uncertain. The discount factor is a dial on how patient you are. If your dial is set to 0 you always grab the cookie in front of you. If your dial is set very close to 1 you are willing to wait for the bigger reward that comes later. Reinforcement learning agents carry this same dial, and the number written on it is gamma.

How is the discounted return formulated mathematically?

Discounted return

For an infinite-horizon problem the discounted return is the geometric series

G_t = \sum_{k=0}^{\infty} \gamma^k R_{t+k+1}

Using the recursive identity of geometric sums, one can rewrite this as

G_t = R_{t+1} + \gamma G_{t+1}

This one-step recursion is the single most important consequence of discounting ^[1]. It lets the problem of evaluating a whole trajectory be broken into a much smaller problem: evaluate the next reward, then evaluate the return from the next state. This recursion is what makes dynamic programming, temporal difference learning, and Q-learning possible.

Value functions

The state value function under a policy $\pi$ is the expected discounted return obtained by starting in state s and following $\pi$ thereafter:

v_\pi(s) = \mathbb{E}_\pi\left[G_t \mid S_t = s\right] = \mathbb{E}_\pi\left[\sum_{k=0}^{\infty} \gamma^k R_{t+k+1} \mid S_t = s\right]

The action value function, also called the Q-function, conditions on both the current state and the current action:

q_\pi(s, a) = \mathbb{E}_\pi\left[G_t \mid S_t = s, A_t = a\right]

Both definitions collapse to the immediate expected reward when $\gamma = 0$ and to the total sum of expected rewards along the trajectory when $\gamma = 1$ ^[1].

Bellman equations

Substituting the recursive identity $G_t = R_{t+1} + \gamma G_{t+1}$ into the definition of $v_\pi(s)$ yields the Bellman expectation equation for the state value function

v_\pi(s) = \sum_a \pi(a \mid s) \sum_{s', r} p(s', r \mid s, a)\left[r + \gamma v_\pi(s')\right]

and the analogous Bellman expectation equation for the action value function

q_\pi(s, a) = \sum_{s', r} p(s', r \mid s, a)\left[r + \gamma \sum_{a'} \pi(a' \mid s') q_\pi(s', a')\right]

The Bellman optimality equation takes the form

v_*(s) = \max_a \sum_{s', r} p(s', r \mid s, a)\left[r + \gamma v_*(s')\right]

q_*(s, a) = \sum_{s', r} p(s', r \mid s, a)\left[r + \gamma \max_{a'} q_*(s', a')\right]

When $0 \le \gamma < 1$ the Bellman operator defined by the right-hand side of the optimality equation is a $\gamma$ -contraction with respect to the maximum norm, with contraction factor exactly $\gamma$ ^[3]. This contraction property guarantees that value iteration and policy iteration converge to a unique fixed point ^[3], and it is the reason the discount factor is not merely cosmetic: it is the object that makes the entire dynamic programming machinery well-posed.

TD learning updates

The TD(0) update for the state value function is

V(S_t) \leftarrow V(S_t) + \alpha\left[R_{t+1} + \gamma V(S_{t+1}) - V(S_t)\right]

The bracketed quantity $R_{t+1} + \gamma V(S_{t+1}) - V(S_t)$ is called the TD error ^[1]. Tabular Q-learning, introduced by Chris Watkins in his 1989 Cambridge PhD thesis ^[6], uses the analogous update

Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha\left[R_{t+1} + \gamma \max_a Q(S_{t+1}, a) - Q(S_t, A_t)\right]

SARSA uses $R_{t+1} + \gamma Q(S_{t+1}, A_{t+1})$ instead of the max. In every one of these rules $\gamma$ appears in exactly the same place: as the multiplier on the bootstrapped estimate of the next state's value ^[1].

Why discount future rewards?

Discounting is so universal in reinforcement learning that it is worth asking explicitly why it is there. There are at least four distinct justifications, and practitioners rarely separate them.

Mathematical convergence in infinite horizons

For a continuing task (one that goes on forever with no terminal state), the undiscounted sum of rewards $\sum R_t$ is in general infinite or undefined. Discounting with $\gamma < 1$ forces the geometric series to converge as long as the rewards are bounded ^[1]. This is the most basic reason for introducing $\gamma$ : it turns an otherwise ill-posed optimization into a well-posed one. Puterman's textbook Markov Decision Processes gives the rigorous treatment ^[3], and it is the reason discounted MDPs are the standard object of study in operations research and control.

Modeling time preference

Humans and institutions routinely prefer rewards sooner rather than later, a phenomenon economists call time preference. Samuelson's 1937 paper "A Note on Measurement of Utility" introduced the exponential discounted utility model that weights consumption at time $t+k$ by $\delta^k$ ^[8]. Reinforcement learning inherits this model essentially unchanged: $\gamma$ in RL plays the same role as $\delta$ in the Samuelson utility function. Exponential discounting is the unique discounting scheme that gives time-consistent preferences, meaning that a plan made today that prefers reward A at time t+5 over reward B at time t+10 will still prefer A over B tomorrow. This property is what makes dynamic programming valid.

Uncertainty and survival

A second and quite different interpretation treats $\gamma$ as a per-step survival probability. Imagine an environment where at every time step there is a fixed probability $1 - \gamma$ that the agent is terminated (the episode ends, the robot breaks, the customer churns, the market closes). Then the expected number of further rewards the agent will ever see is a discounted sum with factor $\gamma$ , even if each reward is counted with weight 1 while the agent is alive. Under this interpretation discounting is not a statement about preferences but a statement about model uncertainty. This equivalence is mentioned in Sutton and Barto ^[1] and is widely used to motivate $\gamma$ in settings where modelers are reluctant to claim that future rewards are intrinsically less valuable.

Finite horizons as a special case

For an episodic task that naturally terminates, such as a game of chess, an Atari episode, or a pick-and-place robot task, the undiscounted sum is well-defined because the number of rewards is finite. Even so, many practitioners still use $\gamma < 1$ because it produces smoother gradients, gives nearer rewards a slight credit assignment advantage, and reduces variance in Monte Carlo return estimates ^[21]. In Sutton and Barto's unified treatment, episodic and continuing tasks are handled with a single formalism by introducing an absorbing state with zero reward; discounting then applies uniformly ^[1].

What is the effective planning horizon?

For a geometric series with ratio $\gamma$ the sum $1 + \gamma + \gamma^2 + \gamma^3 + \cdots$ equals $1/(1 - \gamma)$ . This quantity is called the effective planning horizon of the discount factor, and it is one of the most useful rules of thumb in the field. Rewards received much further in the future than $1/(1 - \gamma)$ steps are effectively invisible to the agent, because their $\gamma^k$ weight has decayed to a small fraction of $\gamma^0 = 1$ . Choosing $\gamma$ therefore amounts to choosing roughly how many steps into the future the agent is willing to reason about. As a concrete reference point, $\gamma = 0.99$ gives an effective horizon of $1/(1 - 0.99) = 100$ steps, and $\gamma = 0.999$ gives 1,000 steps.

$\gamma$	Effective horizon $1/(1 - \gamma)$	Typical use case
0.0	1 step	Bandit problems, purely myopic greedy behavior
0.5	2 steps	Very short-horizon tasks, toy gridworlds
0.8	5 steps	Short games, simple navigation
0.9	10 steps	Classic control (CartPole length, small gridworlds)
0.95	20 steps	Many MuJoCo locomotion tasks
0.99	100 steps	Atari games, most deep RL benchmarks
0.999	1,000 steps	Long-horizon strategy, long simulations, some robotics
0.9999	10,000 steps	Very long-horizon domains, StarCraft-style tasks
1.0	$\infty$ (undefined)	Episodic undiscounted, stochastic shortest path

When the effective horizon is much shorter than the true task length, distant rewards become essentially invisible because their contribution to $G_t$ has been geometrically shrunk. An agent facing a 300-step Atari episode with $\gamma = 0.9$ cannot see past the first ten steps in any meaningful sense, so sparse terminal rewards at the end of the level cannot propagate back to early actions. This is a frequent source of silent failure in practice.

Why 1/(1 − γ) is the right measure

The effective horizon has a precise interpretation. If all rewards are equal to a constant r, then the discounted return is $r \cdot \sum \gamma^k = r/(1 - \gamma)$ , whereas the undiscounted sum over $H$ steps is $r \cdot H$ . Solving $r \cdot H = r/(1 - \gamma)$ gives $H = 1/(1 - \gamma)$ , which is why the same quantity shows up as both the value of a unit reward stream and as the characteristic time scale of the geometric decay. It is also the effective sample size used in the variance analysis of temporal difference methods.

Which discount factors do popular algorithms use?

The discount factor is essentially always one of the first hyperparameters listed in an RL paper. The table below lists the $\gamma$ values reported in the original publications for several well-known algorithms and benchmarks. These are not recommendations for every task; they are the values the authors actually used.

Algorithm / paper	$\gamma$	Domain
Watkins tabular Q-learning (1989)	variable	Tabular toy MDPs
TD-Gammon (Tesauro, 1992)	1.0	Backgammon, episodic
DQN Nature (Mnih et al., 2015)	0.99	Atari 2600 (Arcade Learning Env.)
Double DQN (van Hasselt et al., 2016)	0.99	Atari
Dueling DQN (Wang et al., 2016)	0.99	Atari
Rainbow (Hessel et al., 2018)	0.99	Atari
A3C (Mnih et al., 2016)	0.99	Atari, continuous control
TRPO (Schulman et al., 2015)	0.995	MuJoCo continuous control
PPO (Schulman et al., 2017)	0.99	MuJoCo, Atari
DDPG (Lillicrap et al., 2016)	0.99	Continuous control
SAC (Haarnoja et al., 2018)	0.99	MuJoCo
AlphaGo (Silver et al., 2016)	1.0	Go (episodic, undiscounted)
AlphaZero (Silver et al., 2017)	1.0	Go, chess, shogi (episodic)
OpenAI Five (Dota 2, 2018)	0.9993 (annealed toward 0.9997)	Dota 2
R2D2 (Kapturowski et al., 2019)	0.997	Atari with long horizons
Agent57 (Badia et al., 2020)	0.997	Atari benchmark

The striking observation is how uniformly $\gamma = 0.99$ has become the default for deep reinforcement learning. The choice is rarely justified beyond citing earlier work, but it tracks the effective horizon of 100 steps, which is a reasonable match for Atari frames and for many MuJoCo tasks. In long-horizon strategy games such as Dota 2, OpenAI Five used a much larger $\gamma$ and in fact annealed $\gamma$ upward during training: the base agent trained with $\gamma = 0.9993$ (a horizon of about 180 seconds), and the discount factor was annealed from 0.998 (a reward half-life of about 46 seconds) up to 0.9997 (a half-life of about five minutes) over the course of training ^[18].

How do you choose the discount factor?

In practice $\gamma$ is treated as a hyperparameter and tuned like any other. A few heuristics are widely used.

Start from the task horizon. Estimate the length of a typical episode or the time scale over which rewards are delivered. Pick $\gamma$ so that $1/(1 - \gamma)$ is at least as long as this scale. If episodes are 1,000 steps long, a $\gamma$ that gives an effective horizon of 100 will truncate most of the planning problem.

Use the reward scale as a tie-breaker. For dense reward signals, smaller $\gamma$ often works because there is already a local signal guiding the agent. For sparse reward problems (a single +1 at the end of the level and zero everywhere else), $\gamma$ has to be large enough that the terminal reward propagates back far enough to influence early decisions. This is why sparse-reward domains so often require $\gamma \ge 0.99$ .

Anneal $\gamma$ during training. Several papers have reported benefits from starting training with a smaller $\gamma$ (so that the agent first learns a short-horizon proxy) and annealing it upward toward a larger value as the policy improves. OpenAI Five is the most famous example, annealing $\gamma$ from 0.998 to 0.9997 ^[18]. This is not a standard trick but it is well known.

Beware sensitivity. Empirical hyperparameter importance studies have consistently found that $\gamma$ is among the most influential hyperparameters in deep RL, rivaling the learning rate. Small changes in $\gamma$ can move an agent between "learns the task" and "never learns anything," particularly at the boundary where the effective horizon becomes shorter than the reward delay.

Do not use $\gamma = 1$ with bootstrapping on continuing tasks. With $\gamma = 1$ the Bellman operator is no longer a contraction in the maximum norm, the fixed point is no longer unique, and TD methods can diverge or oscillate ^[3]. Using $\gamma = 1$ is safe only when the problem is naturally episodic with bounded episode length, or when the problem is formulated as a stochastic shortest path (see below) or an average-reward MDP.

What happens when gamma equals 1 (undiscounted formulations)?

When $\gamma = 1$ the discounted return becomes the plain sum of future rewards, which requires additional structure to be well-defined.

Episodic tasks

If every trajectory eventually terminates with probability 1 and the number of rewards per episode is bounded in expectation, the undiscounted return is finite and the theory carries over, although the Bellman operator is no longer a strict contraction ^[1]. Monte Carlo methods naturally work in this setting because they only require the episode to end.

Stochastic shortest path (SSP)

Bertsekas and Tsitsiklis's 1991 paper "An Analysis of Stochastic Shortest Path Problems" formalized an undiscounted framework in which the agent must reach a special absorbing goal state ^[5]. Costs can be positive or negative, and the theory requires the existence of at least one proper policy, meaning a policy under which the goal is reached from every state with probability 1 ^[5]. Under the additional assumption that every improper policy has infinite expected cost, the optimal policy exists, is stationary and deterministic, and can be computed by value iteration even though $\gamma = 1$ ^[4]^[5]. This is the natural undiscounted model for shortest-path, navigation, and other goal-reaching tasks.

Average reward MDPs

An alternative is to maximize the long-run average reward per time step, $\rho_\pi = \lim_{T \to \infty} \frac{1}{T} \mathbb{E}_\pi\left[\sum_{t=1}^{T} R_t\right]$ . This is sometimes called the gain and corresponds to the $n = -1$ level of the n-discount optimality hierarchy introduced by Arthur Veinott in 1969 ^[9]. Average reward is more appropriate than discounted return for cyclic tasks where the notion of a "present" is ill-defined, such as queueing systems, server scheduling, and continuing control problems. The relationship to the discounted setting is that, under mild conditions, the average reward optimal policy is the limit of discounted optimal policies as $\gamma \to 1$ , with the difference of discounted values tending to the so-called bias or relative value function ^[9]^[10]. Rich Sutton and others have long argued that average reward should play a larger role in RL research than it currently does. Puterman's textbook gives the definitive treatment ^[3].

How does the discount factor relate to economics?

The idea that future payoffs should be multiplied by a geometric factor predates reinforcement learning by many decades. Paul Samuelson's 1937 note introduced the discounted utility model that became standard in economics, and the discount factor $\delta$ in that model is mathematically identical to $\gamma$ in RL ^[8]. The interpretation is slightly different: in economics $\delta$ is usually derived from a subjective rate of time preference or an external interest rate, while in RL $\gamma$ is chosen by the algorithm designer as a hyperparameter. Both interpretations share the same underlying justification, namely that exponential discounting is the unique form that yields time-consistent preferences.

Behavioral economics has documented that human discounting often departs from the exponential form, resembling hyperbolic or quasi-hyperbolic discounting instead. This observation has motivated a small literature on hyperbolic discounting in reinforcement learning, but standard RL algorithms stick with the exponential form because hyperbolic discounting destroys the Bellman recursion that makes dynamic programming tractable.

Can the discount factor act as a regularizer?

A somewhat subtler observation, explored in a line of work culminating in Amit, Meir, and Ciosek's 2020 ICML paper "Discount Factor as a Regularizer in Reinforcement Learning," is that lowering $\gamma$ below the "true" value that the designer cares about can actually improve generalization and sample efficiency ^[21]. The authors show "an explicit equivalence between using a reduced discount factor and adding an explicit regularization term to the algorithm's loss" for several temporal-difference methods ^[21]. A smaller $\gamma$ shortens the effective horizon, which reduces the variance of return estimates, restricts the hypothesis space the agent searches over, and acts as a form of regularization analogous to early stopping. In practice this means the $\gamma$ used to train an agent is sometimes deliberately smaller than the $\gamma$ the designer would ideally want to evaluate under, especially in the low-data regime. This is one reason $\gamma$ tuning is load-bearing: the optimal $\gamma$ for learning is not necessarily the optimal $\gamma$ for evaluation.

How does the discount factor interact with function approximation?

In tabular settings the discount factor is almost entirely a modeling choice. In deep RL, where the value function is represented by a neural network, $\gamma$ also affects optimization in ways that are not fully understood. A larger $\gamma$ makes the targets $R_{t+1} + \gamma V(S_{t+1})$ depend more strongly on the network's own predictions, which amplifies the moving-target problem and is the reason DQN introduced a separate target network that is updated only periodically ^[11]. A larger $\gamma$ also increases the variance of Monte Carlo return estimates used in policy gradient methods, which is why generalized advantage estimation (GAE) introduces a second parameter $\lambda$ that trades bias against variance on top of $\gamma$ ^[15]. The combined choice of $(\gamma, \lambda)$ is often more sensitive than either parameter alone.

Common pitfalls

$\gamma$ is too small. The agent never sees distant rewards. Classic symptom: the agent learns a locally good behavior (avoid falling off the cliff) but never learns to exploit a sparse terminal reward (reach the goal). The fix is usually to raise $\gamma$ to at least match the reward delay.

$\gamma$ is too large in tasks with noisy rewards. Large $\gamma$ amplifies the variance of Monte Carlo return estimates because a single stochastic reward at the end of a long trajectory appears in the return of every prior time step with weight $\gamma^k \approx 1$ . This can drown the gradient signal in variance. Remedies include lowering $\gamma$ , introducing a baseline or critic, or using GAE ^[15].

Using $\gamma = 1$ with bootstrapping on a continuing task. The Bellman operator is no longer a contraction in max norm, value estimates can drift, and Q-learning can diverge ^[3]. Switch to an average-reward or stochastic shortest path formulation instead.

Mixing incompatible $\gamma$ values across components. In actor-critic methods the same $\gamma$ should appear in both the value target and the policy gradient. Using inconsistent $\gamma$ values in the actor and the critic is a subtle source of bias.

Comparing algorithms trained with different $\gamma$ . Since $\gamma$ is part of the optimization objective, two agents trained with different $\gamma$ are literally maximizing different quantities. Their total undiscounted returns on the environment are not a fair comparison unless $\gamma$ is held fixed.

Ignoring $\gamma$ when the reward scale changes. The magnitude of the optimal value function is roughly $r_{\max}/(1 - \gamma)$ , which grows without bound as $\gamma \to 1$ . Network initialization, loss clipping, and target normalization all have to be adapted when $\gamma$ is pushed close to 1.

How does the discount factor work under partial observability?

The discount factor plays the same role in partially observable Markov decision processes (POMDPs) as it does in fully observable MDPs. The belief-state formulation turns a POMDP into a continuous-state MDP whose Bellman operator still contracts with rate $\gamma$ . In practice deep RL on POMDPs (for example recurrent DQN variants such as DRQN and R2D2) uses the same default $\gamma = 0.99$ as the fully observable versions, although R2D2 ^[19] and Agent57 ^[20] push $\gamma$ to 0.997 to extend the effective horizon for long-memory tasks.

References

Sutton, Richard S., and Andrew G. Barto. *Reinforcement Learning: An Introduction*. 2nd edition, MIT Press, 2018. Section 3.3 (Returns and Episodes) defines the discounted return and states that "the discount rate determines the present value of future rewards"; Chapters 6 (Temporal-Difference Learning) and 13 (Policy Gradient Methods). The standard textbook treatment of the discount factor and discounted return. ↩
Bellman, Richard. *Dynamic Programming*. Princeton University Press, 1957. The original formalization of the Bellman equation and the discounted cost criterion. ↩
Puterman, Martin L. *Markov Decision Processes: Discrete Stochastic Dynamic Programming*. Wiley, 1994 (reprinted 2005). The canonical reference for discounted, average-reward, and total-reward MDPs, including the γ-contraction property. ↩
Bertsekas, Dimitri P. *Dynamic Programming and Optimal Control*, Volumes I and II, Athena Scientific, multiple editions. Treatment of stochastic shortest path problems and proper policies. ↩
Bertsekas, Dimitri P., and John N. Tsitsiklis. "An Analysis of Stochastic Shortest Path Problems." *Mathematics of Operations Research*, Volume 16, Number 3, 1991, pp. 580-595. ↩
Watkins, Christopher J. C. H. "Learning from Delayed Rewards." PhD thesis, University of Cambridge, 1989. The original Q-learning thesis; γ appears throughout. ↩
Watkins, Christopher J. C. H., and Peter Dayan. "Q-learning." *Machine Learning*, Volume 8, 1992, pp. 279-292.
Samuelson, Paul A. "A Note on Measurement of Utility." *Review of Economic Studies*, Volume 4, Number 2, 1937, pp. 155-161. Origin of the exponential discounted utility model in economics. ↩
Veinott, Arthur F. "Discrete Dynamic Programming with Sensitive Discount Optimality Criteria." *Annals of Mathematical Statistics*, Volume 40, Number 5, 1969, pp. 1635-1660. Introduces n-discount optimality and the link between discounted and average-reward criteria. ↩
Mahadevan, Sridhar. "Average Reward Reinforcement Learning: Foundations, Algorithms, and Empirical Results." *Machine Learning*, Volume 22, 1996, pp. 159-195. ↩
Mnih, Volodymyr, et al. "Human-level Control Through Deep Reinforcement Learning." *Nature*, Volume 518, 2015, pp. 529-533. The DQN Nature paper; reports γ = 0.99 and reward clipping to [-1, 1] for Atari. ↩
Mnih, Volodymyr, et al. "Asynchronous Methods for Deep Reinforcement Learning." *Proceedings of the 33rd International Conference on Machine Learning (ICML)*, 2016. A3C, γ = 0.99. ↩
Schulman, John, Sergey Levine, Philipp Moritz, Michael Jordan, and Pieter Abbeel. "Trust Region Policy Optimization." *Proceedings of the 32nd International Conference on Machine Learning (ICML)*, 2015.
Schulman, John, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. "Proximal Policy Optimization Algorithms." arXiv:1707.06347, 2017. ↩
Schulman, John, Philipp Moritz, Sergey Levine, Michael Jordan, and Pieter Abbeel. "High-Dimensional Continuous Control Using Generalized Advantage Estimation." *International Conference on Learning Representations (ICLR)*, 2016. Introduces GAE with parameters γ and λ. ↩
Lillicrap, Timothy P., et al. "Continuous Control with Deep Reinforcement Learning." *International Conference on Learning Representations (ICLR)*, 2016. DDPG, γ = 0.99. ↩
Haarnoja, Tuomas, Aurick Zhou, Pieter Abbeel, and Sergey Levine. "Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor." *Proceedings of the 35th International Conference on Machine Learning (ICML)*, 2018. ↩
OpenAI, et al. "Dota 2 with Large Scale Deep Reinforcement Learning." arXiv:1912.06680, 2019. OpenAI Five; the base agent used γ = 0.9993 (a horizon of about 180 seconds), with γ annealed from 0.998 (half-life about 46 seconds) up to 0.9997 (half-life about five minutes). ↩
Kapturowski, Steven, Georg Ostrovski, John Quan, Rémi Munos, and Will Dabney. "Recurrent Experience Replay in Distributed Reinforcement Learning." *International Conference on Learning Representations (ICLR)*, 2019. R2D2, γ = 0.997. ↩
Badia, Adrià Puigdomènech, et al. "Agent57: Outperforming the Atari Human Benchmark." *Proceedings of the 37th International Conference on Machine Learning (ICML)*, 2020. ↩
Amit, Ron, Ron Meir, and Kamil Ciosek. "Discount Factor as a Regularizer in Reinforcement Learning." *Proceedings of the 37th International Conference on Machine Learning (ICML)*, 2020. Shows an explicit equivalence between a reduced discount factor and an added regularization term for several TD methods. ↩
Silver, David. "UCL Course on Reinforcement Learning." Lecture 2: Markov Decision Processes, 2015. Widely used lecture notes that treat γ and the Bellman equations in detail, including the myopic-versus-far-sighted characterization of γ.
Achiam, Joshua. "Spinning Up in Deep RL." OpenAI, 2018. Practical documentation of γ defaults across algorithms including PPO, TRPO, DDPG, TD3, and SAC.

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

3 revisions by 1 contributors · full history

Suggest edit

What links here

Bellman Equation Critic Episode (Reinforcement Learning)Greedy Policy Machine learning terms/Reinforcement Learning Random Policy Return (Reinforcement Learning)Timestep Trajectory (Reinforcement Learning)

What is the discount factor?

ELI5 (explain like I'm 5)

How is the discounted return formulated mathematically?

Discounted return

Value functions

Bellman equations

TD learning updates

Why discount future rewards?

Mathematical convergence in infinite horizons

Modeling time preference

Uncertainty and survival

Finite horizons as a special case

What is the effective planning horizon?

Why 1/(1 − γ) is the right measure

Which discount factors do popular algorithms use?

How do you choose the discount factor?

What happens when gamma equals 1 (undiscounted formulations)?

Episodic tasks

Stochastic shortest path (SSP)

Average reward MDPs

How does the discount factor relate to economics?

Can the discount factor act as a regularizer?

How does the discount factor interact with function approximation?

Common pitfalls

How does the discount factor work under partial observability?

See also

References

Improve this article

Related Articles

State (Reinforcement Learning)

State-Action Value Function

Action (Reinforcement Learning)

Bellman Equation

Critic

Deep Q-Network (DQN)

What links here

Related Articles

State (Reinforcement Learning)

State-Action Value Function

Action (Reinforcement Learning)

Bellman Equation

Critic

Deep Q-Network (DQN)

What links here