The discount factor, almost always written as the Greek letter γ (gamma), is a scalar hyperparameter in reinforcement learning that controls how much an agent values future rewards relative to immediate ones. It takes values in the interval [0, 1] and appears as a geometric weight on each future reward when computing the return, the total quantity that an agent tries to maximize. When γ is close to 0 the agent is myopic and cares only about what happens next; when γ is close to 1 it is far-sighted and plans many steps ahead. The discount factor sits inside almost every equation in the field, including the Bellman equation, the recursive definition of the value function, and the update rules used by Q-learning, SARSA, DQN, and policy gradient methods.
Let R_{t+1}, R_{t+2}, R_{t+3}, ... denote the sequence of scalar rewards an agent receives from time step t onward while interacting with an environment. The discounted return from time t, written G_t, is defined as the weighted sum
G_t = R_{t+1} + γ R_{t+2} + γ^2 R_{t+3} + γ^3 R_{t+4} + ... = Σ_{k=0}^{∞} γ^k R_{t+k+1}
Here γ ∈ [0, 1] is the discount factor. A reward arriving k time steps after t is multiplied by γ^k, so the further into the future a reward lies, the less it contributes to G_t. When 0 ≤ γ < 1 and the rewards are bounded, this infinite sum converges. When γ = 1 the sum may not converge, so an undiscounted formulation only makes sense for episodic tasks or with additional structural assumptions.
The definition above is the standard one used in Sutton and Barto's textbook Reinforcement Learning: An Introduction (second edition, 2018), and it generalizes the discounted utility model introduced by Paul Samuelson in 1937 in economics to the sequential decision setting formalized by Richard Bellman in the 1950s.
Imagine someone offers you a choice. You can have one cookie right now, or you can have two cookies tomorrow. Most children reach for the cookie they can eat today, because tomorrow feels far away and uncertain. The discount factor is a dial on how patient you are. If your dial is set to 0 you always grab the cookie in front of you. If your dial is set very close to 1 you are willing to wait for the bigger reward that comes later. Reinforcement learning agents carry this same dial, and the number written on it is gamma.
For an infinite-horizon problem the discounted return is the geometric series
G_t = Σ_{k=0}^{∞} γ^k R_{t+k+1}
Using the recursive identity of geometric sums, one can rewrite this as
G_t = R_{t+1} + γ G_{t+1}
This one-step recursion is the single most important consequence of discounting. It lets the problem of evaluating a whole trajectory be broken into a much smaller problem: evaluate the next reward, then evaluate the return from the next state. This recursion is what makes dynamic programming, temporal difference learning, and Q-learning possible.
The state value function under a policy π is the expected discounted return obtained by starting in state s and following π thereafter:
v_π(s) = E_π [ G_t | S_t = s ] = E_π [ Σ_{k=0}^{∞} γ^k R_{t+k+1} | S_t = s ]
The action value function, also called the Q-function, conditions on both the current state and the current action:
q_π(s, a) = E_π [ G_t | S_t = s, A_t = a ]
Both definitions collapse to the immediate expected reward when γ = 0 and to the total sum of expected rewards along the trajectory when γ = 1.
Substituting the recursive identity G_t = R_{t+1} + γ G_{t+1} into the definition of v_π(s) yields the Bellman expectation equation for the state value function
v_π(s) = Σ_a π(a|s) Σ_{s', r} p(s', r | s, a) [ r + γ v_π(s') ]
and the analogous Bellman expectation equation for the action value function
q_π(s, a) = Σ_{s', r} p(s', r | s, a) [ r + γ Σ_{a'} π(a' | s') q_π(s', a') ]
The Bellman optimality equation takes the form
v_(s) = max_a Σ_{s', r} p(s', r | s, a) [ r + γ v_(s') ] q_(s, a) = Σ_{s', r} p(s', r | s, a) [ r + γ max_{a'} q_(s', a') ]
When 0 ≤ γ < 1 the Bellman operator defined by the right-hand side of the optimality equation is a γ-contraction with respect to the maximum norm, with contraction factor exactly γ. This contraction property guarantees that value iteration and policy iteration converge to a unique fixed point, and it is the reason the discount factor is not merely cosmetic: it is the object that makes the entire dynamic programming machinery well-posed.
The TD(0) update for the state value function is
V(S_t) ← V(S_t) + α [ R_{t+1} + γ V(S_{t+1}) - V(S_t) ]
The bracketed quantity R_{t+1} + γ V(S_{t+1}) − V(S_t) is called the TD error. Tabular Q-learning, introduced by Chris Watkins in his 1989 Cambridge PhD thesis, uses the analogous update
Q(S_t, A_t) ← Q(S_t, A_t) + α [ R_{t+1} + γ max_a Q(S_{t+1}, a) − Q(S_t, A_t) ]
SARSA uses R_{t+1} + γ Q(S_{t+1}, A_{t+1}) instead of the max. In every one of these rules γ appears in exactly the same place: as the multiplier on the bootstrapped estimate of the next state's value.
Discounting is so universal in reinforcement learning that it is worth asking explicitly why it is there. There are at least four distinct justifications, and practitioners rarely separate them.
For a continuing task (one that goes on forever with no terminal state), the undiscounted sum of rewards Σ R_t is in general infinite or undefined. Discounting with γ < 1 forces the geometric series to converge as long as the rewards are bounded. This is the most basic reason for introducing γ: it turns an otherwise ill-posed optimization into a well-posed one. Puterman's textbook Markov Decision Processes gives the rigorous treatment, and it is the reason discounted MDPs are the standard object of study in operations research and control.
Humans and institutions routinely prefer rewards sooner rather than later, a phenomenon economists call time preference. Samuelson's 1937 paper "A Note on Measurement of Utility" introduced the exponential discounted utility model that weights consumption at time t+k by δ^k. Reinforcement learning inherits this model essentially unchanged: γ in RL plays the same role as δ in the Samuelson utility function. Exponential discounting is the unique discounting scheme that gives time-consistent preferences, meaning that a plan made today that prefers reward A at time t+5 over reward B at time t+10 will still prefer A over B tomorrow. This property is what makes dynamic programming valid.
A second and quite different interpretation treats γ as a per-step survival probability. Imagine an environment where at every time step there is a fixed probability 1 − γ that the agent is terminated (the episode ends, the robot breaks, the customer churns, the market closes). Then the expected number of further rewards the agent will ever see is a discounted sum with factor γ, even if each reward is counted with weight 1 while the agent is alive. Under this interpretation discounting is not a statement about preferences but a statement about model uncertainty. This equivalence is mentioned in Sutton and Barto and is widely used to motivate γ in settings where modelers are reluctant to claim that future rewards are intrinsically less valuable.
For an episodic task that naturally terminates, such as a game of chess, an Atari episode, or a pick-and-place robot task, the undiscounted sum is well-defined because the number of rewards is finite. Even so, many practitioners still use γ < 1 because it produces smoother gradients, gives nearer rewards a slight credit assignment advantage, and reduces variance in Monte Carlo return estimates. In Sutton and Barto's unified treatment, episodic and continuing tasks are handled with a single formalism by introducing an absorbing state with zero reward; discounting then applies uniformly.
For a geometric series with ratio γ the sum 1 + γ + γ^2 + γ^3 + ... equals 1/(1 − γ). This quantity is called the effective planning horizon of the discount factor, and it is one of the most useful rules of thumb in the field. Rewards received much further in the future than 1/(1 − γ) steps are effectively invisible to the agent, because their γ^k weight has decayed to a small fraction of γ^0 = 1. Choosing γ therefore amounts to choosing roughly how many steps into the future the agent is willing to reason about.
| γ | Effective horizon 1/(1 − γ) | Typical use case |
|---|---|---|
| 0.0 | 1 step | Bandit problems, purely myopic greedy behavior |
| 0.5 | 2 steps | Very short-horizon tasks, toy gridworlds |
| 0.8 | 5 steps | Short games, simple navigation |
| 0.9 | 10 steps | Classic control (CartPole length, small gridworlds) |
| 0.95 | 20 steps | Many MuJoCo locomotion tasks |
| 0.99 | 100 steps | Atari games, most deep RL benchmarks |
| 0.999 | 1,000 steps | Long-horizon strategy, long simulations, some robotics |
| 0.9999 | 10,000 steps | Very long-horizon domains, StarCraft-style tasks |
| 1.0 | ∞ (undefined) | Episodic undiscounted, stochastic shortest path |
When the effective horizon is much shorter than the true task length, distant rewards become essentially invisible because their contribution to G_t has been geometrically shrunk. An agent facing a 300-step Atari episode with γ = 0.9 cannot see past the first ten steps in any meaningful sense, so sparse terminal rewards at the end of the level cannot propagate back to early actions. This is a frequent source of silent failure in practice.
The effective horizon has a precise interpretation. If all rewards are equal to a constant r, then the discounted return is r · Σ γ^k = r/(1 − γ), whereas the undiscounted sum over H steps is r · H. Solving r · H = r/(1 − γ) gives H = 1/(1 − γ), which is why the same quantity shows up as both the value of a unit reward stream and as the characteristic time scale of the geometric decay. It is also the effective sample size used in the variance analysis of temporal difference methods.
The discount factor is essentially always one of the first hyperparameters listed in an RL paper. The table below lists the γ values reported in the original publications for several well-known algorithms and benchmarks. These are not recommendations for every task; they are the values the authors actually used.
| Algorithm / paper | γ | Domain |
|---|---|---|
| Watkins tabular Q-learning (1989) | variable | Tabular toy MDPs |
| TD-Gammon (Tesauro, 1992) | 1.0 | Backgammon, episodic |
| DQN Nature (Mnih et al., 2015) | 0.99 | Atari 2600 (Arcade Learning Env.) |
| Double DQN (van Hasselt et al., 2016) | 0.99 | Atari |
| Dueling DQN (Wang et al., 2016) | 0.99 | Atari |
| Rainbow (Hessel et al., 2018) | 0.99 | Atari |
| A3C (Mnih et al., 2016) | 0.99 | Atari, continuous control |
| TRPO (Schulman et al., 2015) | 0.995 | MuJoCo continuous control |
| PPO (Schulman et al., 2017) | 0.99 | MuJoCo, Atari |
| DDPG (Lillicrap et al., 2016) | 0.99 | Continuous control |
| SAC (Haarnoja et al., 2018) | 0.99 | MuJoCo |
| AlphaGo (Silver et al., 2016) | 1.0 | Go (episodic, undiscounted) |
| AlphaZero (Silver et al., 2017) | 1.0 | Go, chess, shogi (episodic) |
| OpenAI Five (Dota 2, 2018) | 0.9993 (annealed toward 0.99997) | Dota 2 |
| R2D2 (Kapturowski et al., 2019) | 0.997 | Atari with long horizons |
| Agent57 (Badia et al., 2020) | 0.997 | Atari benchmark |
The striking observation is how uniformly γ = 0.99 has become the default for deep reinforcement learning. The choice is rarely justified beyond citing earlier work, but it tracks the effective horizon of 100 steps, which is a reasonable match for Atari frames and for many MuJoCo tasks. In long-horizon strategy games such as Dota 2, OpenAI Five used a much larger γ and in fact annealed γ upward during training, starting near 0.998 and ending closer to 0.9997, corresponding to an effective horizon of tens of thousands of frames.
In practice γ is treated as a hyperparameter and tuned like any other. A few heuristics are widely used.
Start from the task horizon. Estimate the length of a typical episode or the time scale over which rewards are delivered. Pick γ so that 1/(1 − γ) is at least as long as this scale. If episodes are 1,000 steps long, a γ that gives an effective horizon of 100 will truncate most of the planning problem.
Use the reward scale as a tie-breaker. For dense reward signals, smaller γ often works because there is already a local signal guiding the agent. For sparse reward problems (a single +1 at the end of the level and zero everywhere else), γ has to be large enough that the terminal reward propagates back far enough to influence early decisions. This is why sparse-reward domains so often require γ ≥ 0.99.
Anneal γ during training. Several papers have reported benefits from starting training with a smaller γ (so that the agent first learns a short-horizon proxy) and annealing it upward toward a larger value as the policy improves. OpenAI Five is the most famous example. This is not a standard trick but it is well known.
Beware sensitivity. Empirical hyperparameter importance studies have consistently found that γ is among the most influential hyperparameters in deep RL, rivaling the learning rate. Small changes in γ can move an agent between "learns the task" and "never learns anything," particularly at the boundary where the effective horizon becomes shorter than the reward delay.
Do not use γ = 1 with bootstrapping on continuing tasks. With γ = 1 the Bellman operator is no longer a contraction in the maximum norm, the fixed point is no longer unique, and TD methods can diverge or oscillate. Using γ = 1 is safe only when the problem is naturally episodic with bounded episode length, or when the problem is formulated as a stochastic shortest path (see below) or an average-reward MDP.
When γ = 1 the discounted return becomes the plain sum of future rewards, which requires additional structure to be well-defined.
If every trajectory eventually terminates with probability 1 and the number of rewards per episode is bounded in expectation, the undiscounted return is finite and the theory carries over, although the Bellman operator is no longer a strict contraction. Monte Carlo methods naturally work in this setting because they only require the episode to end.
Bertsekas and Tsitsiklis's 1991 paper "An Analysis of Stochastic Shortest Path Problems" formalized an undiscounted framework in which the agent must reach a special absorbing goal state. Costs can be positive or negative, and the theory requires the existence of at least one proper policy, meaning a policy under which the goal is reached from every state with probability 1. Under the additional assumption that every improper policy has infinite expected cost, the optimal policy exists, is stationary and deterministic, and can be computed by value iteration even though γ = 1. This is the natural undiscounted model for shortest-path, navigation, and other goal-reaching tasks.
An alternative is to maximize the long-run average reward per time step, ρ_π = lim_{T→∞} (1/T) E_π [ Σ_{t=1}^{T} R_t ]. This is sometimes called the gain and corresponds to the n = −1 level of the n-discount optimality hierarchy introduced by Arthur Veinott in 1969. Average reward is more appropriate than discounted return for cyclic tasks where the notion of a "present" is ill-defined, such as queueing systems, server scheduling, and continuing control problems. The relationship to the discounted setting is that, under mild conditions, the average reward optimal policy is the limit of discounted optimal policies as γ → 1, with the difference of discounted values tending to the so-called bias or relative value function. Rich Sutton and others have long argued that average reward should play a larger role in RL research than it currently does. Puterman's textbook gives the definitive treatment.
The idea that future payoffs should be multiplied by a geometric factor predates reinforcement learning by many decades. Paul Samuelson's 1937 note introduced the discounted utility model that became standard in economics, and the discount factor δ in that model is mathematically identical to γ in RL. The interpretation is slightly different: in economics δ is usually derived from a subjective rate of time preference or an external interest rate, while in RL γ is chosen by the algorithm designer as a hyperparameter. Both interpretations share the same underlying justification, namely that exponential discounting is the unique form that yields time-consistent preferences.
Behavioral economics has documented that human discounting often departs from the exponential form, resembling hyperbolic or quasi-hyperbolic discounting instead. This observation has motivated a small literature on hyperbolic discounting in reinforcement learning, but standard RL algorithms stick with the exponential form because hyperbolic discounting destroys the Bellman recursion that makes dynamic programming tractable.
A somewhat subtler observation, explored in a line of work culminating in Amit, Meir, and Ciosek's 2020 ICML paper "Discount Factor as a Regularizer in Reinforcement Learning," is that lowering γ below the "true" value that the designer cares about can actually improve generalization and sample efficiency. A smaller γ shortens the effective horizon, which reduces the variance of return estimates, restricts the hypothesis space the agent searches over, and acts as a form of regularization analogous to early stopping. In practice this means the γ used to train an agent is sometimes deliberately smaller than the γ the designer would ideally want to evaluate under, especially in the low-data regime. This is one reason γ tuning is load-bearing: the optimal γ for learning is not necessarily the optimal γ for evaluation.
In tabular settings the discount factor is almost entirely a modeling choice. In deep RL, where the value function is represented by a neural network, γ also affects optimization in ways that are not fully understood. A larger γ makes the targets R_{t+1} + γ V(S_{t+1}) depend more strongly on the network's own predictions, which amplifies the moving-target problem and is the reason DQN introduced a separate target network that is updated only periodically. A larger γ also increases the variance of Monte Carlo return estimates used in policy gradient methods, which is why generalized advantage estimation (GAE) in PPO introduces a second parameter λ that trades bias against variance on top of γ. The combined choice of (γ, λ) is often more sensitive than either parameter alone.
γ is too small. The agent never sees distant rewards. Classic symptom: the agent learns a locally good behavior (avoid falling off the cliff) but never learns to exploit a sparse terminal reward (reach the goal). The fix is usually to raise γ to at least match the reward delay.
γ is too large in tasks with noisy rewards. Large γ amplifies the variance of Monte Carlo return estimates because a single stochastic reward at the end of a long trajectory appears in the return of every prior time step with weight γ^k ≈ 1. This can drown the gradient signal in variance. Remedies include lowering γ, introducing a baseline or critic, or using GAE.
Using γ = 1 with bootstrapping on a continuing task. The Bellman operator is no longer a contraction in max norm, value estimates can drift, and Q-learning can diverge. Switch to an average-reward or stochastic shortest path formulation instead.
Mixing incompatible γ values across components. In actor-critic methods the same γ should appear in both the value target and the policy gradient. Using inconsistent γ values in the actor and the critic is a subtle source of bias.
Comparing algorithms trained with different γ. Since γ is part of the optimization objective, two agents trained with different γ are literally maximizing different quantities. Their total undiscounted returns on the environment are not a fair comparison unless γ is held fixed.
Ignoring γ when the reward scale changes. The magnitude of the optimal value function is roughly r_max/(1 − γ), which grows without bound as γ → 1. Network initialization, loss clipping, and target normalization all have to be adapted when γ is pushed close to 1.
The discount factor plays the same role in partially observable Markov decision processes (POMDPs) as it does in fully observable MDPs. The belief-state formulation turns a POMDP into a continuous-state MDP whose Bellman operator still contracts with rate γ. In practice deep RL on POMDPs (for example recurrent DQN variants such as DRQN and R2D2) uses the same default γ = 0.99 as the fully observable versions, although R2D2 and Agent57 push γ to 0.997 to extend the effective horizon for long-memory tasks.