Temporal-difference learning
Last reviewed
May 1, 2026
Sources
25 citations
Review status
Source-backed
Revision
v1 · 3,926 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 1, 2026
Sources
25 citations
Review status
Source-backed
Revision
v1 · 3,926 words
Add missing citations, update stale details, or suggest a clearer explanation.
Temporal-difference (TD) learning is a class of model-free reinforcement learning methods that learn value-function estimates by combining ideas from Monte Carlo simulation and dynamic programming. A TD agent samples transitions from experience, then updates its current estimate of how good a state is by comparing it to a target built from the immediate reward plus its own estimate of the next state's value. This blend of sampling (from real interaction) and bootstrapping (using existing estimates as the basis for new ones) gives TD methods their characteristic balance of low bias, moderate variance, and online operation without needing a model of the environment.
TD learning was introduced by Richard S. Sutton in his 1984 PhD thesis at the University of Massachusetts Amherst and described in detail in the 1988 Machine Learning paper "Learning to Predict by the Methods of Temporal Differences" (Sutton, 1988). It is the algorithmic backbone of most modern reinforcement learning, including SARSA, Q-learning, Deep Q-Networks (DQN), actor-critic methods, and the AlphaGo / AlphaZero / MuZero family. Andrew Barto and Richard Sutton received the 2024 ACM A.M. Turing Award for developing the conceptual and algorithmic foundations of reinforcement learning, with TD learning singled out by the citation as one of the most important advances.
The basic problem is prediction: estimate the expected discounted return from each state of a Markov reward process. The return from time step t is
G_t = R_{t+1} + γ R_{t+2} + γ² R_{t+3} + ...,
where γ in [0,1] is a discount factor. The value function V^π(s) under a policy π is the expected return from state s:
V^π(s) = E_π [G_t | S_t = s].
Classical methods compute V^π in two extreme ways. Dynamic programming (DP) plugs V^π into the Bellman equation and iterates, requiring a full model of transition probabilities and rewards. Monte Carlo (MC) instead samples whole trajectories and averages observed returns, requiring no model but having to wait for an episode to finish before any update. TD learning sits between the two. It samples a single transition (S_t, R_{t+1}, S_{t+1}) and uses its own current estimate V(S_{t+1}) as a stand-in for the rest of the unseen return. This is bootstrapping: building one estimate from another estimate.
Why does this matter? Two reasons. First, it lets the agent learn online, from a stream of experience, without waiting for episodes to terminate. Second, it dramatically reduces variance compared with Monte Carlo, because instead of summing many random rewards it leans on a learned summary of the future. The cost is bias: if V is wrong, the bootstrapped target is wrong too. The art of TD learning is managing this trade-off.
The simplest TD method, TD(0), updates the value estimate of S_t after each transition:
V(S_t) <- V(S_t) + α [R_{t+1} + γ V(S_{t+1}) - V(S_t)].
The bracketed quantity is the TD error:
δ_t = R_{t+1} + γ V(S_{t+1}) - V(S_t).
The TD error is the difference between two predictions of the same return: the new prediction R_{t+1} + γ V(S_{t+1}), which incorporates one extra step of real reward, and the old prediction V(S_t). The step size α in (0,1] controls how aggressively the agent corrects toward the new target. Sutton (1988) proved convergence in the mean for TD methods on linear prediction problems, and Tsitsiklis and Van Roy (1997) later proved convergence with probability one for on-policy linear TD(λ) under standard step-size and ergodicity conditions.
TD(0) uses a one-step target. Monte Carlo uses an infinite-step (full return) target. n-step TD generalises across this spectrum:
G_{t:t+n} = R_{t+1} + γ R_{t+2} + ... + γ^{n-1} R_{t+n} + γ^n V(S_{t+n}),
V(S_t) <- V(S_t) + α [G_{t:t+n} - V(S_t)].
Larger n gives a target that depends more on real rewards and less on bootstrapped estimates, raising variance but lowering bias. Choosing the right n is a hyperparameter problem, and a fixed n is usually a compromise.
TD(λ) elegantly avoids picking a single n by averaging all n-step returns with exponentially decaying weights. The λ-return is
G_t^λ = (1 - λ) Σ_{n=1}^{∞} λ^{n-1} G_{t:t+n}.
With λ = 0 only the one-step return survives, recovering TD(0). With λ = 1 the weights collapse to the full Monte Carlo return. Intermediate λ values smoothly interpolate between TD and MC. This is the forward view.
The equivalent backward view, originally due to Sutton, uses eligibility traces e_t(s) to credit recent states for later TD errors. After every step the trace decays by γλ and is incremented by 1 for the visited state (accumulating traces) or set to 1 (replacing traces). Updates take the form
V(s) <- V(s) + α δ_t e_t(s) for all s.
In expectation the offline backward view matches the offline forward view exactly. Online they differ slightly, which Harm van Seijen and Richard Sutton corrected with true online TD(λ) (van Seijen and Sutton, 2014), introducing "dutch traces" that match the forward view at every step at the same computational cost as classical TD(λ).
| Method | Bootstraps? | Samples experience? | Needs model? | Bias | Variance | Online updates? |
|---|---|---|---|---|---|---|
| Dynamic programming | Yes | No (uses expectations) | Yes (full P, R) | None (in the limit) | None | Sweep-based |
| Monte Carlo | No | Yes (full returns) | No | Low | High | Episode-end only |
| TD(0) | Yes (1 step) | Yes | No | Moderate | Low to moderate | Yes |
| n-step TD | Yes (n steps) | Yes | No | Decreases with n | Increases with n | Yes (after n steps) |
| TD(λ) | Yes (mixed) | Yes | No | Tunable via λ | Tunable via λ | Yes |
The table makes the TD value proposition visible: it is the only family that can learn from a continuing stream of experience without a model and without waiting for episodes to end.
Predicting V is useful but the practical goal is usually control, choosing actions that maximise return. For that we estimate the action-value function Q^π(s, a) = E_π[G_t | S_t = s, A_t = a]. Several TD-based control algorithms exist, and they form the conceptual core of model-free RL.
SARSA (state-action-reward-state-action), introduced by Rummery and Niranjan in their 1994 Cambridge technical report "On-line Q-learning using connectionist systems" and later popularised under Sutton's snappier name, is the on-policy TD control method:
Q(S_t, A_t) <- Q(S_t, A_t) + α [R_{t+1} + γ Q(S_{t+1}, A_{t+1}) - Q(S_t, A_t)].
It evaluates and improves the same policy that is generating behaviour. With an epsilon-greedy policy, SARSA learns the value of the actually-followed exploratory policy.
Q-learning, introduced by Christopher Watkins in his 1989 Cambridge PhD thesis "Learning from Delayed Rewards" and proved convergent by Watkins and Dayan (1992), is the off-policy variant:
Q(S_t, A_t) <- Q(S_t, A_t) + α [R_{t+1} + γ max_a Q(S_{t+1}, a) - Q(S_t, A_t)].
The target uses the greedy action's value rather than the actually-chosen action's value, so Q-learning estimates the optimal action-value function regardless of how exploratory the behaviour policy is. This separation of behaviour and target policies is the defining feature of off-policy learning. The tabular Q-learning variant stores Q values in a lookup table; the deep variants replace the table with a neural network.
Expected SARSA (van Seijen, van Hasselt, Whiteson and Wiering, 2009) replaces the sampled next action with its expectation under the policy:
Q(S_t, A_t) <- Q + α [R_{t+1} + γ Σ_a π(a | S_{t+1}) Q(S_{t+1}, a) - Q(S_t, A_t)].
It has lower variance than SARSA at no extra bias and works for both on-policy and off-policy targets. Q-learning is the special case where the target policy is greedy.
Double Q-learning (van Hasselt, 2010) addresses the systematic positive bias of the max operator in Q-learning by maintaining two estimators and using one to select actions and the other to evaluate them. Deep Double Q-Networks (DDQN, van Hasselt, Guez and Silver, 2016) extend the same idea to neural networks and meaningfully improve DQN's stability.
Other members of the family include n-step SARSA, tree-backup (Precup, Sutton and Singh, 2000) for off-policy multi-step learning without importance sampling, Q(σ) which interpolates between sampling and expectation, and Retrace(λ) (Munos et al., 2016), which combines safe off-policy multi-step returns with bias control.
In the tabular setting TD methods enjoy strong convergence guarantees. Once we add function approximation, especially with neural networks, the picture darkens. Sutton and Barto (2018, chapter 11) call the combination of three innocent-looking ingredients the deadly triad:
When all three are present together, value estimates can diverge. Tsitsiklis and Van Roy (1997) constructed simple counterexamples that diverge under linear TD with off-policy data. The Gradient TD family (Sutton, Maei and Szepesvari, 2009; Sutton et al., 2009), including GTD, GTD2, and TDC (Temporal Difference with gradient Correction), provides the first off-policy linear methods with convergence guarantees by performing true stochastic gradient descent on a projected Bellman error. Emphatic TD (Sutton, Mahmood and White, 2016) restores stability via an emphasis weighting that re-balances state visitation. In deep RL, practical mitigations include target networks, experience replay, and clipped objectives. Hado van Hasselt and colleagues explicitly studied the triad in the deep setting in "Deep Reinforcement Learning and the Deadly Triad" (2018), finding that divergence is real but rarer in practice than the worst-case theory suggests.
One of the most striking results in computational neuroscience is the discovery that midbrain dopamine neurons appear to encode a TD error. In a sequence of single-unit recording experiments in monkeys, Wolfram Schultz showed that dopaminergic neurons in the ventral tegmental area and substantia nigra pars compacta fire above baseline at the time of an unexpected reward, fire above baseline at the time of a predictor of reward once that predictor has been learned, and fire below baseline at the time a predicted reward fails to arrive.
In the 1997 Science paper "A Neural Substrate of Prediction and Reward", Schultz, Peter Dayan, and P. Read Montague proposed that this firing pattern is exactly what a TD-learning algorithm computes: the discrepancy δ_t = R_{t+1} + γ V(S_{t+1}) - V(S_t). The hypothesis was so clean and predictive that it changed the language of reward neuroscience. Reward prediction error (RPE) is now a standard tool for thinking about dopamine, addiction, learning disorders, and computational psychiatry. Whether the mapping is literally correct in detail remains debated, but the qualitative parallel between dopamine bursts and TD errors is one of the strongest known links between a machine-learning algorithm and a biological mechanism.
The headline successes of modern reinforcement learning are, almost without exception, TD methods scaled up.
| Year | System | What TD did |
|---|---|---|
| 1995 | TD-Gammon (Tesauro) | TD(λ) trained a backgammon evaluation network through self-play to near-world-champion level. |
| 2013 | DQN preprint (Mnih et al.) | Q-learning with a deep convolutional network learning Atari from pixels. |
| 2015 | DQN, Nature (Mnih et al.) | Human-level performance on 49 Atari 2600 games using TD plus target networks and experience replay. |
| 2016 | AlphaGo (Silver et al.) | Value network trained with TD-style updates, combined with policy network and Monte Carlo tree search, defeated European champion Fan Hui then Lee Sedol. |
| 2016 | Double DQN (van Hasselt, Guez, Silver) | Reduced Q-learning's overestimation bias in the Atari benchmark. |
| 2017 | C51 / categorical DQN (Bellemare, Dabney, Munos) | Distributional TD learning, predicting a return distribution rather than its mean, with 51 atoms. |
| 2017 | AlphaGo Zero (Silver et al.) | Pure self-play with TD-trained value head, no human games. |
| 2018 | AlphaZero (Silver et al.) | Generalised AlphaGo Zero to chess and shogi from scratch. |
| 2018 | Rainbow DQN (Hessel et al.) | Combined six DQN improvements (double Q, prioritised replay, dueling, multi-step, distributional, noisy nets) into one state-of-the-art agent. |
| 2018 | QR-DQN (Dabney et al.) | Quantile regression distributional TD. |
| 2019 | SAC, TD3 (Haarnoja et al.; Fujimoto et al.) | Twin TD critics in continuous control. |
| 2020 | MuZero (Schrittwieser et al., Nature) | Learns its own model and uses TD-style value updates to plan, mastering Go, chess, shogi, and 57 Atari games. |
Gerald Tesauro's TD-Gammon (1995) deserves special mention. It used TD(λ) and a single-hidden-layer neural network trained by self-play, with no expert labels and no opening book, and reached a level commonly described as among the world's strongest. It was the first time TD with function approximation produced a champion-level player and is routinely cited in DQN, AlphaGo, and AlphaZero papers as inspiration.
The Mnih et al. (2015) Nature DQN paper showed that the same TD-control algorithm, augmented with a target network and a replay buffer, could learn to play 49 Atari games directly from raw pixels. The Silver et al. (2016) Nature AlphaGo paper combined a TD-trained value network with Monte Carlo tree search and a supervised-then-reinforcement-trained policy network. AlphaGo Zero (2017) and AlphaZero (2018) removed the supervised pre-training entirely. MuZero (Schrittwieser et al., 2020) added a learned model of the environment so the agent could plan in latent space, while still using TD-style value backups to ground the search.
| Variant | Year | What it adds |
|---|---|---|
| TD(0) | 1988 | Single-step bootstrapped value update. |
| TD(λ), eligibility traces | 1988 | Exponentially-weighted average of n-step returns; backward-view trace mechanism. |
| n-step TD / n-step SARSA | 1980s onward | Fixed-horizon multi-step targets. |
| SARSA | 1994 (Rummery & Niranjan) | On-policy TD control on Q. |
| Q-learning | 1989 (Watkins) | Off-policy TD control using max over next actions. |
| Tabular Q-learning | 1989 | The lookup-table form of Q-learning. |
| Expected SARSA | 2009 (van Seijen et al.) | Replaces next-action sample with expectation under the policy. |
| Tree-backup | 2000 (Precup, Sutton, Singh) | Multi-step off-policy learning without importance sampling. |
| Double Q-learning | 2010 (van Hasselt) | Two estimators to remove the max bias. |
| GTD, GTD2, TDC | 2009 (Sutton et al.) | Convergent off-policy linear TD via a projected Bellman objective. |
| Emphatic TD | 2016 (Sutton, Mahmood, White) | Stable off-policy TD via state-emphasis weighting. |
| True online TD(λ) | 2014 (van Seijen and Sutton) | Online algorithm exactly matching the forward view via dutch traces. |
| DQN | 2013 / 2015 (Mnih et al.) | Deep Q-learning with replay buffer and target network. |
| Double DQN | 2016 (van Hasselt, Guez, Silver) | Double-Q correction on top of DQN. |
| Prioritised experience replay | 2016 (Schaul et al.) | Sampling replays in proportion to TD error magnitude. |
| Dueling DQN | 2016 (Wang et al.) | Separates value and advantage streams. |
| C51 / categorical DQN | 2017 (Bellemare, Dabney, Munos) | Distributional TD with a fixed support. |
| QR-DQN | 2018 (Dabney et al.) | Distributional TD via quantile regression. |
| Retrace(λ) | 2016 (Munos et al.) | Safe off-policy multi-step returns. |
| Rainbow DQN | 2018 (Hessel et al.) | Six DQN extensions combined in one agent. |
| Q(σ) | 2017 (de Asis et al.) | Interpolates between full sampling and full expectation. |
With linear function approximation (V(s) ≈ θ^T φ(s) for some feature vector φ) on-policy TD(λ) is well-understood: convergence to the projected Bellman fixed point with probability one (Tsitsiklis and Van Roy, 1997). Eligibility traces, true online TD(λ), and Gradient TD all play nicely with linear features.
With neural networks, theory lags behind practice. Empirically, deep TD methods such as DQN and SAC work very well, but only with engineering tricks that mitigate the deadly triad: target networks (a slowly-updated copy of Q used in the bootstrap target), large replay buffers, normalised gradients, careful exploration, and reward clipping. Other approximators (kernel methods, decision trees, random projections) have all been studied but none has the empirical reach of neural nets.
TD learning is rarely the entire algorithm in modern systems. More often it provides the critic in a larger architecture.
Getting TD methods to work in practice is mostly about taming a few well-known knobs.