# Temporal-difference learning

> Source: https://aiwiki.ai/wiki/temporal_difference_learning
> Updated: 2026-06-23
> Categories: Algorithms, Machine Learning, Reinforcement Learning
> From AI Wiki (https://aiwiki.ai), a free encyclopedia of artificial intelligence. Quote with attribution.

**Temporal-difference (TD) learning** is a class of model-free [reinforcement learning](/wiki/reinforcement_learning) methods that learn value-function estimates by bootstrapping: updating each estimate of how good a [state](/wiki/state) is toward a target built from the immediate reward plus the agent's own current estimate of the next state's value. The core TD(0) update is V(S_t) <- V(S_t) + alpha[R_{t+1} + gamma V(S_{t+1}) - V(S_t)], and the bracketed quantity, the TD error, is the agent's reward-prediction error. TD methods combine the sampling of Monte Carlo simulation with the bootstrapping of dynamic programming, which lets a TD agent learn online from raw experience, without a model of the environment and without waiting for an episode to finish. [1][2]

TD learning was introduced by Richard S. Sutton in his 1984 PhD thesis at the University of Massachusetts Amherst ("Temporal Credit Assignment in Reinforcement Learning") and defined and analysed in detail in his 1988 *Machine Learning* paper "Learning to Predict by the Methods of Temporal Differences" (Sutton, 1988, vol. 3, pp. 9-44). [1] In their textbook, [Richard Sutton](/wiki/richard_sutton) and [Andrew Barto](/wiki/andrew_barto) write that "if one had to identify one idea as central and novel to reinforcement learning, it would be temporal-difference learning." [2] It is the algorithmic backbone of most modern reinforcement learning, including [SARSA](/wiki/sarsa), [Q-learning](/wiki/q_learning), Deep Q-Networks (DQN), actor-critic methods, and the AlphaGo / AlphaZero / MuZero family. Barto and Sutton received the 2024 ACM A.M. Turing Award (announced March 5, 2025) for developing the conceptual and algorithmic foundations of reinforcement learning, with TD learning singled out by the citation as one of the most important advances. [25]

## What is TD learning trying to do?

The basic problem is *prediction*: estimate the expected discounted return from each state of a Markov reward process. The return from time step t is

G_t = R_{t+1} + γ R_{t+2} + γ² R_{t+3} + ...,

where γ in [0,1] is a discount factor. The value function V^π(s) under a policy π is the expected return from state s:

V^π(s) = E_π [G_t | S_t = s].

Classical methods compute V^π in two extreme ways. Dynamic programming (DP) plugs V^π into the Bellman equation and iterates, requiring a full model of transition probabilities and rewards. Monte Carlo (MC) instead samples whole trajectories and averages observed returns, requiring no model but having to wait for an episode to finish before any update. TD learning sits between the two. It samples a single transition (S_t, R_{t+1}, S_{t+1}) and uses its own current estimate V(S_{t+1}) as a stand-in for the rest of the unseen return. This is bootstrapping: building one estimate from another estimate, or as the idea is often described, learning a guess from a guess. [2]

Why does this matter? Two reasons. First, it lets the agent learn online, from a stream of experience, without waiting for episodes to terminate. Second, it dramatically reduces variance compared with Monte Carlo, because instead of summing many random rewards it leans on a learned summary of the future. The cost is bias: if V is wrong, the bootstrapped target is wrong too. The art of TD learning is managing this trade-off.

## What is the TD(0) update?

The simplest TD method, TD(0), updates the value estimate of S_t after each transition:

V(S_t) <- V(S_t) + α [R_{t+1} + γ V(S_{t+1}) - V(S_t)].

The bracketed quantity is the **TD error**:

δ_t = R_{t+1} + γ V(S_{t+1}) - V(S_t).

The TD error is the difference between two predictions of the same return: the new prediction R_{t+1} + γ V(S_{t+1}), which incorporates one extra step of real reward, and the old prediction V(S_t). The step size α in (0,1] controls how aggressively the agent corrects toward the new target. Sutton (1988) proved convergence in the mean for TD methods on linear prediction problems [1], and Tsitsiklis and Van Roy (1997) later proved convergence with probability one for on-policy linear TD(λ) under standard step-size and ergodicity conditions. [7]

## n-step TD and TD(λ)

TD(0) uses a one-step target. Monte Carlo uses an infinite-step (full return) target. n-step TD generalises across this spectrum:

G_{t:t+n} = R_{t+1} + γ R_{t+2} + ... + γ^{n-1} R_{t+n} + γ^n V(S_{t+n}),

V(S_t) <- V(S_t) + α [G_{t:t+n} - V(S_t)].

Larger n gives a target that depends more on real rewards and less on bootstrapped estimates, raising variance but lowering bias. Choosing the right n is a hyperparameter problem, and a fixed n is usually a compromise.

**TD(λ)** elegantly avoids picking a single n by averaging all n-step returns with exponentially decaying weights. The λ-return is

G_t^λ = (1 - λ) Σ_{n=1}^{∞} λ^{n-1} G_{t:t+n}.

With λ = 0 only the one-step return survives, recovering TD(0). With λ = 1 the weights collapse to the full Monte Carlo return. Intermediate λ values smoothly interpolate between TD and MC. This is the *forward view*. [2]

The equivalent *backward view*, originally due to Sutton, uses **eligibility traces** e_t(s) to credit recent states for later TD errors. After every step the trace decays by γλ and is incremented by 1 for the visited state (accumulating traces) or set to 1 (replacing traces). Updates take the form

V(s) <- V(s) + α δ_t e_t(s)   for all s.

In expectation the offline backward view matches the offline forward view exactly. Online they differ slightly, which Harm van Seijen and Richard Sutton corrected with **true online TD(λ)** (van Seijen and Sutton, 2014), introducing "dutch traces" that achieve exact step-by-step equivalence with an online forward view at the same computational cost as classical TD(λ). [13]

## How does TD compare with dynamic programming and Monte Carlo?

| Method | Bootstraps? | Samples experience? | Needs model? | Bias | Variance | Online updates? |
|---|---|---|---|---|---|---|
| Dynamic programming | Yes | No (uses expectations) | Yes (full P, R) | None (in the limit) | None | Sweep-based |
| Monte Carlo | No | Yes (full returns) | No | Low | High | Episode-end only |
| TD(0) | Yes (1 step) | Yes | No | Moderate | Low to moderate | Yes |
| n-step TD | Yes (n steps) | Yes | No | Decreases with n | Increases with n | Yes (after n steps) |
| TD(λ) | Yes (mixed) | Yes | No | Tunable via λ | Tunable via λ | Yes |

The table makes the TD value proposition visible: it is the only family that can learn from a continuing stream of experience without a model and without waiting for episodes to end. [2]

## TD for control: SARSA, Q-learning, and friends

Predicting V is useful but the practical goal is usually *control*, choosing actions that maximise return. For that we estimate the action-value function Q^π(s, a) = E_π[G_t | S_t = s, A_t = a]. Several TD-based control algorithms exist, and they form the conceptual core of model-free RL.

**SARSA (state-action-reward-state-action)**, introduced by Rummery and Niranjan in their 1994 Cambridge technical report "On-line Q-learning using connectionist systems" [8] and later popularised under Sutton's snappier name, is the on-policy TD control method:

Q(S_t, A_t) <- Q(S_t, A_t) + α [R_{t+1} + γ Q(S_{t+1}, A_{t+1}) - Q(S_t, A_t)].

It evaluates and improves the same policy that is generating behaviour. With an [epsilon-greedy policy](/wiki/epsilon_greedy_policy), SARSA learns the value of the actually-followed exploratory policy.

**Q-learning**, introduced by Christopher Watkins in his 1989 Cambridge PhD thesis "Learning from Delayed Rewards" [3] and proved convergent by Watkins and Dayan (1992) [4], is the off-policy variant:

Q(S_t, A_t) <- Q(S_t, A_t) + α [R_{t+1} + γ max_a Q(S_{t+1}, a) - Q(S_t, A_t)].

The target uses the greedy action's value rather than the actually-chosen action's value, so Q-learning estimates the optimal action-value function regardless of how exploratory the behaviour policy is. This separation of behaviour and target policies is the defining feature of off-policy learning. The [tabular Q-learning](/wiki/tabular_q-learning) variant stores Q values in a lookup table; the deep variants replace the table with a neural network.

**Expected SARSA** (van Seijen, van Hasselt, Whiteson and Wiering, 2009) replaces the sampled next action with its expectation under the policy:

Q(S_t, A_t) <- Q + α [R_{t+1} + γ Σ_a π(a | S_{t+1}) Q(S_{t+1}, a) - Q(S_t, A_t)].

It has lower variance than SARSA at no extra bias and works for both on-policy and off-policy targets. Q-learning is the special case where the target policy is greedy. [9]

**Double Q-learning** (van Hasselt, 2010) addresses the systematic positive bias of the max operator in Q-learning by maintaining two estimators and using one to select actions and the other to evaluate them. [10] Deep Double Q-Networks (DDQN, van Hasselt, Guez and Silver, 2016) extend the same idea to neural networks and meaningfully improve DQN's stability. [15]

Other members of the family include **n-step SARSA**, **tree-backup** (Precup, Sutton and Singh, 2000) for off-policy multi-step learning without importance sampling [23], **Q(σ)** which interpolates between sampling and expectation, and **Retrace(λ)** (Munos et al., 2016), which combines safe off-policy multi-step returns with bias control. [22]

## Why do neural networks make TD harder? The deadly triad

In the tabular setting TD methods enjoy strong convergence guarantees. Once we add function approximation, especially with neural networks, the picture darkens. Sutton and Barto (2018, chapter 11) call the combination of three innocent-looking ingredients the **deadly triad**: [2]

1. Function approximation (e.g. a neural network).
2. Bootstrapping (TD targets that use V or Q estimates).
3. Off-policy training (learning about a different policy than the one collecting data).

When all three are present together, value estimates can diverge. Tsitsiklis and Van Roy (1997) constructed simple counterexamples that diverge under linear TD with off-policy data. [7] The **Gradient TD** family (Sutton, Maei and Szepesvari, 2009; Sutton et al., 2009), including GTD, GTD2, and TDC (Temporal Difference with gradient Correction), provides the first off-policy linear methods with convergence guarantees by performing true stochastic gradient descent on a projected Bellman error. [11] **Emphatic TD** (Sutton, Mahmood and White, 2016) restores stability via an emphasis weighting that re-balances state visitation. [12] In deep RL, practical mitigations include target networks, experience replay, and clipped objectives. Hado van Hasselt and colleagues explicitly studied the triad in the deep setting in "Deep Reinforcement Learning and the Deadly Triad" (2018), finding that divergence is real but rarer in practice than the worst-case theory suggests. [24]

## How does TD learning connect to dopamine in the brain?

One of the most striking results in computational neuroscience is the discovery that midbrain [dopamine](/wiki/dopamine) neurons appear to encode a TD error. In a sequence of single-unit recording experiments in monkeys, Wolfram Schultz showed that dopaminergic neurons in the ventral tegmental area and substantia nigra pars compacta fire above baseline at the time of an unexpected reward, fire above baseline at the time of a *predictor* of reward once that predictor has been learned, and fire *below* baseline at the time a predicted reward fails to arrive. [6]

In the 1997 *Science* paper "A Neural Substrate of Prediction and Reward" (vol. 275, pp. 1593-1599), Schultz, Peter Dayan, and P. Read Montague proposed that this firing pattern is exactly what a TD-learning algorithm computes: the discrepancy δ_t = R_{t+1} + γ V(S_{t+1}) - V(S_t). [6] The dopamine neuron, in their account, is activated by an unpredicted reward (a positive prediction error), shows no response to a fully predicted reward (no prediction error), and is depressed by the omission of a predicted reward (a negative prediction error). The hypothesis was so clean and predictive that it changed the language of reward neuroscience: the 1997 paper has since been cited more than 9,000 times. [6] Reward prediction error (RPE) is now a standard tool for thinking about dopamine, addiction, learning disorders, and computational psychiatry, and Schultz revisited the framework in his 2016 review "Dopamine reward prediction error coding". [26] Whether the mapping is literally correct in detail remains debated, but the qualitative parallel between dopamine bursts and TD errors is one of the strongest known links between a machine-learning algorithm and a biological mechanism.

## What are the famous applications of TD learning?

The headline successes of modern reinforcement learning are, almost without exception, TD methods scaled up.

| Year | System | What TD did |
|---|---|---|
| 1992-1995 | TD-Gammon (Tesauro) | TD(λ) trained a backgammon evaluation network through self-play to near-world-champion level. |
| 2013 | DQN preprint (Mnih et al.) | Q-learning with a deep convolutional network learning Atari from pixels. |
| 2015 | DQN, *Nature* (Mnih et al.) | Human-level performance on 49 Atari 2600 games using TD plus target networks and experience replay. |
| 2016 | AlphaGo (Silver et al.) | Value network trained with TD-style updates, combined with policy network and Monte Carlo tree search, defeated European champion Fan Hui then Lee Sedol. |
| 2016 | Double DQN (van Hasselt, Guez, Silver) | Reduced Q-learning's overestimation bias in the Atari benchmark. |
| 2017 | C51 / categorical DQN (Bellemare, Dabney, Munos) | Distributional TD learning, predicting a return distribution rather than its mean, with 51 atoms. [16] |
| 2017 | AlphaGo Zero (Silver et al.) | Pure self-play with TD-trained value head, no human games. |
| 2018 | AlphaZero (Silver et al.) | Generalised AlphaGo Zero to chess and shogi from scratch. |
| 2018 | Rainbow DQN (Hessel et al.) | Combined six DQN improvements (double Q, prioritised replay, dueling, multi-step, distributional, noisy nets) into one state-of-the-art agent. [17] |
| 2018 | QR-DQN (Dabney et al.) | Quantile regression distributional TD. |
| 2018-2019 | SAC, TD3 (Haarnoja et al.; Fujimoto et al.) | Twin TD critics in continuous control. |
| 2020 | MuZero (Schrittwieser et al., *Nature*) | Learns its own model and uses TD-style value updates to plan, mastering Go, chess, shogi, and 57 Atari games. |

Gerald Tesauro's TD-Gammon deserves special mention. Trained from 1992 onward, it used TD(λ) and a single-hidden-layer neural network learning purely by self-play, with no expert labels and no opening book. Version 1.0 (300,000 self-play games) already beat Neurogammon and every prior backgammon program; by 1995 later versions reached a level commonly described as among the world's strongest, just below the top human players. [5] It was the first time TD with function approximation produced a champion-level player and is routinely cited in DQN, AlphaGo, and AlphaZero papers as inspiration.

The Mnih et al. (2015) *Nature* DQN paper showed that the same TD-control algorithm, augmented with a target network and a replay buffer, could learn to play 49 Atari games directly from raw pixels, reaching human-level scores on the majority. [14] The Silver et al. (2016) *Nature* AlphaGo paper combined a TD-trained value network with [Monte Carlo tree search](/wiki/monte_carlo_tree_search) and a supervised-then-reinforcement-trained policy network. [18] AlphaGo Zero (2017) and AlphaZero (2018) removed the supervised pre-training entirely. [19][20] MuZero (Schrittwieser et al., 2020) added a learned model of the environment so the agent could plan in latent space, while still using TD-style value backups to ground the search. [21]

## Variants and extensions

| Variant | Year | What it adds |
|---|---|---|
| TD(0) | 1988 | Single-step bootstrapped value update. |
| TD(λ), eligibility traces | 1988 | Exponentially-weighted average of n-step returns; backward-view trace mechanism. |
| n-step TD / n-step SARSA | 1980s onward | Fixed-horizon multi-step targets. |
| SARSA | 1994 (Rummery & Niranjan) | On-policy TD control on Q. |
| Q-learning | 1989 (Watkins) | Off-policy TD control using max over next actions. |
| Tabular Q-learning | 1989 | The lookup-table form of Q-learning. |
| Expected SARSA | 2009 (van Seijen et al.) | Replaces next-action sample with expectation under the policy. |
| Tree-backup | 2000 (Precup, Sutton, Singh) | Multi-step off-policy learning without importance sampling. |
| Double Q-learning | 2010 (van Hasselt) | Two estimators to remove the max bias. |
| GTD, GTD2, TDC | 2009 (Sutton et al.) | Convergent off-policy linear TD via a projected Bellman objective. |
| Emphatic TD | 2016 (Sutton, Mahmood, White) | Stable off-policy TD via state-emphasis weighting. |
| True online TD(λ) | 2014 (van Seijen and Sutton) | Online algorithm exactly matching the forward view via dutch traces. |
| DQN | 2013 / 2015 (Mnih et al.) | Deep Q-learning with replay buffer and target network. |
| Double DQN | 2016 (van Hasselt, Guez, Silver) | Double-Q correction on top of DQN. |
| Prioritised experience replay | 2016 (Schaul et al.) | Sampling replays in proportion to TD error magnitude. |
| Dueling DQN | 2016 (Wang et al.) | Separates value and advantage streams. |
| C51 / categorical DQN | 2017 (Bellemare, Dabney, Munos) | Distributional TD with a fixed support. |
| QR-DQN | 2018 (Dabney et al.) | Distributional TD via quantile regression. |
| Retrace(λ) | 2016 (Munos et al.) | Safe off-policy multi-step returns. |
| Rainbow DQN | 2018 (Hessel et al.) | Six DQN extensions combined in one agent. |
| Q(σ) | 2017 (de Asis et al.) | Interpolates between full sampling and full expectation. |

## Function approximation

With linear function approximation (V(s) ≈ θ^T φ(s) for some feature vector φ) on-policy TD(λ) is well-understood: convergence to the projected Bellman fixed point with probability one (Tsitsiklis and Van Roy, 1997). [7] Eligibility traces, true online TD(λ), and Gradient TD all play nicely with linear features.

With neural networks, theory lags behind practice. Empirically, deep TD methods such as DQN and SAC work very well, but only with engineering tricks that mitigate the deadly triad: target networks (a slowly-updated copy of Q used in the bootstrap target), large replay buffers, normalised gradients, careful exploration, and reward clipping. [14] Other approximators (kernel methods, decision trees, random projections) have all been studied but none has the empirical reach of neural nets.

## How does TD learning show up across the RL landscape?

TD learning is rarely the entire algorithm in modern systems. More often it provides the *critic* in a larger architecture.

- **Actor-critic methods** (Barto, Sutton, Anderson, 1983) learn a parameterised policy (the actor) and a TD-trained value function (the critic). Modern variants include A2C/A3C (Mnih et al., 2016), DDPG (Lillicrap et al., 2016), TD3 (Fujimoto, Hoof, Meger, 2018), SAC (Haarnoja et al., 2018), and [PPO](/wiki/proximal_policy_optimization) (Schulman et al., 2017). The critic in PPO is a value network trained with a TD-like objective on bootstrapped multi-step returns; the actor uses the resulting advantage estimates.
- **Model-based RL** with TD includes Dyna-Q (Sutton, 1990), which interleaves real and simulated experience, and MuZero (Schrittwieser et al., 2020), which learns a latent dynamics model and uses TD value updates inside its tree search. [21]
- **Offline (batch) RL** uses TD updates on a fixed dataset, with regularisers to keep the policy close to the data distribution. Examples include BCQ (Fujimoto, Meger, Precup, 2019), CQL (Kumar et al., 2020), and IQL (Kostrikov, Nair, Levine, 2022).
- **[Reinforcement learning from human feedback](/wiki/rlhf)** for large language models typically uses PPO with a value head, so the underlying advantage estimator is built on TD-style bootstrapping.

## Practical considerations

Getting TD methods to work in practice is mostly about taming a few well-known knobs.

- **Step size α.** Too large and updates oscillate or diverge; too small and learning is glacial. Schedules (e.g. 1/n in tabular settings) can help, and for stochastic gradient TD with neural nets the usual deep-learning intuitions about Adam, RMSProp, and learning-rate warmup apply.
- **Discount γ.** Smaller γ makes the problem easier and more myopic; larger γ pushes the effective horizon out and amplifies bootstrapping errors.
- **Trace decay λ.** A common rule of thumb is to start in the 0.8 to 0.95 range, then tune. λ closer to 1 helps long-horizon problems; λ closer to 0 helps when the value function is unstable.
- **Initialisation.** Optimistic initial values can encourage exploration in tabular control; in deep RL initialisation interacts strongly with the target network update rate.
- **Exploration.** Epsilon-greedy, Boltzmann, parameter-space noise (NoisyNets), curiosity bonuses, and entropy regularisation are all in regular use.
- **Stability tricks.** Target networks, replay buffers (uniform or prioritised), reward clipping, gradient clipping, double-Q correction, and ensemble critics are essentially standard equipment in deep TD methods.

## Open challenges

- **Stability with deep nets.** Even with target networks and replay, divergence and policy collapse remain real failure modes, especially in continuous control and offline settings.
- **Sample efficiency.** TD methods often need millions of interactions to reach human-level Atari play. Closing the gap with model-based methods, demonstrations, and pre-training is an active research frontier.
- **Off-policy evaluation.** Estimating the value of a policy from data collected by another policy is fundamental for safe deployment but has high variance with importance sampling and high bias with bootstrapping.
- **Long-horizon credit assignment.** When rewards are very sparse and far in the future, bootstrapping errors compound. Eligibility traces, hierarchical RL, and successor representations are partial answers.
- **Continual learning.** TD networks tend to forget earlier tasks when trained on a new one; the interaction between TD and catastrophic forgetting is poorly understood.
- **Plasticity loss.** Recent work, much of it from Sutton and collaborators, shows that deep TD agents lose the ability to learn from new data after long training runs, motivating new optimisation techniques.

## References

1. Sutton, R. S. (1988). Learning to predict by the methods of temporal differences. *Machine Learning*, 3(1), 9-44. https://link.springer.com/article/10.1007/BF00115009
2. Sutton, R. S., and Barto, A. G. (2018). *Reinforcement Learning: An Introduction* (2nd ed.). MIT Press. Chapters 6, 7, 11, 12. http://incompleteideas.net/book/the-book-2nd.html
3. Watkins, C. J. C. H. (1989). *Learning from Delayed Rewards*. PhD thesis, King's College, University of Cambridge.
4. Watkins, C. J. C. H., and Dayan, P. (1992). Q-learning. *Machine Learning*, 8(3-4), 279-292. https://link.springer.com/article/10.1007/BF00992698
5. Tesauro, G. (1995). Temporal difference learning and TD-Gammon. *Communications of the ACM*, 38(3), 58-68. https://dl.acm.org/doi/10.1145/203330.203343
6. Schultz, W., Dayan, P., and Montague, P. R. (1997). A neural substrate of prediction and reward. *Science*, 275(5306), 1593-1599. https://www.science.org/doi/10.1126/science.275.5306.1593
7. Tsitsiklis, J. N., and Van Roy, B. (1997). An analysis of temporal-difference learning with function approximation. *IEEE Transactions on Automatic Control*, 42(5), 674-690. https://www.mit.edu/~jnt/Papers/J063-97-bvr-td.pdf
8. Rummery, G. A., and Niranjan, M. (1994). On-line Q-learning using connectionist systems. Technical Report CUED/F-INFENG/TR 166, Cambridge University Engineering Department.
9. van Seijen, H., van Hasselt, H., Whiteson, S., and Wiering, M. (2009). A theoretical and empirical analysis of expected SARSA. *IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL)*, 177-184.
10. van Hasselt, H. (2010). Double Q-learning. *Advances in Neural Information Processing Systems 23 (NeurIPS)*, 2613-2622.
11. Sutton, R. S., Maei, H. R., Precup, D., Bhatnagar, S., Silver, D., Szepesvari, C., and Wiewiora, E. (2009). Fast gradient-descent methods for temporal-difference learning with linear function approximation. *International Conference on Machine Learning (ICML)*.
12. Sutton, R. S., Mahmood, A. R., and White, M. (2016). An emphatic approach to the problem of off-policy temporal-difference learning. *Journal of Machine Learning Research*, 17(73), 1-29.
13. van Seijen, H., and Sutton, R. S. (2014). True online TD(λ). *International Conference on Machine Learning (ICML)*, 692-700. http://proceedings.mlr.press/v32/seijen14.pdf
14. Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., et al. (2015). Human-level control through deep reinforcement learning. *Nature*, 518(7540), 529-533. https://www.nature.com/articles/nature14236
15. van Hasselt, H., Guez, A., and Silver, D. (2016). Deep reinforcement learning with double Q-learning. *AAAI Conference on Artificial Intelligence*.
16. Bellemare, M. G., Dabney, W., and Munos, R. (2017). A distributional perspective on reinforcement learning. *International Conference on Machine Learning (ICML)*. https://arxiv.org/abs/1707.06887
17. Hessel, M., Modayil, J., van Hasselt, H., Schaul, T., Ostrovski, G., Dabney, W., Horgan, D., Piot, B., Azar, M. G., and Silver, D. (2018). Rainbow: Combining improvements in deep reinforcement learning. *AAAI Conference on Artificial Intelligence*. https://arxiv.org/abs/1710.02298
18. Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., van den Driessche, G., et al. (2016). Mastering the game of Go with deep neural networks and tree search. *Nature*, 529(7587), 484-489. https://www.nature.com/articles/nature16961
19. Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou, I., Huang, A., Guez, A., et al. (2017). Mastering the game of Go without human knowledge. *Nature*, 550(7676), 354-359.
20. Silver, D., Hubert, T., Schrittwieser, J., et al. (2018). A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play. *Science*, 362(6419), 1140-1144.
21. Schrittwieser, J., Antonoglou, I., Hubert, T., Simonyan, K., Sifre, L., Schmitt, S., et al. (2020). Mastering Atari, Go, chess and shogi by planning with a learned model. *Nature*, 588(7839), 604-609.
22. Munos, R., Stepleton, T., Harutyunyan, A., and Bellemare, M. G. (2016). Safe and efficient off-policy reinforcement learning. *Advances in Neural Information Processing Systems 29 (NeurIPS)*.
23. Precup, D., Sutton, R. S., and Singh, S. P. (2000). Eligibility traces for off-policy policy evaluation. *International Conference on Machine Learning (ICML)*.
24. van Hasselt, H., Doron, Y., Strub, F., Hessel, M., Sonnerat, N., and Modayil, J. (2018). Deep reinforcement learning and the deadly triad. *arXiv preprint arXiv:1812.02648*. https://arxiv.org/abs/1812.02648
25. ACM (2025). Andrew Barto and Richard Sutton receive the 2024 ACM A.M. Turing Award for developing the conceptual and algorithmic foundations of reinforcement learning. https://awards.acm.org/about/2024-turing
26. Schultz, W. (2016). Dopamine reward prediction error coding. *Dialogues in Clinical Neuroscience*, 18(1), 23-32. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4826767/

