Temporal-difference learning

Temporal-difference (TD) learning is a class of model-free reinforcement learning methods that learn value-function estimates by combining ideas from Monte Carlo simulation and dynamic programming. A TD agent samples transitions from experience, then updates its current estimate of how good a state is by comparing it to a target built from the immediate reward plus its own estimate of the next state's value. This blend of sampling (from real interaction) and bootstrapping (using existing estimates as the basis for new ones) gives TD methods their characteristic balance of low bias, moderate variance, and online operation without needing a model of the environment.

TD learning was introduced by Richard S. Sutton in his 1984 PhD thesis at the University of Massachusetts Amherst and described in detail in the 1988 Machine Learning paper "Learning to Predict by the Methods of Temporal Differences" (Sutton, 1988). It is the algorithmic backbone of most modern reinforcement learning, including SARSA, Q-learning, Deep Q-Networks (DQN), actor-critic methods, and the AlphaGo / AlphaZero / MuZero family. Andrew Barto and Richard Sutton received the 2024 ACM A.M. Turing Award for developing the conceptual and algorithmic foundations of reinforcement learning, with TD learning singled out by the citation as one of the most important advances.

What TD learning is trying to do

The basic problem is prediction: estimate the expected discounted return from each state of a Markov reward process. The return from time step t is

G_t = R_{t+1} + γ R_{t+2} + γ² R_{t+3} + ...,

where γ in [0,1] is a discount factor. The value function V^π(s) under a policy π is the expected return from state s:

V^π(s) = E_π [G_t | S_t = s].

Classical methods compute V^π in two extreme ways. Dynamic programming (DP) plugs V^π into the Bellman equation and iterates, requiring a full model of transition probabilities and rewards. Monte Carlo (MC) instead samples whole trajectories and averages observed returns, requiring no model but having to wait for an episode to finish before any update. TD learning sits between the two. It samples a single transition (S_t, R_{t+1}, S_{t+1}) and uses its own current estimate V(S_{t+1}) as a stand-in for the rest of the unseen return. This is bootstrapping: building one estimate from another estimate.

Why does this matter? Two reasons. First, it lets the agent learn online, from a stream of experience, without waiting for episodes to terminate. Second, it dramatically reduces variance compared with Monte Carlo, because instead of summing many random rewards it leans on a learned summary of the future. The cost is bias: if V is wrong, the bootstrapped target is wrong too. The art of TD learning is managing this trade-off.

The TD(0) update

The simplest TD method, TD(0), updates the value estimate of S_t after each transition:

V(S_t) <- V(S_t) + α [R_{t+1} + γ V(S_{t+1}) - V(S_t)].

The bracketed quantity is the TD error:

δ_t = R_{t+1} + γ V(S_{t+1}) - V(S_t).

The TD error is the difference between two predictions of the same return: the new prediction R_{t+1} + γ V(S_{t+1}), which incorporates one extra step of real reward, and the old prediction V(S_t). The step size α in (0,1] controls how aggressively the agent corrects toward the new target. Sutton (1988) proved convergence in the mean for TD methods on linear prediction problems, and Tsitsiklis and Van Roy (1997) later proved convergence with probability one for on-policy linear TD(λ) under standard step-size and ergodicity conditions.

n-step TD and TD(λ)

TD(0) uses a one-step target. Monte Carlo uses an infinite-step (full return) target. n-step TD generalises across this spectrum:

G_{t:t+n} = R_{t+1} + γ R_{t+2} + ... + γ^{n-1} R_{t+n} + γ^n V(S_{t+n}),

V(S_t) <- V(S_t) + α [G_{t:t+n} - V(S_t)].

Larger n gives a target that depends more on real rewards and less on bootstrapped estimates, raising variance but lowering bias. Choosing the right n is a hyperparameter problem, and a fixed n is usually a compromise.

TD(λ) elegantly avoids picking a single n by averaging all n-step returns with exponentially decaying weights. The λ-return is

G_t^λ = (1 - λ) Σ_{n=1}^{∞} λ^{n-1} G_{t:t+n}.

With λ = 0 only the one-step return survives, recovering TD(0). With λ = 1 the weights collapse to the full Monte Carlo return. Intermediate λ values smoothly interpolate between TD and MC. This is the forward view.

The equivalent backward view, originally due to Sutton, uses eligibility traces e_t(s) to credit recent states for later TD errors. After every step the trace decays by γλ and is incremented by 1 for the visited state (accumulating traces) or set to 1 (replacing traces). Updates take the form

V(s) <- V(s) + α δ_t e_t(s) for all s.

In expectation the offline backward view matches the offline forward view exactly. Online they differ slightly, which Harm van Seijen and Richard Sutton corrected with true online TD(λ) (van Seijen and Sutton, 2014), introducing "dutch traces" that match the forward view at every step at the same computational cost as classical TD(λ).

Where TD sits among value-learning methods

Method	Bootstraps?	Samples experience?	Needs model?	Bias	Variance	Online updates?
Dynamic programming	Yes	No (uses expectations)	Yes (full P, R)	None (in the limit)	None	Sweep-based
Monte Carlo	No	Yes (full returns)	No	Low	High	Episode-end only
TD(0)	Yes (1 step)	Yes	No	Moderate	Low to moderate	Yes
n-step TD	Yes (n steps)	Yes	No	Decreases with n	Increases with n	Yes (after n steps)
TD(λ)	Yes (mixed)	Yes	No	Tunable via λ	Tunable via λ	Yes

The table makes the TD value proposition visible: it is the only family that can learn from a continuing stream of experience without a model and without waiting for episodes to end.

TD for control: SARSA, Q-learning, and friends

Predicting V is useful but the practical goal is usually control, choosing actions that maximise return. For that we estimate the action-value function Q^π(s, a) = E_π[G_t | S_t = s, A_t = a]. Several TD-based control algorithms exist, and they form the conceptual core of model-free RL.

SARSA (state-action-reward-state-action), introduced by Rummery and Niranjan in their 1994 Cambridge technical report "On-line Q-learning using connectionist systems" and later popularised under Sutton's snappier name, is the on-policy TD control method:

Q(S_t, A_t) <- Q(S_t, A_t) + α [R_{t+1} + γ Q(S_{t+1}, A_{t+1}) - Q(S_t, A_t)].

It evaluates and improves the same policy that is generating behaviour. With an epsilon-greedy policy, SARSA learns the value of the actually-followed exploratory policy.

Q-learning, introduced by Christopher Watkins in his 1989 Cambridge PhD thesis "Learning from Delayed Rewards" and proved convergent by Watkins and Dayan (1992), is the off-policy variant:

Q(S_t, A_t) <- Q(S_t, A_t) + α [R_{t+1} + γ max_a Q(S_{t+1}, a) - Q(S_t, A_t)].

The target uses the greedy action's value rather than the actually-chosen action's value, so Q-learning estimates the optimal action-value function regardless of how exploratory the behaviour policy is. This separation of behaviour and target policies is the defining feature of off-policy learning. The tabular Q-learning variant stores Q values in a lookup table; the deep variants replace the table with a neural network.

Expected SARSA (van Seijen, van Hasselt, Whiteson and Wiering, 2009) replaces the sampled next action with its expectation under the policy:

Q(S_t, A_t) <- Q + α [R_{t+1} + γ Σ_a π(a | S_{t+1}) Q(S_{t+1}, a) - Q(S_t, A_t)].

It has lower variance than SARSA at no extra bias and works for both on-policy and off-policy targets. Q-learning is the special case where the target policy is greedy.

Double Q-learning (van Hasselt, 2010) addresses the systematic positive bias of the max operator in Q-learning by maintaining two estimators and using one to select actions and the other to evaluate them. Deep Double Q-Networks (DDQN, van Hasselt, Guez and Silver, 2016) extend the same idea to neural networks and meaningfully improve DQN's stability.

Other members of the family include n-step SARSA, tree-backup (Precup, Sutton and Singh, 2000) for off-policy multi-step learning without importance sampling, Q(σ) which interpolates between sampling and expectation, and Retrace(λ) (Munos et al., 2016), which combines safe off-policy multi-step returns with bias control.

Why neural networks make TD harder: the deadly triad

In the tabular setting TD methods enjoy strong convergence guarantees. Once we add function approximation, especially with neural networks, the picture darkens. Sutton and Barto (2018, chapter 11) call the combination of three innocent-looking ingredients the deadly triad:

Function approximation (e.g. a neural network).
Bootstrapping (TD targets that use V or Q estimates).
Off-policy training (learning about a different policy than the one collecting data).

When all three are present together, value estimates can diverge. Tsitsiklis and Van Roy (1997) constructed simple counterexamples that diverge under linear TD with off-policy data. The Gradient TD family (Sutton, Maei and Szepesvari, 2009; Sutton et al., 2009), including GTD, GTD2, and TDC (Temporal Difference with gradient Correction), provides the first off-policy linear methods with convergence guarantees by performing true stochastic gradient descent on a projected Bellman error. Emphatic TD (Sutton, Mahmood and White, 2016) restores stability via an emphasis weighting that re-balances state visitation. In deep RL, practical mitigations include target networks, experience replay, and clipped objectives. Hado van Hasselt and colleagues explicitly studied the triad in the deep setting in "Deep Reinforcement Learning and the Deadly Triad" (2018), finding that divergence is real but rarer in practice than the worst-case theory suggests.

Connection to neuroscience: dopamine and reward prediction error

One of the most striking results in computational neuroscience is the discovery that midbrain dopamine neurons appear to encode a TD error. In a sequence of single-unit recording experiments in monkeys, Wolfram Schultz showed that dopaminergic neurons in the ventral tegmental area and substantia nigra pars compacta fire above baseline at the time of an unexpected reward, fire above baseline at the time of a predictor of reward once that predictor has been learned, and fire below baseline at the time a predicted reward fails to arrive.

In the 1997 Science paper "A Neural Substrate of Prediction and Reward", Schultz, Peter Dayan, and P. Read Montague proposed that this firing pattern is exactly what a TD-learning algorithm computes: the discrepancy δ_t = R_{t+1} + γ V(S_{t+1}) - V(S_t). The hypothesis was so clean and predictive that it changed the language of reward neuroscience. Reward prediction error (RPE) is now a standard tool for thinking about dopamine, addiction, learning disorders, and computational psychiatry. Whether the mapping is literally correct in detail remains debated, but the qualitative parallel between dopamine bursts and TD errors is one of the strongest known links between a machine-learning algorithm and a biological mechanism.

Famous applications

The headline successes of modern reinforcement learning are, almost without exception, TD methods scaled up.

Year	System	What TD did
1995	TD-Gammon (Tesauro)	TD(λ) trained a backgammon evaluation network through self-play to near-world-champion level.
2013	DQN preprint (Mnih et al.)	Q-learning with a deep convolutional network learning Atari from pixels.
2015	DQN, Nature (Mnih et al.)	Human-level performance on 49 Atari 2600 games using TD plus target networks and experience replay.
2016	AlphaGo (Silver et al.)	Value network trained with TD-style updates, combined with policy network and Monte Carlo tree search, defeated European champion Fan Hui then Lee Sedol.
2016	Double DQN (van Hasselt, Guez, Silver)	Reduced Q-learning's overestimation bias in the Atari benchmark.
2017	C51 / categorical DQN (Bellemare, Dabney, Munos)	Distributional TD learning, predicting a return distribution rather than its mean, with 51 atoms.
2017	AlphaGo Zero (Silver et al.)	Pure self-play with TD-trained value head, no human games.
2018	AlphaZero (Silver et al.)	Generalised AlphaGo Zero to chess and shogi from scratch.
2018	Rainbow DQN (Hessel et al.)	Combined six DQN improvements (double Q, prioritised replay, dueling, multi-step, distributional, noisy nets) into one state-of-the-art agent.
2018	QR-DQN (Dabney et al.)	Quantile regression distributional TD.
2019	SAC, TD3 (Haarnoja et al.; Fujimoto et al.)	Twin TD critics in continuous control.
2020	MuZero (Schrittwieser et al., Nature)	Learns its own model and uses TD-style value updates to plan, mastering Go, chess, shogi, and 57 Atari games.

Gerald Tesauro's TD-Gammon (1995) deserves special mention. It used TD(λ) and a single-hidden-layer neural network trained by self-play, with no expert labels and no opening book, and reached a level commonly described as among the world's strongest. It was the first time TD with function approximation produced a champion-level player and is routinely cited in DQN, AlphaGo, and AlphaZero papers as inspiration.

The Mnih et al. (2015) Nature DQN paper showed that the same TD-control algorithm, augmented with a target network and a replay buffer, could learn to play 49 Atari games directly from raw pixels. The Silver et al. (2016) Nature AlphaGo paper combined a TD-trained value network with Monte Carlo tree search and a supervised-then-reinforcement-trained policy network. AlphaGo Zero (2017) and AlphaZero (2018) removed the supervised pre-training entirely. MuZero (Schrittwieser et al., 2020) added a learned model of the environment so the agent could plan in latent space, while still using TD-style value backups to ground the search.

Variants and extensions

Variant	Year	What it adds
TD(0)	1988	Single-step bootstrapped value update.
TD(λ), eligibility traces	1988	Exponentially-weighted average of n-step returns; backward-view trace mechanism.
n-step TD / n-step SARSA	1980s onward	Fixed-horizon multi-step targets.
SARSA	1994 (Rummery & Niranjan)	On-policy TD control on Q.
Q-learning	1989 (Watkins)	Off-policy TD control using max over next actions.
Tabular Q-learning	1989	The lookup-table form of Q-learning.
Expected SARSA	2009 (van Seijen et al.)	Replaces next-action sample with expectation under the policy.
Tree-backup	2000 (Precup, Sutton, Singh)	Multi-step off-policy learning without importance sampling.
Double Q-learning	2010 (van Hasselt)	Two estimators to remove the max bias.
GTD, GTD2, TDC	2009 (Sutton et al.)	Convergent off-policy linear TD via a projected Bellman objective.
Emphatic TD	2016 (Sutton, Mahmood, White)	Stable off-policy TD via state-emphasis weighting.
True online TD(λ)	2014 (van Seijen and Sutton)	Online algorithm exactly matching the forward view via dutch traces.
DQN	2013 / 2015 (Mnih et al.)	Deep Q-learning with replay buffer and target network.
Double DQN	2016 (van Hasselt, Guez, Silver)	Double-Q correction on top of DQN.
Prioritised experience replay	2016 (Schaul et al.)	Sampling replays in proportion to TD error magnitude.
Dueling DQN	2016 (Wang et al.)	Separates value and advantage streams.
C51 / categorical DQN	2017 (Bellemare, Dabney, Munos)	Distributional TD with a fixed support.
QR-DQN	2018 (Dabney et al.)	Distributional TD via quantile regression.
Retrace(λ)	2016 (Munos et al.)	Safe off-policy multi-step returns.
Rainbow DQN	2018 (Hessel et al.)	Six DQN extensions combined in one agent.
Q(σ)	2017 (de Asis et al.)	Interpolates between full sampling and full expectation.

Function approximation

With linear function approximation (V(s) ≈ θ^T φ(s) for some feature vector φ) on-policy TD(λ) is well-understood: convergence to the projected Bellman fixed point with probability one (Tsitsiklis and Van Roy, 1997). Eligibility traces, true online TD(λ), and Gradient TD all play nicely with linear features.

With neural networks, theory lags behind practice. Empirically, deep TD methods such as DQN and SAC work very well, but only with engineering tricks that mitigate the deadly triad: target networks (a slowly-updated copy of Q used in the bootstrap target), large replay buffers, normalised gradients, careful exploration, and reward clipping. Other approximators (kernel methods, decision trees, random projections) have all been studied but none has the empirical reach of neural nets.

How TD learning shows up across the RL landscape

TD learning is rarely the entire algorithm in modern systems. More often it provides the critic in a larger architecture.

Actor-critic methods (Barto, Sutton, Anderson, 1983) learn a parameterised policy (the actor) and a TD-trained value function (the critic). Modern variants include A2C/A3C (Mnih et al., 2016), DDPG (Lillicrap et al., 2016), TD3 (Fujimoto, Hoof, Meger, 2018), SAC (Haarnoja et al., 2018), and PPO (Schulman et al., 2017). The critic in PPO is a value network trained with a TD-like objective on bootstrapped multi-step returns; the actor uses the resulting advantage estimates.
Model-based RL with TD includes Dyna-Q (Sutton, 1990), which interleaves real and simulated experience, and MuZero (Schrittwieser et al., 2020), which learns a latent dynamics model and uses TD value updates inside its tree search.
Offline (batch) RL uses TD updates on a fixed dataset, with regularisers to keep the policy close to the data distribution. Examples include BCQ (Fujimoto, Meger, Precup, 2019), CQL (Kumar et al., 2020), and IQL (Kostrikov, Nair, Levine, 2022).
Reinforcement learning from human feedback for large language models typically uses PPO with a value head, so the underlying advantage estimator is built on TD-style bootstrapping.

Practical considerations

Getting TD methods to work in practice is mostly about taming a few well-known knobs.

Step size α. Too large and updates oscillate or diverge; too small and learning is glacial. Schedules (e.g. 1/n in tabular settings) can help, and for stochastic gradient TD with neural nets the usual deep-learning intuitions about Adam, RMSProp, and learning-rate warmup apply.
Discount γ. Smaller γ makes the problem easier and more myopic; larger γ pushes the effective horizon out and amplifies bootstrapping errors.
Trace decay λ. A common rule of thumb is to start in the 0.8 to 0.95 range, then tune. λ closer to 1 helps long-horizon problems; λ closer to 0 helps when the value function is unstable.
Initialisation. Optimistic initial values can encourage exploration in tabular control; in deep RL initialisation interacts strongly with the target network update rate.
Exploration. Epsilon-greedy, Boltzmann, parameter-space noise (NoisyNets), curiosity bonuses, and entropy regularisation are all in regular use.
Stability tricks. Target networks, replay buffers (uniform or prioritised), reward clipping, gradient clipping, double-Q correction, and ensemble critics are essentially standard equipment in deep TD methods.

Open challenges

Stability with deep nets. Even with target networks and replay, divergence and policy collapse remain real failure modes, especially in continuous control and offline settings.
Sample efficiency. TD methods often need millions of interactions to reach human-level Atari play. Closing the gap with model-based methods, demonstrations, and pre-training is an active research frontier.
Off-policy evaluation. Estimating the value of a policy from data collected by another policy is fundamental for safe deployment but has high variance with importance sampling and high bias with bootstrapping.
Long-horizon credit assignment. When rewards are very sparse and far in the future, bootstrapping errors compound. Eligibility traces, hierarchical RL, and successor representations are partial answers.
Continual learning. TD networks tend to forget earlier tasks when trained on a new one; the interaction between TD and catastrophic forgetting is poorly understood.
Plasticity loss. Recent work, much of it from Sutton and collaborators, shows that deep TD agents lose the ability to learn from new data after long training runs, motivating new optimisation techniques.

References

Sutton, R. S. (1988). Learning to predict by the methods of temporal differences. *Machine Learning*, 3(1), 9-44. https://link.springer.com/article/10.1007/BF00115009
Sutton, R. S., and Barto, A. G. (2018). *Reinforcement Learning: An Introduction* (2nd ed.). MIT Press. Chapters 6, 7, 11, 12. http://incompleteideas.net/book/the-book-2nd.html
Watkins, C. J. C. H. (1989). *Learning from Delayed Rewards*. PhD thesis, King's College, University of Cambridge.
Watkins, C. J. C. H., and Dayan, P. (1992). Q-learning. *Machine Learning*, 8(3-4), 279-292. https://link.springer.com/article/10.1007/BF00992698
Tesauro, G. (1995). Temporal difference learning and TD-Gammon. *Communications of the ACM*, 38(3), 58-68. https://dl.acm.org/doi/10.1145/203330.203343
Schultz, W., Dayan, P., and Montague, P. R. (1997). A neural substrate of prediction and reward. *Science*, 275(5306), 1593-1599. https://www.science.org/doi/10.1126/science.275.5306.1593
Tsitsiklis, J. N., and Van Roy, B. (1997). An analysis of temporal-difference learning with function approximation. *IEEE Transactions on Automatic Control*, 42(5), 674-690. https://www.mit.edu/~jnt/Papers/J063-97-bvr-td.pdf
Rummery, G. A., and Niranjan, M. (1994). On-line Q-learning using connectionist systems. Technical Report CUED/F-INFENG/TR 166, Cambridge University Engineering Department.
van Seijen, H., van Hasselt, H., Whiteson, S., and Wiering, M. (2009). A theoretical and empirical analysis of expected SARSA. *IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL)*, 177-184.
van Hasselt, H. (2010). Double Q-learning. *Advances in Neural Information Processing Systems 23 (NeurIPS)*, 2613-2622.
Sutton, R. S., Maei, H. R., Precup, D., Bhatnagar, S., Silver, D., Szepesvari, C., and Wiewiora, E. (2009). Fast gradient-descent methods for temporal-difference learning with linear function approximation. *International Conference on Machine Learning (ICML)*.
Sutton, R. S., Mahmood, A. R., and White, M. (2016). An emphatic approach to the problem of off-policy temporal-difference learning. *Journal of Machine Learning Research*, 17(73), 1-29.
van Seijen, H., and Sutton, R. S. (2014). True online TD(λ). *International Conference on Machine Learning (ICML)*. http://proceedings.mlr.press/v32/seijen14.pdf
Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., et al. (2015). Human-level control through deep reinforcement learning. *Nature*, 518(7540), 529-533. https://www.nature.com/articles/nature14236
van Hasselt, H., Guez, A., and Silver, D. (2016). Deep reinforcement learning with double Q-learning. *AAAI Conference on Artificial Intelligence*.
Bellemare, M. G., Dabney, W., and Munos, R. (2017). A distributional perspective on reinforcement learning. *International Conference on Machine Learning (ICML)*. https://arxiv.org/abs/1707.06887
Hessel, M., Modayil, J., van Hasselt, H., Schaul, T., Ostrovski, G., Dabney, W., Horgan, D., Piot, B., Azar, M. G., and Silver, D. (2018). Rainbow: Combining improvements in deep reinforcement learning. *AAAI Conference on Artificial Intelligence*. https://arxiv.org/abs/1710.02298
Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., van den Driessche, G., et al. (2016). Mastering the game of Go with deep neural networks and tree search. *Nature*, 529(7587), 484-489. https://www.nature.com/articles/nature16961
Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou, I., Huang, A., Guez, A., et al. (2017). Mastering the game of Go without human knowledge. *Nature*, 550(7676), 354-359.
Silver, D., Hubert, T., Schrittwieser, J., et al. (2018). A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play. *Science*, 362(6419), 1140-1144.
Schrittwieser, J., Antonoglou, I., Hubert, T., Simonyan, K., Sifre, L., Schmitt, S., et al. (2020). Mastering Atari, Go, chess and shogi by planning with a learned model. *Nature*, 588(7839), 604-609.
Munos, R., Stepleton, T., Harutyunyan, A., and Bellemare, M. G. (2016). Safe and efficient off-policy reinforcement learning. *Advances in Neural Information Processing Systems 29 (NeurIPS)*.
Precup, D., Sutton, R. S., and Singh, S. P. (2000). Eligibility traces for off-policy policy evaluation. *International Conference on Machine Learning (ICML)*.
van Hasselt, H., Doron, Y., Strub, F., Hessel, M., Sonnerat, N., and Modayil, J. (2018). Deep reinforcement learning and the deadly triad. *arXiv preprint arXiv:1812.02648*. https://arxiv.org/abs/1812.02648
ACM (2025). 2024 ACM A.M. Turing Award: Andrew Barto and Richard Sutton. https://awards.acm.org/about/2024-turing

Temporal-difference learning

What TD learning is trying to do

The TD(0) update

n-step TD and TD(λ)

Where TD sits among value-learning methods

TD for control: SARSA, Q-learning, and friends

Why neural networks make TD harder: the deadly triad

Connection to neuroscience: dopamine and reward prediction error

Famous applications

Variants and extensions

Function approximation

How TD learning shows up across the RL landscape

Practical considerations

Open challenges

References

Improve this article

Related Articles

Machine learning terms/Reinforcement Learning

ARC-AGI 2

AlphaGo

Monte Carlo Tree Search

Soft Actor-Critic

Twin Delayed DDPG

Temporal-difference learning

What TD learning is trying to do

The TD(0) update

n-step TD and TD(λ)

Where TD sits among value-learning methods

TD for control: SARSA, Q-learning, and friends

Why neural networks make TD harder: the deadly triad

Connection to neuroscience: dopamine and reward prediction error

Famous applications

Variants and extensions

Function approximation

How TD learning shows up across the RL landscape

Practical considerations

Open challenges

References

Related Articles

Machine learning terms/Reinforcement Learning

ARC-AGI 2

AlphaGo

Monte Carlo Tree Search

Soft Actor-Critic

Twin Delayed DDPG