SARSA (State-Action-Reward-State-Action)
Last reviewed
Apr 30, 2026
Sources
20 citations
Review status
Source-backed
Revision
v1 ยท 4,006 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Apr 30, 2026
Sources
20 citations
Review status
Source-backed
Revision
v1 ยท 4,006 words
Add missing citations, update stale details, or suggest a clearer explanation.
SARSA is an on-policy temporal-difference (TD) reinforcement learning algorithm for estimating the action-value function Q^pi(s, a) of the policy the agent is currently following, and for improving that policy greedily over time. The name is an acronym for the quintuple (s_t, a_t, r_{t+1}, s_{t+1}, a_{t+1}) that drives a single update: the current state, the current action, the reward observed after taking that action, the next state, and the next action chosen by the policy. SARSA was first introduced by Gavin A. Rummery and Mahesan Niranjan at the University of Cambridge in 1994 under the name "Modified Connectionist Q-Learning" (MCQ-L). The acronym SARSA was suggested by Richard Sutton and appeared in a footnote, and it is the name that has stuck.
SARSA is one of the canonical algorithms in the model-free family of reinforcement learning (RL) methods, alongside Q-learning, Expected SARSA, and the actor-critic family. It is treated as a foundational example in Sutton and Barto's textbook Reinforcement Learning: An Introduction (1st edition 1998, 2nd edition 2018), where it appears in the chapter on temporal-difference control as the on-policy counterpart to Watkins's off-policy Q-learning. The two algorithms have nearly identical update structure but make a single deliberate change in the bootstrap target, and that change has consequences for safety, stability, and the kind of policy that is learned during exploration.
Consider an episode of interaction in a Markov decision process: the agent observes a state s_t, samples an action a_t from a behaviour policy pi (typically epsilon-greedy with respect to the current Q estimates), receives a reward r_{t+1}, and transitions to a new state s_{t+1}. To make a SARSA update the agent then chooses the next action a_{t+1} from the same policy pi, and uses the full quintuple (s_t, a_t, r_{t+1}, s_{t+1}, a_{t+1}) in the update.
The one-step SARSA update rule is:
Q(s_t, a_t) <- Q(s_t, a_t) + alpha * [ r_{t+1} + gamma * Q(s_{t+1}, a_{t+1}) - Q(s_t, a_t) ]
Here alpha in (0, 1] is the learning rate (step size) and gamma in [0, 1] is the discount factor that weighs future reward against immediate reward in the return. The bracketed term is the TD error: the difference between the current estimate Q(s_t, a_t) and the bootstrapped target r_{t+1} + gamma * Q(s_{t+1}, a_{t+1}). Because both Q values in the target are taken from the same Q table at the same time step, the update is incremental: only one entry of the table changes per step, and the change uses information already in the table.
If s_{t+1} is terminal, the convention is that Q(s_{t+1}, a_{t+1}) = 0, so the target reduces to r_{t+1}. The Sutton and Barto textbook (chapter 6.4 in the 1st edition) introduces this update as equation (6.5) and explains that the quintuple gives rise to the name SARSA.
The defining contrast in single-step temporal-difference control is the choice of bootstrap target. SARSA (on-policy) uses target = r_{t+1} + gamma * Q(s_{t+1}, a_{t+1}), where a_{t+1} is the action that the behaviour policy actually selects in s_{t+1}; SARSA learns Q^pi for the policy pi that the agent is following, including its exploration steps. Q-learning (off-policy) uses target = r_{t+1} + gamma * max_{a'} Q(s_{t+1}, a'); it learns Q*, the action-value function of the optimal greedy policy, regardless of the behaviour policy that generated the data. Watkins introduced Q-learning in his 1989 PhD thesis at the University of Cambridge.
The consequence is that SARSA tracks the value of the policy you are running, while Q-learning tracks the value of the policy you would run if you stopped exploring. When exploration is non-negligible those are two different things, which is why SARSA and Q-learning can converge to different action-value functions in stochastic or risk-laden environments even when both are run on the same data stream.
The earliest description of SARSA appears in the Cambridge University Engineering Department technical report CUED/F-INFENG/TR 166, On-Line Q-Learning Using Connectionist Systems, by Gavin A. Rummery and Mahesan Niranjan, dated September 1994. Rummery and Niranjan called the algorithm "Modified Connectionist Q-Learning" (MCQ-L) because they were studying TD control with neural-network function approximators rather than tables, and the modification (using the actually selected next action rather than the maximum) made the learning method consistent with the policy that was generating data. They reported more stable behaviour than Watkins's Q-learning under function approximation.
The acronym SARSA was suggested by Richard Sutton in correspondence with Rummery and was adopted in Rummery's 1995 Cambridge PhD thesis Problem Solving with Reinforcement Learning. Sutton and Barto popularised the name in their 1998 textbook, noting that SARSA "uses every element of the quintuple of events" (s_t, a_t, r_{t+1}, s_{t+1}, a_{t+1}). By the late 1990s the name MCQ-L had effectively disappeared from the literature in favour of SARSA.
In 1996, Richard Sutton presented "Generalization in Reinforcement Learning: Successful Examples Using Sparse Coarse Coding" at NIPS 8 (MIT Press, pp. 1038-1044), combining SARSA(lambda) with the CMAC tile-coding function approximator and reporting strong results on mountain-car, acrobot, and puddle-world tasks. This was an influential demonstration that on-policy TD control could scale to continuous spaces, at a time when Boyan and Moore (1995) had pointed out divergence problems for naive combinations of TD with function approximation.
The following is the canonical tabular SARSA control algorithm with epsilon-greedy exploration, following Figure 6.9 of Sutton and Barto (1998) and the equivalent presentation in the 2018 second edition.
Initialise Q(s, a) for all s in S, a in A(s), arbitrarily, except Q(terminal, .) = 0
For each episode:
Initialise s
Choose a from s using policy derived from Q (e.g. epsilon-greedy)
Repeat for each step of the episode:
Take action a, observe r and s'
Choose a' from s' using policy derived from Q (e.g. epsilon-greedy)
Q(s, a) <- Q(s, a) + alpha * [ r + gamma * Q(s', a') - Q(s, a) ]
s <- s'
a <- a'
Until s is terminal
Notice the key implementation detail that distinguishes SARSA from Q-learning: the next action a' is chosen before the update, and that same a' is both used in the update target Q(s', a') and executed on the next step. Choosing the action and using it in the bootstrap are linked, which is what makes the method on-policy.
On-policy means that the algorithm evaluates and improves the policy it is actually following. SARSA's update target uses Q(s_{t+1}, a_{t+1}) where a_{t+1} is sampled from the behaviour policy, so the value being estimated is Q^pi for that exploration policy, not Q* for the greedy policy. As exploration shrinks (epsilon goes to 0), the policy approaches the greedy policy and Q^pi approaches Q*. While exploration is non-trivial, however, SARSA and Q-learning estimate two different things from the same data.
The most-cited illustration is the cliff walking example in Sutton and Barto (1998), Example 6.6. The environment is a deterministic gridworld with a row of "cliff" cells along the bottom. Stepping into the cliff returns the agent to the start with reward -100. Every other step has reward -1. The optimal policy walks one cell above the cliff to reach the goal in the minimum number of steps. Under epsilon-greedy exploration with a moderate epsilon, the optimal greedy path occasionally falls off the cliff because of random exploratory actions next to the edge.
Q-learning, which bootstraps on the maximum next-state Q value, learns the optimal greedy values along the cliff edge, and as a result keeps walking next to the cliff and keeps falling in. Its on-line return is poor even though it learns the correct Q*. SARSA, which bootstraps on the actual next action, learns that cells near the cliff have low value because exploration pushes the agent off, so it learns a longer but safer path further from the edge. Sutton and Barto report that SARSA has higher on-line return than Q-learning throughout training on this task, and write that "if epsilon were gradually reduced, then both methods would asymptotically converge to the optimal policy." This is the canonical pedagogical example for the practical difference between on-policy and off-policy TD control.
Satinder Singh, Tommi Jaakkola, Michael L. Littman, and Csaba Szepesvari proved convergence of SARSA(0) to the optimal action-value function and policy in their 2000 Machine Learning paper "Convergence Results for Single-Step On-Policy Reinforcement-Learning Algorithms" (volume 38, pp. 287-308). The result requires a finite-state, finite-action MDP; the Robbins-Monro step-size conditions (sum of alpha_t = infinity and sum of alpha_t^2 < infinity); and a GLIE ("greedy in the limit with infinite exploration") behaviour policy in which every state-action pair is visited infinitely often, and in the limit the policy becomes greedy with respect to the current Q estimates. An epsilon-greedy policy with epsilon_t = 1/t satisfies GLIE. Under these conditions the iterates Q_t converge with probability 1 to Q*, and the learning policy pi_t converges to an optimal policy pi*. Sutton and Barto's first edition (1998), Section 6.4, cites a personal communication from Satinder Singh announcing the same result before it was published. For SARSA combined with linear function approximation, Gordon (2001) and Geramifard et al. (2013) establish convergence to a bounded region around the projected fixed point.
The family of SARSA-style algorithms has grown considerably since 1994. The major variants are summarised in the table below.
| Variant | Reference | Key idea |
|---|---|---|
| One-step SARSA | Rummery and Niranjan 1994 | The original. Bootstraps from the next state-action pair. |
| SARSA(lambda) with accumulating traces | Rummery and Niranjan 1994; Rummery 1995 | Adds eligibility traces e(s, a) and propagates a single TD error to all recently visited (s, a) pairs, weighted by lambda^k. Smoothly interpolates between one-step SARSA (lambda = 0) and Monte Carlo SARSA (lambda = 1). |
| Replacing-trace SARSA(lambda) | Singh and Sutton 1996 | Uses replacing traces (each visit resets the trace to 1), giving faster, less biased learning on tasks like mountain-car. |
| n-step SARSA | Sutton and Barto 1998, chapter 7 (1st ed.); chapter 7 (2nd ed.) | Bootstraps from Q(s_{t+n}, a_{t+n}) with the n-step return. Sits between one-step SARSA (n = 1) and Monte Carlo control (n = infinity). |
| Expected SARSA | van Seijen, van Hasselt, Whiteson, and Wiering 2009 | Replaces the sampled Q(s', a') with its expectation under the policy: target = r + gamma * sum_{a'} pi(a' |
| True online SARSA(lambda) | van Seijen and Sutton 2014; van Seijen et al. 2016 | An exact on-line implementation of the forward-view SARSA(lambda) update with linear function approximation. Maintains exact equivalence to the forward view at every step, not just in the limit of small step sizes. |
| Double SARSA / Double Expected SARSA | Ganger, Duryea, and Hu 2016; Hasselt 2010 (Double Q-learning analogue) | Maintains two estimators Q_A and Q_B and uses one to select the bootstrap action and the other to evaluate it, which reduces maximisation bias in stochastic environments. |
| Deep SARSA | Zhao et al. 2016; Elfwing, Uchibe, and Doya 2018 | Replaces the table with a deep neural network function approximator. Used as a stable on-policy alternative to DQN in some control benchmarks. |
The lambda extension multiplies the one-step TD error by an eligibility trace e_t(s, a), a fading memory of how recently the agent visited (s, a). After each step the algorithm updates every (s, a) by alpha * delta_t * e_t(s, a), where delta_t is the TD error. This propagates new information backwards along the trajectory much faster than one-step SARSA, which only updates the most recent (s, a). The hyperparameter lambda in [0, 1] interpolates between one-step TD (lambda = 0) and Monte Carlo (lambda = 1). Rummery and Niranjan (1994) introduced SARSA(lambda) with accumulating traces. Singh and Sutton's 1996 Machine Learning paper "Reinforcement Learning with Replacing Eligibility Traces" introduced the replacing-trace version, where each visit resets the trace to 1 rather than incrementing it. They showed that replacing traces are unbiased while accumulating traces are biased, and lead to faster, more reliable learning on the mountain-car task.
Harm van Seijen, Hado van Hasselt, Shimon Whiteson, and Marco Wiering presented Expected SARSA at IEEE ADPRL 2009. The update replaces the sampled bootstrap with an expectation under the current policy:
Q(s, a) <- Q(s, a) + alpha * [ r + gamma * sum_{a'} pi(a'|s') * Q(s', a') - Q(s, a) ]
With the same step size, Expected SARSA has the same bias as SARSA but lower variance, since the random next action has been integrated out. The lower variance allows higher learning rates: in deterministic environments, van Seijen et al. report that Expected SARSA can use alpha = 1. When the policy pi is greedy, Expected SARSA reduces to Q-learning.
Classical SARSA(lambda) only matches the forward-view lambda-return target in the limit of small step sizes. Harm van Seijen and Richard Sutton's 2014 ICML paper "True Online TD(lambda)" introduced a new form of trace that maintains exact equivalence to the forward view at every step. The 2016 Journal of Machine Learning Research paper "True Online Temporal-Difference Learning" by van Seijen, Mahmood, Pilarski, Machado, and Sutton extended the construction to control, giving true online SARSA(lambda). Across a range of domains, the true online methods are typically equal to or better than the classical accumulating- and replacing-trace versions.
With a neural network replacing the Q table, SARSA can be applied to high-dimensional perceptual problems in the same way that DQN extended Q-learning. Zhao, Wang, Shao, and Zhu's 2016 paper "Deep Reinforcement Learning with Experience Replay Based on SARSA" combined deep convolutional Q networks with the on-policy SARSA target and reported competitive performance against DQN on several Atari 2600 games. Elfwing, Uchibe, and Doya's 2018 Neural Networks paper used deep SARSA(lambda) with SiLU activation units and reported it outperforming DQN on certain Atari benchmarks. Deep SARSA tends to be more stable than off-policy methods because it avoids the bootstrapping issues that arise from replaying old transitions under a different policy.
| Property | SARSA | Q-learning |
|---|---|---|
| Update target | r + gamma * Q(s', a') with a' sampled from policy | r + gamma * max_{a'} Q(s', a') |
| Policy class | On-policy: learns Q^pi for behaviour policy | Off-policy: learns Q* regardless of behaviour policy |
| First published | Rummery and Niranjan 1994 (Cambridge tech report) | Watkins 1989 PhD thesis; Watkins and Dayan 1992 Machine Learning |
| Behaviour during exploration | Accounts for exploratory actions; learns safer path in cliff walking | Ignores exploratory actions in target; learns optimal greedy path even if it falls off the cliff during training |
| Variance of target | Includes randomness in next action | Lower (no randomness from action selection in target); but maximum operator introduces maximisation bias |
| Sample efficiency | Comparable on most tasks; sometimes faster on safety-critical tasks | Usually comparable; sometimes faster when risk does not matter |
| Convergence guarantees (tabular) | Singh, Jaakkola, Littman, Szepesvari 2000 under GLIE | Watkins and Dayan 1992; Tsitsiklis 1994 |
| Function approximation guarantees | Linear FA + epsilon-greedy converges to bounded region | Off-policy + linear FA can diverge (Baird 1995 counterexample); the "deadly triad" problem |
| When to use | Online learning while behaving; safety-critical exploration; risk-averse settings | Learning offline from logged data; learning the optimal policy regardless of exploration cost |
| Modern deep variants | Deep SARSA, true online SARSA(lambda) | DQN (Mnih et al. 2015), Double DQN, Rainbow, IQN, etc. |
A practical mnemonic: SARSA learns the value of "the policy I am, including my mistakes," while Q-learning learns the value of "the policy I would be if I committed to greedy."
The combination of (i) function approximation, (ii) bootstrapping, and (iii) off-policy training is known as the deadly triad. Sutton and Barto (2018, chapter 11) and van Hasselt, Doron, Strub, Hessel, Sonnerat, and Modayil ("Deep Reinforcement Learning and the Deadly Triad," arXiv:1812.02648, 2018) discuss how value estimates can become unbounded when all three are present. Baird (1995) provided the original counterexample for off-policy linear TD.
SARSA is on-policy by construction, so it does not satisfy the third leg of the triad. SARSA combined with linear function approximation and an epsilon-greedy behaviour policy converges to a bounded region around the projected Bellman fixed point, with error bound proportional to 1/(1 - gamma). This is one of the few large-scale RL settings with an actual convergence proof. By contrast, the divergence of off-policy TD methods like Q-learning under linear function approximation is a known phenomenon that has motivated many extensions (gradient TD, target networks, importance-corrected updates). This guarantee does not extend automatically to deep SARSA, because nonlinear function approximation breaks several assumptions used in linear convergence proofs. In practice, however, deep SARSA is observed to be more stable than DQN in some control benchmarks.
A common epsilon schedule is epsilon-greedy with decay: epsilon_t = max(epsilon_min, epsilon_0 * decay^t). To satisfy the GLIE condition for tabular convergence, one can use epsilon_t = c / N(s_t), where N(s_t) is the visit count, although in practice a fixed schedule is more common. Boltzmann (softmax) exploration is an alternative that produces a smoother on-policy distribution.
The Robbins-Monro conditions require alpha_t to decay to 0 slowly. In practice a constant alpha (0.1 to 0.5 tabular, 1e-4 to 1e-3 for neural networks) is often used in stationary tasks. SARSA(lambda) with replacing traces is a standard improvement for problems with delayed rewards, and in linear function approximation true online SARSA(lambda) is the recommended modern choice. With deep networks, replay buffers must be used carefully because experience replay reintroduces an off-policy element. Like Q-learning, SARSA is sensitive to reward scale and to gamma; standard practice is to clip rewards to [-1, 1] for deep RL and to choose gamma between 0.9 and 0.999.
SARSA is most appropriate when the agent has to learn while behaving and the cost of exploration matters. Robotics, process control, and autonomous driving have a real cost for risky exploration, and an on-policy method that takes the exploration policy into account when valuing states can produce safer training trajectories, as in the cliff-walking analogue. Risk-sensitive applications such as finance, healthcare, and energy management often want a policy that performs well under the same exploration policy that produced the data, not the policy that would perform well if exploration were turned off; SARSA's on-policy estimate of Q^pi is more useful for risk-sensitive policy improvement than Q-learning's optimistic Q*. In multi-agent and non-stationary settings, the optimal Q* is a moving target, and tracking the value of the current joint policy can be more useful than chasing the asymptotic optimum.
Q-learning, by contrast, tends to dominate when the agent is learning offline from a fixed dataset ("batch RL" or "offline RL") or when the goal is to recover the optimal greedy policy regardless of how exploration was done. The choice between the two is a function of how much you care about behaviour during training versus behaviour after training is over.
Most large-scale deep reinforcement learning today uses Q-learning variants such as DQN (Mnih et al. 2015, Nature), Double DQN, Dueling DQN, and Rainbow, or actor-critic methods such as A3C, PPO, SAC, and IMPALA. SARSA is not a workhorse in the way DQN-style methods or PPO are, but it remains important in a few roles. Tabular Q-learning and SARSA are nearly always presented together in the chapter on TD control because the comparison clarifies the on-policy / off-policy distinction, and the cliff walking example appears in essentially every reinforcement learning textbook. New TD methods are often evaluated against SARSA, Expected SARSA, and Q-learning on simple control benchmarks. The critic in many actor-critic algorithms is essentially a SARSA-style on-policy value estimator.
SARSA also appears in the reinforcement learning models literature as a baseline for new on-policy methods. Although SARSA itself is rarely the choice for a frontier system, the conceptual distinction it makes (between estimating Q^pi for the policy you are running and Q* for the policy you would run greedily) is one of the load-bearing ideas in the field, including in methods used for reinforcement learning from human feedback and RLHF, where on-policy versus off-policy distinctions also drive the choice of training algorithm.