Richard S. Sutton
Last reviewed
May 18, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 · 4,520 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 18, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 · 4,520 words
Add missing citations, update stale details, or suggest a clearer explanation.
Richard Stuart Sutton (born 1957 or 1958) is a Canadian–American computer scientist widely regarded as one of the founders of modern computational reinforcement learning. He is a professor of computing science at the University of Alberta, a fellow and Chief Scientific Advisor at the Alberta Machine Intelligence Institute (Amii), and a research scientist at John Carmack's artificial-general-intelligence startup Keen Technologies. With his long-time collaborator and former doctoral advisor Andrew G. Barto, Sutton received the 2024 ACM A.M. Turing Award "for developing the conceptual and algorithmic foundations of reinforcement learning."[^1][^2]
Sutton's research career, which began with a Stanford undergraduate thesis on trial-and-error learning, has spanned more than four decades and shaped almost every major idea in present-day reinforcement learning. He introduced temporal-difference (TD) learning in 1988, co-developed the actor–critic family of algorithms, formulated the Dyna architecture that unifies model-free learning with planning, helped found policy-gradient theory with function approximation, and (with Doina Precup and Satinder Singh) developed the options framework for temporal abstraction.[^3][^4] His textbook with Barto, "Reinforcement Learning: An Introduction" (MIT Press; first edition 1998, second edition 2018), is the canonical reference for the field and has been cited well over 80,000 times.[^3][^5]
Beyond his technical work, Sutton is widely known for "The Bitter Lesson," a short 2019 essay that argued general methods which scale with compute reliably out-perform approaches built on hand-crafted human knowledge — an argument that has become a touchstone for debates over scaling, deep learning, and the future of AI research.[^6][^7] Since 2003 he has built one of the world's largest reinforcement-learning research groups at the University of Alberta in Edmonton, and from 2017 to 2023 he led Google DeepMind's first non-UK research office in the same city.[^8][^9] In September 2023 he announced a new partnership with John Carmack aimed at building a working artificial general intelligence prototype "by 2030."[^10]
| Born | 1957 or 1958, Toledo, Ohio, U.S.[^3] |
| Citizenship | Canadian (naturalized 2015); previously American[^3] |
| Education | BA Psychology, Stanford University (1978); MS Computer Science, University of Massachusetts Amherst (1980); PhD Computer Science, University of Massachusetts Amherst (1984)[^3][^11] |
| Doctoral advisor | Andrew G. Barto[^1][^3] |
| Doctoral thesis | "Temporal Credit Assignment in Reinforcement Learning" (1984)[^3] |
| Known for | Temporal-difference learning; actor–critic methods; Dyna architecture; options framework; policy-gradient methods; "Reinforcement Learning: An Introduction"; "The Bitter Lesson"[^1][^3][^6] |
| Doctoral students (selected) | David Silver, Doina Precup[^3] |
| Current positions | Professor of Computing Science, University of Alberta (since 2003); Fellow, Chief Scientific Advisor & Canada CIFAR AI Chair, Amii; Research Scientist, Keen Technologies (since 2023)[^3][^12] |
| Major award | 2024 ACM A.M. Turing Award, shared with Andrew G. Barto[^1][^2] |
Richard S. Sutton was born in 1957 (sources also give 1958) in Toledo, Ohio, in the United States. He later took Canadian citizenship in 2015 after moving permanently to Edmonton, Alberta, and renounced his U.S. citizenship in 2017, making him a Canadian citizen rather than an American at the time he received the Turing Award.[^3]
Sutton studied at Stanford University, where he received a Bachelor of Arts in psychology in 1978. The Stanford program shaped a lifelong interest in animal learning and the psychological roots of intelligent behavior; his earliest published work, completed before he had a graduate degree in computer science, applied ideas from operant conditioning to artificial learning systems. He has long pointed to psychology — and particularly to the literature on classical and operant conditioning of animals — as a primary inspiration for the reinforcement-learning paradigm, in which an agent learns by interacting with an environment and observing rewards.[^3][^11]
He then moved to the University of Massachusetts Amherst, where he was supervised by Andrew G. Barto, then a young computer-science professor with a similar interest in adaptive systems. Sutton was Barto's first doctoral student. Sutton earned his Master of Science in computer and information science in 1980 and a Ph.D. in 1984. His doctoral thesis, "Temporal Credit Assignment in Reinforcement Learning," was the first systematic treatment of the credit-assignment problem that lies at the heart of reinforcement learning: how an agent should attribute long-delayed rewards to the specific decisions that caused them. The thesis introduced what would later become known as temporal-difference (TD) learning and laid out an early version of the actor–critic architecture, in which two adaptive modules — an "actor" that selects actions and a "critic" that evaluates them — co-evolve.[^3][^11]
The collaboration between Barto and Sutton that began in graduate school continued throughout both of their careers, producing dozens of joint papers and the eventual textbook that defined the field. ACM, in its 2024 Turing Award announcement, would later describe the pair as having "introduced the main ideas, constructed the mathematical foundations, and developed important algorithms for reinforcement learning" over a continuous period of more than four decades.[^1][^11]
After a brief postdoctoral period at UMass Amherst in 1984, Sutton joined GTE Laboratories in Waltham, Massachusetts, in 1985 as a principal member of technical staff. He remained there until 1994, working on connectionist learning, control, and predictive learning systems.[^3] It was during the GTE years that he published "Learning to predict by the methods of temporal differences" (Machine Learning, 1988), the paper that formally established TD learning as a class of algorithms and proved its first convergence results. The paper is among the most influential in the history of machine learning; nearly every modern value-based RL algorithm — Q-learning, SARSA, deep Q-networks, the value heads of alphazero and successor systems — traces its lineage back to it.[^4][^5][^11]
Also at GTE, Sutton developed the Dyna architecture (published in 1990–1991), which became one of the first concrete proposals for integrating learning, planning, and acting within a single agent.[^3][^4] During this period he also began the extended series of collaborations with Barto that would eventually become "Reinforcement Learning: An Introduction." The two had circulated draft chapters of the textbook within the research community for several years before its formal publication.[^5]
Sutton then returned to the University of Massachusetts Amherst from 1995 to 1998 as a senior research scientist in Barto's group, before moving to industry again in 1998. From 1998 to 2002 he was a principal technical staff member at AT&T Labs' Shannon Laboratory in Florham Park, New Jersey, working alongside other notable AI and statistics researchers in the laboratory's "Artificial Intelligence Department." The Shannon Laboratory was at the time one of the most concentrated centers of machine-learning research in industry, and Sutton's tenure there overlapped with the heyday of statistical learning at Bell Labs and AT&T.[^3]
During the AT&T years, Sutton and Barto completed and published the first edition of their textbook, "Reinforcement Learning: An Introduction" (MIT Press, 1998), which would go on to define the field. He also co-authored, in 1999 and 2000, two of his most-cited papers: the options-framework paper in Artificial Intelligence and the policy-gradient theorem paper at NeurIPS.[^3][^5]
In 2003 Sutton accepted a professorship in the Department of Computing Science at the University of Alberta in Edmonton, Canada. He has been there ever since and has built the university into a global center for reinforcement-learning research.[^8][^12]
Sutton founded and continues to lead the Reinforcement Learning and Artificial Intelligence (RLAI) Lab at U of A. By the mid-2020s the lab had grown to roughly ten principal investigators and more than one hundred researchers, making it one of the largest reinforcement-learning groups in the world. Edmonton's reputation as a hub for RL — distinct, for example, from Mila in Montreal or the Vector Institute in Toronto — is largely the product of Sutton's lab, in combination with the broader work of Michael Bowling's computer-games and poker group, the contributions of Patrick Pilarski in rehabilitation robotics, and the long-running computer-Go program of the late Jonathan Schaeffer.[^12][^13]
At U of A, Sutton initially held the AT&T-Alberta Research Chair, named in part for his former employer. He was later named a Canada CIFAR AI Chair under the Pan-Canadian AI Strategy.[^12] He is also a Fellow and the Chief Scientific Advisor of the Alberta Machine Intelligence Institute (Amii), an Edmonton-based AI institute that grew out of the Alberta Ingenuity Centre for Machine Learning (AICML), founded in 2002 jointly by the University of Alberta and the Government of Alberta. Amii was rebranded under its current name in 2017 when it was designated one of three national AI institutes under the CIFAR Pan-Canadian AI Strategy (alongside Mila in Quebec and the Vector Institute in Ontario).[^14][^12]
Through both Amii and his own lab, Sutton has supervised or mentored close to sixty graduate students and postdoctoral researchers. Two of the most prominent are David Silver — later the principal architect of alphago and alphazero at google deepmind — and Doina Precup, who became a co-creator of the options framework, head of DeepMind's Montreal office, and a leading figure in hierarchical RL. Other prominent former students and postdoctoral collaborators include Michael Bowling (now also a professor at U of A and a key figure behind Libratus/DeepStack poker), Csaba Szepesvári, Adam White, and Martha White. By 2025, Sutton's published work had been cited more than 140,000 times in Google Scholar.[^3][^12][^13]
Sutton's work at the University of Alberta has also fed directly into the AI-and-robotics community in Edmonton. His group's research has been applied in areas including assistive robotics (with Patrick Pilarski), continual learning, and life-long learning — fields that emphasize agents which keep learning after deployment, rather than being trained once and frozen.[^12]
On July 5, 2017, Google's deepmind announced its first international research office outside the United Kingdom, located in Edmonton and known as DeepMind Alberta. The announcement explicitly described the new lab as a deep partnership with the University of Alberta. Sutton — described in DeepMind's announcement as "the pioneer of reinforcement learning - and DeepMind's first ever advisor from back in 2010" — co-founded and led the lab together with U of A colleagues Michael Bowling and Patrick Pilarski, all of whom retained their university professorships. He held the title of Distinguished Research Scientist at DeepMind.[^9]
DeepMind had been working closely with researchers in Edmonton for years before formally opening the office. Sutton had served as an advisor to the company since approximately 2010, shortly after DeepMind was founded by Demis Hassabis, Shane Legg, and Mustafa Suleyman in London. DeepMind committed to long-term funding for AI programs at the University of Alberta as part of the 2017 arrangement, and the lab grew to include more than a dozen researchers, several of whom were co-authors of the influential DeepStack poker paper that the Edmonton group had published earlier.[^9]
The Edmonton office was thus, by design, a tightly coupled academic–industrial partnership: the same researchers held both DeepMind and University of Alberta affiliations, and the lab's work fed directly into both DeepMind's broader google deepmind research agenda and U of A's PhD programs. Sutton continued his teaching and graduate-student supervision throughout this period.[^9][^12]
In January 2023, after Google reorganized DeepMind as part of broader cost cuts at Alphabet and consolidated the company's research labs, the Edmonton office was closed. Sutton chose to remain in Alberta and continue his research at the University of Alberta and Amii. His DeepMind affiliation ended in 2023, though his collaborations with DeepMind researchers — most notably David Silver — continued (see "Welcome to the Era of Experience," 2025).[^3][^15][^16]
On September 25, 2023, programmer John Carmack — best known as the co-creator of the Doom and Quake game engines and a co-founder of id Software — and Sutton publicly announced a new partnership aimed at accelerating the development of artificial general intelligence (agi). Sutton joined Carmack's startup Keen Technologies, a small Texas-based AGI research company that Carmack had funded out of his own resources and which had raised an initial seed round in 2022. Sutton joined as a research scientist while retaining his positions at the University of Alberta and Amii.[^10][^15]
In the joint announcement, Sutton said: "I am excited to partner with John and the rest of the team at Keen Technologies. John is a powerful intellect and one of the world's greatest system engineers." Carmack, for his part, framed the collaboration as a deliberate alternative to the dominant approaches of the largest AI labs: "The AI space is awash in capital, compute, and data, but it is still dominated by fashions that may yet hinder important breakthroughs." The two stated that they were focused on developing "a genuine AI prototype by 2030," including establishing and documenting what they called "AGI signs of life."[^10]
The pairing was widely noted by AI researchers and the technology press because it joined one of reinforcement learning's most influential theorists with one of the games industry's most celebrated low-level programmers. Keen's approach, as described publicly by both founders, has emphasized small focused teams, on-device learning, and reinforcement-learning-style "experience" rather than the very large pretraining runs typical of the major AI labs — a perspective that lines up with both Sutton's "Bitter Lesson" and his more recent emphasis on continual, on-the-job learning.[^10][^16]
Sutton's body of research is unusually broad, but a small number of contributions stand out for their lasting impact on machine learning and AI research.
Sutton introduced temporal difference learning (TD learning) in his 1984 doctoral thesis and formalized it in the 1988 paper "Learning to predict by the methods of temporal differences," published in the journal Machine Learning. TD learning combines ideas from Monte Carlo methods and dynamic programming: it allows an agent to update its predictions about the long-run future at every time step, by comparing successive predictions ("bootstrapping"), rather than waiting for a final outcome. The 1988 paper proved convergence under certain conditions and demonstrated TD's effectiveness on prediction tasks such as random-walk Markov chains.[^4][^1]
TD learning underpins almost all modern value-based RL algorithms, including Q-learning (Watkins, 1989, building directly on Sutton's TD work), SARSA (named after the state–action–reward–state–action tuple it updates), and the deep-RL methods used in systems such as DeepMind's deep Q-network, alphago, alphazero, and alphaproof.[^2][^11]
A 1995 result by Sutton, Barto, and colleagues drew a striking link between TD learning and the firing patterns of dopamine neurons in the brain, helping launch the modern neuroscience of reward prediction and influence the broader field of computational neuroscience. This convergence of evidence between behavior, neuroscience, and computer-science models is often cited as one of the success stories of computational cognitive science.[^11]
Building on his doctoral work with Barto, Sutton was a co-developer of the actor–critic family of reinforcement-learning algorithms, in which a "critic" learns a value function and an "actor" learns a policy guided by the critic's TD error. Modern descendants of actor–critic — including A3C, DDPG, SAC, and PPO — are workhorses of deep reinforcement learning and underpin many state-of-the-art results in robotics, game-playing, and large-model fine-tuning.[^1][^3]
In 1991, Sutton proposed the Dyna architecture, a framework that explicitly integrates learning from real experience, learning of a world model, and planning by simulated experience using that model — all within a single learning agent. Dyna is widely regarded as one of the earliest and most influential model-based reinforcement-learning architectures, and its ideas remain central to current model-based and world-model approaches in deep RL.[^3][^4]
Together with Doina Precup and Satinder Singh, Sutton co-developed the options framework for temporal abstraction in reinforcement learning, published as "Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning" in the journal Artificial Intelligence in 1999. The options framework treats sub-tasks as temporally extended "options" within a semi-Markov decision process, providing one of the foundational formalisms for hierarchical reinforcement learning.[^3]
Sutton and his collaborators (David McAllester, Satinder Singh, and Yishay Mansour) authored "Policy Gradient Methods for Reinforcement Learning with Function Approximation" (Advances in Neural Information Processing Systems, 2000), which formally established the policy-gradient theorem for differentiable function approximators. The theorem provides a clean expression for the gradient of an agent's expected return with respect to the parameters of its policy, even when both the policy and the value function are represented by neural networks or other parameterized models. The result is the theoretical foundation for policy gradient algorithms such as REINFORCE, A2C/A3C, TRPO, and PPO that drive much of modern deep RL — including, most prominently, the reinforcement-learning-from-human-feedback (RLHF) and reinforcement-learning-from-AI-feedback (RLAIF) pipelines used to train and align large language models.[^3]
Sutton has also done influential work on gradient and emphatic TD methods (giving the first sound off-policy TD algorithms with linear function approximation), on the Horde architecture for parallel general-value-function learning (in which a single agent learns a large number of value-like predictions in parallel), and on options discovery and intra-option learning. More recently he has focused on continual and "on-the-job" learning — the problem of building agents that keep improving from new experience, without catastrophic forgetting and without a fresh round of expensive offline training, and on the related problem of "plasticity loss" in deep networks.[^12]
He has been a vocal critic of the view that pure scaling of supervised pretraining will be sufficient for general intelligence. In interviews and talks following his 2024 Turing Award, Sutton repeatedly argued that current large language models lack on-the-job learning and that fundamentally new architectures are required if AI systems are to keep learning after deployment. In 2025 he gave the keynote "The Era of Experience and The Age of Design" at the Upper Bound 2025 AI conference in Edmonton, in which he laid out the same case.[^16]
Sutton and Barto's textbook "Reinforcement Learning: An Introduction" is the standard reference and teaching text for the field. The first edition was published by MIT Press in 1998 and the second, substantially expanded edition appeared in 2018. The two editions are commonly referred to among RL researchers simply as "Sutton and Barto" or "the RL bible."[^3][^5]
The book systematically presents the formalisms of Markov decision processes, dynamic programming, Monte Carlo methods, TD learning, n-step methods, eligibility traces, policy-gradient methods, function approximation, and integrated planning. The second edition adds substantial new material on policy-gradient methods (reflecting the explosion of deep-RL work in the 2010s), off-policy methods, average-reward formulations, and case studies from games and robotics. It also includes a final part on the relationship between RL and topics such as psychology and neuroscience, reflecting Sutton's long-standing interest in those connections.[^5]
The textbook has been adopted in graduate AI and machine learning courses worldwide. As of 2025 it had been cited more than 80,000 times in Google Scholar, making it one of the most-cited works in machine learning. The free draft PDF of the second edition, which Sutton hosts on his personal website incompleteideas.net, has been downloaded by hundreds of thousands of students and practitioners.[^3][^5]
In March 2019 Sutton published a short essay on his personal website, incompleteideas.net, titled "The Bitter Lesson." The full essay is around 1,100 words.[^6] In it he argued that, looking across seven decades of AI research, "the biggest lesson that can be read from 70 years of AI research is that general methods that leverage computation are ultimately the most effective, and by a large margin." Approaches that bake in human knowledge — opening-book chess heuristics, hand-engineered vision features, symbolic linguistic structure — repeatedly lose, in the long run, to simple methods (search and learning) that scale with available compute as Moore's law continues to compound.[^6][^7]
Sutton illustrated the argument with examples from computer chess (where Deep Blue's brute-force search beat carefully engineered "human-style" chess programs), computer Go (where AlphaGo and AlphaGo Zero out-scaled all prior knowledge-based programs), speech recognition (where statistical and then neural approaches displaced hand-built phonetic systems), and computer vision (where convolutional networks supplanted hand-crafted features).[^6][^7] The essay characterizes the conclusion as "bitter" because it is less anthropocentric than researchers had hoped, and it has remained influential through the rise of large-scale deep learning and large language models — frequently cited both in defense of scaling and as the target of subsequent critiques.[^7]
On March 5, 2025, the Association for Computing Machinery (ACM) announced that Andrew G. Barto and Richard S. Sutton would receive the 2024 ACM A.M. turing award "for developing the conceptual and algorithmic foundations of reinforcement learning."[^1][^2] The Turing Award is widely described as the "Nobel Prize of Computing" and carries a US$1 million prize, with financial support provided by Google.[^1][^17]
In its citation, ACM credited Barto and Sutton with having introduced the main ideas, constructed the mathematical foundations, and developed the most important algorithms for reinforcement learning over a series of papers beginning in the 1980s. The official ACM citation specifically highlights their textbook "Reinforcement Learning: An Introduction" (1998; 2nd ed. 2018) as the work that introduced these ideas to generations of researchers. The award page noted Sutton's positions as Professor of Computer Science at the University of Alberta, Research Scientist at Keen Technologies, and Fellow at the Alberta Machine Intelligence Institute (Amii).[^1][^2]
The 2024 award marked the first time the Turing Award was given primarily for reinforcement learning, recognizing a body of work that, by the time of the announcement, had become foundational to a wide range of modern AI systems — from DeepMind's game-playing programs (alphago, alphazero, alphaproof) and robotic-control systems, to the reinforcement-learning-from-human-feedback (RLHF) procedures used to align large language models such as ChatGPT and Claude.[^17][^11]
Reaction to the award from the AI research community was broadly enthusiastic. Coverage by ACM, the U.S. National Science Foundation, the University of Alberta, Amii, the Heidelberg Laureate Foundation, and major technology outlets emphasized both the long arc of Barto and Sutton's work and its present-day applications. The University of Alberta described it as the university's first Turing Award and a milestone for Canadian AI research.[^11][^17][^13]
In post-award interviews, Sutton stressed that the award was as much about reinforcement learning as a field as it was about him and Barto personally, and used the platform to argue for continued investment in fundamental, "experience-based" research, as opposed to ever-larger pre-training runs alone. His remarks were consistent with the broader argument of "The Bitter Lesson" — that simple, scalable, search-and-learning-based methods are the durable winners — but added a new emphasis on continual learning from interaction with the world.[^13][^16]
Sutton has been an outspoken voice on the future direction of AI research. His public position, repeated in essays, talks, and post-Turing-Award interviews, can be summarized in three parts.
First, he argues — following "The Bitter Lesson" — that general methods based on search and learning, with compute as the main resource, are the durable winners over hand-engineered domain knowledge. He has repeatedly applied this argument to debates over symbolic AI, neuro-symbolic methods, and aggressive use of human feedback in language-model training.[^6][^7]
Second, he has argued that contemporary large language models, while impressive, lack continual learning from interactive experience — they are essentially trained once and frozen, with at best brief fine-tuning. In the 2025 essay "Welcome to the Era of Experience," co-authored with David Silver and circulated as part of an upcoming MIT Press book "Designing an Intelligence," Sutton and Silver argued that the next leap in AI capability will come from agents that learn predominantly from their own experience interacting with the world, rather than from human-generated text. They cited DeepMind's AlphaProof — which used reinforcement learning to generate millions of new mathematical proofs and reached the level of an IMO silver medalist — as an early instance of this "era of experience."[^16]
Third, he is publicly skeptical of doomer-style framings of AI risk, while supporting careful long-term safety research. He has emphasized the importance of building AI systems whose objectives are shaped through interaction with the world rather than imposed from above, and has framed AGI as a scientific goal as well as an engineering one.[^10][^16]
These views are central to the agenda of Keen Technologies, which Sutton and John Carmack describe as aimed at "real-time, on-device, on-the-job learning."[^10]