The Bitter Lesson
Last reviewed
May 20, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 4,146 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 20, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 4,146 words
Add missing citations, update stale details, or suggest a clearer explanation.
The Bitter Lesson is a short essay by Canadian-American computer scientist Richard S. Sutton, published on his personal blog at incompleteideas.net on March 13, 2019.[^1][^2] Drawing on seven decades of artificial intelligence research, the essay argues that general methods which leverage computation, principally search and learning, ultimately and consistently outperform methods that encode human domain knowledge into a system, and that they do so by a large margin once compute is available at sufficient scale.[^1] Sutton calls the lesson "bitter" because each generation of AI researchers tends to repeat the pattern of investing heavily in handcrafted, knowledge-rich approaches only to be overtaken by more general methods once compute grows.[^1][^3] Despite its modest length of roughly 1,100 words, the essay has become one of the most frequently cited pieces of AI commentary of the late 2010s and early 2020s, particularly within debates over scaling laws, deep learning, and progress toward artificial general intelligence.[^3][^4]
Richard S. Sutton is a Canadian-American researcher widely regarded as a co-founder of modern computational reinforcement learning. He earned a BA in psychology from Stanford in 1978 and a PhD in computer science from the University of Massachusetts Amherst in 1984 under the supervision of Andrew Barto, with a dissertation on temporal credit assignment.[^5] After research positions at GTE Laboratories and AT&T Labs, he became a professor at the University of Alberta in 2003, served as a distinguished research scientist at Google DeepMind from 2017 to 2023, and joined John Carmack's Keen Technologies as a research scientist in 2024.[^5][^6] He is best known for the theory of temporal-difference learning, the actor-critic family of reinforcement learning algorithms, the Dyna architecture for integrated learning and planning, the options framework for temporal abstraction, and for co-authoring the widely used textbook Reinforcement Learning: An Introduction with Andrew Barto.[^5][^7] In March 2025, the Association for Computing Machinery announced that Sutton and Barto would share the 2024 ACM A.M. Turing Award "for developing the conceptual and algorithmic foundations of reinforcement learning," with a one million dollar prize funded by Google.[^8][^9]
The essay itself appears in the "Incomplete Ideas" section of Sutton's personal site, a section he has used since the early 2000s for short, opinionated pieces. The Bitter Lesson is not a peer-reviewed paper but a reflective commentary intended as a public summary of patterns Sutton perceived across his career and across the history of AI more broadly.[^1][^2] Although brief, it has accumulated hundreds of formal academic citations indexed by Google Scholar and many more informal references in blog posts, podcasts, and corporate technical talks.[^3]
Sutton opens with what he describes as the central claim of the essay: that "the biggest lesson that can be read from 70 years of AI research is that general methods that leverage computation are ultimately the most effective, and by a large margin."[^1] He attributes this advantage to the steady, exponential decline in the cost of computation associated with Moore's Law and related hardware trends, which makes any approach that can absorb more compute eventually overtake any approach that cannot.[^1][^3] Although Moore's Law in its strict transistor-density formulation has slowed since the mid-2010s, the broader trend of cost-per-floating-point-operation continues to fall through specialization, parallelism, and dedicated AI accelerators, sustaining the dynamic Sutton describes.[^11][^22]
He identifies two general classes of methods that "seem to scale arbitrarily in this way": search and learning.[^1] By search, Sutton refers to algorithms that explore large spaces of possibilities, including Monte Carlo Tree Search and game-tree search such as alpha-beta. By learning, he refers to statistical and gradient-based optimization of parameters from data, the family of techniques that underlies modern machine learning and deep learning.[^1] The recurring failure mode, in Sutton's account, is the temptation for researchers to encode their own understanding of the problem domain into the system, because doing so "always helps in the short term, and is personally satisfying to the researcher, but in the long run it plateaus and even inhibits further progress," while "breakthrough progress eventually arrives by an opposing approach based on scaling computation by search and learning."[^1]
A useful way to read the essay is as a claim about ranking under different compute regimes. At low compute, knowledge-rich systems often dominate because they amortize human insight into a small parameter budget. As compute grows, systems whose performance curve continues to climb with additional data and parameters eventually pass through the plateau of the knowledge-rich system, and from that point onward general methods dominate. Sutton's claim is that this ranking reversal has now happened many times in the same way across very different domains, suggesting that it is structural rather than coincidental.[^1][^4]
Sutton later condensed the message in a public post on X to twenty-six words: "Don't be distracted by human knowledge, as AI has been historically. Instead focus on methods for creating knowledge that scale with computation, like search and learning."[^10]
Most of the essay consists of four short case studies drawn from the history of AI, each used to illustrate the same pattern: handcrafted, knowledge-rich methods initially leading, then being overtaken by general methods at scale.
Sutton's first example is computer chess and the 1997 match between Deep Blue and world champion Garry Kasparov. He observes that the methods that ultimately defeated Kasparov relied on massive, deep search and specialized hardware rather than on a deep understanding of the game encoded by chess masters.[^1] Many researchers who had invested in knowledge-based chess programs reacted with dismay, viewing the result as "brute force" and contrary to the spirit of intelligent problem solving.[^1][^11] The Deep Blue system used a relatively standard alpha-beta search algorithm but was carried by specialized chess hardware that could evaluate up to roughly 200 million positions per second, supported by a 30-processor supercomputer and 480 custom chess chips.[^11] Critics such as physicist Michael Nielsen have noted that Deep Blue also incorporated thousands of hand-engineered evaluation features, complicating any narrative of pure brute force; nevertheless, Sutton's broader point is that the system's leverage came overwhelmingly from search depth at scale, not from chess-specific symbolic understanding.[^12]
Sutton's second example is the game of Go, where the AI community spent decades attempting to compensate for the game's enormous branching factor by encoding human positional intuition. He notes that "enormous initial efforts went into avoiding search by taking advantage of human knowledge, or of the special features of the game, but all those efforts proved irrelevant, or worse, once search was applied effectively at scale."[^1] The breakthrough came in 2016 when DeepMind's AlphaGo defeated professional Lee Sedol 4-1 by combining a deep policy network and a value network with Monte Carlo Tree Search, trained initially on human games and then by reinforcement learning from self-play.[^13] In 2017, the follow-up system AlphaGo Zero removed all human gameplay data, learning from scratch by self-play and surpassing the previous AlphaGo within days.[^14] AlphaZero, described later that year, generalized the same recipe to chess and shogi without any game-specific knowledge beyond the rules.[^14] For Sutton, this sequence illustrates the bitter lesson nearly in textbook form: decades of handcrafted Go knowledge were not merely surpassed but largely discarded.[^1][^14]
Sutton's third example concerns speech recognition. He recounts that in the 1970s the field invested heavily in approaches based on linguistic and phonetic knowledge, including explicit models of "words, of phonemes, of the human vocal tract," and that statistical methods, particularly Hidden Markov Models trained on large corpora, eventually displaced them.[^1] The deep learning era extended this trajectory: large neural acoustic models trained end-to-end on tens of thousands of hours of audio outperformed earlier hybrid systems that relied on hand-tuned linguistic modules, and modern speech recognition systems are now dominated by general sequence models with comparatively little embedded phonetic structure.[^1][^3]
The fourth example is computer vision, where Sutton notes that "early methods conceived of vision as searching for edges, or generalized cylinders, or in terms of SIFT features," but that modern convolutional neural networks outperformed those approaches while relying only on the very general inductive biases of convolution and certain invariances.[^1] He uses this case to argue that hand-engineered features such as edge detectors and SIFT descriptors were ultimately replaced by representations learned automatically from data.[^1] The 2012 AlexNet system, which dramatically outperformed prior handcrafted-feature pipelines on the ImageNet benchmark, is the canonical empirical event that anchors this narrative in the late deep learning era.[^4]
Sutton describes the lesson as bitter for two intertwined reasons. The first is sociological: AI researchers tend to be drawn to approaches that reflect human understanding of a problem, both because they are intellectually satisfying and because they produce short-term progress that is publishable.[^1] The second is structural: those same approaches plateau, and when general methods overtake them, the work invested in encoding human knowledge tends to be discarded rather than incorporated.[^1] In the closing paragraphs Sutton writes that "the actual contents of minds are tremendously, irredeemably complex; we should stop trying to find simple ways to think about the contents of minds, such as simple ways to think about space, objects, multiple agents, or symmetries. Instead we should build in only the meta-methods that can find and capture this arbitrary complexity," concluding that "we want AI agents that can discover like we can, not which contain what we have discovered."[^1][^15]
The bitterness is amplified by the historical pattern that researchers tend to interpret each defeat as an isolated incident rather than as confirmation of a general law. After Deep Blue, many argued that chess was special because of its tractable branching factor; after AlphaGo, that Go was special because perfect information made search effective; after AlexNet, that vision was special because of the regularity of natural images. Sutton's essay is partly a polemical response to this pattern of post hoc rationalization, presenting the cumulative pattern across decades and domains as evidence that the same dynamic will continue to recur.[^1][^12]
This framing has been read as a programmatic call to favor scalable, general architectures and search procedures over symbolic representations, hard-coded heuristics, and hand-engineered features. It has also been read more narrowly, as a warning about over-investment in domain knowledge rather than a categorical rejection of all human priors, a reading Sutton himself has endorsed in subsequent commentary.[^1][^10]
Although The Bitter Lesson predates the empirical scaling laws literature, it has been retrospectively framed as the philosophical companion to that work. In January 2020, OpenAI researchers led by Jared Kaplan published Scaling Laws for Neural Language Models, which empirically showed that test loss for transformer language models follows smooth power laws in model size, dataset size, and compute, spanning more than seven orders of magnitude.[^16] Nine of the ten authors of the Kaplan paper also appeared on the GPT-3 paper, Language Models are Few-Shot Learners, published five months later, which scaled a GPT-2-style transformer to 175 billion parameters and demonstrated strong in-context learning across many tasks.[^17][^18]
The Bitter Lesson is frequently cited in the broader "scaling hypothesis" discourse, including in Gwern Branwen's influential essay on the topic, which treats GPT-3's success as a vindication of the principle that "most clever AI innovations are ultimately useless as they hamstring AI performance and are surpassed by methods that make fewer assumptions and use more compute and data."[^4]
In 2022, DeepMind researchers Jordan Hoffmann and colleagues published Training Compute-Optimal Large Language Models, the Chinchilla scaling laws paper. By training over 400 models with sizes between 70 million and 16 billion parameters on between 5 billion and 500 billion tokens, the team argued that Kaplan's earlier recipe had underweighted data: for a fixed compute budget, model size and training tokens should scale at roughly equal rates, implying that many previously trained large models were undertrained.[^19] Chinchilla itself, a 70 billion parameter model trained on 1.4 trillion tokens, outperformed much larger predecessors at comparable compute.[^19] Within the Bitter Lesson framework, Chinchilla is sometimes read as a refinement rather than a refutation: data, compute, and model size all scale, and the optimal allocation among them is itself an empirical question that yields to compute-driven experimentation.[^4][^19]
The essay has also been invoked in the post-training era. In their 2025 paper Welcome to the Era of Experience, David Silver and Richard Sutton argue that human-generated text data, the substrate of the large language model era, is approaching its useful limits in critical domains and that the next generation of agents will scale by learning from their own interaction with environments, a position that extends the Bitter Lesson framework to grounded, online experience rather than static datasets.[^20] In this account, the supply-side bottleneck on human text data is itself an instance of the bitter lesson: methods that depend on a fixed, finite human-curated dataset will be overtaken by methods that can generate their own data through interaction with rich environments at scale.[^20]
Test-time scaling is another arena in which the Bitter Lesson has been re-applied. Reasoning-oriented models such as the OpenAI o-series, released from late 2024 onward, allocate substantially more compute at inference time to chain-of-thought generation and verification rather than relying solely on training-time compute. Commentators have described this as a second wave of bitter-lesson-style scaling, where simple but compute-hungry search procedures over candidate reasoning traces outperform handcrafted reasoning heuristics.[^21][^4]
The Bitter Lesson has become a touchstone in industrial AI labs, particularly those associated with the scaling hypothesis. The Wikipedia overview notes that the essay has accumulated hundreds of formal Google Scholar citations and is widely discussed in technical talks and blog posts, and OpenAI insiders have publicly described being asked to internalize its argument.[^3] Researchers and engineers at OpenAI, Google DeepMind, Anthropic, and elsewhere routinely invoke "the bitter lesson" as shorthand for the general claim that scaling deep learning systems on more data and compute outperforms attempts to introduce domain-specific structure.[^4][^21]
Within academic and industrial debate, the essay is used in several distinct ways:
| Use case | Typical claim |
|---|---|
| Defense of scale-driven research programs | Investment in larger models, more compute, and more data is justified by the historical pattern Sutton describes. |
| Skepticism toward symbolic or neuro-symbolic hybrids | Heavy injection of structured human knowledge will be outcompeted by general methods at scale. |
| Critique of narrow, knowledge-rich systems | Production systems that depend on extensive feature engineering are vulnerable to disruption by general models. |
| Framework for reinforcement learning research | Encourages methods that learn from interaction rather than from hand-tuned reward shaping or heuristics. |
The essay is also widely taught, including in graduate seminars on scaling and large language models, where it is paired with empirical scaling work to give students both a historical narrative and a quantitative picture of why scale matters.[^21]
A related secondary effect is on industrial strategy. The thesis that compute scaling beats domain engineering provides a structural argument for very large capital expenditures on data centers and accelerator hardware, since investments that purchase additional compute can be modeled as directly purchasing additional capability at a known marginal rate. This logic has underpinned multibillion-dollar buildouts of training clusters at major labs in the 2020s and is regularly invoked by both proponents and skeptics when justifying or critiquing the scale of those investments.[^16][^17]
The essay has attracted significant pushback, much of it focused on three threads: that Sutton overstates how knowledge-free successful systems actually are, that the lesson provides little practical guidance on shorter timescales, and that its framing risks discouraging research into structural priors that genuinely matter.
The most prominent direct response is Rodney Brooks's essay A Better Lesson, published on March 19, 2019.[^22] Brooks argues that Sutton's account ignores the human ingenuity baked into successful "compute-driven" systems: convolutional neural networks explicitly encode translational invariance through their architecture; transformer models embed strong assumptions about sequences and attention; massive training pipelines depend on carefully curated datasets, tokenizers, and optimization recipes. Brooks frames this as a relocation rather than an elimination of human work: "It is sleight of hand in moving the human intellectual work to somewhere else."[^22] He also points to economic and physical limits, including the slowing of Moore's Law and the power budgets of mobile and embodied systems, arguing that energy and capital costs of ever-larger compute undermine the assumption that compute will continue to fall in price indefinitely.[^22]
Michael Nielsen's Reflections on The Bitter Lesson makes a related point, emphasizing that the Deep Blue system used in 1997 contained roughly 8,000 hand-engineered evaluation features in addition to its alpha-beta search, and that real-world systems have historically been hybrids of knowledge and computation.[^12] Nielsen accepts that the long-run trajectory in chess and Go has favored more general methods, but notes that on the 5 to 10 year timescales over which research and engineering decisions are actually made, hybrid approaches often dominate; a pure 2017-style AlphaZero system, he argues, "would have been trounced by Deep Blue" in 1997 given the compute available then.[^12]
A second class of critique emphasizes the role of architectural inductive bias. Commentators including Felix Hill and others argue that successful "general" methods such as CNNs, transformers, and graph neural networks succeed precisely because they incorporate carefully chosen priors aligned with the structure of their data, and that the work of designing such priors is itself a form of human knowledge engineering rather than its opposite.[^23][^24] On this reading, the Bitter Lesson is true about narrow, domain-specific heuristics but understates the importance of broad, reusable structural biases.
A third critique, often associated with data-centric AI research, holds that the dominant lever for modern progress is not raw compute but data quality and curation. Researchers including Ari Morcos have publicly argued that after years of focusing on architecture they came to the conclusion that "all that really matters is the data," and the Chinchilla result is sometimes cited as evidence that data scaling has been systematically under-prioritized.[^19][^25]
Sutton himself has clarified in interviews and subsequent writing that the essay is not a claim that human knowledge is useless, but rather that knowledge-rich approaches tend to be outcompeted at scale, and that researchers should bias their effort toward methods that scale.[^10][^15] Many critics agree with this softer reading even while resisting stronger interpretations.
The Bitter Lesson can only be fully understood in the context of Sutton's wider research program. With Andrew Barto, Sutton helped establish reinforcement learning as a mathematical discipline in the 1980s and 1990s through a series of papers introducing temporal-difference learning, actor-critic algorithms, and the Dyna architecture for integrating model-free learning with planning.[^5][^7] His 1988 paper Learning to Predict by the Methods of Temporal Differences introduced the TD-lambda family of algorithms, which became foundational for later systems including Gerald Tesauro's TD-Gammon backgammon player and many modern value-based RL methods.[^5][^7]
The textbook Reinforcement Learning: An Introduction, co-authored with Barto and first published by MIT Press in 1998 with a substantially expanded second edition in 2018, is among the most widely cited textbooks in AI, with over 75,000 citations in Google Scholar.[^7][^26] The second edition added chapters on artificial neural networks, Monte Carlo Tree Search, average reward maximization, and several modern applications.[^26]
Sutton's 1999 paper with Doina Precup and Satinder Singh, Between MDPs and Semi-MDPs: A Framework for Temporal Abstraction in Reinforcement Learning, introduced the options framework, formalizing how agents can choose among temporally extended courses of action rather than just primitive actions.[^7] His later work on the Horde architecture, gradient temporal-difference methods, and continual learning continues the broader theme that scalable, general learning procedures, rather than handcrafted modules, are the route to general intelligence.[^7]
In September 2023, Sutton publicly joined John Carmack's Dallas-based startup Keen Technologies, founded in 2022 to pursue artificial general intelligence, announcing a target of demonstrating "AGI signs of life" by 2030.[^6] His ongoing collaboration with David Silver, including the 2025 Era of Experience paper, frames the next phase of AI in terms of grounded, experiential learning rather than passive pretraining on human text, which can be seen as a continuation of the Bitter Lesson's emphasis on general, scalable learning procedures.[^20]
The Bitter Lesson is most often read alongside the empirical scaling laws literature. The two main reference points are Scaling Laws for Neural Language Models by Kaplan and colleagues, and the Chinchilla scaling laws paper by Hoffmann and colleagues, which together established a quantitative framework for how loss, parameters, data, and compute interact in large language model training.[^16][^19] The essay is also discussed in connection with the development of GPT-2, GPT-3, and successive generations of GPT-4-class systems, all of which embody the bet that scaling general transformer architectures yields broad capability gains.[^4][^17]
Within game-playing AI, the lesson is illustrated by the progression from Deep Blue through AlphaGo, AlphaZero, and MuZero, the last of which learned a model of game dynamics together with a policy and value, again with minimal game-specific knowledge beyond the rules and action set.[^14] Within reinforcement learning more broadly, methods such as reinforcement learning from human feedback are sometimes presented as a partial counterweight: they reintroduce structured human preferences into model training and have been crucial for aligning large language models with user intent.[^3][^21]
In adjacent intellectual traditions, the essay is often compared to earlier statements in the AI literature, including Frederick Jelinek's apocryphal observation about speech recognition that "every time I fire a linguist, the performance of the speech recognizer goes up," and Peter Norvig's On Chomsky and the Two Cultures of Statistical Learning, both of which articulate similar tensions between knowledge-rich and statistical approaches.[^3] Comparisons with broader debates over feature engineering and the move from handcrafted features to learned representations in modern machine learning are also routine.[^4]