Commonsense reasoning
Last reviewed
Apr 30, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 3,998 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Apr 30, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 3,998 words
Add missing citations, update stale details, or suggest a clearer explanation.
Commonsense reasoning is the kind of reasoning that ordinary humans rely on to make sense of everyday situations. It draws on broadly shared, mostly tacit knowledge about how the physical world behaves, how minds work, how time and causation flow, and how social interactions unfold. It is not the formal reasoning of mathematics, nor the specialised expertise of a doctor diagnosing a rare disease. It is the kind of thing a five year old already does effortlessly: knowing that water spilt on a table will eventually drip onto the floor, that a person who smiles at you is probably not angry, that you cannot push a rope, that birds usually fly but penguins do not, and that an elephant will not fit inside a teacup.
For more than six decades, commonsense reasoning has been considered one of the central, and unsolved, problems of artificial intelligence. John McCarthy first proposed it as a research goal in 1959. Marvin Minsky, Hector Levesque, Doug Lenat, Yejin Choi, Gary Marcus, Ernest Davis and many others have argued at various points that an AI without common sense is fundamentally brittle, no matter how impressive its narrow performance looks. The arrival of large pretrained transformer models in the late 2010s changed the empirical picture dramatically; many benchmarks that were considered hard in 2018 were saturated by 2023. Whether modern large language models actually do commonsense reasoning, or instead interpolate convincingly from massive training corpora, remains an open question.
Commonsense knowledge covers what most adults take for granted about everyday life. It is the substrate that lets a person understand a sentence like "the trophy did not fit in the suitcase because it was too big" without consciously deploying a rule. It is the reason a child knows that pulling on a string can move a kite but pushing on the string cannot. It is the reason a reader infers, from "Mary went to the restaurant. She left a generous tip," that Mary ate a meal there even though no sentence says so explicitly.
Davis and Marcus, in their 2015 review in Communications of the ACM, divide commonsense knowledge into a small number of recurring domains.
| Domain | Examples |
|---|---|
| Naive physics | Objects fall when unsupported. Hot things burn. Solid objects do not pass through one another. Water flows downhill and conforms to its container. Strings can pull but not push. |
| Naive psychology | People have beliefs, desires, and emotions. They act in pursuit of goals. They communicate, deceive, cooperate, and react to surprise. |
| Time and events | Time flows in one direction. Causes precede effects. Events have durations. Some processes are reversible, most are not. |
| Defaults and exceptions | Birds fly, except penguins, ostriches, and dead birds. Restaurants serve food, except when they are closed. |
| Numeric and quantitative | An elephant is bigger than a cat. A coffee cup holds less than a swimming pool. A human walks faster than a snail and slower than a car. |
| Social norms and pragmatics | People queue, take turns, say hello, hide embarrassment. Lying is usually disapproved of. Sarcasm reverses literal meaning. |
| Spatial | Containers have insides and outsides. You can be on, in, under, behind, or beside something. Two solid bodies do not occupy the same place. |
What ties these domains together is that they are normally invisible. Humans deploy this knowledge without rehearsing it, which is exactly why writing it down for a computer turned out to be so hard.
Marvin Minsky argued for decades that without common sense, AI systems would always be "idiot savants," excellent at narrow tasks but unable to handle anything outside their training. Hector Levesque made a similar case in his 2014 paper On Our Best Behaviour, arguing that statistical pattern matching could never substitute for a real model of the world. The brittleness of classical expert systems in the 1970s and 1980s was a direct symptom of the commonsense gap. MYCIN could diagnose blood infections better than many physicians, but had no way of knowing that a patient who claimed to be pregnant and male was probably joking.
There is a phenomenon sometimes called the AI effect: once a problem is solved, it stops looking like AI. Chess, optical character recognition, and speech recognition all went through this cycle. Commonsense reasoning has not. Each apparent advance, from CYC to ConceptNet to ATOMIC to GPT-4, has been followed by new tests showing the previous solution missed something important.
| Year | Event | Significance |
|---|---|---|
| 1959 | John McCarthy, Programs with Common Sense | First explicit proposal that AI should reason from a declarative store of commonsense facts. Introduced the Advice Taker. |
| 1968 | Quillian's semantic networks | Early attempt to represent meaning as a graph of concepts. |
| 1974 | Marvin Minsky, A Framework for Representing Knowledge | Introduced frames, structured stereotypes for everyday situations. |
| 1975 | Schank's MARGIE | Early conceptual dependency system attempting to represent the meaning of English sentences. |
| 1976 | SAM (Schank et al.) | Used scripts to summarise newspaper stories. |
| 1977 | Schank and Abelson, Scripts, Plans, Goals and Understanding | Foundational work on scripts (e.g. the restaurant script) for everyday situations. |
| 1978 | FRUMP | Faster script-based news skimmer following SAM. |
| 1984 | Doug Lenat begins CYC at MCC | Decades long effort to encode commonsense by hand. |
| 1995 | WordNet released by Princeton (Miller) | Lexical database of English organised by sense. |
| 1995 | Lenat, CYC: A Large-Scale Investment in Knowledge Infrastructure (CACM) | Reports a person-century of effort, around 100,000 concepts and a million handcrafted axioms. |
| 1999 | Open Mind Common Sense launched at MIT (Push Singh) | Crowdsourced commonsense knowledge from the general public. |
| 2004 | Liu and Singh, ConceptNet, A Practical Commonsense Reasoning Tool-Kit | Turned OMCS into a usable semantic network of 1.6 million assertions. |
| 2011 | Roemmele et al., COPA | First widely used multiple choice benchmark for commonsense causal reasoning. |
| 2012 | Levesque, Davis, Morgenstern, Winograd Schema Challenge | Pronoun disambiguation as a Turing-test alternative. |
| 2018 | ARC, SWAG | Larger benchmarks for science and grounded inference. |
| 2019 | ATOMIC, COMET, CommonsenseQA, HellaSwag, SocialIQA, WinoGrande | Wave of large crowdsourced commonsense datasets and the first transformer-based generative commonsense model. |
| 2020 | PIQA | Physical commonsense benchmark inspired by Instructables. |
| 2018 to present | BERT, GPT-2, GPT-3, GPT-4, Claude, Gemini | Implicit commonsense from large pretraining corpora; saturation of older benchmarks. |
| 2023 | Bubeck et al., Sparks of Artificial General Intelligence | Documents both impressive commonsense behaviour and surprising failures in early GPT-4. |
| 2023 to 2025 | RT-2, OpenVLA, GR00T | Vision language action models pushing toward grounded, embodied commonsense. |
The origin point of the field is John McCarthy's paper Programs with Common Sense, presented at the Teddington Conference on the Mechanization of Thought Processes. McCarthy proposed an Advice Taker, a hypothetical program that would store knowledge as logical sentences and draw new conclusions when given advice in the same logical language. His example involved deducing that one needed to walk to a car and drive to the airport. McCarthy's deeper claim was that any program with general competence would need a body of common sense, expressed declaratively, that it could reason over. The paper is credited with founding declarative, knowledge-based AI. McCarthy spent the next forty years working on formalisms like situation calculus and circumscription to capture commonsense reasoning rigorously.
Through the 1970s, two camps emerged. The neats, led by McCarthy, wanted clean logical foundations. The scruffies, led by Marvin Minsky and Roger Schank, argued that human commonsense was too messy for first order logic. Minsky's 1974 frames paper proposed structured prototypes with default values that could be overridden. Schank's scripts, developed with Robert Abelson in their 1977 book Scripts, Plans, Goals and Understanding, applied the same idea to event sequences. The restaurant script captured the standard sequence of being seated, ordering, eating, paying, and leaving, so a question answering system could fill in details that were not stated. Projects like MARGIE, SAM, and FRUMP demonstrated impressive but narrow results; they worked when input matched a known script and broke down when it did not.
In 1984 Doug Lenat began Cyc at the Microelectronics and Computer Technology Corporation. The bet was that human commonsense could be encoded by a team of trained ontologists if they were given enough time. By 1995, when Lenat published the project's status report in Communications of the ACM, Cycorp had committed about a person-century of effort and built up roughly 100,000 concepts with around a million axioms. By the 2010s the count had grown into the millions. See Cyc for the full history.
Cyc remains the most ambitious symbolic commonsense project ever attempted. Its proprietary ResearchCyc and OpenCyc releases influenced ontology engineering and were used in defence, intelligence, and biomedical applications. Critics, including Marcus and Davis, argued that the knowledge was difficult to keep consistent at scale, that hand authoring inevitably missed obvious facts, and that no formalism captured the full range of commonsense inference. Defenders responded that nothing else came close to the breadth and depth of Cyc's content.
A different approach started at MIT in 1999. Push Singh, with Marvin Minsky and Catherine Havasi, launched Open Mind Common Sense, a website that asked the general public to type in commonsense facts. By 2002 it had collected over 450,000 statements from more than 9,000 contributors. Hugo Liu and Push Singh's 2004 paper ConceptNet, A Practical Commonsense Reasoning Tool-Kit in BT Technology Journal turned that raw text into a semantic network of around 1.6 million assertions covering spatial, physical, social, temporal, and psychological knowledge. ConceptNet became one of the most cited commonsense resources of the 2000s and 2010s and was integrated into many later benchmarks.
From around 2018, a Seattle group around Yejin Choi at the Allen Institute for AI and the University of Washington produced a wave of new resources. ATOMIC, presented by Sap and colleagues at AAAI 2019, contained 877,000 if-then inferences about everyday events organised under nine relation types covering causes, effects, intents, and reactions. COMET, by Bosselut and colleagues at ACL 2019, fine tuned a GPT model on ATOMIC and ConceptNet so it could generate new triples on demand, scoring up to 77.5% precision at top one on ATOMIC. The same group released SocialIQA for social reasoning, PIQA for physical reasoning, and WinoGrande, a forty-four-thousand-example adversarial scaling of the Winograd Schema Challenge.
In parallel, Talmor and colleagues at Tel Aviv University released CommonsenseQA at NAACL 2019, drawn from ConceptNet and aimed at multiple-choice question answering, and Zellers and colleagues released HellaSwag at ACL 2019, a sentence completion benchmark generated by adversarial filtering against early language models.
| Approach | Representative work | Strengths | Weaknesses |
|---|---|---|---|
| Knowledge based, symbolic | Cyc, WordNet, ConceptNet, ResearchCyc, OpenCyc | Interpretable; supports formal inference; precise on what it covers | Brittle; expensive to author; coverage gaps; consistency hard to maintain |
| Statistical, corpus based | Distributional semantics; PMI on web text | Cheap; broad coverage; reflects actual usage | Reflects co-occurrence, not necessarily truth; struggles with negation and rare events |
| Crowdsourced | Open Mind Common Sense, ATOMIC, SocialIQA training data | Captures naive human descriptions; broad informal coverage | Quality varies; biased to contributors; redundant or contradictory entries |
| Pretrained transformer | BERT, GPT-3, GPT-4, Claude, Gemini, Llama | Strong on many benchmarks; absorbs implicit world knowledge from pretraining | Hard to inspect; can hallucinate; dependence on training data raises contamination concerns |
| Hybrid neuro-symbolic | COMET, COMET-ATOMIC 2020, retrieval augmented LLMs | Combines structured knowledge with neural fluency | Engineering complexity; brittleness inherited from both sides |
| Embodied and grounded | RT-2, OpenVLA, GR00T, robotics foundation models | Forces models to reckon with real physics and social context | Data hungry; expensive to train; safety constraints; still early |
The symbolic approach treats commonsense as a database of facts and rules. Cyc is the canonical example. WordNet, released by George Miller's group at Princeton in 1995, organises English words into synsets linked by hypernymy, meronymy, and a few other relations. ConceptNet uses informal, crowdsourced relations like UsedFor, AtLocation, Causes, and MotivatedByGoal. These resources support transparent inference: given a query, you can show the chain of triples that produced the answer. Their weakness is coverage; no matter how many facts you encode, the next new domain demands more.
A second strand learns commonsense indirectly from large text corpora. Pointwise mutual information, distributional embeddings like word2vec and GloVe, and topic models all capture some regularities. They are cheap and broad but make obvious mistakes; distributional models often treat antonyms as similar because they appear in similar contexts.
Open Mind Common Sense and ATOMIC ask people directly. The advantage is naturalness; humans state things no one would bother to write in an encyclopedia, like "if you sit on a chair it usually does not collapse." The disadvantage is variance; contributors write redundant trivia or contradictory entries, and the base is biased toward whoever shows up.
From 2018 onward the dominant approach has been to train very large neural networks on internet-scale text and let commonsense emerge implicitly. BERT in 2018 pushed several benchmarks substantially. GPT-3 in 2020 went further. GPT-4 in 2023 reached or exceeded human performance on most older commonsense benchmarks. The implicit story is that the internet contains so many throwaway commonsense statements that a large enough model can learn the regularities. Whether the resulting behaviour counts as reasoning, or competent interpolation, remains contested.
COMET (Bosselut et al. 2019) is the prototypical hybrid. It fine tunes a language model on ATOMIC tuples so that, given a head and a relation, it generates plausible tails. This combines the breadth of pretraining with the structured supervision of a knowledge graph. Later work like COMET-ATOMIC 2020 expanded the data, and retrieval augmented LLMs that look up facts in ConceptNet or Wikidata follow the same pattern.
Most progress in commonsense reasoning since 2011 has been driven by benchmarks. The format is usually multiple choice or sentence completion, which makes evaluation simple but also makes it easier for models to exploit shortcuts.
| Benchmark | Year | Authors | What it tests | Approximate top accuracy by 2024 |
|---|---|---|---|---|
| COPA | 2011 | Roemmele, Bejan, Gordon | Causal reasoning, choose the more plausible cause or effect | Above 95% (saturated) |
| Winograd Schema Challenge | 2012 | Levesque, Davis, Morgenstern | Pronoun disambiguation requiring world knowledge | Above 90% (largely solved by 2019 to 2020) |
| Story Cloze Test, ROCStories | 2016 | Mostafazadeh et al. | Choosing the right ending to a four sentence story | Above 95% (saturated) |
| SWAG | 2018 | Zellers et al. | Grounded commonsense inference | Above 90% (saturated) |
| ARC Challenge | 2018 | Clark et al. | Grade school science questions requiring reasoning | Around 96% with leading LLMs |
| HellaSwag | 2019 | Zellers et al. | Sentence completion with adversarial distractors | GPT-4 about 95.3%, human about 95.6% |
| CommonsenseQA | 2019 | Talmor et al. | Multiple choice questions built from ConceptNet | Around 90% with leading LLMs (BERT-large baseline 56%) |
| SocialIQA | 2019 | Sap et al. | Reasoning about social interactions | Around 80 to 85% |
| WinoGrande | 2019 | Sakaguchi et al. | 44k adversarially filtered Winograd-style problems | Around 87 to 90% |
| PIQA | 2020 | Bisk et al. | Physical commonsense, pick the better solution | Around 90% (humans about 95%) |
| ATOMIC, ATOMIC 2020 | 2019, 2021 | Sap, Hwang et al. | Generative if-then inference over events | Used as training and evaluation, not pure leaderboard |
Performance numbers move fast and depend heavily on prompting, few-shot examples, and whether the test set was contaminated by pretraining data. Treating any single number as definitive is a mistake.
The most striking development of the past few years is that LLMs do remarkably well on most established commonsense benchmarks. GPT-4, on HellaSwag, scored about 95.3% in 2023, essentially matching the human ceiling of 95.6%. On WinoGrande, ARC Challenge, PIQA, and CommonsenseQA, frontier models in 2024 and 2025 reach the high eighties or nineties. The Bubeck et al. 2023 paper Sparks of Artificial General Intelligence documents wide ranging GPT-4 successes across mathematics, coding, vision, medicine, law, and psychology, and explicitly argues that GPT-4 shows commonsense competence well beyond previous models.
The same model, without retraining, handles physical, social, and causal questions across domains. It can produce plausible chains of thought when asked to explain its answer, and adapt to novel scenarios that almost certainly were not in its training data verbatim. There are also reasons for caution.
The live debate is whether large pretrained models do commonsense reasoning that genuinely generalises, or whether they perform very high quality interpolation from a training set so large that almost everything is in distribution. The honest answer is that the question is open, and the framing may be misleading; humans do something in between as well.
| Challenge | Why it matters |
|---|---|
| Counterfactual reasoning | Real understanding requires answering "what if" questions, not just typical-case ones. |
| Out of distribution generalisation | A model trained on internet text often fails when reasoning about genuinely novel situations. |
| Combining commonsense with formal verification | Safety critical applications need guaranteed properties, not just average-case competence. |
| Continual learning of new commonsense | The world keeps producing new objects, customs, and norms; static datasets go stale. |
| Cultural and contextual variation | What counts as common sense in Tokyo, Lagos, or rural Kansas differs; current models are skewed toward Western internet text. |
| Evaluation beyond multiple choice | Multiple choice formats are vulnerable to shortcut learning; open ended evaluation is harder to score. |
| Faithful explanations | Chain of thought explanations sometimes look right while masking incorrect underlying reasoning. |
| Embodied grounding | Reading about pouring water is not the same as having poured water; robotics suggests grounded commonsense may require interaction. |
Some kinds of commonsense, particularly intuitive physics and motor planning, may not be fully learnable from text. You can read every cookbook ever written and still not know how heavy a cast iron pan is when full of water. Robotics foundation models try to close this gap. Google DeepMind's RT-2 in 2023 fine tuned vision language models on robot demonstrations to combine internet-scale knowledge with motor control. OpenVLA, released in 2024 by a Stanford led team, is an open seven billion parameter vision language action model trained on the Open X-Embodiment dataset, which collected over a million episodes from twenty-two embodiments across twenty-one institutions. NVIDIA's GR00T N1, released in March 2025, applies a dual-system architecture to humanoid robots, with a slow vision-language module that interprets the scene and a fast diffusion transformer that generates motor actions. Whether grounded training produces commonsense that generalises beyond the demonstration distribution is one of the more interesting open questions of the next few years.
Commonsense reasoning is closely tied to general reasoning and to question answering. It informs work on knowledge graphs and on knowledge editing, since updating an LLM's stored facts often runs into commonsense entailments. It has analogues in cognitive science and developmental psychology, where intuitive physics and theory of mind in infants give a reference for what AI systems should eventually match.