Commonsense reasoning

Artificial Intelligence Reasoning Models

22 min read

Updated Jun 24, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 24, 2026

Fact-checked

In review queue

Sources

21 citations

Revision

v2 · 4,416 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Commonsense reasoning is the ability to make the everyday, mostly tacit assumptions that ordinary humans take for granted, the implicit knowledge about how the physical world behaves, how minds work, how time and causation flow, and how social interactions unfold. It is the kind of reasoning a five year old already does effortlessly: knowing that water spilt on a table will eventually drip onto the floor, that a person who smiles at you is probably not angry, that you cannot push a rope, that birds usually fly but penguins do not, and that an elephant will not fit inside a teacup. It is not the formal reasoning of mathematics, nor the specialised expertise of a doctor diagnosing a rare disease.

For more than six decades, commonsense reasoning has been considered one of the central, and long unsolved, problems of artificial intelligence. John McCarthy first proposed it as a research goal in his 1959 paper Programs with Common Sense, which is widely credited as the first work to argue that commonsense reasoning ability is the key to artificial intelligence and that commonsense knowledge could be expressed in formal logic.^[1] Marvin Minsky, Hector Levesque, Doug Lenat, Yejin Choi, Gary Marcus, Ernest Davis and many others have argued at various points that an AI without common sense is fundamentally brittle, no matter how impressive its narrow performance looks. The arrival of large pretrained transformer models in the late 2010s changed the empirical picture dramatically: many benchmarks that were considered hard in 2018 were saturated by 2023, with GPT-4 reaching about 95.3% on HellaSwag against a human ceiling of roughly 95.6%.^[2]^[3] Whether modern large language models actually do commonsense reasoning, or instead interpolate convincingly from massive training corpora, remains an open question.

What is commonsense knowledge?

Commonsense knowledge covers what most adults take for granted about everyday life. It is the substrate that lets a person understand a sentence like "the trophy did not fit in the suitcase because it was too big" without consciously deploying a rule. It is the reason a child knows that pulling on a string can move a kite but pushing on the string cannot. It is the reason a reader infers, from "Mary went to the restaurant. She left a generous tip," that Mary ate a meal there even though no sentence says so explicitly.

Davis and Marcus, in their 2015 review Commonsense Reasoning and Commonsense Knowledge in Artificial Intelligence in Communications of the ACM, divide commonsense knowledge into a small number of recurring domains.^[4]

Domain	Examples
Naive physics	Objects fall when unsupported. Hot things burn. Solid objects do not pass through one another. Water flows downhill and conforms to its container. Strings can pull but not push.
Naive psychology	People have beliefs, desires, and emotions. They act in pursuit of goals. They communicate, deceive, cooperate, and react to surprise.
Time and events	Time flows in one direction. Causes precede effects. Events have durations. Some processes are reversible, most are not.
Defaults and exceptions	Birds fly, except penguins, ostriches, and dead birds. Restaurants serve food, except when they are closed.
Numeric and quantitative	An elephant is bigger than a cat. A coffee cup holds less than a swimming pool. A human walks faster than a snail and slower than a car.
Social norms and pragmatics	People queue, take turns, say hello, hide embarrassment. Lying is usually disapproved of. Sarcasm reverses literal meaning.
Spatial	Containers have insides and outsides. You can be on, in, under, behind, or beside something. Two solid bodies do not occupy the same place.

What ties these domains together is that they are normally invisible. Humans deploy this knowledge without rehearsing it, which is exactly why writing it down for a computer turned out to be so hard.

Why has commonsense reasoning been so hard for AI?

There is a deep paradox at the center of the field: tasks that humans find effortless have proven far harder to automate than tasks humans find difficult. Marvin Minsky observed that the first AI accomplishments were proofs in logic and calculus, yet there was no machine that could answer questions about a simple story in a first-grade reader. In The Society of Mind (1988) he put it directly: "Common sense is not a simple thing. Instead, it is an immense society of hard-earned practical ideas, of multitudes of life-learned rules and exceptions, dispositions and tendencies, balances and checks."^[5] Without that web of practical knowledge, AI systems remain excellent at narrow tasks but unable to handle anything outside their training.

Hector Levesque made a similar case in his 2013 IJCAI Research Excellence Award lecture On Our Best Behaviour, warning that good behavioural performance can be achieved with "cheap tricks" unrelated to genuine intelligence, and pressing the field to focus instead on what a system needs to know and how to represent it.^[6] The brittleness of classical expert systems in the 1970s and 1980s was a direct symptom of the commonsense gap. MYCIN could diagnose blood infections better than many physicians, but had no way of knowing that a patient who claimed to be pregnant and male was probably joking.

There is a phenomenon sometimes called the AI effect: once a problem is solved, it stops looking like AI. Chess, optical character recognition, and speech recognition all went through this cycle. Commonsense reasoning has not. Each apparent advance, from Cyc to ConceptNet to ATOMIC to GPT-4, has been followed by new tests showing the previous solution missed something important.

Historical timeline

Year	Event	Significance
1959	John McCarthy, Programs with Common Sense	First explicit proposal that AI should reason from a declarative store of commonsense facts. Introduced the Advice Taker.
1968	Quillian's semantic networks	Early attempt to represent meaning as a graph of concepts.
1974	Marvin Minsky, A Framework for Representing Knowledge	Introduced frames, structured stereotypes for everyday situations.
1975	Schank's MARGIE	Early conceptual dependency system attempting to represent the meaning of English sentences.
1976	SAM (Schank et al.)	Used scripts to summarise newspaper stories.
1977	Schank and Abelson, Scripts, Plans, Goals and Understanding	Foundational work on scripts (e.g. the restaurant script) for everyday situations.
1978	FRUMP	Faster script-based news skimmer following SAM.
1984	Doug Lenat begins Cyc at MCC	Decades long effort to encode commonsense by hand.
1995	WordNet released by Princeton (Miller)	Lexical database of English organised by sense.
1995	Lenat, CYC: A Large-Scale Investment in Knowledge Infrastructure (CACM)	Reports a person-century of effort, around 100,000 concepts and a million handcrafted axioms.
1999	Open Mind Common Sense launched at MIT (Push Singh)	Crowdsourced commonsense knowledge from the general public.
2004	Liu and Singh, ConceptNet, A Practical Commonsense Reasoning Tool-Kit	Turned OMCS into a usable semantic network of 1.6 million assertions.
2011	Roemmele et al., COPA	First widely used multiple choice benchmark for commonsense causal reasoning.
2012	Levesque, Davis, Morgenstern, Winograd Schema Challenge	Pronoun disambiguation as a Turing-test alternative (273 problems in 136 schema pairs).
2018	ARC, SWAG	Larger benchmarks for science and grounded inference.
2019	ATOMIC, COMET, CommonsenseQA, HellaSwag, SocialIQA, WinoGrande	Wave of large crowdsourced commonsense datasets and the first transformer-based generative commonsense model.
2020	PIQA	Physical commonsense benchmark inspired by Instructables.
2018 to present	BERT, GPT-2, GPT-3, GPT-4, Claude, Gemini	Implicit commonsense from large pretraining corpora; saturation of older benchmarks.
2023	Bubeck et al., Sparks of Artificial General Intelligence	Documents both impressive commonsense behaviour and surprising failures in early GPT-4.
2023 to 2025	RT-2, OpenVLA, GR00T	Vision language action models pushing toward grounded, embodied commonsense.

What did McCarthy's Advice Taker propose? (1959)

The origin point of the field is John McCarthy's paper Programs with Common Sense, presented at the Teddington Conference on the Mechanization of Thought Processes. McCarthy proposed an Advice Taker, a hypothetical program that would store knowledge as logical sentences and draw new conclusions when given advice in the same logical language. His example involved deducing that one needed to walk to a car and drive to the airport. It was probably the first proposal to use logic to represent information inside a computer rather than as the subject matter of another program.^[1] McCarthy's deeper claim was that any program with general competence would need a body of common sense, expressed declaratively, that it could reason over. The paper is credited with founding declarative, knowledge-based AI. McCarthy spent the next forty years working on formalisms like situation calculus and circumscription to capture commonsense reasoning rigorously.

Frames, scripts, and the scruffies

Through the 1970s, two camps emerged. The neats, led by McCarthy, wanted clean logical foundations. The scruffies, led by Marvin Minsky and Roger Schank, argued that human commonsense was too messy for first order logic. Minsky's 1974 frames paper proposed structured prototypes with default values that could be overridden. Schank's scripts, developed with Robert Abelson in their 1977 book Scripts, Plans, Goals and Understanding, applied the same idea to event sequences. The restaurant script captured the standard sequence of being seated, ordering, eating, paying, and leaving, so a question answering system could fill in details that were not stated. Projects like MARGIE, SAM, and FRUMP demonstrated impressive but narrow results; they worked when input matched a known script and broke down when it did not.

How much effort did Cyc take?

In 1984 Doug Lenat began Cyc at the Microelectronics and Computer Technology Corporation (MCC). The bet was that human commonsense could be encoded by a team of trained ontologists if they were given enough time. By 1995, when Lenat published the project's status report in Communications of the ACM, the team had committed about a person-century of effort and built a universal schema of roughly 100,000 (10^5) general concepts, with around a million (10^6) commonsense axioms handcrafted into the knowledge base and millions more inferred and cached.^[7] By the 2010s the count had grown into the millions. See Cyc for the full history.

Cyc remains the most ambitious symbolic commonsense project ever attempted. Its proprietary ResearchCyc and OpenCyc releases influenced ontology engineering and were used in defence, intelligence, and biomedical applications. Critics, including Marcus and Davis, argued that the knowledge was difficult to keep consistent at scale, that hand authoring inevitably missed obvious facts, and that no formalism captured the full range of commonsense inference. Defenders responded that nothing else came close to the breadth and depth of Cyc's content.

From Open Mind Common Sense to ConceptNet

A different approach started at MIT in 1999. Push Singh, with Marvin Minsky and Catherine Havasi, launched Open Mind Common Sense, a website that asked the general public to type in commonsense facts. By 2002 it had collected over 450,000 statements from more than 9,000 contributors.^[8] Hugo Liu and Push Singh's 2004 paper ConceptNet, A Practical Commonsense Reasoning Tool-Kit in BT Technology Journal turned that raw text into a semantic network of over 1.6 million assertions covering the spatial, physical, social, temporal, and psychological aspects of everyday life.^[9] ConceptNet became one of the most cited commonsense resources of the 2000s and 2010s and was integrated into many later benchmarks.

The crowdsourced and neural era

From around 2018, a Seattle group around Yejin Choi at the Allen Institute for AI and the University of Washington produced a wave of new resources. ATOMIC, presented by Sap and colleagues at AAAI 2019, is a crowdsourced knowledge graph of about 300,000 event nodes and 877,000 if-then inferences about everyday events, organised under nine relation types that distinguish causes from effects, agents from themes, and actions from mental states.^[10] COMET, by Bosselut and colleagues at ACL 2019, fine tuned a GPT model on ATOMIC and ConceptNet so it could generate new triples on demand, scoring up to 77.5% precision at top one on ATOMIC, approaching human-judged quality.^[11] The same group released SocialIQA (over 38,000 social-reasoning questions),^[12] PIQA for physical reasoning, and WinoGrande, an adversarial scaling of the Winograd Schema Challenge to 43,985 examples.^[13]

In parallel, Talmor and colleagues at Tel Aviv University released CommonsenseQA at NAACL 2019, a set of 12,247 multiple-choice questions drawn from ConceptNet on which a BERT-large baseline scored 56% against 89% for humans,^[14] and Zellers and colleagues released HellaSwag at ACL 2019, a sentence completion benchmark generated by adversarial filtering against early language models.^[2]

Categories and approaches

Approach	Representative work	Strengths	Weaknesses
Knowledge based, symbolic	Cyc, WordNet, ConceptNet, ResearchCyc, OpenCyc	Interpretable; supports formal inference; precise on what it covers	Brittle; expensive to author; coverage gaps; consistency hard to maintain
Statistical, corpus based	Distributional semantics; PMI on web text	Cheap; broad coverage; reflects actual usage	Reflects co-occurrence, not necessarily truth; struggles with negation and rare events
Crowdsourced	Open Mind Common Sense, ATOMIC, SocialIQA training data	Captures naive human descriptions; broad informal coverage	Quality varies; biased to contributors; redundant or contradictory entries
Pretrained transformer	BERT, GPT-3, GPT-4, Claude, Gemini, Llama	Strong on many benchmarks; absorbs implicit world knowledge from pretraining	Hard to inspect; can hallucinate; dependence on training data raises contamination concerns
Hybrid neuro-symbolic	COMET, COMET-ATOMIC 2020, retrieval augmented LLMs	Combines structured knowledge with neural fluency	Engineering complexity; brittleness inherited from both sides
Embodied and grounded	RT-2, OpenVLA, GR00T, robotics foundation models	Forces models to reckon with real physics and social context	Data hungry; expensive to train; safety constraints; still early

Symbolic and knowledge based

The symbolic approach treats commonsense as a database of facts and rules. Cyc is the canonical example. WordNet, released by George Miller's group at Princeton in 1995, organises English words into synsets linked by hypernymy, meronymy, and a few other relations. ConceptNet uses informal, crowdsourced relations like UsedFor, AtLocation, Causes, and MotivatedByGoal. These resources support transparent inference: given a query, you can show the chain of triples that produced the answer. Their weakness is coverage; no matter how many facts you encode, the next new domain demands more.

Statistical and corpus based

A second strand learns commonsense indirectly from large text corpora. Pointwise mutual information, distributional embeddings like word2vec and GloVe, and topic models all capture some regularities. They are cheap and broad but make obvious mistakes; distributional models often treat antonyms as similar because they appear in similar contexts.

Crowdsourced

Open Mind Common Sense and ATOMIC ask people directly. The advantage is naturalness; humans state things no one would bother to write in an encyclopedia, like "if you sit on a chair it usually does not collapse." The disadvantage is variance; contributors write redundant trivia or contradictory entries, and the base is biased toward whoever shows up.

Pretrained language models

From 2018 onward the dominant approach has been to train very large neural networks on internet-scale text and let commonsense emerge implicitly. BERT in 2018 pushed several benchmarks substantially. GPT-3 in 2020 went further. GPT-4 in 2023 reached or exceeded human performance on most older commonsense benchmarks. The implicit story is that the internet contains so many throwaway commonsense statements that a large enough model can learn the regularities. Whether the resulting behaviour counts as reasoning, or competent interpolation, remains contested.

Hybrid neuro-symbolic

COMET (Bosselut et al. 2019) is the prototypical hybrid. It fine tunes a language model on ATOMIC tuples so that, given a head and a relation, it generates plausible tails.^[11] This combines the breadth of pretraining with the structured supervision of a knowledge graph. Later work like COMET-ATOMIC 2020 expanded the data, and retrieval augmented LLMs that look up facts in ConceptNet or Wikidata follow the same pattern.

Benchmarks

Most progress in commonsense reasoning since 2011 has been driven by benchmarks. The format is usually multiple choice or sentence completion, which makes evaluation simple but also makes it easier for models to exploit shortcuts.

Benchmark	Year	Authors	What it tests	Approximate top accuracy by 2024
COPA	2011	Roemmele, Bejan, Gordon	Causal reasoning, choose the more plausible cause or effect	Above 95% (saturated)
Winograd Schema Challenge	2012	Levesque, Davis, Morgenstern	Pronoun disambiguation requiring world knowledge	Above 90% (largely solved by 2019 to 2020)
Story Cloze Test, ROCStories	2016	Mostafazadeh et al.	Choosing the right ending to a four sentence story	Above 95% (saturated)
SWAG	2018	Zellers et al.	Grounded commonsense inference	Above 90% (saturated)
ARC Challenge	2018	Clark et al.	Grade school science questions requiring reasoning	Around 96% with leading LLMs
HellaSwag	2019	Zellers et al.	Sentence completion with adversarial distractors	GPT-4 about 95.3%, human about 95.6%
CommonsenseQA	2019	Talmor et al.	Multiple choice questions built from ConceptNet	Around 90% with leading LLMs (BERT-large baseline 56%)
SocialIQA	2019	Sap et al.	Reasoning about social interactions	Around 80 to 85%
WinoGrande	2019	Sakaguchi et al.	44k adversarially filtered Winograd-style problems	Around 87 to 90%
PIQA	2020	Bisk et al.	Physical commonsense, pick the better solution	Around 90% (humans about 95%)
ATOMIC, ATOMIC 2020	2019, 2021	Sap, Hwang et al.	Generative if-then inference over events	Used as training and evaluation, not pure leaderboard

Performance numbers move fast and depend heavily on prompting, few-shot examples, and whether the test set was contaminated by pretraining data. Treating any single number as definitive is a mistake.

What is the Winograd Schema Challenge?

Proposed by Hector Levesque in 2011 and formalised by Levesque, Davis, and Morgenstern at KR-2012, the Winograd Schema Challenge was conceived as an alternative to the Turing Test with clear pass/fail criteria that does not rely on deception.^[15] A Winograd schema is a pair of nearly identical sentences that differ in one or two words but resolve a pronoun in opposite ways, requiring world knowledge rather than statistics. The canonical example: "The city councilmen refused the demonstrators a permit because they feared violence" (where "they" is the councilmen) versus "...because they advocated violence" (where "they" is the demonstrators). The original collection comprised 273 hand-crafted problems organised as 136 schema pairs.^[15] By 2019 to 2020 large language models had pushed accuracy above 90%, and its authors built WinoGrande precisely because the original set proved exploitable by statistical shortcuts.

How well do large language models handle commonsense?

The most striking development of the past few years is that LLMs do remarkably well on most established commonsense benchmarks. GPT-4, on HellaSwag, scored 95.3% in the 10-shot setting in 2023, essentially matching the human ceiling of about 95.6%.^[2]^[3] On WinoGrande, ARC Challenge, PIQA, and CommonsenseQA, frontier models in 2024 and 2025 reach the high eighties or nineties. The Bubeck et al. 2023 paper Sparks of Artificial General Intelligence documents wide ranging GPT-4 successes across mathematics, coding, vision, medicine, law, and psychology, and explicitly argues that GPT-4 shows commonsense competence well beyond previous models.^[16]

The same model, without retraining, handles physical, social, and causal questions across domains. It can produce plausible chains of thought when asked to explain its answer, and adapt to novel scenarios that almost certainly were not in its training data verbatim. There are also reasons for caution.

Benchmark contamination. Many commonsense datasets are public web pages, and frontier LLMs are trained on most of the public web. A 2025 paper by Chizhov and colleagues titled What the HellaSwag? On the Validity of Common-Sense Reasoning Benchmarks finds that HellaSwag has severe construct-validity issues (ungrammaticality, typos, equally correct options), and that over 65% of model predictions stay the same even when the question is replaced with placeholder text.^[17] WinoGrande's own authors warned in 2019 that high scores might reflect dataset biases rather than real reasoning, which motivated their AFLITE adversarial filtering.^[13]
Brittleness on novel scenarios. The Bubeck paper itself documents failures on counterfactual variants of standard problems.^[16] Marcus and Davis, in Rebooting AI and later essays, catalogue cases where slight rewordings flip a model's answer.^[18]
Chain of thought is not a guarantee of reasoning. Faithfulness studies show that the explanation an LLM produces does not always reflect the computation that produced its answer.
Robustness to distribution shift. Out of distribution test sets and hand crafted edge cases continue to drop accuracy substantially.

The live debate is whether large pretrained models do commonsense reasoning that genuinely generalises, or whether they perform very high quality interpolation from a training set so large that almost everything is in distribution. The honest answer is that the question is open, and the framing may be misleading; humans do something in between as well.

Open challenges

Challenge	Why it matters
Counterfactual reasoning	Real understanding requires answering "what if" questions, not just typical-case ones.
Out of distribution generalisation	A model trained on internet text often fails when reasoning about genuinely novel situations.
Combining commonsense with formal verification	Safety critical applications need guaranteed properties, not just average-case competence.
Continual learning of new commonsense	The world keeps producing new objects, customs, and norms; static datasets go stale.
Cultural and contextual variation	What counts as common sense in Tokyo, Lagos, or rural Kansas differs; current models are skewed toward Western internet text.
Evaluation beyond multiple choice	Multiple choice formats are vulnerable to shortcut learning; open ended evaluation is harder to score.
Faithful explanations	Chain of thought explanations sometimes look right while masking incorrect underlying reasoning.
Embodied grounding	Reading about pouring water is not the same as having poured water; robotics suggests grounded commonsense may require interaction.

Can commonsense be learned without a body?

Some kinds of commonsense, particularly intuitive physics and motor planning, may not be fully learnable from text. You can read every cookbook ever written and still not know how heavy a cast iron pan is when full of water. Robotics foundation models try to close this gap. Google DeepMind's RT-2 in 2023 fine tuned vision language models on robot demonstrations to combine internet-scale knowledge with motor control.^[19] OpenVLA, released in 2024 by a Stanford led team, is an open seven billion parameter vision language action model trained on 970,000 robot manipulation episodes from the Open X-Embodiment dataset, which collected over a million episodes from twenty-two embodiments across twenty-one institutions.^[20] NVIDIA's Isaac GR00T N1, unveiled at GTC on March 18, 2025, applies a dual-system architecture to humanoid robots, with a slow "System 2" vision-language module that interprets the scene at around 10 Hz and a fast "System 1" diffusion transformer that generates motor actions at around 120 Hz.^[21] Whether grounded training produces commonsense that generalises beyond the demonstration distribution is one of the more interesting open questions of the next few years.

Connections to other areas

Commonsense reasoning is closely tied to general reasoning and to question answering. It informs work on knowledge graphs and on knowledge editing, since updating an LLM's stored facts often runs into commonsense entailments. It has analogues in cognitive science and developmental psychology, where intuitive physics and theory of mind in infants give a reference for what AI systems should eventually match.

References

McCarthy, J. (1959). *Programs with Common Sense*. Proceedings of the Teddington Conference on the Mechanization of Thought Processes. https://www-formal.stanford.edu/jmc/mcc59.pdf ↩
Zellers, R., Holtzman, A., Bisk, Y., Farhadi, A., and Choi, Y. (2019). *HellaSwag: Can a Machine Really Finish Your Sentence?* ACL 2019. https://aclanthology.org/P19-1472/ ↩
OpenAI (2023). *GPT-4 Technical Report*. arXiv:2303.08774 (HellaSwag 95.3%, 10-shot). https://arxiv.org/abs/2303.08774 ↩
Davis, E., and Marcus, G. (2015). *Commonsense Reasoning and Commonsense Knowledge in Artificial Intelligence*. Communications of the ACM, 58(9), 92-103. https://dl.acm.org/doi/10.1145/2701413 ↩
Minsky, M. (1986). *The Society of Mind*. Simon and Schuster. ↩
Levesque, H. (2013). *On Our Best Behaviour*. IJCAI Research Excellence Award Lecture, IJCAI-13. https://www.cs.toronto.edu/~hector/Papers/ijcai-13-paper.pdf ↩
Lenat, D. B. (1995). *CYC: A Large-Scale Investment in Knowledge Infrastructure*. Communications of the ACM, 38(11), 33-38. https://dl.acm.org/doi/10.1145/219717.219745 ↩
Singh, P., Lin, T., Mueller, E. T., Lim, G., Perkins, T., and Zhu, W. L. (2002). *Open Mind Common Sense: Knowledge Acquisition from the General Public*. CoopIS/DOA/ODBASE. ↩
Liu, H., and Singh, P. (2004). *ConceptNet, A Practical Commonsense Reasoning Tool-Kit*. BT Technology Journal, 22(4), 211-226. https://link.springer.com/article/10.1023/B:BTTJ.0000047600.45421.6d ↩
Sap, M., Le Bras, R., Allaway, E., Bhagavatula, C., Lourie, N., Rashkin, H., Roof, B., Smith, N. A., and Choi, Y. (2019). *ATOMIC: An Atlas of Machine Commonsense for If-Then Reasoning*. AAAI 2019. https://arxiv.org/abs/1811.00146 ↩
Bosselut, A., Rashkin, H., Sap, M., Malaviya, C., Celikyilmaz, A., and Choi, Y. (2019). *COMET: Commonsense Transformers for Automatic Knowledge Graph Construction*. ACL 2019. https://aclanthology.org/P19-1470/ ↩
Sap, M., Rashkin, H., Chen, D., Le Bras, R., and Choi, Y. (2019). *Social IQa: Commonsense Reasoning about Social Interactions*. EMNLP 2019. https://aclanthology.org/D19-1454/ ↩
Sakaguchi, K., Le Bras, R., Bhagavatula, C., and Choi, Y. (2021). *WinoGrande: An Adversarial Winograd Schema Challenge at Scale*. Communications of the ACM, 64(9), 99-106. https://cacm.acm.org/magazines/2021/9/255048-winogrande/fulltext ↩
Talmor, A., Herzig, J., Lourie, N., and Berant, J. (2019). *CommonsenseQA: A Question Answering Challenge Targeting Commonsense Knowledge*. NAACL-HLT 2019. https://aclanthology.org/N19-1421/ ↩
Levesque, H., Davis, E., and Morgenstern, L. (2012). *The Winograd Schema Challenge*. Proceedings of KR-2012. https://cs.nyu.edu/~davise/papers/WinogradSchemas/WS.html ↩
Bubeck, S., et al. (2023). *Sparks of Artificial General Intelligence: Early Experiments with GPT-4*. arXiv:2303.12712. https://arxiv.org/abs/2303.12712 ↩
Chizhov, P., Nee, M., Langlais, P.-C., and Yamshchikov, I. P. (2025). *What the HellaSwag? On the Validity of Common-Sense Reasoning Benchmarks*. arXiv:2504.07825. https://arxiv.org/abs/2504.07825 ↩
Marcus, G., and Davis, E. (2019). *Rebooting AI: Building Artificial Intelligence We Can Trust*. Pantheon Books. ↩
Brohan, A., et al. (2023). *RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control*. https://robotics-transformer2.github.io/ ↩
Kim, M. J., et al. (2024). *OpenVLA: An Open-Source Vision-Language-Action Model*. arXiv:2406.09246. https://arxiv.org/abs/2406.09246 ↩
NVIDIA (2025). *GR00T N1: An Open Foundation Model for Generalist Humanoid Robots*. arXiv:2503.14734. https://arxiv.org/abs/2503.14734 ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

1 revision by 1 contributors · full history

Suggest edit

What links here

Auto-CoT Cognitive robotics CommonsenseQA Cyc DoRA (Weight-Decomposed Low-Rank Adaptation)HellaSwag MuSR PIQA Reporting Bias WinoGrande Winograd Schema Challenge Yejin Choi

What is commonsense knowledge?

Why has commonsense reasoning been so hard for AI?

Historical timeline

What did McCarthy's Advice Taker propose? (1959)

Frames, scripts, and the scruffies

How much effort did Cyc take?

From Open Mind Common Sense to ConceptNet

The crowdsourced and neural era

Categories and approaches

Symbolic and knowledge based

Statistical and corpus based

Crowdsourced

Pretrained language models

Hybrid neuro-symbolic

Benchmarks

What is the Winograd Schema Challenge?

How well do large language models handle commonsense?

Open challenges

Can commonsense be learned without a body?

Connections to other areas

References

Improve this article

Related Articles

ARC-AGI 1

MathArena

SimpleBench

OpenAI o-series

Test-time compute

Agent planning

What links here

Related Articles

ARC-AGI 1

MathArena

SimpleBench

OpenAI o-series

Test-time compute

Agent planning

What links here