CommonsenseQA is a multiple-choice question answering benchmark designed to evaluate the ability of artificial intelligence systems to perform commonsense reasoning. Introduced in 2019 by Alon Talmor, Jonathan Herzig, Nicholas Lourie, and Jonathan Berant, the dataset contains 12,247 questions that were crowdsourced using the ConceptNet knowledge graph as a structural foundation. Each question presents five answer choices (one correct, four distractors) and requires knowledge that most humans take for granted but that machines have historically struggled to acquire. CommonsenseQA was published as a long paper at the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL-HLT 2019) in Minneapolis, Minnesota. The dataset has since accumulated over 1,800 citations on Semantic Scholar, making it one of the most widely used benchmarks in natural language processing research focused on commonsense understanding.
Commonsense reasoning has long been recognized as a fundamental challenge in artificial intelligence. Humans effortlessly draw on a vast body of background knowledge when interpreting language, making inferences, and answering questions about the world. A person understands, for instance, that rivers flow downhill, that people celebrate birthdays, or that a cold glass of water will eventually reach room temperature. This kind of knowledge is so obvious that it is seldom stated explicitly in written text, a phenomenon known as reporting bias.
Before CommonsenseQA, several benchmarks existed for testing commonsense reasoning in NLP systems. The Winograd Schema Challenge (Levesque et al., 2012) offered carefully crafted pronoun resolution problems but contained only 273 examples, limiting its statistical reliability. The Choice of Plausible Alternatives (COPA) dataset provided 1,000 causal reasoning questions. The Story Cloze Test (Mostafazadeh et al., 2016) asked models to pick the correct ending for short stories, but researchers discovered annotation artifacts that allowed models to score well without genuine understanding. Similarly, SWAG (Zellers et al., 2018) was created for grounded commonsense inference, yet pretrained models like BERT quickly achieved near-human performance through fine-tuning, suggesting the dataset did not fully capture the depth of commonsense reasoning.
The creators of CommonsenseQA observed two shortcomings in prior work. First, many existing datasets were either too small to support robust machine learning experiments or contained statistical artifacts that models could exploit. Second, most benchmarks provided a textual context alongside each question, meaning that models could sometimes identify the correct answer by pattern matching within the given passage rather than by reasoning about the world. CommonsenseQA addressed both problems by creating a larger dataset and by formulating questions that do not come with a supporting passage; answering them correctly demands genuine background knowledge.
The construction of CommonsenseQA relies on ConceptNet, a large-scale commonsense knowledge graph in which nodes represent natural language concepts (such as "river," "birthday," or "school") and edges represent semantic relations between them (such as AtLocation, Causes, CapableOf, and HasSubevent). The key insight behind the dataset design was that concepts sharing the same relation to a common source concept are semantically related yet distinct, making them ideal candidates for challenging multiple-choice distractors.
For example, the concept "river" might be connected via the AtLocation relation to several target concepts: "waterfall," "bridge," and "valley." All three of these targets share a plausible spatial relationship with rivers, so a question that asks about one of them while offering the others as distractors forces the respondent to think carefully about which specific commonsense fact is being tested.
The dataset was created through a multi-step crowdsourcing process on Amazon Mechanical Turk:
Subgraph extraction. The researchers extracted subgraphs from ConceptNet, each containing one source concept and three target concepts connected by the same semantic relation.
Question authoring. A crowd worker was shown the source concept and three target concepts. The worker then authored three separate questions (one per target concept), constructing each question so that only the designated target served as the correct answer while the other two targets functioned as plausible but incorrect distractors. This design naturally encouraged workers to embed commonsense knowledge into the question text to differentiate the correct answer from the alternatives.
Distractor generation. For each question, the worker selected one additional distractor from ConceptNet and manually wrote a fifth distractor. This brought the total number of answer choices to five per question (labeled A through E).
Quality verification. Workers verified the quality of each question to ensure clarity, correctness, and the absence of ambiguity.
The total cost per question was approximately $0.33. This crowdsourcing design proved effective at producing questions that require genuine commonsense reasoning rather than surface-level pattern matching.
The final dataset comprises 12,247 questions (some sources report 12,102 after filtering) distributed across the following splits:
| Split | Number of Examples |
|---|---|
| Training | 9,741 |
| Validation | 1,221 |
| Test | 1,140 |
The dataset is provided under an MIT license and is hosted on Hugging Face (tau/commonsense_qa) as well as the official website at www.tau-nlp.org/commonsenseqa.
The questions in CommonsenseQA draw on a variety of ConceptNet relations, with the following distribution among the most common:
| Relation | Percentage of Questions |
|---|---|
| Causes | 47.3% |
| CapableOf | 17.3% |
| Antonym | 9.4% |
| HasSubevent | 8.5% |
| Other relations | 17.5% |
The heavy representation of the "Causes" relation reflects its importance in everyday reasoning. Understanding causal relationships (for example, that heavy rain causes flooding) is central to how humans navigate and interpret the world.
Each question in CommonsenseQA is a standalone natural language question with five candidate answers. Unlike reading comprehension tasks such as SQuAD, there is no accompanying passage or context; the model must rely entirely on its internal knowledge or external knowledge sources.
A representative data instance has the following structure:
{
"id": "075e483d21c29a511267ef62bedc0461",
"question": "The sanctions against the school were a punishing blow, and they seemed to what the efforts the school had made to change?",
"question_concept": "punishing",
"choices": {
"label": ["A", "B", "C", "D", "E"],
"text": ["ignore", "enforce", "authoritarian", "yell at", "avoid"]
},
"answerKey": "A"
}
The question_concept field records the ConceptNet concept from which the question was derived. In this example, the concept is "punishing," and the question requires understanding that sanctions, despite being a form of punishment, can simultaneously ignore or negate the positive efforts of the entity being sanctioned.
Analysis of the dataset reveals that questions draw on several categories of background knowledge:
| Knowledge Type | Description |
|---|---|
| Spatial reasoning | Understanding where objects and entities are located relative to one another |
| Cause and effect | Knowing what events or actions lead to particular outcomes |
| Scientific knowledge | Basic understanding of physical and biological phenomena |
| Social conventions | Knowledge of cultural norms, social behavior, and human interactions |
| Background world knowledge | General facts about how the world works that are rarely stated explicitly |
This diversity of required knowledge types makes CommonsenseQA a broad test of commonsense understanding rather than a narrow probe of any single reasoning skill.
CommonsenseQA uses accuracy as its primary evaluation metric. Because each question has exactly one correct answer among five choices, random guessing would yield an expected accuracy of 20%. The test set labels are withheld from public release; researchers must submit their predictions to the official leaderboard for evaluation.
Human performance on the dataset was measured at 88.9% accuracy, establishing an upper baseline for the benchmark. The gap between machine performance and human accuracy has served as the primary indicator of progress in commonsense reasoning.
The original paper presented several baseline models and their performance on the dataset:
| Model | Accuracy |
|---|---|
| Random | 20.0% |
| TF-IDF retrieval | 42.0% |
| GloVe embeddings | 44.0% |
| BERT-base | 53.0% |
| BERT-large (fine-tuned) | 55.9% |
| Human | 88.9% |
The 33-percentage-point gap between the best model (BERT-large at 55.9%) and human performance (88.9%) demonstrated that CommonsenseQA posed a genuine challenge for the NLP systems of 2019. Even BERT-large, which had recently achieved state-of-the-art results on many NLP benchmarks, fell far short of human-level commonsense reasoning on this task.
Since the dataset's release, the CommonsenseQA leaderboard has tracked steady improvements as researchers developed increasingly sophisticated approaches. The following table summarizes notable milestones:
| Model | Dev Accuracy | Test Accuracy | Year | Notes |
|---|---|---|---|---|
| BERT-large | ~56% | 55.9% | 2019 | Original baseline |
| Fine-tuned GPT-3 | 73.0% | -- | 2021 | 175B parameter language model |
| RoBERTa-large | 76.7% | -- | 2020 | Improved pretraining over BERT |
| UnifiedQA (11B) | 79.1% | -- | 2020 | Multi-task QA model from Allen AI |
| ALBERT-xxlarge + HGN | 81.2% | 80.0% | 2020 | Knowledge graph integration |
| DeBERTa-xxlarge | 83.8% | -- | 2021 | Enhanced attention mechanism (1.5B parameters) |
| DeBERTaV3-large | 84.6% | -- | 2022 | Efficient DeBERTa variant (418M parameters) |
| KEAR (single model) | 90.8% | 86.1% | 2021 | External attention with knowledge retrieval |
| DeBERTaV3-large + KEAR | 91.2% | -- | 2022 | Best single-model dev accuracy |
| KEAR (39-model ensemble) | 93.4% | 89.4% | 2021 | First to surpass human performance |
| CPACE | -- | 89.8% | 2022 | 0.9% above human performance |
These results illustrate a clear trajectory: from models that barely outperformed simple baselines in 2019, through knowledge-augmented approaches in 2020 and 2021, to systems that eventually matched and surpassed human accuracy.
Microsoft's KEAR (Knowledgeable External Attention for commonsense Reasoning) system, announced on December 20, 2021, was the first model to surpass human performance on the CommonsenseQA leaderboard, achieving 89.4% test accuracy compared to the 88.9% human baseline. The achievement was published at IJCAI 2022 by Yichong Xu, Chenguang Zhu, Shuohang Wang, and colleagues.
KEAR introduced a mechanism called external attention, which complements the self-attention used in transformer architectures. Rather than relying solely on information within the input sequence, KEAR retrieves relevant knowledge from three external sources:
The retrieved knowledge is concatenated with the input question and candidate answer, then fed into a language model to produce a score. The final prediction uses an ensemble of 39 language models (including DeBERTa and ELECTRA variants) with majority voting.
One significant aspect of KEAR's design is that it requires no changes to the underlying transformer architecture. The external knowledge is simply appended to the input text, meaning the approach can be applied to any pretrained language model. This demonstrated that moderately sized models, when equipped with the right external knowledge, could match or exceed the performance of much larger models like GPT-3 (175 billion parameters).
The rise of large language models (LLMs) brought a new perspective to CommonsenseQA evaluation. Models such as GPT-4 have demonstrated strong commonsense reasoning capabilities, with reported accuracy in the range of 80 to 85% on CommonsenseQA using zero-shot or few-shot prompting, without any task-specific fine-tuning. While these numbers fall below the best fine-tuned and ensemble systems on the leaderboard, they highlight that scaling model size and pretraining data can implicitly capture substantial commonsense knowledge.
The continued relevance of CommonsenseQA as a benchmark is evidenced by its inclusion in major LLM evaluation suites. Platforms like the Open LLM Leaderboard and evaluation harnesses such as EleutherAI's lm-evaluation-harness include CommonsenseQA as a standard test of commonsense reasoning. Major AI companies, including OpenAI, Anthropic, Meta, Microsoft, and Google, reference CommonsenseQA results in their model evaluations.
In 2021, Alon Talmor and collaborators released CommonsenseQA 2.0 (CSQA2), a successor benchmark that addressed some limitations of the original dataset. The work was presented as an oral presentation at NeurIPS 2021 under the title "CommonsenseQA 2.0: Exposing the Limits of AI through Gamification." The authors include Alon Talmor, Ori Yoran, Ronan Le Bras, Chandra Bhagavatula, Yoav Goldberg, Yejin Choi, and Jonathan Berant, representing a collaboration between Tel Aviv University and the Allen Institute for AI.
The most distinctive feature of CommonsenseQA 2.0 is its data collection methodology: a gamified framework called "Teach-Your-AI." In this game, players compose yes/no questions with the goal of stumping a rival AI system. The game uses a points-based incentive structure:
This adversarial setup leverages the players' intuitive understanding of where AI systems are likely to fail, producing questions that target genuine gaps in machine commonsense reasoning.
Unlike the original CommonsenseQA, which uses five-choice multiple-choice questions, CommonsenseQA 2.0 uses a yes/no (binary) format. The dataset contains 14,343 questions in total (with the publicly available Hugging Face version containing approximately 11,805 examples split between training and validation).
| Split | Examples |
|---|---|
| Training | ~9,260 |
| Validation | ~2,545 |
| Test | ~2,538 (labels withheld) |
Each data instance includes the question text, the yes/no answer, a confidence score derived from human validation, and metadata about the prompts used during question creation. The dataset is released under a CC-BY-4.0 license.
The initial evaluation results on CommonsenseQA 2.0 revealed a substantial gap between machine and human performance:
| Model | Accuracy |
|---|---|
| GPT-3 (few-shot, 175B) | 52.9% |
| T5-based Unicorn (11B) | 70.2% |
| Human | 94.1% |
The 23.9-percentage-point gap between the best model (Unicorn at 70.2%) and human performance (94.1%) demonstrated that the gamification approach successfully produced a dataset that was considerably harder for AI systems than the original CommonsenseQA. Even GPT-3, despite its 175 billion parameters, performed only marginally above chance (50%) on yes/no questions, suggesting that scale alone was insufficient to solve the challenge.
CommonsenseQA has inspired and is closely related to several other benchmarks in the commonsense reasoning ecosystem:
| Dataset | Description | Year |
|---|---|---|
| ECQA | Explanations for CommonsenseQA; adds positive/negative property annotations and free-flow explanations for ~11,000 QA pairs from CommonsenseQA | 2021 |
| X-CSQA | Multilingual extension of CommonsenseQA; automatically translated into 15 languages beyond English (including Chinese, German, Spanish, French, Japanese, Arabic, Hindi, and others) | 2021 |
| OpenBookQA | Elementary science questions requiring both a provided "open book" of science facts and broad commonsense knowledge | 2018 |
| ARC | AI2 Reasoning Challenge; grade-school science questions split into Easy and Challenge sets | 2018 |
| WinoGrande | Large-scale Winograd Schema Challenge variant with 44,000 pronoun resolution problems | 2020 |
| PIQA | Physical Intuition QA; tests understanding of everyday physical interactions | 2020 |
| SocialIQA | Questions about social situations and emotional intelligence | 2019 |
| HellaSwag | Sentence completion benchmark testing grounded commonsense inference | 2019 |
| Logical-CommonsenseQA | Extension combining commonsense reasoning with logical reasoning | 2025 |
The Explanations for CommonsenseQA (ECQA) dataset, published at ACL 2021 by Aggarwal, Mandowara, Agrawal, Khandelwal, Singla, and Garg, extends the original CommonsenseQA with detailed explanations. For approximately 11,000 question-answer pairs from CommonsenseQA, annotators provided positive properties (reasons why the correct answer is right), negative properties (reasons why each incorrect answer is wrong), and free-flow natural language explanations. This resource supports research on explainable commonsense reasoning, enabling models not only to select the right answer but also to articulate why it is correct.
Recognizing that commonsense reasoning benchmarks had been predominantly English-centric, the X-CSQA dataset was created by automatically translating CommonsenseQA into 15 additional languages, forming development and test sets for cross-lingual evaluation. The 16 languages covered are: English, Chinese, German, Spanish, French, Italian, Japanese, Dutch, Polish, Portuguese, Russian, Arabic, Vietnamese, Hindi, Swahili, and Urdu. This resource enables researchers to study whether commonsense reasoning capabilities transfer across languages and to evaluate multilingual models in a zero-shot cross-lingual setting.
CommonsenseQA is available through multiple platforms. The most straightforward method is via the Hugging Face Datasets library:
from datasets import load_dataset
dataset = load_dataset("tau/commonsense_qa")
The dataset is also available through the official website at www.tau-nlp.org/commonsenseqa and the GitHub repository at github.com/jonathanherzig/commonsenseqa.
Each example in the dataset contains the following fields:
| Field | Type | Description |
|---|---|---|
| id | String | Unique identifier for the question |
| question | String | The natural language question |
| question_concept | String | The ConceptNet concept associated with the question |
| choices.label | List of strings | Answer option labels (A, B, C, D, E) |
| choices.text | List of strings | Answer option text |
| answerKey | String | The label of the correct answer (A through E) |
The test set does not include the answerKey field, requiring submission to the official evaluation server.
CommonsenseQA is included as a standard task in the EleutherAI lm-evaluation-harness, one of the most widely used frameworks for evaluating large language models. The task configuration supports both generation-based and likelihood-based evaluation modes, making it straightforward to benchmark new models against established baselines.
CommonsenseQA has had a substantial impact on the field of natural language processing and AI research more broadly. Several aspects of its influence deserve mention.
First, the dataset helped establish commonsense reasoning as a central evaluation axis for language models. Before CommonsenseQA, many NLP benchmarks focused on reading comprehension, textual entailment, or factual question answering where the answer could be found in a provided context. By removing the context and requiring models to draw on background knowledge, CommonsenseQA shifted attention toward a fundamental capability gap in NLP systems.
Second, the dataset catalyzed research into knowledge-augmented language models. The observation that large pretrained models like BERT could not score well on CommonsenseQA motivated researchers to explore methods for injecting external knowledge, whether from knowledge graphs, dictionaries, or other structured sources, into neural language models. This line of work ultimately produced systems like KEAR that achieved human parity by combining language models with targeted knowledge retrieval.
Third, CommonsenseQA has served as a proving ground for successive generations of language models. The progression from BERT's 55.9% to GPT-3's 73.0% to DeBERTa-based systems exceeding 90% provides a concrete measure of how rapidly NLP capabilities have advanced. The benchmark remains relevant even as models improve, because the transition from fine-tuned to zero-shot evaluation using LLMs creates a new dimension of comparison.
With over 1,800 citations, CommonsenseQA stands alongside GLUE, SuperGLUE, SQuAD, and MMLU as one of the defining benchmarks of the modern NLP era. Its influence extends beyond direct use, as the ConceptNet-based construction methodology has inspired the design of other benchmarks targeting specific reasoning capabilities.
Despite its contributions, CommonsenseQA has several recognized limitations.
The dataset is English-only (though X-CSQA addresses this for evaluation purposes), limiting its direct applicability to multilingual settings. The reliance on ConceptNet as a structural backbone means that the types of commonsense knowledge tested are partially constrained by ConceptNet's coverage and relation types. Some researchers have noted that certain questions can be answered through world knowledge rather than true commonsense reasoning, blurring the boundary between factual recall and commonsense inference.
The five-choice format, while more challenging than binary classification, still allows models to achieve above-chance performance through elimination strategies rather than genuine understanding. CommonsenseQA 2.0's switch to yes/no format was partly motivated by this concern, though binary questions introduce their own complications (such as sensitivity to negation and phrasing).
Finally, as with many benchmarks, there is a risk of overfitting to the specific distribution of CommonsenseQA questions. Models that achieve high accuracy on the leaderboard may not generalize to commonsense reasoning in more open-ended settings, a limitation that broader evaluation suites attempt to address.