# CommonsenseQA

> Source: https://aiwiki.ai/wiki/commonsenseqa
> Updated: 2026-06-23
> Categories: AI Benchmarks, Natural Language Processing
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

CommonsenseQA is a multiple-choice [question answering](/wiki/question_answering) [benchmark](/wiki/benchmark) of 12,247 questions, introduced in 2019 by Alon Talmor, Jonathan Herzig, Nicholas Lourie, and Jonathan Berant, that evaluates whether artificial intelligence systems can perform [commonsense reasoning](/wiki/commonsense_reasoning) without a supporting passage. Each question offers five answer choices (one correct, four distractors) and is derived from the [ConceptNet](/wiki/conceptnet) knowledge graph. In the original paper the best baseline, a fine-tuned [BERT](/wiki/bert)-large model, reached only 56% accuracy versus 89% for humans, a roughly 33-percentage-point gap that the authors summarized as "well below human performance." [1] CommonsenseQA was published as a long paper at the 2019 Conference of the North American Chapter of the Association for Computational Linguistics ([NAACL](/wiki/naacl)-HLT 2019) in Minneapolis, where it won the Best Resource Paper award. [1] The dataset has since accumulated over 1,800 citations on Semantic Scholar, making it one of the most widely used benchmarks in [natural language processing](/wiki/natural_language_processing) research focused on commonsense understanding.

## Background and Motivation

Commonsense reasoning has long been recognized as a fundamental challenge in [artificial intelligence](/wiki/artificial_intelligence). Humans effortlessly draw on a vast body of background knowledge when interpreting language, making inferences, and answering questions about the world. A person understands, for instance, that rivers flow downhill, that people celebrate birthdays, or that a cold glass of water will eventually reach room temperature. This kind of knowledge is so obvious that it is seldom stated explicitly in written text, a phenomenon known as reporting bias.

The authors frame the problem directly in the paper's abstract: "When answering a question, people often draw upon their rich world knowledge in addition to the particular context. Recent work has focused primarily on answering questions given some relevant document or context, and required very little general background." [1] CommonsenseQA was designed to probe exactly that missing general background.

Before CommonsenseQA, several benchmarks existed for testing commonsense reasoning in NLP systems. The Winograd Schema Challenge (Levesque et al., 2012) offered carefully crafted pronoun resolution problems but contained only 273 examples, limiting its statistical reliability. The Choice of Plausible Alternatives (COPA) dataset provided 1,000 causal reasoning questions. The Story Cloze Test (Mostafazadeh et al., 2016) asked models to pick the correct ending for short stories, but researchers discovered annotation artifacts that allowed models to score well without genuine understanding. Similarly, SWAG (Zellers et al., 2018) was created for grounded commonsense inference, yet pretrained models like [BERT](/wiki/bert) quickly achieved near-human performance through fine-tuning, suggesting the dataset did not fully capture the depth of commonsense reasoning.

The creators of CommonsenseQA observed two shortcomings in prior work. First, many existing datasets were either too small to support robust machine learning experiments or contained statistical artifacts that models could exploit. Second, most benchmarks provided a textual context alongside each question, meaning that models could sometimes identify the correct answer by pattern matching within the given passage rather than by reasoning about the world. CommonsenseQA addressed both problems by creating a larger dataset and by formulating questions that do not come with a supporting passage; answering them correctly demands genuine background knowledge.

## How was CommonsenseQA built from ConceptNet?

### ConceptNet as a Foundation

The construction of CommonsenseQA relies on [ConceptNet](/wiki/conceptnet), a large-scale commonsense knowledge graph in which nodes represent natural language concepts (such as "river," "birthday," or "school") and edges represent semantic relations between them (such as AtLocation, Causes, CapableOf, and HasSubevent). The key insight behind the dataset design was that concepts sharing the same relation to a common source concept are semantically related yet distinct, making them ideal candidates for challenging multiple-choice distractors. As the paper puts it, the team extracts "from ConceptNet multiple target concepts that have the same semantic relation to a single source concept," then asks crowd-workers to "author multiple-choice questions that mention the source concept and discriminate in turn between each of the target concepts." [1]

For example, the concept "river" might be connected via the AtLocation relation to several target concepts: "waterfall," "bridge," and "valley." All three of these targets share a plausible spatial relationship with rivers, so a question that asks about one of them while offering the others as distractors forces the respondent to think carefully about which specific commonsense fact is being tested.

### Crowdsourcing Procedure

The dataset was created through a multi-step crowdsourcing process on Amazon Mechanical Turk:

1. **Subgraph extraction.** The researchers extracted subgraphs from ConceptNet, each containing one source concept and three target concepts connected by the same semantic relation.

2. **Question authoring.** A crowd worker was shown the source concept and three target concepts. The worker then authored three separate questions (one per target concept), constructing each question so that only the designated target served as the correct answer while the other two targets functioned as plausible but incorrect distractors. This design naturally encouraged workers to embed commonsense knowledge into the question text to differentiate the correct answer from the alternatives.

3. **Distractor generation.** For each question, the worker selected one additional distractor from ConceptNet and manually wrote a fifth distractor. This brought the total number of answer choices to five per question (labeled A through E).

4. **Quality verification.** Workers verified the quality of each question to ensure clarity, correctness, and the absence of ambiguity.

The total cost per question was approximately $0.33. This crowdsourcing design proved effective at producing questions that require genuine commonsense reasoning rather than surface-level pattern matching.

### How many questions does CommonsenseQA contain?

The final dataset comprises 12,247 questions (some sources report 12,102 after filtering) distributed across the following splits: [1]

| Split | Number of Examples |
|---|---|
| Training | 9,741 |
| Validation | 1,221 |
| Test | 1,140 |

The dataset is provided under an MIT license and is hosted on [Hugging Face](/wiki/hugging_face) (tau/commonsense_qa) as well as the official website at www.tau-nlp.org/commonsenseqa.

### ConceptNet Relations

The questions in CommonsenseQA draw on a variety of ConceptNet relations, with the following distribution among the most common: [1]

| Relation | Percentage of Questions |
|---|---|
| Causes | 47.3% |
| CapableOf | 17.3% |
| Antonym | 9.4% |
| HasSubevent | 8.5% |
| Other relations | 17.5% |

The heavy representation of the "Causes" relation reflects its importance in everyday reasoning. Understanding causal relationships (for example, that heavy rain causes flooding) is central to how humans navigate and interpret the world.

## Question Format and Examples

Each question in CommonsenseQA is a standalone natural language question with five candidate answers. Unlike reading comprehension tasks such as [SQuAD](/wiki/squad), there is no accompanying passage or context; the model must rely entirely on its internal knowledge or external knowledge sources.

A representative data instance has the following structure:

```json
{
  "id": "075e483d21c29a511267ef62bedc0461",
  "question": "The sanctions against the school were a punishing blow, and they seemed to what the efforts the school had made to change?",
  "question_concept": "punishing",
  "choices": {
    "label": ["A", "B", "C", "D", "E"],
    "text": ["ignore", "enforce", "authoritarian", "yell at", "avoid"]
  },
  "answerKey": "A"
}
```

The `question_concept` field records the ConceptNet concept from which the question was derived. In this example, the concept is "punishing," and the question requires understanding that sanctions, despite being a form of punishment, can simultaneously ignore or negate the positive efforts of the entity being sanctioned.

### Types of Commonsense Knowledge Required

Analysis of the dataset reveals that questions draw on several categories of background knowledge:

| Knowledge Type | Description |
|---|---|
| Spatial reasoning | Understanding where objects and entities are located relative to one another |
| Cause and effect | Knowing what events or actions lead to particular outcomes |
| Scientific knowledge | Basic understanding of physical and biological phenomena |
| Social conventions | Knowledge of cultural norms, social behavior, and human interactions |
| Background world knowledge | General facts about how the world works that are rarely stated explicitly |

This diversity of required knowledge types makes CommonsenseQA a broad test of commonsense understanding rather than a narrow probe of any single reasoning skill.

## How is CommonsenseQA scored?

CommonsenseQA uses accuracy as its primary evaluation metric. Because each question has exactly one correct answer among five choices, random guessing would yield an expected accuracy of 20%. The test set labels are withheld from public release; researchers must submit their predictions to the official leaderboard for evaluation.

Human performance on the dataset was measured at 88.9% accuracy (a majority vote among five individuals), establishing an upper baseline for the benchmark; the paper rounds this figure to 89%. [1][3] The gap between machine performance and human accuracy has served as the primary indicator of progress in commonsense reasoning.

## Baseline Results (2019)

The original paper presented several baseline models and their performance on the dataset, reporting that "Our best baseline is based on BERT-large (Devlin et al., 2018) and obtains 56% accuracy, well below human performance, which is 89%." [1]

| Model | Accuracy |
|---|---|
| Random | 20.0% |
| TF-IDF retrieval | 42.0% |
| GloVe embeddings | 44.0% |
| BERT-base | 53.0% |
| [BERT](/wiki/bert)-large (fine-tuned) | 55.9% |
| Human | 88.9% |

The 33-percentage-point gap between the best model (BERT-large at 55.9%) and human performance (88.9%) demonstrated that CommonsenseQA posed a genuine challenge for the NLP systems of 2019. [1] Even BERT-large, which had recently achieved state-of-the-art results on many NLP benchmarks, fell far short of human-level commonsense reasoning on this task.

## How have models progressed on the leaderboard?

Since the dataset's release, the CommonsenseQA leaderboard has tracked steady improvements as researchers developed increasingly sophisticated approaches. The following table summarizes notable milestones:

| Model | Dev Accuracy | Test Accuracy | Year | Notes |
|---|---|---|---|---|
| BERT-large | ~56% | 55.9% | 2019 | Original baseline |
| Fine-tuned GPT-3 | 73.0% | -- | 2021 | 175B parameter [language model](/wiki/large_language_model) |
| [RoBERTa](/wiki/roberta)-large | 76.7% | -- | 2020 | Improved pretraining over BERT |
| UnifiedQA (11B) | 79.1% | -- | 2020 | Multi-task QA model from [Allen AI](/wiki/allen_institute_for_ai) |
| [ALBERT](/wiki/albert)-xxlarge + HGN | 81.2% | 80.0% | 2020 | Knowledge graph integration |
| [DeBERTa](/wiki/deberta)-xxlarge | 83.8% | -- | 2021 | Enhanced attention mechanism (1.5B parameters) |
| DeBERTaV3-large | 84.6% | -- | 2022 | Efficient DeBERTa variant (418M parameters) |
| KEAR (single model) | 90.8% | 86.1% | 2021 | External attention with knowledge retrieval |
| DeBERTaV3-large + KEAR | 91.2% | -- | 2022 | Best single-model dev accuracy |
| KEAR (39-model ensemble) | 93.4% | 89.4% | 2021 | First to surpass human performance |
| CPACE | -- | 89.8% | 2022 | 0.9% above human performance |

These results illustrate a clear trajectory: from models that barely outperformed simple baselines in 2019, through knowledge-augmented approaches in 2020 and 2021, to systems that eventually matched and surpassed human accuracy.

### When did a model first beat humans on CommonsenseQA?

Microsoft's KEAR (Knowledgeable External Attention for commonsense Reasoning) system, announced on December 20, 2021, was the first model to surpass human performance on the CommonsenseQA leaderboard, achieving 89.4% test accuracy compared to the 88.9% human baseline. [3] Microsoft Research described the result plainly: "Our latest model, KEAR, Knowledgeable External Attention for commonsense Reasoning, performs better than people answering the same question." [3] The achievement was published at IJCAI 2022 by Yichong Xu, Chenguang Zhu, Shuohang Wang, and colleagues. [9]

KEAR introduced a mechanism called external attention, which complements the self-attention used in [transformer](/wiki/transformer) architectures. Rather than relying solely on information within the input sequence, KEAR retrieves relevant knowledge from three external sources: [3][9]

1. **Knowledge graph (ConceptNet).** The system performs entity linking to retrieve relation triples connecting concepts mentioned in the question and answer choices.
2. **Dictionary (Wiktionary).** Definitions of key concepts are retrieved through word matching.
3. **Training data from related QA datasets.** Using BM25 text retrieval, KEAR finds similar questions from 17 commonsense datasets.

The retrieved knowledge is concatenated with the input question and candidate answer, then fed into a language model to produce a score. The final prediction uses an ensemble of 39 language models (including [DeBERTa](/wiki/deberta) and ELECTRA variants) with majority voting.

One significant aspect of KEAR's design is that it requires no changes to the underlying transformer architecture. The external knowledge is simply appended to the input text, meaning the approach can be applied to any pretrained language model. This demonstrated that moderately sized models, when equipped with the right external knowledge, could match or exceed the performance of much larger models like [GPT-3](/wiki/gpt-3) (175 billion parameters). [9]

### How do large language models do on CommonsenseQA?

The rise of [large language models](/wiki/large_language_model) (LLMs) brought a new perspective to CommonsenseQA evaluation. Models such as [GPT-4](/wiki/gpt-4) have demonstrated strong commonsense reasoning capabilities without any task-specific fine-tuning: one 2025 study reports GPT-4 reaching 79.5% with zero-shot chain-of-thought prompting and 84.8% with a structured input-action-output prompting method on CommonsenseQA. [10] While these numbers fall below the best fine-tuned and ensemble systems on the leaderboard, they highlight that scaling model size and pretraining data can implicitly capture substantial commonsense knowledge, effectively saturating much of the original benchmark for frontier models.

The continued relevance of CommonsenseQA as a benchmark is evidenced by its inclusion in major LLM evaluation suites. Platforms like the Open LLM Leaderboard and evaluation harnesses such as EleutherAI's lm-evaluation-harness include CommonsenseQA as a standard test of commonsense reasoning. Major AI companies, including [OpenAI](/wiki/openai), [Anthropic](/wiki/anthropic), [Meta](/wiki/meta_ai), Microsoft, and [Google](/wiki/google_deepmind), reference CommonsenseQA results in their model evaluations.

## What is CommonsenseQA 2.0?

In 2021, Alon Talmor and collaborators released CommonsenseQA 2.0 (CSQA2), a successor benchmark that addressed some limitations of the original dataset. The work was presented as an oral presentation at [NeurIPS](/wiki/neurips) 2021 (Datasets and Benchmarks Track) under the title "CommonsenseQA 2.0: Exposing the Limits of AI through Gamification." [2] The authors include Alon Talmor, Ori Yoran, Ronan Le Bras, Chandra Bhagavatula, Yoav Goldberg, Yejin Choi, and Jonathan Berant, representing a collaboration between Tel Aviv University and the [Allen Institute for AI](/wiki/allen_institute_for_ai).

### Gamification Approach

The most distinctive feature of CommonsenseQA 2.0 is its data collection methodology: a gamified framework in which, as the authors describe it, "the goal of players in the game is to compose questions that mislead a rival AI while using specific phrases for extra points." [2] The game uses a points-based incentive structure:

- **Points for beating the AI.** Players earn points when they author questions that the AI answers incorrectly, encouraging them to probe the boundaries of AI understanding.
- **Points for phrase usage.** The game designers specify particular phrases or topics that award bonus points when incorporated into questions, allowing them to steer the distribution of questions toward specific areas of commonsense knowledge.
- **Penalties for bad questions.** Players lose points if their questions fail human validation (for example, if the question is ambiguous, nonsensical, or "sensitive"), ensuring broad human agreement on correct answers.

This adversarial setup leverages the players' intuitive understanding of where AI systems are likely to fail, producing questions that target genuine gaps in machine commonsense reasoning.

### Dataset Format and Statistics

Unlike the original CommonsenseQA, which uses five-choice multiple-choice questions, CommonsenseQA 2.0 uses a yes/no (binary) format. The dataset contains 14,343 yes/no questions in total (with the publicly available Hugging Face version containing approximately 11,805 examples split between training and validation). [2]

| Split | Examples |
|---|---|
| Training | ~9,260 |
| Validation | ~2,545 |
| Test | ~2,538 (labels withheld) |

Each data instance includes the question text, the yes/no answer, a confidence score derived from human validation, and metadata about the prompts used during question creation. The dataset is released under a CC-BY-4.0 license.

### Model Performance on CSQA2

The initial evaluation results on CommonsenseQA 2.0 revealed a substantial gap between machine and human performance. The authors report that "Our best baseline, the T5-based Unicorn with 11B parameters achieves an accuracy of 70.2%, substantially higher than GPT-3 (52.9%) in a few-shot inference setup. Both score well below human performance which is at 94.1%." [2]

| Model | Accuracy |
|---|---|
| GPT-3 (few-shot, 175B) | 52.9% |
| T5-based Unicorn (11B) | 70.2% |
| Human | 94.1% |

The 23.9-percentage-point gap between the best model (Unicorn at 70.2%) and human performance (94.1%) demonstrated that the gamification approach successfully produced a dataset that was considerably harder for AI systems than the original CommonsenseQA. [2] Even GPT-3, despite its 175 billion parameters, performed only marginally above chance (50%) on yes/no questions, suggesting that scale alone was insufficient to solve the challenge.

## Related Datasets and Extensions

CommonsenseQA has inspired and is closely related to several other benchmarks in the commonsense reasoning ecosystem:

| Dataset | Description | Year |
|---|---|---|
| [ECQA](/wiki/ecqa) | Explanations for CommonsenseQA; adds positive/negative property annotations and free-flow explanations for ~11,000 QA pairs from CommonsenseQA | 2021 |
| X-CSQA | Multilingual extension of CommonsenseQA; automatically translated into 15 languages beyond English (including Chinese, German, Spanish, French, Japanese, Arabic, Hindi, and others) | 2021 |
| [OpenBookQA](/wiki/openbookqa) | Elementary science questions requiring both a provided "open book" of science facts and broad commonsense knowledge | 2018 |
| [ARC](/wiki/ai2_reasoning_challenge) | AI2 Reasoning Challenge; grade-school science questions split into Easy and Challenge sets | 2018 |
| [WinoGrande](/wiki/winogrande) | Large-scale Winograd Schema Challenge variant with 44,000 pronoun resolution problems | 2020 |
| [PIQA](/wiki/piqa) | Physical Intuition QA; tests understanding of everyday physical interactions | 2020 |
| [SocialIQA](/wiki/socialiqa) | Questions about social situations and emotional intelligence | 2019 |
| [HellaSwag](/wiki/hellaswag) | Sentence completion benchmark testing grounded commonsense inference | 2019 |
| Logical-CommonsenseQA | Extension combining commonsense reasoning with logical reasoning | 2025 |

### ECQA: Explanations for CommonsenseQA

The Explanations for CommonsenseQA (ECQA) dataset, published at ACL 2021 by Aggarwal, Mandowara, Agrawal, Khandelwal, Singla, and Garg, extends the original CommonsenseQA with detailed explanations. [4] For approximately 11,000 question-answer pairs from CommonsenseQA, annotators provided positive properties (reasons why the correct answer is right), negative properties (reasons why each incorrect answer is wrong), and free-flow natural language explanations. This resource supports research on explainable commonsense reasoning, enabling models not only to select the right answer but also to articulate why it is correct.

### X-CSQA: Multilingual CommonsenseQA

Recognizing that commonsense reasoning benchmarks had been predominantly English-centric, the X-CSQA dataset was created by automatically translating CommonsenseQA into 15 additional languages, forming development and test sets for cross-lingual evaluation. [8] The 16 languages covered are: English, Chinese, German, Spanish, French, Italian, Japanese, Dutch, Polish, Portuguese, Russian, Arabic, Vietnamese, Hindi, Swahili, and Urdu. This resource enables researchers to study whether commonsense reasoning capabilities transfer across languages and to evaluate multilingual models in a zero-shot cross-lingual setting.

## Technical Details

### How do you access the CommonsenseQA dataset?

CommonsenseQA is available through multiple platforms. The most straightforward method is via the Hugging Face Datasets library:

```python
from datasets import load_dataset

dataset = load_dataset("tau/commonsense_qa")
```

The dataset is also available through the official website at www.tau-nlp.org/commonsenseqa and the GitHub repository at github.com/jonathanherzig/commonsenseqa.

### Data Fields

Each example in the dataset contains the following fields:

| Field | Type | Description |
|---|---|---|
| id | String | Unique identifier for the question |
| question | String | The natural language question |
| question_concept | String | The ConceptNet concept associated with the question |
| choices.label | List of strings | Answer option labels (A, B, C, D, E) |
| choices.text | List of strings | Answer option text |
| answerKey | String | The label of the correct answer (A through E) |

The test set does not include the answerKey field, requiring submission to the official evaluation server.

### Evaluation in LLM Harnesses

CommonsenseQA is included as a standard task in the EleutherAI lm-evaluation-harness, one of the most widely used frameworks for evaluating [large language models](/wiki/large_language_model). The task configuration supports both generation-based and likelihood-based evaluation modes, making it straightforward to benchmark new models against established baselines.

## Significance and Impact

CommonsenseQA has had a substantial impact on the field of natural language processing and AI research more broadly. Several aspects of its influence deserve mention.

First, the dataset helped establish commonsense reasoning as a central evaluation axis for language models. Before CommonsenseQA, many NLP benchmarks focused on reading comprehension, textual entailment, or factual question answering where the answer could be found in a provided context. By removing the context and requiring models to draw on background knowledge, CommonsenseQA shifted attention toward a fundamental capability gap in NLP systems.

Second, the dataset catalyzed research into knowledge-augmented language models. The observation that large pretrained models like BERT could not score well on CommonsenseQA motivated researchers to explore methods for injecting external knowledge, whether from knowledge graphs, dictionaries, or other structured sources, into neural language models. This line of work ultimately produced systems like KEAR that achieved human parity by combining language models with targeted knowledge retrieval. [3]

Third, CommonsenseQA has served as a proving ground for successive generations of language models. The progression from BERT's 55.9% to GPT-3's 73.0% to DeBERTa-based systems exceeding 90% provides a concrete measure of how rapidly NLP capabilities have advanced. [1][9] The benchmark remains relevant even as models improve, because the transition from fine-tuned to zero-shot evaluation using LLMs creates a new dimension of comparison.

With over 1,800 citations, CommonsenseQA stands alongside [GLUE](/wiki/glue_benchmark), [SuperGLUE](/wiki/superglue), [SQuAD](/wiki/squad), and [MMLU](/wiki/mmlu) as one of the defining benchmarks of the modern NLP era. Its influence extends beyond direct use, as the ConceptNet-based construction methodology has inspired the design of other benchmarks targeting specific reasoning capabilities.

## Limitations

Despite its contributions, CommonsenseQA has several recognized limitations.

The dataset is English-only (though X-CSQA addresses this for evaluation purposes), limiting its direct applicability to multilingual settings. The reliance on ConceptNet as a structural backbone means that the types of commonsense knowledge tested are partially constrained by ConceptNet's coverage and relation types. Some researchers have noted that certain questions can be answered through world knowledge rather than true commonsense reasoning, blurring the boundary between factual recall and commonsense inference.

The five-choice format, while more challenging than binary classification, still allows models to achieve above-chance performance through elimination strategies rather than genuine understanding. CommonsenseQA 2.0's switch to yes/no format was partly motivated by this concern, though binary questions introduce their own complications (such as sensitivity to negation and phrasing).

Finally, as with many benchmarks, there is a risk of overfitting to the specific distribution of CommonsenseQA questions. Models that achieve high accuracy on the leaderboard may not generalize to commonsense reasoning in more open-ended settings, a limitation that broader evaluation suites attempt to address.

## See Also

- [ConceptNet](/wiki/conceptnet)
- [BERT](/wiki/bert)
- [Natural Language Processing](/wiki/natural_language_processing)
- [Commonsense Reasoning](/wiki/commonsense_reasoning)
- [GLUE](/wiki/glue_benchmark)
- [SQuAD](/wiki/squad)
- [WinoGrande](/wiki/winogrande)
- [HellaSwag](/wiki/hellaswag)
- [MMLU](/wiki/mmlu)

## References

1. Talmor, A., Herzig, J., Lourie, N., & Berant, J. (2019). CommonsenseQA: A Question Answering Challenge Targeting Commonsense Knowledge. *Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT 2019)*, pages 4149-4158. https://aclanthology.org/N19-1421/

2. Talmor, A., Yoran, O., Le Bras, R., Bhagavatula, C., Goldberg, Y., Choi, Y., & Berant, J. (2022). CommonsenseQA 2.0: Exposing the Limits of AI through Gamification. *Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks (NeurIPS 2021)*. https://arxiv.org/abs/2201.05320

3. Microsoft Research (2021). Azure AI milestone: Microsoft KEAR surpasses human performance on CommonsenseQA benchmark. https://www.microsoft.com/en-us/research/blog/azure-ai-milestone-microsoft-kear-surpasses-human-performance-on-commonsenseqa-benchmark/

4. Aggarwal, S., Mandowara, D., Agrawal, V., Khandelwal, D., Singla, P., & Garg, D. (2021). Explanations for CommonsenseQA: New Dataset and Models. *Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (ACL-IJCNLP 2021)*, pages 3050-3065. https://aclanthology.org/2021.acl-long.238/

5. Speer, R., Chin, J., & Havasi, C. (2017). ConceptNet 5.5: An Open Multilingual Graph of General Knowledge. *Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence*.

6. Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. *Proceedings of NAACL-HLT 2019*.

7. Khashabi, D., Min, S., Khot, T., Sabharwal, A., Tafjord, O., Clark, P., & Hajishirzi, H. (2020). UnifiedQA: Crossing Format Boundaries with a Single QA System. *Findings of the Association for Computational Linguistics: EMNLP 2020*.

8. Lin, B. Y., Lee, S., Qiao, X., & Ren, X. (2021). Common Sense Beyond English: Evaluating and Improving Multilingual Language Models for Commonsense Reasoning. *Proceedings of ACL-IJCNLP 2021*.

9. Xu, Y., Zhu, C., Wang, S., et al. (2022). Human Parity on CommonsenseQA: Augmenting Self-Attention with External Attention. *Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence (IJCAI 2022)*, pages 2762-2768. https://www.ijcai.org/proceedings/2022/383

10. Adekunle, B. et al. (2025). IAO Prompting: Making Knowledge Flow Explicit in LLMs through Structured Reasoning Templates. *arXiv preprint*. https://arxiv.org/abs/2502.03080