WinoGrande is a large-scale benchmark for commonsense reasoning, consisting of 44,000 pronoun resolution problems modeled after the Winograd Schema Challenge. Introduced in 2019 by Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi from the Allen Institute for AI (AI2) and the University of Washington, WinoGrande was designed to test whether neural language models have genuinely acquired commonsense capabilities or are simply exploiting statistical biases in existing datasets. The paper was published at AAAI 2020, where it received the Outstanding Paper Award. It was later reprinted in Communications of the ACM in September 2021.
WinoGrande addresses a fundamental limitation of the original Winograd Schema Challenge (WSC): its small size of only 273 problems made it vulnerable to exploitation by modern transformer-based models. By scaling up the dataset through crowdsourcing and applying a novel algorithmic debiasing technique called AFLITE, WinoGrande provides a more reliable test of machine commonsense. At the time of publication, the best models achieved 79.1% accuracy, well below human performance of 94.0%.
The Winograd Schema Challenge (WSC) was proposed in 2012 by Hector Levesque at the University of Toronto as an alternative to the Turing test for measuring machine intelligence. The challenge is named after Terry Winograd, a computer science professor at Stanford University, whose 1972 work on natural language understanding included a famous example of pronoun ambiguity:
"The city councilmen refused the demonstrators a permit because they feared violence."
In this sentence, "they" refers to the city councilmen. However, if "feared" is replaced with "advocated," then "they" refers to the demonstrators instead. Resolving such ambiguities requires understanding the real world, not just pattern matching over surface-level text features.
Levesque, along with Ernest Davis and Leora Morgenstern, developed the original WSC dataset of 273 hand-crafted pronoun disambiguation problems organized as 136 schema pairs. Each pair consists of two nearly identical sentences where a single word change (the "trigger word") flips which entity a pronoun refers to. Human performance on these problems was approximately 96.5%, while early computational systems performed near chance (around 50%).
The WSC gained significant attention as a test of artificial intelligence, but its small size eventually became a liability. By 2019, several transformer-based language models, including BERT and its variants, achieved accuracies above 90% on WSC and related datasets. This raised the question of whether these models had truly learned commonsense reasoning or had simply memorized patterns in a small, expert-authored dataset.
The success of pre-trained language models on WSC-style benchmarks created a false impression that commonsense reasoning was largely solved. Sakaguchi et al. identified two core problems with existing Winograd-style datasets:
Small scale. The original WSC contained only 273 problems, and the largest related dataset (DPR, or Definite Pronoun Resolution) contained only 1,886 problems. These sizes were insufficient to draw statistically robust conclusions or to serve as effective training data for modern models.
Dataset biases. Even with careful expert authoring, subtle biases crept into the data. Models could exploit word associations, lexical patterns, and structural regularities to achieve high accuracy without genuine understanding. For example, the presence of a word with positive or negative sentiment near one answer option could serve as an unintentional signal.
WinoGrande was created to solve both problems simultaneously: scaling up the dataset by more than 100 times while systematically removing exploitable biases.
The WinoGrande dataset was collected through Amazon Mechanical Turk (AMT). The researchers designed a careful crowdsourcing pipeline with multiple stages.
Twin Sentence Design. Following the WSC format, each problem consists of a pair of "twin" sentences that differ by exactly one trigger word. The trigger word changes which of two entities is the correct answer. Workers were instructed to write sentence pairs between 15 and 30 words long, with at least 70% word overlap between the twins. This constraint ensures that the two sentences are nearly identical, differing only in the critical trigger word.
Two Problem Domains. Workers created problems in two categories:
Task Format. Unlike the original WSC, which uses pronoun resolution (asking "what does 'it' refer to?"), WinoGrande reformulates the task as a fill-in-the-blank problem with binary options. A blank in the sentence replaces a mention of one of two entities, and the model must select which entity correctly fills the blank. This format is simpler to evaluate and avoids ambiguities in pronoun reference annotation.
For example:
"Robert woke up at 9:00am while Samuel woke up at 6:00am, so _ had less time to get ready for school."
Option 1: Robert | Option 2: Samuel
Correct answer: Robert
Worker Qualifications and Compensation. Only workers with a 99% approval rate and at least 5,000 approved tasks were eligible. Workers were paid $0.40 per twin sentence pair and $0.03 per validation task.
Validation. Each collected question was validated by three additional crowd workers. A problem was retained only if the majority of validators agreed on the correct answer, the options were unambiguous, and the problem could not be solved through simple word association.
The initial collection yielded approximately 77,000 questions (38,500 twin pairs). After validation, roughly 68% were retained, producing about 53,000 valid problems.
The second stage of dataset construction was the application of AFLITE (Adversarial Filtering Lite), a novel algorithm for systematic bias reduction. AFLITE is a lightweight variant of Adversarial Filtering (AF) that identifies and removes instances whose labels can be predicted from superficial features alone.
The algorithm works as follows:
Embedding computation. Each problem instance is converted into a fixed-length vector representation using pre-computed neural embeddings.
Ensemble training. The dataset is randomly split into training and validation portions. A set of 64 linear classifiers is trained on the training portion.
Prediction scoring. Each instance in the validation set receives a score based on the fraction of classifiers that correctly predict its label. Instances with scores at or above a threshold (0.75) are considered "easy" for the ensemble, meaning their labels are predictable from surface features.
Removal. The top-scoring (most predictable) instances are removed. This process iterates until convergence.
Termination. The loop ends when fewer than a threshold number of instances are removed in a single iteration, or when the dataset shrinks below a minimum size.
AFLITE is more efficient than the original AF algorithm because it does not require retraining a full neural network at each iteration. Instead, it relies on pre-computed embeddings and trains only lightweight linear classifiers. This makes the process scalable to large datasets.
After applying AFLITE, about 14% of the collected data was discarded. The debiasing effect was substantial: the KL divergence between the embedding distributions of the two answer classes dropped from 2.53 (before filtering) to 0.12 (after filtering), indicating that surface-level features could no longer distinguish between the two answer options.
The final WinoGrande dataset contains 43,972 problems. The authors released two configurations:
| Configuration | Description | Train | Dev | Test | Total |
|---|---|---|---|---|---|
| WinoGrande_debiased | Only instances that passed AFLITE | 9,248 | 1,267 | 1,767 | 12,282 |
| WinoGrande_all (XL) | All valid instances (including filtered ones as extra training data) | 40,938 | 1,267 | 1,767 | 43,972 |
To study the effect of training data size, the authors also created subsampled training splits:
| Split | Training Instances | Percentage of Full Training Set |
|---|---|---|
| XS | 160 | ~0.4% |
| S | 640 | ~1.6% |
| M | 2,558 | ~6.3% |
| L | 10,234 | ~25% |
| XL | 40,938 | 100% |
The dataset is available on Hugging Face at the allenai/winogrande repository, where all configurations and splits can be downloaded in Parquet format.
WinoGrande is significantly larger and more carefully debiased than its predecessors:
| Dataset | Size | Avg. Sentence Length | Vocabulary Size | Authorship |
|---|---|---|---|---|
| WSC (Levesque et al., 2012) | 273 | 19.1 words | 919 | Expert-crafted |
| DPR (Rahman & Ng, 2012) | 1,886 | 15.9 words | 4,127 | Undergraduates |
| WinoGrande (Sakaguchi et al., 2020) | 43,972 | 20.6 words | 16,469 | Crowdworkers + AFLITE |
The larger vocabulary and more diverse authorship of WinoGrande make it harder for models to exploit surface-level patterns. The application of AFLITE further reduces systematic biases that are present in expert-crafted and student-authored datasets.
Sakaguchi et al. evaluated several models on WinoGrande_debiased. The results showed a significant gap between model accuracy and human performance:
| Model | Dev Accuracy | Test Accuracy |
|---|---|---|
| WKH (word association baseline) | 49.4% | 49.6% |
| Ensemble LMs | 53.0% | 50.9% |
| BERT (fine-tuned) | 65.8% | 64.9% |
| RoBERTa (fine-tuned) | 79.3% | 79.1% |
| BERT (local context only) | 52.5% | 51.9% |
| RoBERTa (local context only) | 52.1% | 50.0% |
| BERT-DPR | 50.2% | 51.0% |
| RoBERTa-DPR | 59.4% | 58.9% |
| Human | 94.1% | 94.0% |
The word association baseline (WKH) performed at chance level, confirming that AFLITE successfully removed exploitable biases. RoBERTa achieved the highest model accuracy at 79.1%, but this was still 14.9 percentage points below human performance. The "local context only" variants, which received only the words immediately surrounding the blank, performed near chance, confirming that solving WinoGrande requires understanding the full sentence context.
The authors studied how RoBERTa's performance scaled with training data size:
| Training Split | Training Instances | Dev Accuracy | Test Accuracy |
|---|---|---|---|
| XS | 160 | 51.5% | 50.4% |
| S | 640 | 58.6% | 58.6% |
| M | 2,558 | 66.9% | 67.6% |
| L | 10,234 | 75.8% | 74.7% |
| XL | 40,938 | 79.3% | 79.1% |
Extrapolating from this learning curve, the authors estimated that over 118,000 training instances would be needed for RoBERTa to reach human-level accuracy. This finding highlights both the difficulty of WinoGrande and the data-hungry nature of current approaches to commonsense reasoning.
Human accuracy on WinoGrande was measured at 94.0% on the test set, determined by majority vote among three crowd workers per question. This is slightly lower than human accuracy on the original WSC (96.5%), which may reflect the noisier language used by crowd workers compared to expert-crafted sentences. Nevertheless, humans found WinoGrande problems relatively straightforward, confirming that the task tests genuine commonsense knowledge rather than obscure trivia.
One of the key findings of the WinoGrande paper is that models fine-tuned on WinoGrande transfer effectively to other commonsense reasoning benchmarks. A RoBERTa model fine-tuned on WinoGrande achieved new state-of-the-art results on five related benchmarks at the time of publication:
| Benchmark | RoBERTa-WinoGrande | Previous SOTA | Human Performance |
|---|---|---|---|
| WSC (original) | 90.1% | 83.1% | 96.5% |
| PDP (Pronoun Disambiguation Problems) | 87.5% | N/A | 92.5% |
| SuperGLUE WSC | 85.6% | N/A | 100% |
| DPR | 93.1% | 91.7% | 95.2% |
| KnowRef | 85.6% | N/A | 92.0% |
| COPA | 90.6% | N/A | 99.0% |
| Winogender | 97.1% | N/A | N/A |
These results demonstrated that WinoGrande serves as a valuable training resource for commonsense reasoning in general. The strong transfer performance also suggested that high scores on the original WSC and related benchmarks may have been inflated by dataset-specific biases, since models trained on the larger, debiased WinoGrande still fell short of human accuracy on those same benchmarks.
Since its release, WinoGrande has become a standard evaluation benchmark for large language models (LLMs). The following table summarizes reported scores for notable models:
| Model | WinoGrande Accuracy | Evaluation Setting | Source |
|---|---|---|---|
| GPT-3 (davinci, 175B) | 77.7% | Few-shot | Brown et al., 2020 |
| GPT-3.5-turbo | 65.8% | 1-shot | Zheng et al., 2023 |
| GPT-4 (0613) | 87.1% | 1-shot | Zheng et al., 2023 |
| PaLM 2-L | 83.0% | Reported | Google, 2023 |
| LLaMA 2-70B | 69.8% | 1-shot | Zheng et al., 2023 |
| Gemma 2 27B | 83.7% | 5-shot | Leaderboard |
| Qwen 2 72B Instruct | 85.1% | 5-shot | Leaderboard |
| Command R+ | 85.4% | 5-shot | Leaderboard |
| Llama 3.1 Nemotron 70B Instruct | 84.5% | 5-shot | Leaderboard |
| Human | 94.0% | Majority vote | Sakaguchi et al., 2020 |
Note that scores vary depending on the evaluation protocol (zero-shot, few-shot, or fine-tuned) and the specific prompt format used. The most common evaluation setting in the Open LLM Leaderboard is 5-shot.
While the largest and most capable models have narrowed the gap with human performance, no model has matched the 94.0% human baseline as of the latest available evaluations. This persistent gap underscores the continued difficulty of WinoGrande as a commonsense reasoning benchmark.
WinoGrande is one of several benchmarks derived from the Winograd Schema concept. Understanding the differences between these variants is important for interpreting evaluation results.
The original Winograd Schema Challenge dataset contains 273 expert-crafted pronoun resolution problems. Problems are presented as sentences with an ambiguous pronoun, and the system must identify which entity the pronoun refers to. The dataset was designed to be small but carefully curated, with each problem requiring genuine commonsense reasoning. By 2019, transformer-based models had surpassed 90% accuracy on the original WSC, leading researchers to question whether the benchmark still served its intended purpose.
The SuperGLUE benchmark suite includes a WSC task that combines the original 273 WSC problems with additional PDP-style examples. In SuperGLUE, the task is reformulated as a True/False binary classification problem. Given a sentence and a proposed pronoun resolution (e.g., "Does 'he' refer to Martin?"), the model must answer True or False. This formulation differs from both the original WSC (which asks the model to select a referent) and WinoGrande (which uses fill-in-the-blank). The SuperGLUE WSC contains 804 problems in total.
Models have achieved very high accuracy on SuperGLUE WSC, with some exceeding 97%, which further motivated the creation of WinoGrande as a more challenging alternative.
| Feature | Original WSC | SuperGLUE WSC | WinoGrande |
|---|---|---|---|
| Size | 273 | 804 | 43,972 |
| Task format | Pronoun resolution | True/False classification | Fill-in-the-blank (binary choice) |
| Authorship | Expert linguists | Expert linguists | Crowdworkers |
| Debiasing | None | None | AFLITE algorithmic filtering |
| Best model accuracy (circa 2020) | ~90% | ~97% | 79.1% |
| Human accuracy | 96.5% | 100% (reported) | 94.0% |
The substantially lower model accuracy on WinoGrande compared to WSC and SuperGLUE WSC demonstrates the effectiveness of the AFLITE debiasing process. Even though all three benchmarks test the same underlying capability (commonsense pronoun/entity resolution), WinoGrande's scale and debiasing make it a far more robust evaluation.
WinoGrande has been adopted as a standard component of several major LLM evaluation frameworks.
WinoGrande was one of the six benchmarks used in the Hugging Face Open LLM Leaderboard v1, which was the most widely used public leaderboard for comparing open-source language models. The six benchmarks were:
The leaderboard was visited by over 2 million unique users and received approximately 300,000 monthly community interactions before being archived in June 2024. In the successor Open LLM Leaderboard v2, the benchmarks were replaced with harder alternatives (including MMLU-Pro, GPQA, MuSR, MATH, and IFEval) as many models had begun to saturate the original set.
WinoGrande is included in the EleutherAI Language Model Evaluation Harness (lm-evaluation-harness), the standard open-source framework for evaluating language models. The harness implements WinoGrande evaluation using the partial evaluation method described by Trinh and Le (2018), and it is the evaluation backend used by the Hugging Face Open LLM Leaderboard.
WinoGrande is also commonly included in custom evaluation suites used by research labs and companies when reporting the capabilities of new models. It appears frequently in the technical reports of major model releases, including those from OpenAI, Google, Meta, and others.
WinoGrande has had a significant impact on the field of natural language processing (NLP) and AI evaluation.
The AFLITE algorithm introduced in the WinoGrande paper has proven influential beyond the specific context of Winograd schemas. The core idea, that datasets should be systematically filtered to remove instances whose labels are predictable from surface features, has been applied to other benchmark construction efforts. AFLITE demonstrated that even carefully designed datasets can contain subtle biases that inflate model performance, and it provided a practical tool for addressing this problem.
One of the paper's most important contributions was demonstrating that high performance on the original WSC did not necessarily indicate genuine commonsense reasoning. By showing that models performed much worse on the debiased WinoGrande dataset, Sakaguchi et al. highlighted the risk of overestimating machine intelligence based on biased benchmarks. This finding encouraged the NLP community to adopt more rigorous evaluation practices.
The WinoGrande paper received the Outstanding Paper Award at the 34th AAAI Conference on Artificial Intelligence (AAAI 2020), held in New York City. The award recognized both the dataset's practical value and the methodological innovation of the AFLITE algorithm. The paper was subsequently republished in Communications of the ACM (Volume 64, Number 9, September 2021, pages 99-106), accompanied by a technical perspective article discussing the importance of the work.
Despite the rapid progress in large language models since its introduction, WinoGrande continues to serve as a meaningful benchmark. The persistent gap between the best model scores and human performance (94.0%) suggests that commonsense reasoning remains a genuine challenge. WinoGrande's inclusion in major evaluation suites ensures that it will continue to be used as new models are developed and compared.
Like all benchmarks, WinoGrande has limitations that should be considered when interpreting results.
Crowdsourced language. Because the dataset was authored by crowd workers rather than linguistic experts, the language can be noisy and may contain grammatical errors or unnatural phrasings. This introduces some ambiguity that is absent from the original expert-crafted WSC.
Binary choice format. The fill-in-the-blank format with only two options means that random guessing yields 50% accuracy. This provides less discriminative power at the lower end of the performance scale, since weak models can still achieve scores well above zero.
English only. WinoGrande is available only in English, limiting its applicability for evaluating commonsense reasoning in other languages. Some research groups have begun creating translated versions (e.g., Estonian WinoGrande), but these efforts face additional challenges around cross-linguistic validity.
Benchmark saturation. As language models continue to improve, WinoGrande scores are gradually approaching human levels. While a gap remains, the benchmark may eventually lose its discriminative power for the most capable models, following the same trajectory as the original WSC.