WinoGrande

AI Benchmarks Natural Language Processing

21 min read

Updated Jun 21, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 21, 2026

Fact-checked

In review queue

Sources

17 citations

Revision

v5 · 4,264 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

WinoGrande is a large-scale benchmark for commonsense reasoning consisting of 44,000 binary fill-in-the-blank pronoun resolution problems, built to test whether language models genuinely understand commonsense or merely exploit statistical artifacts in their training data. Introduced in 2019 by Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi of the Allen Institute for AI (AI2) and the University of Washington, WinoGrande modernized the Winograd Schema Challenge by scaling it more than 100 times and applying a novel algorithmic debiasing technique called AFLITE.^[1] The paper won the Outstanding Paper Award at AAAI 2020 and was later reprinted in Communications of the ACM in September 2021.^[2]

At the time of publication, the best models reached 79.1% accuracy on WinoGrande, which the authors framed as 15 to 35 percentage points below the human performance of 94.0%, depending on how much training data the model was allowed.^[1] Since 2023, frontier models have closed much of this gap, with the best systems now scoring in the high 80s to low 90s. While the benchmark is now widely regarded as effectively saturated, no model has cleanly matched the 94.0% human ceiling in published evaluations.^[3]

What is the Winograd Schema Challenge?

The Winograd Schema Challenge (WSC) was proposed in 2012 by Hector Levesque at the University of Toronto as an alternative to the Turing test for measuring machine intelligence.^[4] The challenge is named after Terry Winograd, a computer science professor at Stanford University, whose 1972 work on natural language understanding included a famous example of pronoun ambiguity:

"The city councilmen refused the demonstrators a permit because they feared violence."

In this sentence, "they" refers to the city councilmen. However, if "feared" is replaced with "advocated," then "they" refers to the demonstrators instead. Resolving such ambiguities requires understanding the real world, not just pattern matching over surface-level text features.

Levesque, along with Ernest Davis and Leora Morgenstern, developed the original WSC dataset of 273 hand-crafted pronoun disambiguation problems organized as 136 schema pairs. Each pair consists of two nearly identical sentences where a single word change (the "trigger word") flips which entity a pronoun refers to. Human performance on these problems was approximately 96.5%, while early computational systems performed near chance (around 50%).^[4]

The WSC gained significant attention as a test of artificial intelligence, but its small size eventually became a liability. By 2019, several transformer-based language models, including BERT and its variants, achieved accuracies above 90% on WSC and related datasets. This raised the question of whether these models had truly learned commonsense reasoning or had simply memorized patterns in a small, expert-authored dataset.

Why was WinoGrande created?

The success of pre-trained language models on WSC-style benchmarks created a false impression that commonsense reasoning was largely solved. Sakaguchi et al. identified two core problems with existing Winograd-style datasets:^[1]

Small scale. The original WSC contained only 273 problems, and the largest related dataset (DPR, or Definite Pronoun Resolution) contained only 1,886 problems. These sizes were insufficient to draw statistically robust conclusions or to serve as effective training data for modern models.
Dataset biases. Even with careful expert authoring, subtle biases crept into the data. Models could exploit word associations, lexical patterns, and structural regularities to achieve high accuracy without genuine understanding. For example, the presence of a word with positive or negative sentiment near one answer option could serve as an unintentional signal.

The authors summarized their core contribution as "a carefully designed crowdsourcing procedure, followed by systematic bias reduction using a novel AFLITE algorithm that generalizes human-detectable word associations to machine-detectable embedding associations."^[1] WinoGrande was created to solve both problems simultaneously: scaling up the dataset by more than 100 times while systematically removing exploitable biases.

How was the dataset constructed?

Crowdsourcing Procedure

The WinoGrande dataset was collected through Amazon Mechanical Turk (AMT). The researchers designed a careful crowdsourcing pipeline with multiple stages.^[1]

Twin Sentence Design. Following the WSC format, each problem consists of a pair of "twin" sentences that differ by exactly one trigger word. The trigger word changes which of two entities is the correct answer. Workers were instructed to write sentence pairs between 15 and 30 words long, with at least 70% word overlap between the twins. This constraint ensures that the two sentences are nearly identical, differing only in the critical trigger word.

Two Problem Domains. Workers created problems in two categories:

Social commonsense: Situations involving two same-gender people with contrasting attributes, emotions, or social roles.
Physical commonsense: Contexts involving two physical objects with contrasting properties, usage patterns, or locations.

Task Format. Unlike the original WSC, which uses pronoun resolution (asking "what does 'it' refer to?"), WinoGrande reformulates the task as a fill-in-the-blank problem with binary options. A blank in the sentence replaces a mention of one of two entities, and the model must select which entity correctly fills the blank. This format is simpler to evaluate and avoids ambiguities in pronoun reference annotation.

For example:

"Robert woke up at 9:00am while Samuel woke up at 6:00am, so _ had less time to get ready for school."

Option 1: Robert | Option 2: Samuel

Correct answer: Robert

Worker Qualifications and Compensation. Only workers with a 99% approval rate and at least 5,000 approved tasks were eligible. Workers were paid $0.40 per twin sentence pair and $0.03 per validation task.

Validation. Each collected question was validated by three additional crowd workers. A problem was retained only if the majority of validators agreed on the correct answer, the options were unambiguous, and the problem could not be solved through simple word association.

The initial collection yielded approximately 77,000 questions (38,500 twin pairs). After validation, roughly 68% were retained, producing about 53,000 valid problems.^[1]

What is AFLITE adversarial filtering?

The second stage of dataset construction was the application of AFLITE (Adversarial Filtering Lite), a lightweight algorithm for systematic bias reduction introduced in the WinoGrande paper and formalized by Le Bras et al. (2020) in their paper Adversarial Filters of Dataset Biases.^[5] AFLITE is a streamlined variant of Adversarial Filtering (AF) that identifies and removes instances whose labels can be predicted from superficial features alone.

The algorithm works as follows:^[1]

Embedding computation. Each problem instance is converted into a fixed-length vector representation using pre-computed neural embeddings.
Ensemble training. The dataset is randomly split into training and validation portions. A set of 64 linear classifiers is trained on the training portion.
Prediction scoring. Each instance in the validation set receives a score based on the fraction of classifiers that correctly predict its label. Instances with scores at or above a threshold (0.75) are considered "easy" for the ensemble, meaning their labels are predictable from surface features.
Removal. The top-scoring (most predictable) instances are removed. This process iterates until convergence.
Termination. The loop ends when fewer than a threshold number of instances are removed in a single iteration, or when the dataset shrinks below a minimum size.

AFLITE is more efficient than the original AF algorithm because it does not require retraining a full neural network at each iteration. Instead, it relies on pre-computed embeddings and trains only lightweight linear classifiers. This makes the process scalable to large datasets.

After applying AFLITE, about 14% of the collected data was discarded. The debiasing effect was substantial: the KL divergence between the embedding distributions of the two answer classes dropped from 2.53 (before filtering) to 0.12 (after filtering), indicating that surface-level features could no longer distinguish between the two answer options.^[1]

How big is the WinoGrande dataset?

The final WinoGrande dataset contains 43,972 problems. The authors released two configurations:^[1]

Configuration	Description	Train	Dev	Test	Total
WinoGrande_debiased	Only instances that passed AFLITE	9,248	1,267	1,767	12,282
WinoGrande_all (XL)	All valid instances (including filtered ones as extra training data)	40,938	1,267	1,767	43,972

To study the effect of training data size, the authors also created subsampled training splits:

Split	Training Instances	Percentage of Full Training Set
XS	160	~0.4%
S	640	~1.6%
M	2,558	~6.3%
L	10,234	~25%
XL	40,938	100%

The dataset is available on Hugging Face at the allenai/winogrande repository, where all configurations and splits can be downloaded in Parquet format.^[6]

WinoGrande is significantly larger and more carefully debiased than its predecessors:

Dataset	Size	Avg. Sentence Length	Vocabulary Size	Authorship
WSC (Levesque et al., 2012)	273	19.1 words	919	Expert-crafted
DPR (Rahman & Ng, 2012)	1,886	15.9 words	4,127	Undergraduates
WinoGrande (Sakaguchi et al., 2020)	43,972	20.6 words	16,469	Crowdworkers + AFLITE

The larger vocabulary and more diverse authorship of WinoGrande make it harder for models to exploit surface-level patterns. The application of AFLITE further reduces systematic biases that are present in expert-crafted and student-authored datasets.

Evaluation and Results

Baseline and Model Performance (Original Paper)

Sakaguchi et al. evaluated several models on WinoGrande_debiased. The results showed a significant gap between model accuracy and human performance:^[1]

Model	Dev Accuracy	Test Accuracy
WKH (word association baseline)	49.4%	49.6%
Ensemble LMs	53.0%	50.9%
BERT (fine-tuned)	65.8%	64.9%
RoBERTa (fine-tuned)	79.3%	79.1%
BERT (local context only)	52.5%	51.9%
RoBERTa (local context only)	52.1%	50.0%
BERT-DPR	50.2%	51.0%
RoBERTa-DPR	59.4%	58.9%
Human	94.1%	94.0%

The word association baseline (WKH) performed at chance level, confirming that AFLITE successfully removed exploitable biases. RoBERTa achieved the highest model accuracy at 79.1%, but this was still 14.9 percentage points below human performance. The "local context only" variants, which received only the words immediately surrounding the blank, performed near chance, confirming that solving WinoGrande requires understanding the full sentence context.

Learning Curve Analysis

The authors studied how RoBERTa's performance scaled with training data size:^[1]

Training Split	Training Instances	Dev Accuracy	Test Accuracy
XS	160	51.5%	50.4%
S	640	58.6%	58.6%
M	2,558	66.9%	67.6%
L	10,234	75.8%	74.7%
XL	40,938	79.3%	79.1%

Extrapolating from this learning curve, the authors estimated that over 118,000 training instances would be needed for RoBERTa to reach human-level accuracy. This finding highlights both the difficulty of WinoGrande and the data-hungry nature of pre-LLM approaches to commonsense reasoning.

How well do humans do on WinoGrande?

Human accuracy on WinoGrande was measured at 94.0% on the test set, determined by majority vote among three crowd workers per question. This is slightly lower than human accuracy on the original WSC (96.5%), which may reflect the noisier language used by crowd workers compared to expert-crafted sentences. Nevertheless, humans found WinoGrande problems relatively straightforward, confirming that the task tests genuine commonsense knowledge rather than obscure trivia.^[1]

Transfer Learning Results

One of the key findings of the WinoGrande paper is that models fine-tuned on WinoGrande transfer effectively to other commonsense reasoning benchmarks. A RoBERTa model fine-tuned on WinoGrande achieved new state-of-the-art results on five related benchmarks at the time of publication:^[1]

Benchmark	RoBERTa-WinoGrande	Previous SOTA	Human Performance
WSC (original)	90.1%	83.1%	96.5%
PDP (Pronoun Disambiguation Problems)	87.5%	N/A	92.5%
SuperGLUE WSC	85.6%	N/A	100%
DPR	93.1%	91.7%	95.2%
KnowRef	85.6%	N/A	92.0%
COPA	90.6%	N/A	99.0%
Winogender	97.1%	N/A	N/A

These results demonstrated that WinoGrande serves as a valuable training resource for commonsense reasoning in general. The strong transfer performance also suggested that high scores on the original WSC and related benchmarks may have been inflated by dataset-specific biases, since models trained on the larger, debiased WinoGrande still fell short of human accuracy on those same benchmarks.

How do large language models score on WinoGrande?

Since its release, WinoGrande has become a standard evaluation benchmark for large language models (LLMs). The progression of scores from 2020 to 2026 traces the broader rise of pre-trained and instruction-tuned LLMs, but unlike HellaSwag or the original MMLU, WinoGrande's gap to the 94.0% human ceiling has proven persistently difficult to fully close.

SOTA Progression (2020-2026)

Year	Model	Accuracy	Setting	Source
2020	RoBERTa-Large (fine-tuned)	79.1%	Fine-tuned	Sakaguchi et al.^[1]
2020	GPT-3 175B (davinci)	70.2% / 77.7%	0-shot / few-shot	Brown et al.^[7]
2020	T5-3B (fine-tuned)	~84%	Fine-tuned	Kocijan et al.^[8]
2021	T5-11B (UnifiedQA)	89.0%	Fine-tuned	Allen AI leaderboard
2022	Chinchilla 70B	74.9%	0-shot	Hoffmann et al.^[9]
2022	PaLM 540B	81.1%	0-shot	Chowdhery et al.^[10]
2023	GPT-4 (0613)	87.5%	5-shot	OpenAI; AI2 leaderboard
2023	LLaMA 2 70B	80.2%	0-shot	Touvron et al.^[11]
2024	Mixtral 8x7B	81.1%	5-shot	Mistral AI^[12]
2024	Gemma 2 27B	83.7%	5-shot	Open LLM Leaderboard
2024	Llama 3 70B	83.5%	5-shot	Meta AI
2024	Llama 3.1 405B	86.7%	5-shot	Meta AI^[13]
2024	Qwen2 72B Instruct	85.1%	5-shot	Open LLM Leaderboard

Note that scores vary depending on the evaluation protocol (zero-shot, few-shot, or fine-tuned) and the specific prompt format used. The most common evaluation setting in the Open LLM Leaderboard v1 was 5-shot.

While the largest and most capable models have narrowed the gap with human performance, no model has cleanly exceeded the 94.0% human baseline in published evaluations. This stands in contrast to other early-LLM benchmarks like HellaSwag (where multiple frontier models exceed human scores) and underscores the unique difficulty of commonsense pronoun disambiguation at the WinoGrande tail.

Is WinoGrande saturated?

By 2024-2025, WinoGrande is widely considered effectively saturated for frontier models, with most top systems scoring above 85% and many in the high 80s.^[3] However, the benchmark does not exhibit the cleaner saturation pattern seen with MMLU or HellaSwag for two reasons:

Residual ambiguity. Crowd-sourced sentences contain genuine pragmatic ambiguity that even strong models handle imperfectly. The 94.0% human ceiling represents an aggregate over noisy judgments, not a clean upper bound.
Sensitivity to evaluation protocol. Reported WinoGrande scores vary by 5-10 percentage points across zero-shot, 5-shot, log-likelihood, and generation-based protocols, making cross-paper comparisons unreliable.

Following these saturation concerns, the successor Open LLM Leaderboard v2 (launched June 2024) replaced WinoGrande and its peers with harder alternatives including MMLU-Pro, GPQA, MuSR, MATH, and IFEval.

How does WinoGrande differ from WSC and SuperGLUE WSC?

WinoGrande is one of several benchmarks derived from the Winograd Schema concept. Understanding the differences between these variants is important for interpreting evaluation results.

Original WSC

The original Winograd Schema Challenge dataset contains 273 expert-crafted pronoun resolution problems. Problems are presented as sentences with an ambiguous pronoun, and the system must identify which entity the pronoun refers to. The dataset was designed to be small but carefully curated, with each problem requiring genuine commonsense reasoning. By 2019, transformer-based models had surpassed 90% accuracy on the original WSC, leading researchers to question whether the benchmark still served its intended purpose.^[4]

SuperGLUE WSC

The SuperGLUE benchmark suite includes a WSC task that combines the original 273 WSC problems with additional PDP-style examples. In SuperGLUE, the task is reformulated as a True/False binary classification problem. Given a sentence and a proposed pronoun resolution (e.g., "Does 'he' refer to Martin?"), the model must answer True or False. This formulation differs from both the original WSC (which asks the model to select a referent) and WinoGrande (which uses fill-in-the-blank). The SuperGLUE WSC contains 804 problems in total.

Models have achieved very high accuracy on SuperGLUE WSC, with some exceeding 97%, which further motivated the creation of WinoGrande as a more challenging alternative.

Key Differences

Feature	Original WSC	SuperGLUE WSC	WinoGrande
Size	273	804	43,972
Task format	Pronoun resolution	True/False classification	Fill-in-the-blank (binary choice)
Authorship	Expert linguists	Expert linguists	Crowdworkers
Debiasing	None	None	AFLITE algorithmic filtering
Best model accuracy (circa 2020)	~90%	~97%	79.1%
Best model accuracy (2024-2026)	>97% (saturated)	~100% (saturated)	~87-89%
Human accuracy	96.5%	100% (reported)	94.0%

The substantially lower model accuracy on WinoGrande compared to WSC and SuperGLUE WSC demonstrates the effectiveness of the AFLITE debiasing process. Even though all three benchmarks test the same underlying capability (commonsense pronoun/entity resolution), WinoGrande's scale and debiasing have made it more robust to ceiling effects.

Role in Evaluation Suites

WinoGrande has been adopted as a standard component of several major LLM evaluation frameworks.

Hugging Face Open LLM Leaderboard

WinoGrande was one of the six benchmarks used in the Hugging Face Open LLM Leaderboard v1, which was the most widely used public leaderboard for comparing open-source language models. The six benchmarks were:^[14]

ARC (AI2 Reasoning Challenge, 25-shot)
HellaSwag (10-shot)
MMLU (5-shot)
TruthfulQA (0-shot)
WinoGrande (5-shot)
GSM8K (5-shot)

The leaderboard was visited by over 2 million unique users and received approximately 300,000 monthly community interactions before being archived in June 2024. In the successor Open LLM Leaderboard v2, the benchmarks were replaced with harder alternatives (including MMLU-Pro, GPQA, MuSR, MATH, and IFEval) as many models had begun to saturate the original set.

EleutherAI Language Model Evaluation Harness

WinoGrande is included in the EleutherAI Language Model Evaluation Harness (lm-evaluation-harness), the standard open-source framework for evaluating language models.^[15] The harness implements WinoGrande evaluation using the partial evaluation method described by Trinh and Le (2018), and it is the evaluation backend used by the Hugging Face Open LLM Leaderboard.

Other Evaluation Contexts

WinoGrande is also commonly included in custom evaluation suites used by research labs and companies when reporting the capabilities of new models. It appears frequently in the technical reports of major model releases, including those from OpenAI, Google DeepMind, Meta, and Mistral. Although it is no longer a primary discriminator between the most capable frontier systems, it remains useful as a low-cost sanity check for pretrained model quality and is often reported alongside HellaSwag and ARC.

Critiques and Limitations

Crowdsourced Language

Because the dataset was authored by crowd workers rather than linguistic experts, the language can be noisy and may contain grammatical errors or unnatural phrasings. This introduces some ambiguity that is absent from the original expert-crafted WSC.

Binary Choice Format

The fill-in-the-blank format with only two options means that random guessing yields 50% accuracy. This provides less discriminative power at the lower end of the performance scale, since weak models can still achieve scores well above zero.

Residual Artifacts and Reasoning Concerns

Despite the AFLITE debiasing process, follow-up work has uncovered residual artifacts and questioned whether top models genuinely reason about WinoGrande problems. The 2025 WinoWhat study by Gevers et al. constructed paraphrased versions of WinoGrande validation instances and found that all evaluated LLMs perform significantly worse on the paraphrases than on the originals, implying that "LLM reasoning capabilities are overestimated on WinoGrande" and that models rely partly on surface-level patterns rather than genuine commonsense reasoning. The authors also found that direct memorization from pretraining data had only a minimal effect, locating the gap in shallow pattern-matching rather than test-set contamination.^[16]

Stereotype and Bias Concerns

Like the original Winograd Schema Challenge, WinoGrande problems often involve same-gender pairs of named entities engaged in everyday social and physical scenarios. Critics have noted that crowdsourced schemas can encode stereotypes (e.g., associating certain professions, behaviors, or attributes more strongly with one gender or social role), and that models that exploit these stereotypes may achieve higher scores without genuine understanding. AFLITE removes only those biases that are detectable from neural embeddings, leaving open the possibility of more subtle social biases.

Is WinoGrande available in other languages?

WinoGrande was originally released only in English, limiting its applicability for evaluating commonsense reasoning in other languages. Some research groups have since created translated or culturally-adapted variants, such as a 2025 Estonian WinoGrande dataset that compared LLM performance on human versus machine translation, but these efforts face additional challenges around cross-linguistic validity since pronoun structure and commonsense norms vary across languages and cultures.^[17]

Benchmark Saturation

As language models continue to improve, WinoGrande scores have approached human levels, and the benchmark provides limited signal between top-tier frontier systems. While a gap to the 94.0% human ceiling remains, the benchmark is widely considered to be in its post-discriminative phase for the most capable models.^[3]

Impact and Legacy

WinoGrande has had a significant impact on the field of natural language processing (NLP) and AI evaluation.

Methodological Contributions

The AFLITE algorithm introduced in the WinoGrande paper has proven influential beyond the specific context of Winograd schemas. The core idea, that datasets should be systematically filtered to remove instances whose labels are predictable from surface features, has been applied to other benchmark construction efforts.^[5] AFLITE demonstrated that even carefully designed datasets can contain subtle biases that inflate model performance, and it provided a practical tool for addressing this problem.

Challenging Overconfident Claims

One of the paper's most important contributions was demonstrating that high performance on the original WSC did not necessarily indicate genuine commonsense reasoning. By showing that models performed much worse on the debiased WinoGrande dataset, Sakaguchi et al. highlighted the risk of overestimating machine intelligence based on biased benchmarks. This finding encouraged the NLP community to adopt more rigorous evaluation practices and informed the design of subsequent benchmarks.

Award Recognition

The WinoGrande paper received the Outstanding Paper Award at the 34th AAAI Conference on Artificial Intelligence (AAAI 2020), held in New York City. The award recognized both the dataset's practical value and the methodological innovation of the AFLITE algorithm. The paper was subsequently republished in Communications of the ACM (Volume 64, Number 9, September 2021, pages 99-106), accompanied by a technical perspective article discussing the importance of the work.^[2]

Ongoing Relevance

Despite saturation concerns and the rapid progress of large language models, WinoGrande continues to be cited as a meaningful artifact in commonsense reasoning research. The persistent gap between the best model scores and human performance (94.0%) and the WinoWhat paraphrase findings suggest that aspects of commonsense pronoun resolution remain genuinely unsolved. WinoGrande's inclusion in major historical evaluation suites also makes it useful for retrospective scaling and capability analyses across the BERT-to-frontier-LLM era.

References

Sakaguchi, K., Le Bras, R., Bhagavatula, C., & Choi, Y. (2020). WinoGrande: An Adversarial Winograd Schema Challenge at Scale. *Proceedings of the AAAI Conference on Artificial Intelligence*, 34(05), 8732-8740. https://arxiv.org/abs/1907.10641 ↩
Sakaguchi, K., Le Bras, R., Bhagavatula, C., & Choi, Y. (2021). WinoGrande: An adversarial winograd schema challenge at scale. *Communications of the ACM*, 64(9), 99-106. https://doi.org/10.1145/3474381 ↩
Epoch AI. WinoGrande benchmark page (2024). https://epoch.ai/benchmarks/wino-grande ↩
Levesque, H. J., Davis, E., & Morgenstern, L. (2012). The Winograd Schema Challenge. *Proceedings of the Thirteenth International Conference on Principles of Knowledge Representation and Reasoning (KR-12)*. ↩
Le Bras, R., Swayamdipta, S., Bhagavatula, C., Zellers, R., Peters, M. E., Sabharwal, A., & Choi, Y. (2020). Adversarial Filters of Dataset Biases. *International Conference on Machine Learning (ICML)*. https://arxiv.org/abs/2002.04108 ↩
Allen Institute for AI. allenai/winogrande dataset on Hugging Face. https://huggingface.co/datasets/allenai/winogrande ↩
Brown, T. et al. (2020). Language Models are Few-Shot Learners. *Advances in Neural Information Processing Systems*, 33, 1877-1901. https://arxiv.org/abs/2005.14165 ↩
Kocijan, V., Lukasiewicz, T., Davis, E., Marcus, G., & Morgenstern, L. (2020). A Review of Winograd Schema Challenge Datasets and Approaches. https://arxiv.org/abs/2004.13831 ↩
Hoffmann, J. et al. (2022). Training Compute-Optimal Large Language Models. *Advances in Neural Information Processing Systems*, 35. https://arxiv.org/abs/2203.15556 ↩
Chowdhery, A. et al. (2022/2023). PaLM: Scaling Language Modeling with Pathways. *Journal of Machine Learning Research*, 24(240). https://arxiv.org/abs/2204.02311 ↩
Touvron, H. et al. (2023). Llama 2: Open Foundation and Fine-Tuned Chat Models. https://arxiv.org/abs/2307.09288 ↩
Jiang, A. Q. et al. (2024). Mixtral of Experts. https://arxiv.org/abs/2401.04088 ↩
Meta AI. Llama 3.1 Model Card (July 2024). https://github.com/meta-llama/llama-models/blob/main/models/llama3_1/MODEL_CARD.md ↩
Beeching, E. et al. (2023). Open LLM Leaderboard. Hugging Face. https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard ↩
Gao, L. et al. (2023). A Framework for Few-Shot Language Model Evaluation. EleutherAI. https://github.com/EleutherAI/lm-evaluation-harness ↩
Gevers, I., De Marez, V., De Bruyne, L., & Daelemans, W. (2025). WinoWhat: A Parallel Corpus of Paraphrased WinoGrande Sentences with Common Sense Categorization. *Proceedings of CoNLL 2025*. https://arxiv.org/abs/2503.23779 ↩
Purason, T. et al. (2025). Estonian WinoGrande Dataset: Comparative Analysis of LLM Performance on Human and Machine Translation. https://arxiv.org/abs/2511.17290 ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

4 revisions by 1 contributors · full history

Suggest edit

WinoGrande

What is the Winograd Schema Challenge?

Why was WinoGrande created?

How was the dataset constructed?

Crowdsourcing Procedure

What is AFLITE adversarial filtering?

How big is the WinoGrande dataset?

Evaluation and Results

Baseline and Model Performance (Original Paper)

Learning Curve Analysis

How well do humans do on WinoGrande?

Transfer Learning Results

How do large language models score on WinoGrande?

SOTA Progression (2020-2026)

Is WinoGrande saturated?

How does WinoGrande differ from WSC and SuperGLUE WSC?

Original WSC

SuperGLUE WSC

Key Differences

Role in Evaluation Suites

Hugging Face Open LLM Leaderboard

EleutherAI Language Model Evaluation Harness

Other Evaluation Contexts

Critiques and Limitations

Crowdsourced Language

Binary Choice Format

Residual Artifacts and Reasoning Concerns

Stereotype and Bias Concerns

Is WinoGrande available in other languages?

Benchmark Saturation

Impact and Legacy

Methodological Contributions

Challenging Overconfident Claims

Award Recognition

Ongoing Relevance

See Also

References

Improve this article

What links here

What links here

What is the Winograd Schema Challenge?

Why was WinoGrande created?

How was the dataset constructed?

Crowdsourcing Procedure

What is AFLITE adversarial filtering?

How big is the WinoGrande dataset?

Comparison with Related Datasets

Evaluation and Results

Baseline and Model Performance (Original Paper)

Learning Curve Analysis

How well do humans do on WinoGrande?

Transfer Learning Results

How do large language models score on WinoGrande?

SOTA Progression (2020-2026)

Is WinoGrande saturated?

How does WinoGrande differ from WSC and SuperGLUE WSC?

Original WSC

SuperGLUE WSC

Key Differences

Role in Evaluation Suites

Hugging Face Open LLM Leaderboard

EleutherAI Language Model Evaluation Harness

Other Evaluation Contexts

Critiques and Limitations

Crowdsourced Language

Binary Choice Format

Residual Artifacts and Reasoning Concerns

Stereotype and Bias Concerns

Is WinoGrande available in other languages?

Benchmark Saturation

Impact and Legacy

Methodological Contributions

Challenging Overconfident Claims

Award Recognition

Ongoing Relevance

See Also

References

Improve this article

Related Articles

AA-LCR

DROP (Discrete Reasoning Over Paragraphs)

HellaSwag

LiveBench

MGSM (Multilingual Grade School Math)

MathArena

What links here

Related Articles

AA-LCR

DROP (Discrete Reasoning Over Paragraphs)

HellaSwag

LiveBench

MGSM (Multilingual Grade School Math)

MathArena

What links here