Winograd Schema Challenge

AI Benchmarks Natural Language Processing

22 min read

Updated Jun 28, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 28, 2026

Fact-checked

In review queue

Sources

13 citations

Revision

v2 · 4,337 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

The Winograd Schema Challenge (WSC) is a commonsense reasoning test in which a system must resolve an ambiguous pronoun in a short sentence where the correct answer flips when one or two words change, so that getting it right requires world knowledge rather than surface word patterns. It was proposed by Hector Levesque at the 2011 AAAI Spring Symposium and formalized in a 2012 paper co-authored with Ernest Davis (NYU) and Leora Morgenstern, as a more rigorous alternative to the Turing test.^[1]^[2] The original benchmark, known as WSC-273, contains 273 hand-built schemas; for most of the 2010s no system beat chance by much, but by 2019-2023 large language models had largely solved it, scoring above 90 percent and reaching human-level performance.^[3]^[6]

Levesque defined the format precisely. "A Winograd schema is a pair of sentences that differ in only one or two words and that contain an ambiguity that is resolved in opposite ways in the two sentences and requires the use of world knowledge and reasoning for its resolution."^[1] Each item is therefore a near-identical sentence pair, and the swap flips which entity the pronoun refers to. Levesque pitched it as a test that would resist the kind of statistical and lexical shortcuts a chatbot can use to fake conversation, arguing that a thoughtfully designed pronoun puzzle is harder to game than an open-ended dialogue.

The name honors a sentence Terry Winograd used in his 1972 MIT thesis on natural language understanding: "The city councilmen refused the demonstrators a permit because they feared violence," versus the variant "because they advocated violence."^[2] The pronoun "they" points at the council in the first version and the demonstrators in the second, and the only thing that changed is one verb. There is no syntactic clue. A reader has to know something about who tends to fear violence and who tends to advocate it, and apply that knowledge on the fly.

The canonical illustration in the 2012 paper is the trophy and the suitcase: "The trophy doesn't fit in the brown suitcase because it is too large. What is too large?"^[1] Here "it" refers to the trophy, but in the variant "because it is too small," "it" refers to the suitcase. Nothing in the grammar settles the question. You have to know that a thing that does not fit inside a container is the larger of the two.

The story of the WSC is, in a sense, the story of a decade in natural language understanding. It was designed to be hard for the wrong reasons, and it became easy for reasons that still feel a little uncomfortable, since nobody really thinks GPT-3 understands violence the way a human does. What it understands is text.

What is the Winograd Schema Challenge?

The Winograd Schema Challenge is a binary pronoun-resolution test built from pairs of sentences that are identical except for one or two words. In each pair an ambiguous pronoun (such as "it," "they," or "she") has two grammatically possible antecedents, and the special word decides which antecedent is correct, in opposite directions across the two sentences.^[1] Because both candidate referents are syntactically and semantically plausible, the resolution cannot come from grammar or from a simple lexical rule. It has to come from knowledge about how the world works.

Levesque, Davis, and Morgenstern proposed the challenge as a replacement for the Turing test. The Turing test asks whether a machine can fool a human judge in conversation, which the authors regarded as a test of imitation and evasion rather than of understanding. The WSC, by contrast, requires no conversation: the subject is shown a sentence and asked which entity a pronoun refers to. As the authors put it, the goal was a test where "a subject with the requisite commonsense knowledge" succeeds and a system relying on surface tricks fails.^[2]

What is a Winograd schema?

A Winograd schema is a pair of sentences satisfying four conditions, which Levesque, Davis, and Morgenstern stated formally.^[1] A schema must:

Differ in only one or two words (the "special" word or words).
Contain an ambiguous pronoun whose two candidate referents are both grammatically and semantically plausible.
Resolve to different referents in the two variants.
Be "Google-proof," meaning, as Davis put it, that "there is no obvious statistical test over text corpora that will reliably disambiguate these correctly."^[2]

The last condition is the hard one. Levesque, Davis, and Morgenstern manually screened candidate schemas to remove items that could be solved by checking which noun phrase appears more often near the predicate on the open web. The trophy-and-suitcase schema passes the screen. A schema like "The man could not lift the boy because he was too heavy/strong" might fail if a corpus search reveals that "man" co-occurs with "strong" more often than "boy" does.

The paired structure does the work. A model that gets one variant right by luck or by surface heuristic should get the other one wrong, since the two sentences look almost identical and the correct answer flips. That symmetry is what makes the schema, in principle, hard to game.

The Winograd sentence

Terry Winograd introduced the city-councilmen example in his 1972 MIT PhD thesis, Understanding Natural Language, the same project that produced the SHRDLU dialogue system. The point of the example was not to be a benchmark. Winograd was illustrating the limits of purely syntactic parsing. A grammar-based parser cannot decide whether "they" refers to the council or the demonstrators because both are syntactically available. The disambiguation is semantic, and in 1972 that meant it required something close to general world knowledge.

The sentence sat in textbooks for almost forty years as a stock example of the difficulty of coreference resolution. Levesque's contribution was to take the form of the example seriously and turn it into a test set. The paired structure is what makes the schema, in principle, hard to game, since a system cannot lean on a fixed bias toward one noun phrase when the correct answer alternates between the two sentences.^[1]

Why was the Winograd Schema Challenge proposed?

Levesque had two complaints about the existing landscape. The first was about the Turing test itself. He argued that the Loebner Prize and similar competitions had degenerated into exercises in misdirection, where chatbots won by deflecting questions and faking typos rather than by demonstrating any understanding. In his 2013 IJCAI Research Excellence Award lecture, "On Our Best Behaviour," he objected that the Turing test rewards "trickery" and "evasiveness" and that a machine could pass it by fooling a judge rather than by reasoning.^[4]

The second complaint was about benchmarks more generally. Many NLP datasets at the time had statistical regularities that a sufficiently large model could exploit without solving the underlying task. A reading comprehension dataset where the answer is always near the question keyword teaches a model to find keywords, not to read. Levesque wanted a test where the only way to do well, by construction, was to bring relevant world knowledge to bear on each item.

A third, quieter motivation ran underneath both. Levesque had spent his career on knowledge representation, the symbolic AI tradition that tries to encode facts and rules in formal logic. By 2012 that tradition was widely seen as a dead end, eclipsed by statistical and neural methods. The WSC was, in part, a wager that there were tasks where the symbolic intuition was right and the statistical approach would hit a wall. He was half right. The wall came later than expected, and the statistical approach went through it.

Levesque's 2012 proposal

Levesque's case for the Winograd Schema Challenge appeared in two related papers. The 2012 paper "The Winograd Schema Challenge," co-authored with Davis and Morgenstern, was presented at the AAAI Spring Symposium on Logical Formalizations of Commonsense Reasoning and later included in the proceedings of the 13th International Conference on the Principles of Knowledge Representation and Reasoning (KR 2012).^[1] The companion essay "On Our Best Behaviour," delivered as Levesque's IJCAI 2013 Research Excellence Award lecture, made the philosophical argument: the field had spent decades training systems to mimic surface behavior, and the Turing test had become a contest in evasion rather than understanding.^[4]

The authors framed the eventual test set in Turing-test terms: "The set would then be presented as a challenge for AI programs, along the lines of the Turing test."^[2] The resulting set, WSC-273, was released as a flat list of 273 schemas, drawn mostly from collections compiled by Davis. Each schema was checked by multiple authors. A small extended set, WSC-285, added a handful more.

How do the schemas work?

A few examples from the original WSC-273 give the flavor.

Sentence	Question	Answer A	Answer B
"The trophy did not fit in the suitcase because it was too big."	What was too big?	trophy	suitcase
"The trophy did not fit in the suitcase because it was too small."	What was too small?	trophy	suitcase
"Joan made sure to thank Susan for all the help she had given."	Who had given the help?	Joan	Susan
"Joan made sure to thank Susan for all the help she had received."	Who had received the help?	Joan	Susan
"The man couldn't lift his son because he was so weak."	Who was weak?	the man	the son
"The man couldn't lift his son because he was so heavy."	Who was heavy?	the man	the son

The correct answers (in order) are: trophy, suitcase, Susan, Joan, the man, the son. To get any of these right with confidence, you need a small piece of world model. Big things do not fit inside smaller things. People who give help are typically thanked by people who receive it. Lifting fails when the lifter is weak or when the load is heavy. None of these facts is hidden, and none of them appears verbatim in the sentence. They have to come from somewhere else.

Levesque was deliberate about the shape. The schemas are short. The vocabulary is plain. There is no trick syntax. If a system fails, it fails on meaning.

Early results and the 2016 competition

For the first few years after the dataset's release, almost no system did well. Most published results clustered between chance (50 percent on the binary version) and the high 50s. Approaches based on parsing, knowledge bases, and selectional preferences could solve some schemas but stumbled on others. There was no obvious recipe for end-to-end progress.

In 2016, Nuance Communications sponsored the first Winograd Schema Challenge competition, held in conjunction with the IJCAI conference. The prize was 25,000 US dollars for any system that could match human performance on a held-out test set.^[5] The competition used a pronoun disambiguation set (PDP-60) for the qualifying round and Winograd schemas for the final. The top entry, from a team led by Quan Liu, scored 58 percent.^[5]^[6] Human performance on Winograd schemas hovers around 92 to 96 percent, depending on the study.^[3] No prize was awarded, and the competition was not run again in the same form.

The 2016 results felt like a confirmation of Levesque's bet. Existing methods, including the early neural networks of the time, were not making progress. The wall seemed real.

The BERT era

The wall came down faster than anyone expected. Two things changed. The first was pretraining on enormous amounts of text. The second was the transformer architecture, introduced in 2017, which let models efficiently capture long-range dependencies through attention.

BERT, released by Google in late 2018, was the first model to score reliably above 70 percent on WSC-273 after task-specific fine-tuning; a 2019 result using BERT reached 90.1 percent on the original dataset.^[6] RoBERTa, released by Facebook AI in 2019, pushed scores higher still on several reformulations. The performance came largely from the pretraining corpus. A model that has read enough English about trophies and suitcases will, in some statistical sense, have absorbed the constraint that a thing inside a container is smaller than the container. It does not need to be told.

This is exactly the outcome Levesque had been skeptical about. He had argued in 2012 that no statistical pattern of word co-occurrence could capture the relevant facts, because the schemas were screened to be Google-proof. What he had not anticipated, and what nobody quite anticipated, was the scale of the effect when a model is exposed to hundreds of billions of tokens. The schemas were Google-proof against simple search queries. They turned out to be much less proof against a model that effectively memorizes the joint distribution of every co-occurring word in a large corpus.

Have AI models solved the Winograd Schema Challenge?

Yes. By the early 2020s large language models had effectively solved the original WSC-273, scoring above 90 percent and matching or exceeding human-level performance. GPT-3, released by OpenAI in 2020, scored 88.3 percent zero-shot, 89.7 percent one-shot, and 88.6 percent few-shot on the Winograd task in its 175-billion-parameter form, which the authors described as "just a few points below state-of-the-art and estimated human performance."^[7] PaLM, Google's 540-billion-parameter model from 2022, scored about 90 percent.^[8] By the time GPT-4 appeared in 2023, the original WSC was effectively saturated, with frontier models scoring in the mid-90s on the standard test set.

What the GPT-3 results did, more than confirm any particular hypothesis, was force a reckoning. If a model trained on raw text could solve the Winograd Schema Challenge at human level, then either (a) the WSC was not actually a clean test of commonsense reasoning, or (b) commonsense reasoning, or some useful proxy for it, can emerge from large-scale language modeling. Both views have defenders. The honest answer is probably both.

What is WinoGrande?

WinoGrande, introduced by Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi (AAAI 2020, with a 2019 arXiv preprint and a 2021 Communications of the ACM version), was the field's response to the saturation problem. The dataset contains 44,000 Winograd-style problems, gathered through a careful crowdsourcing pipeline at the Allen Institute for AI.^[3] The authors describe it as "a large-scale dataset of 44k problems, inspired by the original WSC design, but adjusted to improve both the scale and the hardness of the dataset."^[3] Workers wrote new schemas in a controlled template and were paid more for items that survived an adversarial filter.

The filter, called AFLite (Adversarial Filter, Lite), was the technical contribution. A standard concern with crowdsourced datasets is that workers, even when trying to write hard items, leave behind systematic lexical cues. A model trained on the data can learn to exploit those cues without solving the intended task. AFLite generalizes "human-detectable word associations to machine-detectable embedding associations," using an ensemble of small linear classifiers trained on shallow features to identify items that are easy in this artifact-driven sense, and discards them.^[3] What remains is, in principle, harder for a model that relies on surface statistics.

The result was a benchmark that initially proved much harder than WSC-273. State-of-the-art methods at launch scored between 59.4 and 79.1 percent, well below the 94.0 percent human ceiling.^[3] WinoGrande became a standard line item on the SuperGLUE successor leaderboards and on the suite of benchmarks reported in the GPT-3 paper, where it sat alongside HellaSwag, BoolQ, PIQA, and LAMBADA. It is now a routine reporting category for new language models. By 2022 the largest models were also scoring in the high 80s on WinoGrande, though it has held up better than its predecessor.

SuperGLUE inclusion

The original WSC, in a slightly modified form, was one of the eight tasks in SuperGLUE, the benchmark suite introduced by Wang and colleagues in 2019 as a successor to GLUE.^[9] The SuperGLUE version of the WSC reformulated the task as binary classification (is the marked pronoun coreferent with the marked noun phrase?) rather than the original multiple-choice question.

In the SuperGLUE leaderboard, WSC was the task that lagged longest. Most other SuperGLUE tasks were essentially solved by 2020. The WSC subtask, partly because of its small size and partly because of class imbalance issues, took longer. By the time it was solved, the leaderboard itself had become less interesting, since the suite had run its course.

Critiques and dataset artifacts

The earliest serious critique of the WSC came from a 2018 paper by Trichelair, Emami, Trischler, Suleman, and Cheung at McGill, "On the Evaluation of Common-Sense Reasoning in Natural Language Understanding."^[10] They showed that the WSC-273 had detectable patterns. Some schemas were associative, meaning a strong language model could solve them by pure word-association. Others showed switchability problems, where flipping the order of the two candidate referents changed the difficulty in ways that should not have mattered. There were also gender artifacts, traceable to the original construction process. The set was small enough that these patterns were hard to fix without building a new dataset.

The Trichelair paper, along with parallel work on related benchmarks, helped establish the now-routine practice of probing for dataset artifacts before claiming a benchmark measures what it says it measures. The lesson generalized. A dataset is not commonsense reasoning. A dataset is a set of inputs and labels, and a model can sometimes match the labels for reasons unrelated to the intent of whoever wrote the inputs.

WinoGrande's AFLite filter was a direct response to this critique. So were later projects in the same family, including KnowRef (Emami et al. 2019), the Definite Pronoun Resolution Dataset (Rahman and Ng 2012), and Wino-X, a multilingual extension that recasts WSC-style schemas across translation pairs to test whether a model's pronoun choice is consistent across languages.

What are the variants and extensions of the Winograd Schema Challenge?

Dataset	Year	Size	Notes
Definite Pronoun Resolution	2012	1,886	Earlier dataset of WSC-style problems by Rahman and Ng.
WSC-273	2012	273	The original Levesque-Davis-Morgenstern set.
WSC-285	2016	285	Slight extension of WSC-273.
PDP-60	2016	60	Pronoun disambiguation problems used in the IJCAI competition.
KnowRef	2019	8,724	Naturalistic coreference items mined from text and adversarially filtered.
WinoGrande	2019	44,000	Crowdsourced and AFLite-filtered. The current standard.
Wino-X	2021	Multiple languages	Multilingual probe based on translation consistency.
WinoBias	2018	3,160	Diagnostic for gender bias in coreference resolution.
WinoGender	2018	720	Schema-style sentences for gender-bias probing.

The Wino- prefix has become a common naming convention for any Winograd-style coreference probe. WinoBias and WinoGender, both from 2018, applied the schema format specifically to study how coreference systems handle gendered pronouns in stereotypically gendered occupations. They are diagnostic rather than competitive benchmarks but share the methodological lineage.

Performance over time

Year	System	WSC-273 (%)	Notes
2012	Rule-based / KR systems	~50-58	Around chance to slightly above.
2016	Quan Liu et al. (IJCAI)	58	Best entry, 25,000 USD prize unclaimed.^[5]^[6]
2019	BERT (fine-tuned)	90.1	First strong neural result on the original set.^[6]
2020	GPT-3 (175B)	88.6	Few-shot prompting.^[7]
2022	PaLM (540B)	~90	Few-shot.^[8]
2023+	GPT-4 / frontier LLMs	~95+	Effectively saturated.
Human	Baseline	~92-96	Reported in multiple studies.^[3]

The WinoGrande leaderboard tells a similar but slower story. State-of-the-art methods sat between 59.4 and 79.1 percent at launch, below the 94.0 percent human ceiling.^[3] GPT-3 reported 70.2 percent zero-shot, climbing into the high 70s with few-shot prompting.^[7] By 2023 frontier models were in the low 90s, still short of the human ceiling but close.

Connection to Levesque's larger argument

In "On Our Best Behaviour," Levesque argued that AI research had drifted into producing systems that pass surface tests without meaningfully engaging with the underlying problem.^[4] The Winograd Schema Challenge was meant to be a test that could not be passed by surface tricks alone, because the test items had been designed, by hand, to defeat the obvious shortcuts.

The outcome is more nuanced than either side of the original debate would have predicted. Modern language models do solve the WSC at near-human level. They do not appear to do so by anything that looks, from the inside, like the kind of reasoning Levesque had in mind. A 175-billion-parameter network has internalized so much regularity in human-written text that it can predict the right pronoun referent without any explicit world model. Whether this counts as commonsense reasoning depends on what you think commonsense reasoning is.

Levesque retired from the University of Toronto, where he was a longtime faculty member, and is affiliated with the Vector Institute. He has continued to write about the relationship between learning and reasoning, with a generally skeptical line about whether scale alone can substitute for explicit knowledge representation. The fact that GPT-3 solves the Winograd Schema Challenge does not, on his view, mean GPT-3 understands. It means the WSC was a less perfect test than its designers hoped. That is a fair reading. It is also possible that scale is genuinely doing something interesting, and that the line between "absorbing the right regularities" and "having commonsense knowledge" is thinner than the symbolic tradition assumed. Both positions have honest defenders.

Is the Winograd Schema Challenge still used?

For practical purposes, the original WSC-273 is now retired. Frontier language models score above the human baseline, and the dataset's small size and known artifacts make it less useful for ranking the next generation of systems. It still appears in evaluation suites for smaller and older models, where it remains diagnostic. WinoGrande continues to appear in standard benchmark tables, though it too is approaching saturation. The successor problems, including those involving multi-step reasoning, mathematical inference, and tool use, have largely moved to harder benchmarks like MMLU, the BIG-Bench suite, and various agent-style evaluations.

The WSC's intellectual legacy is broader than any specific score. It established the practice of building benchmarks that target world knowledge rather than surface form, the practice of stress-testing benchmarks for artifacts, and the use of paired or contrastive items to control for spurious correlations. The Winograd-style format shows up in commonsense benchmarks well after the original test's competitive life, including CommonsenseQA, PIQA, and HellaSwag.

It also leaves a methodological lesson that has aged well. Benchmarks designed to be hard for the wrong reasons can still be solved by the right reasons, but they can also be solved by reasons nobody had thought of. Every benchmark is a hypothesis about what counts as success. The WSC's hypothesis was that pronoun resolution under semantic ambiguity required general intelligence. The hypothesis turned out to be partially false in a way that was not obvious in 2012, and that fact is itself part of what we have learned.

ELI5: the trophy and the suitcase

Imagine someone tells you, "The trophy doesn't fit in the suitcase because it is too big." What is too big, the trophy or the suitcase? You instantly know it is the trophy, because if a thing does not fit inside a box, the thing is the bigger one. Now swap one word: "because it is too small." Now "it" means the suitcase, because a box that is too small cannot hold the trophy. The sentence barely changed, but the answer flipped. The Winograd Schema Challenge is a big collection of these flip puzzles. They were designed so that a computer could not cheat by counting which words usually appear together. For a long time computers were bad at them. Then very large AI language models, after reading enormous amounts of text, got good at them, scoring better than 90 out of 100. People still argue about whether that means the AI truly "understands" or just learned the patterns extremely well.

References

Levesque, H. J., Davis, E., and Morgenstern, L. (2012). "The Winograd Schema Challenge." *Proceedings of the 13th International Conference on Principles of Knowledge Representation and Reasoning (KR 2012)*, AAAI Press. ↩
Davis, E., Morgenstern, L., and Ortiz, C. "The Winograd Schema Challenge." Project page, NYU Computer Science. https://cs.nyu.edu/~davise/papers/WinogradSchemas/WS.html ↩
Sakaguchi, K., Le Bras, R., Bhagavatula, C., and Choi, Y. (2019/2020). "WinoGrande: An Adversarial Winograd Schema Challenge at Scale." arXiv:1907.10641; *Proceedings of AAAI 2020*; *Communications of the ACM* 64(9), 2021. ↩
Levesque, H. J. (2014). "On Our Best Behaviour." *Artificial Intelligence*, vol. 212. Based on the IJCAI 2013 Research Excellence Award lecture. ↩
Davis, E., Morgenstern, L., and Ortiz, C. L. (2017). "The First Winograd Schema Challenge at IJCAI-16." *AI Magazine* 38(4). ↩
Wikipedia contributors. "Winograd Schema Challenge." *Wikipedia, The Free Encyclopedia*. https://en.wikipedia.org/wiki/Winograd_schema_challenge ↩
Brown, T. B. et al. (2020). "Language Models are Few-Shot Learners." *Proceedings of NeurIPS 2020*. (The GPT-3 paper, including WSC and WinoGrande results.) arXiv:2005.14165. ↩
Chowdhery, A. et al. (2022). "PaLM: Scaling Language Modeling with Pathways." arXiv:2204.02311. ↩
Wang, A., Pruksachatkun, Y., Nangia, N., Singh, A., Michael, J., Hill, F., Levy, O., and Bowman, S. R. (2019). "SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems." *Proceedings of NeurIPS 2019*. ↩
Trichelair, P., Emami, A., Trischler, A., Suleman, K., and Cheung, J. C. K. (2018). "On the Evaluation of Common-Sense Reasoning in Natural Language Understanding." arXiv:1811.01778. ↩
Winograd, T. (1972). *Understanding Natural Language*. Academic Press. Based on Winograd's 1971 MIT PhD thesis.
Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2019). "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding." *Proceedings of NAACL-HLT 2019*.
Davis, E. (2017). "A Collection of Winograd Schemas." Maintained list of WSC items, NYU Computer Science.

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

1 revision by 1 contributors · full history

Suggest edit

What links here

GLUE benchmark GPT-2 SimpleBench WinoGrande

What is the Winograd Schema Challenge?

What is a Winograd schema?

The Winograd sentence

Why was the Winograd Schema Challenge proposed?

Levesque's 2012 proposal

How do the schemas work?

Early results and the 2016 competition

The BERT era

Have AI models solved the Winograd Schema Challenge?

What is WinoGrande?

SuperGLUE inclusion

Critiques and dataset artifacts

What are the variants and extensions of the Winograd Schema Challenge?

Performance over time

Connection to Levesque's larger argument

Is the Winograd Schema Challenge still used?

ELI5: the trophy and the suitcase

See also

References

Improve this article

Related Articles

AA-LCR

DROP (Discrete Reasoning Over Paragraphs)

HellaSwag

LiveBench

MGSM (Multilingual Grade School Math)

MathArena

What links here

Related Articles

AA-LCR

DROP (Discrete Reasoning Over Paragraphs)

HellaSwag

LiveBench

MGSM (Multilingual Grade School Math)

MathArena

What links here