Winograd Schema Challenge
Last reviewed
May 2, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 3,720 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 2, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 3,720 words
Add missing citations, update stale details, or suggest a clearer explanation.
The Winograd Schema Challenge (WSC) is a commonsense reasoning benchmark proposed by Hector Levesque at the AAAI Spring Symposium in 2011 and formalized in a 2012 paper with Ernest Davis (NYU) and Leora Morgenstern. The task asks a system to resolve an ambiguous pronoun in a short sentence where getting the answer right seems to require knowledge about how the world works, not just patterns of words. Each item is a pair of nearly identical sentences that differ in one or two words, and the swap flips which entity the pronoun refers to. Levesque pitched it as a more rigorous alternative to the Turing test, arguing that a thoughtfully designed pronoun puzzle would resist the kind of statistical and lexical shortcuts a chatbot can use to fake conversation.
The name honors a sentence Terry Winograd used in his 1972 MIT thesis on natural language understanding: "The city councilmen refused the demonstrators a permit because they feared violence," versus the variant "because they advocated violence." The pronoun "they" points at the council in the first version and the demonstrators in the second, and the only thing that changed is one verb. There is no syntactic clue. A reader has to know something about who tends to fear violence and who tends to advocate it, and apply that knowledge on the fly.
The original benchmark, often called WSC-273, contains 273 hand-crafted schemas. For most of the 2010s no system did much better than chance. Then large language models arrived. BERT and its descendants pushed scores into the 70s. GPT-3 and PaLM reached the high 80s and low 90s. By 2020 to 2021, the best models were within a few points of human performance, and the field had largely moved on to harder benchmarks, including WinoGrande, a 44,000-example successor designed by researchers at the Allen Institute for AI to be more resistant to dataset artifacts.
The story of the WSC is, in a sense, the story of a decade in natural language understanding. It was designed to be hard for the wrong reasons, and it became easy for reasons that still feel a little uncomfortable, since nobody really thinks GPT-3 understands violence the way a human does. What it understands is text.
Terry Winograd introduced the city-councilmen example in his 1972 MIT PhD thesis, Understanding Natural Language, the same project that produced the SHRDLU dialogue system. The point of the example was not to be a benchmark. Winograd was illustrating the limits of purely syntactic parsing. A grammar-based parser cannot decide whether "they" refers to the council or the demonstrators because both are syntactically available. The disambiguation is semantic, and in 1972 that meant it required something close to general world knowledge.
The sentence sat in textbooks for almost forty years as a stock example of the difficulty of coreference resolution. Levesque's contribution was to take the form of the example seriously and turn it into a test set. The paired structure does the work. A model that gets one variant right by luck or by surface heuristic should get the other one wrong, since the two sentences look almost identical and the correct answer flips. That symmetry is what makes the schema, in principle, hard to game.
Levesque's case for the Winograd Schema Challenge appeared in two related papers. The 2012 paper "The Winograd Schema Challenge," co-authored with Davis and Morgenstern, was presented at the AAAI Spring Symposium on Logical Formalizations of Commonsense Reasoning and later included in the proceedings of the 13th International Conference on the Principles of Knowledge Representation and Reasoning (KR 2012). The companion essay "On Our Best Behaviour," delivered as Levesque's IJCAI 2013 Research Excellence Award lecture, made the philosophical argument: the field had spent decades training systems to mimic surface behavior, and the Turing test had become a contest in evasion rather than understanding.
The formal definition has four conditions. A Winograd schema is a pair of sentences that:
The last condition is the hard one. Levesque, Davis, and Morgenstern manually screened candidate schemas to remove items that could be solved by checking which noun phrase appears more often near the predicate on the open web. A schema like "The trophy did not fit in the suitcase because it was too big/small. What was too big/small?" passes the screen. A schema like "The man could not lift the boy because he was too heavy/strong" might fail if a corpus search reveals that "man" co-occurs with "strong" more often than "boy" does.
The resulting set, WSC-273, was released as a flat list of 273 schemas in 2014, drawn mostly from collections compiled by Davis. Each schema was checked by multiple authors. A small extended set, WSC-285, added a handful more.
Levesque had two complaints about the existing landscape. The first was about the Turing test itself. He argued that the Loebner Prize and similar competitions had degenerated into exercises in misdirection, where chatbots won by deflecting questions and faking typos rather than by demonstrating any understanding. "The trouble with the Turing test," he wrote, "is that it is a test of being convincing, not of being intelligent."
The second complaint was about benchmarks more generally. Many NLP datasets at the time had statistical regularities that a sufficiently large model could exploit without solving the underlying task. A reading comprehension dataset where the answer is always near the question keyword teaches a model to find keywords, not to read. Levesque wanted a test where the only way to do well, by construction, was to bring relevant world knowledge to bear on each item.
A third, quieter motivation ran underneath both. Levesque had spent his career on knowledge representation, the symbolic AI tradition that tries to encode facts and rules in formal logic. By 2012 that tradition was widely seen as a dead end, eclipsed by statistical and neural methods. The WSC was, in part, a wager that there were tasks where the symbolic intuition was right and the statistical approach would hit a wall. He was half right. The wall came later than expected, and the statistical approach went through it.
A few examples from the original WSC-273 give the flavor.
| Sentence | Question | Answer A | Answer B |
|---|---|---|---|
| "The trophy did not fit in the suitcase because it was too big." | What was too big? | trophy | suitcase |
| "The trophy did not fit in the suitcase because it was too small." | What was too small? | trophy | suitcase |
| "Joan made sure to thank Susan for all the help she had given." | Who had given the help? | Joan | Susan |
| "Joan made sure to thank Susan for all the help she had received." | Who had received the help? | Joan | Susan |
| "The man couldn't lift his son because he was so weak." | Who was weak? | the man | the son |
| "The man couldn't lift his son because he was so heavy." | Who was heavy? | the man | the son |
The correct answers (in order) are: suitcase, trophy, Susan, Joan, the man, the son. To get any of these right with confidence, you need a small piece of world model. Big things do not fit inside smaller things. People who give help are typically thanked by people who receive it. Lifting fails when the lifter is weak or when the load is heavy. None of these facts is hidden, and none of them appears verbatim in the sentence. They have to come from somewhere else.
Levesque was deliberate about the shape. The schemas are short. The vocabulary is plain. There is no trick syntax. If a system fails, it fails on meaning.
For the first few years after the dataset's release, almost no system did well. Most published results clustered between chance (50% on the binary version) and the high 50s. Approaches based on parsing, knowledge bases, and selectional preferences could solve some schemas but stumbled on others. There was no obvious recipe for end-to-end progress.
In 2016, Nuance Communications sponsored the first Winograd Schema Challenge competition, held in conjunction with the IJCAI conference in New York. The prize was $25,000 for any system that could match human performance on a held-out test set. The competition used a 60-question subset for the final round. The top entry, from a team at the University of Massachusetts Lowell, scored 58%. The second-place entry scored 48%. Human performance on Winograd schemas hovers around 92 to 96%, depending on the study. No prize was awarded, and the competition was not run again in the same form.
The 2016 results felt like a confirmation of Levesque's bet. Existing methods, including the early neural networks of the time, were not making progress. The wall seemed real.
The wall came down faster than anyone expected. Two things changed. The first was pretraining on enormous amounts of text. The second was the transformer architecture, introduced in 2017, which let models efficiently capture long-range dependencies through attention.
BERT, released by Google in late 2018, was the first model to score reliably above 70% on WSC-273 after task-specific fine-tuning. RoBERTa, released by Facebook AI in 2019, pushed the number into the high 70s and low 80s on several reformulations. The performance came largely from the pretraining corpus. A model that has read enough English about trophies and suitcases will, in some statistical sense, have absorbed the constraint that a thing inside a container is smaller than the container. It does not need to be told.
This is exactly the outcome Levesque had been skeptical about. He had argued in 2012 that no statistical pattern of word co-occurrence could capture the relevant facts, because the schemas were screened to be Google-proof. What he had not anticipated, and what nobody quite anticipated, was the scale of the effect when a model is exposed to hundreds of billions of tokens. The schemas were Google-proof against simple search queries. They turned out to be much less proof against a model that effectively memorizes the joint distribution of every co-occurring word in a large corpus.
GPT-3, released by OpenAI in 2020, scored 88.6% on WSC-273 in its 175-billion-parameter form using a few-shot prompt format. PaLM, Google's 540-billion-parameter model from 2022, scored about 90%. By the time GPT-4 appeared in 2023, the original WSC was effectively saturated, with frontier models scoring in the mid-90s and matching or exceeding human-level performance on the standard test set.
What the GPT-3 results did, more than confirm any particular hypothesis, was force a reckoning. If a model trained on raw text could solve the Winograd Schema Challenge at human level, then either (a) the WSC was not actually a clean test of commonsense reasoning, or (b) commonsense reasoning, or some useful proxy for it, can emerge from large-scale language modeling. Both views have defenders. The honest answer is probably both.
WinoGrande, introduced by Sakaguchi, Le Bras, Bhagavatula, and Yejin Choi at AAAI 2020, was the field's response to the saturation problem. The dataset contains 44,000 Winograd-style problems, gathered through a careful crowdsourcing pipeline at the Allen Institute for AI. Workers wrote new schemas in a controlled template and were paid more for items that survived an adversarial filter.
The filter, called AFLite (Adversarial Filter, Lite), was the technical contribution. A standard concern with crowdsourced datasets is that workers, even when trying to write hard items, leave behind systematic lexical cues. A model trained on the data can learn to exploit those cues without solving the intended task. AFLite uses an ensemble of small linear classifiers trained on shallow features to identify items that are easy in this artifact-driven sense, and discards them. What remains is, in principle, harder for a model that relies on surface statistics.
The result was a benchmark that initially proved much harder than WSC-273. The original RoBERTa baseline scored about 79% on WinoGrande, well below the 94% human ceiling. WinoGrande became a standard line item on the SuperGLUE successor leaderboards and on the suite of benchmarks reported in the GPT-3 paper, where it sat alongside HellaSwag, BoolQ, PIQA, and LAMBADA. It is now a routine reporting category for new language models. By 2022 the largest models were also scoring in the high 80s on WinoGrande, though it has held up better than its predecessor.
The original WSC, in a slightly modified form, was one of the eight tasks in SuperGLUE, the benchmark suite introduced by Wang and colleagues in 2019 as a successor to GLUE. The SuperGLUE version of the WSC reformulated the task as binary classification (is the marked pronoun coreferent with the marked noun phrase?) rather than the original multiple-choice question, and used a sample of 554 items including the WSC-285 set plus additional schemas.
In the SuperGLUE leaderboard, WSC was the task that lagged longest. Most other SuperGLUE tasks were essentially solved by 2020. The WSC subtask, partly because of its small size and partly because of class imbalance issues, took longer. By the time it was solved, the leaderboard itself had become less interesting, since the suite had run its course.
The earliest serious critique of the WSC came from a 2018 paper by Trichelair, Emami, Trischler, Suleman, and Cheung at McGill, "On the Evaluation of Common-Sense Reasoning in Natural Language Understanding." They showed that the WSC-273 had detectable patterns. Some schemas were associative, meaning a strong language model could solve them by pure word-association. Others showed switchability problems, where flipping the order of the two candidate referents changed the difficulty in ways that should not have mattered. There were also gender artifacts, traceable to the original construction process. The set was small enough that these patterns were hard to fix without building a new dataset.
The Trichelair paper, along with parallel work on related benchmarks, helped establish the now-routine practice of probing for dataset artifacts before claiming a benchmark measures what it says it measures. The lesson generalized. A dataset is not commonsense reasoning. A dataset is a set of inputs and labels, and a model can sometimes match the labels for reasons unrelated to the intent of whoever wrote the inputs.
WinoGrande's AFLite filter was a direct response to this critique. So were later projects in the same family, including KnowRef (Emami et al. 2019), the Definite Pronoun Resolution Dataset (Rahman and Ng 2012), and Wino-X, a multilingual extension that recasts WSC-style schemas across translation pairs to test whether a model's pronoun choice is consistent across languages.
| Dataset | Year | Size | Notes |
|---|---|---|---|
| Definite Pronoun Resolution | 2012 | 1,886 | Earlier dataset of WSC-style problems by Rahman and Ng. |
| WSC-273 | 2014 | 273 | The original Levesque-Davis-Morgenstern set. |
| WSC-285 | 2014 | 285 | Slight extension of WSC-273. |
| PDP-60 | 2016 | 60 | Pronoun disambiguation problems used in the IJCAI competition. |
| KnowRef | 2019 | 8,724 | Naturalistic coreference items mined from text and adversarially filtered. |
| WinoGrande | 2019 | 44,000 | Crowdsourced and AFLite-filtered. The current standard. |
| Wino-X | 2021 | Multiple languages | Multilingual probe based on translation consistency. |
| WinoBias | 2018 | 3,160 | Diagnostic for gender bias in coreference resolution. |
| WinoGender | 2018 | 720 | Schema-style sentences for gender-bias probing. |
The Wino- prefix has become a common naming convention for any Winograd-style coreference probe. WinoBias and WinoGender, both from 2018, applied the schema format specifically to study how coreference systems handle gendered pronouns in stereotypically gendered occupations. They are diagnostic rather than competitive benchmarks but share the methodological lineage.
| Year | System | WSC-273 (%) | Notes |
|---|---|---|---|
| 2012 | Rule-based / KR systems | ~50-58 | Around chance to slightly above. |
| 2016 | UMass Lowell (IJCAI) | 58 | Best entry, $25K prize unclaimed. |
| 2018 | BERT-large fine-tuned | ~70-72 | First strong neural result. |
| 2019 | RoBERTa-large | ~79 | Improved pretraining, larger data. |
| 2020 | GPT-3 (175B) | 88.6 | Few-shot prompting. |
| 2022 | PaLM (540B) | ~90 | Few-shot. |
| 2023+ | GPT-4 / frontier LLMs | ~95+ | Effectively saturated. |
| Human | Baseline | ~92-96 | Reported in multiple studies. |
The WinoGrande leaderboard tells a similar but slower story. RoBERTa-large with full fine-tuning sat at 79.1% on launch. T5-11B reached 89.6%. GPT-3 reported 77.7% in zero-shot, climbing to 86.7% with few-shot prompting in the original paper. By 2023 frontier models were in the low 90s, still short of the 94% human ceiling but close.
In "On Our Best Behaviour," Levesque argued that AI research had drifted into what he called "behavioral conformity," producing systems that pass surface tests without meaningfully engaging with the underlying problem. The Winograd Schema Challenge was meant to be a behavioral test that could not be passed by behavioral conformity alone, because the test items had been designed, by hand, to defeat the obvious shortcuts.
The outcome is more nuanced than either side of the original debate would have predicted. Modern language models do solve the WSC at near-human level. They do not appear to do so by anything that looks, from the inside, like the kind of reasoning Levesque had in mind. A 175-billion-parameter network has internalized so much regularity in human-written text that it can predict the right pronoun referent without any explicit world model. Whether this counts as commonsense reasoning depends on what you think commonsense reasoning is.
Levesque retired from the University of Toronto, where he was a longtime faculty member, and is affiliated with the Vector Institute. He has continued to write about the relationship between learning and reasoning, with a generally skeptical line about whether scale alone can substitute for explicit knowledge representation. The fact that GPT-3 solves the Winograd Schema Challenge does not, on his view, mean GPT-3 understands. It means the WSC was a less perfect test than its designers hoped. That is a fair reading. It is also possible that scale is genuinely doing something interesting, and that the line between "absorbing the right regularities" and "having commonsense knowledge" is thinner than the symbolic tradition assumed. Both positions have honest defenders.
For practical purposes, the original WSC-273 is now retired. Frontier language models score above the human baseline, and the dataset's small size and known artifacts make it less useful for ranking the next generation of systems. It still appears in evaluation suites for smaller and older models, where it remains diagnostic. WinoGrande continues to appear in standard benchmark tables, though it too is approaching saturation. The successor problems, including those involving multi-step reasoning, mathematical inference, and tool use, have largely moved to harder benchmarks like MMLU, the BIG-Bench suite, and various agent-style evaluations.
The WSC's intellectual legacy is broader than any specific score. It established the practice of building benchmarks that target world knowledge rather than surface form, the practice of stress-testing benchmarks for artifacts, and the use of paired or contrastive items to control for spurious correlations. The Winograd-style format shows up in commonsense benchmarks well after the original test's competitive life, including CommonsenseQA, PIQA, and HellaSwag.
It also leaves a methodological lesson that has aged well. Benchmarks designed to be hard for the wrong reasons can still be solved by the right reasons, but they can also be solved by reasons nobody had thought of. Every benchmark is a hypothesis about what counts as success. The WSC's hypothesis was that pronoun resolution under semantic ambiguity required general intelligence. The hypothesis turned out to be partially false in a way that was not obvious in 2012, and that fact is itself part of what we have learned.