SuperGLUE (a stickier successor to GLUE) is a public benchmark for evaluating general-purpose English language understanding in machine learning systems. It was introduced in May 2019 by Alex Wang and colleagues at New York University, Facebook AI Research, the University of Washington, and DeepMind, and was published at the 33rd Conference on Neural Information Processing Systems (NeurIPS 2019) in Vancouver. The benchmark consists of eight challenging language understanding tasks, two diagnostic datasets, a public leaderboard hosted at super.gluebenchmark.com, and a software toolkit for fair model comparison. [1]
SuperGLUE was designed as a direct response to the rapid saturation of the older GLUE benchmark. GLUE had been released by the same team only thirteen months earlier, in April 2018, and was meant to provide enough headroom to last several years. Instead, the arrival of pretrained transformer models, especially BERT in late 2018, pushed model performance past the non-expert human baseline within months. By July 2019 the top GLUE score (88.4) already exceeded the human number (87.1) by more than a point, and machines were beating humans on four of the nine constituent tasks. The authors concluded that GLUE was no longer useful as a measure of progress and built SuperGLUE to restore a meaningful margin between machines and people. [1] [2]
The benchmark held that meaningful margin for less than two years. In January 2021, Microsoft's DeBERTa ensemble crossed the human baseline of 89.8 with a score of 90.3, and Google's T5 plus Meena ensemble landed just behind at 90.2. By the early 2020s, large foundation models were saturating SuperGLUE in the same way they had saturated GLUE, and active leaderboard submissions effectively halted by late 2022. SuperGLUE is now widely considered solved and has been functionally replaced by harder evaluations such as MMLU, BIG-Bench, HELM, MMLU-Pro, GPQA, and ARC-AGI. Despite this short useful life, SuperGLUE shaped the trajectory of natural language processing research from 2019 to 2021, providing the target that motivated the scaling of T5, GPT-3, DeBERTa, and many of the architectural and training tricks that defined the early transformer era. [3] [4]
For most of the 2010s, evaluation in NLP was fragmented. Research papers picked a small set of tasks (often a single dataset such as Stanford Question Answering Dataset, the Stanford Natural Language Inference corpus, or the CoNLL named entity recognition data) and reported state-of-the-art numbers in isolation. Cross-paper comparisons were difficult because preprocessing, evaluation scripts, and even data splits often differed silently between releases. The community had no shared yardstick that summarized broad linguistic competence in a single number.
The GLUE benchmark, released by Wang and colleagues in April 2018, was the first widely adopted attempt to fix this. It bundled nine sentence-level tasks (sentiment, paraphrase, entailment, similarity, acceptability, and others) behind a single submission server and reported a macro-average score. GLUE quickly became the de facto leaderboard for general-purpose natural language processing systems. Sentence-encoder papers competed directly on it, and the scoreboard tracked the rapid progression from BiLSTM baselines through ELMo, GPT-1, and finally BERT. [2]
The arrival of BERT-large in October 2018 changed everything. Within two months, the GLUE leaderboard moved by more than ten points. By spring 2019, ensemble methods built on BERT, RoBERTa, and XLNet had pushed the macro-average above the human baseline. Several individual GLUE tasks were solved outright: the Multi-Genre Natural Language Inference (MNLI) task, the Quora Question Pairs (QQP) task, and the Stanford Sentiment Treebank (SST-2) all sat above human numbers. With most of the remaining headroom concentrated on the linguistically idiosyncratic Corpus of Linguistic Acceptability (CoLA) and the small Recognizing Textual Entailment (RTE) corpus, GLUE was no longer informative. [1]
The SuperGLUE team set out to design a successor that would be substantially harder, would emphasize reasoning over surface pattern matching, and would still be solvable by a college-educated English speaker. They explicitly reused the GLUE infrastructure (submission server, evaluation toolkit, public leaderboard) but replaced the task suite with a new set chosen for difficulty and headroom.
The authors solicited task proposals from the NLP community and considered roughly thirty candidate datasets. To be included, a task had to satisfy the following six criteria. [1]
| Criterion | Requirement |
|---|---|
| Substance | The task must test a system's ability to understand and reason about English text, not surface heuristics. |
| Difficulty | Top published systems must lag well behind a college-educated human baseline, leaving room for several years of progress. |
| Evaluability | Performance must be measurable with an automatic metric that correlates with human judgment. |
| Public data | The task must have publicly available training and development data, with private test labels held by the benchmark. |
| Format | The input and output must be expressible in a simple format that does not require task-specific architectures. |
| Licensing | The data must be available under terms that permit redistribution and research use. |
Many candidate tasks were rejected because they failed one of these criteria. Some had restrictive licenses, some had complex output formats (for example, structured prediction over dependency trees), and several were judged too easy because BERT-class models were already approaching human performance on them. The eight surviving tasks form the core of SuperGLUE.
SuperGLUE evaluates models on eight tasks drawn from existing public datasets. Each task tests a distinct aspect of language understanding, and most are framed as classification or short-answer problems so that a generic encoder followed by a simple output head can be applied uniformly across the suite. The macro-average of per-task scores produces the headline SuperGLUE number. [1] [5]
| Task | Type | Train / Dev / Test | Metric | Original Source |
|---|---|---|---|---|
| BoolQ | Yes/no question answering | 9,427 / 3,270 / 3,245 | Accuracy | Clark et al. 2019 |
| CB (CommitmentBank) | Three-way textual entailment | 250 / 57 / 250 | Accuracy / F1 | de Marneffe et al. 2019 |
| COPA | Causal reasoning, choice of alternatives | 400 / 100 / 500 | Accuracy | Roemmele et al. 2011 |
| MultiRC | Multi-sentence reading comprehension | 5,100 / 953 / 1,800 | F1a / Exact Match | Khashabi et al. 2018 |
| ReCoRD | Cloze-style reading comprehension | 100,730 / 10,000 / 10,000 | F1 / Exact Match | Zhang et al. 2018 |
| RTE | Two-way textual entailment | 2,500 / 278 / 300 | Accuracy | Combined RTE 1 to 5 |
| WiC | Word sense disambiguation in context | 6,000 / 638 / 1,400 | Accuracy | Pilehvar and Camacho-Collados 2019 |
| WSC | Pronoun coreference (Winograd schema) | 554 / 104 / 146 | Accuracy | Levesque et al. 2012 |
BoolQ (Boolean Questions) is a yes-or-no question answering task. Each example consists of a short Wikipedia passage and a question whose answer is either yes or no. The questions are naturally occurring queries submitted by users to the Google search engine, which makes them less stilted than the synthetic questions found in many earlier QA datasets. The task was published by Christopher Clark and colleagues from the University of Washington and Google in 2019, and the public split contains 15,942 examples in total. Because the question structure looks superficial but often hinges on a single subtle fact, BoolQ has been a stable test of careful reading throughout the SuperGLUE era. [1]
CB is a corpus of short discourses, each containing an embedded clause whose truth the speaker is committed to in some degree. SuperGLUE reformulates the data as three-way natural language inference: given a premise containing the embedded clause and a hypothesis that extracts the clause, the model must predict entailment, contradiction, or neutral. The texts come from the Wall Street Journal, fiction in the British National Corpus, and Switchboard transcripts. CB is unusually small (only 250 training examples) and unusually imbalanced toward entailment, so it doubles as a stress test for sample-efficient and robust learning. The benchmark filters to examples with above 80 percent inter-annotator agreement.
COPA (Choice of Plausible Alternatives) was originally introduced by Melissa Roemmele, Cosmin Bejan, and Andrew Gordon at the University of Southern California in 2011. Each example provides a premise sentence and asks the model to choose between two alternative sentences, identifying which is the more plausible cause or effect of the premise. COPA is hand-constructed, small, and focused on commonsense causal reasoning of the kind that contemporary distributional models rarely capture. The premise might read "The man broke his toe." with two candidate causes such as "He got a hole in his sock." and "He dropped a hammer on his foot." Models must select the second.
MultiRC (Multi-Sentence Reading Comprehension), released by Daniel Khashabi and colleagues in 2018, asks the model to read a paragraph and answer multiple-choice questions where any subset of the candidate answers may be correct. Solving an example often requires combining facts from several sentences in the paragraph. SuperGLUE evaluates with two metrics: F1 over the answer set (F1a) and exact match (EM) requiring all and only the correct answers to be selected. The combination penalizes models that hedge by selecting many answers and rewards genuine multi-hop comprehension.
ReCoRD (Reading Comprehension with Commonsense Reasoning) is by far the largest task in SuperGLUE, with over 100,000 training examples drawn from CNN and Daily Mail news articles. Each example presents an article paired with a cloze-style query in which one named entity has been masked out. The model must select the masked entity from a list of candidate entities mentioned in the article. The dataset was constructed by Sheng Zhang and colleagues at Johns Hopkins University and Microsoft in 2018. Because the cloze question rarely makes sense without external knowledge of typical event sequences, ReCoRD specifically targets the integration of reading comprehension with commonsense reasoning.
RTE (Recognizing Textual Entailment) is a binary entailment classification task assembled from the data of the RTE-1, RTE-2, RTE-3, and RTE-5 challenges held between 2005 and 2009. Given a premise sentence and a hypothesis, the model decides whether the premise entails the hypothesis. RTE was already part of the original GLUE suite, where it was kept because models still struggled with it. SuperGLUE retains it for the same reason: even after BERT, RTE held genuine headroom relative to other entailment tasks.
WiC (Word-in-Context), introduced by Mohammad Taher Pilehvar and Jose Camacho-Collados at NAACL 2019, frames lexical semantics as a binary classification problem. Each example provides a target word (a noun or verb) and two sentences in which that word appears. The model must determine whether the word carries the same sense in both sentences. WiC is an explicit probe for whether a contextual encoder produces representations that change with usage, which static word vectors such as word2vec and GloVe famously cannot.
WSC (Winograd Schema Challenge) is a coreference task originally proposed by Hector Levesque, Ernest Davis, and Leora Morgenstern in 2012 as a more robust replacement for the Turing test. Each example is a single sentence containing an ambiguous pronoun, and the model must identify the noun phrase to which the pronoun refers. The defining feature of a Winograd schema is that flipping a single word in the sentence flips the correct antecedent, which forces the model to use commonsense world knowledge rather than syntactic or distributional shortcuts. SuperGLUE recasts the original 273-item WSC into a binary classification format with additional training data. WSC was the most resistant task in early SuperGLUE evaluation, with BERT scoring only 64.3 percent against a perfect human score of 100. [6]
In addition to the eight scored tasks, SuperGLUE includes two diagnostic datasets that are not part of the macro-average but are intended to probe specific behaviors. AX-b is a broad-coverage diagnostic dataset of expert-constructed natural language inference examples that test specific linguistic phenomena (lexical semantics, predicate-argument structure, logic, knowledge). It is scored with the Matthews correlation coefficient. AX-g, also called Winogender, measures gender bias in coreference resolution by presenting minimal pairs of sentences that differ only in the gender of a referent. It is scored on accuracy and on a gender parity metric that tracks whether models change their predictions across gender pairs. Reported human performance is 88 percent accuracy and 0.77 MCC on AX-b, and 99.7 percent accuracy with 0.99 parity on AX-g.
The SuperGLUE paper reported four model baselines at release in May 2019. A simple multi-task BiLSTM averaged 64.4. CBOW (continuous bag of words) averaged a much weaker 44.5. BERT-large averaged 69.0, gaining roughly 25 points over the BiLSTM baseline. BERT++ (BERT-large fine-tuned with intermediate task pretraining: MultiNLI for the entailment tasks and SWAG for COPA) averaged 71.5. The human baseline was 89.8. Critically, BERT++ left a gap of nearly 20 points relative to humans, exactly the kind of headroom that GLUE had lost. [1]
| System | SuperGLUE Average | BoolQ | CB Acc/F1 | COPA | MultiRC F1/EM | ReCoRD F1/EM | RTE | WiC | WSC |
|---|---|---|---|---|---|---|---|---|---|
| Most Frequent Class | 47.1 | 62.3 | 50.0/21.7 | 50.0 | 61.1/0.3 | 33.4/32.5 | 50.3 | 50.0 | 65.1 |
| CBOW | 44.5 | 62.1 | 49.0/71.2 | 51.6 | 0.0/0.4 | 14.0/13.6 | 49.7 | 53.0 | 65.1 |
| BiLSTM + Attn | 64.4 | 75.7 | 76.6/85.4 | 70.6 | 70.0/24.1 | 70.5/69.8 | 71.0 | 65.6 | 64.3 |
| BERT-large | 69.0 | 77.4 | 75.7/83.6 | 70.6 | 70.0/24.0 | 72.0/71.3 | 71.6 | 69.5 | 64.3 |
| BERT++ | 71.5 | 79.0 | 84.7/90.4 | 73.8 | 70.0/24.1 | 72.0/71.3 | 79.0 | 69.5 | 64.3 |
| Human | 89.8 | 89.0 | 95.8/98.9 | 100.0 | 81.8/51.9 | 91.7/91.3 | 93.6 | 80.0 | 100.0 |
The table makes the design intent visible. WSC and COPA were the largest gaps, with BERT trailing humans by 35 and 26 points respectively. WiC, with a 10-point gap, was small but stubborn for static-style models. And while BoolQ, CB, and RTE were close to human at the macro level, they still left meaningful headroom. The benchmark was widely judged to be exactly as challenging as advertised. [1]
The top of the SuperGLUE leaderboard moved with extraordinary speed in 2019 and 2020, and effectively flattened by 2021.
| Period | System | SuperGLUE Score | Notes |
|---|---|---|---|
| May 2019 | BERT-large baseline | 69.0 | Reported in the SuperGLUE paper. |
| May 2019 | BERT++ baseline | 71.5 | Intermediate task pretraining (MultiNLI, SWAG). |
| Late 2019 | RoBERTa | ~84.6 | Improved BERT pretraining recipe by Liu et al. at Facebook. |
| October 2019 | T5-11B (single model) | 89.3 | Text-to-text transfer transformer by Raffel et al. at Google. |
| 2020 | GPT-3 175B (few-shot) | ~71.8 | Few-shot prompting without task-specific fine-tuning. |
| January 2021 | DeBERTa ensemble | 90.3 | First model to surpass the human baseline. |
| January 2021 | T5 + Meena ensemble | 90.2 | Google submission, also above human. |
| 2021 to present | Foundation models | >90 | Generally trivially saturated; few formal submissions. |
RoBERTa, released by Yinhan Liu and colleagues at Facebook AI in mid-2019, was a careful re-tuning of BERT pretraining that demonstrated how much room had been left on the table by the original recipe. RoBERTa pushed SuperGLUE to the mid-80s and made it clear that the benchmark was beatable by patient engineering rather than fundamental architectural change. [1]
T5, released by Google in October 2019 by Colin Raffel and colleagues, framed every NLP task (classification, regression, generation) as a text-to-text problem and trained a single 11-billion-parameter encoder-decoder transformer on the C4 web corpus. T5-11B reported a SuperGLUE score of 89.3, within striking distance of the human baseline. T5 also confirmed the broader scaling story: bigger transformers, trained longer on more data, won decisively. [7]
DeBERTa (Decoding-enhanced BERT with disentangled attention), developed by Pengcheng He, Xiaodong Liu, Jianfeng Gao, and Weizhu Chen at Microsoft Research, was the first model to cross the human baseline. The single 1.5-billion-parameter model scored 89.9 (just above the 89.8 human number) and the ensemble reached 90.3. The DeBERTa team posted these results to the leaderboard on January 6, 2021. The same day, Google's submission of an ensemble combining T5 and the Meena dialogue model reached 90.2. The two events were treated as a paired milestone: SuperGLUE had a human-surpassing score within twenty months of release. [3] [4]
GPT-3, the 175-billion-parameter autoregressive model released by OpenAI in mid-2020, took a different path. Rather than fine-tune on each SuperGLUE task, the GPT-3 team prompted the model with 32 examples per task in a few-shot setting. Average SuperGLUE under this protocol was 71.8, comparable to the original BERT++ baseline despite no parameter updates. The result was the strongest demonstration to that point that large language models could perform multi-task NLU from in-context examples alone, foreshadowing the prompt-based evaluation paradigm that would soon dominate the field. [8]
From 2022 onward, the SuperGLUE leaderboard quietly stopped moving. Frontier foundation models from OpenAI, Anthropic, Google, Meta, and others routinely score in the high 80s and low 90s without targeted optimization, and most labs have shifted formal evaluation to harder benchmarks. The official leaderboard has received essentially no significant submissions since late 2022, and SuperGLUE is generally treated as a solved benchmark of historical interest. [4]
During its 2019 to 2021 peak, SuperGLUE served as the primary public scoreboard for large-model research and shaped several enduring lessons.
Validating scale. The fastest path up the SuperGLUE leaderboard turned out to be more parameters and more pretraining data, not new architectures. T5-11B at 89.3, DeBERTa at 90.3, and the GPT-3 few-shot result at 71.8 all reinforced the hypothesis that emerged from BERT and RoBERTa: scale dominates. This finding fed directly into the scaling-laws research of 2020 and the subsequent push toward large language models. [7] [8]
Standardizing transfer learning. Because SuperGLUE tasks are heterogeneous (yes/no QA, three-way entailment, multi-choice causal reasoning, cloze, coreference, lexical semantics) a model that wins SuperGLUE must transfer well from a shared pretraining objective to many downstream formats. The benchmark normalized the pattern of pretraining once and fine-tuning many times that defined the BERT era, and motivated cleaner output heads, multi-task learning approaches, and intermediate-task training recipes. [1]
Establishing few-shot prompting as a measurable paradigm. The GPT-3 SuperGLUE numbers in 2020 were the first widely cited demonstration that an autoregressive language model could perform on a multi-task NLU benchmark from in-context examples alone. Even though the score was well below the fine-tuned state of the art, the result reframed evaluation around prompted models and seeded the prompting and instruction-tuning research that followed. [8]
Driving methodological scrutiny. The rapid saturation of SuperGLUE prompted serious concern about benchmark contamination, dataset artifacts, and the possibility that models were exploiting spurious patterns rather than understanding language. Subsequent work showed measurable annotation artifacts in NLI datasets and motivated the design of dynamic and adversarial benchmarks (Adversarial NLI, DynaBench) as well as broader, more knowledge-intensive evaluations like MMLU and BIG-Bench.
SuperGLUE attracted several substantive critiques even during its active period.
The small size of CB (250 training examples) and WSC (554 training examples) made evaluation noisy: models could swing several points on individual runs depending on random seed. Authors of papers that reported new state-of-the-art numbers often had to qualify gains within the noise floor.
Several tasks contained known annotation artifacts that allowed shortcut learning. RTE in particular had been studied extensively for hypothesis-only baselines that perform well above chance, indicating that lexical and stylistic cues leaked information about the label. Some researchers argued the benchmark therefore overstated true language understanding.
The English-only design left questions about whether the same trends held in other languages. Multilingual successors such as XTREME, XGLUE, and SuperGLUE-style benchmarks for Chinese, French, Russian, and other languages were proposed in the years that followed.
Finally, the speed of saturation underlined a deeper structural problem: any static benchmark risks being optimized away by ever-larger models trained on ever-larger corpora that may include some of the test data itself. SuperGLUE saturation drove the community toward dynamic, adversarial, and held-out evaluation formats. [4]
| Benchmark | Year | Tasks | Format | Human Baseline | Status |
|---|---|---|---|---|---|
| GLUE | 2018 | 9 sentence-level tasks | Single number macro-average | 87.1 | Saturated by BERT in late 2018, retired as a measure of progress. |
| SuperGLUE | 2019 | 8 reasoning and reading tasks plus 2 diagnostics | Single number macro-average | 89.8 | Surpassed by DeBERTa and T5 in January 2021. Effectively dormant. |
| BIG-Bench | 2021 | 200+ collaborative tasks across diverse domains | Per-task and aggregate | Variable | Active. Designed for breadth and probing emergent abilities. |
| MMLU | 2020 | 57 multiple-choice subjects (law, medicine, math, ethics, more) | Macro accuracy | 89.8 (expert) | Largely saturated by frontier models in 2024. |
| HELM | 2022 | Holistic evaluation across many scenarios and metrics | Multi-dimensional dashboard | n/a | Active framework rather than fixed leaderboard. |
| MMLU-Pro | 2024 | Harder, augmented MMLU with 10 answer choices | Macro accuracy | n/a | Active. Designed to restore headroom over MMLU. |
| GPQA | 2023 | Graduate-level science questions | Macro accuracy | ~65 (PhDs in domain) | Active. Heavy reasoning and adversarial. |
| ARC-AGI | 2019 to present | Visual abstract reasoning grids | Per-task | ~85 (humans) | Active. Probes sample-efficient generalization. |
SuperGLUE shares with GLUE a concrete philosophy that proved enduring: a single number aggregating diverse tasks, a private test server, public leaderboards, and easy submission. Its successors largely preserve those design choices but expand the range of capabilities tested, often by an order of magnitude in number of tasks or in domain breadth. The conceptual lineage from GLUE to SuperGLUE to MMLU and beyond is among the clearest threads in the modern history of NLP evaluation.
The SuperGLUE paper had outsized reach for a benchmark publication. By the mid-2020s it had accumulated several thousand citations and was a standard reference in nearly every paper that introduced a new pretrained transformer between 2019 and 2022. The benchmark was hosted by NYU with funding from DeepMind and GPU support from NVIDIA, and the leaderboard was treated as a primary venue for state-of-the-art claims by industry labs and academic groups alike during that window. [1]
The benchmark also influenced the design of later evaluations both technically and politically. Technical influences include the standard practice of bundling diagnostic datasets alongside scored tasks, the reliance on macro-averaged scores over heterogeneous metrics, and the convention of holding test labels server-side. Political and community influences include the expectation that major model releases will report on a shared benchmark and that the benchmark itself will be open and well documented.
The SuperGLUE leaderboard remains live at super.gluebenchmark.com but receives few new submissions. As of 2024 and 2025, frontier large language models (GPT-4 and successors, Claude 3 and successors, Gemini, Llama 3 and successors, Qwen, DeepSeek) all score well above the human baseline on most SuperGLUE tasks when evaluated in zero-shot or few-shot settings, but their developers rarely publish formal SuperGLUE numbers, preferring richer evaluations such as MMLU-Pro, GPQA, BIG-Bench Hard, AGIEval, and a growing set of agentic and tool-use benchmarks. Within research practice, SuperGLUE is now most commonly referenced for historical context, for ablation comparisons against BERT-era systems, or as a sanity check that a new training recipe has not regressed on classical NLU. [4]