GLUE benchmark

AI Benchmarks Machine Learning Natural Language Processing

28 min read

Updated Jun 21, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 21, 2026

Fact-checked

In review queue

Sources

18 citations

Revision

v5 · 5,589 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

The General Language Understanding Evaluation (GLUE) benchmark is a collection of nine natural language understanding (NLU) tasks designed to evaluate and compare the performance of language models across a diverse set of linguistic challenges. It was introduced in 2018 by Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman, and published as a conference paper at ICLR 2019.^[1] GLUE aggregates nine existing English NLU datasets into a single composite score and a public leaderboard hosted at gluebenchmark.com, and it became one of the most widely used benchmarks in NLP research, playing a central role in accelerating the development of pretrained language models such as BERT, RoBERTa, and ALBERT.^[1]

The authors framed GLUE around a single requirement for general NLU. As they wrote in the paper, "For natural language understanding (NLU) technology to be maximally useful, both practically and as a scientific object of study, it must be general: it must be able to process language in a way that is not exclusively tailored to any one specific task or dataset."^[1] GLUE's design reflects this idea: a general-purpose language understanding system should perform well not on a single task but across a wide range of tasks that test different aspects of linguistic knowledge.^[1] By aggregating nine tasks into a single composite score, GLUE provided researchers with a standardized way to measure progress and compare models, which helped drive rapid improvements in the field between 2018 and 2020. The benchmark was effectively saturated within roughly 14 months: the best system rose from a baseline of 70.0 in April 2018 to XLNet's 88.4 in June 2019, which surpassed the estimated human baseline of 87.1 and prompted the creation of a harder successor, SuperGLUE.^[1]^[3]^[6]

What is the GLUE benchmark used for?

GLUE is used to measure how well a single language model can handle many different natural language understanding tasks at once, rather than being tuned for just one. Before GLUE, NLP researchers typically evaluated models on individual tasks using separate leaderboards and datasets. This fragmented approach made it difficult to assess whether a model had genuinely improved in general language understanding or had simply been overfit to one particular task. The creators of GLUE argued that for language technology to achieve practical utility and scientific rigor, models must demonstrate the ability to process language in a way that generalizes across tasks rather than being tailored to a single dataset.^[1]

The GLUE benchmark drew inspiration from earlier efforts like SentEval and DecaNLP, but it was distinctive in several ways. First, it included a carefully curated set of tasks spanning different problem types: single-sentence classification, sentence-pair similarity, and natural language inference. Second, it provided a public leaderboard and standardized evaluation server, ensuring that all submissions were scored against held-out test sets under identical conditions. Third, it included a diagnostic dataset of 1,100 hand-crafted examples organized into four coarse linguistic categories (Lexical Semantics, Predicate-Argument Structure, Logic, and Knowledge) to enable fine-grained error analysis beyond aggregate scores.^[1]

The benchmark was also designed to incentivize multi-task learning and transfer learning. Several of the nine tasks have very limited training data, making it difficult for a model to learn those tasks in isolation. As the authors put it, GLUE "is model-agnostic, but it incentivizes sharing knowledge across tasks because certain tasks have very limited training data."^[1] The hope was that models sharing representations across tasks would outperform models trained independently on each task. In the original paper, however, the authors found that existing multi-task and transfer learning approaches at the time provided "no substantial improvements" over per-task training, highlighting that general NLU was still an open challenge.^[1]

The Nine GLUE Tasks

GLUE consists of nine tasks that fall into three broad categories: single-sentence tasks, similarity and paraphrase tasks, and inference tasks. Each task comes with a fixed training set, a development set for local evaluation, and a held-out test set scored by the official GLUE evaluation server.^[1]

Single-Sentence Tasks

CoLA (Corpus of Linguistic Acceptability): Created by Warstadt, Singh, and Bowman (2018), CoLA consists of 10,657 English sentences drawn from linguistics textbooks and journal articles.^[10] Each sentence is labeled as grammatically acceptable or unacceptable. The task tests whether a model can distinguish well-formed English sentences from ill-formed ones, requiring sensitivity to subtle syntactic phenomena such as island constraints, argument structure, and binding theory. CoLA uses Matthews correlation coefficient (MCC) as its evaluation metric because the dataset is imbalanced, and accuracy alone would be misleading.^[10]

SST-2 (Stanford Sentiment Treebank): Based on the sentiment annotations from the Stanford Sentiment Treebank (Socher et al., 2013), SST-2 is a binary sentiment analysis task.^[16] Each example is a sentence extracted from a movie review, and the model must classify it as expressing positive or negative sentiment. With approximately 67,000 training examples, SST-2 is one of the larger GLUE tasks. It uses accuracy as its evaluation metric.^[1]

Similarity and Paraphrase Tasks

MRPC (Microsoft Research Paraphrase Corpus): Introduced by Dolan and Brockett (2005), MRPC contains 5,800 sentence pairs drawn from online news articles. Each pair is annotated as semantically equivalent (paraphrase) or not. The dataset has approximately 3,700 training examples and is evaluated using both accuracy and F1 score.^[1]

STS-B (Semantic Textual Similarity Benchmark): STS-B, from the SemEval STS shared tasks (Cer et al., 2017), is a collection of sentence pairs annotated with similarity scores on a continuous scale from 0 to 5. Unlike the other GLUE tasks, STS-B is a regression task rather than a classification task. It contains approximately 7,000 training examples and is evaluated using Pearson and Spearman correlation coefficients.^[1]

QQP (Quora Question Pairs): This dataset, sourced from the community question-answering website Quora, consists of question pairs labeled as semantically equivalent (duplicates) or not. QQP is one of the largest GLUE tasks, with approximately 364,000 training examples. It is evaluated using both accuracy and F1 score.^[1]

Inference Tasks

MNLI (Multi-Genre Natural Language Inference): MNLI (Williams et al., 2018) is the largest and most diverse task in GLUE. It is a crowdsourced collection of 393,000 sentence pairs drawn from ten distinct genres of written and spoken English, including fiction, government documents, and telephone conversations.^[11] Given a premise sentence and a hypothesis sentence, the model must classify the relationship as entailment, contradiction, or neutral. MNLI is the only three-class classification task in GLUE. Performance is reported separately on a "matched" test set (same genres as training) and a "mismatched" test set (different genres), both evaluated with accuracy.^[11]

QNLI (Question Natural Language Inference): QNLI is derived from the Stanford Question Answering Dataset (SQuAD; Rajpurkar et al., 2016).^[15] Each example pairs a question with a sentence from a Wikipedia paragraph, and the model must determine whether the sentence contains the answer to the question. The original extractive QA task was converted into a sentence-pair classification format for inclusion in GLUE. QNLI has approximately 105,000 training examples and uses accuracy as its metric.^[1]

RTE (Recognizing Textual Entailment): RTE combines data from a series of annual textual entailment challenges (RTE1, RTE2, RTE3, and RTE5). The task is binary classification: given a premise and a hypothesis, determine whether the premise entails the hypothesis. With only about 2,500 training examples, RTE is one of the smaller GLUE tasks, making it a good test of a model's ability to transfer knowledge from other tasks or pretraining. It uses accuracy as its metric.^[1]

WNLI (Winograd Natural Language Inference): WNLI is derived from the Winograd Schema Challenge (Levesque et al., 2012).^[17] The original challenge presents sentences containing ambiguous pronouns that require world knowledge and commonsense reasoning to resolve.^[17] For GLUE, the examples were reformulated as sentence pairs: the original sentence and a version with the pronoun replaced by one of the candidate referents. The model must determine whether the sentence with the substituted referent is entailed by the original. WNLI is the smallest task in GLUE, with only 634 training examples, and uses accuracy as its metric.^[1] Many early systems, including BERT, effectively skipped WNLI due to training set construction issues and submitted majority-class predictions.^[4]

GLUE Task Summary Table

Task	Type	Training Examples	Test Examples	Metric	Source
CoLA	Single sentence (acceptability)	8,551	1,063	Matthews corr.	Warstadt et al., 2018
SST-2	Single sentence (sentiment)	67,349	1,821	Accuracy	Socher et al., 2013
MRPC	Sentence pair (paraphrase)	3,668	1,725	Accuracy / F1	Dolan & Brockett, 2005
STS-B	Sentence pair (similarity)	5,749	1,379	Pearson / Spearman corr.	Cer et al., 2017
QQP	Sentence pair (paraphrase)	363,849	390,965	Accuracy / F1	Quora
MNLI	Sentence pair (NLI, 3-class)	392,702	9,815 + 9,832	Accuracy (matched/mismatched)	Williams et al., 2018
QNLI	Sentence pair (QA/NLI)	104,743	5,463	Accuracy	Rajpurkar et al., 2016
RTE	Sentence pair (NLI)	2,490	3,000	Accuracy	Dagan et al., 2005; et al.
WNLI	Sentence pair (NLI/coreference)	634	146	Accuracy	Levesque et al., 2012

How is the GLUE score calculated?

The overall GLUE score is computed as a macro-average across all nine tasks. For tasks evaluated with multiple metrics (MRPC, STS-B, and QQP), an unweighted average of those metrics is first computed to produce a single per-task score. The nine per-task scores are then averaged to yield the final GLUE score. This design gives equal weight to each task regardless of dataset size, which means that small, difficult tasks like CoLA, RTE, and WNLI carry the same influence as large tasks like MNLI and QQP.^[1]

All models must submit predictions on the held-out test sets to the GLUE evaluation server. Researchers cannot evaluate on the test data locally; this was done to prevent overfitting to test sets. The evaluation server produces per-task scores and the overall average, which are displayed on a public leaderboard.^[1]

Baseline Results and Human Performance

Original Baselines

In the original GLUE paper, the authors evaluated several baseline systems. The strongest baseline was a multi-task model with attention and ELMo embeddings, which achieved an overall GLUE score of 70.0. Other baselines included a simple BiLSTM model with GloVe embeddings and various multi-task training configurations. These baselines performed reasonably on tasks with large training sets (like SST-2 and MNLI) but struggled on smaller tasks (like CoLA and RTE), underscoring the need for better transfer learning approaches.^[1]

What is the human baseline on GLUE?

Human performance on GLUE was estimated by Nangia and Bowman (2019) in their paper "Human vs. Muppet: A Conservative Estimate of Human Performance on the GLUE Benchmark." They recruited crowd workers to annotate samples from each test set, producing an overall human baseline of 87.1 and the following per-task estimates:^[3]

Task	Human Performance	Metric
CoLA	66.4	Matthews corr.
SST-2	97.8	Accuracy
MRPC	86.3	F1
STS-B	92.7	Pearson corr.
QQP	59.5	F1
MNLI-m	92.0	Accuracy
MNLI-mm	92.8	Accuracy
QNLI	91.2	Accuracy
RTE	93.6	Accuracy
WNLI	95.9	Accuracy
Overall	87.1	Average

The relatively low human score on CoLA (66.4) reflects the difficulty of judging fine-grained grammatical acceptability even for humans. The low F1 on QQP (59.5) is partly an artifact of the metric and the difficulty of the negative class. These human baseline numbers were described as "conservative" estimates because they were computed over limited samples and using crowdsource workers rather than trained linguists.^[3]

How GLUE Accelerated NLP Progress

The GLUE benchmark's greatest impact was providing a unified stage on which the rapid progress of pretrained language models played out in 2018 and 2019. The timeline of improvement was remarkable:

Pre-BERT era (early 2018): When GLUE was released, the best systems scored around 63-70 on the benchmark. The strongest baseline in the original paper scored 70.0 using ELMo embeddings with multi-task training.^[1] OpenAI's GPT (Radford et al., 2018), one of the first Transformer-based pretrained models to be evaluated on GLUE, scored 72.8.^[1]

BERT (October 2018): BERT (Devlin et al., 2019) represented a dramatic leap. BERT-Large achieved an overall GLUE score of 80.5, a 7.7-point absolute improvement over GPT.^[4] BERT's bidirectional pretraining approach, using masked language modeling and next sentence prediction, proved highly effective across all GLUE tasks.^[4] The BERT-Large per-task scores included 60.5 on CoLA, 94.9 on SST-2, 89.3 (F1) on MRPC, 86.5 (Spearman) on STS-B, 72.1 (F1) on QQP, 86.7 on MNLI-m, 92.7 on QNLI, and 70.1 on RTE.^[4]

XLNet (June 2019): XLNet (Yang et al., 2019) introduced permutation-based language modeling and achieved a GLUE score of 88.4, surpassing the estimated human baseline of 87.1 for the first time.^[6] This marked a turning point: within roughly one year of the benchmark's creation, AI systems had caught up to human performance on the aggregate metric.

RoBERTa (July 2019): RoBERTa (Liu et al., 2019) showed that BERT's training procedure had been significantly under-optimized. By training longer, on more data, with larger batches, and removing the next sentence prediction objective, RoBERTa achieved 88.5 on GLUE, matching and slightly exceeding XLNet.^[5]

ALBERT (September 2019): ALBERT (Lan et al., 2019) introduced parameter-reduction techniques (factorized embedding parameters and cross-layer parameter sharing) while maintaining strong performance. ALBERT achieved competitive GLUE scores while using far fewer parameters than BERT-Large.^[7]

T5 (October 2019): Google's T5 (Raffel et al., 2019) framed all NLP tasks as text-to-text problems and achieved strong GLUE results through its unified framework and extensive pretraining on the C4 corpus.^[8]

DeBERTa and beyond (2020-2021): Microsoft's DeBERTa (He et al., 2020) introduced disentangled attention mechanisms and achieved some of the highest GLUE scores reported, pushing well above 90 on the overall benchmark.^[9]

By mid-2019, the GLUE leaderboard had become saturated, with multiple models surpassing the human baseline. The rapid pace of improvement, from 70.0 to above 88 in roughly one year, was one of the most visible demonstrations of how pretraining and fine-tuning had transformed NLP.

Key GLUE Leaderboard Milestones

Date	Model	GLUE Score	Notable Achievement
Apr 2018	ELMo + Multi-Task	70.0	Original baseline
Jun 2018	GPT	72.8	First Transformer-based submission
Oct 2018	BERT-Large	80.5	7.7-point jump over GPT
Jun 2019	XLNet	88.4	First to surpass human baseline (87.1)
Jul 2019	RoBERTa	88.5	Optimized BERT training
Late 2020	DeBERTa	90+	Disentangled attention

SuperGLUE (2019)

SuperGLUE is a harder successor to GLUE, made up of eight more difficult language understanding tasks plus a software toolkit and a public leaderboard. It was introduced in May 2019 by many of the same researchers who created GLUE (Wang, Pruksachatkun, Nangia, Singh, Michael, Hill, Levy, and Bowman), along with additional collaborators, in a paper titled "SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems" and published at NeurIPS 2019.^[2] The authors motivated it directly by GLUE's saturation, writing that "performance on the benchmark has recently surpassed the level of non-expert humans, suggesting limited headroom for further research" and presenting "SuperGLUE, a new benchmark styled after GLUE with a new set of more difficult language understanding tasks, a software toolkit, and a public leaderboard."^[2]

Why SuperGLUE Was Needed

By early July 2019, the state-of-the-art GLUE score (88.4, from XLNet) had surpassed the estimated human performance of 87.1 by 1.3 points. Models exceeded human performance on four of the nine tasks. While this did not mean that machines had achieved human-level language understanding (the remaining gaps on tasks like WNLI and the diagnostic dataset told a different story), it did mean that the GLUE benchmark could no longer effectively discriminate between models. A harder benchmark was needed to continue driving and measuring progress.^[2]

The SuperGLUE designers applied several principles when selecting tasks. They prioritized tasks that (1) posed genuine language understanding challenges, (2) were difficult enough that college-educated English speakers could solve them reliably but current NLP systems could not, (3) had publicly available data, (4) supported reliable automatic evaluation, and (5) represented diverse task formats beyond sentence-pair classification. Only two tasks from the original GLUE benchmark (RTE and a reformulated version of WNLI, renamed WSC) were retained.^[2]

The Eight SuperGLUE Tasks

BoolQ (Boolean Questions): Clark et al. (2019) created BoolQ, a yes/no question answering task. Each example consists of a short passage from Wikipedia and a naturally occurring yes/no question about that passage. The questions were collected from anonymous Google search queries, making them representative of real user information needs. BoolQ has 9,427 training examples and is evaluated with accuracy.^[2]

CB (CommitmentBank): De Marneffe et al. (2019) developed the CommitmentBank, a corpus of short texts each containing at least one embedded clause. The task is to determine the degree of commitment the author of the text has to the truth of the embedded clause, classifying it as entailment, contradiction, or neutral. CB is very small, with only 250 training examples, and is evaluated using accuracy and the average of per-class F1 scores to account for label imbalance.^[2]

COPA (Choice of Plausible Alternatives): Roemmele et al. (2011) introduced COPA, a causal reasoning task. The model is given a premise sentence and must choose which of two alternatives is either the cause or the effect of the premise. COPA tests commonsense causal reasoning and has only 400 training examples. It is evaluated with accuracy.^[2]

MultiRC (Multi-Sentence Reading Comprehension): Khashabi et al. (2018) created MultiRC, a reading comprehension task where each question may have multiple correct answers. Given a paragraph and a question, the model must label each candidate answer as true or false. Correctly answering each question requires drawing on information from multiple sentences in the passage. MultiRC has approximately 5,100 training examples and is evaluated using per-question F1 and exact match (EM) scores.^[2]

ReCoRD (Reading Comprehension with Commonsense Reasoning Dataset): Zhang et al. (2018) developed ReCoRD, a cloze-style reading comprehension task. Each example presents a news article and a query with a masked entity. The model must predict the masked entity from a list of candidate entities in the passage, requiring both reading comprehension and commonsense reasoning. ReCoRD is the largest SuperGLUE task, with approximately 101,000 training examples, and is evaluated with token-level F1 and exact match.^[2]

RTE (Recognizing Textual Entailment): The same RTE task from GLUE was carried over to SuperGLUE. It remained one of the more challenging tasks for models due to its small training set (2,490 examples) and the difficulty of textual entailment reasoning.^[2]

WiC (Word-in-Context): Pilehvar and Camacho-Collados (2019) created WiC, a word sense disambiguation task presented in a binary classification format. Given two sentences that each contain the same polysemous word, the model must determine whether the word is used with the same sense in both sentences. WiC has approximately 5,428 training examples and is evaluated with accuracy.^[2]

WSC (Winograd Schema Challenge): This is a reformulation of the Winograd Schema Challenge (Levesque et al., 2012) as a coreference resolution task.^[17] The model is presented with a sentence containing a pronoun and must determine whether a given noun phrase is the correct referent of that pronoun. WSC has 554 training examples and is evaluated with accuracy. It replaces WNLI from the original GLUE benchmark with a cleaner formulation.^[2]

SuperGLUE Task Summary Table

Task	Type	Training Examples	Metric	Source
BoolQ	QA (yes/no)	9,427	Accuracy	Clark et al., 2019
CB	NLI (3-class)	250	Accuracy / F1	De Marneffe et al., 2019
COPA	Causal reasoning	400	Accuracy	Roemmele et al., 2011
MultiRC	Reading comprehension	~5,100	F1 / EM	Khashabi et al., 2018
ReCoRD	Reading comprehension (cloze)	~101,000	F1 / EM	Zhang et al., 2018
RTE	NLI (2-class)	2,490	Accuracy	Dagan et al., 2005; et al.
WiC	Word sense disambiguation	~5,428	Accuracy	Pilehvar & Camacho-Collados, 2019
WSC	Coreference resolution	554	Accuracy	Levesque et al., 2012

SuperGLUE Human Performance and BERT Baselines

Unlike GLUE, SuperGLUE was designed from the start with comprehensive human baselines. Human annotators achieved an overall average of 89.8, against a BERT baseline of 69.0 and a BERT++ baseline of 71.5, with the following per-task estimates:^[2]

Task	Human Performance	BERT Baseline	BERT++ Baseline	Gap (Human - BERT++)
BoolQ	89.0	60.5	63.4	25.6
CB	95.8	75.9	83.6	12.2
COPA	100.0	63.4	71.2	28.8
MultiRC	81.8	63.5	66.2	15.6
ReCoRD	91.7	72.0	72.4	19.3
RTE	93.6	68.7	72.9	20.7
WiC	80.0	64.4	64.4	15.6
WSC	100.0	64.4	65.4	34.6
Average	89.8	69.0	71.5	18.3

The BERT++ baseline refers to BERT with additional intermediate task training (using MNLI as an auxiliary task for some subtasks). Even with this enhancement, there was a nearly 20-point gap between BERT++ (71.5) and human performance (89.8). Human performance was perfect (100%) on COPA and WSC, and the largest gap between machines and humans was on WSC, with a difference of about 35 points.^[2]

This massive headroom, compared to the less than 1-point gap on GLUE at the time, confirmed that SuperGLUE would remain a meaningful benchmark for a longer period.^[2]

When was SuperGLUE solved?

Progress on SuperGLUE was somewhat slower than on GLUE, but models eventually closed the gap:

Mid 2019: BERT and BERT++ baselines scored 69.0 and 71.5, respectively.^[2]
Late 2019 to mid 2020: Models like RoBERTa, ALBERT, and T5 steadily improved SuperGLUE scores. T5-11B, the largest variant of Google's Text-to-Text Transfer Transformer, pushed scores into the mid-to-high 80s.^[8]
January 2021: Microsoft's DeBERTa (1.5 billion parameters, using disentangled attention, an enhanced mask decoder, and Scale-Invariant Fine-Tuning) became the first model to surpass human performance on the SuperGLUE benchmark. The single DeBERTa model first crossed the human baseline of 89.8 on December 29, 2020, with a score of 89.9, and an ensemble version reached 90.3, submitted on January 6, 2021.^[18] At roughly the same time, a Google Brain submission based on T5 and the Meena chatbot scored 90.2, submitted on January 5, 2021.^[18]
2021 and beyond: Multiple models surpassed the human baseline, and SuperGLUE scores continued to climb as large language models grew in scale and sophistication.

The time from SuperGLUE's release (May 2019) to the first human-performance-surpassing result (late December 2020 / January 2021) was roughly 20 months, compared to about 14 months for GLUE.

Diagnostic Dataset

In addition to the nine main tasks, the GLUE benchmark includes a diagnostic dataset consisting of 1,100 hand-crafted natural language inference examples. These examples are organized into four coarse-grained linguistic categories:^[1]

Lexical Semantics: Testing understanding of word meaning, including synonymy, antonymy, and hypernymy.
Predicate-Argument Structure: Testing understanding of thematic roles, relative clauses, and other aspects of how predicates relate to their arguments.
Logic: Testing understanding of negation, quantifiers, conditionals, and other logical operators.
Knowledge: Testing world knowledge and commonsense reasoning.

The diagnostic dataset is not used in computing the GLUE score. Instead, it serves as an analysis tool for researchers to conduct detailed error analysis and identify systematic weaknesses in their models. In the original GLUE paper, baseline models performed well on examples requiring strong lexical cues but struggled with logical structure and compositional reasoning. Performance on the diagnostic dataset fell far below average human performance, with some categories showing near-chance or below-chance results.^[1]

SuperGLUE includes its own version of the Broad Coverage Diagnostics, along with a Winogender diagnostic dataset for evaluating gender bias in coreference resolution systems.^[2]

Criticisms and Limitations

Despite its influence, GLUE and SuperGLUE have faced several criticisms:

Annotation artifacts and dataset biases. Research has shown that several GLUE datasets contain annotation artifacts, which are unintended patterns that allow models to achieve high accuracy without genuine language understanding. For example, in NLI datasets, certain words in the hypothesis (such as negation words) are strongly correlated with the "contradiction" label, enabling models to exploit shallow heuristics rather than reasoning about the relationship between premise and hypothesis.

Task saturation and limited discriminative power. Both GLUE and SuperGLUE were saturated relatively quickly. Once multiple models surpass the human baseline, the benchmark loses its ability to distinguish between models or measure meaningful progress. The aggregate score can obscure important differences in per-task performance.

Narrow task coverage. GLUE covers only a subset of language understanding capabilities. It does not test generation, dialogue, summarization, multilingual understanding, code understanding, mathematical reasoning, or many other abilities that are important for general-purpose language systems. SuperGLUE added question answering and coreference resolution but remained focused on relatively short English text.

Reliance on crowdsourced human baselines. The human performance estimates were obtained from crowdsource workers using limited samples, making them approximate. The Nangia and Bowman paper explicitly described their estimates as "conservative."^[3] Some tasks (notably QQP with an F1 of 59.5) have human baselines that reflect the difficulty of the metric and labeling scheme rather than the underlying task difficulty.

Social bias concerns. The datasets used in GLUE and SuperGLUE inherit biases from their source corpora. Models trained and evaluated on these benchmarks may learn and perpetuate associations related to gender, ethnicity, and other sensitive attributes. The benchmark creators acknowledged that the data should not be used for non-research applications due to these concerns.

Adversarial vulnerability. Adversarial GLUE (AdvGLUE), introduced by Wang et al. (2021), demonstrated that models achieving high GLUE scores are often vulnerable to adversarial perturbations.^[12] By applying 14 textual adversarial attack methods to GLUE tasks, the authors showed that state-of-the-art models experienced performance drops of up to 55% on some tasks, suggesting that high GLUE scores do not necessarily indicate robust language understanding.^[12]

Aggregation concerns. Using a single average score across tasks has been criticized because it can mask significant weaknesses. A model might achieve a high overall score by excelling on large, easy tasks while performing poorly on small, difficult ones. The equal weighting of all tasks regardless of size or difficulty is a design choice with both advantages and drawbacks.

Legacy and Influence

The GLUE benchmark had a lasting impact on NLP research and the broader AI community in several ways.

Standardization of NLU evaluation. Before GLUE, researchers routinely reported results on different subsets of tasks using inconsistent evaluation protocols. GLUE established the norm of evaluating models on a standardized, multi-task benchmark with a public leaderboard. This pattern has been replicated in dozens of subsequent benchmarks across many areas of AI.

Acceleration of pretrained language models. GLUE served as the primary proving ground during a period of extraordinary progress in NLP. The competition to top the GLUE leaderboard motivated the development of BERT, RoBERTa, XLNet, ALBERT, ELECTRA, T5, and DeBERTa, among many others. These models, in turn, became the foundation for modern large language models.

Template for future benchmarks. GLUE's design, combining multiple tasks, a single aggregate score, a public leaderboard, and a held-out evaluation server, became the template for many subsequent AI benchmarks. Examples include:

SuperGLUE (2019): The direct successor with harder tasks, as described above.^[2]
MMLU (Hendrycks et al., 2021): Massive Multitask Language Understanding, which tests knowledge across 57 subjects ranging from STEM to humanities, extending evaluation beyond NLU to knowledge and reasoning.^[13]
BIG-Bench (Srivastava et al., 2022): The Beyond the Imitation Game Benchmark, a collaborative effort with over 200 tasks contributed by researchers worldwide, designed to probe language model capabilities beyond what GLUE-style benchmarks measure.^[14]
HELM (Liang et al., 2022): Holistic Evaluation of Language Models from Stanford, which evaluates models across multiple dimensions including accuracy, calibration, robustness, fairness, and efficiency.
Domain-specific GLUE variants: Researchers have created GLUE-style benchmarks for specific domains, including BioGLUE for biomedical text and LexGLUE for legal language understanding.

Multi-task learning and transfer learning research. GLUE's design explicitly encouraged research into multi-task and transfer learning. The finding that pretrained models with fine-tuning dramatically outperformed multi-task baselines helped establish the "pretrain then fine-tune" paradigm that dominated NLP from 2018 through the early 2020s.

Relationship to Modern Benchmarks

As of 2026, GLUE and SuperGLUE are considered effectively solved by modern language models. State-of-the-art models routinely achieve scores well above human baselines on both benchmarks. However, the benchmarks remain important for several reasons.

First, GLUE tasks are still widely used as standard fine-tuning benchmarks when developing and evaluating new model architectures. A model that performs poorly on GLUE would raise questions about its basic NLU capabilities, even though high GLUE performance alone is no longer sufficient to establish a model as state-of-the-art.

Second, the individual GLUE datasets (particularly MNLI, QQP, SST-2, and CoLA) continue to serve as training data and evaluation benchmarks in their own right. MNLI, for example, is widely used as intermediate training data for other NLI tasks, and SST-2 remains a standard sentiment analysis benchmark.

Third, newer benchmarks like MMLU, BIG-Bench, and HELM can be understood as extensions of the GLUE philosophy to broader and harder evaluation suites. Where GLUE tested basic NLU across nine tasks, MMLU tests knowledge across 57 domains, and BIG-Bench includes over 200 tasks spanning reasoning, mathematics, common sense, translation, and more. The scale has grown dramatically, but the underlying principle of multi-task evaluation pioneered by GLUE remains the same.

The progression from GLUE to SuperGLUE to modern benchmarks illustrates a recurring pattern in AI research: benchmarks are proposed, rapidly saturated by improving models, and then replaced by harder successors. This cycle, while sometimes criticized for encouraging "teaching to the test," has also served as a powerful engine for progress, providing clear targets and enabling systematic comparison of approaches.

References

Wang, A., Singh, A., Michael, J., Hill, F., Levy, O., & Bowman, S. R. (2018). "GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding." *Proceedings of the 2018 EMNLP Workshop BlackboxNLP*. Later published at ICLR 2019. arXiv:1804.07461 ↩
Wang, A., Pruksachatkun, Y., Nangia, N., Singh, A., Michael, J., Hill, F., Levy, O., & Bowman, S. R. (2019). "SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems." *Advances in Neural Information Processing Systems 32 (NeurIPS 2019)*. arXiv:1905.00537 ↩
Nangia, N. & Bowman, S. R. (2019). "Human vs. Muppet: A Conservative Estimate of Human Performance on the GLUE Benchmark." *Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL 2019)*. arXiv:1905.10425 ↩
Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding." *Proceedings of NAACL-HLT 2019*. arXiv:1810.04805 ↩
Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., & Stoyanov, V. (2019). "RoBERTa: A Robustly Optimized BERT Pretraining Approach." arXiv:1907.11692 ↩
Yang, Z., Dai, Z., Yang, Y., Carbonell, J., Salakhutdinov, R., & Le, Q. V. (2019). "XLNet: Generalized Autoregressive Pretraining for Language Understanding." *Advances in Neural Information Processing Systems 32 (NeurIPS 2019)*. arXiv:1906.08237 ↩
Lan, Z., Chen, M., Goodman, S., Gimpel, K., Sharma, P., & Soricut, R. (2019). "ALBERT: A Lite BERT for Self-supervised Learning of Language Representations." *ICLR 2020*. arXiv:1909.11942 ↩
Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., & Liu, P. J. (2020). "Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer." *Journal of Machine Learning Research, 21*(140), 1-67. arXiv:1910.10683 ↩
He, P., Liu, X., Gao, J., & Chen, W. (2021). "DeBERTa: Decoding-enhanced BERT with Disentangled Attention." *ICLR 2021*. arXiv:2006.03654 ↩
Warstadt, A., Singh, A., & Bowman, S. R. (2019). "Neural Network Acceptability Judgments." *Transactions of the Association for Computational Linguistics, 7*, 625-641. arXiv:1805.12471 ↩
Williams, A., Nangia, N., & Bowman, S. R. (2018). "A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference." *Proceedings of NAACL-HLT 2018*. arXiv:1704.05426 ↩
Wang, B., Xu, C., Wang, S., Gan, Z., Cheng, Y., Gao, J., Awadallah, A. H., & Li, B. (2021). "Adversarial GLUE: A Multi-Task Benchmark for Robustness Evaluation of Language Models." *NeurIPS 2021 Datasets and Benchmarks Track*. arXiv:2111.02840 ↩
Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D., & Steinhardt, J. (2021). "Measuring Massive Multitask Language Understanding." *ICLR 2021*. arXiv:2009.03300 ↩
Srivastava, A., et al. (2022). "Beyond the Imitation Game: Quantifying and Extrapolating the Capabilities of Language Models." arXiv:2206.04615 ↩
Rajpurkar, P., Zhang, J., Lopyrev, K., & Liang, P. (2016). "SQuAD: 100,000+ Questions for Machine Comprehension of Text." *Proceedings of EMNLP 2016*. arXiv:1606.05250 ↩
Socher, R., Perelygin, A., Wu, J., Chuang, J., Manning, C. D., Ng, A., & Potts, C. (2013). "Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank." *Proceedings of EMNLP 2013*. ↩
Levesque, H. J., Davis, E., & Morgenstern, L. (2012). "The Winograd Schema Challenge." *Proceedings of the Thirteenth International Conference on Principles of Knowledge Representation and Reasoning (KR 2012)*. ↩
Microsoft Research. (2021). "Microsoft DeBERTa surpasses human performance on the SuperGLUE benchmark." Microsoft Research Blog. Link ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

4 revisions by 1 contributors · full history

Suggest edit

GLUE benchmark

What is the GLUE benchmark used for?

The Nine GLUE Tasks

Single-Sentence Tasks

Similarity and Paraphrase Tasks

Inference Tasks

GLUE Task Summary Table

How is the GLUE score calculated?

Baseline Results and Human Performance

Original Baselines

What is the human baseline on GLUE?

How GLUE Accelerated NLP Progress

Key GLUE Leaderboard Milestones

SuperGLUE (2019)

Why SuperGLUE Was Needed

The Eight SuperGLUE Tasks

SuperGLUE Task Summary Table

SuperGLUE Human Performance and BERT Baselines

When was SuperGLUE solved?

Diagnostic Dataset

Criticisms and Limitations

Legacy and Influence

Relationship to Modern Benchmarks

See Also

References

Improve this article

What links here (24 of 45)

What links here (24 of 45)

What is the GLUE benchmark used for?

The Nine GLUE Tasks

Single-Sentence Tasks

Similarity and Paraphrase Tasks

Inference Tasks

GLUE Task Summary Table

How is the GLUE score calculated?

Baseline Results and Human Performance

Original Baselines

What is the human baseline on GLUE?

How GLUE Accelerated NLP Progress

Key GLUE Leaderboard Milestones

SuperGLUE (2019)

Why SuperGLUE Was Needed

The Eight SuperGLUE Tasks

SuperGLUE Task Summary Table

SuperGLUE Human Performance and BERT Baselines

When was SuperGLUE solved?

Diagnostic Dataset

Criticisms and Limitations

Legacy and Influence

Relationship to Modern Benchmarks

See Also

References

Improve this article

Related Articles

DROP (Discrete Reasoning Over Paragraphs)

LiveBench

TruthfulQA

CRUXEval

BIG-Bench Hard

AA-LCR

What links here (24 of 45)

Related Articles

DROP (Discrete Reasoning Over Paragraphs)

LiveBench

TruthfulQA

CRUXEval

BIG-Bench Hard

AA-LCR

What links here (24 of 45)