SQuAD

AI Benchmarks Natural Language Processing

21 min read

Updated Jun 21, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 21, 2026

Fact-checked

In review queue

Sources

10 citations

Revision

v7 · 4,279 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

SQuAD (the Stanford Question Answering Dataset) is a large-scale reading comprehension benchmark from Stanford University in which a model must answer a question by extracting the exact span of text that answers it from a given passage. Released in 2016 by Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang, the original SQuAD 1.1 contains 107,785 question-answer pairs written by crowdworkers over 536 Wikipedia articles, and the 2018 follow-up SQuAD 2.0 adds 53,775 unanswerable questions that systems must learn to decline rather than guess.^[1]^[2] SQuAD established extractive question answering as one of the most important evaluation tasks in natural language processing (NLP), and its public leaderboard became a proving ground where models from major research labs raced toward and eventually surpassed human-level performance, with BERT reaching 91.8% F1 on SQuAD 1.1 in 2018, above the 91.2% human score.^[1]^[3]

The original paper defines the dataset as "a new reading comprehension dataset consisting of 100,000+ questions posed by crowdworkers on a set of Wikipedia articles, where the answer to each question is a segment of text from the corresponding reading passage."^[1] SQuAD drove the development of numerous neural architectures and pre-trained language models, and remains one of the most cited resources in NLP, with the 2016 paper accumulating over 7,500 citations.^[1]

What problem was SQuAD created to solve?

Before SQuAD, research on machine reading comprehension was hampered by a lack of large, high-quality datasets. Earlier benchmarks such as MCTest (Richardson et al., 2013) were limited in scale, containing only a few hundred passages. Cloze-style datasets like the CNN/Daily Mail corpus (Hermann et al., 2015) and the Children's Book Test (Hill et al., 2015) offered greater scale but relied on automatically generated fill-in-the-blank questions that did not fully capture the complexity of natural language understanding.^[1]

The Stanford NLP group, led by Percy Liang, set out to create a dataset that combined three properties: large scale, natural language questions written by humans, and answers grounded directly in source text. The result was SQuAD, which provided the community with a benchmark that was both challenging enough to differentiate model capabilities and large enough to train data-hungry neural network models.^[1]

SQuAD 1.1

Paper and Authors

SQuAD 1.1 was introduced in the paper "SQuAD: 100,000+ Questions for Machine Comprehension of Text" by Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. The paper was published at the 2016 Conference on Empirical Methods in Natural Language Processing (EMNLP 2016).^[1] It quickly became one of the most cited papers in NLP, accumulating over 7,500 citations according to Semantic Scholar.^[1]

Dataset Statistics

SQuAD 1.1 contains 107,785 question-answer pairs spread across 536 Wikipedia articles and 23,215 paragraphs.^[1] The dataset was split by article into a training set (80%), a development set (10%), and a test set (10%). The test set was kept hidden, and researchers could only evaluate on it through the official leaderboard submission system.^[1]

Split	Approximate Size
Training set	~87,600 examples
Development set	~10,600 examples
Test set	~9,500 examples
Total	107,785 examples
Wikipedia articles	536
Paragraphs	23,215

Task Format

The task is framed as extractive question answering. Given a passage (a paragraph from a Wikipedia article) and a question about that passage, a model must identify the span of text within the passage that answers the question. The answer is always a contiguous substring of the passage, which can range from a single word to an entire sentence.^[1] This format makes evaluation straightforward because both the predicted and gold-standard answers are character spans that can be compared directly.

For example, given a passage about the University of Notre Dame and the question "To whom did the Virgin Mary allegedly appear in 1858 in Lourdes, France?", the correct answer span would be "Saint Bernadette Soubirous."^[1]

How was SQuAD built?

SQuAD was built using crowdsourcing through Amazon Mechanical Turk (MTurk). The construction process involved several steps:

Article selection: The researchers selected 536 high-quality Wikipedia articles from the top 10,000 articles ranked by a Wikipedia internal project assessment tool.
Paragraph extraction: Individual paragraphs were extracted from these articles and presented to crowdworkers.
Question writing: Workers were shown a paragraph and asked to write up to five questions about it. They were encouraged to use their own words rather than copying phrases directly from the passage. To enforce this, the copy-paste function on the paragraph text was disabled during the annotation interface. Workers were paid at a rate of $9 per hour and were asked to spend approximately four minutes per paragraph.
Answer highlighting: After writing a question, the worker highlighted the shortest span in the paragraph that answered the question.
Additional answer collection: To enable robust evaluation and estimate human performance, the researchers collected at least two additional answers for every question in the development and test sets. In this secondary annotation task, different workers were shown the question and passage and asked to highlight the shortest answer span.

This multi-step process ensured that questions were diverse, naturally phrased, and answerable from the given passage.^[1]

Question Types

The researchers analyzed the types of reasoning required to answer SQuAD questions. They found that the questions span a range of difficulty levels:

Reasoning Type	Description
Lexical variation (synonymy)	The question uses synonyms of words in the passage
Lexical variation (world knowledge)	Answering requires basic world knowledge
Syntactic variation	The question rephrases the passage using different syntax
Multi-sentence reasoning	The answer requires combining information from multiple sentences
Ambiguous	The question or passage is ambiguous

While many SQuAD questions can be answered through simple pattern matching or paraphrase detection, a meaningful fraction requires more sophisticated reasoning, including coreference resolution and multi-sentence inference.^[1]

SQuAD 2.0

Paper and Authors

SQuAD 2.0 was introduced in the paper "Know What You Don't Know: Unanswerable Questions for SQuAD" by Pranav Rajpurkar, Robin Jia, and Percy Liang. It was published at the 56th Annual Meeting of the Association for Computational Linguistics (ACL 2018) in Melbourne, Australia.^[2]

What changed from SQuAD 1.1 to SQuAD 2.0?

A key limitation of SQuAD 1.1 was that every question had an answer in the provided passage. This meant that models could always extract some plausible-looking span, even when a question could not actually be answered from the context. In real-world applications, a question answering system must be able to recognize when it does not have enough information to provide a correct answer and abstain rather than guess. SQuAD 2.0 was designed to address this gap. As the authors put it, "To do well on SQuAD 2.0, systems must not only answer questions when possible, but also determine when no answer is supported by the paragraph and abstain from answering."^[2]

Unanswerable Questions

SQuAD 2.0 combines the existing 100,000+ answerable questions from SQuAD 1.1 with 53,775 new unanswerable questions.^[2] The unanswerable questions were written adversarially by crowdworkers who were instructed to craft questions that:

Appeared relevant to the paragraph
Referenced entities mentioned in the paragraph
Looked similar in style to the answerable SQuAD 1.1 questions
Could not actually be answered based on the paragraph alone, even though a plausible answer might seem to exist

Workers were shown examples from SQuAD 1.1 to calibrate the style and difficulty of their unanswerable questions. This adversarial design made the unanswerable questions difficult for models to distinguish from answerable ones, since superficial cues like entity overlap and question style were not reliable indicators.^[2]

Dataset Composition

Split	Total Examples	Unanswerable Examples	Ratio
Training	130,319	43,498	~2:1 answerable-to-unanswerable
Development	11,873	5,945	~1:1
Test	8,862	4,332	~1:1

The training set maintains roughly a 2:1 ratio of answerable to unanswerable questions, while the development and test sets are balanced at approximately 1:1.^[2] This design choice encourages models to learn to handle both types of questions during training while providing a fair evaluation at test time.

Performance Gap

The introduction of unanswerable questions dramatically increased difficulty. A DocQA model enhanced with ELMo embeddings that achieved 85.8% F1 on SQuAD 1.1 dropped to just 66.3% F1 on SQuAD 2.0, a decline of nearly 20 points.^[2] That left the strongest launch system roughly 23 points below the human F1 of 89.5%, and the SQuAD 2.0 paper highlighted this gap as evidence that the ability to abstain from answering was a genuinely difficult capability that existing models lacked.^[2]

How is SQuAD scored?

SQuAD uses two evaluation metrics: Exact Match (EM) and F1 score.^[1]

Exact Match (EM)

The Exact Match metric measures the percentage of predictions that exactly match one of the ground-truth answers after normalization (lowercasing, removing articles, removing punctuation, and stripping whitespace). It is a strict binary metric: a prediction receives a score of 1 if it matches a reference answer character for character, and 0 otherwise. For questions with multiple valid reference answers, the maximum EM score across all reference answers is taken.^[1]

F1 Score

The F1 score provides partial credit by measuring the token-level overlap between the prediction and the ground-truth answer. It is computed as the harmonic mean of precision and recall:

Precision: The fraction of tokens in the predicted answer that also appear in the ground-truth answer
Recall: The fraction of tokens in the ground-truth answer that also appear in the predicted answer

As with EM, the maximum F1 over all reference answers is used for each question, and the final score is the average across all questions in the evaluation set.^[1]

The F1 metric is especially valuable because it gives partial credit when a model identifies the correct region of the passage but selects a span that is slightly too long or too short. For example, if the gold answer is "Saint Bernadette Soubirous" and the model predicts "Bernadette Soubirous," the prediction receives a high F1 score despite not being an exact match.

SQuAD 2.0 Scoring

For SQuAD 2.0, a predicted "no answer" for an unanswerable question is treated as correct, receiving an EM and F1 of 1. Conversely, predicting "no answer" for an answerable question receives a score of 0. This design ensures that models are penalized both for hallucinating answers when none exist and for failing to answer when they should.^[2]

Human Performance

Human performance was estimated by treating one annotator's answer as the "prediction" and the remaining annotators' answers as the ground truth. This approach measures inter-annotator agreement rather than an individual's raw accuracy.^[1]

Benchmark	EM	F1
SQuAD 1.1 (test set)	82.3%	91.2%
SQuAD 2.0 (test set)	86.8%	89.5%

The gap between EM and F1 in human performance reflects the fact that different annotators often select slightly different but equally valid answer spans. Two humans might disagree on whether the answer to a question is "1858" or "in 1858," leading to an exact match failure but a high F1 score.

Historical Leaderboard Progress

The SQuAD leaderboard, hosted at Stanford, became one of the most closely watched benchmarks in machine learning. The competitive leaderboard format encouraged rapid iteration and drew submissions from major research labs worldwide, including Google, Microsoft, Facebook AI Research, Alibaba, and many academic groups.

SQuAD 1.1 Timeline

Progress on SQuAD 1.1 was remarkably fast. The initial logistic regression baseline published alongside the dataset achieved just 40.4% EM and 51.0% F1.^[1] Within months, neural models pushed performance far beyond this baseline.

Model	Year	EM (Test)	F1 (Test)	Key Innovation
Logistic Regression (Rajpurkar et al.)	2016	40.4%	51.0%	Feature-engineered baseline
Match-LSTM (Wang & Jiang)	2016	64.7%	73.7%	Pointer networks for span extraction
BiDAF (Seo et al.)	2016	68.0%	77.3%	Bidirectional attention flow
R-NET (Microsoft Research Asia)	2017	72.3%	80.7%	Self-matching attention
DrQA (Chen et al., Facebook)	2017	69.5%	78.8%	Document retriever + reader pipeline
QANet (Yu et al., Google)	2018	76.2%	84.6%	Convolution + self-attention, no recurrence
BERT-Large (Devlin et al., Google)	2018	85.1%	91.8%	Pre-trained transformer fine-tuning
BERT-Large (ensemble)	2018	87.4%	93.2%	Ensemble of pre-trained transformers
Human Performance	-	82.3%	91.2%	Inter-annotator agreement

The progression illustrates a shift from traditional feature engineering to increasingly sophisticated neural architectures, culminating in the pre-training and fine-tuning paradigm introduced by BERT.^[3]

Surpassing Human Performance on SQuAD 1.1

On January 3, 2018, before BERT was published, models from Microsoft Research Asia and Alibaba became the first to surpass human performance on the EM metric of SQuAD 1.1. Microsoft's r-net+ ensemble scored 82.65 EM, and Alibaba's SLQA+ scored 82.44 EM, both exceeding the human baseline of 82.3 EM. This milestone received widespread media coverage, though researchers were careful to note that surpassing humans on a single benchmark did not imply general reading comprehension ability.

Later in 2018, BERT-Large surpassed human performance on both EM and F1. Its single model scored 85.1% EM and 91.8% F1 and its ensemble scored 87.4% EM and 93.2% F1, compared to the human scores of 82.3% EM and 91.2% F1.^[3] Microsoft noted that it had already integrated earlier versions of its SQuAD model into its Bing search engine, demonstrating practical applications of the technology.

SQuAD 2.0 Leaderboard

SQuAD 2.0 proved significantly harder, and progress was initially slower. The best baseline at launch (DocQA + ELMo) scored only 63.4% EM and 66.3% F1, far below the human performance of 86.8% EM and 89.5% F1.^[2] Over the following years, models gradually closed the gap and eventually surpassed human-level scores.

Model	Year	EM (Test)	F1 (Test)	Notes
BiDAF-No-Answer (baseline)	2018	59.2%	62.1%	Threshold-based abstention
DocQA + ELMo (baseline)	2018	63.4%	66.3%	Strongest baseline at launch
BERT-Large (Google)	2018	~80.0%	~83.1%	First pre-trained transformer on SQuAD 2.0
XLNet-Large (Yang et al.)	2019	87.9%	90.6%	Autoregressive pre-training
ALBERT-xxlarge (Lan et al.)	2019	88.1%	90.9%	Parameter-efficient pre-training
Retrospective Reader (Zhang et al.)	2020	88.4%	90.9%	First single model to beat human EM and F1
DeBERTa (He et al., Microsoft)	2020	88.4%	90.7%	Disentangled attention
IE-Net (RICOH, ensemble)	2020	90.9%	93.2%	Top of leaderboard
Human Performance	-	86.8%	89.5%	Inter-annotator agreement

By 2020, the top ensemble models on the SQuAD 2.0 leaderboard exceeded human performance by more than 3 points on both EM and F1, demonstrating that the combination of large-scale pre-training, careful fine-tuning, and model ensembling could yield superhuman extractive QA performance on this benchmark.

Key Models and Architectural Innovations

BiDAF (2016)

Bidirectional Attention Flow (BiDAF), developed by Minjoon Seo and colleagues at the University of Washington and the Allen Institute for AI, introduced a multi-level attention mechanism that computes attention in both directions: from context to query and from query to context. BiDAF demonstrated that carefully designed attention could substantially improve reading comprehension performance, achieving 77.3% F1 on the SQuAD 1.1 test set.^[4]

DrQA (2017)

Developed at Facebook AI Research by Danqi Chen and colleagues, DrQA combined a document retriever (using TF-IDF) with a neural document reader trained on SQuAD. While its primary contribution was open-domain question answering (retrieving relevant documents from all of Wikipedia before extracting answers), the reader component achieved competitive single-passage results of 69.5% EM and 78.8% F1 on SQuAD 1.1.^[5]

QANet (2018)

QANet, developed by Adams Wei Yu and colleagues at Google and Carnegie Mellon University, replaced recurrent neural networks with a combination of local convolutions and global self-attention. This made training parallelizable and significantly faster. QANet achieved 84.6% F1 on SQuAD 1.1 and represented the state of the art immediately before the release of BERT.^[6]

BERT (2018)

BERT (Bidirectional Encoder Representations from Transformers), introduced by Jacob Devlin and colleagues at Google AI, transformed the SQuAD landscape. By pre-training a deep bidirectional transformer on large text corpora using masked language modeling and next sentence prediction, BERT learned rich contextual representations that could be fine-tuned for downstream tasks with minimal architectural changes. On SQuAD 1.1, BERT-Large achieved 91.8% F1, surpassing human performance. On SQuAD 2.0, it achieved approximately 83.1% F1, a massive improvement over prior models.^[3]

XLNet and ALBERT (2019)

XLNet (Yang et al., 2019) introduced permutation-based language modeling to capture bidirectional context without the masking artifacts of BERT.^[7] ALBERT (Lan et al., 2019) achieved comparable or better performance with substantially fewer parameters through cross-layer parameter sharing and factorized embedding parameterization.^[8] Both models achieved F1 scores above 90% on SQuAD 2.0, closing in on human performance.

DeBERTa (2020)

DeBERTa (Decoding-enhanced BERT with Disentangled Attention) from Microsoft introduced disentangled attention, which uses separate vectors for content and position. DeBERTa-Large achieved 90.7% F1 on SQuAD 2.0, surpassing human performance.^[9]

Why was SQuAD so influential on NLP research?

SQuAD had a profound and lasting impact on NLP research in several ways:

Standardized Evaluation

Before SQuAD, there was no widely accepted benchmark for reading comprehension. SQuAD provided a common evaluation framework that allowed researchers to directly compare different approaches. The public leaderboard created a competitive dynamic that accelerated progress and attracted attention from industry research labs.

Driving Architectural Innovation

The SQuAD leaderboard pushed the development of key architectural innovations:

Attention mechanisms: BiDAF, R-NET, and other models pioneered sophisticated attention designs specifically to perform well on SQuAD.^[4]
Pre-training and fine-tuning: BERT's success on SQuAD was one of the clearest demonstrations that pre-trained language models could achieve state-of-the-art results on downstream tasks, catalyzing the entire field toward the pre-train/fine-tune paradigm.^[3]
Efficient architectures: QANet showed that recurrence was not necessary for strong reading comprehension, foreshadowing the rise of transformer-only architectures.^[6]

Transfer Learning Benchmark

SQuAD became one of the standard benchmarks included in multi-task evaluation suites like GLUE and SuperGLUE. Performance on SQuAD was routinely reported when introducing new pre-trained language models, making it a de facto litmus test for a model's reading comprehension ability. Models such as RoBERTa, ELECTRA, T5, and many others used SQuAD as a primary evaluation benchmark.

Industry Applications

The techniques developed for SQuAD found direct applications in commercial products. Microsoft integrated SQuAD-derived models into Bing search. Google used similar reading comprehension capabilities in Google Search featured snippets. Customer service chatbots, document analysis tools, and legal tech platforms all benefited from the reading comprehension advances that SQuAD helped catalyze.

How does SQuAD differ from other QA benchmarks?

SQuAD inspired and influenced the creation of numerous subsequent question answering datasets, each designed to address different aspects of language understanding.

Dataset	Year	Size	Source	Key Difference from SQuAD
SQuAD 1.1	2016	107K QA pairs	Wikipedia	Extractive, all questions answerable
TriviaQA	2017	95K QA pairs	Trivia websites	Questions written independently of evidence documents
Natural Questions (NQ)	2019	300K+ questions	Google Search logs	Real user queries, long and short answers
QuAC	2018	100K questions	Wikipedia	Conversational, multi-turn dialogue
CoQA	2018	127K questions	Multiple domains	Conversational, free-form answers
HotpotQA	2018	113K QA pairs	Wikipedia	Multi-hop reasoning across multiple documents
NewsQA	2017	100K+ QA pairs	CNN news articles	Longer contexts, news domain

Key Distinctions

TriviaQA differs from SQuAD in that questions were written by trivia enthusiasts without seeing the evidence documents. This means questions are more naturally phrased and less tied to specific passage wordings, creating a more realistic evaluation of open-domain question answering.

Natural Questions (Google, 2019) uses real queries from Google Search, making the questions reflect genuine information needs rather than questions crafted while reading a passage. NQ also distinguishes between long answers (a paragraph) and short answers (a span), adding complexity to the task.

QuAC and CoQA extend question answering into a conversational setting where questions are asked in sequence, and later questions may refer to earlier parts of the dialogue through coreference and ellipsis. This tests a model's ability to track conversational context, something SQuAD does not address.

HotpotQA requires models to reason across multiple documents to find an answer, testing multi-hop reasoning capabilities that go beyond the single-passage format of SQuAD.

Limitations

Despite its enormous influence, SQuAD has several well-documented limitations:

Extractive-Only Answers

All answers in SQuAD are spans extracted directly from the passage.^[1] This means the dataset cannot evaluate a model's ability to generate abstractive answers, perform numerical reasoning, or synthesize information from multiple sources into a novel response.

Passage-Dependent Question Generation

Because crowdworkers wrote questions while reading the passage, many SQuAD questions closely mirror the passage's phrasing. This creates an artificial scenario where lexical overlap between the question and the answer sentence is often a strong signal. Studies have shown that models can achieve high SQuAD scores partly through superficial pattern matching rather than deep language understanding.

English-Only

SQuAD covers only English text. While several community efforts have translated SQuAD into other languages (for example, SQuAD-it for Italian, KorQuAD for Korean, and SberQuAD for Russian), the official benchmark remains monolingual.

Wikipedia Domain

All passages come from Wikipedia, which has a distinctive encyclopedic writing style. Models trained on SQuAD may not generalize well to other domains such as biomedical literature, legal documents, technical manuals, or conversational text.

Annotation Artifacts

Research has identified annotation artifacts in SQuAD. For instance, models can sometimes answer questions correctly by exploiting superficial heuristics, such as matching the type of the expected answer (e.g., a date, a person's name) to entities in the passage. This means that high SQuAD scores do not always indicate genuine comprehension.

Solved Benchmark

By 2020, top models had surpassed human performance on both SQuAD 1.1 and SQuAD 2.0 by substantial margins. Modern large language models such as GPT-4 achieve near-perfect scores on SQuAD, suggesting that the benchmark no longer differentiates among frontier models. This has shifted attention to more challenging benchmarks that test multi-hop reasoning, long-context understanding, and open-ended generation.

Is SQuAD still relevant today?

Despite being largely "solved" by current models, SQuAD remains relevant for several reasons:

Educational value: SQuAD is widely used in NLP courses and tutorials as an accessible introduction to question answering and reading comprehension tasks. Its clear task formulation and well-defined metrics make it an excellent teaching tool.
Baseline evaluation: New model architectures and training methods often report SQuAD performance as a sanity check, even if the primary evaluation targets more challenging benchmarks.
Fine-tuning resource: Many practitioners use SQuAD as training data when building extractive QA systems, since the dataset is freely available under a CC-BY-SA-4.0 license and works well as a starting point for domain-specific fine-tuning.
Historical significance: SQuAD's leaderboard documented one of the most dramatic periods of progress in AI history. The trajectory from 51% F1 in 2016 to superhuman performance in 2018 on SQuAD 1.1 vividly illustrates the pace of advancement in deep learning for NLP.
Methodological template: The SQuAD data collection methodology, evaluation protocol, and leaderboard format have been replicated by dozens of subsequent benchmarks across multiple languages and domains.

SQuAD is freely available through Hugging Face Datasets, TensorFlow Datasets, and the official SQuAD Explorer website maintained at Stanford. The dataset continues to serve as a foundation for question answering research and applications around the world.

References

Rajpurkar, P., Zhang, J., Lopyrev, K., & Liang, P. (2016). "SQuAD: 100,000+ Questions for Machine Comprehension of Text." *Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing (EMNLP)*. https://arxiv.org/abs/1606.05250 ↩
Rajpurkar, P., Jia, R., & Liang, P. (2018). "Know What You Don't Know: Unanswerable Questions for SQuAD." *Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (ACL)*. https://arxiv.org/abs/1806.03822 ↩
Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding." *Proceedings of NAACL-HLT 2019*. https://arxiv.org/abs/1810.04805 ↩
Seo, M., Kembhavi, A., Farhadi, A., & Hajishirzi, H. (2017). "Bidirectional Attention Flow for Machine Comprehension." *Proceedings of ICLR 2017*. https://arxiv.org/abs/1611.01603 ↩
Chen, D., Fisch, A., Weston, J., & Bordes, A. (2017). "Reading Wikipedia to Answer Open-Domain Questions." *Proceedings of ACL 2017*. https://arxiv.org/abs/1704.00051 ↩
Yu, A. W., Dohan, D., Luong, M.-T., Zhao, R., Chen, K., Norouzi, M., & Le, Q. V. (2018). "QANet: Combining Local Convolution with Global Self-Attention for Reading Comprehension." *Proceedings of ICLR 2018*. https://arxiv.org/abs/1804.09541 ↩
Yang, Z., Dai, Z., Yang, Y., Carbonell, J., Salakhutdinov, R., & Le, Q. V. (2019). "XLNet: Generalized Autoregressive Pretraining for Language Understanding." *Advances in Neural Information Processing Systems 32 (NeurIPS 2019)*. https://arxiv.org/abs/1906.08237 ↩
Lan, Z., Chen, M., Goodman, S., Gimpel, K., Sharma, P., & Soricut, R. (2020). "ALBERT: A Lite BERT for Self-supervised Learning of Language Representations." *Proceedings of ICLR 2020*. https://arxiv.org/abs/1909.11942 ↩
He, P., Liu, X., Gao, J., & Chen, W. (2021). "DeBERTa: Decoding-enhanced BERT with Disentangled Attention." *Proceedings of ICLR 2021*. https://arxiv.org/abs/2006.03654 ↩
Wang, S. & Jiang, J. (2017). "Machine Comprehension Using Match-LSTM and Answer Pointer." *Proceedings of ICLR 2017*. https://arxiv.org/abs/1608.07905

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

6 revisions by 1 contributors · full history

Suggest edit

What problem was SQuAD created to solve?

SQuAD 1.1

Paper and Authors

Dataset Statistics

Task Format

How was SQuAD built?

Question Types

SQuAD 2.0

Paper and Authors

What changed from SQuAD 1.1 to SQuAD 2.0?

Unanswerable Questions

Dataset Composition

Performance Gap

How is SQuAD scored?

Exact Match (EM)

F1 Score

SQuAD 2.0 Scoring

Human Performance

Historical Leaderboard Progress

SQuAD 1.1 Timeline

Surpassing Human Performance on SQuAD 1.1

SQuAD 2.0 Leaderboard

Key Models and Architectural Innovations

BiDAF (2016)

DrQA (2017)

QANet (2018)

BERT (2018)

XLNet and ALBERT (2019)

DeBERTa (2020)

Why was SQuAD so influential on NLP research?

Standardized Evaluation

Driving Architectural Innovation

Transfer Learning Benchmark

Industry Applications

How does SQuAD differ from other QA benchmarks?

Key Distinctions

Limitations

Extractive-Only Answers

Passage-Dependent Question Generation

English-Only

Wikipedia Domain

Annotation Artifacts

Solved Benchmark

Is SQuAD still relevant today?

References

Improve this article

Related Articles

AA-LCR

DROP (Discrete Reasoning Over Paragraphs)

HellaSwag

LiveBench

MGSM (Multilingual Grade School Math)

MathArena

What links here (24 of 45)

Related Articles

AA-LCR

DROP (Discrete Reasoning Over Paragraphs)

HellaSwag

LiveBench

MGSM (Multilingual Grade School Math)

MathArena

What links here (24 of 45)