# SQuAD

> Source: https://aiwiki.ai/wiki/squad
> Updated: 2026-06-21
> Categories: AI Benchmarks, Natural Language Processing
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

**SQuAD** (the Stanford Question Answering Dataset) is a large-scale [reading comprehension](/wiki/reading_comprehension) benchmark from [Stanford University](/wiki/stanford_university) in which a model must answer a question by extracting the exact span of text that answers it from a given passage. Released in 2016 by Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and [Percy Liang](/wiki/percy_liang), the original SQuAD 1.1 contains 107,785 question-answer pairs written by crowdworkers over 536 Wikipedia articles, and the 2018 follow-up SQuAD 2.0 adds 53,775 unanswerable questions that systems must learn to decline rather than guess.[1][2] SQuAD established extractive [question answering](/wiki/question_answering) as one of the most important evaluation tasks in [natural language processing](/wiki/natural_language_processing) (NLP), and its public leaderboard became a proving ground where models from major research labs raced toward and eventually surpassed human-level performance, with [BERT](/wiki/bert) reaching 91.8% F1 on SQuAD 1.1 in 2018, above the 91.2% human score.[1][3]

The original paper defines the dataset as "a new reading comprehension dataset consisting of 100,000+ questions posed by crowdworkers on a set of Wikipedia articles, where the answer to each question is a segment of text from the corresponding reading passage."[1] SQuAD drove the development of numerous neural architectures and pre-trained language models, and remains one of the most cited resources in NLP, with the 2016 paper accumulating over 7,500 citations.[1]

## What problem was SQuAD created to solve?

Before SQuAD, research on machine reading comprehension was hampered by a lack of large, high-quality datasets. Earlier benchmarks such as MCTest (Richardson et al., 2013) were limited in scale, containing only a few hundred passages. Cloze-style datasets like the CNN/Daily Mail corpus (Hermann et al., 2015) and the Children's Book Test (Hill et al., 2015) offered greater scale but relied on automatically generated fill-in-the-blank questions that did not fully capture the complexity of natural language understanding.[1]

The Stanford NLP group, led by [Percy Liang](/wiki/percy_liang), set out to create a dataset that combined three properties: large scale, natural language questions written by humans, and answers grounded directly in source text. The result was SQuAD, which provided the community with a benchmark that was both challenging enough to differentiate model capabilities and large enough to train data-hungry [neural network](/wiki/neural_network) models.[1]

## SQuAD 1.1

### Paper and Authors

SQuAD 1.1 was introduced in the paper "SQuAD: 100,000+ Questions for Machine Comprehension of Text" by Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. The paper was published at the 2016 Conference on Empirical Methods in Natural Language Processing ([EMNLP](/wiki/emnlp) 2016).[1] It quickly became one of the most cited papers in NLP, accumulating over 7,500 citations according to Semantic Scholar.[1]

### Dataset Statistics

SQuAD 1.1 contains 107,785 question-answer pairs spread across 536 Wikipedia articles and 23,215 paragraphs.[1] The dataset was split by article into a training set (80%), a development set (10%), and a test set (10%). The test set was kept hidden, and researchers could only evaluate on it through the official leaderboard submission system.[1]

| Split | Approximate Size |
|---|---|
| Training set | ~87,600 examples |
| Development set | ~10,600 examples |
| Test set | ~9,500 examples |
| Total | 107,785 examples |
| Wikipedia articles | 536 |
| Paragraphs | 23,215 |

### Task Format

The task is framed as extractive question answering. Given a passage (a paragraph from a Wikipedia article) and a question about that passage, a model must identify the span of text within the passage that answers the question. The answer is always a contiguous substring of the passage, which can range from a single word to an entire sentence.[1] This format makes evaluation straightforward because both the predicted and gold-standard answers are character spans that can be compared directly.

For example, given a passage about the University of Notre Dame and the question "To whom did the Virgin Mary allegedly appear in 1858 in Lourdes, France?", the correct answer span would be "Saint Bernadette Soubirous."[1]

### How was SQuAD built?

SQuAD was built using [crowdsourcing](/wiki/crowdsourcing) through [Amazon Mechanical Turk](/wiki/amazon_mechanical_turk) (MTurk). The construction process involved several steps:

1. **Article selection**: The researchers selected 536 high-quality Wikipedia articles from the top 10,000 articles ranked by a Wikipedia internal project assessment tool.
2. **Paragraph extraction**: Individual paragraphs were extracted from these articles and presented to crowdworkers.
3. **Question writing**: Workers were shown a paragraph and asked to write up to five questions about it. They were encouraged to use their own words rather than copying phrases directly from the passage. To enforce this, the copy-paste function on the paragraph text was disabled during the annotation interface. Workers were paid at a rate of $9 per hour and were asked to spend approximately four minutes per paragraph.
4. **Answer highlighting**: After writing a question, the worker highlighted the shortest span in the paragraph that answered the question.
5. **Additional answer collection**: To enable robust evaluation and estimate human performance, the researchers collected at least two additional answers for every question in the development and test sets. In this secondary annotation task, different workers were shown the question and passage and asked to highlight the shortest answer span.

This multi-step process ensured that questions were diverse, naturally phrased, and answerable from the given passage.[1]

### Question Types

The researchers analyzed the types of reasoning required to answer SQuAD questions. They found that the questions span a range of difficulty levels:

| Reasoning Type | Description |
|---|---|
| Lexical variation (synonymy) | The question uses synonyms of words in the passage |
| Lexical variation (world knowledge) | Answering requires basic world knowledge |
| Syntactic variation | The question rephrases the passage using different syntax |
| Multi-sentence reasoning | The answer requires combining information from multiple sentences |
| Ambiguous | The question or passage is ambiguous |

While many SQuAD questions can be answered through simple pattern matching or paraphrase detection, a meaningful fraction requires more sophisticated reasoning, including coreference resolution and multi-sentence inference.[1]

## SQuAD 2.0

### Paper and Authors

SQuAD 2.0 was introduced in the paper "Know What You Don't Know: Unanswerable Questions for SQuAD" by Pranav Rajpurkar, Robin Jia, and Percy Liang. It was published at the 56th Annual Meeting of the Association for Computational Linguistics (ACL 2018) in Melbourne, Australia.[2]

### What changed from SQuAD 1.1 to SQuAD 2.0?

A key limitation of SQuAD 1.1 was that every question had an answer in the provided passage. This meant that models could always extract some plausible-looking span, even when a question could not actually be answered from the context. In real-world applications, a question answering system must be able to recognize when it does not have enough information to provide a correct answer and abstain rather than guess. SQuAD 2.0 was designed to address this gap. As the authors put it, "To do well on SQuAD 2.0, systems must not only answer questions when possible, but also determine when no answer is supported by the paragraph and abstain from answering."[2]

### Unanswerable Questions

SQuAD 2.0 combines the existing 100,000+ answerable questions from SQuAD 1.1 with 53,775 new unanswerable questions.[2] The unanswerable questions were written adversarially by crowdworkers who were instructed to craft questions that:

- Appeared relevant to the paragraph
- Referenced entities mentioned in the paragraph
- Looked similar in style to the answerable SQuAD 1.1 questions
- Could not actually be answered based on the paragraph alone, even though a plausible answer might seem to exist

Workers were shown examples from SQuAD 1.1 to calibrate the style and difficulty of their unanswerable questions. This adversarial design made the unanswerable questions difficult for models to distinguish from answerable ones, since superficial cues like entity overlap and question style were not reliable indicators.[2]

### Dataset Composition

| Split | Total Examples | Unanswerable Examples | Ratio |
|---|---|---|---|
| Training | 130,319 | 43,498 | ~2:1 answerable-to-unanswerable |
| Development | 11,873 | 5,945 | ~1:1 |
| Test | 8,862 | 4,332 | ~1:1 |

The training set maintains roughly a 2:1 ratio of answerable to unanswerable questions, while the development and test sets are balanced at approximately 1:1.[2] This design choice encourages models to learn to handle both types of questions during training while providing a fair evaluation at test time.

### Performance Gap

The introduction of unanswerable questions dramatically increased difficulty. A DocQA model enhanced with [ELMo](/wiki/elmo) embeddings that achieved 85.8% F1 on SQuAD 1.1 dropped to just 66.3% F1 on SQuAD 2.0, a decline of nearly 20 points.[2] That left the strongest launch system roughly 23 points below the human F1 of 89.5%, and the SQuAD 2.0 paper highlighted this gap as evidence that the ability to abstain from answering was a genuinely difficult capability that existing models lacked.[2]

## How is SQuAD scored?

SQuAD uses two evaluation metrics: Exact Match (EM) and [F1 score](/wiki/f1_score).[1]

### Exact Match (EM)

The Exact Match metric measures the percentage of predictions that exactly match one of the ground-truth answers after normalization (lowercasing, removing articles, removing punctuation, and stripping whitespace). It is a strict binary metric: a prediction receives a score of 1 if it matches a reference answer character for character, and 0 otherwise. For questions with multiple valid reference answers, the maximum EM score across all reference answers is taken.[1]

### F1 Score

The F1 score provides partial credit by measuring the token-level overlap between the prediction and the ground-truth answer. It is computed as the harmonic mean of precision and recall:

- **[Precision](/wiki/precision)**: The fraction of tokens in the predicted answer that also appear in the ground-truth answer
- **[Recall](/wiki/recall)**: The fraction of tokens in the ground-truth answer that also appear in the predicted answer

As with EM, the maximum F1 over all reference answers is used for each question, and the final score is the average across all questions in the evaluation set.[1]

The F1 metric is especially valuable because it gives partial credit when a model identifies the correct region of the passage but selects a span that is slightly too long or too short. For example, if the gold answer is "Saint Bernadette Soubirous" and the model predicts "Bernadette Soubirous," the prediction receives a high F1 score despite not being an exact match.

### SQuAD 2.0 Scoring

For SQuAD 2.0, a predicted "no answer" for an unanswerable question is treated as correct, receiving an EM and F1 of 1. Conversely, predicting "no answer" for an answerable question receives a score of 0. This design ensures that models are penalized both for hallucinating answers when none exist and for failing to answer when they should.[2]

## Human Performance

Human performance was estimated by treating one annotator's answer as the "prediction" and the remaining annotators' answers as the ground truth. This approach measures inter-annotator agreement rather than an individual's raw accuracy.[1]

| Benchmark | EM | F1 |
|---|---|---|
| SQuAD 1.1 (test set) | 82.3% | 91.2% |
| SQuAD 2.0 (test set) | 86.8% | 89.5% |

The gap between EM and F1 in human performance reflects the fact that different annotators often select slightly different but equally valid answer spans. Two humans might disagree on whether the answer to a question is "1858" or "in 1858," leading to an exact match failure but a high F1 score.

## Historical Leaderboard Progress

The SQuAD leaderboard, hosted at Stanford, became one of the most closely watched benchmarks in [machine learning](/wiki/machine_learning). The competitive leaderboard format encouraged rapid iteration and drew submissions from major research labs worldwide, including [Google](/wiki/google), [Microsoft](/wiki/microsoft), [Facebook AI Research](/wiki/meta_ai), [Alibaba](/wiki/alibaba_cloud), and many academic groups.

### SQuAD 1.1 Timeline

Progress on SQuAD 1.1 was remarkably fast. The initial logistic regression baseline published alongside the dataset achieved just 40.4% EM and 51.0% F1.[1] Within months, neural models pushed performance far beyond this baseline.

| Model | Year | EM (Test) | F1 (Test) | Key Innovation |
|---|---|---|---|---|
| Logistic Regression (Rajpurkar et al.) | 2016 | 40.4% | 51.0% | Feature-engineered baseline |
| [Match-LSTM](/wiki/match_lstm) (Wang & Jiang) | 2016 | 64.7% | 73.7% | Pointer networks for span extraction |
| [BiDAF](/wiki/bidaf) (Seo et al.) | 2016 | 68.0% | 77.3% | Bidirectional attention flow |
| [R-NET](/wiki/r_net) (Microsoft Research Asia) | 2017 | 72.3% | 80.7% | Self-matching attention |
| [DrQA](/wiki/drqa) (Chen et al., Facebook) | 2017 | 69.5% | 78.8% | Document retriever + reader pipeline |
| [QANet](/wiki/qanet) (Yu et al., Google) | 2018 | 76.2% | 84.6% | Convolution + self-attention, no recurrence |
| [BERT](/wiki/bert)-Large (Devlin et al., Google) | 2018 | 85.1% | 91.8% | Pre-trained [transformer](/wiki/transformer) fine-tuning |
| BERT-Large (ensemble) | 2018 | 87.4% | 93.2% | Ensemble of pre-trained transformers |
| Human Performance | - | 82.3% | 91.2% | Inter-annotator agreement |

The progression illustrates a shift from traditional feature engineering to increasingly sophisticated neural architectures, culminating in the [pre-training](/wiki/pre-training) and [fine-tuning](/wiki/fine_tuning) paradigm introduced by BERT.[3]

### Surpassing Human Performance on SQuAD 1.1

On January 3, 2018, before BERT was published, models from [Microsoft Research](/wiki/microsoft_research) Asia and Alibaba became the first to surpass human performance on the EM metric of SQuAD 1.1. Microsoft's r-net+ ensemble scored 82.65 EM, and Alibaba's SLQA+ scored 82.44 EM, both exceeding the human baseline of 82.3 EM. This milestone received widespread media coverage, though researchers were careful to note that surpassing humans on a single benchmark did not imply general reading comprehension ability.

Later in 2018, BERT-Large surpassed human performance on both EM and F1. Its single model scored 85.1% EM and 91.8% F1 and its ensemble scored 87.4% EM and 93.2% F1, compared to the human scores of 82.3% EM and 91.2% F1.[3] Microsoft noted that it had already integrated earlier versions of its SQuAD model into its Bing search engine, demonstrating practical applications of the technology.

### SQuAD 2.0 Leaderboard

SQuAD 2.0 proved significantly harder, and progress was initially slower. The best baseline at launch (DocQA + ELMo) scored only 63.4% EM and 66.3% F1, far below the human performance of 86.8% EM and 89.5% F1.[2] Over the following years, models gradually closed the gap and eventually surpassed human-level scores.

| Model | Year | EM (Test) | F1 (Test) | Notes |
|---|---|---|---|---|
| BiDAF-No-Answer (baseline) | 2018 | 59.2% | 62.1% | Threshold-based abstention |
| DocQA + [ELMo](/wiki/elmo) (baseline) | 2018 | 63.4% | 66.3% | Strongest baseline at launch |
| [BERT](/wiki/bert)-Large (Google) | 2018 | ~80.0% | ~83.1% | First pre-trained transformer on SQuAD 2.0 |
| [XLNet](/wiki/xlnet)-Large (Yang et al.) | 2019 | 87.9% | 90.6% | Autoregressive pre-training |
| [ALBERT](/wiki/albert)-xxlarge (Lan et al.) | 2019 | 88.1% | 90.9% | Parameter-efficient pre-training |
| Retrospective Reader (Zhang et al.) | 2020 | 88.4% | 90.9% | First single model to beat human EM and F1 |
| [DeBERTa](/wiki/deberta) (He et al., Microsoft) | 2020 | 88.4% | 90.7% | Disentangled attention |
| IE-Net (RICOH, ensemble) | 2020 | 90.9% | 93.2% | Top of leaderboard |
| Human Performance | - | 86.8% | 89.5% | Inter-annotator agreement |

By 2020, the top ensemble models on the SQuAD 2.0 leaderboard exceeded human performance by more than 3 points on both EM and F1, demonstrating that the combination of large-scale pre-training, careful fine-tuning, and model ensembling could yield superhuman extractive QA performance on this benchmark.

## Key Models and Architectural Innovations

### BiDAF (2016)

Bidirectional [Attention](/wiki/attention) Flow (BiDAF), developed by Minjoon Seo and colleagues at the University of Washington and the [Allen Institute for AI](/wiki/ai2), introduced a multi-level [attention mechanism](/wiki/attention) that computes attention in both directions: from context to query and from query to context. BiDAF demonstrated that carefully designed attention could substantially improve reading comprehension performance, achieving 77.3% F1 on the SQuAD 1.1 test set.[4]

### DrQA (2017)

Developed at Facebook AI Research by Danqi Chen and colleagues, DrQA combined a document retriever (using TF-IDF) with a neural document reader trained on SQuAD. While its primary contribution was open-domain question answering (retrieving relevant documents from all of Wikipedia before extracting answers), the reader component achieved competitive single-passage results of 69.5% EM and 78.8% F1 on SQuAD 1.1.[5]

### QANet (2018)

QANet, developed by Adams Wei Yu and colleagues at Google and Carnegie Mellon University, replaced [recurrent neural networks](/wiki/recurrent_neural_network) with a combination of local convolutions and global [self-attention](/wiki/self_attention). This made training parallelizable and significantly faster. QANet achieved 84.6% F1 on SQuAD 1.1 and represented the state of the art immediately before the release of BERT.[6]

### BERT (2018)

[BERT](/wiki/bert) (Bidirectional Encoder Representations from Transformers), introduced by Jacob Devlin and colleagues at Google AI, transformed the SQuAD landscape. By pre-training a deep bidirectional transformer on large text corpora using masked language modeling and next sentence prediction, BERT learned rich contextual representations that could be fine-tuned for downstream tasks with minimal architectural changes. On SQuAD 1.1, BERT-Large achieved 91.8% F1, surpassing human performance. On SQuAD 2.0, it achieved approximately 83.1% F1, a massive improvement over prior models.[3]

### XLNet and ALBERT (2019)

[XLNet](/wiki/xlnet) (Yang et al., 2019) introduced permutation-based language modeling to capture bidirectional context without the masking artifacts of BERT.[7] [ALBERT](/wiki/albert) (Lan et al., 2019) achieved comparable or better performance with substantially fewer parameters through cross-layer parameter sharing and factorized embedding parameterization.[8] Both models achieved F1 scores above 90% on SQuAD 2.0, closing in on human performance.

### DeBERTa (2020)

[DeBERTa](/wiki/deberta) (Decoding-enhanced BERT with Disentangled Attention) from Microsoft introduced disentangled attention, which uses separate vectors for content and position. DeBERTa-Large achieved 90.7% F1 on SQuAD 2.0, surpassing human performance.[9]

## Why was SQuAD so influential on NLP research?

SQuAD had a profound and lasting impact on NLP research in several ways:

### Standardized Evaluation

Before SQuAD, there was no widely accepted benchmark for reading comprehension. SQuAD provided a common evaluation framework that allowed researchers to directly compare different approaches. The public leaderboard created a competitive dynamic that accelerated progress and attracted attention from industry research labs.

### Driving Architectural Innovation

The SQuAD leaderboard pushed the development of key architectural innovations:

- **Attention mechanisms**: BiDAF, R-NET, and other models pioneered sophisticated attention designs specifically to perform well on SQuAD.[4]
- **Pre-training and fine-tuning**: BERT's success on SQuAD was one of the clearest demonstrations that pre-trained language models could achieve state-of-the-art results on downstream tasks, catalyzing the entire field toward the pre-train/fine-tune paradigm.[3]
- **Efficient architectures**: QANet showed that recurrence was not necessary for strong reading comprehension, foreshadowing the rise of transformer-only architectures.[6]

### Transfer Learning Benchmark

SQuAD became one of the standard benchmarks included in multi-task evaluation suites like [GLUE](/wiki/glue_benchmark) and [SuperGLUE](/wiki/superglue). Performance on SQuAD was routinely reported when introducing new [pre-trained language models](/wiki/pre-trained_models), making it a de facto litmus test for a model's reading comprehension ability. Models such as [RoBERTa](/wiki/roberta), [ELECTRA](/wiki/electra), [T5](/wiki/t5), and many others used SQuAD as a primary evaluation benchmark.

### Industry Applications

The techniques developed for SQuAD found direct applications in commercial products. Microsoft integrated SQuAD-derived models into Bing search. Google used similar reading comprehension capabilities in Google Search featured snippets. Customer service chatbots, document analysis tools, and legal tech platforms all benefited from the reading comprehension advances that SQuAD helped catalyze.

## How does SQuAD differ from other QA benchmarks?

SQuAD inspired and influenced the creation of numerous subsequent question answering datasets, each designed to address different aspects of language understanding.

| Dataset | Year | Size | Source | Key Difference from SQuAD |
|---|---|---|---|---|
| [SQuAD](/wiki/squad) 1.1 | 2016 | 107K QA pairs | Wikipedia | Extractive, all questions answerable |
| [TriviaQA](/wiki/triviaqa) | 2017 | 95K QA pairs | Trivia websites | Questions written independently of evidence documents |
| [Natural Questions](/wiki/natural_questions) (NQ) | 2019 | 300K+ questions | Google Search logs | Real user queries, long and short answers |
| [QuAC](/wiki/quac) | 2018 | 100K questions | Wikipedia | Conversational, multi-turn dialogue |
| [CoQA](/wiki/coqa) | 2018 | 127K questions | Multiple domains | Conversational, free-form answers |
| [HotpotQA](/wiki/hotpotqa) | 2018 | 113K QA pairs | Wikipedia | Multi-hop reasoning across multiple documents |
| [NewsQA](/wiki/newsqa) | 2017 | 100K+ QA pairs | CNN news articles | Longer contexts, news domain |

### Key Distinctions

**TriviaQA** differs from SQuAD in that questions were written by trivia enthusiasts without seeing the evidence documents. This means questions are more naturally phrased and less tied to specific passage wordings, creating a more realistic evaluation of open-domain question answering.

**Natural Questions** (Google, 2019) uses real queries from Google Search, making the questions reflect genuine information needs rather than questions crafted while reading a passage. NQ also distinguishes between long answers (a paragraph) and short answers (a span), adding complexity to the task.

**QuAC** and **CoQA** extend question answering into a conversational setting where questions are asked in sequence, and later questions may refer to earlier parts of the dialogue through coreference and ellipsis. This tests a model's ability to track conversational context, something SQuAD does not address.

**HotpotQA** requires models to reason across multiple documents to find an answer, testing multi-hop reasoning capabilities that go beyond the single-passage format of SQuAD.

## Limitations

Despite its enormous influence, SQuAD has several well-documented limitations:

### Extractive-Only Answers

All answers in SQuAD are spans extracted directly from the passage.[1] This means the dataset cannot evaluate a model's ability to generate abstractive answers, perform numerical reasoning, or synthesize information from multiple sources into a novel response.

### Passage-Dependent Question Generation

Because crowdworkers wrote questions while reading the passage, many SQuAD questions closely mirror the passage's phrasing. This creates an artificial scenario where lexical overlap between the question and the answer sentence is often a strong signal. Studies have shown that models can achieve high SQuAD scores partly through superficial pattern matching rather than deep language understanding.

### English-Only

SQuAD covers only English text. While several community efforts have translated SQuAD into other languages (for example, SQuAD-it for Italian, KorQuAD for Korean, and SberQuAD for Russian), the official benchmark remains monolingual.

### Wikipedia Domain

All passages come from Wikipedia, which has a distinctive encyclopedic writing style. Models trained on SQuAD may not generalize well to other domains such as biomedical literature, legal documents, technical manuals, or conversational text.

### Annotation Artifacts

Research has identified annotation artifacts in SQuAD. For instance, models can sometimes answer questions correctly by exploiting superficial heuristics, such as matching the type of the expected answer (e.g., a date, a person's name) to entities in the passage. This means that high SQuAD scores do not always indicate genuine comprehension.

### Solved Benchmark

By 2020, top models had surpassed human performance on both SQuAD 1.1 and SQuAD 2.0 by substantial margins. Modern [large language models](/wiki/large_language_model) such as [GPT-4](/wiki/gpt-4) achieve near-perfect scores on SQuAD, suggesting that the benchmark no longer differentiates among frontier models. This has shifted attention to more challenging benchmarks that test multi-hop reasoning, long-context understanding, and open-ended generation.

## Is SQuAD still relevant today?

Despite being largely "solved" by current models, SQuAD remains relevant for several reasons:

1. **Educational value**: SQuAD is widely used in NLP courses and tutorials as an accessible introduction to question answering and reading comprehension tasks. Its clear task formulation and well-defined metrics make it an excellent teaching tool.

2. **Baseline evaluation**: New model architectures and training methods often report SQuAD performance as a sanity check, even if the primary evaluation targets more challenging benchmarks.

3. **Fine-tuning resource**: Many practitioners use SQuAD as training data when building extractive QA systems, since the dataset is freely available under a CC-BY-SA-4.0 license and works well as a starting point for domain-specific fine-tuning.

4. **Historical significance**: SQuAD's leaderboard documented one of the most dramatic periods of progress in AI history. The trajectory from 51% F1 in 2016 to superhuman performance in 2018 on SQuAD 1.1 vividly illustrates the pace of advancement in deep learning for NLP.

5. **Methodological template**: The SQuAD data collection methodology, evaluation protocol, and leaderboard format have been replicated by dozens of subsequent benchmarks across multiple languages and domains.

SQuAD is freely available through [Hugging Face](/wiki/hugging_face) Datasets, TensorFlow Datasets, and the official SQuAD Explorer website maintained at Stanford. The dataset continues to serve as a foundation for question answering research and applications around the world.

## References

1. Rajpurkar, P., Zhang, J., Lopyrev, K., & Liang, P. (2016). "SQuAD: 100,000+ Questions for Machine Comprehension of Text." *Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing (EMNLP)*. https://arxiv.org/abs/1606.05250

2. Rajpurkar, P., Jia, R., & Liang, P. (2018). "Know What You Don't Know: Unanswerable Questions for SQuAD." *Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (ACL)*. https://arxiv.org/abs/1806.03822

3. Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding." *Proceedings of NAACL-HLT 2019*. https://arxiv.org/abs/1810.04805

4. Seo, M., Kembhavi, A., Farhadi, A., & Hajishirzi, H. (2017). "Bidirectional Attention Flow for Machine Comprehension." *Proceedings of ICLR 2017*. https://arxiv.org/abs/1611.01603

5. Chen, D., Fisch, A., Weston, J., & Bordes, A. (2017). "Reading Wikipedia to Answer Open-Domain Questions." *Proceedings of ACL 2017*. https://arxiv.org/abs/1704.00051

6. Yu, A. W., Dohan, D., Luong, M.-T., Zhao, R., Chen, K., Norouzi, M., & Le, Q. V. (2018). "QANet: Combining Local [Convolution](/wiki/convolution) with Global Self-Attention for Reading Comprehension." *Proceedings of ICLR 2018*. https://arxiv.org/abs/1804.09541

7. Yang, Z., Dai, Z., Yang, Y., Carbonell, J., Salakhutdinov, R., & Le, Q. V. (2019). "XLNet: Generalized Autoregressive Pretraining for Language Understanding." *Advances in Neural Information Processing Systems 32 ([NeurIPS](/wiki/neurips) 2019)*. https://arxiv.org/abs/1906.08237

8. Lan, Z., Chen, M., Goodman, S., Gimpel, K., Sharma, P., & Soricut, R. (2020). "ALBERT: A Lite BERT for Self-supervised Learning of Language Representations." *Proceedings of ICLR 2020*. https://arxiv.org/abs/1909.11942

9. He, P., Liu, X., Gao, J., & Chen, W. (2021). "DeBERTa: Decoding-enhanced BERT with Disentangled Attention." *Proceedings of ICLR 2021*. https://arxiv.org/abs/2006.03654

10. Wang, S. & Jiang, J. (2017). "Machine Comprehension Using Match-LSTM and Answer Pointer." *Proceedings of ICLR 2017*. https://arxiv.org/abs/1608.07905