SQuAD (Stanford Question Answering Dataset) is a large-scale reading comprehension benchmark created at Stanford University that has played a central role in advancing natural language processing (NLP) research. First released in 2016, SQuAD established extractive question answering as one of the most important evaluation tasks in the field. The dataset consists of over 100,000 question-answer pairs derived from Wikipedia articles, with answers that correspond to spans of text within a given passage. SQuAD drove the development of numerous neural architectures and pre-trained language models, and its competitive leaderboard became a proving ground where models from major research labs raced toward and eventually surpassed human-level performance.
Before SQuAD, research on machine reading comprehension was hampered by a lack of large, high-quality datasets. Earlier benchmarks such as MCTest (Richardson et al., 2013) were limited in scale, containing only a few hundred passages. Cloze-style datasets like the CNN/Daily Mail corpus (Hermann et al., 2015) and the Children's Book Test (Hill et al., 2015) offered greater scale but relied on automatically generated fill-in-the-blank questions that did not fully capture the complexity of natural language understanding.
The Stanford NLP group, led by Percy Liang, set out to create a dataset that combined three properties: large scale, natural language questions written by humans, and answers grounded directly in source text. The result was SQuAD, which provided the community with a benchmark that was both challenging enough to differentiate model capabilities and large enough to train data-hungry neural network models.
SQuAD 1.1 was introduced in the paper "SQuAD: 100,000+ Questions for Machine Comprehension of Text" by Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. The paper was published at the 2016 Conference on Empirical Methods in Natural Language Processing (EMNLP 2016). It quickly became one of the most cited papers in NLP, accumulating over 7,500 citations according to Semantic Scholar.
SQuAD 1.1 contains 107,785 question-answer pairs spread across 536 Wikipedia articles and 23,215 paragraphs. The dataset was split by article into a training set (80%), a development set (10%), and a test set (10%). The test set was kept hidden, and researchers could only evaluate on it through the official leaderboard submission system.
| Split | Approximate Size |
|---|---|
| Training set | ~87,600 examples |
| Development set | ~10,600 examples |
| Test set | ~9,500 examples |
| Total | 107,785 examples |
| Wikipedia articles | 536 |
| Paragraphs | 23,215 |
The task is framed as extractive question answering. Given a passage (a paragraph from a Wikipedia article) and a question about that passage, a model must identify the span of text within the passage that answers the question. The answer is always a contiguous substring of the passage, which can range from a single word to an entire sentence. This format makes evaluation straightforward because both the predicted and gold-standard answers are character spans that can be compared directly.
For example, given a passage about the University of Notre Dame and the question "To whom did the Virgin Mary allegedly appear in 1858 in Lourdes, France?", the correct answer span would be "Saint Bernadette Soubirous."
SQuAD was built using crowdsourcing through Amazon Mechanical Turk (MTurk). The construction process involved several steps:
This multi-step process ensured that questions were diverse, naturally phrased, and answerable from the given passage.
The researchers analyzed the types of reasoning required to answer SQuAD questions. They found that the questions span a range of difficulty levels:
| Reasoning Type | Description |
|---|---|
| Lexical variation (synonymy) | The question uses synonyms of words in the passage |
| Lexical variation (world knowledge) | Answering requires basic world knowledge |
| Syntactic variation | The question rephrases the passage using different syntax |
| Multi-sentence reasoning | The answer requires combining information from multiple sentences |
| Ambiguous | The question or passage is ambiguous |
While many SQuAD questions can be answered through simple pattern matching or paraphrase detection, a meaningful fraction requires more sophisticated reasoning, including coreference resolution and multi-sentence inference.
SQuAD 2.0 was introduced in the paper "Know What You Don't Know: Unanswerable Questions for SQuAD" by Pranav Rajpurkar, Robin Jia, and Percy Liang. It was published at the 56th Annual Meeting of the Association for Computational Linguistics (ACL 2018) in Melbourne, Australia.
A key limitation of SQuAD 1.1 was that every question had an answer in the provided passage. This meant that models could always extract some plausible-looking span, even when a question could not actually be answered from the context. In real-world applications, a question answering system must be able to recognize when it does not have enough information to provide a correct answer and abstain rather than guess. SQuAD 2.0 was designed to address this gap.
SQuAD 2.0 combines the existing 100,000+ answerable questions from SQuAD 1.1 with 53,775 new unanswerable questions. The unanswerable questions were written adversarially by crowdworkers who were instructed to craft questions that:
Workers were shown examples from SQuAD 1.1 to calibrate the style and difficulty of their unanswerable questions. This adversarial design made the unanswerable questions difficult for models to distinguish from answerable ones, since superficial cues like entity overlap and question style were not reliable indicators.
| Split | Total Examples | Unanswerable Examples | Ratio |
|---|---|---|---|
| Training | 130,319 | 43,498 | ~2:1 answerable-to-unanswerable |
| Development | 11,873 | 5,945 | ~1:1 |
| Test | 8,862 | 4,332 | ~1:1 |
The training set maintains roughly a 2:1 ratio of answerable to unanswerable questions, while the development and test sets are balanced at approximately 1:1. This design choice encourages models to learn to handle both types of questions during training while providing a fair evaluation at test time.
The introduction of unanswerable questions dramatically increased difficulty. A DocQA model enhanced with ELMo embeddings that achieved 85.8% F1 on SQuAD 1.1 dropped to just 66.3% F1 on SQuAD 2.0, a decline of nearly 20 points. This large gap demonstrated that the ability to abstain from answering was a genuinely difficult capability that existing models lacked.
SQuAD uses two evaluation metrics: Exact Match (EM) and F1 score.
The Exact Match metric measures the percentage of predictions that exactly match one of the ground-truth answers after normalization (lowercasing, removing articles, removing punctuation, and stripping whitespace). It is a strict binary metric: a prediction receives a score of 1 if it matches a reference answer character for character, and 0 otherwise. For questions with multiple valid reference answers, the maximum EM score across all reference answers is taken.
The F1 score provides partial credit by measuring the token-level overlap between the prediction and the ground-truth answer. It is computed as the harmonic mean of precision and recall:
As with EM, the maximum F1 over all reference answers is used for each question, and the final score is the average across all questions in the evaluation set.
The F1 metric is especially valuable because it gives partial credit when a model identifies the correct region of the passage but selects a span that is slightly too long or too short. For example, if the gold answer is "Saint Bernadette Soubirous" and the model predicts "Bernadette Soubirous," the prediction receives a high F1 score despite not being an exact match.
For SQuAD 2.0, a predicted "no answer" for an unanswerable question is treated as correct, receiving an EM and F1 of 1. Conversely, predicting "no answer" for an answerable question receives a score of 0. This design ensures that models are penalized both for hallucinating answers when none exist and for failing to answer when they should.
Human performance was estimated by treating one annotator's answer as the "prediction" and the remaining annotators' answers as the ground truth. This approach measures inter-annotator agreement rather than an individual's raw accuracy.
| Benchmark | EM | F1 |
|---|---|---|
| SQuAD 1.1 (test set) | 82.3% | 91.2% |
| SQuAD 2.0 (test set) | 86.8% | 89.5% |
The gap between EM and F1 in human performance reflects the fact that different annotators often select slightly different but equally valid answer spans. Two humans might disagree on whether the answer to a question is "1858" or "in 1858," leading to an exact match failure but a high F1 score.
The SQuAD leaderboard, hosted at Stanford, became one of the most closely watched benchmarks in machine learning. The competitive leaderboard format encouraged rapid iteration and drew submissions from major research labs worldwide, including Google, Microsoft, Facebook AI Research, Alibaba, and many academic groups.
Progress on SQuAD 1.1 was remarkably fast. The initial logistic regression baseline published alongside the dataset achieved just 40.4% EM and 51.0% F1. Within months, neural models pushed performance far beyond this baseline.
| Model | Year | EM (Test) | F1 (Test) | Key Innovation |
|---|---|---|---|---|
| Logistic Regression (Rajpurkar et al.) | 2016 | 40.4% | 51.0% | Feature-engineered baseline |
| Match-LSTM (Wang & Jiang) | 2016 | 64.7% | 73.7% | Pointer networks for span extraction |
| BiDAF (Seo et al.) | 2016 | 68.0% | 77.3% | Bidirectional attention flow |
| R-NET (Microsoft Research Asia) | 2017 | 72.3% | 80.7% | Self-matching attention |
| DrQA (Chen et al., Facebook) | 2017 | 69.5% | 78.8% | Document retriever + reader pipeline |
| QANet (Yu et al., Google) | 2018 | 76.2% | 84.6% | Convolution + self-attention, no recurrence |
| BERT-Large (Devlin et al., Google) | 2018 | 85.1% | 91.8% | Pre-trained transformer fine-tuning |
| Human Performance | - | 82.3% | 91.2% | Inter-annotator agreement |
The progression illustrates a shift from traditional feature engineering to increasingly sophisticated neural architectures, culminating in the pre-training and fine-tuning paradigm introduced by BERT.
On January 3, 2018, before BERT was published, models from Microsoft Research Asia and Alibaba became the first to surpass human performance on the EM metric of SQuAD 1.1. Microsoft's r-net+ ensemble scored 82.65 EM, and Alibaba's SLQA+ scored 82.44 EM, both exceeding the human baseline of 82.3 EM. This milestone received widespread media coverage, though researchers were careful to note that surpassing humans on a single benchmark did not imply general reading comprehension ability.
Later in 2018, BERT-Large surpassed human performance on both EM and F1, achieving 85.1% EM and 91.8% F1 compared to the human scores of 82.3% EM and 91.2% F1. Microsoft noted that it had already integrated earlier versions of its SQuAD model into its Bing search engine, demonstrating practical applications of the technology.
SQuAD 2.0 proved significantly harder, and progress was initially slower. The best baseline at launch (DocQA + ELMo) scored only 63.4% EM and 66.3% F1, far below the human performance of 86.8% EM and 89.5% F1. Over the following years, models gradually closed the gap and eventually surpassed human-level scores.
| Model | Year | EM (Test) | F1 (Test) | Notes |
|---|---|---|---|---|
| BiDAF-No-Answer (baseline) | 2018 | 59.2% | 62.1% | Threshold-based abstention |
| DocQA + ELMo (baseline) | 2018 | 63.4% | 66.3% | Strongest baseline at launch |
| BERT-Large (Google) | 2018 | ~80.0% | ~83.1% | First pre-trained transformer on SQuAD 2.0 |
| XLNet-Large (Yang et al.) | 2019 | 87.9% | 90.6% | Autoregressive pre-training |
| ALBERT-xxlarge (Lan et al.) | 2019 | 88.1% | 90.9% | Parameter-efficient pre-training |
| Retrospective Reader (Zhang et al.) | 2020 | 88.4% | 90.9% | First single model to beat human EM and F1 |
| DeBERTa (He et al., Microsoft) | 2020 | 88.4% | 90.7% | Disentangled attention |
| IE-Net (RICOH, ensemble) | 2020 | 90.9% | 93.2% | Top of leaderboard |
| Human Performance | - | 86.8% | 89.5% | Inter-annotator agreement |
By 2020, the top ensemble models on the SQuAD 2.0 leaderboard exceeded human performance by more than 3 points on both EM and F1, demonstrating that the combination of large-scale pre-training, careful fine-tuning, and model ensembling could yield superhuman extractive QA performance on this benchmark.
Bidirectional Attention Flow (BiDAF), developed by Minjoon Seo and colleagues at the University of Washington and the Allen Institute for AI, introduced a multi-level attention mechanism that computes attention in both directions: from context to query and from query to context. BiDAF demonstrated that carefully designed attention could substantially improve reading comprehension performance, achieving 77.3% F1 on the SQuAD 1.1 test set.
Developed at Facebook AI Research by Danqi Chen and colleagues, DrQA combined a document retriever (using TF-IDF) with a neural document reader trained on SQuAD. While its primary contribution was open-domain question answering (retrieving relevant documents from all of Wikipedia before extracting answers), the reader component achieved competitive single-passage results of 69.5% EM and 78.8% F1 on SQuAD 1.1.
QANet, developed by Adams Wei Yu and colleagues at Google and Carnegie Mellon University, replaced recurrent neural networks with a combination of local convolutions and global self-attention. This made training parallelizable and significantly faster. QANet achieved 84.6% F1 on SQuAD 1.1 and represented the state of the art immediately before the release of BERT.
BERT (Bidirectional Encoder Representations from Transformers), introduced by Jacob Devlin and colleagues at Google AI, transformed the SQuAD landscape. By pre-training a deep bidirectional transformer on large text corpora using masked language modeling and next sentence prediction, BERT learned rich contextual representations that could be fine-tuned for downstream tasks with minimal architectural changes. On SQuAD 1.1, BERT-Large achieved 91.8% F1, surpassing human performance. On SQuAD 2.0, it achieved approximately 83.1% F1, a massive improvement over prior models.
XLNet (Yang et al., 2019) introduced permutation-based language modeling to capture bidirectional context without the masking artifacts of BERT. ALBERT (Lan et al., 2019) achieved comparable or better performance with substantially fewer parameters through cross-layer parameter sharing and factorized embedding parameterization. Both models achieved F1 scores above 90% on SQuAD 2.0, closing in on human performance.
DeBERTa (Decoding-enhanced BERT with Disentangled Attention) from Microsoft introduced disentangled attention, which uses separate vectors for content and position. DeBERTa-Large achieved 90.7% F1 on SQuAD 2.0, surpassing human performance.
SQuAD had a profound and lasting impact on NLP research in several ways:
Before SQuAD, there was no widely accepted benchmark for reading comprehension. SQuAD provided a common evaluation framework that allowed researchers to directly compare different approaches. The public leaderboard created a competitive dynamic that accelerated progress and attracted attention from industry research labs.
The SQuAD leaderboard pushed the development of key architectural innovations:
SQuAD became one of the standard benchmarks included in multi-task evaluation suites like GLUE and SuperGLUE. Performance on SQuAD was routinely reported when introducing new pre-trained language models, making it a de facto litmus test for a model's reading comprehension ability. Models such as RoBERTa, ELECTRA, T5, and many others used SQuAD as a primary evaluation benchmark.
The techniques developed for SQuAD found direct applications in commercial products. Microsoft integrated SQuAD-derived models into Bing search. Google used similar reading comprehension capabilities in Google Search featured snippets. Customer service chatbots, document analysis tools, and legal tech platforms all benefited from the reading comprehension advances that SQuAD helped catalyze.
SQuAD inspired and influenced the creation of numerous subsequent question answering datasets, each designed to address different aspects of language understanding.
| Dataset | Year | Size | Source | Key Difference from SQuAD |
|---|---|---|---|---|
| SQuAD 1.1 | 2016 | 107K QA pairs | Wikipedia | Extractive, all questions answerable |
| TriviaQA | 2017 | 95K QA pairs | Trivia websites | Questions written independently of evidence documents |
| Natural Questions (NQ) | 2019 | 300K+ questions | Google Search logs | Real user queries, long and short answers |
| QuAC | 2018 | 100K questions | Wikipedia | Conversational, multi-turn dialogue |
| CoQA | 2018 | 127K questions | Multiple domains | Conversational, free-form answers |
| HotpotQA | 2018 | 113K QA pairs | Wikipedia | Multi-hop reasoning across multiple documents |
| NewsQA | 2017 | 100K+ QA pairs | CNN news articles | Longer contexts, news domain |
TriviaQA differs from SQuAD in that questions were written by trivia enthusiasts without seeing the evidence documents. This means questions are more naturally phrased and less tied to specific passage wordings, creating a more realistic evaluation of open-domain question answering.
Natural Questions (Google, 2019) uses real queries from Google Search, making the questions reflect genuine information needs rather than questions crafted while reading a passage. NQ also distinguishes between long answers (a paragraph) and short answers (a span), adding complexity to the task.
QuAC and CoQA extend question answering into a conversational setting where questions are asked in sequence, and later questions may refer to earlier parts of the dialogue through coreference and ellipsis. This tests a model's ability to track conversational context, something SQuAD does not address.
HotpotQA requires models to reason across multiple documents to find an answer, testing multi-hop reasoning capabilities that go beyond the single-passage format of SQuAD.
Despite its enormous influence, SQuAD has several well-documented limitations:
All answers in SQuAD are spans extracted directly from the passage. This means the dataset cannot evaluate a model's ability to generate abstractive answers, perform numerical reasoning, or synthesize information from multiple sources into a novel response.
Because crowdworkers wrote questions while reading the passage, many SQuAD questions closely mirror the passage's phrasing. This creates an artificial scenario where lexical overlap between the question and the answer sentence is often a strong signal. Studies have shown that models can achieve high SQuAD scores partly through superficial pattern matching rather than deep language understanding.
SQuAD covers only English text. While several community efforts have translated SQuAD into other languages (for example, SQuAD-it for Italian, KorQuAD for Korean, and SberQuAD for Russian), the official benchmark remains monolingual.
All passages come from Wikipedia, which has a distinctive encyclopedic writing style. Models trained on SQuAD may not generalize well to other domains such as biomedical literature, legal documents, technical manuals, or conversational text.
Research has identified annotation artifacts in SQuAD. For instance, models can sometimes answer questions correctly by exploiting superficial heuristics, such as matching the type of the expected answer (e.g., a date, a person's name) to entities in the passage. This means that high SQuAD scores do not always indicate genuine comprehension.
By 2020, top models had surpassed human performance on both SQuAD 1.1 and SQuAD 2.0 by substantial margins. Modern large language models such as GPT-4 achieve near-perfect scores on SQuAD, suggesting that the benchmark no longer differentiates among frontier models. This has shifted attention to more challenging benchmarks that test multi-hop reasoning, long-context understanding, and open-ended generation.
Despite being largely "solved" by current models, SQuAD remains relevant for several reasons:
Educational value: SQuAD is widely used in NLP courses and tutorials as an accessible introduction to question answering and reading comprehension tasks. Its clear task formulation and well-defined metrics make it an excellent teaching tool.
Baseline evaluation: New model architectures and training methods often report SQuAD performance as a sanity check, even if the primary evaluation targets more challenging benchmarks.
Fine-tuning resource: Many practitioners use SQuAD as training data when building extractive QA systems, since the dataset is freely available under a CC-BY-SA-4.0 license and works well as a starting point for domain-specific fine-tuning.
Historical significance: SQuAD's leaderboard documented one of the most dramatic periods of progress in AI history. The trajectory from 51% F1 in 2016 to superhuman performance in 2018 on SQuAD 1.1 vividly illustrates the pace of advancement in deep learning for NLP.
Methodological template: The SQuAD data collection methodology, evaluation protocol, and leaderboard format have been replicated by dozens of subsequent benchmarks across multiple languages and domains.
SQuAD is freely available through Hugging Face Datasets, TensorFlow Datasets, and the official SQuAD Explorer website maintained at Stanford. The dataset continues to serve as a foundation for question answering research and applications around the world.