DROP (Discrete Reasoning Over Paragraphs)
Last reviewed
Jun 1, 2026
Sources
14 citations
Review status
Source-backed
Revision
v4 ยท 5,388 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 1, 2026
Sources
14 citations
Review status
Source-backed
Revision
v4 ยท 5,388 words
Add missing citations, update stale details, or suggest a clearer explanation.
**
| DROP | |
|---|---|
| Overview | |
| Full name | Discrete Reasoning Over Paragraphs |
| Abbreviation | DROP |
| Description | A reading comprehension benchmark requiring discrete reasoning and mathematical operations over paragraphs |
| Release date | 2019-03-01 |
| Latest version | 1.0 |
| Benchmark updated | 2019-04 |
| Authors | Dheeru Dua, Yizhong Wang, Pradeep Dasigi, Gabriel Stanovsky, Sameer Singh, Matt Gardner |
| Organization | Allen Institute for AI, UC Irvine |
| Technical Details | |
| Type | Reading Comprehension, Discrete Reasoning, Mathematical Reasoning |
| Modality | Text |
| Task format | Extractive/Abstractive Question Answering |
| Number of tasks | 5 reasoning types |
| Total examples | 96,567 questions |
| Evaluation metric | F1 Score, Exact Match |
| Domains | Sports, History, Wikipedia articles |
| Languages | English |
| Performance | |
| Human performance | 96.42% F1 |
| Baseline | 32.7% F1 (2019) |
| SOTA score | 91.6% F1 |
| SOTA model | DeepSeek-V3 |
| SOTA date | 2024 |
| Saturated | No |
| Resources | |
| Website | Official website |
| Paper | Paper |
| GitHub | Repository |
| Dataset | Download |
| License | CC BY-SA 4.0 |
DROP** (Discrete Reasoning Over Paragraphs) is a reading comprehension benchmark that requires artificial intelligence systems to perform discrete reasoning operations over textual content. Released in March 2019 by researchers from the Allen Institute for AI and the University of California, Irvine, DROP challenged the prevailing paradigm in reading comprehension by requiring systems to perform mathematical operations, counting, sorting, and comparison operations rather than simply extracting text spans[1]. The benchmark contains 96,567 questions created through an adversarial crowdsourcing process designed to expose weaknesses in contemporary reading comprehension systems.
DROP was presented at the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL-HLT) in Minneapolis, Minnesota[1]. At the time of its release, even the best existing models achieved only 32.7% F1, compared to human performance of 96.42% F1, highlighting a massive gap between machine and human reading comprehension when discrete reasoning is involved[1]. The dataset has since become one of the most widely cited benchmarks for evaluating numerical and logical reasoning over text, and it has driven the development of specialized architectures, neurosymbolic methods, and prompting strategies in natural language processing.
Before DROP, the dominant reading comprehension benchmarks, most notably the Stanford Question Answering Dataset (SQuAD), primarily evaluated a model's ability to locate and extract a span of text from a passage that answered a given question[2]. While SQuAD and similar datasets were valuable for advancing extractive question answering, researchers observed that models could achieve high scores by relying on superficial pattern matching, lexical overlap between the question and passage, and entity-type heuristics rather than genuinely understanding the text[1].
This observation raised a fundamental question: were reading comprehension models actually comprehending what they read, or were they exploiting statistical shortcuts? Several studies had shown that models could perform well on SQuAD-style tasks even when passages were adversarially modified, suggesting that true understanding was not required[3].
The authors of DROP identified a specific gap in existing benchmarks. Real-world reading comprehension often requires readers to perform operations over the information they extract. Consider a sports game summary: a human reader might need to count the number of touchdowns, subtract one team's score from another, determine which quarterback threw the longest pass, or compare statistics across multiple paragraphs. These operations require genuine understanding of the text's content and the ability to perform discrete reasoning steps that go beyond span extraction.
DROP was designed to fill this gap by creating a benchmark where systems must:
The passages in DROP are drawn from English Wikipedia, specifically from an October 2018 Wikipedia dump supplemented by contemporaneous web scraping[1]. The authors targeted passages that contained a "narrative sequence of events, particularly with a high proportion of numbers." Two categories of Wikipedia articles proved especially suitable:
| Source Category | Approximate Share | Content Characteristics |
|---|---|---|
| NFL game summaries | ~60% | Dense with statistics: yards gained, touchdowns scored, field goal distances, quarter-by-quarter scoring, and player performance figures |
| History articles | ~40% | Rich in dates, population figures, casualty counts, territorial measurements, and chronological sequences of events |
These source categories were selected because they naturally contain abundant quantitative information amenable to discrete reasoning. NFL game summaries, for instance, provide a structured narrative where multiple numerical facts (scores, yardage, time of possession) appear in close proximity, creating natural opportunities for arithmetic and comparison questions. History articles offer similar density of numerical data spread across longer temporal narratives[1].
The final dataset comprises approximately 8,600 passages, with an average passage length between 191 and 213 words.
DROP's question collection used an adversarial annotation protocol, a design choice that significantly improved the benchmark's difficulty and resistance to superficial shortcuts[1]. The process worked as follows:
Adversary selection. The authors trained a BiDAF (Bidirectional Attention Flow) model on SQuAD to serve as the automated adversary. BiDAF was chosen because it was a strong, well-understood reading comprehension baseline at the time.
Worker recruitment and training. Crowdworkers were recruited through Amazon Mechanical Turk, restricted to workers based in the United States. Workers were gradually filtered based on their annotation quality, retaining only high-performing annotators. Trained workers averaged roughly $10 per hour in compensation, with each HIT (Human Intelligence Task) paying $5 USD[1].
Question creation under adversarial constraint. Each worker received five random Wikipedia passages per task and was instructed to produce at least 12 question-answer pairs. The key constraint was that workers could only submit questions that the real-time BiDAF model failed to answer correctly. This adversarial setup ensured that the resulting questions could not be solved by simple span extraction, pushing workers to create questions requiring genuine reasoning.
Answer format restrictions. Workers could provide answers as text spans (from the passage or the question), dates, or numbers. Numeric answers were allowed only for questions that explicitly stated a specific unit of measurement, reducing ambiguity[1].
Validation. Additional annotators provided independent second answers for all development and test set questions. Inter-annotator agreement was measured using Cohen's kappa, yielding an overall score of 0.74. Agreement was highest for number answers (kappa = 0.81), followed by date answers (0.65) and span answers (0.62)[1].
Quality control. Both automatic checks and manual review were applied to ensure question quality and filter out ambiguous or poorly formed examples.
The total annotation budget, including validation, was approximately $60,000 USD[1].
The final dataset is divided into standard train, development, and test splits:
| Split | Questions | Passages | Purpose |
|---|---|---|---|
| Training | 77,409 | 5,565 | Model training |
| Development | 9,536 | 582 | Hyperparameter tuning and ablation studies |
| Test | 9,622 | 588 | Final evaluation (hidden answers) |
| Total | 96,567 | ~8,600 | Complete dataset |
An analysis of 350 sampled questions from the development set revealed the following distribution of reasoning types[1]:
| Reasoning Type | Percentage | Description | Example |
|---|---|---|---|
| Subtraction | 28.8% | Computing the difference between two numbers | "How many more yards did Player A gain than Player B?" |
| Selection | 19.4% | Choosing a specific entity based on criteria | "Who scored the longest touchdown?" |
| Comparison | 18.2% | Comparing quantities across passage segments | "Which team had more first downs?" |
| Counting | 16.5% | Counting entities or events meeting criteria | "How many touchdowns were scored in the first half?" |
| Addition | 11.7% | Summing numerical values | "How many total yards did both teams gain?" |
| Sorting | 11.7% | Ordering items by an attribute | "What was the shortest field goal kicked?" |
| Coreference resolution | 3.7% | Resolving pronouns to answer questions | "Who did he throw the pass to?" |
| Other arithmetic | 3.2% | Non-standard mathematical operations | Various multi-step calculations |
| Set of spans | 6.0% | Identifying multiple answer spans | "Which players scored touchdowns?" |
| Other | 6.8% | Miscellaneous reasoning patterns | Various |
Note that percentages sum to more than 100% because some questions require multiple reasoning types.
The distribution of answer types reflects the benchmark's emphasis on numerical reasoning[1]:
| Answer Type | Percentage | Description | Example |
|---|---|---|---|
| Numbers | 66.1% | Numeric answers (counts, differences, sums) | "42", "7" |
| Person names | 12.2% | Names of individuals | "Tom Brady" |
| Other words/phrases | 9.4% | General text spans | "New England Patriots" |
| Entity names | 7.3% | Named entities other than persons | "Gillette Stadium" |
| Verb phrases | 3.5% | Action descriptions | "intercepted a pass" |
| Dates | 1.5% | Temporal information | "September 15, 2013" |
DROP uses two primary evaluation metrics, both of which operate at the token level after answer normalization[1]:
| Metric | Description | Calculation |
|---|---|---|
| F1 Score | Token-level overlap between predicted and gold answers | Harmonic mean of precision and recall over the bag of words |
| Exact Match (EM) | Whether the predicted answer exactly matches the gold answer after normalization | Binary score (0 or 1) |
F1 is the primary metric because it provides partial credit for answers that are close but not identical to the gold standard. EM serves as a stricter secondary metric.
The evaluation script performs several normalization steps before comparing predicted and gold answers[1][4]:
Human performance on DROP was established by having the paper's authors collectively answer 560 questions from the test set. These answers were then evaluated using the same metrics applied to machine predictions[1]. The results:
| Metric | Score |
|---|---|
| F1 | 96.42% |
| Exact Match | 94.09% |
This approach provides a more realistic human ceiling than simply holding out one crowdworker annotation, since the authors were highly familiar with the task requirements.
The original paper evaluated several existing reading comprehension and semantic parsing models on DROP, revealing dramatically poor performance across the board[1]:
| Model | Exact Match | F1 | Notes |
|---|---|---|---|
| Majority heuristic | 0.07 | 1.44 | Always predicts the most common answer |
| Semantic parsing (best) | 8.98 | 11.45 | Traditional structured reasoning |
| BiDAF | 24.75 | 27.49 | Span extraction baseline |
| QANet | 25.50 | 28.36 | Efficient attention-based model |
| QANet + ELMo | 27.08 | 29.67 | Contextualized embeddings |
| BERT | 29.45 | 32.70 | Best existing model at the time |
| NAQANet (proposed) | 44.07 | 47.01 | Numerically-aware model introduced with DROP |
| Human | 94.09 | 96.42 | Author-established ceiling |
The most striking result was BERT's performance. On SQuAD, BERT had achieved F1 scores above 90%. On DROP, that same model dropped by more than 50 absolute F1 points, falling to just 32.70%[1]. This dramatic decline confirmed the authors' hypothesis: existing reading comprehension models relied heavily on span extraction shortcuts and lacked genuine numerical reasoning ability.
DROP introduced the Numerically-Aware QANet (NAQANet) as a new model designed to handle discrete reasoning operations[1]. NAQANet extends the QANet architecture, which was the best-performing published model on SQuAD 1.1 at the time, with four specialized answer prediction heads:
Passage span head. The standard QANet encoder produces start and end position probabilities through feed-forward networks over encoder outputs. This handles traditional extractive questions.
Question span head. Computes a passage context vector using attention, then predicts span boundaries in the question conditioned on passage information. This allows the model to extract answers that appear in the question itself (for example, "Who scored more, Team A or Team B?" where the answer is "Team A").
Count head. A multi-class classifier over integers 0 through 9, implemented as a feed-forward network operating on the passage representation. This handles questions like "How many touchdowns were scored?"
Arithmetic expression head. Extracts all numbers from the passage and assigns a sign (+, -, or 0) to each number through a softmax classifier. The final answer is computed by summing all numbers multiplied by their assigned signs. This handles addition and subtraction questions.
Answer type selector. A classifier that chooses which of the four heads to use for a given question, based on concatenated passage and question representations.
NAQANet was trained using weak supervision through marginal likelihood maximization over all arithmetic expressions that evaluate to the correct answer. In practice, training was limited to addition and subtraction involving two numbers, which covered the majority of arithmetic questions in the dataset[1].
NAQANet achieved 47.01% F1 on the test set, a substantial improvement over BERT's 32.70% but still far below human performance at 96.42%.
An analysis of 100 errors made by NAQANet revealed the following failure categories[1]:
| Error Category | Percentage | Description |
|---|---|---|
| Arithmetic operations | 51% | Incorrect computation or number selection |
| Counting | 30% | Failed to count entities accurately |
| Domain knowledge / common sense | 23% | Lacked sports or historical knowledge |
| Coreference | 6% | Failed to resolve pronoun references |
| Combined reasoning types | 40% | Questions requiring multiple reasoning steps |
The high rate of combined errors (40%) indicated that the most difficult questions were those requiring the integration of multiple reasoning skills within a single inference chain.
DROP spurred a wave of research into models capable of numerical and discrete reasoning over text. The progression of architectures and techniques can be organized into four broad phases.
The first generation of post-DROP models added purpose-built numerical reasoning components to existing encoders.
NumNet (Ran et al., 2019) introduced a numerically-aware graph neural network that constructed a graph connecting all numbers in the passage and question. Edges in this graph encoded numerical relationships (greater than, less than, equal), allowing the model to reason about relative magnitudes. NumNet achieved 64.56% EM on the DROP development set, representing a significant leap over NAQANet[5].
NumNet+ extended this approach with a more expressive graph structure and improved training procedure, further improving accuracy on numerical questions.
GenBERT (Geva et al., 2020) took a different approach by augmenting BERT-Base with five additional feed-forward neural networks specialized for numerical operations. GenBERT achieved 72.4% F1 on DROP, demonstrating that injecting numerical priors into pretrained language models could yield strong gains[6].
The second phase saw the adoption of heterogeneous graph neural networks that could represent richer relationships between entities, numbers, and events in the passage.
QDGAT (Chen et al., 2020), the Question Directed Graph Attention Network, constructed a heterogeneous directed graph whose nodes included both numbers and entities from the passage. Two kinds of edges were defined: type-specific edges connecting numbers of the same category (for example, all yardage figures), and co-occurrence edges linking entities and numbers that appeared in the same sentence. QDGAT used question-directed attention to guide the graph reasoning process toward the information most relevant to the question. Published at EMNLP 2020, QDGAT with an ALBERT language module achieved state-of-the-art results on DROP at the time[7].
HGN (Heterogeneous Graph Network) models further explored multi-relational graph structures for reasoning, though their impact on DROP specifically was more incremental.
A third line of research combined neural networks with symbolic reasoning by generating executable programs as intermediate representations.
NeRd (Chen et al., 2020), the Neural Symbolic Reader, was presented at ICLR 2020. NeRd paired a BERT-based reader with an LSTM-based programmer that generated compositional programs (for example, "subtract(find_num('yards'), find_num('yards'))") which were then executed against the passage. This approach achieved 81.7% F1 and 78.3% EM on the DROP test set, representing a major advance[8]. The key insight was that explicit program generation forced the model to decompose complex questions into interpretable reasoning steps.
OPERA (Li et al., 2022) took an operation-pivoted approach, using lightweight symbolic operations as neural modules. OPERA inherited the interpretability advantages of semantic parsing methods while maintaining the flexibility of neural approaches. It was published at NAACL 2022[9].
The rise of large pretrained generative models, particularly T5 and BART, brought a paradigm shift. Rather than designing specialized architectures, researchers fine-tuned general-purpose sequence-to-sequence models on DROP.
NT5 (Yang and Chen, 2021) pre-trained T5 on synthetic numeracy corpora before fine-tuning on DROP. This two-stage training improved T5-Small's F1 from 45.90 to 70.83, demonstrating that numerical pre-training could substantially benefit downstream reasoning[10].
UnifiedQA (Khashabi et al., 2020) fine-tuned T5 across multiple QA formats simultaneously, achieving competitive DROP results while maintaining strong performance on other reading comprehension benchmarks.
These generative approaches showed that with sufficient scale and appropriate training data, general-purpose models could match or exceed specialized architectures, foreshadowing the dominance of large language models in Phase 5.
The introduction of GPT-4, Claude, and other frontier LLMs dramatically shifted the performance landscape on DROP. These models achieve strong results through few-shot prompting and chain-of-thought reasoning, without any task-specific fine-tuning.
GPT-4 (OpenAI, 2023) achieved approximately 80.9% F1 on DROP using chain-of-thought prompting. Notably, the GPT-4 technical report highlighted that DROP was one of the few benchmarks where GPT-4 did not surpass the existing state of the art with benchmark-specific training[11].
PaLM-2 (Google, 2023) reported scores around 83% F1 on DROP, slightly exceeding GPT-4 on this particular task.
As of late 2024, the following models represent the top performers on DROP, as tracked by independent benchmarking platforms[12]:
| Rank | Model | F1 Score | Organization | Key Approach |
|---|---|---|---|---|
| 1 | DeepSeek-V3 | 91.6 | DeepSeek | Mixture-of-experts, 671B parameters |
| 2 | Claude 3.5 Sonnet (June 2024) | 87.1 | Anthropic | Constitutional AI, advanced reasoning |
| 2 | Claude 3.5 Sonnet (October 2024) | 87.1 | Anthropic | Updated version, same DROP score |
| 4 | GPT-4 Turbo | 86.0 | OpenAI | Optimized GPT-4 variant |
| 5 | Amazon Nova Pro | 85.4 | Amazon | Enterprise-focused LLM |
| 6 | Llama 3.1 405B Instruct | 84.8 | Meta | Largest open-weight model |
| 7 | GPT-4o | 83.4 | OpenAI | Multimodal optimized model |
| 8 | Claude 3.5 Haiku | 83.1 | Anthropic | Lightweight Claude variant |
| 8 | Claude 3 Opus | 83.1 | Anthropic | Previous flagship model |
| 10 | GPT-4 | 80.9 | OpenAI | Original GPT-4 |
| - | Human performance | 96.42 | - | Author-established ceiling |
The gap between the best model (DeepSeek-V3 at 91.6% F1) and human performance (96.42% F1) has narrowed to roughly 5 points, but this remaining gap proves persistent. It likely reflects challenges in complex multi-step reasoning, coreference resolution, and questions requiring domain knowledge that current models still struggle with.
The trajectory from the original 2019 baselines to the 2024 state of the art illustrates the rapid pace of progress in NLP reasoning:
| Year | Best Model | F1 Score | EM Score | Improvement over Previous | Key Innovation |
|---|---|---|---|---|---|
| 2019 | BERT (baseline) | 32.70 | 29.45 | - | Pretrained language model |
| 2019 | NAQANet | 47.01 | 44.07 | +14.31 | Numerically-aware answer heads |
| 2019 | NumNet | ~68 | 64.56 | +21 | Graph neural network for numbers |
| 2020 | GenBERT | 72.4 | - | +4.4 | Numerical pre-training for BERT |
| 2020 | QDGAT | ~75 | ~71 | +2.6 | Question-directed graph attention |
| 2020 | NeRd | 81.7 | 78.3 | +6.7 | Neural symbolic program synthesis |
| 2021 | T5-large (fine-tuned) | ~80 | - | ~-1.7 | Generative pre-trained transformer |
| 2023 | GPT-4 | 80.9 | - | +0.9 | Large language model, few-shot |
| 2024 | Claude 3.5 Sonnet | 87.1 | ~85 | +6.2 | Advanced instruction-following |
| 2024 | DeepSeek-V3 | 91.6 | - | +4.5 | Mixture-of-experts at scale |
Research driven by DROP has produced several important technical innovations that extend well beyond this single benchmark:
NAQANet, NumNet, and their successors demonstrated that standard neural architectures lack built-in capacity for arithmetic. By adding explicit numerical reasoning modules (arithmetic heads, graph neural networks, and calculator integrations), these models could perform operations that would otherwise require symbolic computation. This insight has influenced the design of models for financial analysis, scientific reasoning, and data interpretation tasks[5].
QDGAT and related models showed that representing passage content as a graph, with numbers and entities as nodes and their relationships as edges, enables more structured reasoning than purely sequential processing. This approach has been adopted in multi-hop question answering and knowledge graph reasoning research[7].
NeRd and OPERA demonstrated the value of generating explicit programs as an intermediate reasoning step. Rather than directly predicting an answer, these models produce interpretable computation traces (for example, "subtract(find('yards', 'Player A'), find('yards', 'Player B'))") that can be executed and verified. This paradigm has influenced research on tool use in LLMs and interpretable AI[8][9].
The success of GPT-4 and other LLMs on DROP using chain-of-thought prompting validated the idea that step-by-step reasoning elicitation can unlock mathematical capabilities in large models. This finding, initially demonstrated by Wei et al. (2022), has become a foundational technique in modern LLM evaluation and deployment[13].
Some approaches to DROP integrate external tools such as calculators or symbolic math engines. Rather than relying on the neural network to perform arithmetic internally, these systems extract numbers and operations from the text, then delegate computation to a reliable external tool. This approach foreshadowed the broader "tool use" paradigm now common in LLM applications.
In December 2023, the Hugging Face team discovered serious evaluation issues when they added DROP to the Open LLM Leaderboard. The investigation, conducted jointly with EleutherAI and Zeno, revealed that the overwhelming majority of models scored below 10% F1, a result that was clearly inconsistent with known model capabilities[4][14].
The root cause was traced to the evaluation script's use of "." (period) as the end-of-generation stop token. This created two cascading failures:
Floating-point answers were systematically broken. When the gold answer was a decimal number like 12.25, the model's generation was truncated at the period, producing "12" instead of "12.25". Approximately 10% of DROP questions have floating-point gold answers, and models achieved 0% accuracy on all of them[14].
High-quality models were penalized for generating context. Better models tended to generate answers in a natural format (for example, "42
The next passage discusses..."), mimicking the few-shot prompt format. The evaluation script included all generated tokens in the bag-of-words comparison, so the additional context words diluted the F1 score. Paradoxically, higher-quality models received lower scores[14].
The Zeno team's analysis found that more than 50% of evaluation examples needed to be re-run to produce correct results. Fixing the issue would have required approximately 8 years of GPU time to re-evaluate all models. Rather than publishing unreliable scores, the Hugging Face team removed DROP from the Open LLM Leaderboard entirely[14].
This incident highlighted a broader lesson for the NLP community: benchmark evaluation scripts, not just the benchmarks themselves, require careful validation and ongoing maintenance.
Beyond the leaderboard incident, several other evaluation challenges have been documented:
| Issue | Description | Impact |
|---|---|---|
| Tokenization inconsistency | Different tokenizers split answers differently, affecting bag-of-words computation | Score variations of 1-3% F1 between evaluation frameworks |
| Number formatting ambiguity | Valid representations like "2,000" vs "2000" vs "two thousand" may not all be recognized | Occasional false negatives |
| Multiple valid answers | Some questions legitimately have more than one correct answer (for example, both "Player A" and "A") | Underestimated model performance |
| Date format variation | Dates can be expressed in multiple valid formats | Scoring inconsistencies for date-type answers |
Like many crowdsourced datasets, DROP contains annotation artifacts that models can potentially exploit. Common patterns include:
These artifacts may allow models to achieve some performance gains through pattern matching rather than genuine reasoning, though the adversarial collection process mitigates this concern to a significant degree.
DROP has significantly influenced multiple research areas within NLP:
| Research Area | Impact | Notable Follow-up Work |
|---|---|---|
| Numerical reasoning in NLP | Catalyzed an entire subfield of mathematical reasoning over text | NumNet, GenBERT, NT5, MathQA |
| Multi-hop reasoning | Demonstrated the importance of combining information from multiple passage locations | HotpotQA extensions, multi-step QA |
| Neurosymbolic AI | Validated the value of combining neural perception with symbolic computation | NeRd, OPERA, program synthesis for QA |
| Benchmark design | Popularized adversarial annotation as a benchmark construction methodology | Adversarial NLI, DynaBench |
| LLM evaluation methodology | Exposed pitfalls in automated evaluation of generative models | Evaluation script reforms, leaderboard redesigns |
The original DROP paper has been cited over 1,000 times as of 2025, making it one of the most influential reading comprehension papers of the 2019-2020 era. It is routinely included in survey papers on question answering, numerical reasoning, and NLP benchmarking.
DROP's adversarial collection methodology influenced the design of subsequent benchmarks. The idea that crowdworkers should create examples that defeat a baseline model has been adopted by projects like Adversarial NLI (ANLI) and the DynaBench platform. This approach produces harder, more robust datasets that are less susceptible to superficial shortcuts.
DROP exists within a broader ecosystem of reading comprehension and reasoning benchmarks. The following table situates DROP relative to related datasets:
| Benchmark | Year | Focus | Relation to DROP |
|---|---|---|---|
| SQuAD | 2016 | Extractive span-based QA | Predecessor; DROP was designed to address SQuAD's limitations |
| SQuAD 2.0 | 2018 | Span extraction + unanswerable questions | Added unanswerable questions but still span-based |
| DROP | 2019 | Discrete reasoning over text | The benchmark described in this article |
| HotpotQA | 2019 | Multi-hop reasoning across documents | Complementary; focuses on cross-document reasoning |
| IIRC | 2020 | Incomplete information retrieval + reasoning | Extends multi-hop reasoning with retrieval |
| MathQA | 2019 | Math word problems with operation programs | More focused on pure mathematical problem-solving |
| NumGLUE | 2022 | Suite of eight numerical reasoning tasks | Uses DROP as one component; broader numerical evaluation |
| TAT-QA | 2021 | Table and text question answering | Extends numerical reasoning to hybrid tabular/text settings |
Technologies and techniques developed for DROP have found practical applications beyond academic benchmarking:
Financial analysis. Extracting numerical data from earnings reports, computing year-over-year growth, and comparing financial metrics across companies requires exactly the kind of discrete reasoning DROP evaluates.
Sports analytics. Automated analysis of game summaries, player statistics comparison, and season-level aggregation directly mirrors DROP's primary domain.
Legal document analysis. Contracts and legal texts contain numerical clauses (payment terms, deadlines, penalty amounts) that require arithmetic reasoning to interpret correctly.
Healthcare informatics. Clinical trial reports contain statistical results that must be compared, aggregated, and interpreted through discrete reasoning.
Educational assessment. Automated tutoring systems use reading comprehension with numerical reasoning to generate and evaluate math word problems embedded in textual contexts.
Several research directions continue to build on DROP's foundation:
Compositional reasoning. Breaking complex multi-step operations into primitive reasoning steps that can be individually verified and combined.
Robust evaluation. Developing evaluation frameworks that handle the normalization, formatting, and tokenization challenges exposed by the Open LLM Leaderboard incident.
Domain extension. Expanding discrete reasoning benchmarks beyond sports and history to domains like science, finance, and medicine, where numerical reasoning over text is equally critical.
Explainable reasoning. Generating not just correct answers but interpretable reasoning traces that humans can verify, building on the neurosymbolic approaches pioneered by NeRd and OPERA.
Multilingual discrete reasoning. Extending DROP-style evaluation to languages other than English, where numerical expressions and reasoning patterns differ.
Dynamic benchmarking. Creating continuously updated question sets that prevent overfitting to a fixed test distribution, addressing DROP's static nature.