| DROP | |
|---|---|
| Overview | |
| Full name | Discrete Reasoning Over Paragraphs |
| Abbreviation | DROP |
| Description | A reading comprehension benchmark requiring discrete reasoning and mathematical operations over paragraphs |
| Release date | 2019-03-01 |
| Latest version | 1.0 |
| Benchmark updated | 2019-04 |
| Authors | Dheeru Dua, Yizhong Wang, Pradeep Dasigi, Gabriel Stanovsky, Sameer Singh, Matt Gardner |
| Organization | Allen Institute for AI, UC Irvine |
| Technical Details | |
| Type | Reading Comprehension, Discrete Reasoning, Mathematical Reasoning |
| Modality | Text |
| Task format | Extractive/Abstractive Question Answering |
| Number of tasks | 5 reasoning types |
| Total examples | approximately 96,000 questions |
| Evaluation metric | F1 Score, Exact Match |
| Domains | Sports, History, Wikipedia articles |
| Languages | English |
| Performance | |
| Human performance | 96.0% F1 |
| Baseline | 32.7% F1 (2019) |
| SOTA score | 87.1% F1 |
| SOTA model | Claude 3.5 Sonnet |
| SOTA date | 2024 |
| Saturated | No |
| Resources | |
| Website | Official website |
| Paper | Paper |
| GitHub | Repository |
| Dataset | Download |
| License | CC BY-SA 4.0
|
DROP (Discrete Reasoning Over Paragraphs) is a reading comprehension benchmark that requires artificial intelligence systems to perform discrete reasoning operations over textual content. Released in March 2019 by researchers from the Allen Institute for AI and UC Irvine[1], DROP challenged the prevailing paradigm in reading comprehension by requiring systems to perform mathematical operations, counting, sorting, and comparison operations rather than simply extracting text spans. The benchmark contains approximately 96,000 questions created adversarially to expose weaknesses in contemporary reading comprehension systems.
DROP represents a significant advancement in evaluating genuine comprehension and reasoning capabilities in AI systems. Unlike traditional reading comprehension benchmarks that primarily test span extraction abilities, DROP requires models to understand paragraph content comprehensively and perform discrete operations including addition, subtraction, counting, sorting, and comparison. This design philosophy ensures that systems must develop true understanding rather than relying on superficial pattern matching[1].
The development of DROP addressed critical limitations in existing reading comprehension evaluation:
DROP organizes questions into five primary reasoning categories[1]:
| Category | Percentage | Description | Example Operation |
|---|---|---|---|
| **Addition/Subtraction** | ~25% | Mathematical operations on numbers in text | "How many more yards did X gain than Y?" |
| **Counting** | ~20% | Counting entities or events | "How many touchdowns were scored in the game?" |
| **Selection** | ~20% | Choosing specific items based on criteria | "Who scored the longest touchdown?" |
| **Minimum/Maximum** | ~18% | Finding extreme values | "What was the shortest field goal?" |
| **Comparison** | ~17% | Comparing quantities or attributes | "Which team had more first downs?" |
DROP supports multiple answer formats to accommodate different reasoning operations:
| Answer Type | Frequency | Description | Example |
|---|---|---|---|
| **Number** | ~45% | Numeric values with optional units | "42 yards", "7 touchdowns" |
| **Text Span** | ~40% | Extracted from passage or question | "New England Patriots" |
| **Date** | ~15% | Temporal information | "September 15, 2018" |
The dataset is divided into standard splits:
| Split | Number of Questions | Number of Passages | Purpose |
|---|---|---|---|
| **Training** | ~77,400 | ~6,700 | Model training |
| **Development** | ~9,540 | ~950 | Hyperparameter tuning |
| **Test** | ~9,500+ | ~950 | Final evaluation (hidden) |
| **Total** | 96,567 | ~8,600 | Complete dataset |
DROP employed a sophisticated adversarial collection methodology[1]:
1. **Baseline System**: BiDAF model trained on SQuAD used as adversary 2. **Worker Instructions**: Crowdworkers instructed to create questions the baseline cannot answer 3. **Question Validation**: Multiple workers validate each question-answer pair 4. **Iterative Refinement**: Questions refined based on validation feedback 5. **Quality Control**: Automatic and manual checks ensure question quality
The passages in DROP are drawn from:
| Source | Percentage | Content Type | Characteristics |
|---|---|---|---|
| **Sports Game Summaries** | ~60% | NFL game descriptions | Rich in numbers, statistics, events |
| **Wikipedia History** | ~40% | Historical articles | Dates, quantities, comparisons |
These sources were chosen for their natural abundance of quantitative information and discrete facts amenable to mathematical reasoning.
DROP uses two primary evaluation metrics:
| Metric | Description | Calculation | Use Case |
|---|---|---|---|
| **F1 Score** | Token-level overlap | Harmonic mean of precision and recall | Primary metric |
| **Exact Match (EM)** | Complete answer match | Binary exact match after normalization | Secondary metric |
The evaluation script performs extensive normalization[2]:
Several technical challenges have been identified:
| Challenge | Description | Impact | Mitigation |
|---|---|---|---|
| **Tokenization Issues** | Different tokenizers affect scoring | Score variations | Standardized preprocessing |
| **Number Formatting** | Various valid number representations | False negatives | Comprehensive normalization |
| **Multiple Valid Answers** | Some questions have multiple correct answers | Underestimated performance | Set-based evaluation |
| Rank | Model | F1 Score | EM Score | Key Innovation |
|---|---|---|---|---|
| 1 | Claude 3.5 Sonnet | 87.1 | ~85 | Advanced reasoning |
| 2 | Gemini 1.5 Pro | 78.9 | ~76 | Long context understanding |
| 3 | GPT-4 Turbo | ~77 | ~74 | Improved mathematical reasoning |
| 4 | GPT-4o | ~75 | ~72 | Optimized architecture |
| - | Human Performance | 96.0 | 94.1 | Expert baseline |
The dramatic improvement in DROP performance over time:
| Year | Best Model | F1 Score | Improvement | Key Development |
|---|---|---|---|---|
| 2019 | BiDAF (baseline) | 32.7 | - | Initial baseline |
| 2019 | NAQANet | 47.0 | +14.3 | Numerical reasoning modules |
| 2020 | QDGAT | 70.6 | +23.6 | Graph attention networks |
| 2021 | T5-large | 79.1 | +8.5 | Pretrained transformers |
| 2023 | GPT-4 | ~82 | +3 | Large language models |
| 2024 | Claude 3.5 Sonnet | 87.1 | +5.1 | Enhanced reasoning |
Research on DROP has led to several important technical innovations:
| Technique | Impact | Description | Representative Models |
|---|---|---|---|
| **Numerical Reasoning Modules** | High | Specialized components for mathematical operations | NAQANet, NumNet |
| **Graph Neural Networks** | Medium | Representing passage structure as graphs | QDGAT |
| **Program Synthesis** | High | Generating executable programs for reasoning | NeRd, OPERA |
| **Chain-of-Thought** | High | Step-by-step reasoning generation | GPT-4, Claude |
| **Tool Use** | Medium | Calculator and symbolic math integration | Various LLMs |
The progression of architectures successful on DROP:
1. **Early Neural Models** (2019): BiDAF with numerical heads 2. **Graph-Based Models** (2020): QDGAT, HGN 3. **Pretrained Transformers** (2021): T5, BART with task-specific fine-tuning 4. **Large Language Models** (2023+): GPT-4, Claude with few-shot prompting
| Limitation | Description | Impact |
|---|---|---|
| **Domain Specificity** | Heavy focus on sports and history | Limited generalization |
| **Annotation Artifacts** | Patterns in crowdsourced questions | Potential shortcuts |
| **Evaluation Issues** | Normalization and tokenization problems | Scoring inconsistencies |
| **Static Nature** | Fixed question set | Risk of overfitting |
Several technical issues have been documented:
DROP has significantly influenced reading comprehension research:
| Area | Impact | Development |
|---|---|---|
| **Numerical Reasoning** | Spurred research in mathematical NLP | NumNet, MathQA |
| **Multi-hop Reasoning** | Advanced multi-step reasoning research | HotpotQA follow-up |
| **Program Synthesis** | Integration of symbolic reasoning | Neural program synthesis |
| **Evaluation Methods** | More comprehensive evaluation metrics | Beyond span extraction |
| Benchmark | Year | Focus | Relation to DROP |
|---|---|---|---|
| SQuAD | 2016 | Span extraction | Predecessor, simpler |
| DROP | 2019 | Discrete reasoning | Current benchmark |
| IIRC | 2020 | Multi-hop reasoning | Complementary focus |
| NumGLUE | 2022 | Numerical reasoning | Specialized follow-up |
Technologies developed for DROP have found applications in:
Current research directions inspired by DROP include:
1. **Compositional Reasoning**: Breaking complex operations into primitive steps 2. **Symbolic Integration**: Combining neural and symbolic approaches 3. **Explainable Reasoning**: Generating interpretable reasoning chains 4. **Robust Evaluation**: Addressing normalization and scoring issues 5. **Domain Extension**: Expanding beyond sports and history
DROP has fundamentally advanced the field of reading comprehension by demonstrating that true understanding requires more than pattern matching and span extraction. The benchmark's requirement for discrete mathematical and logical operations over text pushed the development of models with genuine reasoning capabilities. The improvement from 32.7% to 87.1% F1 score represents not just numerical progress but qualitative advances in how AI systems process and reason about textual information.
While current models have made significant progress, the gap to human performance (96% F1) indicates remaining challenges in achieving robust, generalizable reasoning over text. DROP continues to serve as a valuable benchmark for developing and evaluating reading comprehension systems that can perform practical, real-world information processing tasks.