# BoolQ

> Source: https://aiwiki.ai/wiki/boolq
> Updated: 2026-06-25
> Categories: AI Benchmarks, Natural Language Processing
> From AI Wiki (https://aiwiki.ai), a free encyclopedia of artificial intelligence. Quote with attribution.

BoolQ (Boolean Questions) is a [natural language processing](/wiki/natural_language_processing) [benchmark](/wiki/benchmark) [dataset](/wiki/dataset) of 15,942 naturally occurring yes/no [question answering](/wiki/question_answering) examples, each pairing a real Google search query with a Wikipedia passage and a True/False answer [1]. It was introduced in the 2019 paper "BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions" by Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova of Google AI Language, published at the Conference of the North American Chapter of the Association for Computational Linguistics (NAACL 2019) and released as arXiv:1905.10044 [1]. BoolQ is one of the eight tasks in the [SuperGLUE](/wiki/superglue) benchmark, and the original paper reported that the best model reached 80.4% accuracy against 90% for human annotators, leaving a clear gap that made the dataset a standard test of reading comprehension and inference for language models [1][2].

The paper's central, oft-quoted claim is that yes/no questions drawn from real users are deceptively hard. As the abstract states: "They often query for complex, non-factoid information, and require difficult entailment-like inference to solve" [1]. This is why the work is titled around the "surprising difficulty" of a task that sounds trivial.

## What is BoolQ?

BoolQ is a reading-comprehension dataset for yes/no questions. Each example gives a model a short Wikipedia paragraph (the evidence passage) and a single yes/no question, and the model must output a boolean answer (yes/True or no/False) supported by that passage [1]. Unlike extractive datasets that ask a model to highlight a text span, BoolQ requires the model to reduce the entire passage to one bit of information, which often demands inference rather than copying [1].

The dataset was built by [Google](/wiki/google) AI Language researchers and contains 15,942 examples in total [1]. Because the questions come from anonymized search logs rather than from annotators following instructions, they reflect genuine information needs and a wide spread of difficulty [1].

## Background and Motivation

Yes/no questions represent one of the most common forms of information-seeking behavior. When people search the web, a significant portion of their queries can be answered with a simple "yes" or "no." Despite this prevalence, yes/no question answering received relatively little attention in [NLP](/wiki/nlp) research prior to BoolQ. Most reading comprehension benchmarks, such as [SQuAD](/wiki/squad), focused on extractive question answering, where models must identify a text span that answers the question. Other benchmarks tested natural language inference through artificially constructed sentence pairs.

The authors of BoolQ observed that naturally occurring yes/no questions present a different and often more difficult challenge than extractive QA or synthetic inference tasks. When people ask yes/no questions in real search scenarios, the questions tend to be more complex, requiring reasoning beyond simple word matching or paraphrase detection [1]. This observation motivated the creation of a dedicated benchmark that could capture the true difficulty of boolean question answering in a naturalistic setting.

Before BoolQ, yes/no questions were sometimes treated as a byproduct of other tasks or filtered out entirely from QA datasets. The BoolQ paper demonstrated that these questions deserve focused study because they require a distinct combination of reading comprehension, inference, and world knowledge [1].

## How was BoolQ built?

### Data Collection Pipeline

BoolQ follows a data collection pipeline adapted from Google's Natural Questions (NQ) project [1][8]. The process begins with anonymized, aggregated queries submitted to the Google search engine. From this pool of real user queries, the researchers applied heuristic filters to identify queries that are likely to be yes/no questions. This approach differs from many other [NLP](/wiki/nlp) benchmarks where annotators are prompted to write questions, because BoolQ's questions reflect genuine information needs rather than researcher-designed tasks [1].

The key steps in the data collection process are:

1. **Query sampling**: Anonymized search queries were sampled from Google's search logs. These represent real questions that actual users typed into the search engine.

2. **Yes/no filtering**: Heuristic rules were applied to identify queries that can be interpreted as yes/no questions. Questions typically begin with words like "is," "are," "do," "does," "can," "was," "will," and similar auxiliaries.

3. **Manual filtering**: The candidate questions were manually reviewed to ensure they are comprehensible and unambiguous. Questions that were unclear, ill-formed, or could not reasonably be answered with "yes" or "no" were removed.

4. **Passage selection**: For each question, annotators identified a relevant Wikipedia article and selected a paragraph from that article containing enough information to determine the answer.

5. **Answer annotation**: Human annotators read the selected passage and provided a boolean answer (yes or no) to the question based on the passage content.

### Annotation Process

The annotation process involved human workers who reviewed each question-passage pair. Multiple annotators independently evaluated the same questions, and disagreements were resolved through consensus. Annotators received formal training to maintain consistency across the dataset. Each annotated example in BoolQ consists of four fields:

| Field | Type | Description |
|-------|------|-------------|
| question | String | A naturally occurring yes/no question (typically 20 to 100 characters) |
| passage | String | A paragraph from a Wikipedia article (typically 35 to 4,720 characters) |
| answer | Boolean | True (yes) or False (no) |
| title | String | The title of the Wikipedia article (optional additional context) |

The use of real search queries rather than researcher-prompted questions is a defining characteristic of BoolQ. By sampling from actual information-seeking behavior, the dataset captures a wider range of question complexity and topic diversity than datasets built through crowdsourcing prompts alone [1].

## Dataset Statistics

### How big is BoolQ and how is it split?

The full BoolQ dataset contains 15,942 examples divided into three splits [1]:

| Split | Examples | Labels |
|-------|----------|--------|
| Training | 9,427 | Provided |
| Development (validation) | 3,270 | Provided |
| Test | 3,245 | Withheld |

The training and development splits are publicly available with labeled answers. The test split answers are withheld and used for official evaluation through the SuperGLUE leaderboard [1][2].

### Answer Distribution

The dataset has a moderately imbalanced answer distribution. Approximately 62% of the answers are "yes" (True), while 38% are "no" (False) [1]. This means a naive baseline that always predicts "yes" would achieve 62% [accuracy](/wiki/accuracy), establishing the majority-class baseline for the task [1].

### Question Characteristics

The questions in BoolQ cover a wide range of topics drawn from Wikipedia, including geography, history, science, entertainment, sports, law, and technology. Common question patterns include:

- "Is [X] the same as [Y]?"
- "Do [X] speak [language]?"
- "Can [X] do [Y]?"
- "Was [X] based on [Y]?"
- "Does [X] have [property]?"
- "Is [X] a type of [Y]?"

## Example Questions

The following table shows representative examples from the BoolQ dataset:

| Question | Passage (excerpt) | Answer |
|----------|--------------------|--------|
| Do Iran and Afghanistan speak the same language? | Persian, also known by its endonym Farsi, is a Western Iranian language... It is the official language of Iran, Afghanistan (officially known as Dari), and Tajikistan... | Yes |
| Is Harry Potter and the Escape from Gringotts a roller coaster ride? | Harry Potter and the Escape from Gringotts is an indoor steel roller coaster... at Universal Studios Florida... | Yes |
| Does ethanol take more energy to make than it produces? | According to a 2005 study... the energy balance is actually positive... corn ethanol produces 67% more energy than it takes to produce... | No |
| Is Elder Scrolls Online the same as Skyrim? | As with other games in The Elder Scrolls series, the game is set on the continent of Tamriel. The game takes place in... roughly 1,000 years before the events of The Elder Scrolls V: Skyrim... | No |
| Is France the same timezone as the UK? | France uses Central European Time (UTC+01:00)... the UK uses Greenwich Mean Time (UTC+00:00)... | No |

These examples illustrate several properties of BoolQ questions. Some can be answered by straightforward paraphrase detection (the roller coaster question). Others require multi-step reasoning, background knowledge, or careful interpretation of the passage (the ethanol question).

## Why is BoolQ surprisingly hard?

One of the central findings of the BoolQ paper is that naturally occurring yes/no questions are significantly more difficult than expected [1]. In the authors' words, the questions "often query for complex, non-factoid information, and require difficult entailment-like inference to solve" [1]. To quantify this, the authors performed a manual analysis of question types and found the following distribution of reasoning requirements:

| Reasoning Type | Percentage | Description |
|---------------|------------|-------------|
| Paraphrase | 38.7% | The answer can be determined by matching or rephrasing words between the question and passage |
| Inferential | ~30% | Requires drawing inferences from implicit information in the passage |
| World knowledge | ~15% | Requires background knowledge not explicitly stated in the passage |
| Complex/multi-step | ~16% | Requires combining multiple pieces of information or multi-step reasoning |

Only 38.7% of the questions can be answered through simple paraphrase matching between the question and passage [1]. The remaining 61.3% require more complex reasoning, including inferential reasoning from implicit information, integration of world knowledge, and multi-step logical deduction. This distribution explains why BoolQ is more challenging than many [natural language inference](/wiki/natural_language_understanding) benchmarks, where surface-level pattern matching can achieve higher accuracy.

The difficulty of BoolQ questions stems from their natural origin. When people ask yes/no questions in a search engine, they are not constrained by instructions from researchers. Their questions reflect genuine uncertainty and often involve nuanced relationships that cannot be resolved by simple keyword overlap.

## Evaluation

### Metric

BoolQ uses [accuracy](/wiki/accuracy) as its sole evaluation metric [1][2]. Each prediction is either correct (matching the gold-standard boolean answer) or incorrect. The overall accuracy is computed as the percentage of correctly answered questions in the evaluation set.

### What accuracy did the original BoolQ models reach?

The original BoolQ paper (Clark et al., 2019) established several baselines [1]:

| Model / Baseline | Dev Accuracy | Test Accuracy |
|-----------------|-------------|---------------|
| Majority class (always "yes") | 62.0% | 62.0% |
| [BERT](/wiki/bert)-large (no transfer) | ~71% | ~71% |
| [BERT](/wiki/bert)-large + MultiNLI transfer | 79.4% | 80.4% |
| Human annotators | -- | 90.0% |

The majority-class baseline of 62% reflects the answer imbalance in the dataset [1]. [BERT](/wiki/bert)-large [3], when trained only on BoolQ data, achieved substantially higher accuracy, but still lagged well behind human performance. The best configuration in the original paper used a two-stage [transfer learning](/wiki/transfer_learning) approach: first [fine-tuning](/wiki/fine_tuning) [BERT](/wiki/bert)-large on the MultiNLI entailment dataset, then further [fine-tuning](/wiki/fine_tuning) on BoolQ training data. As the authors summarized, "Our best method trains BERT on MultiNLI and then re-trains it on our train set. It achieves 80.4% accuracy compared to 90% accuracy of human annotators (and 62% majority-baseline), leaving a significant gap for future work" [1]. That 80.4% test result sits 9.6 percentage points below the 90% human ceiling [1].

Human annotators achieved 90% accuracy on BoolQ, which the authors noted is lower than the near-perfect human agreement on many other reading comprehension benchmarks [1]. This lower agreement rate reflects the genuine ambiguity and difficulty present in naturally occurring yes/no questions.

## Transfer Learning Findings

A major contribution of the BoolQ paper is its systematic study of [transfer learning](/wiki/transfer_learning) for boolean question answering [1]. The authors experimented with transferring knowledge from several related NLP tasks before [fine-tuning](/wiki/fine_tuning) on BoolQ:

| Source Task | Dataset | Task Type | Accuracy Improvement |
|-------------|---------|-----------|---------------------|
| Entailment | MultiNLI | Natural language inference | Highest (+8-9 points) |
| Entailment | SNLI | Natural language inference | Moderate improvement |
| Extractive QA | [SQuAD](/wiki/squad) 2.0 | Span extraction | Smaller improvement |
| Extractive QA | QNLI | Sentence-pair classification | Smaller improvement |
| Multiple-choice QA | RACE | Reading comprehension | Moderate improvement |
| Paraphrase | MRPC/QQP | Paraphrase detection | Smallest improvement |

The key findings from these experiments include:

1. **Entailment transfer is most effective**: Pre-training on the MultiNLI entailment dataset before fine-tuning on BoolQ yielded the largest accuracy gains [1]. This suggests that yes/no question answering shares deeper structural similarities with natural language inference than with extractive question answering.

2. **Transfer from extractive QA helps less**: Despite the superficial similarity between reading comprehension tasks, transferring from [SQuAD](/wiki/squad) or QNLI provided smaller gains than transferring from entailment data [1]. This indicates that the reasoning processes involved in yes/no questions differ from those in span extraction.

3. **Transfer benefits persist with large models**: Even when starting from [BERT](/wiki/bert)-large, which already encodes substantial linguistic knowledge through [pre-training](/wiki/pre-training), additional task-specific transfer from MultiNLI continued to provide significant improvements [1]. The authors described this as "surprisingly" beneficial "even when starting from massive pre-trained language models such as BERT" [1].

4. **Paraphrase transfer is least effective**: Transferring from paraphrase detection tasks provided the smallest accuracy improvements, consistent with the finding that most BoolQ questions require reasoning beyond simple paraphrase matching [1].

These results have practical implications for building yes/no QA systems. They suggest that training pipelines for boolean question answering should include an intermediate stage of entailment training before task-specific fine-tuning.

## How does BoolQ fit into SuperGLUE?

BoolQ is one of eight tasks in the [SuperGLUE](/wiki/superglue) benchmark, which was introduced by Wang et al. (2019) as a successor to the [GLUE benchmark](/wiki/glue_benchmark) [2]. SuperGLUE was designed to present more difficult language understanding challenges after several models had surpassed human performance on the original GLUE benchmark [2].

### SuperGLUE Tasks

The eight tasks in SuperGLUE are [2]:

| Task | Type | Metric |
|------|------|--------|
| **BoolQ** | Yes/no question answering | Accuracy |
| CB (CommitmentBank) | Natural language inference (3-class) | F1 / Accuracy |
| COPA | Causal reasoning | Accuracy |
| MultiRC | Multi-sentence reading comprehension | F1 / Exact Match |
| ReCoRD | Reading comprehension with commonsense | F1 / Exact Match |
| RTE | Textual entailment | Accuracy |
| WiC | Word sense disambiguation | Accuracy |
| WSC (Winograd Schema Challenge) | Coreference resolution | Accuracy |

BoolQ was selected for SuperGLUE because it met several criteria: it presents a meaningful gap between model and human performance, it tests a distinct form of language understanding (boolean QA), and it provides enough training data to support supervised learning approaches while remaining difficult [2].

### SuperGLUE Baseline Results on BoolQ

In the original SuperGLUE paper, baseline models were evaluated on all eight tasks. The BoolQ results were [2]:

| Model | BoolQ Accuracy |
|-------|---------------|
| [BERT](/wiki/bert)-large | 77.4% |
| [BERT](/wiki/bert)-large++ (with MultiNLI) | 79.0% |
| Human baseline | 89.0% |

The gap between BERT-large++ and human performance on BoolQ was approximately 10 points, which was among the smaller gaps in SuperGLUE [2]. Other tasks like WSC had gaps of 35 points [2]. Still, the BoolQ gap represented a significant challenge that motivated years of subsequent research.

## Progress on BoolQ

### Model Performance Over Time

Since BoolQ's introduction in 2019, a series of increasingly powerful models have narrowed and eventually closed the gap with human performance. The following table summarizes notable results:

| Model | Year | BoolQ Accuracy | Parameters | Notes |
|-------|------|---------------|------------|-------|
| [BERT](/wiki/bert)-large + MultiNLI | 2019 | 80.4% | 340M | Original paper best result [1] |
| [RoBERTa](/wiki/roberta) | 2019 | ~86-87% | 355M | Improved pre-training approach [6] |
| [ALBERT](/wiki/albert) xxlarge | 2019 | ~89-90% | 235M | Parameter-efficient architecture [7] |
| [T5](/wiki/t5)-11B | 2020 | ~91% | 11B | Text-to-text framework [5] |
| [DeBERTa](/wiki/deberta) (single model) | 2021 | 90.4% | 1.5B | Disentangled [attention](/wiki/attention) [4] |
| [DeBERTa](/wiki/deberta) (ensemble) | 2021 | ~91% | 1.5B x N | First to surpass human on SuperGLUE overall [4] |
| [GPT-3](/wiki/gpt-3) (few-shot) | 2020 | ~60-76% | 175B | Without fine-tuning; varies by prompt |
| [GPT-4](/wiki/gpt-4) | 2023 | ~90%+ | Unknown | Near or above human level |

Several trends are apparent in this progression:

1. **Pre-training improvements matter**: Models like [RoBERTa](/wiki/roberta) and [ALBERT](/wiki/albert), which used improved pre-training procedures compared to [BERT](/wiki/bert), achieved large gains on BoolQ without changing the fundamental architecture [6][7].

2. **Scale helps but is not everything**: [T5](/wiki/t5)-11B with 11 billion parameters achieved approximately 91% accuracy, but [DeBERTa](/wiki/deberta) with 1.5 billion parameters reached comparable accuracy through architectural innovations like disentangled [attention](/wiki/attention) [4][5]. Meanwhile, [GPT-3](/wiki/gpt-3) with 175 billion parameters performed relatively poorly in zero-shot and few-shot settings without task-specific [fine-tuning](/wiki/fine_tuning).

3. **Human parity achieved**: By 2020-2021, the best fine-tuned models had matched or exceeded the 89-90% human accuracy baseline on BoolQ. In January 2021, Microsoft's [DeBERTa](/wiki/deberta) became the first model to surpass human performance on the overall SuperGLUE benchmark, with BoolQ being one of the contributing tasks [4].

### Open-Source Model Results

More recent evaluations of open-source models on BoolQ show continued strong performance [11]:

| Model | Developer | BoolQ Score |
|-------|-----------|-------------|
| Hermes 3 70B | Nous Research | 0.880 |
| Gemma 2 27B | Google | 0.848 |
| Phi-3.5-MoE-instruct | Microsoft | 0.846 |
| Gemma 2 9B | Google | 0.842 |
| Phi 4 Mini | Microsoft | 0.812 |
| Phi-3.5-mini-instruct | Microsoft | 0.780 |

These results represent zero-shot or few-shot evaluations rather than fine-tuned performance, which explains why scores are generally lower than the fine-tuned state-of-the-art results reported on the SuperGLUE leaderboard [11].

## Dataset Format and Access

### Data Format

BoolQ examples are stored in JSON Lines (JSONL) format. Each line contains a single JSON object with the following structure:

```json
{
  "question": "do iran and afghanistan speak the same language",
  "passage": "Persian (/ˈpɜːrʒən, -ʃən/), also known by its endonym Farsi...",
  "answer": true,
  "title": "Persian language"
}
```

### Loading with Hugging Face Datasets

The dataset can be loaded using the Hugging Face Datasets library:

```python
from datasets import load_dataset

dataset = load_dataset("google/boolq")

# Access splits
train_set = dataset["train"]      # 9,427 examples
val_set = dataset["validation"]   # 3,270 examples

# View an example
print(dataset["train"][0])
```

### Is BoolQ free to use? Availability and license

BoolQ is available through multiple channels:

| Source | URL |
|--------|-----|
| GitHub (official) | https://github.com/google-research-datasets/boolean-questions |
| Hugging Face | https://huggingface.co/datasets/google/boolq |
| TensorFlow Datasets | Available as `bool_q` in TFDS |
| SuperGLUE | Included in the SuperGLUE download |

The dataset is released under the Creative Commons Attribution-ShareAlike 3.0 (CC BY-SA 3.0) license, allowing free use with proper attribution and share-alike requirements [10].

## How does BoolQ compare to other benchmarks?

BoolQ occupies a specific niche among NLP benchmarks. The following table compares it with related datasets:

| Benchmark | Task Type | Question Source | Answer Format | Size |
|-----------|-----------|-----------------|---------------|------|
| BoolQ | Yes/no QA | Google search queries | Boolean (yes/no) | 15,942 |
| [SQuAD](/wiki/squad) | Extractive QA | Crowdsourced | Text span | 100,000+ |
| Natural Questions | Open-domain QA | Google search queries | Text span / yes-no / null | 300,000+ |
| MultiNLI | Natural language inference | Crowdsourced | Entailment / contradiction / neutral | 433,000 |
| [GLUE benchmark](/wiki/glue_benchmark) | Multi-task NLU | Various | Various | ~270,000 total |
| SuperGLUE | Multi-task NLU | Various | Various | ~33,000 total |

### Relationship to Natural Questions

BoolQ shares its data collection methodology with Google's Natural Questions (NQ) dataset [1][8]. Both datasets start from real Google search queries and pair them with Wikipedia content. The difference is that BoolQ specifically filters for yes/no questions, while NQ includes a broader range of question types with extractive answers [8]. Some yes/no questions in NQ overlap conceptually with BoolQ, but the datasets were collected and annotated independently.

### Comparison with NLI Tasks

BoolQ is often compared to [natural language inference](/wiki/natural_language_understanding) (NLI) tasks like MultiNLI and SNLI [9]. Like NLI, BoolQ involves determining whether a hypothesis (the question's implied statement) is supported by a premise (the passage). However, there are important differences:

- **Question format**: BoolQ uses natural yes/no questions rather than declarative sentence pairs.
- **Naturalness**: BoolQ questions come from real search behavior, while NLI datasets typically use crowdsourced sentence pairs.
- **Difficulty**: BoolQ questions tend to be more challenging because they were not designed to be easily answerable [1].

The strong transfer learning results from MultiNLI to BoolQ confirm the structural relationship between these tasks while also highlighting BoolQ's distinct challenges [1].

## What is BoolQ used for?

BoolQ is used in several contexts within the NLP community:

### Model Evaluation

BoolQ is commonly included in evaluation suites for [large language models](/wiki/large_language_model). It tests a model's ability to read a passage and determine whether a specific claim is supported. This makes it useful for assessing reading comprehension, factual reasoning, and inference capabilities.

### Transfer Learning Research

The transfer learning findings from the original BoolQ paper have influenced how researchers approach multi-task learning and sequential [fine-tuning](/wiki/fine_tuning) [1]. The discovery that entailment pre-training benefits boolean QA has been applied to other task combinations.

### Few-Shot and Zero-Shot Evaluation

With the rise of large language models like [GPT-3](/wiki/gpt-3) and [GPT-4](/wiki/gpt-4), BoolQ has been used to test few-shot and zero-shot capabilities. In these evaluations, models are given a few examples (or none) and must answer BoolQ questions without task-specific training. Performance in this setting is generally lower than fine-tuned performance, but it provides a measure of a model's general reasoning ability.

### Robustness and Bias Testing

Researchers have used BoolQ to study model robustness, including sensitivity to question phrasing, passage length, and answer distribution. The dataset's natural origin makes it useful for testing whether models can handle the kind of variation found in real-world queries.

## Limitations and Criticisms

While BoolQ has been widely adopted, it has several known limitations:

### Language and Source Constraints

BoolQ contains only English-language questions and passages sourced exclusively from English Wikipedia [1]. This limits its applicability to evaluating multilingual models or models intended for domains not well covered by Wikipedia.

### Answer Imbalance

The 62/38 split between "yes" and "no" answers introduces a class imbalance [1]. Models that learn to exploit this bias can achieve above-chance accuracy without genuine comprehension. Researchers must account for this imbalance when interpreting accuracy scores.

### Passage Dependence

Each question is paired with a single Wikipedia passage. In some cases, the passage may not contain all information needed to definitively answer the question, or additional context from other sources might change the answer. This single-passage setup does not capture the full complexity of real-world information retrieval, where a user might consult multiple sources.

### Human Ceiling

The 90% human accuracy is lower than the near-perfect agreement seen on other benchmarks [1]. While this partly reflects genuine question difficulty, it also suggests that some questions may be inherently ambiguous or that the passage-question pairing introduces occasional mismatches. For questions where annotators disagreed, the "correct" answer may not always be clear-cut.

### Benchmark Saturation

As of the early 2020s, state-of-the-art models have matched or surpassed human performance on BoolQ [4]. This "benchmark saturation" means BoolQ is no longer an effective discriminator among top-performing models, although it remains useful for evaluating mid-range models and for ablation studies.

### Limited Reasoning Annotations

BoolQ does not include annotations for the type of reasoning required to answer each question. While the paper provides aggregate statistics (38.7% paraphrase, etc.), individual question-level reasoning labels are not part of the dataset, making fine-grained error analysis more difficult [1].

## Influence and Legacy

BoolQ has had a substantial impact on the NLP research community since its introduction in 2019. As of 2025, the original paper has accumulated over 700 citations according to Semantic Scholar, reflecting its widespread use in model evaluation and benchmark design.

Several aspects of BoolQ's design have influenced subsequent work:

1. **Natural question sourcing**: BoolQ's use of real search queries rather than crowdsourced questions has been adopted by other benchmark designers who recognize that naturally occurring data produces more challenging and representative evaluations [1].

2. **Transfer learning analysis**: The systematic comparison of transfer sources (entailment vs. extractive QA vs. paraphrase) provided a template for studying how different pre-training tasks benefit downstream performance [1].

3. **SuperGLUE contribution**: As a component of SuperGLUE, BoolQ helped establish a new standard for NLU evaluation that lasted until models surpassed human performance on the benchmark in 2021 [2][4].

4. **Yes/no QA as a research focus**: BoolQ helped legitimize yes/no question answering as a distinct area of study within NLP, leading to follow-up work on boolean questions, unanswerable yes/no questions, and multi-domain yes/no QA [1].

## Technical Details for Practitioners

### Evaluation Setup

When evaluating models on BoolQ, the standard setup is:

1. **Input**: Concatenate the passage and question, separated by a special token (e.g., [SEP] for [BERT](/wiki/bert)-family models).
2. **Output**: A binary classification over {True, False} or equivalently {Yes, No}.
3. **Metric**: Report accuracy on the development set for analysis and on the test set for official comparison.

For [transformer](/wiki/transformer)-based models, the typical approach involves:

```
[CLS] question [SEP] passage [SEP]
```

A classification head (usually a linear layer) is applied to the [CLS] token representation to produce the binary prediction.

### Few-Shot Prompting

For [large language models](/wiki/large_language_model) evaluated in few-shot settings, a common prompt format is:

```
Passage: [passage text]
Question: [question text]
Answer (yes or no):
```

Providing 3 to 5 demonstrations before the target question typically improves answer formatting and accuracy. Research has shown that using more few-shot examples can improve the model's robustness in generating answers in the exact correct format.

### Common Pitfalls

Practitioners should be aware of several common issues when working with BoolQ:

- **Answer format sensitivity**: In generative evaluation, models may produce answers like "Yes, because..." or "No, it is not..." rather than a clean "yes" or "no." Evaluation scripts must handle these variations through exact matching or keyword extraction.
- **Majority class exploitation**: Always compare model accuracy against the 62% majority baseline, not just against random chance (50%) [1].
- **Title information**: The page title field provides additional context that can improve accuracy. Some evaluation setups include the title while others omit it; results should specify which setup was used.

## See Also

- [BERT](/wiki/bert)
- [SuperGLUE](/wiki/superglue)
- [GLUE benchmark](/wiki/glue_benchmark)
- [SQuAD](/wiki/squad)
- [Natural language processing](/wiki/natural_language_processing)
- [Transfer learning](/wiki/transfer_learning)
- [Question answering](/wiki/question_answering)
- [DeBERTa](/wiki/deberta)
- [T5 (language model)](/wiki/t5)
- [Accuracy](/wiki/accuracy)
- [Large language model](/wiki/large_language_model)

## References

1. Clark, C., Lee, K., Chang, M.-W., Kwiatkowski, T., Collins, M., & Toutanova, K. (2019). "BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions." Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT 2019), pp. 2924-2936. Minneapolis, Minnesota. arXiv:1905.10044. DOI: 10.18653/v1/N19-1300.

2. Wang, A., Pruksachatkun, Y., Nangia, N., Singh, A., Michael, J., Hill, F., Levy, O., & Bowman, S. (2019). "SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems." Advances in Neural Information Processing Systems 32 (NeurIPS 2019). arXiv:1905.00537.

3. Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding." Proceedings of NAACL-HLT 2019, pp. 4171-4186.

4. He, P., Liu, X., Gao, J., & Chen, W. (2021). "DeBERTa: Decoding-enhanced BERT with Disentangled Attention." Proceedings of ICLR 2021.

5. Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., & Liu, P. J. (2020). "Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer." Journal of Machine Learning Research, 21(140), pp. 1-67.

6. Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., & Stoyanov, V. (2019). "RoBERTa: A Robustly Optimized BERT Pretraining Approach." arXiv:1907.11692.

7. Lan, Z., Chen, M., Goodman, S., Gimpel, K., Sharma, P., & Soricut, R. (2020). "ALBERT: A Lite BERT for Self-supervised Learning of Language Representations." Proceedings of ICLR 2020.

8. Kwiatkowski, T., Palomaki, J., Redfield, O., Collins, M., Parikh, A., Alberti, C., Epstein, D., Polosukhin, I., Devlin, J., & Lee, K. (2019). "Natural Questions: A Benchmark for Question Answering Research." Transactions of the Association for Computational Linguistics, 7, pp. 453-466.

9. Williams, A., Nangia, N., & Bowman, S. (2018). "A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference." Proceedings of NAACL-HLT 2018.

10. Google Research Datasets. "Boolean Questions." GitHub repository. https://github.com/google-research-datasets/boolean-questions. Accessed March 2026.

11. LLM Stats. "BoolQ Benchmark Leaderboard." https://llm-stats.com/benchmarks/boolq (accessed May 2026). Current top open-source models: Hermes 3 70B (0.880), Gemma 2 27B (0.848), Phi-3.5-MoE-instruct (0.846); average across all evaluated models: 0.817.

