BoolQ (Boolean Questions) is a natural language processing benchmark dataset designed for yes/no question answering. Created by researchers at Google, BoolQ contains 15,942 naturally occurring yes/no questions paired with Wikipedia passages and boolean answers. The dataset was introduced in the 2019 paper "BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions" by Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova, published at the Conference of the North American Chapter of the Association for Computational Linguistics (NAACL 2019). BoolQ is one of eight tasks included in the SuperGLUE benchmark and has become a standard evaluation for measuring reading comprehension and inference capabilities in language models.
Yes/no questions represent one of the most common forms of information-seeking behavior. When people search the web, a significant portion of their queries can be answered with a simple "yes" or "no." Despite this prevalence, yes/no question answering received relatively little attention in NLP research prior to BoolQ. Most reading comprehension benchmarks, such as SQuAD, focused on extractive question answering, where models must identify a text span that answers the question. Other benchmarks tested natural language inference through artificially constructed sentence pairs.
The authors of BoolQ observed that naturally occurring yes/no questions present a different and often more difficult challenge than extractive QA or synthetic inference tasks. When people ask yes/no questions in real search scenarios, the questions tend to be more complex, requiring reasoning beyond simple word matching or paraphrase detection. This observation motivated the creation of a dedicated benchmark that could capture the true difficulty of boolean question answering in a naturalistic setting.
Before BoolQ, yes/no questions were sometimes treated as a byproduct of other tasks or filtered out entirely from QA datasets. The BoolQ paper demonstrated that these questions deserve focused study because they require a distinct combination of reading comprehension, inference, and world knowledge.
BoolQ follows a data collection pipeline adapted from Google's Natural Questions (NQ) project. The process begins with anonymized, aggregated queries submitted to the Google search engine. From this pool of real user queries, the researchers applied heuristic filters to identify queries that are likely to be yes/no questions. This approach differs from many other NLP benchmarks where annotators are prompted to write questions, because BoolQ's questions reflect genuine information needs rather than researcher-designed tasks.
The key steps in the data collection process are:
Query sampling: Anonymized search queries were sampled from Google's search logs. These represent real questions that actual users typed into the search engine.
Yes/no filtering: Heuristic rules were applied to identify queries that can be interpreted as yes/no questions. Questions typically begin with words like "is," "are," "do," "does," "can," "was," "will," and similar auxiliaries.
Manual filtering: The candidate questions were manually reviewed to ensure they are comprehensible and unambiguous. Questions that were unclear, ill-formed, or could not reasonably be answered with "yes" or "no" were removed.
Passage selection: For each question, annotators identified a relevant Wikipedia article and selected a paragraph from that article containing enough information to determine the answer.
Answer annotation: Human annotators read the selected passage and provided a boolean answer (yes or no) to the question based on the passage content.
The annotation process involved human workers who reviewed each question-passage pair. Multiple annotators independently evaluated the same questions, and disagreements were resolved through consensus. Annotators received formal training to maintain consistency across the dataset. Each annotated example in BoolQ consists of four fields:
| Field | Type | Description |
|---|---|---|
| question | String | A naturally occurring yes/no question (typically 20 to 100 characters) |
| passage | String | A paragraph from a Wikipedia article (typically 35 to 4,720 characters) |
| answer | Boolean | True (yes) or False (no) |
| title | String | The title of the Wikipedia article (optional additional context) |
The use of real search queries rather than researcher-prompted questions is a defining characteristic of BoolQ. By sampling from actual information-seeking behavior, the dataset captures a wider range of question complexity and topic diversity than datasets built through crowdsourcing prompts alone.
The full BoolQ dataset contains 15,942 examples divided into three splits:
| Split | Examples | Labels |
|---|---|---|
| Training | 9,427 | Provided |
| Development (validation) | 3,270 | Provided |
| Test | 3,245 | Withheld |
The training and development splits are publicly available with labeled answers. The test split answers are withheld and used for official evaluation through the SuperGLUE leaderboard.
The dataset has a moderately imbalanced answer distribution. Approximately 62% of the answers are "yes" (True), while 38% are "no" (False). This means a naive baseline that always predicts "yes" would achieve 62% accuracy, establishing the majority-class baseline for the task.
The questions in BoolQ cover a wide range of topics drawn from Wikipedia, including geography, history, science, entertainment, sports, law, and technology. Common question patterns include:
The following table shows representative examples from the BoolQ dataset:
| Question | Passage (excerpt) | Answer |
|---|---|---|
| Do Iran and Afghanistan speak the same language? | Persian, also known by its endonym Farsi, is a Western Iranian language... It is the official language of Iran, Afghanistan (officially known as Dari), and Tajikistan... | Yes |
| Is Harry Potter and the Escape from Gringotts a roller coaster ride? | Harry Potter and the Escape from Gringotts is an indoor steel roller coaster... at Universal Studios Florida... | Yes |
| Does ethanol take more energy to make than it produces? | According to a 2005 study... the energy balance is actually positive... corn ethanol produces 67% more energy than it takes to produce... | No |
| Is Elder Scrolls Online the same as Skyrim? | As with other games in The Elder Scrolls series, the game is set on the continent of Tamriel. The game takes place in... roughly 1,000 years before the events of The Elder Scrolls V: Skyrim... | No |
| Is France the same timezone as the UK? | France uses Central European Time (UTC+01:00)... the UK uses Greenwich Mean Time (UTC+00:00)... | No |
These examples illustrate several properties of BoolQ questions. Some can be answered by straightforward paraphrase detection (the roller coaster question). Others require multi-step reasoning, background knowledge, or careful interpretation of the passage (the ethanol question).
One of the central findings of the BoolQ paper is that naturally occurring yes/no questions are significantly more difficult than expected. The authors performed a manual analysis of question types and found the following distribution of reasoning requirements:
| Reasoning Type | Percentage | Description |
|---|---|---|
| Paraphrase | 38.7% | The answer can be determined by matching or rephrasing words between the question and passage |
| Inferential | ~30% | Requires drawing inferences from implicit information in the passage |
| World knowledge | ~15% | Requires background knowledge not explicitly stated in the passage |
| Complex/multi-step | ~16% | Requires combining multiple pieces of information or multi-step reasoning |
Only 38.7% of the questions can be answered through simple paraphrase matching between the question and passage. The remaining 61.3% require more complex reasoning, including inferential reasoning from implicit information, integration of world knowledge, and multi-step logical deduction. This distribution explains why BoolQ is more challenging than many natural language inference benchmarks, where surface-level pattern matching can achieve higher accuracy.
The difficulty of BoolQ questions stems from their natural origin. When people ask yes/no questions in a search engine, they are not constrained by instructions from researchers. Their questions reflect genuine uncertainty and often involve nuanced relationships that cannot be resolved by simple keyword overlap.
BoolQ uses accuracy as its sole evaluation metric. Each prediction is either correct (matching the gold-standard boolean answer) or incorrect. The overall accuracy is computed as the percentage of correctly answered questions in the evaluation set.
The original BoolQ paper (Clark et al., 2019) established several baselines:
| Model / Baseline | Dev Accuracy | Test Accuracy |
|---|---|---|
| Majority class (always "yes") | 62.0% | 62.0% |
| BERT-large (no transfer) | ~71% | ~71% |
| BERT-large + MultiNLI transfer | 79.4% | 80.4% |
| Human annotators | -- | 90.0% |
The majority-class baseline of 62% reflects the answer imbalance in the dataset. BERT-large, when trained only on BoolQ data, achieved substantially higher accuracy, but still lagged well behind human performance. The best configuration in the original paper used a two-stage transfer learning approach: first fine-tuning BERT-large on the MultiNLI entailment dataset, then further fine-tuning on BoolQ training data. This approach achieved 80.4% test accuracy, a 9.6 percentage point gap below human performance of 90%.
Human annotators achieved 90% accuracy on BoolQ, which the authors noted is lower than the near-perfect human agreement on many other reading comprehension benchmarks. This lower agreement rate reflects the genuine ambiguity and difficulty present in naturally occurring yes/no questions.
A major contribution of the BoolQ paper is its systematic study of transfer learning for boolean question answering. The authors experimented with transferring knowledge from several related NLP tasks before fine-tuning on BoolQ:
| Source Task | Dataset | Task Type | Accuracy Improvement |
|---|---|---|---|
| Entailment | MultiNLI | Natural language inference | Highest (+8-9 points) |
| Entailment | SNLI | Natural language inference | Moderate improvement |
| Extractive QA | SQuAD 2.0 | Span extraction | Smaller improvement |
| Extractive QA | QNLI | Sentence-pair classification | Smaller improvement |
| Multiple-choice QA | RACE | Reading comprehension | Moderate improvement |
| Paraphrase | MRPC/QQP | Paraphrase detection | Smallest improvement |
The key findings from these experiments include:
Entailment transfer is most effective: Pre-training on the MultiNLI entailment dataset before fine-tuning on BoolQ yielded the largest accuracy gains. This suggests that yes/no question answering shares deeper structural similarities with natural language inference than with extractive question answering.
Transfer from extractive QA helps less: Despite the superficial similarity between reading comprehension tasks, transferring from SQuAD or QNLI provided smaller gains than transferring from entailment data. This indicates that the reasoning processes involved in yes/no questions differ from those in span extraction.
Transfer benefits persist with large models: Even when starting from BERT-large, which already encodes substantial linguistic knowledge through pre-training, additional task-specific transfer from MultiNLI continued to provide significant improvements. This finding was somewhat surprising, as one might expect the massive pre-trained representations to already capture the knowledge provided by intermediate task training.
Paraphrase transfer is least effective: Transferring from paraphrase detection tasks provided the smallest accuracy improvements, consistent with the finding that most BoolQ questions require reasoning beyond simple paraphrase matching.
These results have practical implications for building yes/no QA systems. They suggest that training pipelines for boolean question answering should include an intermediate stage of entailment training before task-specific fine-tuning.
BoolQ is one of eight tasks in the SuperGLUE benchmark, which was introduced by Wang et al. (2019) as a successor to the GLUE benchmark. SuperGLUE was designed to present more difficult language understanding challenges after several models had surpassed human performance on the original GLUE benchmark.
The eight tasks in SuperGLUE are:
| Task | Type | Metric |
|---|---|---|
| BoolQ | Yes/no question answering | Accuracy |
| CB (CommitmentBank) | Natural language inference (3-class) | F1 / Accuracy |
| COPA | Causal reasoning | Accuracy |
| MultiRC | Multi-sentence reading comprehension | F1 / Exact Match |
| ReCoRD | Reading comprehension with commonsense | F1 / Exact Match |
| RTE | Textual entailment | Accuracy |
| WiC | Word sense disambiguation | Accuracy |
| WSC (Winograd Schema Challenge) | Coreference resolution | Accuracy |
BoolQ was selected for SuperGLUE because it met several criteria: it presents a meaningful gap between model and human performance, it tests a distinct form of language understanding (boolean QA), and it provides enough training data to support supervised learning approaches while remaining difficult.
In the original SuperGLUE paper, baseline models were evaluated on all eight tasks. The BoolQ results were:
| Model | BoolQ Accuracy |
|---|---|
| BERT-large | 77.4% |
| BERT-large++ (with MultiNLI) | 79.0% |
| Human baseline | 89.0% |
The gap between BERT-large++ and human performance on BoolQ was approximately 10 points, which was among the smaller gaps in SuperGLUE. Other tasks like WSC had gaps of 35 points. Still, the BoolQ gap represented a significant challenge that motivated years of subsequent research.
Since BoolQ's introduction in 2019, a series of increasingly powerful models have narrowed and eventually closed the gap with human performance. The following table summarizes notable results:
| Model | Year | BoolQ Accuracy | Parameters | Notes |
|---|---|---|---|---|
| BERT-large + MultiNLI | 2019 | 80.4% | 340M | Original paper best result |
| RoBERTa | 2019 | ~86-87% | 355M | Improved pre-training approach |
| ALBERT xxlarge | 2019 | ~89-90% | 235M | Parameter-efficient architecture |
| T5-11B | 2020 | ~91% | 11B | Text-to-text framework |
| DeBERTa (single model) | 2021 | 90.4% | 1.5B | Disentangled attention |
| DeBERTa (ensemble) | 2021 | ~91% | 1.5B x N | First to surpass human on SuperGLUE overall |
| GPT-3 (few-shot) | 2020 | ~60-76% | 175B | Without fine-tuning; varies by prompt |
| GPT-4 | 2023 | ~90%+ | Unknown | Near or above human level |
Several trends are apparent in this progression:
Pre-training improvements matter: Models like RoBERTa and ALBERT, which used improved pre-training procedures compared to BERT, achieved large gains on BoolQ without changing the fundamental architecture.
Scale helps but is not everything: T5-11B with 11 billion parameters achieved approximately 91% accuracy, but DeBERTa with 1.5 billion parameters reached comparable accuracy through architectural innovations like disentangled attention. Meanwhile, GPT-3 with 175 billion parameters performed relatively poorly in zero-shot and few-shot settings without task-specific fine-tuning.
Human parity achieved: By 2020-2021, the best fine-tuned models had matched or exceeded the 89-90% human accuracy baseline on BoolQ. In January 2021, Microsoft's DeBERTa became the first model to surpass human performance on the overall SuperGLUE benchmark, with BoolQ being one of the contributing tasks.
More recent evaluations of open-source models on BoolQ show continued strong performance:
| Model | Developer | BoolQ Score |
|---|---|---|
| Hermes 3 70B | Nous Research | 0.880 |
| Gemma 2 27B | 0.848 | |
| Phi-3.5-MoE-instruct | Microsoft | 0.846 |
| Gemma 2 9B | 0.842 | |
| Phi 4 Mini | Microsoft | 0.812 |
| Phi-3.5-mini-instruct | Microsoft | 0.780 |
These results represent zero-shot or few-shot evaluations rather than fine-tuned performance, which explains why scores are generally lower than the fine-tuned state-of-the-art results reported on the SuperGLUE leaderboard.
BoolQ examples are stored in JSON Lines (JSONL) format. Each line contains a single JSON object with the following structure:
{
"question": "do iran and afghanistan speak the same language",
"passage": "Persian (/ˈpɜːrʒən, -ʃən/), also known by its endonym Farsi...",
"answer": true,
"title": "Persian language"
}
The dataset can be loaded using the Hugging Face Datasets library:
from datasets import load_dataset
dataset = load_dataset("google/boolq")
# Access splits
train_set = dataset["train"] # 9,427 examples
val_set = dataset["validation"] # 3,270 examples
# View an example
print(dataset["train"]<sup><a href="#cite_note-0" class="cite-ref">[0]</a></sup>)
BoolQ is available through multiple channels:
| Source | URL |
|---|---|
| GitHub (official) | https://github.com/google-research-datasets/boolean-questions |
| Hugging Face | https://huggingface.co/datasets/google/boolq |
| TensorFlow Datasets | Available as bool_q in TFDS |
| SuperGLUE | Included in the SuperGLUE download |
The dataset is released under the Creative Commons Attribution-ShareAlike 3.0 (CC BY-SA 3.0) license, allowing free use with proper attribution and share-alike requirements.
BoolQ occupies a specific niche among NLP benchmarks. The following table compares it with related datasets:
| Benchmark | Task Type | Question Source | Answer Format | Size |
|---|---|---|---|---|
| BoolQ | Yes/no QA | Google search queries | Boolean (yes/no) | 15,942 |
| SQuAD | Extractive QA | Crowdsourced | Text span | 100,000+ |
| Natural Questions | Open-domain QA | Google search queries | Text span / yes-no / null | 300,000+ |
| MultiNLI | Natural language inference | Crowdsourced | Entailment / contradiction / neutral | 433,000 |
| GLUE benchmark | Multi-task NLU | Various | Various | ~270,000 total |
| SuperGLUE | Multi-task NLU | Various | Various | ~33,000 total |
BoolQ shares its data collection methodology with Google's Natural Questions (NQ) dataset. Both datasets start from real Google search queries and pair them with Wikipedia content. The difference is that BoolQ specifically filters for yes/no questions, while NQ includes a broader range of question types with extractive answers. Some yes/no questions in NQ overlap conceptually with BoolQ, but the datasets were collected and annotated independently.
BoolQ is often compared to natural language inference (NLI) tasks like MultiNLI and SNLI. Like NLI, BoolQ involves determining whether a hypothesis (the question's implied statement) is supported by a premise (the passage). However, there are important differences:
The strong transfer learning results from MultiNLI to BoolQ confirm the structural relationship between these tasks while also highlighting BoolQ's distinct challenges.
BoolQ is used in several contexts within the NLP community:
BoolQ is commonly included in evaluation suites for large language models. It tests a model's ability to read a passage and determine whether a specific claim is supported. This makes it useful for assessing reading comprehension, factual reasoning, and inference capabilities.
The transfer learning findings from the original BoolQ paper have influenced how researchers approach multi-task learning and sequential fine-tuning. The discovery that entailment pre-training benefits boolean QA has been applied to other task combinations.
With the rise of large language models like GPT-3 and GPT-4, BoolQ has been used to test few-shot and zero-shot capabilities. In these evaluations, models are given a few examples (or none) and must answer BoolQ questions without task-specific training. Performance in this setting is generally lower than fine-tuned performance, but it provides a measure of a model's general reasoning ability.
Researchers have used BoolQ to study model robustness, including sensitivity to question phrasing, passage length, and answer distribution. The dataset's natural origin makes it useful for testing whether models can handle the kind of variation found in real-world queries.
While BoolQ has been widely adopted, it has several known limitations:
BoolQ contains only English-language questions and passages sourced exclusively from English Wikipedia. This limits its applicability to evaluating multilingual models or models intended for domains not well covered by Wikipedia.
The 62/38 split between "yes" and "no" answers introduces a class imbalance. Models that learn to exploit this bias can achieve above-chance accuracy without genuine comprehension. Researchers must account for this imbalance when interpreting accuracy scores.
Each question is paired with a single Wikipedia passage. In some cases, the passage may not contain all information needed to definitively answer the question, or additional context from other sources might change the answer. This single-passage setup does not capture the full complexity of real-world information retrieval, where a user might consult multiple sources.
The 90% human accuracy is lower than the near-perfect agreement seen on other benchmarks. While this partly reflects genuine question difficulty, it also suggests that some questions may be inherently ambiguous or that the passage-question pairing introduces occasional mismatches. For questions where annotators disagreed, the "correct" answer may not always be clear-cut.
As of the early 2020s, state-of-the-art models have matched or surpassed human performance on BoolQ. This "benchmark saturation" means BoolQ is no longer an effective discriminator among top-performing models, although it remains useful for evaluating mid-range models and for ablation studies.
BoolQ does not include annotations for the type of reasoning required to answer each question. While the paper provides aggregate statistics (38.7% paraphrase, etc.), individual question-level reasoning labels are not part of the dataset, making fine-grained error analysis more difficult.
BoolQ has had a substantial impact on the NLP research community since its introduction in 2019. As of 2025, the original paper has accumulated over 700 citations according to Semantic Scholar, reflecting its widespread use in model evaluation and benchmark design.
Several aspects of BoolQ's design have influenced subsequent work:
Natural question sourcing: BoolQ's use of real search queries rather than crowdsourced questions has been adopted by other benchmark designers who recognize that naturally occurring data produces more challenging and representative evaluations.
Transfer learning analysis: The systematic comparison of transfer sources (entailment vs. extractive QA vs. paraphrase) provided a template for studying how different pre-training tasks benefit downstream performance.
SuperGLUE contribution: As a component of SuperGLUE, BoolQ helped establish a new standard for NLU evaluation that lasted until models surpassed human performance on the benchmark in 2021.
Yes/no QA as a research focus: BoolQ helped legitimize yes/no question answering as a distinct area of study within NLP, leading to follow-up work on boolean questions, unanswerable yes/no questions, and multi-domain yes/no QA.
When evaluating models on BoolQ, the standard setup is:
For transformer-based models, the typical approach involves:
[CLS] question [SEP] passage [SEP]
A classification head (usually a linear layer) is applied to the [CLS] token representation to produce the binary prediction.
For large language models evaluated in few-shot settings, a common prompt format is:
Passage: [passage text]
Question: [question text]
Answer (yes or no):
Providing 3 to 5 demonstrations before the target question typically improves answer formatting and accuracy. Research has shown that using more few-shot examples can improve the model's robustness in generating answers in the exact correct format.
Practitioners should be aware of several common issues when working with BoolQ: