HotpotQA
Last reviewed
May 2, 2026
Sources
29 citations
Review status
Source-backed
Revision
v1 ยท 3,497 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 2, 2026
Sources
29 citations
Review status
Source-backed
Revision
v1 ยท 3,497 words
Add missing citations, update stale details, or suggest a clearer explanation.
HotpotQA is a large-scale multi-hop question answering dataset and benchmark over English Wikipedia, containing about 113,000 crowd-authored question-and-answer pairs whose answers cannot be located in any single paragraph. The dataset was introduced in HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering by Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W. Cohen, Ruslan Salakhutdinov, and Christopher D. Manning, presented at EMNLP 2018 in Brussels and posted to arXiv on 25 September 2018 as 1809.09600. It was the first large multi-hop reading-comprehension dataset where annotators wrote the questions themselves and marked the sentence-level supporting facts, and for several years it was one of the standard testbeds for retrieval-augmented language models.
The dataset is released under the same Creative Commons Attribution-ShareAlike 4.0 license as Wikipedia. The official leaderboard, code, and downloads are hosted at hotpotqa.github.io.
| Field | Value |
|---|---|
| Released | 25 September 2018 (arXiv preprint); EMNLP, 31 October 2018 |
| Paper | Yang et al. 2018, arXiv:1809.09600 |
| Authors | Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W. Cohen, Ruslan Salakhutdinov, Christopher D. Manning |
| Affiliations | Carnegie Mellon University, Stanford NLP Group, Universite de Montreal / Mila, Google AI |
| Total questions | ~112,779 (90,564 train, 7,405 dev, 7,405 distractor test, 7,405 fullwiki test) |
| Domain | English Wikipedia (introduction paragraphs) |
| Question types | Bridge questions (~80%), comparison questions (~20%) |
| Settings | Distractor (10 paragraphs) and Fullwiki (open domain over ~5M articles) |
| Metrics | Exact Match (EM) and F1 on Answer and Supporting Facts; Joint EM/F1 |
| License | Creative Commons Attribution-ShareAlike 4.0 |
| Leaderboard | hotpotqa.github.io |
Before HotpotQA, the most influential reading-comprehension benchmark was the Stanford Question Answering Dataset, or SQuAD, released by Rajpurkar et al. in 2016. SQuAD 1.1 contained 100,000 questions written against single Wikipedia paragraphs, and SQuAD 2.0 (Rajpurkar, Jia, and Liang, 2018) added 50,000 unanswerable questions. By mid-2018 several systems had matched or exceeded the human F1 score on SQuAD 1.1, suggesting the format was close to saturated and was an incomplete proxy for reading comprehension. SQuAD answers are spans inside one paragraph, so a model can succeed by aligning the question to a single sentence and copying.
A second wave of datasets had begun to test reasoning across multiple documents. WikiHop, part of the QAngaroo collection (Welbl, Stenetorp, and Riedel, 2018), generated multi-hop questions automatically from Wikipedia and Wikidata triples and asked models to choose between candidate entities. ComplexWebQuestions (Talmor and Berant, 2018) transformed Freebase queries into natural-language questions whose answer required composing facts from several web pages. TriviaQA (Joshi et al. 2017) and SearchQA (Dunn et al. 2017) added long evidence chains, although their multi-hop content was incidental rather than required.
These earlier datasets had limitations. WikiHop questions were synthesized from knowledge-base templates, so the language was rigid and many questions could be answered by lexical overlap on a single document. None of them annotated which sentences a system needed to read. The HotpotQA authors set out to build a dataset that combined natural human-written language, true multi-hop dependency, sentence-level evidence supervision, and a freely licensed source corpus.
The HotpotQA team built the dataset on Amazon Mechanical Turk during 2017 and 2018. The pipeline started from a hyperlink graph over the introduction paragraphs of English Wikipedia. The authors used the November 2017 dump and kept only the lead paragraphs, since these contain a high concentration of factual sentences and are short enough for crowd workers to read quickly. They sampled pairs of articles connected by hyperlinks, treating one article as the bridge entity that the question would route through.
For each pair, a worker saw the two paragraphs and was asked to write a question whose answer required information from both, plus the answer itself and the specific supporting sentences. Automated checks filtered out questions whose answer span occurred in only one paragraph or whose required reasoning could be obtained without the second article. Workers were paid roughly two cents per accepted question and were given iterative feedback during a qualification round.
HotpotQA contains two question types. Bridge questions, about 80 percent of the corpus, ask about an entity that links the two paragraphs. An example given in the paper is "What government position was held by the woman who portrayed Corliss Archer in the film Kiss and Tell?" The first paragraph identifies the actress (Shirley Temple), and the second names her role as Chief of Protocol. Comparison questions, about 20 percent of the corpus, contrast two entities on a shared attribute, for example "Were Scott Derrickson and Ed Wood of the same nationality?" These often have yes or no answers and require the model to extract a property from each paragraph and compare it.
The authors also included roughly 6,000 yes-no questions and a small fraction of comparison questions whose answers are dates or numbers requiring arithmetic. All other answers are extractive spans. Each example is annotated with two to four supporting sentences, on average about 2.4, drawn from the gold paragraphs. After collection, every example was checked by a second worker for a hard subset of about 26,000 items. The final release contains 90,564 training questions, 7,405 development questions, and 7,405 hidden test questions for each of the two evaluation settings, for a public total of about 112,779 questions.
| Split | Questions | Notes |
|---|---|---|
| Train | 90,564 | Easy, medium, and hard examples; gold paragraphs and supporting facts visible |
| Dev | 7,405 | All hard, gold supervision visible |
| Distractor test | 7,405 | Hidden answers; 10 paragraphs per question (2 gold, 8 distractors) |
| Fullwiki test | 7,405 | Hidden answers; system must retrieve from full Wikipedia |
| Question type | Share | Example pattern |
|---|---|---|
| Bridge | ~80% | First paragraph identifies an entity, second paragraph supplies the answer |
| Comparison | ~20% | Compare two entities on a shared attribute (often yes-no, date, or number) |
HotpotQA evaluates models in two regimes that share the same question set but differ in retrieval difficulty.
The distractor setting hands the system 10 paragraphs per question. Two are the gold paragraphs containing the supporting facts; the other eight are distractors retrieved by a TF-IDF query over the bigram representation of the question against all Wikipedia introductions. The model must read those 10 paragraphs, return the answer span (or yes / no), and predict which sentences are supporting facts. This setting isolates multi-hop reasoning and explanation from the open-domain retrieval problem.
The fullwiki setting is an open-domain task. The model receives only the question and must retrieve, read, and reason over the full corpus of about 5 million Wikipedia articles. The fullwiki setting is much harder because the gold paragraphs are buried in millions of distractors, and a retrieval miss on either of the two required paragraphs makes the question effectively unanswerable.
Four metrics are reported on each setting. Answer EM and Answer F1 measure exact-match accuracy and token-level F1 against the gold answer. Supporting Facts EM and F1 measure how accurately the model identifies the gold supporting sentences. Joint EM and Joint F1 require both the answer and all supporting facts to be correct in the same example. The leaderboard ranks systems primarily by Joint F1 inside each setting.
| Setting | Inputs | Difficulty | Metrics |
|---|---|---|---|
| Distractor | 10 paragraphs (2 gold + 8 IR distractors) | Reading and reasoning only | Ans EM/F1, Sup EM/F1, Joint EM/F1 |
| Fullwiki | Question + entire Wikipedia (~5M articles) | Retrieval and reasoning | Ans EM/F1, Sup EM/F1, Joint EM/F1 |
The original paper trained a baseline reader combining a recurrent question encoder, character-level embeddings, a self-attention layer, and a span pointer, with an auxiliary loss for predicting supporting facts. This model was a modified version of Clark and Gardner's 2018 simple-and-effective SQuAD reader, extended with supporting-fact heads.
On the distractor setting, the baseline reached 58.99 Answer F1 and 17.85 Joint F1 on the development set, well below human performance. On fullwiki the baseline retrieved with TF-IDF over Wikipedia introductions and dropped to about 33 Answer F1 and below 10 Joint F1. The paper reported a human upper bound on a dev sample of about 96.4 Answer EM and 91.4 Supporting-Facts EM, which translates to roughly 99 Answer F1 and 88.6 Joint F1.
The HotpotQA leaderboard tracked steady progress for several years. Reference points below are taken from the official leaderboard at hotpotqa.github.io and the corresponding papers.
In 2019 the first wave of strong systems appeared. DecompRC (Min, Zhong, Zettlemoyer, and Hajishirzi, 2019) decomposed each multi-hop question into single-hop sub-questions and answered each independently with a SQuAD-style reader, scoring 70.6 Answer F1 on distractor dev. GoldEn Retriever (Qi, Lin, Mehr, Wang, and Manning, 2019) interleaved retrieval and reading and reached about 37.9 Answer EM on fullwiki. Cognitive Graph QA (Ding, Zhou, Yang, and Tang, 2019) built a graph of entity nodes during reading and used a graph neural network to score answers, reaching 37.6 Joint F1 on fullwiki.
In 2020 systems built on BERT and its successors pushed numbers further. HGN, the Hierarchical Graph Network of Fang et al. (2020), tied paragraphs, sentences, entities, and answer candidates together with graph attention and ALBERT encoders, reaching 82.2 Answer F1 and 71.3 Joint F1 on distractor dev. Asai et al. (2020) introduced a learned path retriever that walked the Wikipedia hyperlink graph and reached 65.4 Answer EM in fullwiki. Longformer (Beltagy, Peters, and Cohan, 2020) showed that a single sparse-attention transformer encoding all 10 distractor paragraphs at once could match graph-based pipelines.
Later systems combined dense retrieval with stronger readers. MDR, the Multi-hop Dense Retriever of Xiong et al. (2021), trained a recurrent dense encoder that issued the next query conditioned on retrieved evidence and pushed fullwiki to about 62.3 Answer F1 and 48.0 Joint F1 on test, the largest single jump in the open-domain setting. Beam Retrieval and similar designs through 2022 and 2023 reached mid-70s Joint F1 on distractor.
Large language models entered the leaderboard alongside fine-tuned systems. The GPT-3 paper (Brown et al. 2020) reported few-shot results on HotpotQA without retrieval at about 29.9 Answer EM and 41.5 Answer F1, well below specialized retrieval models. Press et al. 2022 introduced Self-Ask, in which GPT-3 asks itself sub-questions and answers them with a search tool, raising performance over standard chain-of-thought prompting. ReAct (Yao et al. 2022) interleaved reasoning traces with Wikipedia search actions, reaching 35.1 EM with PaLM-540B on a HotpotQA dev subset, ahead of plain chain-of-thought at 29.4. IRCoT (Trivedi et al. 2023) interleaved retrieval with chain-of-thought steps using GPT-3 and code-davinci-002, reporting about 48 Answer F1 on a dev subset.
GPT-4 and Claude evaluations have appeared in retrieval-augmented setups rather than as a single leaderboard number, since the questions and gold paragraphs are openly available and may overlap with training data. Several 2023 and 2024 RAG benchmarks (RAGAS, RGB, MultiHop-RAG) use HotpotQA as one of their evaluation sets.
| System (Year) | Setting | Answer EM | Answer F1 | Joint F1 |
|---|---|---|---|---|
| Yang et al. baseline (2018) | distractor dev | 44.4 | 58.3 | 17.9 |
| Yang et al. baseline (2018) | fullwiki dev | 23.9 | 32.9 | 7.3 |
| DecompRC (2019) | distractor dev | 55.2 | 70.6 | -- |
| GoldEn Retriever (2019) | fullwiki test | 37.9 | 49.8 | 16.0 |
| Cognitive Graph (2019) | fullwiki test | 37.6 | 49.4 | 22.6 |
| HGN (2020) | distractor test | 69.2 | 82.2 | 71.3 |
| Asai et al. path retriever (2020) | fullwiki test | 65.4 | 78.4 | 53.0 |
| Longformer (2020) | distractor dev | 70.6 | 81.0 | 64.4 |
| MDR (2021) | fullwiki test | 62.3 | 75.3 | 48.0 |
| ReAct PaLM-540B (2022) | open, no fine-tune | 35.1 | -- | -- |
| IRCoT GPT-3 (2023) | open, no fine-tune | -- | ~48 | -- |
| Human upper bound | distractor sample | 96.4 | 99.0 | 88.6 |
Numbers above are taken from the original papers and from the snapshot of the official leaderboard at hotpotqa.github.io as of 2024. Verification dates and exact ranks shift over time, so the leaderboard remains the authoritative source for current scores.
HotpotQA became a default evaluation dataset for retrieval-augmented generation and tool-using language models. The questions are short and factual, with crisp ground-truth answers and small evidence sets, so they are cheap to evaluate. Each question is genuinely two-hop, which exposes whether a system can chain a second retrieval step rather than relying on a single best-paragraph match. The underlying corpus, Wikipedia, is the same corpus most retrieval systems use, so the dataset doubles as a test of multi-hop reasoning and a test of reading inside that corpus.
ReAct (Yao et al. 2022) used HotpotQA as its headline evaluation for tool-augmented prompting. Self-Ask (Press et al. 2022) and IRCoT (Trivedi et al. 2023) followed the same pattern. Toolformer (Schick et al. 2023) used HotpotQA in its evaluation of self-supervised tool learning. Adaptive-RAG (Jeong et al. 2024) and Self-RAG papers used HotpotQA as their multi-hop test bed. The dataset has functioned as a standard reference point, somewhat like SQuAD did for single-hop reading.
HotpotQA has faced substantial scrutiny since its release.
The most cited critique is by Min, Wallace, Singh, Gardner, Hajishirzi, and Zettlemoyer (2019), Compositional Questions Do Not Necessitate Multi-hop Reasoning. The paper showed that for a large share of HotpotQA bridge questions, a single-paragraph BERT reader given only one of the two gold paragraphs could produce the correct answer with reasonable accuracy. Their analysis estimated that around half of bridge questions are effectively single-hop, often because the wording of the question contains enough surface clues to identify the bridge entity directly. Chen and Durrett (2019) reached a similar conclusion: they removed one of the two supporting paragraphs and found only a modest drop in F1. Trivedi et al. (2020) showed that adversarial distractors selected by a retriever rather than by TF-IDF degraded performance sharply, indicating the original distractor pool was too easy.
A second criticism is the gap between the distractor and fullwiki settings. The distractor setting is closer to a reading-comprehension task, while fullwiki is a retrieval-and-reading task whose primary bottleneck is recall over five million articles. Models that win the distractor leaderboard often do worse on fullwiki and vice versa, meaning the two settings measure different abilities under the same name.
A third issue is leaderboard saturation in the distractor setting. By 2021 several systems were within a few points of the human upper bound on Answer F1, and improvements concentrated on supporting-fact prediction, which depends heavily on tokenization and sentence segmentation choices. The fullwiki setting still has clear headroom against human performance.
Several later datasets were designed in part to address HotpotQA's weaknesses. 2WikiMultiHopQA (Ho, Nguyen, Sugawara, and Aizawa, 2020) generated 192,606 multi-hop questions from Wikipedia and Wikidata using templates and structured triples, with explicit reasoning paths on every example. MuSiQue (Trivedi et al. 2022) built about 25,000 multi-hop questions by composing single-hop questions from existing datasets and removing reasoning shortcuts; previous strong models on HotpotQA dropped sharply on MuSiQue, supporting the claim that HotpotQA had latent shortcuts.
IIRC (Ferguson et al. 2020) collected 13,441 questions where the supporting context is incomplete in the visible passage. StrategyQA (Geva et al. 2021) covered 2,780 yes-no questions whose decomposition is implicit. MultiRC (Khashabi et al. 2018), released a few months before HotpotQA, is a related multi-sentence reading-comprehension dataset of about 6,000 questions.
Dense-retrieval and retrieval-augmented benchmarks have included HotpotQA inside larger evaluation suites. BEIR (Thakur et al. 2021) used HotpotQA as a multi-hop entry in its zero-shot retrieval suite. KILT (Petroni et al. 2021) covered HotpotQA along with several other knowledge-intensive tasks under a unified Wikipedia snapshot. MultiHop-RAG (Tang and Yang, 2024) reused HotpotQA's design philosophy in a news-article benchmark for retrieval-augmented LLMs.
HotpotQA is distributed under the Creative Commons Attribution-ShareAlike 4.0 license, the same license as Wikipedia content. The training and development sets, including all answers and supporting-fact annotations, are publicly downloadable as JSON files from the project page. The test sets are released without answers; predictions must be submitted to the official evaluation server. The dataset, baseline models, evaluation script, and Wikipedia paragraph dump are hosted under the hotpotqa GitHub organization.
The primary author, Zhilin Yang, was a PhD student at Carnegie Mellon University under Ruslan Salakhutdinov and William W. Cohen at the time of release. He later founded the Chinese AI lab Moonshot AI, known for the Kimi family of long-context models. Peng Qi was a PhD student in the Stanford NLP Group under Christopher D. Manning, who together with Yoshua Bengio and Salakhutdinov advised the project across institutions.