# HotpotQA

> Source: https://aiwiki.ai/wiki/hotpotqa
> Updated: 2026-06-23
> Categories: AI Benchmarks, Artificial Intelligence, Data & Datasets, Natural Language Processing
> From AI Wiki (https://aiwiki.ai), a free encyclopedia of artificial intelligence. Quote with attribution.

**HotpotQA** is a large-scale, multi-hop [question answering](/wiki/question_answering) dataset of about 112,779 crowd-authored question-and-answer pairs over English [Wikipedia](/wiki/wikipedia), whose answers cannot be found in any single paragraph and instead require reasoning across two documents. Each example ships with sentence-level supporting facts that mark exactly which sentences justify the answer, which is why the project page describes it as "a question answering dataset featuring natural, multi-hop questions, with strong supervision for supporting facts to enable more explainable question answering systems." [1][2] It was introduced in *HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering* by Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W. Cohen, Ruslan Salakhutdinov, and Christopher D. Manning, presented at EMNLP 2018 in Brussels and posted to arXiv on 25 September 2018 as 1809.09600. [1] It was the first large multi-hop reading-comprehension dataset where annotators wrote the questions themselves and marked the supporting sentences, and it became one of the standard testbeds for [retrieval-augmented generation](/wiki/retrieval_augmented_generation) and tool-using language models.

The dataset is released under the same Creative Commons Attribution-ShareAlike 4.0 license as Wikipedia. [2] The official leaderboard, code, and downloads are hosted at hotpotqa.github.io.

## Quick facts

| Field | Value |
|---|---|
| Released | 25 September 2018 (arXiv preprint); EMNLP, 31 October 2018 |
| Paper | Yang et al. 2018, arXiv:1809.09600 |
| Authors | [Zhilin Yang](/wiki/zhilin_yang), Peng Qi, Saizheng Zhang, [Yoshua Bengio](/wiki/yoshua_bengio), William W. Cohen, [Ruslan Salakhutdinov](/wiki/ruslan_salakhutdinov), [Christopher D. Manning](/wiki/christopher_manning) |
| Affiliations | [Carnegie Mellon University](/wiki/cmu), [Stanford NLP Group](/wiki/stanford_nlp), Universite de Montreal / Mila, Google AI |
| Total questions | ~112,779 (90,564 train, 7,405 dev, 7,405 distractor test, 7,405 fullwiki test) |
| Domain | English Wikipedia (introduction paragraphs) |
| Question types | Bridge questions (~80%), comparison questions (~20%) |
| Settings | Distractor (10 paragraphs) and Fullwiki (open domain over ~5M articles) |
| Metrics | Exact Match (EM) and F1 on Answer and Supporting Facts; Joint EM/F1 |
| License | Creative Commons Attribution-ShareAlike 4.0 |
| Leaderboard | hotpotqa.github.io |

## What problem was HotpotQA built to solve?

Before HotpotQA, the most influential reading-comprehension benchmark was the Stanford Question Answering Dataset, or [SQuAD](/wiki/squad), released by Rajpurkar et al. in 2016. [3] SQuAD 1.1 contained 100,000 questions written against single Wikipedia paragraphs, and SQuAD 2.0 (Rajpurkar, Jia, and Liang, 2018) added 50,000 unanswerable questions. [3][4] By mid-2018 several systems had matched or exceeded the human F1 score on SQuAD 1.1, suggesting the format was close to saturated and was an incomplete proxy for [reading comprehension](/wiki/reading_comprehension). SQuAD answers are spans inside one paragraph, so a model can succeed by aligning the question to a single sentence and copying.

A second wave of datasets had begun to test reasoning across multiple documents. WikiHop, part of the QAngaroo collection (Welbl, Stenetorp, and Riedel, 2018), generated multi-hop questions automatically from Wikipedia and Wikidata triples and asked models to choose between candidate entities. [5] ComplexWebQuestions (Talmor and Berant, 2018) transformed Freebase queries into natural-language questions whose answer required composing facts from several web pages. [6] TriviaQA (Joshi et al. 2017) and SearchQA (Dunn et al. 2017) added long evidence chains, although their multi-hop content was incidental rather than required.

These earlier datasets had limitations. WikiHop questions were synthesized from knowledge-base templates, so the language was rigid and many questions could be answered by lexical overlap on a single document. None of them annotated which sentences a system needed to read. The HotpotQA authors set out to build a dataset that combined natural human-written language, true multi-hop dependency, sentence-level evidence supervision, and a freely licensed source corpus. The paper states its four design goals directly: the questions "require finding and reasoning over multiple supporting documents to answer"; they "are diverse and not constrained to any pre-existing knowledge bases or knowledge schemas"; the dataset provides "sentence-level supporting facts required for reasoning"; and it offers "a new type of factoid comparison questions." [1]

## How was HotpotQA built?

The HotpotQA team built the dataset on Amazon Mechanical Turk during 2017 and 2018. The pipeline started from a hyperlink graph over the introduction paragraphs of English Wikipedia. The authors used the November 2017 dump and kept only the lead paragraphs, since these contain a high concentration of factual sentences and are short enough for crowd workers to read quickly. [1] They sampled pairs of articles connected by hyperlinks, treating one article as the bridge entity that the question would route through.

For each pair, a worker saw the two paragraphs and was asked to write a question whose answer required information from both, plus the answer itself and the specific supporting sentences. Automated checks filtered out questions whose answer span occurred in only one paragraph or whose required reasoning could be obtained without the second article. Workers were paid roughly two cents per accepted question and were given iterative feedback during a qualification round.

HotpotQA contains two question types. Bridge questions, about 80 percent of the corpus, ask about an entity that links the two paragraphs. An example given in the paper is "What government position was held by the woman who portrayed Corliss Archer in the film Kiss and Tell?" The first paragraph identifies the actress (Shirley Temple), and the second names her role as Chief of Protocol. [1] Comparison questions, about 20 percent of the corpus, contrast two entities on a shared attribute, for example "Were Scott Derrickson and Ed Wood of the same nationality?" These often have yes or no answers and require the model to extract a property from each paragraph and compare it. [1]

The authors also included yes-no questions (about 6 percent of all answers) and a small fraction of comparison questions whose answers are dates or numbers requiring arithmetic. All other answers are extractive spans. Each example is annotated with a small set of supporting sentences, on average about 2.4, drawn from the gold paragraphs. [1] The final release contains 90,564 training questions (split into 18,089 easy, 56,814 medium, and 15,661 hard examples), 7,405 development questions, and 7,405 hidden test questions for each of the two evaluation settings, for a public total of about 112,779 questions. [1]

## Splits and statistics

| Split | Questions | Notes |
|---|---|---|
| Train | 90,564 | Easy (18,089), medium (56,814), and hard (15,661); gold paragraphs and supporting facts visible |
| Dev | 7,405 | All hard, gold supervision visible |
| Distractor test | 7,405 | Hidden answers; 10 paragraphs per question (2 gold, 8 distractors) |
| Fullwiki test | 7,405 | Hidden answers; system must retrieve from full Wikipedia |

| Question type | Share | Example pattern |
|---|---|---|
| Bridge | ~80% | First paragraph identifies an entity, second paragraph supplies the answer |
| Comparison | ~20% | Compare two entities on a shared attribute (often yes-no, date, or number) |

## What are the two evaluation settings?

HotpotQA evaluates models in two regimes that share the same question set but differ in retrieval difficulty.

The **distractor** setting hands the system 10 paragraphs per question, and as the project page puts it, "a question-answering system reads 10 paragraphs to provide an answer (Ans) to a question." [2] Two of the 10 are the gold paragraphs containing the supporting facts; the other eight are distractors retrieved by a bigram TF-IDF query over the question against all Wikipedia introductions. [1] The model must read those 10 paragraphs, return the answer span (or yes / no), and predict which sentences are supporting facts. This setting isolates [multi-hop reasoning](/wiki/multi_hop_reasoning) and explanation from the open-domain retrieval problem.

The **fullwiki** setting is an open-domain task in which a system "must find the answer to a question in the scope of the entire Wikipedia." [2] The model receives only the question and must retrieve, read, and reason over the full corpus of about 5 million Wikipedia articles. The fullwiki setting is much harder because the gold paragraphs are buried in millions of distractors, and a retrieval miss on either of the two required paragraphs makes the question effectively unanswerable.

Four metrics are reported on each setting. Answer EM and Answer F1 measure exact-match accuracy and token-level F1 against the gold answer. Supporting Facts EM and F1 measure how accurately the model identifies the gold supporting sentences. Joint EM and Joint F1 require both the answer and all supporting facts to be correct in the same example. The leaderboard ranks systems primarily by Joint F1 inside each setting.

| Setting | Inputs | Difficulty | Metrics |
|---|---|---|---|
| Distractor | 10 paragraphs (2 gold + 8 IR distractors) | Reading and reasoning only | Ans EM/F1, Sup EM/F1, Joint EM/F1 |
| Fullwiki | Question + entire Wikipedia (~5M articles) | Retrieval and reasoning | Ans EM/F1, Sup EM/F1, Joint EM/F1 |

## How well did the original baseline and humans do?

The original paper trained a baseline reader combining a recurrent question encoder, character-level embeddings, a self-attention layer, and a span pointer, with an auxiliary loss for predicting supporting facts. This model was a modified version of Clark and Gardner's 2018 simple-and-effective SQuAD reader, extended with supporting-fact heads.

On the distractor setting, the baseline reached 58.28 Answer F1 and 40.86 Joint F1 on the development set (58.99 Answer F1 and 41.37 Joint F1 on test), well below human performance. [1] On fullwiki the baseline retrieved with TF-IDF over Wikipedia introductions and dropped to 34.36 Answer F1 and 17.73 Joint F1 on dev (34.40 and 17.85 on test). [1] Human annotators on a held-out sample scored 68.99 Answer F1 and 52.37 Joint F1, while the human upper bound (a second annotator who could also see the gold answer) reached 96.80 Answer EM, 98.77 Answer F1, 87.40 Supporting-Facts EM, and 96.37 Joint F1. [1] The large gap between the baseline and humans, especially on Joint F1, was the original motivation for the leaderboard.

## How well do models perform on HotpotQA?

The HotpotQA leaderboard tracked steady progress for several years. Reference points below are taken from the official leaderboard at hotpotqa.github.io and the corresponding papers. [2]

In 2019 the first wave of strong systems appeared. **DecompRC** (Min, Zhong, Zettlemoyer, and Hajishirzi, 2019) decomposed each multi-hop question into single-hop sub-questions and answered each independently with a SQuAD-style reader, scoring 70.6 Answer F1 on distractor dev. [7] **GoldEn Retriever** (Qi, Lin, Mehr, Wang, and Manning, 2019) interleaved retrieval and reading and reached about 37.9 Answer EM on fullwiki. [8] **Cognitive Graph QA** (Ding, Zhou, Yang, and Tang, 2019) built a graph of entity nodes during reading and used a graph neural network to score answers, reaching 37.6 Joint F1 on fullwiki. [9]

In 2020 systems built on [BERT](/wiki/bert) and its successors pushed numbers further. **HGN**, the Hierarchical Graph Network of Fang et al. (2020), tied paragraphs, sentences, entities, and answer candidates together with graph attention and ALBERT encoders, reaching 82.2 Answer F1 and 71.3 Joint F1 on distractor test. [10] **Asai et al. (2020)** introduced a learned path retriever that walked the Wikipedia hyperlink graph and reached 65.4 Answer EM in fullwiki. [11] **Longformer** (Beltagy, Peters, and Cohan, 2020) showed that a single sparse-attention transformer encoding all 10 distractor paragraphs at once could match graph-based pipelines. [12]

Later systems combined dense retrieval with stronger readers. **MDR**, the Multi-hop Dense Retriever of Xiong et al. (2021), trained a recurrent dense encoder that issued the next query conditioned on retrieved evidence and pushed fullwiki to about 62.3 Answer F1 and 48.0 Joint F1 on test, the largest single jump in the open-domain setting. [13] **Beam Retrieval** and similar designs through 2022 and 2023 reached mid-70s Joint F1 on distractor.

Large language models entered the leaderboard alongside fine-tuned systems. The **GPT-3** paper (Brown et al. 2020) reported few-shot results on HotpotQA without retrieval at about 29.9 Answer EM and 41.5 Answer F1, well below specialized retrieval models. [22] Press et al. 2022 introduced **Self-Ask**, in which [GPT-3](/wiki/gpt_3) asks itself sub-questions and answers them with a search tool, raising performance over standard [chain-of-thought prompting](/wiki/chain_of_thought). [23] **ReAct** (Yao et al. 2022) interleaved reasoning traces with Wikipedia search actions, reaching about 35 EM with PaLM-540B on a HotpotQA dev subset, ahead of plain chain-of-thought. [24] **IRCoT** (Trivedi et al. 2023) interleaved retrieval with chain-of-thought steps using GPT-3 and code-davinci-002, reporting about 48 Answer F1 on a dev subset. [25]

[GPT-4](/wiki/gpt_4) and Claude evaluations have appeared in retrieval-augmented setups rather than as a single leaderboard number, since the questions and gold paragraphs are openly available and may overlap with training data. Several 2023 and 2024 RAG benchmarks (RAGAS, RGB, MultiHop-RAG) use HotpotQA as one of their evaluation sets.

| System (Year) | Setting | Answer EM | Answer F1 | Joint F1 |
|---|---|---|---|---|
| Yang et al. baseline (2018) | distractor dev | 44.4 | 58.3 | 40.9 |
| Yang et al. baseline (2018) | fullwiki dev | 24.7 | 34.4 | 17.7 |
| DecompRC (2019) | distractor dev | 55.2 | 70.6 | -- |
| GoldEn Retriever (2019) | fullwiki test | 37.9 | 49.8 | 16.0 |
| Cognitive Graph (2019) | fullwiki test | 37.6 | 49.4 | 22.6 |
| HGN (2020) | distractor test | 69.2 | 82.2 | 71.3 |
| Asai et al. path retriever (2020) | fullwiki test | 65.4 | 78.4 | 53.0 |
| Longformer (2020) | distractor dev | 70.6 | 81.0 | 64.4 |
| MDR (2021) | fullwiki test | 62.3 | 75.3 | 48.0 |
| ReAct PaLM-540B (2022) | open, no fine-tune | ~35 | -- | -- |
| IRCoT GPT-3 (2023) | open, no fine-tune | -- | ~48 | -- |
| Human (gold sample) | distractor sample | 60.9 | 69.0 | 52.4 |
| Human upper bound | distractor sample | 96.8 | 98.8 | 96.4 |

Numbers above are taken from the original papers and from the snapshot of the official leaderboard at hotpotqa.github.io as of 2024. Verification dates and exact ranks shift over time, so the leaderboard remains the authoritative source for current scores. [2]

## How is HotpotQA used for RAG and tool use?

HotpotQA became a default evaluation dataset for [retrieval-augmented generation](/wiki/retrieval_augmented_generation) and tool-using language models. The questions are short and factual, with crisp ground-truth answers and small evidence sets, so they are cheap to evaluate. Each question is genuinely two-hop, which exposes whether a system can chain a second retrieval step rather than relying on a single best-paragraph match. The underlying corpus, Wikipedia, is the same corpus most retrieval systems use, so the dataset doubles as a test of multi-hop reasoning and a test of reading inside that corpus.

ReAct (Yao et al. 2022) used HotpotQA as its headline evaluation for tool-augmented prompting. [24] Self-Ask (Press et al. 2022) and IRCoT (Trivedi et al. 2023) followed the same pattern. [23][25] Toolformer (Schick et al. 2023) used HotpotQA in its evaluation of self-supervised tool learning. [26] Adaptive-RAG (Jeong et al. 2024) and Self-RAG papers used HotpotQA as their multi-hop test bed. The dataset has functioned as a standard reference point, somewhat like SQuAD did for single-hop reading.

## What are the main criticisms of HotpotQA?

HotpotQA has faced substantial scrutiny since its release.

The most cited critique is by Min, Wallace, Singh, Gardner, Hajishirzi, and Zettlemoyer (2019), *Compositional Questions Do Not Necessitate Multi-hop Reasoning*. [14] The paper showed that for a large share of HotpotQA bridge questions, a single-paragraph BERT reader given only one of the two gold paragraphs could produce the correct answer with reasonable accuracy. Their analysis estimated that around half of bridge questions are effectively single-hop, often because the wording of the question contains enough surface clues to identify the bridge entity directly. Chen and Durrett (2019) reached a similar conclusion: they removed one of the two supporting paragraphs and found only a modest drop in F1. [15] Trivedi et al. (2020) showed that adversarial distractors selected by a retriever rather than by TF-IDF degraded performance sharply, indicating the original distractor pool was too easy. [16]

A second criticism is the gap between the distractor and fullwiki settings. The distractor setting is closer to a reading-comprehension task, while fullwiki is a retrieval-and-reading task whose primary bottleneck is recall over five million articles. Models that win the distractor leaderboard often do worse on fullwiki and vice versa, meaning the two settings measure different abilities under the same name.

A third issue is leaderboard saturation in the distractor setting. By 2021 several systems were within a few points of the human upper bound on Answer F1, and improvements concentrated on supporting-fact prediction, which depends heavily on tokenization and sentence segmentation choices. The fullwiki setting still has clear headroom against human performance.

## What datasets followed HotpotQA?

Several later datasets were designed in part to address HotpotQA's weaknesses. **2WikiMultiHopQA** (Ho, Nguyen, Sugawara, and Aizawa, 2020) generated 192,606 multi-hop questions from Wikipedia and Wikidata using templates and structured triples, with explicit reasoning paths on every example. [17] **MuSiQue** (Trivedi et al. 2022) built about 25,000 multi-hop questions by composing single-hop questions from existing datasets and removing reasoning shortcuts; previous strong models on HotpotQA dropped sharply on MuSiQue, supporting the claim that HotpotQA had latent shortcuts. [18]

**IIRC** (Ferguson et al. 2020) collected 13,441 questions where the supporting context is incomplete in the visible passage. [19] **StrategyQA** (Geva et al. 2021) covered 2,780 yes-no questions whose decomposition is implicit. [20] **MultiRC** (Khashabi et al. 2018), released a few months before HotpotQA, is a related multi-sentence reading-comprehension dataset of about 6,000 questions. [21]

Dense-retrieval and retrieval-augmented benchmarks have included HotpotQA inside larger evaluation suites. **BEIR** (Thakur et al. 2021) used HotpotQA as a multi-hop entry in its zero-shot retrieval suite. [27] **KILT** (Petroni et al. 2021) covered HotpotQA along with several other knowledge-intensive tasks under a unified Wikipedia snapshot. [28] **MultiHop-RAG** (Tang and Yang, 2024) reused HotpotQA's design philosophy in a news-article benchmark for retrieval-augmented LLMs. [29]

## Is HotpotQA open source?

Yes. HotpotQA is distributed under the Creative Commons Attribution-ShareAlike 4.0 license, the same license as Wikipedia content. [2] The training and development sets, including all answers and supporting-fact annotations, are publicly downloadable as JSON files from the project page. The test sets are released without answers; predictions must be submitted to the official evaluation server. The dataset, baseline models, evaluation script, and Wikipedia paragraph dump are hosted under the hotpotqa GitHub organization.

The primary author, Zhilin Yang, was a PhD student at [Carnegie Mellon University](/wiki/cmu) under [Ruslan Salakhutdinov](/wiki/ruslan_salakhutdinov) and William W. Cohen at the time of release. He later founded the Chinese AI lab [Moonshot AI](/wiki/moonshot_ai), known for the Kimi family of long-context models. Peng Qi was a PhD student in the [Stanford NLP Group](/wiki/stanford_nlp) under [Christopher D. Manning](/wiki/christopher_manning), who together with [Yoshua Bengio](/wiki/yoshua_bengio) and Salakhutdinov advised the project across institutions.

## References

1. Yang, Z., Qi, P., Zhang, S., Bengio, Y., Cohen, W. W., Salakhutdinov, R., and Manning, C. D. (2018). *HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering*. Proceedings of EMNLP 2018, pp. 2369-2380. arXiv:1809.09600. https://arxiv.org/abs/1809.09600
2. HotpotQA project page and leaderboard. https://hotpotqa.github.io.
3. Rajpurkar, P., Zhang, J., Lopyrev, K., and Liang, P. (2016). *SQuAD: 100,000+ Questions for Machine Comprehension of Text*. EMNLP.
4. Rajpurkar, P., Jia, R., and Liang, P. (2018). *Know What You Don't Know: Unanswerable Questions for SQuAD*. ACL.
5. Welbl, J., Stenetorp, P., and Riedel, S. (2018). *Constructing Datasets for Multi-hop Reading Comprehension Across Documents*. TACL.
6. Talmor, A., and Berant, J. (2018). *The Web as a Knowledge-Base for Answering Complex Questions*. NAACL.
7. Min, S., Zhong, V., Zettlemoyer, L., and Hajishirzi, H. (2019). *Multi-hop Reading Comprehension through Question Decomposition and Rescoring*. ACL. (DecompRC)
8. Qi, P., Lin, X. V., Mehr, L., Wang, Z., and Manning, C. D. (2019). *Answering Complex Open-domain Questions Through Iterative Query Generation*. EMNLP. (GoldEn Retriever)
9. Ding, M., Zhou, C., Yang, H., and Tang, J. (2019). *Cognitive Graph for Multi-Hop Reading Comprehension at Scale*. ACL.
10. Fang, Y., Sun, S., Gan, Z., Pillai, R., Wang, S., and Liu, J. (2020). *Hierarchical Graph Network for Multi-hop Question Answering*. EMNLP.
11. Asai, A., Hashimoto, K., Hajishirzi, H., Socher, R., and Xiong, C. (2020). *Learning to Retrieve Reasoning Paths over Wikipedia Graph for Question Answering*. ICLR.
12. Beltagy, I., Peters, M. E., and Cohan, A. (2020). *Longformer: The Long-Document Transformer*. arXiv:2004.05150.
13. Xiong, W., Li, X. L., Iyer, S., Du, J., Lewis, P., Wang, W. Y., Mehdad, Y., Yih, W., Riedel, S., Kiela, D., and Oguz, B. (2021). *Answering Complex Open-Domain Questions with Multi-Hop Dense Retrieval*. ICLR. (MDR)
14. Min, S., Wallace, E., Singh, S., Gardner, M., Hajishirzi, H., and Zettlemoyer, L. (2019). *Compositional Questions Do Not Necessitate Multi-hop Reasoning*. ACL.
15. Chen, J., and Durrett, G. (2019). *Understanding Dataset Design Choices for Multi-hop Reasoning*. NAACL.
16. Trivedi, H., Balasubramanian, N., Khot, T., and Sabharwal, A. (2020). *Is Multihop QA in DiRe Condition? Measuring and Reducing Disconnected Reasoning*. EMNLP.
17. Ho, X., Nguyen, A., Sugawara, S., and Aizawa, A. (2020). *Constructing A Multi-hop QA Dataset for Comprehensive Evaluation of Reasoning Steps*. COLING. (2WikiMultiHopQA)
18. Trivedi, H., Balasubramanian, N., Khot, T., and Sabharwal, A. (2022). *MuSiQue: Multihop Questions via Single-hop Question Composition*. TACL.
19. Ferguson, J., Gardner, M., Hajishirzi, H., Khot, T., and Dasigi, P. (2020). *IIRC: A Dataset of Incomplete Information Reading Comprehension Questions*. EMNLP.
20. Geva, M., Khashabi, D., Segal, E., Khot, T., Roth, D., and Berant, J. (2021). *Did Aristotle Use a Laptop? A Question Answering Benchmark with Implicit Reasoning Strategies*. TACL. (StrategyQA)
21. Khashabi, D., Chaturvedi, S., Roth, M., Upadhyay, S., and Roth, D. (2018). *Looking Beyond the Surface: A Challenge Set for Reading Comprehension over Multiple Sentences*. NAACL. (MultiRC)
22. Brown, T. B., et al. (2020). *Language Models are Few-Shot Learners*. NeurIPS. (GPT-3)
23. Press, O., Zhang, M., Min, S., Schmidt, L., Smith, N. A., and Lewis, M. (2022). *Measuring and Narrowing the Compositionality Gap in Language Models*. arXiv:2210.03350. (Self-Ask)
24. Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., and Cao, Y. (2022). *ReAct: Synergizing Reasoning and Acting in Language Models*. arXiv:2210.03629.
25. Trivedi, H., Balasubramanian, N., Khot, T., and Sabharwal, A. (2023). *Interleaving Retrieval with Chain-of-Thought Reasoning for Knowledge-Intensive Multi-Step Questions*. ACL. (IRCoT)
26. Schick, T., Dwivedi-Yu, J., Dessi, R., Raileanu, R., Lomeli, M., Hambro, E., Zettlemoyer, L., Cancedda, N., and Scialom, T. (2023). *Toolformer: Language Models Can Teach Themselves to Use Tools*. arXiv:2302.04761.
27. Thakur, N., Reimers, N., Ruckle, A., Srivastava, A., and Gurevych, I. (2021). *BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models*. NeurIPS Datasets and Benchmarks.
28. Petroni, F., et al. (2021). *KILT: a Benchmark for Knowledge Intensive Language Tasks*. NAACL.
29. Tang, Y., and Yang, Y. (2024). *MultiHop-RAG: Benchmarking Retrieval-Augmented Generation for Multi-Hop Queries*. arXiv:2401.15391.

