Question Answering Models
Last reviewed
May 31, 2026
Sources
27 citations
Review status
Source-backed
Revision
v3 ยท 5,186 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 31, 2026
Sources
27 citations
Review status
Source-backed
Revision
v3 ยท 5,186 words
Add missing citations, update stale details, or suggest a clearer explanation.
See also: Natural Language Processing Models and Tasks
Question answering (QA) models are natural language processing systems that take a natural-language question as input and return a natural-language answer, optionally grounded in a supplied passage, document collection, or knowledge base. QA has been a central benchmark for machine reading comprehension since the introduction of large-scale span-extraction datasets in 2016, and it remains a primary application of large language models (LLMs) and retrieval-augmented generation (RAG) systems in 2026.
QA systems are typically categorized along several axes. Extractive QA locates an answer as a contiguous span of text within a supplied passage, a formulation popularized by the SQuAD dataset (Rajpurkar et al., 2016). The model predicts a start token and an end token within the passage, and the text between those positions becomes the answer. This formulation is clean to evaluate because the prediction is directly comparable to a gold span.
Abstractive or generative QA produces a free-form answer that may not appear verbatim in any source passage and is closer to summarization in style. Generative models may rephrase, condense, or synthesize information from multiple sources, producing more natural responses at the cost of harder evaluation.
A second axis distinguishes closed-book QA, where the model relies only on knowledge encoded in its parameters during training, from open-domain QA (sometimes called open-book QA), where the system retrieves supporting documents from a corpus such as Wikipedia or the open web before producing an answer (Chen et al., 2017). Open-domain QA in turn decomposes into a retrieval step and a reader or generator step, an architecture often called retrieve-then-read.
Additional task variants include the following. Multi-hop QA requires combining information across multiple documents or passages, since no single passage contains a sufficient answer (Yang et al., 2018). This forces the model to chain inferences: finding an intermediate entity in one document and then using that entity to locate the final answer in another. Conversational QA presents questions in a dialogue history, requiring the system to resolve coreferences and track entities across turns; the CoQA and QuAC datasets formalize this setting. Knowledge base QA (KBQA) answers questions over a structured graph such as Wikidata or Freebase by generating a formal query, typically SPARQL, that retrieves the answer entity or literal directly. Community QA mines question-answer pairs from forums like Stack Exchange or Reddit. Long-form QA requires paragraph-length answers rather than short spans; the ELI5 dataset (Fan et al., 2019) and ASQA (Stelmakh et al., 2022) exemplify this setting. Visual QA grounds answers in images or document images and is treated under multimodal models; document VQA specifically involves understanding text rendered in scanned PDFs, invoices, forms, and presentations. Medical and scientific QA applies QA techniques in high-stakes domains requiring precise factual accuracy, with dedicated benchmarks such as BioASQ, PubMedQA, and MedQA (USMLE).
Factoid QA traces back to systems like the TREC QA Track in the late 1990s and IBM's Watson, which won the Jeopardy! challenge in 2011 using a combination of structured knowledge, information retrieval, and learned ranking. Watson's DeepQA architecture ran more than 100 natural language processing and evidence scoring techniques in parallel across over 2,500 compute cores, querying 200 million pages of locally stored text and producing an answer in 3 to 5 seconds (Ferrucci et al., 2010). The system was a remarkable engineering achievement but relied heavily on hand-crafted pipelines that did not generalize easily to new domains.
Modern neural QA effectively began with the release of large supervised reading-comprehension datasets. The Stanford Question Answering Dataset (SQuAD) introduced 100,000 crowd-written question-answer pairs over Wikipedia paragraphs, framing QA as span prediction (Rajpurkar et al., 2016). The clean format and large scale made SQuAD a standard benchmark that drove rapid model improvement over the following two years.
Early neural readers used attention-based recurrent networks. The Bidirectional Attention Flow (BiDAF) model of Seo et al. (2016) introduced a query-aware context representation using a memory-less, bi-directional attention mechanism and became a standard SQuAD baseline. R-NET from Microsoft Research and QANet from Google (Yu et al., 2018) further improved span-extraction accuracy using gated self-matching attention and convolution plus self-attention respectively. By 2018, recurrent-network ensembles reached F1 scores above 85 on SQuAD 1.1, approaching the reported human ceiling of approximately 91 F1.
For open-domain QA, Chen et al.'s DrQA system (2017) combined a TF-IDF retriever over all of Wikipedia with a recurrent-network span reader, establishing the retrieve-then-read template that later approaches refined.
The Transformer architecture and especially the bidirectional encoder BERT (Devlin et al., 2018) reshaped QA. Fine-tuned BERT-Large reached 93.2 F1 on SQuAD 1.1, surpassing reported human performance, and 83.1 F1 on SQuAD 2.0, which adds adversarially written unanswerable questions (Rajpurkar et al., 2018). BERT's fine-tuning recipe for extractive QA is straightforward: add a linear classification head that produces start-token and end-token logits over the passage, then fine-tune all parameters jointly. Most subsequent extractive systems are variants of pre-trained encoders such as RoBERTa, ALBERT, ELECTRA, and DeBERTa with this span-prediction head.
Open-domain retrieval shifted from sparse to dense methods in 2020. Dense Passage Retrieval (DPR) trained two BERT encoders, one for queries and one for passages, with in-batch negatives and outperformed BM25 on top-20 retrieval accuracy by 9 to 19 absolute points across several benchmarks (Karpukhin et al., 2020). ColBERT introduced a late-interaction architecture that scores documents through fine-grained per-token similarity while still allowing offline indexing (Khattab and Zaharia, 2020).
Two influential systems unified retrieval with end-to-end training. REALM jointly pre-trained a knowledge retriever and a masked language model and improved open-domain QA accuracy by 4 to 16 absolute points over prior methods (Guu et al., 2020). RAG by Lewis et al. (2020) coupled a DPR retriever with a BART generator and set state-of-the-art on three open-domain QA tasks while producing more specific and factual generated text than a parametric-only baseline.
Izacard and Grave's Fusion-in-Decoder (FiD) model fed many retrieved passages independently through a T5 encoder and concatenated their representations in the decoder, scaling cleanly with the number of retrieved passages and topping Natural Questions and TriviaQA leaderboards in 2020 (Izacard and Grave, 2020). Atlas, a follow-up from Meta, combined a Contriever retriever with FiD and reached over 42 percent accuracy on Natural Questions using only 64 training examples, beating a 540-billion-parameter model with 50 times fewer parameters (Izacard et al., 2022).
Closed-book QA was investigated by Roberts et al. (2020), who fine-tuned T5 models up to 11 billion parameters on QA tasks without any retrieval and achieved competitive results on TriviaQA and WebQuestions. The same family of experiments showed that knowledge stored in parameters scales with model size but lags retrieval-augmented systems on rare facts.
Multi-hop reasoning received dedicated attention during this period. The 2WikiMultiHopQA dataset (Ho et al., 2020) required chaining evidence across two Wikipedia articles through causal reasoning. MuSiQue (Trivedi et al., 2022) raised the difficulty further with construction constraints ensuring that models must integrate at least two hops; a single-paragraph baseline achieves approximately 32 F1 on MuSiQue compared to roughly 65 F1 on HotpotQA, reflecting a more genuine requirement for multi-step inference.
The arrival of GPT-4, the Claude series, Gemini, Llama 3, and similar systems reshaped QA again. Modern LLMs answer most factoid and reasoning questions zero-shot from parameters and reach or exceed human SQuAD scores without any task-specific fine-tuning. On the MMLU benchmark, which spans 57 subjects from elementary mathematics to professional law, GPT-4o achieved 88.7% accuracy and Claude 3 Opus achieved 86.8%, both well above the 52 to 56% range reported for initial MMLU results with earlier models. For knowledge-intensive or fast-changing queries, production systems wrap an LLM with retrieval over a private or web corpus, a pattern now ubiquitous in search products such as ChatGPT search, Perplexity, Google AI Overviews, and Microsoft Copilot. Tool use, code execution, and browsing further extend QA to questions that require live data or calculation.
Extractive QA is the most studied QA variant in the academic literature, partly because its clean formulation allows automatic evaluation with high inter-annotator agreement. The standard implementation fine-tunes a pre-trained encoder with two linear projection heads, one for start positions and one for end positions, over the context tokens. The model selects the span with the highest combined start-end probability. SQuAD 2.0 extended this by requiring the model to also predict whether a question is answerable given the passage; unanswerable questions were written adversarially to look plausible, making the task a realistic test of model confidence. ALBERT and DeBERTa variants dominate the SQuAD 2.0 leaderboard as of 2025.
Generative QA systems produce answers by decoding a sequence rather than selecting a span. Encoder-decoder models such as T5 and BART, and decoder-only models such as GPT, can be fine-tuned to generate answers conditioned on a question and optionally a passage. For many real-world questions the correct answer is not a verbatim phrase from a source, so generative models match human expectations better even when extractive metrics are not directly applicable. Natural Questions' long-answer track and the MS MARCO generative track are examples of this format.
Open-domain QA requires the system to answer without being given a pre-selected passage; instead it must identify the relevant documents from a large corpus (often all of Wikipedia or the web). The retrieve-then-read pipeline, introduced by DrQA and refined by DPR and RAG, remains the dominant architecture. A retriever returns a set of passages, and a reader (extractive or generative) produces the answer from those passages. End-to-end trained systems like REALM and FiD improve over pipeline approaches by allowing gradients to flow through the retrieval decision during training.
Multi-hop questions require the system to locate an intermediate answer and use it as a stepping stone toward the final answer. For example, "Who is the spouse of the person who directed Interstellar?" requires first identifying Christopher Nolan as the director and then finding his spouse. HotpotQA (Yang et al., 2018) introduced 113,000 such questions with supporting fact annotations that make it possible to evaluate whether the model retrieves the right evidence. Approaches to multi-hop QA include iterative retrieval (issue a query, retrieve, issue a follow-up query based on the retrieved content), chain-of-thought prompting over retrieved passages, and graph-based reasoning that builds an entity graph from retrieved text.
CoQA (Reddy et al., 2018) paired a crowd-sourced student-teacher format with text passages from seven domains including news, literature, and Reddit. Questions often rely on pronoun resolution or pragmatic inference from previous turns, and the dataset reports human accuracy of about 88.8 F1. QuAC (Choi et al., 2018) uses a one-sided information-seeking format where only the teacher can see the passage, modeling realistic reference desk interactions. Both datasets reveal that tracking dialogue context adds substantial difficulty beyond single-turn reading comprehension.
KBQA maps natural language questions to formal queries over structured graphs. Early systems such as Berant et al.'s Semantic Parsing (2013) over Freebase and Bordes et al.'s work on embedding-based KBQA showed that structured reasoning over knowledge graphs is feasible but brittle. WebQuestions (Berant et al., 2013) with 5,810 question-answer pairs and SimpleQuestions (Bordes et al., 2015) with 108,000 questions became standard Freebase benchmarks. After Freebase's discontinuation, newer datasets migrated to Wikidata; WikiWebQuestions (WWQ) is one such port. Modern approaches use LLMs to generate SPARQL queries from natural language (ChatKBQA, 2023) or use tool-augmented LLMs to traverse knowledge graphs iteratively (ToG, 2024).
Long-form QA tasks ask for paragraph-length answers synthesizing information from multiple sources. ELI5 (Fan et al., 2019) drew from the "Explain Like I'm Five" subreddit and exposed a significant gap between retrieval-augmented models and human writers. ASQA (Stelmakh et al., 2022) focused on disambiguating factoid questions with multiple valid short answers by requiring a long-form synthesis. The ALCE benchmark (Gao et al., 2023) introduced attribution evaluation, scoring whether model-generated answers include accurate in-line citations to the sources used. Long-form QA evaluation remains difficult because reference-overlap metrics (ROUGE, BLEU) correlate weakly with human judgments of informativeness and faithfulness.
Visual QA (VQA) grounds answers in images and requires joint image-text reasoning. Models such as BLIP-2, LLaVA, and GPT-4V represent image regions as token embeddings and answer questions about scene content, relationships, and text within images. Document VQA specializes in text-dense document images such as invoices, receipts, scientific charts, and forms; DocVQA (Mathew et al., 2021) and InfographicVQA are standard benchmarks. Multimodal RAG systems for document QA (e.g., VisDoM, 2024) retrieve both textual and visual chunks from document collections before generating answers.
| Model | Year | Organization | Type | Notes |
|---|---|---|---|---|
| BiDAF | 2016 | AI2, UW | Extractive reader | Bi-directional attention flow over context and query; SQuAD baseline |
| DrQA | 2017 | Facebook AI, Stanford | Open-domain | TF-IDF retriever plus RNN span reader over Wikipedia |
| QANet | 2018 | Google Brain, CMU | Extractive reader | Convolutions plus self-attention; faster training than RNN readers |
| BERT | 2018 | Extractive reader | 93.2 F1 on SQuAD 1.1; standard fine-tuning recipe | |
| REALM | 2020 | Retrieval-augmented LM | Jointly pre-trained retriever and masked LM | |
| DPR | 2020 | Facebook AI | Dense retriever | Dual BERT encoders for open-domain retrieval |
| ColBERT | 2020 | Stanford | Late-interaction retriever | Per-token similarity with offline indexing |
| RAG | 2020 | Facebook AI | Retrieve-and-generate | DPR retriever plus BART generator |
| FiD | 2020 | Facebook AI | Generative reader | Fusion-in-Decoder over many retrieved passages |
| T5 closed-book | 2020 | Closed-book generative | Up to 11B parameters; no retrieval | |
| Atlas | 2022 | Meta | Few-shot RAG | Contriever plus FiD; strong few-shot QA |
| Galactica | 2022 | Meta | Scientific QA | 120B-parameter model trained on scientific text |
| GPT-4 | 2023 | OpenAI | LLM | Strong zero-shot QA across benchmarks |
| Claude 3 | 2024 | Anthropic | LLM | Long-context QA over hundreds of pages |
| Gemini | 2023-2025 | Google DeepMind | LLM | Multimodal QA and grounded search responses |
QA benchmarks vary by answer format, domain, and reasoning requirement. The table below lists the datasets most widely cited in the QA literature.
| Benchmark | Year | Authors | Size | Focus |
|---|---|---|---|---|
| SQuAD 1.1 | 2016 | Rajpurkar et al. | 100K QA pairs | Span extraction from Wikipedia paragraphs |
| MS MARCO | 2016 | Nguyen et al. (Microsoft) | 1M Bing queries | Real user queries, passage ranking and generation |
| TriviaQA | 2017 | Joshi et al. | 650K QA-evidence triples | Trivia questions with distantly supervised evidence |
| HotpotQA | 2018 | Yang et al. | 113K QA pairs | Multi-hop reasoning with supporting facts |
| SQuAD 2.0 | 2018 | Rajpurkar et al. | 150K QA pairs | Adds 50K adversarial unanswerable questions |
| CoQA | 2018 | Reddy et al. | 127K QA turns | Conversational QA with coreference |
| QuAC | 2018 | Choi et al. | 100K QA turns | Information-seeking dialogues |
| OpenBookQA | 2018 | Mihaylov et al. | 6K QA pairs | Elementary-science open-book reasoning |
| ARC | 2018 | Clark et al. (AI2) | 7.8K science questions | Multiple choice, easy and challenge splits |
| Natural Questions | 2019 | Kwiatkowski et al. (Google) | 307K training examples | Real Google queries, long and short answer spans |
| DROP | 2019 | Dua et al. (AI2) | 96K QA pairs | Discrete reasoning, arithmetic, counting |
| BoolQ | 2019 | Clark et al. | 16K yes/no questions | Naturally occurring binary questions |
| ELI5 | 2019 | Fan et al. | 270K QA pairs | Long-form answers from "Explain Like I'm Five" Reddit |
| BioASQ | 2013-ongoing | Tsatsaronis et al. | Several K expert QA pairs | Biomedical QA with expert annotations |
| MMLU | 2020 | Hendrycks et al. | 14K multiple-choice | 57 subject areas; few-shot LLM evaluation |
| 2WikiMultiHopQA | 2020 | Ho et al. | 167K QA pairs | Multi-hop reasoning across two Wikipedia articles |
| MuSiQue | 2022 | Trivedi et al. | 20K QA pairs | Compositional multi-hop with strict single-hop controls |
| ASQA | 2022 | Stelmakh et al. | 6.3K QA pairs | Ambiguous factoid questions requiring long-form answers |
The dominant evaluation metric for extractive QA is the macro-averaged F1 score over the token sets of predicted and reference answers, paired with exact match (EM), the fraction of predictions identical to a reference answer after normalization. Both metrics strip punctuation and articles before comparison. For abstractive QA, ROUGE and BLEU measure n-gram overlap with reference text, though they correlate weakly with factual correctness. Retrieval components are evaluated with recall@k (the fraction of questions whose gold answer appears in the top-k retrieved passages) and mean reciprocal rank (MRR). LLM-based generative QA is increasingly evaluated with LLM-as-judge protocols, human ratings, and faithfulness or citation accuracy measures, since open-ended answers often diverge in surface form from any single reference.
For RAG pipelines specifically, the RAGAS framework (Es et al., 2023) provides four reference-free metrics: faithfulness (whether each answer claim is supported by the retrieved context), answer relevancy (whether the answer addresses the question), context precision (whether retrieved passages are relevant), and context recall (whether the retrieved passages contain the information needed to answer). RAGAS processes millions of evaluations monthly for enterprise users and has become the de facto standard for RAG pipeline evaluation. RAGBENCH (2024) extended this with 100,000 domain-specific examples and the TRACe framework for explainable RAG metrics. The KDD Cup 2024 CRAG Challenge introduced a comprehensive RAG benchmark spanning five domains and three task types.
For long-form QA, the ALCE benchmark evaluates attribution: whether claims in the generated answer are supported by the cited source documents. This metric addresses the problem that LLMs sometimes generate fluent but unsupported or wrongly attributed statements, a critical failure mode for high-stakes applications.
The dominant approach to extractive QA fine-tunes a bidirectional transformer encoder on labeled question-passage pairs. The input is formatted as [CLS] question [SEP] passage [SEP] and the encoder produces a contextual representation for each token. Two linear heads project the encoder output to start and end position logits; the highest-scoring non-overlapping span becomes the predicted answer. RoBERTa and DeBERTa-v3 variants consistently achieve near-ceiling performance on SQuAD, while models like ELECTRA and ALBERT offer faster inference with competitive accuracy.
A complication is that real documents are long and must be chunked into overlapping windows. Stride-based inference runs the model on each window independently and takes the best span across windows. Aggregation across strides is a significant engineering concern in production systems.
Dense retrieval encodes queries and passages into continuous vectors and uses approximate nearest-neighbor search to find the most relevant passages at query time. DPR trained separate question and passage encoders with a contrastive objective using in-batch negatives, demonstrating that dense representations substantially outperform BM25 on Natural Questions and TriviaQA when enough training pairs are available. Subsequent dense retrievers improved on DPR by using hard negatives (Xiong et al., 2021), better pre-training (Contriever, Izacard et al., 2022, which uses unsupervised contrastive learning over random document spans), and larger base models.
By 2024, state-of-the-art embedding models such as BGE-M3 from BAAI support multi-functionality (dense, sparse, and multi-vector retrieval in one model), multi-linguality (over 100 languages), and inputs up to 8,192 tokens, making them broadly applicable as drop-in retrievers in RAG systems.
ColBERT stores per-token embeddings for every passage in the index and computes document scores at query time as the sum over query token maximum inner products with passage token vectors. This "late interaction" scores passages more precisely than bi-encoders but requires storing full token matrices for every passage, producing larger indexes. ColBERT v2 (Santhanam et al., 2022) reduced index size through residual compression while maintaining accuracy, and systems like PLAID (2022) further optimized retrieval latency. Late-interaction retrievers are particularly useful for high-precision tasks where bi-encoder approximations miss relevant passages.
The standard open-domain QA pipeline retrieves the top-k passages (often k = 5 to 100) and passes them concatenated or individually to a reader model. Extractive readers like BERT select a span from the concatenated passages. Generative readers like FiD encode all passages independently and fuse their representations in the decoder, allowing the model to synthesize across passages. Chain-of-thought prompting over retrieved passages encourages LLMs to reason through the evidence step by step before committing to an answer, which substantially improves multi-hop performance.
A common refinement is a reranker: after the retriever returns the top-100 passages, a cross-encoder (which attends jointly to query and passage rather than encoding them independently) re-scores and reorders them to produce the top-5 that the reader actually uses. Cross-encoders are much more accurate than bi-encoders but too slow to score the entire corpus, making the two-stage design necessary.
A typical RAG system has three stages: index, retrieve, and generate. The corpus is split into chunks (often a few hundred tokens) and each chunk is embedded into a vector. At query time, the user question is embedded and a vector index (FAISS, ScaNN, or a managed vector database such as Pinecone or Weaviate) returns the top-k most similar chunks. A generator, typically an LLM, conditions on the question and retrieved chunks to produce an answer, often with inline citations to the retrieved sources.
Retrievers fall into three families. Sparse retrieval uses term-frequency methods such as BM25 and remains a strong baseline for many domains. Dense retrieval uses bi-encoders such as DPR, Contriever, or modern text-embedding models. Late-interaction retrieval like ColBERT keeps per-token vectors and computes maximum similarity at query time, trading index size for higher accuracy. Production systems often use hybrid retrieval that blends BM25 and dense scores, and many add a cross-encoder reranker on the top-100 results before passing the top-5 to the generator.
GraphRAG (Edge et al., Microsoft Research, 2024) augments chunk retrieval with a knowledge graph built over the corpus by extracting entities and relationships with an LLM and applying hierarchical community detection (the Leiden algorithm) to create community summaries. GraphRAG improves answer comprehensiveness by 50 to 70 percent over conventional vector RAG on global summarization questions by allowing the system to reason about cross-document entity relationships rather than only surfacing locally relevant passages.
Chunking strategy, embedding model choice, query rewriting, and prompt design materially affect answer quality, and the engineering of these pipelines is sometimes called retrieval engineering. Evaluation frameworks such as RAGAS and TruLens score answer faithfulness, context precision, and context recall.
Closed-book models answer entirely from parameters. T5 and its descendants achieve competitive accuracy on TriviaQA and WebQuestions by memorizing factual associations during pre-training, but accuracy degrades sharply for rare entities, recent events, and private knowledge that did not appear in the training corpus. The investigation by Roberts et al. (2020) established that knowledge density in LLMs scales with model size, which motivated very large models (GPT-3, PaLM, Llama) to be tested on closed-book benchmarks. Modern LLMs in the 70B to 405B parameter range achieve impressive closed-book performance on standard benchmarks but still benefit from retrieval for tail facts and grounded applications.
Prompting an LLM to produce a step-by-step reasoning chain before answering improves accuracy substantially on multi-step arithmetic and reasoning-intensive QA benchmarks. For multi-hop QA, chain-of-thought allows the model to identify intermediate entities, record partial answers, and condition subsequent reasoning steps on earlier results. Self-consistency (Wang et al., 2022) further improves accuracy by sampling multiple reasoning chains and taking a majority vote. These techniques, now standard in production QA systems, complement retrieval by allowing the model to reason over retrieved passages rather than merely extracting surface patterns.
QA models power a growing share of consumer and enterprise software. Web search engines use generative QA to produce direct answers above the traditional ten blue links, as in Google AI Overviews, Bing answers, and dedicated answer engines like Perplexity and You.com. Voice assistants including Siri, Alexa, and Google Assistant rely on QA to handle factoid queries. Customer-support chatbots use RAG over product documentation and ticket histories to answer user questions and reduce agent load.
Document-analysis tools target legal contract review, financial filings, and clinical notes, with offerings such as Harvey AI for legal QA and Glass Health for clinical decision support. Medical QA is a particularly active domain: systems fine-tuned on biomedical literature and clinical guidelines can answer practitioner questions grounded in PubMed abstracts, drug databases, and treatment protocols, with BioASQ and PubMedQA serving as standard evaluation suites. OpenMedLM (2024) established new open-source performance levels on MedQA and MedMCQA using large open-weight models, demonstrating that domain-specialized QA is accessible beyond frontier closed models.
Coding assistants like GitHub Copilot Chat and Cursor provide QA over private codebases. Academic and scientific search engines such as Elicit, Consensus, and Semantic Scholar's Ask answer questions grounded in research papers. Enterprise knowledge management platforms deploy RAG over proprietary document stores to let employees query internal policy documents, technical manuals, and databases in natural language.
Generative QA systems remain prone to hallucination, producing fluent but factually incorrect answers, especially on long-tail entities and recent events. Retrieval mitigates but does not eliminate hallucination, and faithfulness to retrieved context is itself an active research area. A 2024 study on semantic entropy as a hallucination signal (Farquhar et al., Nature, 2024) showed that uncertainty estimation over model outputs can detect confabulations across domains including QA, pointing toward better self-evaluation.
Multi-hop reasoning over several documents continues to challenge systems, with HotpotQA and the 2WikiMultiHopQA dataset showing large human-machine gaps when supporting facts are spread across passages. MuSiQue results reveal that many apparent multi-hop gains on HotpotQA arise from single-hop shortcuts rather than genuine multi-step inference.
Numerical and discrete reasoning as measured by DROP and the GSM8K math benchmark stresses both retrievers and readers; chain-of-thought prompting and tool use with code interpreters help close part of the gap. Knowledge base QA is limited by the freshness and completeness of the underlying graph; Wikidata, while more actively maintained than Freebase, still has significant coverage gaps for non-English entities and recent events.
The retrieval quality bottleneck means that even a perfect reader cannot compensate for missing or irrelevant passages, and lexical mismatch between questions and answers remains a problem for dense embedders trained on narrow domains. Domain adaptation to specialized corpora (legal, medical, financial) often requires domain-specific embeddings and reranking. A 2024 study found that even strong embedding models can fail on simple queries when there is a granularity mismatch between query specificity and document length.
Long-form QA is hard to evaluate because reference answers are subjective and reference-overlap metrics correlate poorly with human judgments; works like ELI5 and ASQA highlight the gap between ranking models on a leaderboard and producing answers users prefer.
Temporal and citation accuracy in LLM-grounded answers is improving but still produces fabricated or mismatched citations in many production settings, motivating continuing research into attributed QA and verifiable retrieval. The ALCE benchmark provides standardized attribution evaluation, and ongoing work on faithful fine-tuning (e.g., Faithful Finetuning, 2024) trains models to explicitly ground claims in retrieved passages before generating.
Benchmark saturation and contamination are growing concerns: models trained on web-scale corpora may have seen test questions from SQuAD, TriviaQA, and similar datasets, inflating reported performance. MMLU-CF (Contamination-Free, 2024) proposed methods to detect and exclude potentially contaminated test items, and the field is shifting toward held-out or dynamically generated evaluation sets to address the problem.