Question Answering Models
Last reviewed
May 11, 2026
Sources
17 citations
Review status
Source-backed
Revision
v2 ยท 2,486 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 11, 2026
Sources
17 citations
Review status
Source-backed
Revision
v2 ยท 2,486 words
Add missing citations, update stale details, or suggest a clearer explanation.
See also: Natural Language Processing Models and Tasks
Question answering (QA) models are natural language processing systems that take a natural-language question as input and return a natural-language answer, optionally grounded in a supplied passage, document collection, or knowledge base. QA has been a central benchmark for machine reading comprehension since the introduction of large-scale span-extraction datasets in 2016, and it remains a primary application of large language models (LLMs) and retrieval-augmented generation (RAG) systems in 2026.
QA systems are typically categorized along several axes. Extractive QA locates an answer as a contiguous span of text within a supplied passage, a formulation popularized by the SQuAD dataset (Rajpurkar et al., 2016). Abstractive or generative QA produces a free-form answer that may not appear verbatim in any source passage and is closer to summarization in style.
A second axis distinguishes closed-book QA, where the model relies only on knowledge encoded in its parameters during training, from open-domain QA (sometimes called open-book QA), where the system retrieves supporting documents from a corpus such as Wikipedia or the open web before producing an answer (Chen et al., 2017). Open-domain QA in turn decomposes into a retrieval step and a reader or generator step, an architecture often called retrieve-then-read.
Additional task variants include multi-hop QA, which requires combining information across multiple documents (Yang et al., 2018); conversational QA, where questions occur in a dialogue history; knowledge base QA, which answers questions over a structured graph such as Wikidata or Freebase; community QA, which mines question-answer pairs from forums like Stack Exchange; and visual QA, which grounds answers in images and is treated under multimodal models.
Factoid QA traces back to systems like the TREC QA Track in the late 1990s and IBM's Watson, which won the Jeopardy! challenge in 2011 using a combination of structured knowledge, information retrieval, and learned ranking. Modern neural QA effectively began with the release of large supervised reading-comprehension datasets. The Stanford Question Answering Dataset (SQuAD) introduced 100,000 crowd-written question-answer pairs over Wikipedia paragraphs, framing QA as span prediction (Rajpurkar et al., 2016).
Early neural readers used attention-based recurrent networks. The Bidirectional Attention Flow (BiDAF) model of Seo et al. (2016) introduced a query-aware context representation using a memory-less, bi-directional attention mechanism and became a standard SQuAD baseline. R-NET from Microsoft Research and QANet from Google (Yu et al., 2018) further improved span-extraction accuracy using gated self-matching attention and convolution plus self-attention respectively.
For open-domain QA, Chen et al.'s DrQA system (2017) combined a TF-IDF retriever over all of Wikipedia with a recurrent-network span reader, establishing the retrieve-then-read template that later approaches refined.
The Transformer architecture and especially the bidirectional encoder BERT (Devlin et al., 2018) reshaped QA. Fine-tuned BERT-Large reached 93.2 F1 on SQuAD 1.1, surpassing reported human performance, and 83.1 F1 on SQuAD 2.0, which adds adversarially written unanswerable questions (Rajpurkar et al., 2018). Most subsequent extractive systems are variants of pre-trained encoders such as RoBERTa, ALBERT, ELECTRA, and DeBERTa with a span-prediction head.
Open-domain retrieval shifted from sparse to dense methods in 2020. Dense Passage Retrieval (DPR) trained two BERT encoders, one for queries and one for passages, with in-batch negatives and outperformed BM25 on top-20 retrieval accuracy by 9 to 19 absolute points across several benchmarks (Karpukhin et al., 2020). ColBERT introduced a late-interaction architecture that scores documents through fine-grained per-token similarity while still allowing offline indexing (Khattab and Zaharia, 2020).
Two influential systems unified retrieval with end-to-end training. REALM jointly pre-trained a knowledge retriever and a masked language model and improved open-domain QA accuracy by 4 to 16 absolute points over prior methods (Guu et al., 2020). RAG by Lewis et al. (2020) coupled a DPR retriever with a BART generator and set state-of-the-art on three open-domain QA tasks while producing more specific and factual generated text than a parametric-only baseline.
Izacard and Grave's Fusion-in-Decoder (FiD) model fed many retrieved passages independently through a T5 encoder and concatenated their representations in the decoder, scaling cleanly with the number of retrieved passages and topping Natural Questions and TriviaQA leaderboards in 2020 (Izacard and Grave, 2020). Atlas, a follow-up from Meta, combined a Contriever retriever with FiD and reached over 42 percent accuracy on Natural Questions using only 64 training examples, beating a 540-billion-parameter model with 50 times fewer parameters (Izacard et al., 2022).
Closed-book QA was investigated by Roberts et al. (2020), who fine-tuned T5 models up to 11 billion parameters on QA tasks without any retrieval and achieved competitive results on TriviaQA and WebQuestions. The same family of experiments showed that knowledge stored in parameters scales with model size but lags retrieval-augmented systems on rare facts.
The arrival of GPT-4, the Claude series, Gemini, Llama 3, and similar systems reshaped QA again. Modern LLMs answer most factoid and reasoning questions zero-shot from parameters and reach or exceed human SQuAD scores without any task-specific fine-tuning. For knowledge-intensive or fast-changing queries, production systems wrap an LLM with retrieval over a private or web corpus, a pattern now ubiquitous in search products such as ChatGPT search, Perplexity, Google AI Overviews, and Microsoft Copilot. Tool use, code execution, and browsing further extend QA to questions that require live data or calculation.
| Model | Year | Organization | Type | Notes |
|---|---|---|---|---|
| BiDAF | 2016 | AI2, UW | Extractive reader | Bi-directional attention flow over context and query; SQuAD baseline |
| DrQA | 2017 | Facebook AI, Stanford | Open-domain | TF-IDF retriever plus RNN span reader over Wikipedia |
| QANet | 2018 | Google Brain, CMU | Extractive reader | Convolutions plus self-attention; faster training than RNN readers |
| BERT | 2018 | Extractive reader | 93.2 F1 on SQuAD 1.1; standard fine-tuning recipe | |
| REALM | 2020 | Retrieval-augmented LM | Jointly pre-trained retriever and masked LM | |
| DPR | 2020 | Facebook AI | Dense retriever | Dual BERT encoders for open-domain retrieval |
| ColBERT | 2020 | Stanford | Late-interaction retriever | Per-token similarity with offline indexing |
| RAG | 2020 | Facebook AI | Retrieve-and-generate | DPR retriever plus BART generator |
| FiD | 2020 | Facebook AI | Generative reader | Fusion-in-Decoder over many retrieved passages |
| T5 closed-book | 2020 | Closed-book generative | Up to 11B parameters; no retrieval | |
| Atlas | 2022 | Meta | Few-shot RAG | Contriever plus FiD; strong few-shot QA |
| Galactica | 2022 | Meta | Scientific QA | 120B-parameter model trained on scientific text |
| GPT-4 | 2023 | OpenAI | LLM | Strong zero-shot QA across benchmarks |
| Claude 3 | 2024 | Anthropic | LLM | Long-context QA over hundreds of pages |
| Gemini | 2023-2025 | Google DeepMind | LLM | Multimodal QA and grounded search responses |
QA benchmarks vary by answer format, domain, and reasoning requirement. The table below lists the datasets most widely cited in the QA literature.
| Benchmark | Year | Authors | Size | Focus |
|---|---|---|---|---|
| SQuAD 1.1 | 2016 | Rajpurkar et al. | 100K QA pairs | Span extraction from Wikipedia paragraphs |
| MS MARCO | 2016 | Nguyen et al. (Microsoft) | 1M Bing queries | Real user queries, passage ranking and generation |
| TriviaQA | 2017 | Joshi et al. | 650K QA-evidence triples | Trivia questions with distantly supervised evidence |
| HotpotQA | 2018 | Yang et al. | 113K QA pairs | Multi-hop reasoning with supporting facts |
| SQuAD 2.0 | 2018 | Rajpurkar et al. | 150K QA pairs | Adds 50K adversarial unanswerable questions |
| Natural Questions | 2019 | Kwiatkowski et al. (Google) | 307K training examples | Real Google queries, long and short answer spans |
| DROP | 2019 | Dua et al. (AI2) | 96K QA pairs | Discrete reasoning, arithmetic, counting |
| CoQA | 2018 | Reddy et al. | 127K QA turns | Conversational QA with coreference |
| QuAC | 2018 | Choi et al. | 100K QA turns | Information-seeking dialogues |
| BoolQ | 2019 | Clark et al. | 16K yes/no questions | Naturally occurring binary questions |
| OpenBookQA | 2018 | Mihaylov et al. | 6K QA pairs | Elementary-science open-book reasoning |
| ARC | 2018 | Clark et al. (AI2) | 7.8K science questions | Multiple choice, easy and challenge splits |
The dominant evaluation metric for extractive QA is the macro-averaged F1 score over the token sets of predicted and reference answers, paired with exact match (EM), the fraction of predictions identical to a reference answer after normalization. For abstractive QA, ROUGE and BLEU measure n-gram overlap with reference text, though they correlate weakly with factual correctness. Retrieval components are evaluated with recall@k (the fraction of questions whose gold answer appears in the top-k retrieved passages) and mean reciprocal rank (MRR). LLM-based generative QA is increasingly evaluated with LLM-as-judge protocols, human ratings, and faithfulness or citation accuracy measures, since open-ended answers often diverge in surface form from any single reference.
A typical RAG system has three stages: index, retrieve, and generate. The corpus is split into chunks (often a few hundred tokens) and each chunk is embedded into a vector. At query time, the user question is embedded and a vector index (FAISS, ScaNN, or a managed vector database) returns the top-k most similar chunks. A generator, typically an LLM, conditions on the question and retrieved chunks to produce an answer, often with inline citations to the retrieved sources.
Retrievers fall into three families. Sparse retrieval uses term-frequency methods such as BM25 and remains a strong baseline for many domains. Dense retrieval uses bi-encoders such as DPR, Contriever, or modern text-embedding models like OpenAI's text-embedding-3 and Cohere Embed. Late-interaction retrieval like ColBERT keeps per-token vectors and computes maximum similarity at query time, trading index size for higher accuracy. Production systems often use hybrid retrieval that blends BM25 and dense scores, and many add a cross-encoder reranker on the top-100 results before passing the top-5 to the generator. Recent work on GraphRAG (Microsoft Research, 2024) augments chunk retrieval with a knowledge graph built over the corpus, improving multi-hop and global summarization queries.
Chunking strategy, embedding model choice, query rewriting, and prompt design materially affect answer quality, and the engineering of these pipelines is sometimes called retrieval engineering. Evaluation frameworks such as RAGAS and TruLens score answer faithfulness, context precision, and context recall.
QA models power a growing share of consumer and enterprise software. Web search engines use generative QA to produce direct answers above the traditional ten blue links, as in Google AI Overviews, Bing answers, and dedicated answer engines like Perplexity and You.com. Voice assistants including Siri, Alexa, and Google Assistant rely on QA to handle factoid queries. Customer-support chatbots use RAG over product documentation and ticket histories to answer user questions and reduce agent load. Document-analysis tools target legal contract review, financial filings, and clinical notes, with offerings such as Harvey AI for legal QA and Glass Health for clinical decision support. Coding assistants like GitHub Copilot Chat and Cursor provide QA over private codebases. Academic and scientific search engines such as Elicit, Consensus, and Semantic Scholar's Ask answer questions grounded in research papers.
Generative QA systems remain prone to hallucination, producing fluent but factually incorrect answers, especially on long-tail entities and recent events. Retrieval mitigates but does not eliminate hallucination, and faithfulness to retrieved context is itself an active research area. Multi-hop reasoning over several documents continues to challenge systems, with HotpotQA and the 2WikiMultiHopQA dataset (Ho et al., 2020) showing large human-machine gaps when supporting facts are spread across passages. Numerical and discrete reasoning as measured by DROP and the GSM8K math benchmark stresses both retrievers and readers; chain-of-thought prompting and tool use with code interpreters help close part of the gap.
The retrieval quality bottleneck means that even a perfect reader cannot compensate for missing or irrelevant passages, and lexical mismatch between questions and answers remains a problem for dense embedders trained on narrow domains. Domain adaptation to specialized corpora (legal, medical, financial) often requires domain-specific embeddings and reranking. Long-form QA is hard to evaluate because reference answers are subjective and reference-overlap metrics correlate poorly with human judgments; works like ELI5 and ASQA highlight the gap between ranking models on a leaderboard and producing answers users prefer. Temporal and citation accuracy in LLM-grounded answers is improving but still produces fabricated or mismatched citations in many production settings, motivating continuing research into attributed QA and verifiable retrieval.