Question answering (QA) is a subfield of natural language processing and information retrieval that focuses on automatically generating answers to questions posed in natural language. Rather than returning a ranked list of documents (as a traditional search engine would), a QA system produces a direct answer, often as a short text span, a generated sentence, or a structured response. Question answering has been one of the longest-studied problems in artificial intelligence, dating back to the 1960s, and remains a central benchmark for evaluating language understanding in modern machine learning systems.
The earliest QA systems were rule-based programs designed for narrow domains. BASEBALL (Green et al., 1961) answered questions about Major League Baseball game statistics over a single season, drawing from a structured database. LUNAR (Woods, 1972) answered questions about the geological analysis of rock samples returned by the Apollo moon missions; when demonstrated at a lunar science convention in 1971, it correctly answered approximately 90% of questions posed by scientists who had no prior training on the system. Around the same time, SHRDLU (Winograd, 1971) demonstrated natural language understanding in a simulated blocks-world environment.
These early systems relied on hand-crafted rules and small, curated knowledge bases, which limited them to closed domains. Throughout the 1980s and 1990s, progress was incremental, with systems built around template matching, information extraction pipelines, and shallow parsing. The annual Text REtrieval Conference (TREC) QA track, launched in 1999, helped standardize evaluation and spurred the development of more robust open-domain approaches.
The arrival of large-scale reading comprehension datasets such as SQuAD in 2016, along with advances in deep learning and transfer learning, transformed the field. Neural models rapidly surpassed traditional pipelines, and the introduction of BERT in 2018 set new performance records on nearly every QA benchmark. Today, large language models (LLMs) such as GPT-4 and Claude function as general-purpose QA engines, often augmented with retrieval components.
Question answering encompasses a broad family of tasks that differ along several dimensions: where the answer comes from, how the answer is produced, what modality the input takes, and whether the interaction is single-turn or multi-turn. The table below summarizes the major categories.
| QA Type | Answer Source | Answer Format | Representative Datasets / Systems | Key Characteristics |
|---|---|---|---|---|
| Extractive QA | Provided passage or document | Contiguous text span from the source | SQuAD, NewsQA, BiDAF, BERT-QA | Answer is always a substring of the context |
| Abstractive / Generative QA | Passage, knowledge base, or model parameters | Freely generated natural language | T5, BART, UnifiedQA, FiD | Answer may paraphrase or synthesize information |
| Open-domain QA | Large corpus (e.g., Wikipedia) | Span or generated text | DrQA, ORQA, DPR + reader, RAG | Retriever-reader pipeline; no pre-selected context |
| Closed-book QA | Model parameters only | Generated text | T5 (Roberts et al., 2020), GPT-3 | No retrieval step; relies on memorized knowledge |
| Multi-hop QA | Multiple documents or passages | Span or generated text | HotpotQA, MuSiQue, 2WikiMultiHopQA | Requires reasoning across two or more evidence sources |
| Conversational QA | Passage or knowledge source within a dialog | Span or free-form text | CoQA, QuAC, ChatQA | Questions depend on dialog history; coreference resolution needed |
| Table QA | Structured tables | Cell value, aggregation, or generated text | WikiTableQuestions, WikiSQL, SQA, TAPAS | Requires understanding rows, columns, and operations |
| Visual QA | Image (with optional text) | Short text answer | VQA, VQA v2, GQA, OK-VQA | Combines computer vision and language understanding |
| Knowledge-grounded QA | Knowledge graph | Entity or relation | WebQuestions, GrailQA, MetaQA | Queries resolved via graph traversal or SPARQL |
In extractive QA, the system receives a question and a context passage, and must identify the contiguous span of text within the passage that answers the question. Because the answer is always a substring of the provided context, the problem reduces to predicting start and end token positions.
The Stanford Question Answering Dataset (SQuAD), introduced by Rajpurkar et al. (2016), became the most widely used benchmark for extractive QA. SQuAD 1.1 contains 107,785 question-answer pairs derived from 536 Wikipedia articles. Crowdworkers read Wikipedia paragraphs and wrote questions whose answers are spans within those paragraphs. Human performance on SQuAD 1.1 was measured at an F1 score of 91.2 and an Exact Match (EM) score of 82.3.
SQuAD 2.0 (Rajpurkar et al., 2018) extended the dataset by adding over 50,000 adversarially crafted unanswerable questions, bringing the total to roughly 150,000 examples. To succeed on SQuAD 2.0, a system must not only extract correct answer spans but also recognize when the passage does not contain a valid answer and abstain from responding.
The Bi-Directional Attention Flow (BiDAF) model, introduced by Seo et al. (2017) at ICLR, was an influential early neural architecture for extractive QA. BiDAF processes the question and the context through multiple representation layers (character embeddings, word embeddings, and contextual LSTM encodings) and then applies a bidirectional attention mechanism that computes both context-to-query and query-to-context attention. A key design choice in BiDAF is that the attention layer does not collapse the context into a single fixed-size vector; instead, it preserves the full sequence of context representations, allowing downstream layers to retain fine-grained information. BiDAF topped the SQuAD leaderboard upon release and inspired subsequent architectures.
BERT (Bidirectional Encoder Representations from Transformers), introduced by Devlin et al. (2019), transformed extractive QA. BERT is pre-trained on large text corpora using masked language modeling and next sentence prediction, producing deep bidirectional representations. For QA, the question and passage are concatenated as input, and two learned vectors (a start vector and an end vector) are used to compute the probability of each token being the beginning or end of the answer span. The training objective maximizes the log-likelihood of the correct start and end positions.
BERT-Large achieved a Test F1 of 93.2 on SQuAD 1.1 and a Test F1 of 83.1 on SQuAD 2.0, surpassing human-level performance on SQuAD 1.1 and setting new state-of-the-art results at the time of publication. This demonstrated that large-scale pre-training followed by task-specific fine-tuning could produce highly effective QA models with minimal architectural changes.
Subsequent pre-trained models built on similar ideas. RoBERTa (Liu et al., 2019) optimized the pre-training procedure; ALBERT (Lan et al., 2020) reduced parameters through factorized embeddings; DeBERTa (He et al., 2021) introduced disentangled attention and achieved further gains on SQuAD and other benchmarks; and XLNet (Yang et al., 2019) used permutation-based language modeling to capture bidirectional context.
While extractive QA restricts answers to spans within a given passage, abstractive (or generative) QA produces answers in free-form natural language. A generative QA model takes a question (and optionally a context passage) as input and generates the answer token by token using a sequence-to-sequence architecture.
Encoder-decoder Transformers such as T5 (Raffel et al., 2020) and BART (Lewis et al., 2020) serve as the standard backbone for generative QA. T5 frames every NLP task as a text-to-text problem: the input is a string like "question: What is the capital of France? context: France is a country in Europe. Its capital is Paris." and the output is "Paris." This unified format allows a single model to handle extractive, abstractive, multiple-choice, and yes/no QA tasks.
UnifiedQA (Khashabi et al., 2020), developed at the Allen Institute for AI, demonstrated that a single T5-based model fine-tuned on eight QA datasets spanning four different formats (extractive, abstractive, multiple-choice, and yes/no) could match or outperform nine format-specific models across 17 QA benchmarks. This work showed that the boundaries between QA formats are largely artificial.
A key challenge for generative QA is hallucination: the model may produce fluent but factually incorrect answers that are not supported by any source. Grounding generation in retrieved evidence (as in retrieval-augmented generation) helps mitigate this problem.
Open-domain QA (ODQA) refers to answering factoid questions without a pre-specified context passage. The system must first find relevant information from a large corpus (such as all of English Wikipedia) and then extract or generate an answer from the retrieved documents.
DrQA (Chen et al., 2017), developed at Facebook AI Research, introduced the retriever-reader architecture that became the standard paradigm for ODQA. The system has two components:
DrQA used the entirety of English Wikipedia (over 5 million articles) as its knowledge source. By combining document retrieval with machine comprehension, it demonstrated that existing reading comprehension models could scale to open-domain settings when paired with an effective retrieval step.
Subsequent work replaced the sparse TF-IDF retriever with dense neural retrievers. ORQA (Lee et al., 2019) jointly pre-trained the retriever and reader using an Inverse Cloze Task. Dense Passage Retrieval (DPR) by Karpukhin et al. (2020) trained a dual-encoder model with a BERT-based question encoder and a BERT-based passage encoder, using a contrastive learning objective. DPR significantly outperformed BM25 and TF-IDF baselines on multiple open-domain QA benchmarks.
Izacard and Grave (2021) proposed Fusion-in-Decoder (FiD), a generative approach to open-domain QA. FiD encodes each retrieved passage independently with a T5 encoder, then concatenates all encoded representations and feeds them into a T5 decoder to generate the answer. By processing 100 retrieved passages, FiD achieved 51.4 EM on Natural Questions and 67.6 EM on TriviaQA, outperforming both extractive retriever-reader models and closed-book approaches while using far fewer parameters than comparable closed-book systems.
Closed-book QA tests whether a language model can answer factoid questions using only the knowledge stored in its parameters, without accessing any external documents at inference time. This setup is analogous to a student taking an exam without reference materials.
Roberts et al. (2020), in "How Much Knowledge Can You Pack Into the Parameters of a Language Model?", fine-tuned T5 models of varying sizes on open-domain QA datasets. They found that performance scaled consistently with model size: on the Natural Questions test set, T5-Base achieved 27.0 EM, T5-Large reached 29.8, T5-3B scored 32.1, and T5-11B achieved 34.5. With additional salient span masking (SSM) pre-training, T5-11B further improved to 36.6 on Natural Questions and 60.5 on TriviaQA. These results demonstrated that large language models can memorize a substantial amount of world knowledge in their parameters.
However, subsequent analysis by Lewis et al. (2021) found that much of the strong closed-book performance could be attributed to question memorization from the training set, raising concerns about whether these models truly generalize. Despite this caveat, closed-book QA remains an important paradigm for understanding what LLMs learn during pre-training.
Multi-hop QA requires reasoning over two or more pieces of evidence to arrive at an answer. Unlike single-hop questions that can be resolved from a single sentence or passage, multi-hop questions demand that the system retrieve multiple documents, identify relevant facts in each, and chain them together.
HotpotQA (Yang et al., 2018), presented at EMNLP 2018, is the most widely used multi-hop QA benchmark. It contains approximately 113,000 question-answer pairs based on Wikipedia, with four defining features:
HotpotQA includes two evaluation settings: a distractor setting where the model receives 10 paragraphs (2 gold, 8 distractors) and a fullwiki setting where the model must retrieve evidence from all of Wikipedia. Models are evaluated on answer EM and F1, as well as supporting fact EM and F1.
Other multi-hop benchmarks include MuSiQue (Trivedi et al., 2022), which constructs questions requiring 2 to 4 reasoning hops, and 2WikiMultiHopQA (Ho et al., 2020), which focuses on questions requiring cross-document reasoning over two Wikipedia articles.
Conversational QA extends the standard QA task to multi-turn dialogs, where each question may depend on the conversation history. This introduces challenges such as coreference resolution (e.g., "When was he born?" following a question about a specific person), pragmatic reasoning, and topic shifts.
CoQA (Reddy et al., 2019), published in the Transactions of the Association for Computational Linguistics, contains 127,000 questions with answers collected from 8,000 conversations about text passages drawn from seven domains: children's stories, literature, middle and high school English exams, news articles, Wikipedia, Reddit, and science texts. Answers in CoQA are free-form text, and each answer is paired with a rationale (the span in the passage that supports it). The best system at the time of publication achieved an F1 of 65.4, compared to human performance of 88.8.
QuAC (Choi et al., 2018), presented at EMNLP 2018, contains 14,000 information-seeking QA dialogs with 100,000 questions in total. In QuAC, a "student" asks freeform questions to learn about a hidden Wikipedia section, and a "teacher" answers by selecting short excerpts from the text. Because the student cannot see the passage, questions tend to be more open-ended and exploratory than in standard extractive QA. The best model at the time of publication trailed human performance by 20 F1 points.
More recent work in conversational QA includes ChatQA (Liu et al., 2024) from NVIDIA, which demonstrated that a 70-billion-parameter model could match GPT-4-level accuracy across 10 conversational QA benchmarks.
Table QA involves answering natural language questions over structured or semi-structured tabular data. Unlike free-text QA, table QA systems must understand rows, columns, headers, and perform operations such as counting, summing, averaging, sorting, and filtering.
Early table QA systems used semantic parsing to convert natural language questions into executable logical forms (e.g., SQL queries). WikiTableQuestions (Pasupat and Liang, 2015) introduced a benchmark of 22,033 questions over 2,108 HTML tables from Wikipedia. The original semantic parser achieved a test accuracy of 37.1%. WikiSQL (Zhong et al., 2017) provided a larger-scale benchmark with 80,654 hand-annotated pairs of questions and SQL queries spanning 24,241 tables.
TAPAS (Herzig et al., 2020), developed at Google Research, proposed a BERT-based model pre-trained directly on tables. TAPAS linearizes a table by flattening rows and columns into a token sequence, adding special position embeddings for row and column indices. Rather than generating SQL, TAPAS selects table cells and optionally applies aggregation operators (count, sum, average) to produce the answer. TAPAS improved state-of-the-art accuracy on the Sequential Question Answering (SQA) dataset from 55.1 to 67.2 and performed competitively on WikiTableQuestions and WikiSQL.
Visual question answering (VQA) requires a system to answer natural language questions about the content of an image. This task sits at the intersection of computer vision and NLP, demanding both visual perception and language understanding.
The VQA dataset (Antol et al., 2015), presented at ICCV, contains approximately 250,000 images from MS COCO, 760,000 questions, and 10 million answers. Questions range from simple object recognition ("What color is the dog?") to complex reasoning about spatial relationships, counting, and scene understanding. A second version, VQA v2 (Goyal et al., 2017), addressed biases in the original dataset by including pairs of similar images with different answers to the same question.
Modern VQA systems are typically built on vision-language models that combine a visual encoder (e.g., a Vision Transformer) with a language model. Models like BLIP-2 (Li et al., 2023), LLaVA (Liu et al., 2024), and GPT-4V (OpenAI, 2023) represent the current state of the art, handling open-ended visual questions with high accuracy.
Knowledge-grounded QA (KGQA) answers questions by querying structured knowledge graphs such as Freebase, Wikidata, or DBpedia. Instead of finding answer spans in text, the system translates a natural language question into a structured query (e.g., SPARQL) that traverses the knowledge graph to retrieve the answer entity or relation.
Key benchmarks include WebQuestions (Berant et al., 2013), grounded in Freebase, where questions are single-entity factoid queries sourced from the Google Suggest API. WebQuestionsSP (Yih et al., 2016) extended this with 4,737 questions requiring one or two hops in the knowledge graph. GrailQA (Gu et al., 2021) scaled up to 64,331 questions with three levels of generalization (i.i.d., compositional, and zero-shot) and complex logical forms.
KGQA systems face two main challenges: handling complex questions that require multiple hops across the graph, and dealing with incomplete knowledge graphs where the answer entity may not exist. Hybrid approaches that combine knowledge graph retrieval with text-based QA have shown promise in addressing these limitations.
Retrieval-augmented generation (RAG), introduced by Lewis et al. (2020) at NeurIPS, represents the modern synthesis of the retriever-reader paradigm with generative language models. RAG combines a pre-trained neural retriever (based on DPR) with a pre-trained sequence-to-sequence generator (based on BART) in an end-to-end architecture. Given a question, the retriever fetches relevant passages from a non-parametric memory (e.g., a Wikipedia index), and the generator produces an answer conditioned on both the question and the retrieved passages.
RAG set state-of-the-art results on three open-domain QA benchmarks at the time of publication and generated more factual and specific text than purely parametric models. Two variants were proposed: RAG-Sequence, which uses the same retrieved document to generate the entire answer, and RAG-Token, which can attend to different documents for each generated token.
In practice, RAG has become the dominant paradigm for building QA systems with LLMs. Organizations deploy RAG pipelines that retrieve relevant documents from proprietary databases, then feed those documents into an LLM to generate grounded answers. This approach combines the broad language capabilities of LLMs with up-to-date, verifiable information from external sources, reducing hallucination and enabling domain-specific applications without full model retraining.
Modern RAG systems have evolved considerably from the original formulation. Advanced techniques include query rewriting, hypothetical document embeddings (HyDE), re-ranking retrieved passages with cross-encoders, and iterative retrieval where the model refines its search based on partial answers.
Several large-scale benchmarks have driven progress in question answering. The table below summarizes the most influential ones.
| Benchmark | Year | Size | Task Type | Source | Key Feature |
|---|---|---|---|---|---|
| SQuAD 1.1 | 2016 | 107,785 questions | Extractive | Wikipedia | Standard reading comprehension benchmark |
| SQuAD 2.0 | 2018 | ~150,000 questions | Extractive + unanswerable | Wikipedia | Includes adversarial unanswerable questions |
| Natural Questions | 2019 | 307,373 training examples | Long and short answer | Google Search + Wikipedia | Real user queries from Google |
| TriviaQA | 2017 | 95,000 question-answer pairs | Reading comprehension | Trivia websites + Wikipedia/Web | Complex compositional questions with distant supervision |
| HotpotQA | 2018 | 113,000 questions | Multi-hop | Wikipedia | Sentence-level supporting fact annotations |
| CoQA | 2019 | 127,000 questions | Conversational | Seven domains | Free-form answers with evidence rationales |
| QuAC | 2018 | 100,000 questions | Conversational | Wikipedia | Information-seeking dialog between student and teacher |
| WikiTableQuestions | 2015 | 22,033 questions | Table QA | Wikipedia tables | Compositional questions requiring aggregation |
| VQA v2 | 2017 | 1.1 million questions | Visual QA | MS COCO images | Balanced pairs to reduce language bias |
| WebQuestions | 2013 | 5,810 questions | Knowledge-grounded | Freebase | Single-entity factoid questions |
Natural Questions (Kwiatkowski et al., 2019), developed at Google, contains 307,373 training examples of real, anonymized queries issued to the Google search engine. For each question, an annotator is shown a Wikipedia article from the top search results and marks a long answer (usually a paragraph), a short answer (one or more entities), or indicates that the page does not contain the answer. Because the questions come from real users rather than crowdworkers reading a passage, Natural Questions tests a more realistic form of question understanding.
TriviaQA (Joshi et al., 2017) contains over 650,000 question-answer-evidence triples, with 95,000 question-answer pairs authored by trivia enthusiasts. Evidence documents are independently gathered (averaging six per question) and provide distant supervision. TriviaQA questions tend to be more compositional and require more cross-sentence reasoning than SQuAD, making it a challenging benchmark for both extractive and open-domain systems.
Two metrics dominate the evaluation of question answering systems, particularly for extractive QA: Exact Match and F1 score.
Exact Match measures the percentage of predictions that match the ground-truth answer exactly, after normalization (lowercasing, removing articles, punctuation, and extra whitespace). If the predicted answer string is identical to the reference answer string, the score is 1; otherwise, it is 0. EM is a strict metric: a prediction that is off by a single token receives no credit.
The F1 score treats both the prediction and the ground-truth answer as bags of tokens and computes the harmonic mean of precision and recall at the token level. Precision is the fraction of predicted tokens that appear in the reference answer; recall is the fraction of reference tokens that appear in the prediction. F1 provides partial credit when a predicted span overlaps with but does not exactly match the reference, making it a more forgiving and often more informative metric than EM.
For datasets with multiple reference answers (such as Natural Questions), the maximum F1 across all reference answers is typically reported.
Beyond EM and F1, other evaluation approaches are used depending on the QA variant:
The emergence of large language models has blurred the line between dedicated QA systems and general-purpose language models. Models such as GPT-3 (Brown et al., 2020), GPT-4 (OpenAI, 2023), Claude (Anthropic, 2023), and Gemini (Google DeepMind, 2023) can answer questions directly through prompting without any task-specific fine-tuning.
LLMs can perform QA in a zero-shot setting (simply posing the question) or a few-shot setting (providing example question-answer pairs in the prompt). GPT-3 demonstrated competitive few-shot performance on TriviaQA and other benchmarks, and subsequent models have continued to improve. GPT-4 has shown strong performance across a wide range of QA tasks, including medical QA (approaching expert-level performance on MedQA benchmarks) and multi-hop reasoning.
In practice, most production QA systems built on LLMs incorporate a retrieval step. The model receives retrieved context alongside the user question, combining the strengths of neural retrieval with the generation capabilities of the LLM. This RAG-based approach has become the standard architecture for enterprise QA applications, customer support chatbots, and knowledge management systems.
Despite their impressive capabilities, LLMs face several challenges as QA systems:
Question answering technology powers a wide range of real-world applications: