Question answering

Information Retrieval Machine Learning Natural Language Processing

26 min read

Updated Jun 21, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 21, 2026

Fact-checked

In review queue

Sources

31 citations

Revision

v6 · 5,147 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Question answering (QA) is a subfield of natural language processing and information retrieval in which a computer system automatically produces a direct answer to a question posed in natural language, rather than returning a ranked list of documents. A QA system returns the answer itself, typically as a short text span, a generated sentence, or a structured value. Question answering is one of the longest-studied problems in artificial intelligence, dating to the early 1960s, and it remains a central benchmark for evaluating language understanding in modern machine learning systems. Two performance milestones illustrate the field's trajectory: human readers score an F1 score of 91.2 on the Stanford Question Answering Dataset (SQuAD) 1.1, and in 2018 BERT became the first model to exceed that human level with a test F1 of 93.2.^[7]^[16]

Modern QA systems fall into two broad families. Extractive QA selects the answer as a contiguous span copied from a provided passage, while abstractive (generative) QA writes the answer in free-form natural language. Today most production systems combine a retriever that fetches relevant documents with a large language model that generates a grounded answer, an architecture known as retrieval-augmented generation (RAG).

History

The earliest QA systems were rule-based programs designed for narrow domains. BASEBALL (Green et al., 1961) answered questions about Major League Baseball game statistics over a single season, drawing from a structured database.^[1] LUNAR (Woods, 1972) answered questions about the geological analysis of rock samples returned by the Apollo moon missions; when demonstrated at a lunar science convention in 1971, it correctly answered approximately 90% of questions posed by scientists who had no prior training on the system.^[2] Around the same time, SHRDLU (Winograd, 1971) demonstrated natural language understanding in a simulated blocks-world environment.^[3]

These early systems relied on hand-crafted rules and small, curated knowledge bases, which limited them to closed domains. Throughout the 1980s and 1990s, progress was incremental, with systems built around template matching, information extraction pipelines, and shallow parsing. The annual Text REtrieval Conference (TREC) QA track, launched in 1999, helped standardize evaluation and spurred the development of more robust open-domain approaches.

The arrival of large-scale reading comprehension datasets such as SQuAD in 2016, along with advances in deep learning and transfer learning, transformed the field.^[7] Neural models rapidly surpassed traditional pipelines, and the introduction of BERT in 2018 set new performance records on nearly every QA benchmark.^[16] Today, large language models (LLMs) such as GPT-4 and Claude function as general-purpose QA engines, often augmented with retrieval components.

Types of Question Answering

Question answering encompasses a broad family of tasks that differ along several dimensions: where the answer comes from, how the answer is produced, what modality the input takes, and whether the interaction is single-turn or multi-turn. The table below summarizes the major categories.

QA Type	Answer Source	Answer Format	Representative Datasets / Systems	Key Characteristics
Extractive QA	Provided passage or document	Contiguous text span from the source	SQuAD, NewsQA, BiDAF, BERT-QA	Answer is always a substring of the context
Abstractive / Generative QA	Passage, knowledge base, or model parameters	Freely generated natural language	T5, BART, UnifiedQA, FiD	Answer may paraphrase or synthesize information
Open-domain QA	Large corpus (e.g., Wikipedia)	Span or generated text	DrQA, ORQA, DPR + reader, RAG	Retriever-reader pipeline; no pre-selected context
Closed-book QA	Model parameters only	Generated text	T5 (Roberts et al., 2020), GPT-3	No retrieval step; relies on memorized knowledge
Multi-hop QA	Multiple documents or passages	Span or generated text	HotpotQA, MuSiQue, 2WikiMultiHopQA	Requires reasoning across two or more evidence sources
Conversational QA	Passage or knowledge source within a dialog	Span or free-form text	CoQA, QuAC, ChatQA	Questions depend on dialog history; coreference resolution needed
Table QA	Structured tables	Cell value, aggregation, or generated text	WikiTableQuestions, WikiSQL, SQA, TAPAS	Requires understanding rows, columns, and operations
Visual QA	Image (with optional text)	Short text answer	VQA, VQA v2, GQA, OK-VQA	Combines computer vision and language understanding
Knowledge-grounded QA	Knowledge graph	Entity or relation	WebQuestions, GrailQA, MetaQA	Queries resolved via graph traversal or SPARQL

What is extractive question answering?

In extractive QA, the system receives a question and a context passage, and must identify the contiguous span of text within the passage that answers the question. Because the answer is always a substring of the provided context, the problem reduces to predicting start and end token positions.

SQuAD

The Stanford Question Answering Dataset (SQuAD), introduced by Rajpurkar et al. (2016), became the most widely used benchmark for extractive QA. The authors described it as "100,000+ questions posed by crowdworkers on a set of Wikipedia articles, where the answer to each question is a segment of text, or span, from the corresponding reading passage."^[7] SQuAD 1.1 contains 107,785 question-answer pairs derived from 536 Wikipedia articles. Crowdworkers read Wikipedia paragraphs and wrote questions whose answers are spans within those paragraphs. Human performance on SQuAD 1.1 was measured at an F1 score of 91.2 and an Exact Match (EM) score of 82.3.^[7]

SQuAD 2.0 (Rajpurkar et al., 2018) extended the dataset by adding over 50,000 adversarially crafted unanswerable questions, bringing the total to roughly 150,000 examples. To succeed on SQuAD 2.0, a system must not only extract correct answer spans but also recognize when the passage does not contain a valid answer and abstain from responding.^[15]

BiDAF

The Bi-Directional Attention Flow (BiDAF) model, introduced by Seo et al. (2017) at ICLR, was an influential early neural architecture for extractive QA. BiDAF processes the question and the context through multiple representation layers (character embeddings, word embeddings, and contextual LSTM encodings) and then applies a bidirectional attention mechanism that computes both context-to-query and query-to-context attention. A key design choice in BiDAF is that the attention layer does not collapse the context into a single fixed-size vector; instead, it preserves the full sequence of context representations, allowing downstream layers to retain fine-grained information. BiDAF topped the SQuAD leaderboard upon release and inspired subsequent architectures.^[8]

BERT for Question Answering

BERT (Bidirectional Encoder Representations from Transformers), introduced by Devlin et al. (2019), transformed extractive QA. BERT is pre-trained on large text corpora using masked language modeling and next sentence prediction, producing deep bidirectional representations. For QA, the question and passage are concatenated as input, and two learned vectors (a start vector and an end vector) are used to compute the probability of each token being the beginning or end of the answer span. The training objective maximizes the log-likelihood of the correct start and end positions.^[16]

BERT-Large achieved a Test F1 of 93.2 on SQuAD 1.1 and a Test F1 of 83.1 on SQuAD 2.0, surpassing human-level performance on SQuAD 1.1 and setting new state-of-the-art results at the time of publication.^[16] The authors reported that on SQuAD 1.1, BERT "outperforms the top leaderboard system by 1.5 F1 in ensembling and 1.3 F1 as a single system" and exceeds human performance by 2.0 F1.^[16] This demonstrated that large-scale pre-training followed by task-specific fine-tuning could produce highly effective QA models with minimal architectural changes.

Subsequent pre-trained models built on similar ideas. RoBERTa (Liu et al., 2019) optimized the pre-training procedure; ALBERT (Lan et al., 2020) reduced parameters through factorized embeddings; DeBERTa (He et al., 2021) introduced disentangled attention and achieved further gains on SQuAD and other benchmarks; and XLNet (Yang et al., 2019) used permutation-based language modeling to capture bidirectional context.

How does generative QA differ from extractive QA?

While extractive QA restricts answers to spans within a given passage, abstractive (or generative) QA produces answers in free-form natural language. A generative QA model takes a question (and optionally a context passage) as input and generates the answer token by token using a sequence-to-sequence architecture.

Encoder-decoder Transformers such as T5 (Raffel et al., 2020) and BART (Lewis et al., 2020) serve as the standard backbone for generative QA.^[20]^[21] T5 frames every NLP task as a text-to-text problem: the input is a string like "question: What is the capital of France? context: France is a country in Europe. Its capital is Paris." and the output is "Paris."^[20] This unified format allows a single model to handle extractive, abstractive, multiple-choice, and yes/no QA tasks.

UnifiedQA (Khashabi et al., 2020), developed at the Allen Institute for AI, demonstrated that a single T5-based model fine-tuned on eight QA datasets spanning four different formats (extractive, abstractive, multiple-choice, and yes/no) could perform on par with format-specific models trained on each dataset individually, while establishing a new state of the art on 10 factoid and commonsense QA datasets.^[25] The authors found that UnifiedQA "performs surprisingly well" even on unseen datasets, showing that the boundaries between QA formats are largely artificial.^[25]

A key challenge for generative QA is hallucination: the model may produce fluent but factually incorrect answers that are not supported by any source. Grounding generation in retrieved evidence (as in retrieval-augmented generation) helps mitigate this problem.

What is open-domain question answering?

Open-domain QA (ODQA) refers to answering factoid questions without a pre-specified context passage. The system must first find relevant information from a large corpus (such as all of English Wikipedia) and then extract or generate an answer from the retrieved documents.

DrQA: The Retriever-Reader Framework

DrQA (Chen et al., 2017), developed at Facebook AI Research, introduced the retriever-reader architecture that became the standard paradigm for ODQA. The system has two components:

Document Retriever. A non-neural component that uses TF-IDF-weighted bag-of-words vectors with bigram hashing to score the relevance of Wikipedia articles to a given question. It returns the top k = 5 articles.
Document Reader. A multi-layer bidirectional LSTM network trained for extractive QA. It reads the retrieved articles and predicts the answer span.

DrQA used the entirety of English Wikipedia (more than 5 million articles) as its knowledge source. By combining document retrieval with machine comprehension, it demonstrated that existing reading comprehension models could scale to open-domain settings when paired with an effective retrieval step.^[9]

Neural Retrieval

Subsequent work replaced the sparse TF-IDF retriever with dense neural retrievers. ORQA (Lee et al., 2019) jointly pre-trained the retriever and reader using an Inverse Cloze Task.^[19] Dense Passage Retrieval (DPR) by Karpukhin et al. (2020) trained a dual-encoder model with a BERT-based question encoder and a BERT-based passage encoder, using a contrastive learning objective. DPR significantly outperformed BM25 and TF-IDF baselines on multiple open-domain QA benchmarks.^[23]

Fusion-in-Decoder (FiD)

Izacard and Grave (2021) proposed Fusion-in-Decoder (FiD), a generative approach to open-domain QA. FiD encodes each retrieved passage independently with a T5 encoder, then concatenates all encoded representations and feeds them into a T5 decoder to generate the answer. By processing 100 retrieved passages, FiD achieved 51.4 EM on Natural Questions and 67.6 EM on TriviaQA, outperforming both extractive retriever-reader models and closed-book approaches while using far fewer parameters than comparable closed-book systems.^[27]

What is closed-book question answering?

Closed-book QA tests whether a language model can answer factoid questions using only the knowledge stored in its parameters, without accessing any external documents at inference time. This setup is analogous to a student taking an exam without reference materials.

Roberts et al. (2020), in "How Much Knowledge Can You Pack Into the Parameters of a Language Model?", fine-tuned T5 models of varying sizes on open-domain QA datasets. They found that performance scaled consistently with model size: on the Natural Questions test set, T5-Base achieved 27.0 EM, T5-Large reached 29.8, T5-3B scored 32.1, and T5-11B achieved 34.5. With additional salient span masking (SSM) pre-training, T5-11B further improved to 36.6 on Natural Questions and 60.5 on TriviaQA. These results demonstrated that large language models can memorize a substantial amount of world knowledge in their parameters.^[22]

However, subsequent analysis by Lewis et al. (2021) found that much of the strong closed-book performance could be attributed to question memorization from the training set, raising concerns about whether these models truly generalize. Despite this caveat, closed-book QA remains an important paradigm for understanding what LLMs learn during pre-training.

What is multi-hop question answering?

Multi-hop QA requires reasoning over two or more pieces of evidence to arrive at an answer. Unlike single-hop questions that can be resolved from a single sentence or passage, multi-hop questions demand that the system retrieve multiple documents, identify relevant facts in each, and chain them together.

HotpotQA

HotpotQA (Yang et al., 2018), presented at EMNLP 2018 and collected by researchers at Carnegie Mellon University, Stanford University, and the Universite de Montreal, is the most widely used multi-hop QA benchmark. It contains approximately 113,000 question-answer pairs based on Wikipedia, with four defining features:

Questions require finding and reasoning over multiple supporting documents.
Questions are diverse and not constrained to any pre-existing knowledge schema.
Sentence-level supporting facts are provided, enabling explainable predictions.
A novel type of factoid comparison question tests the ability to extract relevant facts from two entities and perform a comparison.

HotpotQA includes two evaluation settings: a distractor setting where the model receives 10 paragraphs (2 gold, 8 distractors) and a fullwiki setting where the model must retrieve evidence from all of Wikipedia. Models are evaluated on answer EM and F1, as well as supporting fact EM and F1.^[13]

Other multi-hop benchmarks include MuSiQue (Trivedi et al., 2022), which constructs questions requiring 2 to 4 reasoning hops, and 2WikiMultiHopQA (Ho et al., 2020), which focuses on questions requiring cross-document reasoning over two Wikipedia articles.

What is conversational question answering?

Conversational QA extends the standard QA task to multi-turn dialogs, where each question may depend on the conversation history. This introduces challenges such as coreference resolution (e.g., "When was he born?" following a question about a specific person), pragmatic reasoning, and topic shifts.

CoQA

CoQA (Reddy et al., 2019), published in the Transactions of the Association for Computational Linguistics, contains 127,000 questions with answers collected from 8,000 conversations about text passages drawn from seven domains: children's stories, literature, middle and high school English exams, news articles, Wikipedia, Reddit, and science texts. Answers in CoQA are free-form text, and each answer is paired with a rationale (the span in the passage that supports it). The best system at the time of publication achieved an F1 of 65.4, which the authors noted was "23.4 points behind human performance (88.8%), indicating that there is ample room for improvement."^[17]

QuAC

QuAC (Choi et al., 2018), presented at EMNLP 2018, contains 14,000 information-seeking QA dialogs with 100,000 questions in total. In QuAC, a "student" asks freeform questions to learn about a hidden Wikipedia section, and a "teacher" answers by selecting short excerpts from the text. Because the student cannot see the passage, questions tend to be more open-ended and exploratory than in standard extractive QA. The best model at the time of publication trailed human performance by 20 F1 points.^[14]

More recent work in conversational QA includes ChatQA (Liu et al., 2024) from NVIDIA, which demonstrated that a 70-billion-parameter model built on Llama2 could reach GPT-4-level accuracy across 10 conversational QA datasets. Averaged across those benchmarks, ChatQA-70B scored 54.14, edging out GPT-4 (53.90) and GPT-3.5-turbo (50.37) without using any synthetic training data generated by OpenAI models.^[31]

Table Question Answering

Table QA involves answering natural language questions over structured or semi-structured tabular data. Unlike free-text QA, table QA systems must understand rows, columns, headers, and perform operations such as counting, summing, averaging, sorting, and filtering.

Semantic Parsing Approaches

Early table QA systems used semantic parsing to convert natural language questions into executable logical forms (e.g., SQL queries). WikiTableQuestions (Pasupat and Liang, 2015) introduced a benchmark of 22,033 questions over 2,108 HTML tables from Wikipedia. The original semantic parser achieved a test accuracy of 37.1%.^[5] WikiSQL (Zhong et al., 2017) provided a larger-scale benchmark with 80,654 hand-annotated pairs of questions and SQL queries spanning 24,241 tables.^[11]

TAPAS

TAPAS (Herzig et al., 2020), developed at Google Research, proposed a BERT-based model pre-trained directly on tables. TAPAS linearizes a table by flattening rows and columns into a token sequence, adding special position embeddings for row and column indices. Rather than generating SQL, TAPAS selects table cells and optionally applies aggregation operators (count, sum, average) to produce the answer. TAPAS improved state-of-the-art accuracy on the Sequential Question Answering (SQA) dataset from 55.1 to 67.2 and performed competitively on WikiTableQuestions and WikiSQL.^[26]

Visual Question Answering

Visual question answering (VQA) requires a system to answer natural language questions about the content of an image. This task sits at the intersection of computer vision and NLP, demanding both visual perception and language understanding.

The VQA dataset (Antol et al., 2015), presented at ICCV, contains approximately 250,000 images from MS COCO, 760,000 questions, and 10 million answers.^[6] Questions range from simple object recognition ("What color is the dog?") to complex reasoning about spatial relationships, counting, and scene understanding. A second version, VQA v2 (Goyal et al., 2017), addressed biases in the original dataset by including pairs of similar images with different answers to the same question.^[12]

Modern VQA systems are typically built on vision-language models that combine a visual encoder (e.g., a Vision Transformer) with a language model. Models like BLIP-2 (Li et al., 2023), LLaVA (Liu et al., 2024), and GPT-4V (OpenAI, 2023) represent the current state of the art, handling open-ended visual questions with high accuracy.

Knowledge-Grounded Question Answering

Knowledge-grounded QA (KGQA) answers questions by querying structured knowledge graphs such as Freebase, Wikidata, or DBpedia. Instead of finding answer spans in text, the system translates a natural language question into a structured query (e.g., SPARQL) that traverses the knowledge graph to retrieve the answer entity or relation.

Key benchmarks include WebQuestions (Berant et al., 2013), grounded in Freebase, where questions are single-entity factoid queries sourced from the Google Suggest API.^[4] WebQuestionsSP (Yih et al., 2016) extended this with 4,737 questions requiring one or two hops in the knowledge graph. GrailQA (Gu et al., 2021) scaled up to 64,331 questions with three levels of generalization (i.i.d., compositional, and zero-shot) and complex logical forms.^[29]

KGQA systems face two main challenges: handling complex questions that require multiple hops across the graph, and dealing with incomplete knowledge graphs where the answer entity may not exist. Hybrid approaches that combine knowledge graph retrieval with text-based QA have shown promise in addressing these limitations.

Retrieval-Augmented Generation as Modern QA

Retrieval-augmented generation (RAG), introduced by Lewis et al. (2020) at NeurIPS, represents the modern synthesis of the retriever-reader paradigm with generative language models. RAG combines a pre-trained neural retriever (based on DPR) with a pre-trained sequence-to-sequence generator (based on BART) in an end-to-end architecture. Given a question, the retriever fetches relevant passages from a non-parametric memory (e.g., a Wikipedia index), and the generator produces an answer conditioned on both the question and the retrieved passages.^[24]

The authors reported that RAG models "set the state-of-the-art on three open domain QA tasks" and, for language generation, "generate more specific, diverse and factual language than a state-of-the-art parametric-only seq2seq baseline."^[24] Two variants were proposed: RAG-Sequence, which uses the same retrieved document to generate the entire answer, and RAG-Token, which can attend to different documents for each generated token.

In practice, RAG has become the dominant paradigm for building QA systems with LLMs. Organizations deploy RAG pipelines that retrieve relevant documents from proprietary databases, then feed those documents into an LLM to generate grounded answers. This approach combines the broad language capabilities of LLMs with up-to-date, verifiable information from external sources, reducing hallucination and enabling domain-specific applications without full model retraining.

Modern RAG systems have evolved considerably from the original formulation. Advanced techniques include query rewriting, hypothetical document embeddings (HyDE), re-ranking retrieved passages with cross-encoders, and iterative retrieval where the model refines its search based on partial answers.

Benchmarks

Several large-scale benchmarks have driven progress in question answering. The table below summarizes the most influential ones.

Benchmark	Year	Size	Task Type	Source	Key Feature
SQuAD 1.1	2016	107,785 questions	Extractive	Wikipedia	Standard reading comprehension benchmark
SQuAD 2.0	2018	~150,000 questions	Extractive + unanswerable	Wikipedia	Includes adversarial unanswerable questions
Natural Questions	2019	307,373 training examples	Long and short answer	Google Search + Wikipedia	Real user queries from Google
TriviaQA	2017	95,000 question-answer pairs	Reading comprehension	Trivia websites + Wikipedia/Web	Complex compositional questions with distant supervision
HotpotQA	2018	113,000 questions	Multi-hop	Wikipedia	Sentence-level supporting fact annotations
CoQA	2019	127,000 questions	Conversational	Seven domains	Free-form answers with evidence rationales
QuAC	2018	100,000 questions	Conversational	Wikipedia	Information-seeking dialog between student and teacher
WikiTableQuestions	2015	22,033 questions	Table QA	Wikipedia tables	Compositional questions requiring aggregation
VQA v2	2017	1.1 million questions	Visual QA	MS COCO images	Balanced pairs to reduce language bias
WebQuestions	2013	5,810 questions	Knowledge-grounded	Freebase	Single-entity factoid questions

Natural Questions

Natural Questions (Kwiatkowski et al., 2019), developed at Google, contains 307,373 training examples of real, anonymized queries issued to the Google search engine. For each question, an annotator is shown a Wikipedia article from the top search results and marks a long answer (usually a paragraph), a short answer (one or more entities), or indicates that the page does not contain the answer. Because the questions come from real users rather than crowdworkers reading a passage, Natural Questions tests a more realistic form of question understanding.^[18]

TriviaQA

TriviaQA (Joshi et al., 2017) contains over 650,000 question-answer-evidence triples, with 95,000 question-answer pairs authored by trivia enthusiasts. Evidence documents are independently gathered (averaging six per question) and provide distant supervision. TriviaQA questions tend to be more compositional and require more cross-sentence reasoning than SQuAD, making it a challenging benchmark for both extractive and open-domain systems.^[10]

Evaluation Metrics

Two metrics dominate the evaluation of question answering systems, particularly for extractive QA: Exact Match and F1 score.

Exact Match (EM)

Exact Match measures the percentage of predictions that match the ground-truth answer exactly, after normalization (lowercasing, removing articles, punctuation, and extra whitespace). If the predicted answer string is identical to the reference answer string, the score is 1; otherwise, it is 0. EM is a strict metric: a prediction that is off by a single token receives no credit.

F1 Score

The F1 score treats both the prediction and the ground-truth answer as bags of tokens and computes the harmonic mean of precision and recall at the token level. Precision is the fraction of predicted tokens that appear in the reference answer; recall is the fraction of reference tokens that appear in the prediction. F1 provides partial credit when a predicted span overlaps with but does not exactly match the reference, making it a more forgiving and often more informative metric than EM.

For datasets with multiple reference answers (such as Natural Questions), the maximum F1 across all reference answers is typically reported.

Other Metrics

Beyond EM and F1, other evaluation approaches are used depending on the QA variant:

BLEU and ROUGE are sometimes applied to abstractive QA to measure n-gram overlap between the generated answer and reference answers.
Mean Reciprocal Rank (MRR) and Recall@k evaluate the retrieval component in open-domain QA.
Human evaluation remains important for generative QA, assessing factual correctness, fluency, and relevance.
Supporting fact F1 (used in HotpotQA) evaluates whether the model identifies the correct evidence sentences in addition to the correct answer.

Large Language Models as QA Systems

The emergence of large language models has blurred the line between dedicated QA systems and general-purpose language models. Models such as GPT-3 (Brown et al., 2020), GPT-4 (OpenAI, 2023), Claude (Anthropic, 2023), and Gemini (Google DeepMind, 2023) can answer questions directly through prompting without any task-specific fine-tuning.^[28]

Zero-Shot and Few-Shot QA

LLMs can perform QA in a zero-shot setting (simply posing the question) or a few-shot setting (providing example question-answer pairs in the prompt). GPT-3 demonstrated competitive few-shot performance on TriviaQA and other benchmarks, and subsequent models have continued to improve.^[28] GPT-4 has shown strong performance across a wide range of QA tasks, including medical QA (approaching expert-level performance on MedQA benchmarks) and multi-hop reasoning.^[30]

LLMs with Retrieval

In practice, most production QA systems built on LLMs incorporate a retrieval step. The model receives retrieved context alongside the user question, combining the strengths of neural retrieval with the generation capabilities of the LLM. This RAG-based approach has become the standard architecture for enterprise QA applications, customer support chatbots, and knowledge management systems.

Challenges

Despite their impressive capabilities, LLMs face several challenges as QA systems:

Hallucination. LLMs can generate confident but factually incorrect answers, especially for rare or ambiguous questions.
Knowledge cutoff. Parametric knowledge is limited to the training data, making retrieval augmentation necessary for up-to-date information.
Attribution. LLMs do not naturally cite their sources, making it difficult to verify answers. Research on attribution and faithful generation is ongoing.
Cost and latency. Large models are expensive to run, prompting work on smaller, more efficient QA-specialized models.

Applications

Question answering technology powers a wide range of real-world applications:

Search engines. Google's featured snippets and Bing's AI-powered answers use QA models to surface direct answers at the top of search results.
Virtual assistants. Systems like Siri, Alexa, and Google Assistant rely on QA pipelines to answer user queries.
Customer support. Automated chatbots use domain-specific QA to resolve customer questions by retrieving answers from knowledge bases.
Healthcare. Medical QA systems help clinicians by answering questions from clinical guidelines, research literature, and patient records.
Education. QA systems support tutoring applications by answering student questions about course material.
Enterprise knowledge management. Organizations deploy internal QA systems over company documents, policies, and databases.

References

Green, B. F., Wolf, A. K., Chomsky, C., and Laughery, K. (1961). "Baseball: An Automatic Question-Answerer." *Proceedings of the Western Joint Computer Conference*. ↩
Woods, W. A. (1972). "The Lunar Sciences Natural Language Information System: Final Report." *BBN Report No. 2378*. ↩
Winograd, T. (1971). "Procedures as a Representation for Data in a Computer Program for Understanding Natural Language." *MIT AI Technical Report 235*. ↩
Berant, J., Chou, A., Frostig, R., and Liang, P. (2013). "Semantic Parsing on Freebase from Question-Answer Pairs." *EMNLP 2013*. ↩
Pasupat, P. and Liang, P. (2015). "Compositional Semantic Parsing on Semi-Structured Tables." *ACL 2015*. ↩
Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Zitnick, C. L., and Parikh, D. (2015). "VQA: Visual Question Answering." *ICCV 2015*. ↩
Rajpurkar, P., Zhang, J., Lopyrev, K., and Liang, P. (2016). "SQuAD: 100,000+ Questions for Machine Comprehension of Text." *EMNLP 2016*. arXiv:1606.05250. ↩
Seo, M., Kembhavi, A., Farhadi, A., and Hajishirzi, H. (2017). "Bidirectional Attention Flow for Machine Comprehension." *ICLR 2017*. ↩
Chen, D., Fisch, A., Weston, J., and Bordes, A. (2017). "Reading Wikipedia to Answer Open-Domain Questions." *ACL 2017*. arXiv:1704.00051. ↩
Joshi, M., Choi, E., Weld, D., and Zettlemoyer, L. (2017). "TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension." *ACL 2017*. ↩
Zhong, V., Xiong, C., and Socher, R. (2017). "Seq2SQL: Generating Structured Queries from Natural Language Using Reinforcement Learning." *arXiv:1709.00103*. ↩
Goyal, Y., Khot, T., Summers-Stay, D., Batra, D., and Parikh, D. (2017). "Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering." *CVPR 2017*. ↩
Yang, Z., Qi, P., Zhang, S., Bengio, Y., Cohen, W., Salakhutdinov, R., and Manning, C. D. (2018). "HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering." *EMNLP 2018*. arXiv:1809.09600. ↩
Choi, E., He, H., Iyyer, M., Yatskar, M., Yih, W., Choi, Y., Liang, P., and Zettlemoyer, L. (2018). "QuAC: Question Answering in Context." *EMNLP 2018*. ↩
Rajpurkar, P., Jia, R., and Liang, P. (2018). "Know What You Don't Know: Unanswerable Questions for SQuAD." *ACL 2018*. ↩
Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2019). "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding." *NAACL 2019*. arXiv:1810.04805. ↩
Reddy, S., Chen, D., and Manning, C. D. (2019). "CoQA: A Conversational Question Answering Challenge." *Transactions of the Association for Computational Linguistics*, 7. arXiv:1808.07042. ↩
Kwiatkowski, T., Palomaki, J., Redfield, O., Collins, M., Parikh, A., Alberti, C., Epstein, D., Polosukhin, I., Devlin, J., Lee, K., Toutanova, K., Jones, L., Kelcey, M., Chang, M.-W., Dai, A., Uszkoreit, J., Le, Q., and Petrov, S. (2019). "Natural Questions: A Benchmark for Question Answering Research." *Transactions of the Association for Computational Linguistics*, 7. ↩
Lee, K., Chang, M.-W., and Toutanova, K. (2019). "Latent Retrieval for Weakly Supervised Open Domain Question Answering." *ACL 2019*. ↩
Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., and Liu, P. J. (2020). "Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer." *JMLR*, 21(140). ↩
Lewis, M., Liu, Y., Goyal, N., Ghazvininejad, M., Mohamed, A., Levy, O., Stoyanov, V., and Zettlemoyer, L. (2020). "BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension." *ACL 2020*. ↩
Roberts, A., Raffel, C., and Shazeer, N. (2020). "How Much Knowledge Can You Pack Into the Parameters of a Language Model?" *EMNLP 2020*. ↩
Karpukhin, V., Oguz, B., Min, S., Lewis, P., Wu, L., Edunov, S., Chen, D., and Yih, W. (2020). "Dense Passage Retrieval for Open-Domain Question Answering." *EMNLP 2020*. ↩
Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., Kuttler, H., Lewis, M., Yih, W., Rocktaschel, T., Riedel, S., and Kiela, D. (2020). "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks." *NeurIPS 2020*. arXiv:2005.11401. ↩
Khashabi, D., Min, S., Khot, T., Sabharwal, A., Tafjord, O., Clark, P., and Hajishirzi, H. (2020). "UnifiedQA: Crossing Format Boundaries With a Single QA System." *Findings of EMNLP 2020*. arXiv:2005.00700. ↩
Herzig, J., Nowak, P. K., Muller, T., Piccinno, F., and Eisenschlos, J. M. (2020). "TAPAS: Weakly Supervised Table Parsing via Pre-training." *ACL 2020*. ↩
Izacard, G. and Grave, E. (2021). "Leveraging Passage Retrieval with Generative Models for Open Domain Question Answering." *EACL 2021*. arXiv:2007.01282. ↩
Brown, T., Mann, B., Ryder, N., et al. (2020). "Language Models are Few-Shot Learners." *NeurIPS 2020*. ↩
Gu, Y., Kase, S., Vanni, M., Sadler, B., Liang, P., Yan, X., and Su, Y. (2021). "Beyond I.I.D.: Three Levels of Generalization for Question Answering on Knowledge Bases." *TheWebConf 2021*. ↩
OpenAI. (2023). "GPT-4 Technical Report." *arXiv:2303.08774*. ↩
Liu, Z., Ping, W., Roy, R., Xu, P., Lee, C., Shoeybi, M., and Catanzaro, B. (2024). "ChatQA: Surpassing GPT-4 on Conversational QA and RAG." *NeurIPS 2024*. arXiv:2401.10225. ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

5 revisions by 1 contributors · full history

Suggest edit

Question answering

History

Types of Question Answering

What is extractive question answering?

SQuAD

BiDAF

BERT for Question Answering

How does generative QA differ from extractive QA?

What is open-domain question answering?

DrQA: The Retriever-Reader Framework

Neural Retrieval

Fusion-in-Decoder (FiD)

What is closed-book question answering?

What is multi-hop question answering?

HotpotQA

What is conversational question answering?

CoQA

QuAC

Table Question Answering

Semantic Parsing Approaches

TAPAS

Visual Question Answering

Knowledge-Grounded Question Answering

Retrieval-Augmented Generation as Modern QA

Benchmarks

Natural Questions

TriviaQA

Evaluation Metrics

Exact Match (EM)

F1 Score

Other Metrics

Large Language Models as QA Systems

Zero-Shot and Few-Shot QA

LLMs with Retrieval

Challenges

Applications

References

Improve this article

What links here (24 of 38)

What links here (24 of 38)

History

Types of Question Answering

What is extractive question answering?

SQuAD

BiDAF

BERT for Question Answering

How does generative QA differ from extractive QA?

What is open-domain question answering?

DrQA: The Retriever-Reader Framework

Neural Retrieval

Fusion-in-Decoder (FiD)

What is closed-book question answering?

What is multi-hop question answering?

HotpotQA

What is conversational question answering?

CoQA

QuAC

Table Question Answering

Semantic Parsing Approaches

TAPAS

Visual Question Answering

Knowledge-Grounded Question Answering

Retrieval-Augmented Generation as Modern QA

Benchmarks

Natural Questions

TriviaQA

Evaluation Metrics

Exact Match (EM)

F1 Score

Other Metrics

Large Language Models as QA Systems

Zero-Shot and Few-Shot QA

LLMs with Retrieval

Challenges

Applications

References

Improve this article

Related Articles

Similarity Measure

Embeddings

Information Retrieval

Matryoshka representation learning

Vector embeddings

LlamaIndex

What links here (24 of 38)

Related Articles

Similarity Measure

Embeddings

Information Retrieval

Matryoshka representation learning

Vector embeddings

LlamaIndex

What links here (24 of 38)