TriviaQA

AI Benchmarks Natural Language Processing

25 min read

Updated Jun 23, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 23, 2026

Fact-checked

In review queue

Sources

10 citations

Revision

v5 · 4,947 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

TriviaQA is a large-scale reading comprehension and question answering dataset of over 650,000 question-answer-evidence triples, introduced in 2017 by Mandar Joshi, Eunsol Choi, Daniel S. Weld, and Luke Zettlemoyer at the University of Washington and published at ACL 2017.^[1] It pairs 95,956 question-answer pairs written by trivia enthusiasts with roughly six independently gathered evidence documents per question, drawn from Wikipedia and the open web.^[1] TriviaQA's defining feature is that questions are authored completely independently of the evidence documents, which forces models to reason past surface word overlap. In the original paper, the strongest baseline reached only about 40 percent exact match against roughly 80 percent for humans, and the authors concluded that "neither approach comes close to human performance (23% and 40% vs. 80%), suggesting that TriviaQA is a challenging testbed that is worth significant future study."^[1] Today TriviaQA is one of the most widely used benchmarks in natural language processing, reported in major large language model technical releases as both an open-book reading comprehension test and a closed-book factual-knowledge eval.^[1]

When was TriviaQA released and who built it?

TriviaQA was introduced in 2017 and published at the 55th Annual Meeting of the Association for Computational Linguistics (ACL 2017) in Vancouver, Canada, with the preprint posted as arXiv:1705.03551 in May 2017.^[1] It was created by Mandar Joshi, Eunsol Choi, Daniel S. Weld, and Luke Zettlemoyer at the Paul G. Allen School of Computer Science and Engineering at the University of Washington.^[1] The full title of the paper is "TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension," which captures its two key design commitments: very large scale and distant supervision rather than hand-annotated answer spans.^[1]

Background and Motivation

Before TriviaQA, the dominant reading comprehension benchmarks suffered from a fundamental design flaw: questions were typically written by people who had already read the evidence passage. In datasets like SQuAD (Stanford Question Answering Dataset), annotators viewed a Wikipedia paragraph and then composed questions about it.^[2] While this approach produced high-quality question-answer pairs, it also introduced biases. Questions tended to closely mirror the vocabulary and sentence structure of the source passage, making it possible for models to find answers through simple pattern matching or word overlap rather than genuine comprehension.^[1]

The researchers behind TriviaQA recognized that this coupling between question creation and evidence selection limited how well benchmarks could test true reading comprehension ability. They proposed a different approach: collect questions that were written completely independently of any evidence document, then retrospectively gather supporting documents from Wikipedia and the web.^[1] This decoupling meant that the syntactic and lexical overlap between a question and its answer-containing sentence would be naturally low, forcing models to perform genuine reasoning rather than surface-level matching.^[1]

The choice to source questions from trivia enthusiasts was deliberate. Trivia questions are inherently complex, often compositional, and cover a wide range of topics. Unlike questions generated by crowdworkers for a specific NLP task, trivia questions reflect genuine human curiosity and knowledge-testing patterns. They frequently require understanding of time frames, comparisons, fine-grained categories, and multi-step reasoning, all properties that make them effective probes of machine reading ability.^[1]

How was the TriviaQA dataset built?

Question Collection

TriviaQA's questions were gathered from 14 trivia and quiz-league websites.^[1] The research team scraped question-answer pairs from these sites, which host questions written by trivia enthusiasts for pub quizzes, quiz leagues, and online trivia competitions. Questions with fewer than four tokens were removed, since short questions tended to be either trivially simple or too vague to answer reliably.^[1]

The resulting collection contains 95,956 question-answer pairs covering a wide range of topics.^[1] Each question averages 14 tokens in length, reflecting the compositional nature of trivia questions.^[1] For instance, a typical question might read: "What fragrant essential oil is obtained from Damask Rose?" This level of specificity and compositionality is significantly more challenging than the simpler factoid questions found in many other QA datasets.

Answer Representation

Answers in TriviaQA use a rich alias system. Since the same entity or concept can be referred to in multiple ways, the dataset includes normalized aliases for each answer. For questions whose answers correspond to Wikipedia entities, the dataset leverages Wikipedia redirect pages and disambiguation pages to compile comprehensive alias lists.^[1] This design ensures that models receive credit for producing any valid form of the correct answer.

The distribution of answer types breaks down as follows:

Answer Type	Percentage
Wikipedia entity title	92.85%
Numerical answer	4.17%
Free-text answer	2.98%

Among the Wikipedia entity answers, the named entity categories are distributed across several types:

Named Entity Category	Percentage of Wikipedia Entity Answers
Person	32%
Location	23%
Miscellaneous	40%
Organization	5%

The heavy concentration of Wikipedia entity answers (92.85%) reflects the nature of trivia questions, which predominantly ask about notable people, places, events, and things that have their own Wikipedia pages.^[1]

Evidence Document Collection

A distinctive feature of TriviaQA is its use of two independent sources of evidence documents: Wikipedia articles and web search results. For each question, evidence was gathered through two complementary methods.^[1]

Wikipedia Evidence. The researchers applied TagMe, an off-the-shelf entity linking tool, to identify entities mentioned in each question. The corresponding Wikipedia articles for these entities were then collected as evidence documents.^[1] This approach captures the structured, encyclopedic knowledge that is most relevant to answering trivia questions. The Wikipedia portion of the dataset contains articles linked to the entities detected in each question.

Web Evidence. Each question was submitted as a search query to the Bing Web Search API. The top 50 search result URLs were collected, and the researchers crawled the top 10 web pages for each question.^[1] These web documents provide more diverse evidence, including blog posts, news articles, reference sites, and other sources that might contain the answer in different contexts and phrasings.

After gathering candidate evidence documents, the researchers filtered them to create the "reading comprehension" (RC) subset. In this filtered version, only documents that actually contain the answer string are retained.^[1] This filtering step ensures that every evidence document in the RC split provides at least one passage where the answer appears, creating a cleaner signal for training and evaluation. The unfiltered version retains all documents, including those that do not contain the answer, making it more suitable for information-retrieval-style question answering.

How big is the TriviaQA dataset?

The complete dataset comprises the following statistics:

Statistic	Value
Total question-answer-evidence triples	Over 650,000
Total question-answer pairs	95,956
Unique answers	40,478
Total evidence documents	662,659
Average question length	14 words
Average document length	2,895 words
Average evidence documents per question	~6

Dataset Splits and Configurations

TriviaQA is organized into multiple splits and configurations to support different research scenarios. The two primary configurations are "RC" (reading comprehension) and "Unfiltered," each available with or without document context.

Reading Comprehension (RC) Configuration

The RC configuration includes only question-document pairs where the evidence document contains the answer string.^[1] This filtered setup is the standard configuration for reading comprehension evaluation.

Split	Number of Examples
Train	138,384
Validation	18,669
Test	17,210
Total	174,263

Note that the number of examples exceeds the number of unique questions because each question is paired with multiple evidence documents.

Domain-Specific Splits

Within the RC configuration, examples are further divided by evidence source:

Wikipedia Domain:

Split	Questions	Documents
Train	61,888	110,648
Development	7,993	14,229
Test	7,701	13,661

Web Domain:

Split	Questions	Documents
Train	76,496	528,979
Development	9,951	68,621
Test	9,509	65,059

The web domain contains significantly more documents per question than the Wikipedia domain, reflecting the redundancy of information across multiple web pages.

Unfiltered Configuration

The unfiltered configuration includes all 110,000 question-answer pairs, including those where evidence documents may not contain the answer string.^[1] This version is more appropriate for open-domain question answering and information retrieval research.

Split	Number of Examples
Train	87,622
Validation	11,313
Test	10,832
Total	109,767

Verified Evaluation Subsets

To provide a cleaner evaluation signal, the researchers created human-verified subsets of the development and test data. In these verified subsets, human annotators confirmed that the evidence documents contain sufficient information to answer the question.^[1] This addresses the noise introduced by distant supervision, where automatically matched documents may contain the answer string in an irrelevant context.

Verified Split	Wikipedia Questions	Web Questions
Development	297	322
Test	584	733

The verified subsets contain 1,936 question-document-answer triples with documents certified to contain all necessary facts for answering the question.^[1]

No-Context Variants

Each configuration also has a "nocontext" variant that strips out the evidence documents, providing only the questions and answers. These variants are useful for evaluating closed-book question answering, where models must rely entirely on knowledge stored in their parameters rather than extracting answers from provided passages.

What makes TriviaQA questions difficult?

One of TriviaQA's distinguishing properties is the complexity and diversity of its questions. The researchers conducted a detailed analysis of question characteristics, revealing several dimensions of difficulty.^[1]

Question Type Distribution

Property	Example	Frequency
Fine-grained answer type hint	"What fragrant essential oil is obtained from Damask Rose?"	73.5%
Coarse-grained answer type hint	"Who won the Nobel Peace Prize in 2009?"	15.5%
Time frame reference	"What was photographed for the first time in October 1959?"	34%
Comparison question	"What is the largest type of frog?"	9%
Average entities per question	"Which politician won the Nobel Peace Prize in 2009?"	1.77

The high proportion of questions with fine-grained answer type hints (73.5%) indicates that most TriviaQA questions specify exactly what kind of answer is expected, pushing models to identify precise entities rather than broadly related text spans.^[1]

Reasoning Requirements

The research team analyzed what types of reasoning are needed to answer TriviaQA questions by examining how the answer-containing evidence sentences relate to the question. This analysis revealed that TriviaQA demands substantially more sophisticated reasoning than SQuAD.^[1]

Reasoning Type	Wikipedia Domain	Web Domain
Syntactic variation	69%	65%
Lexical variation (synonyms)	41%	39%
Lexical + world knowledge	17%	17%
Multi-sentence reasoning	40%	35%
Lists/tables	N/A	7%

The syntactic variation rate of 69% in Wikipedia documents means that for the majority of questions, the sentence containing the answer uses a different syntactic structure than the question itself. The 40% multi-sentence reasoning requirement in the Wikipedia domain is particularly notable: over three times as many questions in TriviaQA require reasoning over multiple sentences compared to SQuAD.^[1]

How does TriviaQA differ from SQuAD?

The paper provides a direct comparison between TriviaQA and SQuAD across several dimensions:^[1]

Property	TriviaQA	SQuAD
Large scale	Yes	Yes
Freeform answer	Yes	Yes
Well-formed questions	Yes	Yes
Questions independent of evidence	Yes	No
Varied evidence sources	Yes	No
Multi-sentence reasoning required	~40%	~13%

The most critical distinction is that TriviaQA questions are written independently of the evidence documents, while SQuAD questions are composed by annotators who are looking at the evidence paragraph.^[2] This independence makes TriviaQA significantly more challenging because models cannot rely on lexical overlap between the question and the answer-containing passage.^[1]

How is TriviaQA scored?

Metrics

TriviaQA uses two standard evaluation metrics, consistent with those used in SQuAD and other reading comprehension benchmarks:^[1]

Exact Match (EM): The percentage of predictions that exactly match one of the ground-truth answers (after normalization). A prediction receives a score of 1 if it matches any valid answer alias, and 0 otherwise.

F1 Score: The average token-level F1 score between the prediction and the best-matching ground-truth answer. This metric gives partial credit for predictions that overlap with the correct answer, even if they are not an exact match. The F1 score is computed over the bag of words in the prediction and answer after removing stop words, articles, and punctuation.

Both metrics apply standard normalization to predictions and ground-truth answers before comparison. This normalization includes lowercasing, removing articles (a, an, the), stripping punctuation, and collapsing whitespace.

Domain-Specific Evaluation Protocols

The evaluation protocol differs slightly between the Wikipedia and Web domains:

For the Wikipedia domain, evaluation is performed at the question level. Since the factual information needed to answer a question typically appears only once in a Wikipedia article, question-level accuracy is the natural unit of measurement.^[1]

For the Web domain, evaluation is performed at the document level. Because web documents exhibit high information redundancy (approximately six documents per question on average), the per-document accuracy captures how well models can extract answers from individual documents.^[1] The final score is computed as the average across all question-document pairs.

Leaderboard and Test Evaluation

The official TriviaQA leaderboard is hosted on CodaLab. Researchers submit their model predictions for the held-out test set, which is not publicly available, and receive automated evaluation scores.^[1] This setup prevents overfitting to the test data and ensures fair comparison across submissions.

Baseline Models and Original Results

The original TriviaQA paper evaluated three baseline systems to establish reference performance levels. The results demonstrated a substantial gap between machine performance and human ability. As the authors summarized, "neither approach comes close to human performance (23% and 40% vs. 80%), suggesting that TriviaQA is a challenging testbed that is worth significant future study."^[1]

Baseline Systems

Random Entity Baseline. This baseline selects a random named entity from the evidence document as its answer. It serves as a lower bound, showing the performance achievable by chance given the distribution of entity types in the documents.^[1]

Feature-Based Classifier. This system uses hand-engineered features including word overlap, named entity type matching, and distance-based features to score candidate answer spans. It represents a traditional, pre-neural approach to reading comprehension.^[1]

BiDAF (Bidirectional Attention Flow). BiDAF was the state-of-the-art neural reading comprehension model at the time of TriviaQA's release. It uses bidirectional attention mechanisms to model interactions between the question and the evidence passage, producing a probability distribution over possible answer spans.^[5]

Results on Wikipedia Domain

Model	Dev EM	Dev F1	Test EM	Test F1
Random Entity	12.72%	22.91%	12.74%	22.35%
Classifier	23.42%	27.68%	22.45%	26.52%
BiDAF	40.26%	45.74%	40.32%	45.91%
Human	79.7%	-	-	-

Results on Web Domain

Model	Dev EM	Dev F1	Test EM	Test F1
Random Entity	12.72%	22.91%	12.74%	22.35%
Classifier	24.64%	29.08%	24.00%	28.38%
BiDAF	41.08%	47.40%	40.74%	47.05%
Human	75.4%	-	-	-

Results on Verified Subsets

Performance on the human-verified subsets was notably higher across all models, suggesting that some of the difficulty in the full dataset stems from noise in the distant supervision rather than question difficulty alone.^[1]

Model	Wiki Dev EM (Verified)	Wiki Dev F1 (Verified)	Web Dev EM (Verified)	Web Dev F1 (Verified)
Random Entity	14.81%	23.31%	15.41%	25.44%
Classifier	24.91%	29.43%	27.38%	31.91%
BiDAF	47.47%	53.70%	51.38%	55.47%

The performance gap between BiDAF and humans was approximately 40 percentage points in Exact Match, making TriviaQA one of the most challenging reading comprehension benchmarks at the time of its release. The researchers noted that BiDAF's accuracy dropped from roughly 50% for short questions (5 words or fewer) to about 32% for longer questions (20 words or more), confirming that the compositional complexity of trivia questions poses a significant challenge.^[1]

How well do models perform on TriviaQA?

Since its release in 2017, TriviaQA has served as a proving ground for advances in natural language processing, particularly in reading comprehension and open-domain question answering. The performance gap between machines and humans has narrowed considerably with each generation of models.

Pre-Transformer Era (2017-2018)

The initial baselines used models like BiDAF, which achieved around 40% Exact Match on TriviaQA.^[1] Other attention-based models from this period, including Document Reader and Reinforced Mnemonic Reader, made incremental improvements but struggled with the dataset's requirement for multi-sentence reasoning and tolerance for lexical variation.

BERT and Transformer Models (2018-2020)

The introduction of BERT (Bidirectional Encoder Representations from Transformers) in 2018 marked a major leap in reading comprehension performance.^[9] BERT's pre-training on large text corpora gave it a much richer understanding of language than previous models.^[9] Fine-tuning BERT on TriviaQA yielded significant improvements over BiDAF-era baselines. Researchers found that data augmentation strategies, such as first fine-tuning on TriviaQA before fine-tuning on SQuAD, could boost performance on both benchmarks, demonstrating the complementary nature of these datasets.

Large Language Models and Closed-Book QA (2020-Present)

The release of GPT-3 in 2020 introduced a new paradigm for TriviaQA evaluation: closed-book question answering, where the model must answer questions purely from knowledge stored in its parameters without access to any evidence documents.^[3] GPT-3's performance on TriviaQA in the closed-book setting was striking:^[3]

Setting	Accuracy
Zero-shot	64.3%
One-shot	68.0%
Few-shot	71.2%

The few-shot result of 71.2% was state-of-the-art for the closed-book setting, matching or exceeding fine-tuned models that used retrieval systems with multiple BERT-scale models.^[3] This demonstrated that sufficiently large language models can internalize enormous amounts of factual knowledge during pre-training. T5 (Text-to-Text Transfer Transformer) with 11 billion parameters was shown to produce the exact answer text 34.5% of the time on TriviaQA in a generative closed-book setting, further advancing the state of the art for open-domain QA.^[4]

Recent Model Performance

More recent large language models continue to push TriviaQA scores higher. Performance data from model evaluations shows that modern LLMs achieve strong results on TriviaQA, often evaluated as a general knowledge benchmark:

Model	TriviaQA Score
Kimi K2 Base (Moonshot AI)	85.1%
Gemma 2 27B (Google)	83.7%
Mistral Small 3.1 24B (Mistral AI)	80.5%
Mistral Small 3 24B (Mistral AI)	80.3%
Granite 3.3 8B (IBM)	78.2%
Gemma 2 9B (Google)	76.6%
Mistral Large 3 (Mistral AI)	74.9%
Mistral NeMo Instruct (Mistral AI)	73.8%

These results show that TriviaQA remains a discriminating benchmark even for state-of-the-art models, with scores ranging from roughly 59% to 85% depending on model size and training. The benchmark continues to differentiate between models of varying capability levels.

Distant Supervision in TriviaQA

One of TriviaQA's key contributions to the field is its approach to distant supervision. Traditional reading comprehension datasets rely on close supervision, where human annotators mark the exact answer span in a specific passage. TriviaQA instead uses distant supervision, automatically pairing questions with evidence documents that contain the answer string.^[1]

How Distant Supervision Works in TriviaQA

The distant supervision pipeline works as follows:

A trivia question and its answer are obtained from a trivia website.
Wikipedia articles are identified through entity linking (using TagMe) on the question text.
Web documents are obtained by submitting the question as a search query to Bing.
Documents are filtered to retain only those containing the answer string (for the RC configuration).
The resulting question-document-answer triples form the training data.

This approach has several advantages. It allows the dataset to scale to hundreds of thousands of examples without expensive human annotation of answer spans.^[1] It also produces more naturalistic training data, since the relationship between questions and documents mirrors real-world information-seeking scenarios where a user's question was not composed while looking at the answer source.

Noise in Distant Supervision

The distant supervision approach introduces some noise. A document may contain the answer string in an irrelevant context (for example, the answer "Paris" might appear in a document discussing a person named Paris rather than the city of Paris).^[1] The verified evaluation subsets were created specifically to measure the impact of this noise.^[1] Performance on verified subsets is consistently higher than on the full dataset, confirming that distant supervision noise accounts for a portion of the difficulty. However, even on verified subsets, models fall well short of human performance, demonstrating that the inherent complexity of TriviaQA questions is the primary challenge.

What is TriviaQA used for?

Standard Benchmark for Reading Comprehension

TriviaQA is one of the standard benchmarks used to evaluate reading comprehension and question answering systems. It is included in major evaluation frameworks like EleutherAI's Language Model Evaluation Harness, which standardizes the evaluation of large language models across dozens of benchmarks. Most major LLM releases from organizations like OpenAI, Google, Anthropic, and Mistral include TriviaQA scores in their technical reports.

Open-Domain Question Answering Research

The unfiltered configuration of TriviaQA has been particularly influential in open-domain QA research, where systems must both retrieve relevant documents and extract answers. The dataset's combination of naturally complex questions and diverse evidence sources makes it an ideal testbed for retrieval-augmented generation (RAG) systems and end-to-end QA pipelines.

Evaluating Factual Knowledge in LLMs

With the rise of large language models, TriviaQA has found a new role as a probe of parametric knowledge. The no-context (closed-book) variant tests whether models have internalized factual knowledge during pre-training.^[3] This application has become increasingly important as researchers seek to understand what LLMs know, how reliably they can recall facts, and whether they confabulate plausible-sounding but incorrect answers.

Data Contamination Studies

Because TriviaQA has been publicly available since 2017, researchers have raised concerns about data contamination, where LLMs may have encountered TriviaQA questions during pre-training. Studies like RepLiQA (2024) found that models perform significantly better on TriviaQA than on freshly created QA benchmarks with similar difficulty levels, suggesting that data contamination may inflate closed-book performance scores.^[10] This makes TriviaQA particularly useful as a case study in understanding the difference between genuine comprehension and memorization.

Influence on Subsequent Datasets

TriviaQA's design principles have influenced several subsequent QA datasets:

Natural Questions (Kwiatkowski et al., 2019) adopted TriviaQA's approach of using naturally occurring questions (from Google Search) rather than crowdsourced questions. However, Natural Questions pairs each question with a single Wikipedia page rather than multiple evidence documents.^[6]

HotpotQA (Yang et al., 2018) extended the multi-sentence reasoning requirement that TriviaQA highlighted. HotpotQA explicitly requires reasoning across two Wikipedia paragraphs, with annotated supporting facts.^[7]

SearchQA (Dunn et al., 2017) took a similar approach to evidence collection, using web search results as evidence documents for Jeopardy! questions. However, SearchQA's questions are less naturalistic than TriviaQA's trivia questions.^[8]

MMLU and other multi-task benchmarks include question answering components that draw on the same principles of testing broad factual knowledge across diverse domains that TriviaQA pioneered at scale.

Technical Details

Data Format

Each TriviaQA example contains the following fields:

question: The trivia question text (string)
question_id: A unique identifier for the question (string)
question_source: The URL of the trivia website where the question originated (string)
answer: An object containing the answer value, normalized value, aliases, normalized aliases, matched Wikipedia entity name, and answer type
entity_pages: A list of Wikipedia articles identified through entity linking, each with title, filename, document source, and full wiki context text
search_results: A list of web search results, each with title, URL, description, rank, filename, and search context text

Is TriviaQA open source and how do you access it?

TriviaQA is freely available for research use. The dataset can be accessed through several channels:

Official website: http://nlp.cs.washington.edu/triviaqa/
Hugging Face Datasets: Available as mandarjoshi/trivia_qa with multiple configurations
TensorFlow Datasets: Available as trivia_qa with rc, unfiltered, and no-context variants
GitHub: https://github.com/mandarjoshi90/triviaqa

The dataset files are substantial in size. The RC configuration with full document context requires approximately 18.7 GB of disk space, while the unfiltered version requires about 32.5 GB. The no-context variants are much smaller, with the unfiltered no-context version requiring only about 707 MB.

Storage Requirements by Configuration

Configuration	Download Size	Generated Size	Total Disk
RC (with context)	2.67 GB	16.02 GB	18.68 GB
RC (no context)	2.67 GB	126 MB	2.79 GB
Unfiltered (with context)	3.30 GB	29.24 GB	32.54 GB
Unfiltered (no context)	633 MB	75 MB	707 MB

Limitations and Criticisms

While TriviaQA has been tremendously influential, several limitations have been identified over the years:

Domain bias. Trivia questions tend to focus on topics that are popular in Western pub quiz culture, including history, geography, entertainment, sports, and science. This means the dataset may not adequately test understanding of topics outside these domains.

Answer type concentration. With 92.85% of answers being Wikipedia entity titles, the dataset is heavily skewed toward entity-centric questions.^[1] Questions requiring numerical reasoning, multi-word descriptive answers, or yes/no judgments are underrepresented.

Distant supervision noise. Despite the verified subsets, the majority of the training data relies on distant supervision, which introduces noise. Documents may contain the answer string in misleading contexts, and some question-document pairings may not actually support answering the question.^[1]

Data contamination. As one of the oldest and most widely distributed QA benchmarks, TriviaQA is at high risk of appearing in the training data of modern LLMs. This makes it difficult to determine whether strong closed-book performance reflects genuine knowledge retrieval or memorization of specific question-answer pairs encountered during pre-training.^[10]

Static nature. The dataset was created in 2017 and has not been updated with new questions or evidence documents. Questions about events, records, or figures that have changed since 2017 may now have different correct answers.

Authors

TriviaQA was created by researchers at the Paul G. Allen School of Computer Science and Engineering at the University of Washington:^[1]

Mandar Joshi was the lead author and primary architect of the dataset. He later joined Google DeepMind as a Research Scientist. Joshi is also known for his work on SpanBERT, another influential contribution to the NLP field.
Eunsol Choi contributed to the dataset design and analysis. She subsequently joined the faculty at the University of Texas at Austin, where she leads research on question answering and knowledge-intensive NLP.
Daniel S. Weld is a professor at the University of Washington and a Venture Partner at the Allen Institute for Artificial Intelligence (AI2). His research spans artificial intelligence, machine learning, and human-computer interaction.
Luke Zettlemoyer is a professor at the University of Washington with an affiliation at Meta AI (formerly Facebook AI Research). His research focuses on natural language understanding, semantic parsing, and pre-trained language models.

References

Joshi, M., Choi, E., Weld, D. S., & Zettlemoyer, L. (2017). "TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension." *Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (ACL 2017)*, Vancouver, Canada. arXiv:1705.03551. https://aclanthology.org/P17-1147/ ↩
Rajpurkar, P., Zhang, J., Lopyrev, K., & Liang, P. (2016). "SQuAD: 100,000+ Questions for Machine Comprehension of Text." *Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing (EMNLP)*. ↩
Brown, T. B., Mann, B., Ryder, N., et al. (2020). "Language Models are Few-Shot Learners." *Advances in Neural Information Processing Systems 33 (NeurIPS 2020)*. arXiv:2005.14165. ↩
Raffel, C., Shazeer, N., Roberts, A., et al. (2020). "Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer." *Journal of Machine Learning Research*, 21(140), 1-67. ↩
Seo, M., Kembhavi, A., Farhadi, A., & Hajishirzi, H. (2017). "Bidirectional Attention Flow for Machine Comprehension." *Proceedings of the 5th International Conference on Learning Representations (ICLR 2017)*. ↩
Kwiatkowski, T., Palomaki, J., Redfield, O., et al. (2019). "Natural Questions: A Benchmark for Question Answering Research." *Transactions of the Association for Computational Linguistics*, 7, 453-466. ↩
Yang, Z., Qi, P., Zhang, S., et al. (2018). "HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering." *Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP)*. ↩
Dunn, M., Saez, L., Haddow, B., & Duh, K. (2017). "SearchQA: A New Q&A Dataset Augmented with Context from a Search Engine." arXiv:1704.05179. ↩
Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding." *Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL-HLT 2019)*. ↩
Li, J., Cheng, X., Zhao, W. X., et al. (2024). "RepLiQA: A Question-Answering Dataset for Benchmarking LLMs on Unseen Reference Content." arXiv:2406.11811. ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

4 revisions by 1 contributors · full history

Suggest edit

What links here

BrowseComp FRAMES (benchmark)GPT-3 LLaMA/Model Card LongBench Machine learning terms/Natural Language Processing Mixtral Phi-3 Question Answering Models Question answering SQuAD TruthfulQA

When was TriviaQA released and who built it?

Background and Motivation

How was the TriviaQA dataset built?

Question Collection

Answer Representation

Evidence Document Collection

How big is the TriviaQA dataset?

Dataset Splits and Configurations

Reading Comprehension (RC) Configuration

Domain-Specific Splits

Unfiltered Configuration

Verified Evaluation Subsets

No-Context Variants

What makes TriviaQA questions difficult?

Question Type Distribution

Reasoning Requirements

How does TriviaQA differ from SQuAD?

How is TriviaQA scored?

Metrics

Domain-Specific Evaluation Protocols

Leaderboard and Test Evaluation

Baseline Models and Original Results

Baseline Systems

Results on Wikipedia Domain

Results on Web Domain

Results on Verified Subsets

How well do models perform on TriviaQA?

Pre-Transformer Era (2017-2018)

BERT and Transformer Models (2018-2020)

Large Language Models and Closed-Book QA (2020-Present)

Recent Model Performance

Distant Supervision in TriviaQA

How Distant Supervision Works in TriviaQA

Noise in Distant Supervision

What is TriviaQA used for?

Standard Benchmark for Reading Comprehension

Open-Domain Question Answering Research

Evaluating Factual Knowledge in LLMs

Data Contamination Studies

Influence on Subsequent Datasets

Technical Details

Data Format

Is TriviaQA open source and how do you access it?

Storage Requirements by Configuration

Limitations and Criticisms

Authors

See Also

References

Improve this article

Related Articles

AA-LCR

DROP (Discrete Reasoning Over Paragraphs)

HellaSwag

LiveBench

MGSM (Multilingual Grade School Math)

MathArena

What links here

Related Articles

AA-LCR

DROP (Discrete Reasoning Over Paragraphs)

HellaSwag

LiveBench

MGSM (Multilingual Grade School Math)

MathArena

What links here