TriviaQA is a large-scale reading comprehension and question answering dataset containing over 650,000 question-answer-evidence triples. Introduced in 2017 by Mandar Joshi, Eunsol Choi, Daniel S. Weld, and Luke Zettlemoyer at the University of Washington, TriviaQA was designed to address limitations in existing reading comprehension benchmarks by collecting questions independently from evidence documents and requiring complex, multi-sentence reasoning to find answers. The dataset pairs 95,956 question-answer pairs authored by trivia enthusiasts with evidence gathered from both Wikipedia articles and web search results, providing an average of six evidence documents per question. TriviaQA has become one of the most widely used benchmarks in natural language processing, with the original paper accumulating thousands of citations since its publication at the 55th Annual Meeting of the Association for Computational Linguistics (ACL 2017) in Vancouver, Canada.
Before TriviaQA, the dominant reading comprehension benchmarks suffered from a fundamental design flaw: questions were typically written by people who had already read the evidence passage. In datasets like SQuAD (Stanford Question Answering Dataset), annotators viewed a Wikipedia paragraph and then composed questions about it. While this approach produced high-quality question-answer pairs, it also introduced biases. Questions tended to closely mirror the vocabulary and sentence structure of the source passage, making it possible for models to find answers through simple pattern matching or word overlap rather than genuine comprehension.
The researchers behind TriviaQA recognized that this coupling between question creation and evidence selection limited how well benchmarks could test true reading comprehension ability. They proposed a different approach: collect questions that were written completely independently of any evidence document, then retrospectively gather supporting documents from Wikipedia and the web. This decoupling meant that the syntactic and lexical overlap between a question and its answer-containing sentence would be naturally low, forcing models to perform genuine reasoning rather than surface-level matching.
The choice to source questions from trivia enthusiasts was deliberate. Trivia questions are inherently complex, often compositional, and cover a wide range of topics. Unlike questions generated by crowdworkers for a specific NLP task, trivia questions reflect genuine human curiosity and knowledge-testing patterns. They frequently require understanding of time frames, comparisons, fine-grained categories, and multi-step reasoning, all properties that make them effective probes of machine reading ability.
TriviaQA's questions were gathered from 14 trivia and quiz-league websites. The research team scraped question-answer pairs from these sites, which host questions written by trivia enthusiasts for pub quizzes, quiz leagues, and online trivia competitions. Questions with fewer than four tokens were removed, since short questions tended to be either trivially simple or too vague to answer reliably.
The resulting collection contains 95,956 question-answer pairs covering a wide range of topics. Each question averages 14 tokens in length, reflecting the compositional nature of trivia questions. For instance, a typical question might read: "What fragrant essential oil is obtained from Damask Rose?" This level of specificity and compositionality is significantly more challenging than the simpler factoid questions found in many other QA datasets.
Answers in TriviaQA use a rich alias system. Since the same entity or concept can be referred to in multiple ways, the dataset includes normalized aliases for each answer. For questions whose answers correspond to Wikipedia entities, the dataset leverages Wikipedia redirect pages and disambiguation pages to compile comprehensive alias lists. This design ensures that models receive credit for producing any valid form of the correct answer.
The distribution of answer types breaks down as follows:
| Answer Type | Percentage |
|---|---|
| Wikipedia entity title | 92.85% |
| Numerical answer | 4.17% |
| Free-text answer | 2.98% |
Among the Wikipedia entity answers, the named entity categories are distributed across several types:
| Named Entity Category | Percentage of Wikipedia Entity Answers |
|---|---|
| Person | 32% |
| Location | 23% |
| Miscellaneous | 40% |
| Organization | 5% |
The heavy concentration of Wikipedia entity answers (92.85%) reflects the nature of trivia questions, which predominantly ask about notable people, places, events, and things that have their own Wikipedia pages.
A distinctive feature of TriviaQA is its use of two independent sources of evidence documents: Wikipedia articles and web search results. For each question, evidence was gathered through two complementary methods.
Wikipedia Evidence. The researchers applied TagMe, an off-the-shelf entity linking tool, to identify entities mentioned in each question. The corresponding Wikipedia articles for these entities were then collected as evidence documents. This approach captures the structured, encyclopedic knowledge that is most relevant to answering trivia questions. The Wikipedia portion of the dataset contains articles linked to the entities detected in each question.
Web Evidence. Each question was submitted as a search query to the Bing Web Search API. The top 50 search result URLs were collected, and the researchers crawled the top 10 web pages for each question. These web documents provide more diverse evidence, including blog posts, news articles, reference sites, and other sources that might contain the answer in different contexts and phrasings.
After gathering candidate evidence documents, the researchers filtered them to create the "reading comprehension" (RC) subset. In this filtered version, only documents that actually contain the answer string are retained. This filtering step ensures that every evidence document in the RC split provides at least one passage where the answer appears, creating a cleaner signal for training and evaluation. The unfiltered version retains all documents, including those that do not contain the answer, making it more suitable for information-retrieval-style question answering.
The complete dataset comprises the following statistics:
| Statistic | Value |
|---|---|
| Total question-answer pairs | 95,956 |
| Unique answers | 40,478 |
| Total evidence documents | 662,659 |
| Average question length | 14 words |
| Average document length | 2,895 words |
| Average evidence documents per question | ~6 |
TriviaQA is organized into multiple splits and configurations to support different research scenarios. The two primary configurations are "RC" (reading comprehension) and "Unfiltered," each available with or without document context.
The RC configuration includes only question-document pairs where the evidence document contains the answer string. This filtered setup is the standard configuration for reading comprehension evaluation.
| Split | Number of Examples |
|---|---|
| Train | 138,384 |
| Validation | 18,669 |
| Test | 17,210 |
| Total | 174,263 |
Note that the number of examples exceeds the number of unique questions because each question is paired with multiple evidence documents.
Within the RC configuration, examples are further divided by evidence source:
Wikipedia Domain:
| Split | Questions | Documents |
|---|---|---|
| Train | 61,888 | 110,648 |
| Development | 7,993 | 14,229 |
| Test | 7,701 | 13,661 |
Web Domain:
| Split | Questions | Documents |
|---|---|---|
| Train | 76,496 | 528,979 |
| Development | 9,951 | 68,621 |
| Test | 9,509 | 65,059 |
The web domain contains significantly more documents per question than the Wikipedia domain, reflecting the redundancy of information across multiple web pages.
The unfiltered configuration includes all 110,000 question-answer pairs, including those where evidence documents may not contain the answer string. This version is more appropriate for open-domain question answering and information retrieval research.
| Split | Number of Examples |
|---|---|
| Train | 87,622 |
| Validation | 11,313 |
| Test | 10,832 |
| Total | 109,767 |
To provide a cleaner evaluation signal, the researchers created human-verified subsets of the development and test data. In these verified subsets, human annotators confirmed that the evidence documents contain sufficient information to answer the question. This addresses the noise introduced by distant supervision, where automatically matched documents may contain the answer string in an irrelevant context.
| Verified Split | Wikipedia Questions | Web Questions |
|---|---|---|
| Development | 297 | 322 |
| Test | 584 | 733 |
The verified subsets contain 1,936 question-document-answer triples with documents certified to contain all necessary facts for answering the question.
Each configuration also has a "nocontext" variant that strips out the evidence documents, providing only the questions and answers. These variants are useful for evaluating closed-book question answering, where models must rely entirely on knowledge stored in their parameters rather than extracting answers from provided passages.
One of TriviaQA's distinguishing properties is the complexity and diversity of its questions. The researchers conducted a detailed analysis of question characteristics, revealing several dimensions of difficulty.
| Property | Example | Frequency |
|---|---|---|
| Fine-grained answer type hint | "What fragrant essential oil is obtained from Damask Rose?" | 73.5% |
| Coarse-grained answer type hint | "Who won the Nobel Peace Prize in 2009?" | 15.5% |
| Time frame reference | "What was photographed for the first time in October 1959?" | 34% |
| Comparison question | "What is the largest type of frog?" | 9% |
| Average entities per question | "Which politician won the Nobel Peace Prize in 2009?" | 1.77 |
The high proportion of questions with fine-grained answer type hints (73.5%) indicates that most TriviaQA questions specify exactly what kind of answer is expected, pushing models to identify precise entities rather than broadly related text spans.
The research team analyzed what types of reasoning are needed to answer TriviaQA questions by examining how the answer-containing evidence sentences relate to the question. This analysis revealed that TriviaQA demands substantially more sophisticated reasoning than SQuAD.
| Reasoning Type | Wikipedia Domain | Web Domain |
|---|---|---|
| Syntactic variation | 69% | 65% |
| Lexical variation (synonyms) | 41% | 39% |
| Lexical + world knowledge | 17% | 17% |
| Multi-sentence reasoning | 40% | 35% |
| Lists/tables | N/A | 7% |
The syntactic variation rate of 69% in Wikipedia documents means that for the majority of questions, the sentence containing the answer uses a different syntactic structure than the question itself. The 40% multi-sentence reasoning requirement in the Wikipedia domain is particularly notable: over three times as many questions in TriviaQA require reasoning over multiple sentences compared to SQuAD.
The paper provides a direct comparison between TriviaQA and SQuAD across several dimensions:
| Property | TriviaQA | SQuAD |
|---|---|---|
| Large scale | Yes | Yes |
| Freeform answer | Yes | Yes |
| Well-formed questions | Yes | Yes |
| Questions independent of evidence | Yes | No |
| Varied evidence sources | Yes | No |
| Multi-sentence reasoning required | ~40% | ~13% |
The most critical distinction is that TriviaQA questions are written independently of the evidence documents, while SQuAD questions are composed by annotators who are looking at the evidence paragraph. This independence makes TriviaQA significantly more challenging because models cannot rely on lexical overlap between the question and the answer-containing passage.
TriviaQA uses two standard evaluation metrics, consistent with those used in SQuAD and other reading comprehension benchmarks:
Exact Match (EM): The percentage of predictions that exactly match one of the ground-truth answers (after normalization). A prediction receives a score of 1 if it matches any valid answer alias, and 0 otherwise.
F1 Score: The average token-level F1 score between the prediction and the best-matching ground-truth answer. This metric gives partial credit for predictions that overlap with the correct answer, even if they are not an exact match. The F1 score is computed over the bag of words in the prediction and answer after removing stop words, articles, and punctuation.
Both metrics apply standard normalization to predictions and ground-truth answers before comparison. This normalization includes lowercasing, removing articles (a, an, the), stripping punctuation, and collapsing whitespace.
The evaluation protocol differs slightly between the Wikipedia and Web domains:
For the Wikipedia domain, evaluation is performed at the question level. Since the factual information needed to answer a question typically appears only once in a Wikipedia article, question-level accuracy is the natural unit of measurement.
For the Web domain, evaluation is performed at the document level. Because web documents exhibit high information redundancy (approximately six documents per question on average), the per-document accuracy captures how well models can extract answers from individual documents. The final score is computed as the average across all question-document pairs.
The official TriviaQA leaderboard is hosted on CodaLab. Researchers submit their model predictions for the held-out test set, which is not publicly available, and receive automated evaluation scores. This setup prevents overfitting to the test data and ensures fair comparison across submissions.
The original TriviaQA paper evaluated three baseline systems to establish reference performance levels. The results demonstrated a substantial gap between machine performance and human ability.
Random Entity Baseline. This baseline selects a random named entity from the evidence document as its answer. It serves as a lower bound, showing the performance achievable by chance given the distribution of entity types in the documents.
Feature-Based Classifier. This system uses hand-engineered features including word overlap, named entity type matching, and distance-based features to score candidate answer spans. It represents a traditional, pre-neural approach to reading comprehension.
BiDAF (Bidirectional Attention Flow). BiDAF was the state-of-the-art neural reading comprehension model at the time of TriviaQA's release. It uses bidirectional attention mechanisms to model interactions between the question and the evidence passage, producing a probability distribution over possible answer spans.
| Model | Dev EM | Dev F1 | Test EM | Test F1 |
|---|---|---|---|---|
| Random Entity | 12.72% | 22.91% | 12.74% | 22.35% |
| Classifier | 23.42% | 27.68% | 22.45% | 26.52% |
| BiDAF | 40.26% | 45.74% | 40.32% | 45.91% |
| Human | 79.7% | - | - | - |
| Model | Dev EM | Dev F1 | Test EM | Test F1 |
|---|---|---|---|---|
| Random Entity | 12.72% | 22.91% | 12.74% | 22.35% |
| Classifier | 24.64% | 29.08% | 24.00% | 28.38% |
| BiDAF | 41.08% | 47.40% | 40.74% | 47.05% |
| Human | 75.4% | - | - | - |
Performance on the human-verified subsets was notably higher across all models, suggesting that some of the difficulty in the full dataset stems from noise in the distant supervision rather than question difficulty alone.
| Model | Wiki Dev EM (Verified) | Wiki Dev F1 (Verified) | Web Dev EM (Verified) | Web Dev F1 (Verified) |
|---|---|---|---|---|
| Random Entity | 14.81% | 23.31% | 15.41% | 25.44% |
| Classifier | 24.91% | 29.43% | 27.38% | 31.91% |
| BiDAF | 47.47% | 53.70% | 51.38% | 55.47% |
The performance gap between BiDAF and humans was approximately 40 percentage points in Exact Match, making TriviaQA one of the most challenging reading comprehension benchmarks at the time of its release. The researchers noted that BiDAF's accuracy dropped from roughly 50% for short questions (5 words or fewer) to about 32% for longer questions (20 words or more), confirming that the compositional complexity of trivia questions poses a significant challenge.
Since its release in 2017, TriviaQA has served as a proving ground for advances in natural language processing, particularly in reading comprehension and open-domain question answering. The performance gap between machines and humans has narrowed considerably with each generation of models.
The initial baselines used models like BiDAF, which achieved around 40% Exact Match on TriviaQA. Other attention-based models from this period, including Document Reader and Reinforced Mnemonic Reader, made incremental improvements but struggled with the dataset's requirement for multi-sentence reasoning and tolerance for lexical variation.
The introduction of BERT (Bidirectional Encoder Representations from Transformers) in 2018 marked a major leap in reading comprehension performance. BERT's pre-training on large text corpora gave it a much richer understanding of language than previous models. Fine-tuning BERT on TriviaQA yielded significant improvements over BiDAF-era baselines. Researchers found that data augmentation strategies, such as first fine-tuning on TriviaQA before fine-tuning on SQuAD, could boost performance on both benchmarks, demonstrating the complementary nature of these datasets.
The release of GPT-3 in 2020 introduced a new paradigm for TriviaQA evaluation: closed-book question answering, where the model must answer questions purely from knowledge stored in its parameters without access to any evidence documents. GPT-3's performance on TriviaQA in the closed-book setting was striking:
| Setting | Accuracy |
|---|---|
| Zero-shot | 64.3% |
| One-shot | 68.0% |
| Few-shot | 71.2% |
The few-shot result of 71.2% was state-of-the-art for the closed-book setting, matching or exceeding fine-tuned models that used retrieval systems with multiple BERT-scale models. This demonstrated that sufficiently large language models can internalize enormous amounts of factual knowledge during pre-training. T5 (Text-to-Text Transfer Transformer) with 11 billion parameters was shown to produce the exact answer text 34.5% of the time on TriviaQA in a generative closed-book setting, further advancing the state of the art for open-domain QA.
More recent large language models continue to push TriviaQA scores higher. Performance data from model evaluations shows that modern LLMs achieve strong results on TriviaQA, often evaluated as a general knowledge benchmark:
| Model | TriviaQA Score |
|---|---|
| Kimi K2 Base (Moonshot AI) | 85.1% |
| Gemma 2 27B (Google) | 83.7% |
| Mistral Small 3.1 24B (Mistral AI) | 80.5% |
| Mistral Small 3 24B (Mistral AI) | 80.3% |
| Granite 3.3 8B (IBM) | 78.2% |
| Gemma 2 9B (Google) | 76.6% |
| Mistral Large 3 (Mistral AI) | 74.9% |
| Mistral NeMo Instruct (Mistral AI) | 73.8% |
These results show that TriviaQA remains a discriminating benchmark even for state-of-the-art models, with scores ranging from roughly 59% to 85% depending on model size and training. The benchmark continues to differentiate between models of varying capability levels.
One of TriviaQA's key contributions to the field is its approach to distant supervision. Traditional reading comprehension datasets rely on close supervision, where human annotators mark the exact answer span in a specific passage. TriviaQA instead uses distant supervision, automatically pairing questions with evidence documents that contain the answer string.
The distant supervision pipeline works as follows:
This approach has several advantages. It allows the dataset to scale to hundreds of thousands of examples without expensive human annotation of answer spans. It also produces more naturalistic training data, since the relationship between questions and documents mirrors real-world information-seeking scenarios where a user's question was not composed while looking at the answer source.
The distant supervision approach introduces some noise. A document may contain the answer string in an irrelevant context (for example, the answer "Paris" might appear in a document discussing a person named Paris rather than the city of Paris). The verified evaluation subsets were created specifically to measure the impact of this noise. Performance on verified subsets is consistently higher than on the full dataset, confirming that distant supervision noise accounts for a portion of the difficulty. However, even on verified subsets, models fall well short of human performance, demonstrating that the inherent complexity of TriviaQA questions is the primary challenge.
TriviaQA is one of the standard benchmarks used to evaluate reading comprehension and question answering systems. It is included in major evaluation frameworks like EleutherAI's Language Model Evaluation Harness, which standardizes the evaluation of large language models across dozens of benchmarks. Most major LLM releases from organizations like OpenAI, Google, Anthropic, and Mistral include TriviaQA scores in their technical reports.
The unfiltered configuration of TriviaQA has been particularly influential in open-domain QA research, where systems must both retrieve relevant documents and extract answers. The dataset's combination of naturally complex questions and diverse evidence sources makes it an ideal testbed for retrieval-augmented generation (RAG) systems and end-to-end QA pipelines.
With the rise of large language models, TriviaQA has found a new role as a probe of parametric knowledge. The no-context (closed-book) variant tests whether models have internalized factual knowledge during pre-training. This application has become increasingly important as researchers seek to understand what LLMs know, how reliably they can recall facts, and whether they confabulate plausible-sounding but incorrect answers.
Because TriviaQA has been publicly available since 2017, researchers have raised concerns about data contamination, where LLMs may have encountered TriviaQA questions during pre-training. Studies like RepLiQA (2024) found that models perform significantly better on TriviaQA than on freshly created QA benchmarks with similar difficulty levels, suggesting that data contamination may inflate closed-book performance scores. This makes TriviaQA particularly useful as a case study in understanding the difference between genuine comprehension and memorization.
TriviaQA's design principles have influenced several subsequent QA datasets:
Natural Questions (Kwiatkowski et al., 2019) adopted TriviaQA's approach of using naturally occurring questions (from Google Search) rather than crowdsourced questions. However, Natural Questions pairs each question with a single Wikipedia page rather than multiple evidence documents.
HotpotQA (Yang et al., 2018) extended the multi-sentence reasoning requirement that TriviaQA highlighted. HotpotQA explicitly requires reasoning across two Wikipedia paragraphs, with annotated supporting facts.
SearchQA (Dunn et al., 2017) took a similar approach to evidence collection, using web search results as evidence documents for Jeopardy! questions. However, SearchQA's questions are less naturalistic than TriviaQA's trivia questions.
MMLU and other multi-task benchmarks include question answering components that draw on the same principles of testing broad factual knowledge across diverse domains that TriviaQA pioneered at scale.
Each TriviaQA example contains the following fields:
TriviaQA is freely available for research use. The dataset can be accessed through several channels:
mandarjoshi/trivia_qa with multiple configurationstrivia_qa with rc, unfiltered, and no-context variantsThe dataset files are substantial in size. The RC configuration with full document context requires approximately 18.7 GB of disk space, while the unfiltered version requires about 32.5 GB. The no-context variants are much smaller, with the unfiltered no-context version requiring only about 707 MB.
| Configuration | Download Size | Generated Size | Total Disk |
|---|---|---|---|
| RC (with context) | 2.67 GB | 16.02 GB | 18.68 GB |
| RC (no context) | 2.67 GB | 126 MB | 2.79 GB |
| Unfiltered (with context) | 3.30 GB | 29.24 GB | 32.54 GB |
| Unfiltered (no context) | 633 MB | 75 MB | 707 MB |
While TriviaQA has been tremendously influential, several limitations have been identified over the years:
Domain bias. Trivia questions tend to focus on topics that are popular in Western pub quiz culture, including history, geography, entertainment, sports, and science. This means the dataset may not adequately test understanding of topics outside these domains.
Answer type concentration. With 92.85% of answers being Wikipedia entity titles, the dataset is heavily skewed toward entity-centric questions. Questions requiring numerical reasoning, multi-word descriptive answers, or yes/no judgments are underrepresented.
Distant supervision noise. Despite the verified subsets, the majority of the training data relies on distant supervision, which introduces noise. Documents may contain the answer string in misleading contexts, and some question-document pairings may not actually support answering the question.
Data contamination. As one of the oldest and most widely distributed QA benchmarks, TriviaQA is at high risk of appearing in the training data of modern LLMs. This makes it difficult to determine whether strong closed-book performance reflects genuine knowledge retrieval or memorization of specific question-answer pairs encountered during pre-training.
Static nature. The dataset was created in 2017 and has not been updated with new questions or evidence documents. Questions about events, records, or figures that have changed since 2017 may now have different correct answers.
TriviaQA was created by researchers at the Paul G. Allen School of Computer Science and Engineering at the University of Washington: