FRAMES (benchmark)
Last reviewed
May 31, 2026
Sources
9 citations
Review status
Source-backed
Revision
v1 · 1,891 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 31, 2026
Sources
9 citations
Review status
Source-backed
Revision
v1 · 1,891 words
Add missing citations, update stale details, or suggest a clearer explanation.
FRAMES is an evaluation dataset for retrieval-augmented generation that tests factual accuracy, retrieval, and reasoning together rather than one at a time. The name stands for Factuality, Retrieval, And reasoning MEasurement Set. It was introduced in 2024 by Satyapriya Krishna and colleagues from Google DeepMind and Harvard University in the paper "Fact, Fetch, and Reason: A Unified Evaluation of Retrieval-Augmented Generation" (arXiv:2409.12941) [1]. The dataset contains 824 multi-hop questions, each of which requires synthesizing information drawn from 2 to 15 Wikipedia articles, and it is distributed publicly on Hugging Face as google/frames-benchmark under an Apache 2.0 license [2].
The central idea is that a good answer in a RAG setting depends on three things happening correctly at once: the system has to fetch the right documents, it has to reason across them, and it has to state something true at the end. FRAMES scores all of that end to end, which makes it useful for measuring whether retrieval and reasoning components actually cooperate in practice.
Most benchmarks for large language models probe these abilities separately. Retrieval is often measured with information-retrieval metrics like recall@k on a fixed corpus. Factuality is checked with closed-book question answering or hallucination probes. Reasoning gets its own suites of math and logic problems. Each of those tells you something, but none of them tells you whether a full pipeline can take a hard question, find the evidence, chain several facts together, and produce a correct response.
That gap matters because the failure modes interact. A model can retrieve perfect documents and still botch the arithmetic that ties them together. It can reason flawlessly over the wrong passages and confidently return a false answer. Earlier multi-hop datasets such as HotpotQA pushed on the reasoning side, but the questions are often answerable by pattern matching over two paragraphs, and supporting-fact supervision can let systems shortcut the actual reasoning. The FRAMES authors set out to build something harder and more unified: questions where you cannot get the right answer without genuinely retrieving from several sources and then doing real work on what you found [1].
The authors first tried to generate questions synthetically by prompting an LLM, but they measured a hallucination rate above 30% in the generated items, which meant heavy manual cleanup either way [1]. They switched to human annotation, using the synthetic-generation instructions as a guide for the annotators rather than as a source of finished questions.
Human experts wrote questions that deliberately span multiple Wikipedia articles and that need several reasoning steps to resolve. Each question is paired with a gold answer and the list of Wikipedia pages required to reach it. Roughly 36% of the questions need two articles and about 35% need three, with the remainder spreading out across four, five, and up to fifteen articles [1]. Questions with binary yes/no answers were excluded so that a coin flip could not inflate scores.
Factual drift is a real problem for anything built on Wikipedia, since the right answer to a question about, say, a current officeholder can change. To control for this, the annotators re-verified the answers about three months after the initial collection and removed 5.5% of the samples whose answers were no longer true [1]. The result is a compact set of 824 questions that the authors consider clean and current as of release.
Every question is labeled with one or more reasoning types, since a single hard question often demands more than one kind of operation. The five categories and their approximate share of the dataset are summarized below.
| Reasoning type | What it requires | Approx. share |
|---|---|---|
| Multiple constraints | Satisfy several conditions at once, narrowing candidates until one answer remains | ~36% |
| Numerical | Arithmetic, counting, comparison, or other computation over retrieved values | ~20% |
| Temporal | Reasoning about dates, durations, ordering, and time-based disambiguation | (remainder) |
| Tabular | Reading and combining values from tables embedded in articles | (remainder) |
| Post-processing | Transforming retrieved facts into the final form the question asks for | (remainder) |
Because a question can carry several labels, the categories overlap rather than partition the set cleanly. Temporal disambiguation shows up frequently, for example resolving which of several similarly named events or people a question refers to based on when something happened.
FRAMES is meant to be run under several conditions so that retrieval quality and reasoning quality can be teased apart. The paper reports four main settings, the first three of which are single-step (the model answers once, given whatever context it has).
Answers are graded by an automatic rater. The authors checked this rater against human judgments and reported 0.96 agreement accuracy with a Cohen's kappa of 0.889, which they take as evidence that the auto-rater tracks human grading closely enough to trust the headline numbers [1].
The top-line finding is that strong models do poorly without help and improve a lot with the right retrieval strategy. A state-of-the-art LLM reaches only 0.40 accuracy with no retrieval, and the proposed multi-step pipeline lifts that to 0.66, an improvement of more than 50% [1]. That jump is the single most cited result from the paper, and it is the reason FRAMES is read as evidence that retrieval strategy, not just raw model quality, drives RAG performance.
The single-step numbers fill in the picture. Using Gemini-Pro-1.5 as the reference model, the reported accuracies were 0.408 with a naive no-retrieval prompt, 0.452 with BM25 returning two documents, and 0.474 with BM25 returning four documents [1]. Handing the model all the gold articles (the oracle setting) raised accuracy to 0.729.
| Setting | Retrieval | Accuracy |
|---|---|---|
| Naive prompt | None | 0.408 |
| Single-step BM25 | 2 documents | 0.452 |
| Single-step BM25 | 4 documents | 0.474 |
| Single-step oracle | All gold articles | 0.729 |
| Multi-step pipeline | Iterative search | 0.66 |
Two things stand out. First, naive single-shot retrieval barely helps: going from no retrieval to four BM25 documents moves accuracy only from about 0.41 to about 0.47, because a single query rarely surfaces all of the articles a multi-hop question needs. Second, even with perfect retrieval the oracle setting tops out around 0.73, which means models still miss more than a quarter of questions when they are handed every relevant document. That residual error is reasoning and synthesis, not retrieval, and it is the part FRAMES is designed to expose [1].
The multi-step pipeline closes most of the gap between naive retrieval and the oracle, which suggests that iterative query generation recovers documents a single query misses. Performance was weakest on numerical, post-processing, and tabular questions, the categories that lean most on computation and transformation rather than on lookup. Beyond Gemini-Pro-1.5, the authors evaluated several other systems, including Gemini-Flash-1.5, Gemma2-27B, Llama-3.2-3B-Instruct, and Qwen2.5-3B-Instruct, with smaller models generally trailing [1]. (Models such as Gemini, Gemma, LLaMA, and Qwen are covered in their own articles.)
FRAMES sits next to a family of question answering and RAG datasets but differs in what it forces a system to do at once. Multi-hop sets like HotpotQA and Multi-hop RAG emphasize chaining facts across passages, while single-hop open-domain sets like TriviaQA and SQuAD generally need one supporting document. What FRAMES adds is the insistence that retrieval, multi-step reasoning, and factuality all be exercised in a single end-to-end score, with reasoning-type labels that make it possible to see where a pipeline breaks. The table below sketches the contrast.
| Benchmark | Hops | Primary focus | Unified retrieval + reasoning + factuality |
|---|---|---|---|
| FRAMES | 2 to 15 articles | End-to-end RAG | Yes |
| HotpotQA | Typically 2 | Multi-hop reasoning | Partial |
| Multi-hop RAG | Multiple | Multi-hop RAG retrieval | Partial |
| TriviaQA | Usually 1 | Open-domain factual QA | No |
| SQuAD | 1 | Reading comprehension | No |
The iterative-retrieval result also connects FRAMES to work on chain-of-thought prompting and on reasoning models, since decomposing a question into sub-queries is itself a form of explicit reasoning over what to fetch next. As a general-purpose benchmark, it is widely used to compare RAG pipelines rather than to rank base models in isolation.
The authors are candid about several constraints. Because the questions are built on Wikipedia, and Wikipedia text is in most LLM pretraining corpora, some answers may be memorized rather than retrieved, which can flatter closed-book scores; the paper flags this pretraining contamination risk directly [1]. The factual-drift problem never fully goes away either, since answers verified at release can become stale later, which is part of why the re-verification step was needed. With 824 questions the set is deliberately small and curated, so it may not span the full range of real-world queries, and its English, Wikipedia-grounded scope leaves out other languages and document types. Finally, automatic grading, even at 0.96 agreement with humans, is not identical to human judgment, so reported accuracies carry some measurement noise. None of this undercuts the main takeaway, which is that single-shot RAG leaves a lot on the table and that the reasoning component remains a real bottleneck even when retrieval is perfect.