FRAMES (benchmark)

AI Benchmarks Information Retrieval Model Evaluation

9 min read

Updated May 31, 2026

Suggest edit History Talk

RawGraph

Last edited

May 31, 2026

Fact-checked

In review queue

Sources

9 citations

Revision

v1 · 1,891 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

FRAMES is an evaluation dataset for retrieval-augmented generation that tests factual accuracy, retrieval, and reasoning together rather than one at a time. The name stands for Factuality, Retrieval, And reasoning MEasurement Set. It was introduced in 2024 by Satyapriya Krishna and colleagues from Google DeepMind and Harvard University in the paper "Fact, Fetch, and Reason: A Unified Evaluation of Retrieval-Augmented Generation" (arXiv:2409.12941) ^[1]. The dataset contains 824 multi-hop questions, each of which requires synthesizing information drawn from 2 to 15 Wikipedia articles, and it is distributed publicly on Hugging Face as google/frames-benchmark under an Apache 2.0 license ^[2].

The central idea is that a good answer in a RAG setting depends on three things happening correctly at once: the system has to fetch the right documents, it has to reason across them, and it has to state something true at the end. FRAMES scores all of that end to end, which makes it useful for measuring whether retrieval and reasoning components actually cooperate in practice.

Motivation

Most benchmarks for large language models probe these abilities separately. Retrieval is often measured with information-retrieval metrics like recall@k on a fixed corpus. Factuality is checked with closed-book question answering or hallucination probes. Reasoning gets its own suites of math and logic problems. Each of those tells you something, but none of them tells you whether a full pipeline can take a hard question, find the evidence, chain several facts together, and produce a correct response.

That gap matters because the failure modes interact. A model can retrieve perfect documents and still botch the arithmetic that ties them together. It can reason flawlessly over the wrong passages and confidently return a false answer. Earlier multi-hop datasets such as HotpotQA pushed on the reasoning side, but the questions are often answerable by pattern matching over two paragraphs, and supporting-fact supervision can let systems shortcut the actual reasoning. The FRAMES authors set out to build something harder and more unified: questions where you cannot get the right answer without genuinely retrieving from several sources and then doing real work on what you found ^[1].

Dataset construction

The authors first tried to generate questions synthetically by prompting an LLM, but they measured a hallucination rate above 30% in the generated items, which meant heavy manual cleanup either way ^[1]. They switched to human annotation, using the synthetic-generation instructions as a guide for the annotators rather than as a source of finished questions.

Human experts wrote questions that deliberately span multiple Wikipedia articles and that need several reasoning steps to resolve. Each question is paired with a gold answer and the list of Wikipedia pages required to reach it. Roughly 36% of the questions need two articles and about 35% need three, with the remainder spreading out across four, five, and up to fifteen articles ^[1]. Questions with binary yes/no answers were excluded so that a coin flip could not inflate scores.

Factual drift is a real problem for anything built on Wikipedia, since the right answer to a question about, say, a current officeholder can change. To control for this, the annotators re-verified the answers about three months after the initial collection and removed 5.5% of the samples whose answers were no longer true ^[1]. The result is a compact set of 824 questions that the authors consider clean and current as of release.

Reasoning categories

Every question is labeled with one or more reasoning types, since a single hard question often demands more than one kind of operation. The five categories and their approximate share of the dataset are summarized below.

Reasoning type	What it requires	Approx. share
Multiple constraints	Satisfy several conditions at once, narrowing candidates until one answer remains	~36%
Numerical	Arithmetic, counting, comparison, or other computation over retrieved values	~20%
Temporal	Reasoning about dates, durations, ordering, and time-based disambiguation	(remainder)
Tabular	Reading and combining values from tables embedded in articles	(remainder)
Post-processing	Transforming retrieved facts into the final form the question asks for	(remainder)

Because a question can carry several labels, the categories overlap rather than partition the set cleanly. Temporal disambiguation shows up frequently, for example resolving which of several similarly named events or people a question refers to based on when something happened.

Evaluation protocols

FRAMES is meant to be run under several conditions so that retrieval quality and reasoning quality can be teased apart. The paper reports four main settings, the first three of which are single-step (the model answers once, given whatever context it has).

No retrieval. The model sees only the question and answers from its parameters. This isolates closed-book factual recall and reasoning, with no external evidence.
Retrieval with BM25. A sparse retriever pulls a fixed number of documents (the paper reports runs with 2 and with 4 documents) and the model answers conditioned on them. This reflects a basic single-shot RAG pipeline.
Oracle documents. The model is handed all of the gold Wikipedia articles for the question. Retrieval is effectively perfect here, so whatever errors remain are reasoning and synthesis errors rather than retrieval failures.
Multi-step retrieval. Instead of retrieving once, the system iterates. The paper's pipeline runs several rounds of search-and-read, generating multiple queries at each step and accumulating documents before producing a final answer. This lets the model decompose the question, gather evidence for each sub-part, and combine the results.

Answers are graded by an automatic rater. The authors checked this rater against human judgments and reported 0.96 agreement accuracy with a Cohen's kappa of 0.889, which they take as evidence that the auto-rater tracks human grading closely enough to trust the headline numbers ^[1].

Headline results

The top-line finding is that strong models do poorly without help and improve a lot with the right retrieval strategy. A state-of-the-art LLM reaches only 0.40 accuracy with no retrieval, and the proposed multi-step pipeline lifts that to 0.66, an improvement of more than 50% ^[1]. That jump is the single most cited result from the paper, and it is the reason FRAMES is read as evidence that retrieval strategy, not just raw model quality, drives RAG performance.

The single-step numbers fill in the picture. Using Gemini-Pro-1.5 as the reference model, the reported accuracies were 0.408 with a naive no-retrieval prompt, 0.452 with BM25 returning two documents, and 0.474 with BM25 returning four documents ^[1]. Handing the model all the gold articles (the oracle setting) raised accuracy to 0.729.

Setting	Retrieval	Accuracy
Naive prompt	None	0.408
Single-step BM25	2 documents	0.452
Single-step BM25	4 documents	0.474
Single-step oracle	All gold articles	0.729
Multi-step pipeline	Iterative search	0.66

Two things stand out. First, naive single-shot retrieval barely helps: going from no retrieval to four BM25 documents moves accuracy only from about 0.41 to about 0.47, because a single query rarely surfaces all of the articles a multi-hop question needs. Second, even with perfect retrieval the oracle setting tops out around 0.73, which means models still miss more than a quarter of questions when they are handed every relevant document. That residual error is reasoning and synthesis, not retrieval, and it is the part FRAMES is designed to expose ^[1].

The multi-step pipeline closes most of the gap between naive retrieval and the oracle, which suggests that iterative query generation recovers documents a single query misses. Performance was weakest on numerical, post-processing, and tabular questions, the categories that lean most on computation and transformation rather than on lookup. Beyond Gemini-Pro-1.5, the authors evaluated several other systems, including Gemini-Flash-1.5, Gemma2-27B, Llama-3.2-3B-Instruct, and Qwen2.5-3B-Instruct, with smaller models generally trailing ^[1]. (Models such as Gemini, Gemma, LLaMA, and Qwen are covered in their own articles.)

Relation to other benchmarks

FRAMES sits next to a family of question answering and RAG datasets but differs in what it forces a system to do at once. Multi-hop sets like HotpotQA and Multi-hop RAG emphasize chaining facts across passages, while single-hop open-domain sets like TriviaQA and SQuAD generally need one supporting document. What FRAMES adds is the insistence that retrieval, multi-step reasoning, and factuality all be exercised in a single end-to-end score, with reasoning-type labels that make it possible to see where a pipeline breaks. The table below sketches the contrast.

Benchmark	Hops	Primary focus	Unified retrieval + reasoning + factuality
FRAMES	2 to 15 articles	End-to-end RAG	Yes
HotpotQA	Typically 2	Multi-hop reasoning	Partial
Multi-hop RAG	Multiple	Multi-hop RAG retrieval	Partial
TriviaQA	Usually 1	Open-domain factual QA	No
SQuAD	1	Reading comprehension	No

The iterative-retrieval result also connects FRAMES to work on chain-of-thought prompting and on reasoning models, since decomposing a question into sub-queries is itself a form of explicit reasoning over what to fetch next. As a general-purpose benchmark, it is widely used to compare RAG pipelines rather than to rank base models in isolation.

Limitations

The authors are candid about several constraints. Because the questions are built on Wikipedia, and Wikipedia text is in most LLM pretraining corpora, some answers may be memorized rather than retrieved, which can flatter closed-book scores; the paper flags this pretraining contamination risk directly ^[1]. The factual-drift problem never fully goes away either, since answers verified at release can become stale later, which is part of why the re-verification step was needed. With 824 questions the set is deliberately small and curated, so it may not span the full range of real-world queries, and its English, Wikipedia-grounded scope leaves out other languages and document types. Finally, automatic grading, even at 0.96 agreement with humans, is not identical to human judgment, so reported accuracies carry some measurement noise. None of this undercuts the main takeaway, which is that single-shot RAG leaves a lot on the table and that the reasoning component remains a real bottleneck even when retrieval is perfect.

References

Krishna, Satyapriya; Krishna, Kalpesh; Mohananey, Anhad; Schwarcz, Steven; Stambler, Adam; Upadhyay, Shyam; Faruqui, Manaal. "Fact, Fetch, and Reason: A Unified Evaluation of Retrieval-Augmented Generation." arXiv preprint arXiv:2409.12941, 2024. https://arxiv.org/abs/2409.12941 ↩
Google. "frames-benchmark." Hugging Face dataset, 2024. https://huggingface.co/datasets/google/frames-benchmark ↩
Krishna, Satyapriya; et al. "Fact, Fetch, and Reason: A Unified Evaluation of Retrieval-Augmented Generation." Hugging Face Papers, 2024. https://huggingface.co/papers/2409.12941
"FRAMES Leaderboard." LLM-Stats, 2024. https://llm-stats.com/benchmarks/frames
Lewis, Patrick; et al. "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks." Advances in Neural Information Processing Systems (NeurIPS), 2020. https://arxiv.org/abs/2005.11401
Yang, Zhilin; Qi, Peng; Zhang, Saizheng; Bengio, Yoshua; Cohen, William W.; Salakhutdinov, Ruslan; Manning, Christopher D. "HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering." Proceedings of EMNLP, 2018. https://arxiv.org/abs/1809.09600
Joshi, Mandar; Choi, Eunsol; Weld, Daniel S.; Zettlemoyer, Luke. "TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension." Proceedings of ACL, 2017. https://arxiv.org/abs/1705.03551
Robertson, Stephen; Zaragoza, Hugo. "The Probabilistic Relevance Framework: BM25 and Beyond." Foundations and Trends in Information Retrieval, 2009. https://www.staff.city.ac.uk/~sbrp622/papers/foundations_bm25_review.pdf
Trivedi, Harsh; Balasubramanian, Niranjan; Khot, Tushar; Sabharwal, Ashish. "MuSiQue: Multihop Questions via Single-hop Question Composition." Transactions of the Association for Computational Linguistics, 2022. https://arxiv.org/abs/2108.00573

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

Suggest edit

What links here

Benchmark (AI)Terminal-Bench

Motivation

Dataset construction

Reasoning categories

Evaluation protocols

Headline results

Relation to other benchmarks

Limitations

References

Improve this article

Related Articles

Average Precision

MTEB (Massive Text Embedding Benchmark)

MMTEB

Benchmark (AI)

MATH

SWE-bench Verified

What links here

Related Articles

Average Precision

MTEB (Massive Text Embedding Benchmark)

MMTEB

Benchmark (AI)

MATH

SWE-bench Verified

What links here