FACTS Grounding
Last reviewed
May 31, 2026
Sources
9 citations
Review status
Source-backed
Revision
v1 · 2,165 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 31, 2026
Sources
9 citations
Review status
Source-backed
Revision
v1 · 2,165 words
Add missing citations, update stale details, or suggest a clearer explanation.
FACTS Grounding is a benchmark from Google DeepMind and Google Research that measures how well a large language model can answer a request using only the information in a long source document, without adding claims the document does not support. It was announced on December 17, 2024, alongside a public leaderboard hosted on Kaggle. The benchmark contains 1,719 examples split into a public set and a held-out private set, and it scores each model response with an ensemble of frontier LLM judges that first check whether the response answers the user and then check whether every claim in it is grounded in the provided text. [1][2][3]
The name is a backronym for Factuality, Accuracy, and Correctness in Text Summarization, though in practice the benchmark covers more than summarization. Its central idea is narrow on purpose: it does not ask whether a model knows true things about the world, it asks whether a model stays faithful to a document it was handed. That distinction turns out to matter a great deal for production systems, and it is what separates FACTS Grounding from parametric factuality tests like SimpleQA. [1][2]
When people first worry about model factuality, they usually mean parametric factuality: does the model recall the right answer from what it learned during training? Benchmarks like SimpleQA and TruthfulQA probe that, asking short closed questions and comparing the answer against a known ground truth. The knowledge lives inside the model's weights, and the test checks whether the model can retrieve it accurately. [1]
Grounded factuality is a different question. Here the relevant facts are not supposed to come from the weights at all. They come from a document supplied at inference time, and the model's job is to read that document and respond in a way that is fully supported by it. This is the setting that matters for retrieval-augmented generation, document question answering, summarization of a contract or a medical record, and most enterprise assistants that sit on top of a company's own files. In those systems the source of truth is the attached text, not the model's memory, so the failure people fear is a model that invents a detail, misattributes a number, or quietly blends in something it remembers from pretraining. [1][2]
Hallucination in this context is specifically the production of content that the source does not back up, even when that content happens to be true in general. A summary that adds a plausible but unstated figure is ungrounded, and for a grounded system that is a real error regardless of whether the figure is correct elsewhere. The FACTS Grounding paper frames the field as having two factuality scenarios: factuality with respect to a given context such as a user request and grounding documents, and factuality with respect to external sources and general world knowledge. The benchmark deliberately targets the first scenario only. [2]
Each of the 1,719 examples has three parts: a context document, a system instruction telling the model to rely only on that document, and a user request to fulfill. A typical system instruction reads like "in your answer, refer only to the context document; do not employ any outside knowledge." The user request is a concrete task such as a question to answer, a summary to write, or a passage to rewrite. [1][2][3]
The documents are long. They run up to about 32,000 tokens, roughly 20,000 words, with a mean closer to 2,500 tokens, and they span five domains: finance, technology, retail, medical, and legal. Examples include things like SEC filings, technical materials, and medical documents. The tasks are limited to question answering, summarization, and document rewriting. [2][3]
The data was built by third-party human raters who wrote prompts requiring long-form input and long-form output, then passed through validation and filtering. The annotation instructions were written to avoid prompts that need creative responses, expert-level domain knowledge, mathematical or logical reasoning, or meta-analysis, because those abilities are confounders: a model could fail such a prompt for reasons that have nothing to do with grounding. Items that depended on unreadable OCR-rendered PDFs were also removed. The point of all this filtering is to isolate grounding as cleanly as possible, so that a low score reflects unfaithfulness to the source rather than some unrelated weakness. [2]
The public split (860 examples) is released on Hugging Face under a CC-BY-4.0 license, with fields named system_instruction, user_request, context_document, and full_prompt. The private split (859 examples) is held out. [1][3]
FACTS Grounding does not use string matching or a single grader. It uses automated LLM judges in a two-stage pipeline, and the order of the two stages is the heart of the method. [1][2]
The first stage is eligibility, sometimes described as an instruction-following check. A judge decides whether the response actually addresses the user's request. Responses that dodge the task are marked ineligible and then treated as inaccurate. This stage exists to close an obvious loophole. If grounding were the only thing scored, a model could win by refusing to commit to anything: a vague, hedged, or near-empty answer contains no unsupported claims and would look perfectly grounded. By gating on eligibility first, the benchmark discourages those vacuous but technically grounded outputs and forces a model to answer before it can earn credit for being faithful. Applying the eligibility filter lowers reported factuality scores by roughly 1 to 5 percent and can change the ordering of the leaderboard. [1][2]
The second stage is the factuality, or grounding, judgment. For responses that passed eligibility, a judge decides whether the response is fully grounded in the document. A response is rated accurate only if all of its claims are supported by the provided context or require no grounding (for example, a generic transition sentence). A single unsupported claim flips the rating to inaccurate. There is no partial credit at the level of a response; grounding is treated as all-or-nothing per response, and the score for a model is the percentage of its responses rated accurate. [1][2]
The authors tested several prompt templates for the judges, including span-level, response-level, and JSON-formatted variants, and selected the best-performing template per judge using Macro-F1 against held-out labeled data. Different judges did best with different prompting styles. [2]
A known hazard of using an LLM as a grader is that models tend to favor text that looks like their own. If Google graded Gemini submissions with a Gemini judge alone, the scores would be suspect. FACTS Grounding addresses this by running three frontier judges from three different developers: Gemini 1.5 Pro, OpenAI's GPT-4o, and Anthropic's Claude 3.5 Sonnet. [1][2]
The paper reports the bias it is guarding against directly. Models rate their own outputs higher than they rate other models' outputs, by an average of about +3.2 percent. Averaging across three judges from different families dilutes that effect, because no single family's preference dominates the final number. [2]
The two stages aggregate slightly differently. For eligibility, the three judges are combined by consensus: a response is ruled ineligible only if all three judges agree it is ineligible, which is a deliberately forgiving rule that avoids disqualifying answers over one judge's strictness. For factuality, each judge produces a per-example accuracy verdict, and the final factuality score is the average of the judges' scores across all examples. The published leaderboard number for a model is its aggregate score across both the public and private sets. [1][2]
The leaderboard lives on Kaggle, which holds the private set, runs the evaluations, and posts results publicly. Keeping the private split sealed follows standard practice for guarding against contamination and leaderboard hacking, and an ongoing public leaderboard lets new models be added over time rather than freezing the field at one snapshot. [1][2]
At launch the top entries were close together near the low 80s. The table below lists the final aggregate factuality scores reported in the technical report. [1][2]
| Model | Final factuality score |
|---|---|
| Gemini 2.0 Flash Experimental | 83.6% |
| Gemini 1.5 Flash | 82.9% |
| Gemini 1.5 Pro | 80.0% |
| Claude 3.5 Sonnet | 79.4% |
| GPT-4o | 78.8% |
These figures are averages over the public and private sets, which is why a model's headline number can differ from a score quoted for either split on its own. The clustering near 80 percent is itself a finding: even strong models fail to stay fully grounded on roughly one in five long-context prompts. [1][2]
The choice of evaluation rule changes the picture too. The authors note that the eligibility filter not only lowers scores but can reshuffle rankings, which is a reminder that a grounding score is a property of the whole pipeline (instruction-following gate plus grounding judgment plus ensemble), not of grounding alone. [2]
FACTS Grounding sits at the input-faithfulness end of factuality evaluation, complementing rather than replacing the parametric end. SimpleQA and TruthfulQA test whether a model's stored knowledge is correct; FACTS Grounding tests whether a model respects a document put in front of it. A model can score well on one and poorly on the other, because the skills are different: one is recall, the other is restraint. [1]
The benchmark is closely tied to retrieval-augmented generation. RAG systems retrieve passages and ask a model to answer from them, and their whole value proposition is that the answer should track the retrieved evidence. FACTS Grounding can be read as a clean measurement of the generation half of a RAG pipeline, holding retrieval fixed by simply handing the model the document. It also relates to attribution and citation-faithfulness benchmarks, and to natural-language-inference-style entailment checks, all of which ask whether output text is entailed by source text. [1][2]
In December 2025 Google extended the work into the FACTS Benchmark Suite, announced December 9, 2025, which keeps an updated Grounding benchmark (v2) and adds three more: a Parametric benchmark for internal-knowledge factoid questions, a Search benchmark for retrieving and synthesizing from web search, and a Multimodal benchmark for image-based questions. The suite reports a combined FACTS Score averaged across public and private sets and is also run on Kaggle. This expansion is the clearest signal of where the original benchmark fit: it covered grounded factuality, and the suite added the parametric and tool-using scenarios it had explicitly left out. [4]
The benchmark is narrow by design, and that narrowness is also its main limitation. It says nothing about whether a model's world knowledge is correct, because every example supplies the facts. A model that grounds well here can still hallucinate freely when no document is attached. The task mix is restricted to question answering, summarization, and rewriting, and prompts needing reasoning, math, or domain expertise were filtered out, so the score does not speak to grounded performance on harder cognitive tasks. [2]
The scoring depends on LLM judges, which carries the usual caveats. The judges themselves can misjudge grounding, especially on subtle paraphrase or implicit support, and although the ensemble dampens self-preference it does not eliminate the shared blind spots that frontier models may have in common. Because the judges are specific model versions, the meaning of a score is tied to those versions and to the selected prompt templates, and it can drift as judges are updated. The all-or-nothing per-response rule is strict and can penalize a long, mostly faithful answer for one marginal claim, while the consensus rule for eligibility is lenient and may let weak answers through. The two-stage design closes the obvious gaming route of empty hedged answers, but no automated grounding metric is fully immune to gaming. Finally, the data was built by human raters whose vendor and count are not disclosed in the report, and the documents skew toward five business and professional domains, so coverage of other genres is limited. The authors themselves note that benchmarks like this can be overtaken quickly and that they intend to keep iterating. [1][2]