FACTS Grounding

AI Benchmarks Large Language Models Model Evaluation

13 min read

Updated Jun 29, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 29, 2026

Fact-checked

In review queue

Sources

10 citations

Revision

v2 · 2,645 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

FACTS Grounding is a factuality benchmark from Google DeepMind and Google Research that measures whether a large language model answers a request using only the information in a provided source document, without adding claims the document does not support. Announced on December 17, 2024, it contains 1,719 examples (860 public and 859 held-out private) in which a model is given a system instruction, a long context document of up to about 32,000 tokens, and a user request, and an ensemble of three frontier LLM judges scores whether the response both answers the user and stays fully grounded in the document. The benchmark is published with a public leaderboard on Kaggle, and at launch the leading models clustered near 80 percent, meaning even strong models fail to stay fully grounded on roughly one in five long-context prompts. ^[1]^[2]^[3]

The name is a backronym for Factuality, Accuracy, and Correctness in Text Summarization, though in practice the benchmark covers more than summarization. Its central idea is narrow on purpose: it does not ask whether a model knows true things about the world, it asks whether a model stays faithful to a document it was handed. That distinction turns out to matter a great deal for production systems, and it is what separates FACTS Grounding from parametric factuality tests like SimpleQA. ^[1]^[2]

Google DeepMind frames the motivation plainly: "Large language models (LLMs) are transforming how we access information, yet their grip on factual accuracy remains imperfect. They can 'hallucinate' false information, particularly when given complex inputs." FACTS Grounding targets exactly that failure mode in the grounded setting, where the model has been handed the facts and should not invent any. ^[1]

What is FACTS Grounding?

FACTS Grounding is a benchmark for grounded factuality: it tests whether an LLM's long-form answer is fully supported by a source document supplied at inference time. Each test item pairs a context document with a system instruction telling the model to rely only on that document and a user request to fulfill, and a model passes an item only if its response answers the request and contains no claim the document fails to back up. The benchmark was introduced by Google DeepMind and Google Research on December 17, 2024, with a Kaggle leaderboard and a public dataset split. ^[1]^[2]^[3]

The distinction it draws is deliberate. FACTS Grounding does not measure whether a model knows facts; it measures whether a model respects facts it has been given. That makes it a direct proxy for the reliability of retrieval-augmented generation, document question answering, and enterprise assistants that read a company's own files, where the attached text, not the model's memory, is the source of truth. ^[1]^[2]

Why is grounded factuality its own problem?

When people first worry about model factuality, they usually mean parametric factuality: does the model recall the right answer from what it learned during training? Benchmarks like SimpleQA and TruthfulQA probe that, asking short closed questions and comparing the answer against a known ground truth. The knowledge lives inside the model's weights, and the test checks whether the model can retrieve it accurately. ^[1]

Grounded factuality is a different question. Here the relevant facts are not supposed to come from the weights at all. They come from a document supplied at inference time, and the model's job is to read that document and respond in a way that is fully supported by it. This is the setting that matters for retrieval-augmented generation, document question answering, summarization of a contract or a medical record, and most enterprise assistants that sit on top of a company's own files. In those systems the source of truth is the attached text, not the model's memory, so the failure people fear is a model that invents a detail, misattributes a number, or quietly blends in something it remembers from pretraining. ^[1]^[2]

Hallucination in this context is specifically the production of content that the source does not back up, even when that content happens to be true in general. A summary that adds a plausible but unstated figure is ungrounded, and for a grounded system that is a real error regardless of whether the figure is correct elsewhere. The FACTS Grounding paper frames the field as having two factuality scenarios: factuality with respect to a given context such as a user request and grounding documents, and factuality with respect to external sources and general world knowledge. The benchmark deliberately targets the first scenario only. ^[2]

What is in the FACTS Grounding dataset?

Each of the 1,719 examples has three parts: a context document, a system instruction telling the model to rely only on that document, and a user request to fulfill. A typical system instruction reads like "in your answer, refer only to the context document; do not employ any outside knowledge." The user request is a concrete task such as a question to answer, a summary to write, or a passage to rewrite. ^[1]^[2]^[3]

The documents are long. They run up to about 32,000 tokens, roughly 20,000 words, with a mean closer to 2,500 tokens, and they span five domains: finance, technology, retail, medical, and legal. Examples include things like SEC filings, technical materials, and medical documents. The tasks are limited to question answering, summarization, and document rewriting. ^[2]^[3]

The data was built by third-party human raters who wrote prompts requiring long-form input and long-form output, then passed through validation and filtering. The annotation instructions were written to avoid prompts that need creative responses, expert-level domain knowledge, mathematical or logical reasoning, or meta-analysis, because those abilities are confounders: a model could fail such a prompt for reasons that have nothing to do with grounding. Items that depended on unreadable OCR-rendered PDFs were also removed. The point of all this filtering is to isolate grounding as cleanly as possible, so that a low score reflects unfaithfulness to the source rather than some unrelated weakness. ^[2]

The public split (860 examples) is released on Hugging Face under a CC-BY-4.0 license, with fields named system_instruction, user_request, context_document, and full_prompt. The private split (859 examples) is held out. The table below summarizes the dataset at a glance. ^[1]^[3]

Attribute	Value
Total examples	1,719
Public split	860 (Hugging Face, CC-BY-4.0)
Private split	859 (held out on Kaggle)
Document length	up to ~32,000 tokens (~20,000 words); mean ~2,500 tokens
Domains	finance, technology, retail, medical, legal
Task types	question answering, summarization, document rewriting
Announced	December 17, 2024

How does FACTS Grounding score factuality?

FACTS Grounding does not use string matching or a single grader. It uses automated LLM judges in a two-stage pipeline, and the order of the two stages is the heart of the method. ^[1]^[2]

The first stage is eligibility, sometimes described as an instruction-following check. A judge decides whether the response actually addresses the user's request. Responses that dodge the task are marked ineligible and then treated as inaccurate. This stage exists to close an obvious loophole. If grounding were the only thing scored, a model could win by refusing to commit to anything: a vague, hedged, or near-empty answer contains no unsupported claims and would look perfectly grounded. By gating on eligibility first, the benchmark discourages those vacuous but technically grounded outputs and forces a model to answer before it can earn credit for being faithful. Applying the eligibility filter lowers reported factuality scores by roughly 1 to 5 percent and can change the ordering of the leaderboard. ^[1]^[2]

The second stage is the factuality, or grounding, judgment. As DeepMind puts it, "responses are judged as factually accurate if they are fully grounded in information contained in the provided document, with no hallucinations." For responses that passed eligibility, a judge decides whether the response is fully grounded in the document. A response is rated accurate only if all of its claims are supported by the provided context or require no grounding (for example, a generic transition sentence). A single unsupported claim flips the rating to inaccurate. There is no partial credit at the level of a response; grounding is treated as all-or-nothing per response, and the score for a model is the percentage of its responses rated accurate. ^[1]^[2]

The authors tested several prompt templates for the judges, including span-level, response-level, and JSON-formatted variants, and selected the best-performing template per judge using Macro-F1 against held-out labeled data. Different judges did best with different prompting styles. ^[2]

Why use an ensemble of judges?

A known hazard of using an LLM as a grader is that models tend to favor text that looks like their own. If Google graded Gemini submissions with a Gemini judge alone, the scores would be suspect. FACTS Grounding addresses this by running three frontier judges from three different developers: Gemini 1.5 Pro, OpenAI's GPT-4o, and Anthropic's Claude 3.5 Sonnet. ^[1]^[2]

The paper reports the bias it is guarding against directly. Models rate their own outputs higher than they rate other models' outputs, by an average of about +3.2 percent. Averaging across three judges from different families dilutes that effect, because no single family's preference dominates the final number. ^[2]

The two stages aggregate slightly differently. For eligibility, the three judges are combined by consensus: a response is ruled ineligible only if all three judges agree it is ineligible, which is a deliberately forgiving rule that avoids disqualifying answers over one judge's strictness. For factuality, each judge produces a per-example accuracy verdict, and the final factuality score is the average of the judges' scores across all examples. As the technical report states, "the final score for the overall grounding task is the average of all judge models' scores across all examples." The published leaderboard number for a model is its aggregate score across both the public and private sets. ^[1]^[2]

What were the FACTS Grounding launch results?

The leaderboard lives on Kaggle, which holds the private set, runs the evaluations, and posts results publicly. Keeping the private split sealed follows standard practice for guarding against contamination and leaderboard hacking, and an ongoing public leaderboard lets new models be added over time rather than freezing the field at one snapshot. ^[1]^[2]^[5]

At launch the top entries were close together near the low 80s. The table below lists the final aggregate factuality scores reported in the technical report. ^[1]^[2]

Model	Final factuality score
Gemini 2.0 Flash Experimental	83.6%
Gemini 1.5 Flash	82.9%
Gemini 1.5 Pro	80.0%
Claude 3.5 Sonnet	79.4%
GPT-4o	78.8%

These figures are averages over the public and private sets, which is why a model's headline number can differ from a score quoted for either split on its own. The clustering near 80 percent is itself a finding: even strong models fail to stay fully grounded on roughly one in five long-context prompts. ^[1]^[2]

The choice of evaluation rule changes the picture too. The authors note that the eligibility filter not only lowers scores but can reshuffle rankings, which is a reminder that a grounding score is a property of the whole pipeline (instruction-following gate plus grounding judgment plus ensemble), not of grounding alone. ^[2]

How does FACTS Grounding relate to other factuality work?

FACTS Grounding sits at the input-faithfulness end of factuality evaluation, complementing rather than replacing the parametric end. SimpleQA and TruthfulQA test whether a model's stored knowledge is correct; FACTS Grounding tests whether a model respects a document put in front of it. A model can score well on one and poorly on the other, because the skills are different: one is recall, the other is restraint. ^[1]

The benchmark is closely tied to retrieval-augmented generation. RAG systems retrieve passages and ask a model to answer from them, and their whole value proposition is that the answer should track the retrieved evidence. FACTS Grounding can be read as a clean measurement of the generation half of a RAG pipeline, holding retrieval fixed by simply handing the model the document. It also relates to attribution and citation-faithfulness benchmarks, and to natural-language-inference-style entailment checks, all of which ask whether output text is entailed by source text. ^[1]^[2]

What replaced FACTS Grounding?

In December 2025 Google extended the work into the FACTS Benchmark Suite, announced December 9, 2025, which keeps an updated Grounding benchmark (v2) and adds three more: a Parametric benchmark for internal-knowledge factoid questions, a Search benchmark for retrieving and synthesizing from a standardized web search API, and a Multimodal benchmark for image-based questions. The suite spans 3,513 curated examples in total and reports a combined FACTS Score averaged across public and private sets, and it is also run on Kaggle. At its launch Gemini 3 Pro led overall with a 68.8 percent FACTS Score, and multimodal factuality was the hardest dimension, with every evaluated model scoring under 70 percent. This expansion is the clearest signal of where the original benchmark fit: it covered grounded factuality, and the suite added the parametric and tool-using scenarios it had explicitly left out, with the Grounding v2 benchmark replacing the original FACTS Grounding. ^[4]^[10]

What are the limitations of FACTS Grounding?

The benchmark is narrow by design, and that narrowness is also its main limitation. It says nothing about whether a model's world knowledge is correct, because every example supplies the facts. A model that grounds well here can still hallucinate freely when no document is attached. The task mix is restricted to question answering, summarization, and rewriting, and prompts needing reasoning, math, or domain expertise were filtered out, so the score does not speak to grounded performance on harder cognitive tasks. ^[2]

The scoring depends on LLM judges, which carries the usual caveats. The judges themselves can misjudge grounding, especially on subtle paraphrase or implicit support, and although the ensemble dampens self-preference it does not eliminate the shared blind spots that frontier models may have in common. Because the judges are specific model versions, the meaning of a score is tied to those versions and to the selected prompt templates, and it can drift as judges are updated. The all-or-nothing per-response rule is strict and can penalize a long, mostly faithful answer for one marginal claim, while the consensus rule for eligibility is lenient and may let weak answers through. The two-stage design closes the obvious gaming route of empty hedged answers, but no automated grounding metric is fully immune to gaming. Finally, the data was built by human raters whose vendor and count are not disclosed in the report, and the documents skew toward five business and professional domains, so coverage of other genres is limited. The authors themselves note that benchmarks like this can be overtaken quickly and that they intend to keep iterating. ^[1]^[2]

References

Google DeepMind. "FACTS Grounding: A new benchmark for evaluating the factuality of large language models." December 17, 2024. https://deepmind.google/blog/facts-grounding-a-new-benchmark-for-evaluating-the-factuality-of-large-language-models/ ↩
Jacovi, Alon, et al. "The FACTS Grounding Leaderboard: Benchmarking LLMs' Ability to Ground Responses to Long-Form Input." arXiv:2501.03200, January 2025. https://arxiv.org/abs/2501.03200 ↩
Google. "google/FACTS-grounding-public." Hugging Face Datasets. https://huggingface.co/datasets/google/FACTS-grounding-public ↩
Google DeepMind. "FACTS Benchmark Suite: a new way to systematically evaluate LLMs' factuality." December 9, 2025. https://deepmind.google/blog/facts-benchmark-suite-systematically-evaluating-the-factuality-of-large-language-models/ ↩
FACTS Grounding Leaderboard. Kaggle. https://www.kaggle.com/benchmarks/google/facts-grounding ↩
MarkTechPost. "Google DeepMind Introduces FACTS Grounding: A New AI Benchmark for Evaluating Factuality in Long-Form LLM Response." December 20, 2024. https://www.marktechpost.com/2024/12/20/google-deepmind-introduces-facts-grounding-a-new-ai-benchmark-for-evaluating-factuality-in-long-form-llm-response/
Hackster.io. "Google's DeepMind Aims to Fix LLMs' Lying Ways, with the FACTS Grounding Benchmark." December 2024. https://www.hackster.io/news/google-s-deepmind-aims-to-fix-llms-lying-ways-with-the-facts-grounding-benchmark-897a215a836b
WinBuzzer. "Google's New FACTS Benchmark Measures Truthfulness of AI Models." December 18, 2024. https://winbuzzer.com/2024/12/18/googles-new-facts-benchmark-measures-truthfulness-of-ai-models-xcxwbn/
EmergentMind. "FACTS Grounding Benchmark Overview." https://www.emergentmind.com/topics/facts-grounding-benchmark
Jacovi, Alon, et al. "The FACTS Leaderboard: A Comprehensive Benchmark for Large Language Model Factuality." arXiv:2512.10791, December 2025. https://arxiv.org/abs/2512.10791 ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

1 revision by 1 contributors · full history

Suggest edit

What links here

Contextual AI Gemini 2.5 Flash Jamba2

What is FACTS Grounding?

Why is grounded factuality its own problem?

What is in the FACTS Grounding dataset?

How does FACTS Grounding score factuality?

Why use an ensemble of judges?

What were the FACTS Grounding launch results?

How does FACTS Grounding relate to other factuality work?

What replaced FACTS Grounding?

What are the limitations of FACTS Grounding?

See also

References

Improve this article

Related Articles

LLM-as-a-judge

NoLiMa

LongBench v2

BABILong

MRCR

LLM Benchmark Comparison (Leaderboard Overview)

What links here

Related Articles

LLM-as-a-judge

NoLiMa

LongBench v2

BABILong

MRCR

LLM Benchmark Comparison (Leaderboard Overview)

What links here