LongFact / SAFE

AI Benchmarks AI Safety

8 min read

Updated Jun 8, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 8, 2026

Fact-checked

In review queue

Sources

3 citations

Revision

v1 · 1,657 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Overview

LongFact and SAFE are a paired benchmark and evaluation method for measuring the long-form factuality of large language models, introduced by researchers at Google DeepMind and Stanford University in the 2024 paper "Long-form factuality in large language models" ^[1]^[2]. LongFact is a prompt set of 2,280 fact-seeking prompts spanning 38 topics that elicit detailed, paragraph-length responses, and SAFE (Search-Augmented Factuality Evaluator) is an automated pipeline in which a language model acts as an agent: it breaks a long response into individual atomic facts and, for each one, issues Google Search queries and reasons over the results to decide whether the fact is supported, irrelevant, or not supported ^[1]^[3].

Alongside the data and evaluator, the authors propose a metric called F1@K that combines factual precision with a length-aware recall term, rewarding responses that are both accurate and sufficiently detailed ^[1]. The work generalizes earlier fine-grained factuality evaluation, most directly FActScore, from a single domain (biographies checked against Wikipedia) to an open-domain prompt set checked against live web search ^[1]. LongFact, SAFE, and the experiment code were released publicly ^[3], and the paper was published at NeurIPS 2024 ^[2]. The benchmark has become a widely cited reference point for evaluating hallucination in long-form generation.

Motivation

LLMs frequently produce fluent text that contains factual errors when asked open-ended, fact-seeking questions, a failure mode commonly called hallucination ^[1]. Evaluating this behavior is harder for long-form answers than for short ones. A multi-sentence response can mix true and false claims, contain irrelevant statements, and vary in length, so a single right-or-wrong judgment is too coarse. Human annotation of every claim in a long answer is accurate but slow and expensive, which limits its use at the scale of modern model evaluation ^[1].

Prior fine-grained work, especially FActScore (Min et al., 2023), addressed part of this by decomposing a response into atomic facts and scoring each one, but it focused on a narrow setting: biographies of people, with facts verified against a fixed Wikipedia knowledge source ^[1]. The authors of LongFact set out to extend that idea in two directions at once: to a broad, open-domain set of fact-seeking prompts that go well beyond biographies, and to a scalable automatic evaluator that does not depend on a curated reference corpus. The result is a benchmark intended to measure how factual a model's long-form output is across many subject areas, together with an evaluator cheap enough to run at scale ^[1].

LongFact (the prompt set)

LongFact is a set of 2,280 prompts that ask for long-form, fact-rich answers ^[1]^[3]. The prompts were generated using GPT-4, with the model instructed to write open-ended questions that would require a detailed factual response, after which the authors manually curated and deduplicated the set ^[1].

The benchmark covers 38 manually selected topics drawn from broad areas including STEM, the social sciences, the humanities, and other categories. Example topics include astronomy, biology, chemistry, computer science, machine learning, medicine, physics, economics, geography, history, philosophy, world religions, movies, music, sports, and global facts ^[1]. For each topic the authors generated 30 unique prompts, and the data is split into two parallel tasks ^[1]:

Task	What it asks about	Prompts per topic	Total prompts
LongFact-Objects	Specific entities (people, places, events, companies, and similar)	30	1,140
LongFact-Concepts	Abstract ideas, theories, and concepts	30	1,140
Combined	Both	60	2,280

The two tasks differ in the kind of question asked. LongFact-Objects prompts request information about a concrete object or entity, whereas LongFact-Concepts prompts ask about an abstract concept within a topic ^[1]. A small number of topics that do not naturally support concept-style questions were handled by adjusting the topic list for the Concepts task, so that each of the 38 topics still yields 30 prompts per task in the released, deduplicated data ^[1]^[3]. The prompts are distributed as JSONL files in the project repository ^[3].

SAFE (the evaluator)

SAFE, the Search-Augmented Factuality Evaluator, is an automatic method for grading a long-form response without a human annotator and without a fixed reference document ^[1]^[3]. It uses a language model as an agent that runs a multi-step pipeline:

Decomposition. The model splits the long response into a list of individual atomic facts, each meant to convey a single piece of information ^[1].
Revision for self-containment. Each extracted fact is revised so that it is self-contained, for example by resolving pronouns and vague references using the surrounding context, so the fact can be checked on its own ^[1].
Relevance check. The model decides whether the fact is relevant to answering the original prompt. Facts judged not relevant are labeled irrelevant and excluded from the factuality score ^[1].
Search and reasoning. For each relevant fact, the agent issues Google Search queries, reads the returned snippets, and reasons over multiple steps, refining its queries as needed, to decide whether the search results support the fact ^[1].

Each relevant atomic fact is ultimately placed into one of three categories: supported, not supported, or irrelevant ^[1]. Because SAFE drives a search engine rather than comparing against a single predetermined article, it can evaluate claims across the full breadth of LongFact's open-domain prompts. In the paper's implementation, the underlying language model for SAFE was GPT-3.5-Turbo, paired with calls to the Google Search API through SerpAPI ^[1].

The F1@K metric

To turn SAFE's per-fact labels into a single response-level score, the authors define F1@K, a metric that balances precision against a length-sensitive notion of recall ^[1].

Precision is the fraction of a response's rated facts that are supported, computed as the number of supported facts divided by the total number of supported plus not-supported facts (irrelevant facts are excluded) ^[1].
Recall uses a target K, the number of supported facts a user would want before additional detail stops adding value. Recall is defined as min(S(y) / K, 1), where S(y) is the count of supported facts in the response. This caps the reward at K supported facts, so a response is not penalized for stopping once it has provided "enough" correct information ^[1].
F1@K is the harmonic mean of this precision and recall when at least one fact is supported, and is 0 when no facts are supported ^[1].

The K parameter makes the metric adjustable to how much detail is expected. The authors report results at K = 64, the median number of facts in the responses they examined, and at K = 178, the maximum number of facts in any response in that set ^[1]. Without the recall term, a model could score perfectly by emitting a single trivially true sentence; F1@K rewards models that are both accurate and appropriately thorough ^[1].

Findings

The authors validated SAFE against human judgment and then used it to benchmark a range of models.

For validation, they compared SAFE's labels with crowdsourced human annotations on roughly 16,000 individual facts (16,011 facts drawn from 496 prompt-response pairs) ^[1]. SAFE agreed with the human annotators on 72.0 percent of these facts ^[1]. On a random subset of 100 cases where SAFE and the humans disagreed, the authors re-examined each case and found that SAFE's label was correct 76 percent of the time, compared with the human annotation being correct in the remaining cases ^[1]. The paper frames this as SAFE matching or exceeding human raters on this sample, at much lower cost: SAFE was reported to cost about 0.19 US dollars per model response versus about 4.00 US dollars for human annotation, more than 20 times cheaper ^[1].

Using LongFact-Objects with SAFE and F1@K, the authors benchmarked 13 language models across four model families, Gemini, GPT, Claude, and PaLM-2 ^[1]. The headline finding was that larger models within a family generally achieve better long-form factuality, consistent with the broader trend that scale tends to improve factual reliability ^[1]. The released code lets others reproduce these evaluations and run SAFE on new models ^[3].

A practical caveat, noted by the authors and by later commentary, is that SAFE's judgments depend on the quality and coverage of Google Search results and on the reasoning of its underlying language model, so it inherits the limitations of both. The 72 percent agreement figure and the disagreement-case results are specific to the sampled facts and annotation setup used in the paper ^[1].

Relationship to FActScore

LongFact and SAFE build directly on FActScore, an earlier method that decomposes a generated response into atomic facts and scores the fraction that are supported ^[1]. The two share the atomic-fact decomposition idea, but they differ in scope and mechanism:

Domain. FActScore was designed for biographical generation, scoring facts about named individuals. LongFact spans 38 topics across STEM, the social sciences, the humanities, and other categories ^[1].
Knowledge source. FActScore verifies facts against a fixed reference corpus such as Wikipedia, whereas SAFE issues live Google Search queries, removing the dependence on a curated knowledge base and allowing open-domain coverage ^[1].
Scoring. FActScore reports a precision-style fraction of supported facts. F1@K adds a length-aware recall term so that brevity does not artificially inflate the score, balancing accuracy against sufficient detail ^[1].

In this sense LongFact and SAFE can be read as a generalization of FActScore to the open-domain, long-form setting, paired with a metric that captures both how accurate and how complete a response is. The benchmark is now commonly used as a reference for evaluating hallucination and factuality in long-form text generation ^[1]^[2].

References

Wei, Jerry; Yang, Chengrun; Song, Xinying; Lu, Yifeng; Hu, Nathan; Huang, Jie; Tran, Dustin; Peng, Daiyi; Liu, Ruibo; Huang, Da; Du, Cosmo; Le, Quoc V. "Long-form factuality in large language models." arXiv:2403.18802, March 2024. https://arxiv.org/abs/2403.18802 ↩
"Long-form factuality in large language models." Advances in Neural Information Processing Systems (NeurIPS) 2024. https://proceedings.neurips.cc/paper_files/paper/2024/hash/937ae0e83eb08d2cb8627fe1def8c751-Abstract-Conference.html ↩
google-deepmind/long-form-factuality. GitHub repository (LongFact data, SAFE code, F1@K implementation). https://github.com/google-deepmind/long-form-factuality ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

Suggest edit

What links here

FActScore