LongFact / SAFE
Last reviewed
Jun 8, 2026
Sources
3 citations
Review status
Source-backed
Revision
v1 · 1,657 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 8, 2026
Sources
3 citations
Review status
Source-backed
Revision
v1 · 1,657 words
Add missing citations, update stale details, or suggest a clearer explanation.
LongFact and SAFE are a paired benchmark and evaluation method for measuring the long-form factuality of large language models, introduced by researchers at Google DeepMind and Stanford University in the 2024 paper "Long-form factuality in large language models" [1][2]. LongFact is a prompt set of 2,280 fact-seeking prompts spanning 38 topics that elicit detailed, paragraph-length responses, and SAFE (Search-Augmented Factuality Evaluator) is an automated pipeline in which a language model acts as an agent: it breaks a long response into individual atomic facts and, for each one, issues Google Search queries and reasons over the results to decide whether the fact is supported, irrelevant, or not supported [1][3].
Alongside the data and evaluator, the authors propose a metric called F1@K that combines factual precision with a length-aware recall term, rewarding responses that are both accurate and sufficiently detailed [1]. The work generalizes earlier fine-grained factuality evaluation, most directly FActScore, from a single domain (biographies checked against Wikipedia) to an open-domain prompt set checked against live web search [1]. LongFact, SAFE, and the experiment code were released publicly [3], and the paper was published at NeurIPS 2024 [2]. The benchmark has become a widely cited reference point for evaluating hallucination in long-form generation.
LLMs frequently produce fluent text that contains factual errors when asked open-ended, fact-seeking questions, a failure mode commonly called hallucination [1]. Evaluating this behavior is harder for long-form answers than for short ones. A multi-sentence response can mix true and false claims, contain irrelevant statements, and vary in length, so a single right-or-wrong judgment is too coarse. Human annotation of every claim in a long answer is accurate but slow and expensive, which limits its use at the scale of modern model evaluation [1].
Prior fine-grained work, especially FActScore (Min et al., 2023), addressed part of this by decomposing a response into atomic facts and scoring each one, but it focused on a narrow setting: biographies of people, with facts verified against a fixed Wikipedia knowledge source [1]. The authors of LongFact set out to extend that idea in two directions at once: to a broad, open-domain set of fact-seeking prompts that go well beyond biographies, and to a scalable automatic evaluator that does not depend on a curated reference corpus. The result is a benchmark intended to measure how factual a model's long-form output is across many subject areas, together with an evaluator cheap enough to run at scale [1].
LongFact is a set of 2,280 prompts that ask for long-form, fact-rich answers [1][3]. The prompts were generated using GPT-4, with the model instructed to write open-ended questions that would require a detailed factual response, after which the authors manually curated and deduplicated the set [1].
The benchmark covers 38 manually selected topics drawn from broad areas including STEM, the social sciences, the humanities, and other categories. Example topics include astronomy, biology, chemistry, computer science, machine learning, medicine, physics, economics, geography, history, philosophy, world religions, movies, music, sports, and global facts [1]. For each topic the authors generated 30 unique prompts, and the data is split into two parallel tasks [1]:
| Task | What it asks about | Prompts per topic | Total prompts |
|---|---|---|---|
| LongFact-Objects | Specific entities (people, places, events, companies, and similar) | 30 | 1,140 |
| LongFact-Concepts | Abstract ideas, theories, and concepts | 30 | 1,140 |
| Combined | Both | 60 | 2,280 |
The two tasks differ in the kind of question asked. LongFact-Objects prompts request information about a concrete object or entity, whereas LongFact-Concepts prompts ask about an abstract concept within a topic [1]. A small number of topics that do not naturally support concept-style questions were handled by adjusting the topic list for the Concepts task, so that each of the 38 topics still yields 30 prompts per task in the released, deduplicated data [1][3]. The prompts are distributed as JSONL files in the project repository [3].
SAFE, the Search-Augmented Factuality Evaluator, is an automatic method for grading a long-form response without a human annotator and without a fixed reference document [1][3]. It uses a language model as an agent that runs a multi-step pipeline:
Each relevant atomic fact is ultimately placed into one of three categories: supported, not supported, or irrelevant [1]. Because SAFE drives a search engine rather than comparing against a single predetermined article, it can evaluate claims across the full breadth of LongFact's open-domain prompts. In the paper's implementation, the underlying language model for SAFE was GPT-3.5-Turbo, paired with calls to the Google Search API through SerpAPI [1].
To turn SAFE's per-fact labels into a single response-level score, the authors define F1@K, a metric that balances precision against a length-sensitive notion of recall [1].
The K parameter makes the metric adjustable to how much detail is expected. The authors report results at K = 64, the median number of facts in the responses they examined, and at K = 178, the maximum number of facts in any response in that set [1]. Without the recall term, a model could score perfectly by emitting a single trivially true sentence; F1@K rewards models that are both accurate and appropriately thorough [1].
The authors validated SAFE against human judgment and then used it to benchmark a range of models.
For validation, they compared SAFE's labels with crowdsourced human annotations on roughly 16,000 individual facts (16,011 facts drawn from 496 prompt-response pairs) [1]. SAFE agreed with the human annotators on 72.0 percent of these facts [1]. On a random subset of 100 cases where SAFE and the humans disagreed, the authors re-examined each case and found that SAFE's label was correct 76 percent of the time, compared with the human annotation being correct in the remaining cases [1]. The paper frames this as SAFE matching or exceeding human raters on this sample, at much lower cost: SAFE was reported to cost about 0.19 US dollars per model response versus about 4.00 US dollars for human annotation, more than 20 times cheaper [1].
Using LongFact-Objects with SAFE and F1@K, the authors benchmarked 13 language models across four model families, Gemini, GPT, Claude, and PaLM-2 [1]. The headline finding was that larger models within a family generally achieve better long-form factuality, consistent with the broader trend that scale tends to improve factual reliability [1]. The released code lets others reproduce these evaluations and run SAFE on new models [3].
A practical caveat, noted by the authors and by later commentary, is that SAFE's judgments depend on the quality and coverage of Google Search results and on the reasoning of its underlying language model, so it inherits the limitations of both. The 72 percent agreement figure and the disagreement-case results are specific to the sampled facts and annotation setup used in the paper [1].
LongFact and SAFE build directly on FActScore, an earlier method that decomposes a generated response into atomic facts and scores the fraction that are supported [1]. The two share the atomic-fact decomposition idea, but they differ in scope and mechanism:
In this sense LongFact and SAFE can be read as a generalization of FActScore to the open-domain, long-form setting, paired with a metric that captures both how accurate and how complete a response is. The benchmark is now commonly used as a reference for evaluating hallucination and factuality in long-form text generation [1][2].