# LongFact / SAFE

> Source: https://aiwiki.ai/wiki/longfact
> Updated: 2026-06-08
> Categories: AI Benchmarks, AI Safety
> From AI Wiki (https://aiwiki.ai), a free encyclopedia of artificial intelligence. Quote with attribution.

## Overview

**LongFact** and **SAFE** are a paired benchmark and evaluation method for measuring the long-form factuality of [large language models](/wiki/large_language_model), introduced by researchers at [Google DeepMind](/wiki/google_deepmind) and Stanford University in the 2024 paper "Long-form factuality in large language models" [1][2]. LongFact is a prompt set of 2,280 fact-seeking prompts spanning 38 topics that elicit detailed, paragraph-length responses, and SAFE (Search-Augmented Factuality Evaluator) is an automated pipeline in which a language model acts as an agent: it breaks a long response into individual atomic facts and, for each one, issues [Google Search](/wiki/google_search) queries and reasons over the results to decide whether the fact is supported, irrelevant, or not supported [1][3].

Alongside the data and evaluator, the authors propose a metric called **F1@K** that combines factual precision with a length-aware recall term, rewarding responses that are both accurate and sufficiently detailed [1]. The work generalizes earlier fine-grained factuality evaluation, most directly [FActScore](/wiki/factscore), from a single domain (biographies checked against Wikipedia) to an open-domain prompt set checked against live web search [1]. LongFact, SAFE, and the experiment code were released publicly [3], and the paper was published at NeurIPS 2024 [2]. The benchmark has become a widely cited reference point for evaluating [hallucination](/wiki/hallucination) in long-form generation.

## Motivation

LLMs frequently produce fluent text that contains factual errors when asked open-ended, fact-seeking questions, a failure mode commonly called hallucination [1]. Evaluating this behavior is harder for long-form answers than for short ones. A multi-sentence response can mix true and false claims, contain irrelevant statements, and vary in length, so a single right-or-wrong judgment is too coarse. Human annotation of every claim in a long answer is accurate but slow and expensive, which limits its use at the scale of modern model evaluation [1].

Prior fine-grained work, especially FActScore (Min et al., 2023), addressed part of this by decomposing a response into atomic facts and scoring each one, but it focused on a narrow setting: biographies of people, with facts verified against a fixed Wikipedia knowledge source [1]. The authors of LongFact set out to extend that idea in two directions at once: to a broad, open-domain set of fact-seeking prompts that go well beyond biographies, and to a scalable automatic evaluator that does not depend on a curated reference corpus. The result is a benchmark intended to measure how factual a model's long-form output is across many subject areas, together with an evaluator cheap enough to run at scale [1].

## LongFact (the prompt set)

LongFact is a set of 2,280 prompts that ask for long-form, fact-rich answers [1][3]. The prompts were generated using [GPT-4](/wiki/gpt_4), with the model instructed to write open-ended questions that would require a detailed factual response, after which the authors manually curated and deduplicated the set [1].

The benchmark covers 38 manually selected topics drawn from broad areas including STEM, the social sciences, the humanities, and other categories. Example topics include astronomy, biology, chemistry, computer science, machine learning, medicine, physics, economics, geography, history, philosophy, world religions, movies, music, sports, and global facts [1]. For each topic the authors generated 30 unique prompts, and the data is split into two parallel tasks [1]:

| Task | What it asks about | Prompts per topic | Total prompts |
|------|--------------------|-------------------|---------------|
| LongFact-Objects | Specific entities (people, places, events, companies, and similar) | 30 | 1,140 |
| LongFact-Concepts | Abstract ideas, theories, and concepts | 30 | 1,140 |
| Combined | Both | 60 | 2,280 |

The two tasks differ in the kind of question asked. LongFact-Objects prompts request information about a concrete object or entity, whereas LongFact-Concepts prompts ask about an abstract concept within a topic [1]. A small number of topics that do not naturally support concept-style questions were handled by adjusting the topic list for the Concepts task, so that each of the 38 topics still yields 30 prompts per task in the released, deduplicated data [1][3]. The prompts are distributed as JSONL files in the project repository [3].

## SAFE (the evaluator)

SAFE, the Search-Augmented Factuality Evaluator, is an automatic method for grading a long-form response without a human annotator and without a fixed reference document [1][3]. It uses a language model as an agent that runs a multi-step pipeline:

1. **Decomposition.** The model splits the long response into a list of individual atomic facts, each meant to convey a single piece of information [1].
2. **Revision for self-containment.** Each extracted fact is revised so that it is self-contained, for example by resolving pronouns and vague references using the surrounding context, so the fact can be checked on its own [1].
3. **Relevance check.** The model decides whether the fact is relevant to answering the original prompt. Facts judged not relevant are labeled irrelevant and excluded from the factuality score [1].
4. **Search and reasoning.** For each relevant fact, the agent issues Google Search queries, reads the returned snippets, and reasons over multiple steps, refining its queries as needed, to decide whether the search results support the fact [1].

Each relevant atomic fact is ultimately placed into one of three categories: **supported**, **not supported**, or **irrelevant** [1]. Because SAFE drives a search engine rather than comparing against a single predetermined article, it can evaluate claims across the full breadth of LongFact's open-domain prompts. In the paper's implementation, the underlying language model for SAFE was GPT-3.5-Turbo, paired with calls to the Google Search API through SerpAPI [1].

## The F1@K metric

To turn SAFE's per-fact labels into a single response-level score, the authors define **F1@K**, a metric that balances precision against a length-sensitive notion of recall [1].

- **Precision** is the fraction of a response's rated facts that are supported, computed as the number of supported facts divided by the total number of supported plus not-supported facts (irrelevant facts are excluded) [1].
- **Recall** uses a target K, the number of supported facts a user would want before additional detail stops adding value. Recall is defined as min(S(y) / K, 1), where S(y) is the count of supported facts in the response. This caps the reward at K supported facts, so a response is not penalized for stopping once it has provided "enough" correct information [1].
- **F1@K** is the harmonic mean of this precision and recall when at least one fact is supported, and is 0 when no facts are supported [1].

The K parameter makes the metric adjustable to how much detail is expected. The authors report results at K = 64, the median number of facts in the responses they examined, and at K = 178, the maximum number of facts in any response in that set [1]. Without the recall term, a model could score perfectly by emitting a single trivially true sentence; F1@K rewards models that are both accurate and appropriately thorough [1].

## Findings

The authors validated SAFE against human judgment and then used it to benchmark a range of models.

For validation, they compared SAFE's labels with crowdsourced human annotations on roughly 16,000 individual facts (16,011 facts drawn from 496 prompt-response pairs) [1]. SAFE agreed with the human annotators on 72.0 percent of these facts [1]. On a random subset of 100 cases where SAFE and the humans disagreed, the authors re-examined each case and found that SAFE's label was correct 76 percent of the time, compared with the human annotation being correct in the remaining cases [1]. The paper frames this as SAFE matching or exceeding human raters on this sample, at much lower cost: SAFE was reported to cost about 0.19 US dollars per model response versus about 4.00 US dollars for human annotation, more than 20 times cheaper [1].

Using LongFact-Objects with SAFE and F1@K, the authors benchmarked 13 language models across four model families, [Gemini](/wiki/gemini), [GPT](/wiki/gpt_4), [Claude](/wiki/claude), and [PaLM-2](/wiki/palm_2) [1]. The headline finding was that larger models within a family generally achieve better long-form factuality, consistent with the broader trend that scale tends to improve factual reliability [1]. The released code lets others reproduce these evaluations and run SAFE on new models [3].

A practical caveat, noted by the authors and by later commentary, is that SAFE's judgments depend on the quality and coverage of Google Search results and on the reasoning of its underlying language model, so it inherits the limitations of both. The 72 percent agreement figure and the disagreement-case results are specific to the sampled facts and annotation setup used in the paper [1].

## Relationship to FActScore

LongFact and SAFE build directly on FActScore, an earlier method that decomposes a generated response into atomic facts and scores the fraction that are supported [1]. The two share the atomic-fact decomposition idea, but they differ in scope and mechanism:

- **Domain.** FActScore was designed for biographical generation, scoring facts about named individuals. LongFact spans 38 topics across STEM, the social sciences, the humanities, and other categories [1].
- **Knowledge source.** FActScore verifies facts against a fixed reference corpus such as Wikipedia, whereas SAFE issues live Google Search queries, removing the dependence on a curated knowledge base and allowing open-domain coverage [1].
- **Scoring.** FActScore reports a precision-style fraction of supported facts. F1@K adds a length-aware recall term so that brevity does not artificially inflate the score, balancing accuracy against sufficient detail [1].

In this sense LongFact and SAFE can be read as a generalization of FActScore to the open-domain, long-form setting, paired with a metric that captures both how accurate and how complete a response is. The benchmark is now commonly used as a reference for evaluating hallucination and factuality in long-form text generation [1][2].

## References

1. Wei, Jerry; Yang, Chengrun; Song, Xinying; Lu, Yifeng; Hu, Nathan; Huang, Jie; Tran, Dustin; Peng, Daiyi; Liu, Ruibo; Huang, Da; Du, Cosmo; Le, Quoc V. "Long-form factuality in large language models." arXiv:2403.18802, March 2024. https://arxiv.org/abs/2403.18802
2. "Long-form factuality in large language models." Advances in Neural Information Processing Systems (NeurIPS) 2024. https://proceedings.neurips.cc/paper_files/paper/2024/hash/937ae0e83eb08d2cb8627fe1def8c751-Abstract-Conference.html
3. google-deepmind/long-form-factuality. GitHub repository (LongFact data, SAFE code, F1@K implementation). https://github.com/google-deepmind/long-form-factuality

