# FActScore

> Source: https://aiwiki.ai/wiki/factscore
> Updated: 2026-06-08
> Categories: AI Benchmarks, AI Safety
> From AI Wiki (https://aiwiki.ai), a free encyclopedia of artificial intelligence. Quote with attribution.

## Overview

FActScore (Factual precision in Atomicity Score) is an evaluation method and metric, introduced in 2023, for measuring the factual precision of long-form text generated by [large language models](/wiki/large_language_model). Rather than assigning a single right-or-wrong label to a whole passage, FActScore decomposes a generation into a set of short "atomic facts" and computes the fraction of those atomic facts that are supported by a reliable knowledge source, such as Wikipedia. The resulting score is the percentage of supported atomic facts in the response [1][2].

The method was presented in the paper "FActScore: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text Generation" by Sewon Min, Kalpesh Krishna, Xinxi Lyu, Mike Lewis, Wen-tau Yih, Pang Wei Koh, Mohit Iyyer, Luke Zettlemoyer, and Hannaneh Hajishirzi, with affiliations at the University of Washington, the Allen Institute for AI, Meta AI, and the University of Massachusetts Amherst. It was published at the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP 2023) in Singapore [1][2]. FActScore became one of the standard reference metrics for long-form factuality and [hallucination](/wiki/hallucination) evaluation, and it directly inspired later methods including the Search-Augmented Factuality Evaluator (SAFE) and the [LongFact](/wiki/longfact) prompt set [3].

## Motivation

Evaluating the factuality of long-form generations is harder than scoring short factual answers. The FActScore authors identify two core problems [1][2]:

1. A single long-form response usually contains a mixture of supported and unsupported pieces of information, so a binary "correct" or "incorrect" judgment for the whole passage is too coarse to be informative.
2. Careful human evaluation of factuality is slow and expensive, which makes it impractical to compare many models at scale.

FActScore addresses the first problem by moving to a fine-grained, claim-level unit of analysis: it asks what fraction of the individual facts in a passage are actually supported. This yields a continuous precision score between 0 and 1 instead of a coarse label, distinguishing a response that is mostly accurate with a few errors from one that is largely fabricated. It addresses the second problem by also providing an automated estimator that approximates the human metric at a small fraction of the cost.

## How FActScore works

Given a model generation, FActScore proceeds in two stages [1][2]:

- Decomposition into atomic facts. The passage is broken into a list of atomic facts, where an atomic fact is defined as "a short sentence conveying one piece of information." This decomposition is itself performed by a language model; the paper used InstructGPT (text-davinci-003) to generate the atomic facts. For example, a biography sentence can split into several atomic facts about a person's birth year, profession, and notable works. In the human-evaluation study, annotators revised the automatically generated decompositions, adjusting splitting decisions in about 18 percent of cases and merging decisions in about 34 percent of cases.
- Support checking. Each atomic fact is checked against a given knowledge source and labeled as supported or not supported. The knowledge source is the authoritative reference; the public release uses a Wikipedia dump dated April 1, 2023 by default, though users can register a custom knowledge source. Facts that the source neither confirms nor relates to are excluded from scoring as irrelevant.

The FActScore of a single generation y is the fraction of its atomic facts that are supported. Formally, if A(y) is the set of atomic facts extracted from y, C is the knowledge source, and the indicator function marks whether a fact is supported, the per-generation score is the mean of that indicator over all atomic facts. The reported FActScore for a model is the expectation of this fraction across the prompts to which the model responds [1]. Because the metric is a precision measure, it can be paired with recall-style statistics, such as how often the model declines to answer and how many facts it provides per response.

The original benchmark instantiated this on biographies of people drawn from Wikipedia entities, a domain chosen because the facts are concrete and verifiable. The public dataset includes a labeled set of entities used for validation and a larger unlabeled set of entities used for broader evaluation [1][2].

## Automated estimation

Because human annotation is costly, the paper introduces an automated estimator that approximates the human FActScore using retrieval plus a strong language model. For each atomic fact, the system retrieves relevant passages about the target entity from the knowledge source ([retrieval-augmented generation](/wiki/retrieval-augmented_generation)) and then asks an evaluation language model to judge whether the retrieved evidence supports the fact [1][2].

The authors report several estimator configurations. The default combines retrieval with a strong LM (ChatGPT). A fully open alternative pairs retrieval with "Inst-LLaMA," a LLaMA 7B model trained on Super-Natural Instructions, optionally augmented with a nonparametric probability estimate (NPM). The paper reports that the best estimators achieve less than a 2 percent error rate relative to human FActScore, and the public documentation notes that the retrieval-plus-ChatGPT and retrieval-plus-LLaMA-plus-NPM variants reach about 0.99 Pearson correlation with each other [1][2]. In the public package, running the estimator costs roughly 1 US dollar in API cost per 100 sentences [2].

## Findings

The paper's human-evaluation study scored biographies generated by three commercial systems: InstructGPT, ChatGPT, and the retrieval-augmented PerplexityAI. A central headline result is that ChatGPT achieved a FActScore of only 58 percent on this task, illustrating that even a strong, widely used model leaves a large share of its stated facts unsupported and motivating a fine-grained score rather than a pass-or-fail label [1][2].

A second key finding concerns entity rarity. The paper reports "a notable decrease in FActScore as the rarity of entities increases, consistently across all LMs." For ChatGPT, scores fell sharply from about 80 percent for the most frequent entities to about 16 percent for the rarest ones, showing that models are far less reliable when writing about obscure or long-tail subjects [1].

Using the automated estimator, the authors evaluated 6,500 generations from 13 then-recent language models, an evaluation they note would have cost about 26,000 US dollars if performed by humans. Among the conclusions: GPT-4 and ChatGPT were more factual than the public models tested, and among open models Vicuna and Alpaca were among the strongest [1][2].

The table below summarizes selected reported figures.

| Item | Reported value |
|---|---|
| ChatGPT FActScore on biographies (human eval) | 58% |
| ChatGPT, most frequent entities | about 80% |
| ChatGPT, rarest entities | about 16% |
| Automated estimator error vs. human | less than 2% |
| Generations evaluated by the automated metric | 6,500 from 13 LMs |
| Estimated human-evaluation cost avoided | about $26,000 |
| Default knowledge source | Wikipedia dump, April 1, 2023 |
| Automated estimator cost | about $1 per 100 sentences |

All figures are as reported in the FActScore paper and its accompanying software release [1][2].

## Adoption and influence

FActScore was released as an open package (installable via "pip install factscore") with code and data on GitHub, which helped it become a default building block for long-form factuality research [2]. Its core idea, decompose a generation into atomic claims and verify each against an external source, has been reused and extended widely.

The most prominent extension is Google DeepMind's 2024 work "Long-form factuality in large language models," which introduced the [LongFact](/wiki/longfact) prompt set of thousands of fact-seeking questions across 38 topics and the Search-Augmented Factuality Evaluator (SAFE). SAFE follows the FActScore decompose-then-verify recipe but replaces a static Wikipedia source with multi-step Google Search queries, allowing open-domain rather than biography-only evaluation. The DeepMind authors report that SAFE agrees with crowdsourced human annotators about 72 percent of the time while being more than 20 times cheaper, and they propose an F1-based aggregate metric (F1@K) that balances factual precision against recall under a target response length [3]. This precision-versus-recall framing is a direct response to a limitation of FActScore as a pure precision metric: a model can trivially raise precision by stating fewer, safer facts.

Later methods continued in the same lineage. VeriScore (2024) focuses on extracting verifiable claims and uses search-engine evidence rather than Wikipedia, arguing this better handles claims that Wikipedia does not cover. FactBench (2024) builds an updatable benchmark of in-the-wild, hallucination-prone prompts and reports FActScore-style results alongside other metrics. Open reimplementations such as OpenFActScore aim to reproduce the metric using open, Hugging Face-compatible models so that evaluation does not depend on a specific commercial API. Across these efforts, FActScore is routinely cited as the foundational fine-grained factual-precision metric for long-form text and remains a common baseline and design reference into 2025 and 2026 [3].

## References

1. Min, Sewon; Krishna, Kalpesh; Lyu, Xinxi; Lewis, Mike; Yih, Wen-tau; Koh, Pang Wei; Iyyer, Mohit; Zettlemoyer, Luke; Hajishirzi, Hannaneh. "FActScore: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text Generation." arXiv:2305.14251, 2023. https://arxiv.org/abs/2305.14251
2. Min, Sewon; et al. "FActScore: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text Generation." Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP 2023), pages 12076 to 12100, Singapore. Association for Computational Linguistics. https://aclanthology.org/2023.emnlp-main.741/ ; software release: https://github.com/shmsw25/factscore
3. Wei, Jerry; Yang, Chengrun; et al. "Long-form factuality in large language models." arXiv:2403.18802, 2024 (LongFact and SAFE; Google DeepMind). https://arxiv.org/abs/2403.18802

