# HalluLens

> Source: https://aiwiki.ai/wiki/hallulens
> Updated: 2026-06-02
> Categories: AI Benchmarks, Meta AI, Model Evaluation
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

**HalluLens** is a [large language model](/wiki/large_language_model) [hallucination](/wiki/hallucination) benchmark introduced by researchers at [Meta AI](/wiki/meta_ai)'s Fundamental AI Research (FAIR) lab, together with collaborators at the Hong Kong University of Science and Technology, in April 2025 [1][2]. It organizes hallucination evaluation around a taxonomy that separates *extrinsic* hallucination (output inconsistent with a model's training data) from *intrinsic* hallucination (output inconsistent with the input context), and it generates its extrinsic test sets dynamically at evaluation time so that the questions cannot leak into training corpora and the benchmark cannot be gamed [1][3]. The work was published as arXiv:2504.17550 and accepted to the 63rd Annual Meeting of the Association for Computational Linguistics (ACL 2025), where it appears in the proceedings as paper 2025.acl-long.1176 [1][4]. Reference code and the dynamic test-set generators are released by Meta as an open-source repository [3].

## Overview

HalluLens responds to a structural problem in hallucination research: the field has used inconsistent definitions and conflated hallucination with factuality, which makes benchmarks hard to compare and easy to misread [1]. The benchmark's central design choices are a clear conceptual taxonomy, a strict distinction between hallucination and factuality, and dynamic generation of evaluation data to resist data leakage [1][3].

The benchmark is built from two parts. The extrinsic component consists of new tasks authored for HalluLens, each backed by a generator that produces fresh test items from seed corpora rather than from a fixed file [1][3]. The intrinsic component reuses existing, non-saturated benchmarks for context-grounded settings such as summarization and reading comprehension, because the authors argue those tasks are already well covered by available datasets [1]. The authors evaluate 13 instruction-tuned models, including open-weight [Llama 3.1](/wiki/llama_3_1) and [Llama 3.3](/wiki/llama_3_3) variants, [Qwen](/wiki/qwen) 2.5, [Gemma 2](/wiki/gemma_2), and [Mistral](/wiki/mistral_7b) models, alongside commercial systems [Claude](/wiki/claude) 3 Haiku, Claude 3 Sonnet, and [GPT-4o](/wiki/gpt_4o) [1].

## The hallucination taxonomy: extrinsic versus intrinsic

A recurring complaint in the paper is that prior work treats "hallucination" and "factuality" as interchangeable, which they are not. HalluLens defines hallucination as model output that is inconsistent with its source, where the source is either the training data or the input context. Factuality, by contrast, concerns correctness against established world knowledge verified by external sources [1]. The two come apart: a response can be faithful to a model's training data yet factually wrong because that data was outdated, and a response can be factually correct yet still count as a hallucination if it contradicts the provided input [1]. Holding hallucination and factuality separate lets the benchmark probe a model's internal consistency rather than its coverage of an ever-changing external world.

Within hallucination, HalluLens draws a second line [1]:

- **Extrinsic hallucination** is generation that is not consistent with the training data and that can be neither supported nor refuted by any input context. It typically arises in open-ended generation, when a model fills a knowledge gap or fails to recognize the boundary of what it knows. Extrinsic hallucination reflects limitations in how a model absorbs knowledge during training and whether it can recognize what lies outside that knowledge.
- **Intrinsic hallucination** is generation that is not consistent with the input context. The model misreads or contradicts the prompt, or asserts content the input does not support. It shows up in grounded tasks such as machine translation, summarization, and question answering over a supplied document, and reflects a failure of inference-time consistency rather than a gap in stored knowledge.

This split drives the benchmark's structure: extrinsic hallucination needs dynamically generated, leakage-resistant tasks because it depends on the model's training knowledge, whereas intrinsic hallucination can be measured with existing context-grounded datasets [1].

## The dynamic test-set generation and why it matters

The headline methodological contribution is that HalluLens does not ship a fixed set of extrinsic questions. Static benchmarks decay over time because the test items, once published, are scraped into later training sets; a model can then appear to hallucinate less simply because it has memorized the answers [1][3]. To break that cycle, HalluLens regenerates its extrinsic test items at evaluation time from seed corpora, so the exact prompts are not present in any pre-existing dataset and cannot be trained on in advance [1][3].

The obvious risk with on-the-fly generation is instability: if the test set changes every run, scores might not be comparable. The authors address this by controlling difficulty and sampling, and they report empirically that the dynamic procedure is reproducible, with PreciseWikiQA showing less than roughly 1% average standard deviation across three independent runs [1]. The benchmark therefore aims to be both leakage-resistant and stable enough to compare models fairly.

Generation draws on several seed sources [1][3]:

- **GoodWiki**, a curated set of high-quality Wikipedia articles, seeds the Wikipedia-based tasks, with difficulty stratified using harmonic-centrality popularity scores from WikiRank so that easy and hard questions are balanced rather than dominated by long-tail topics.
- The **Integrated Taxonomic Information System (ITIS)** taxonomic database and a large worldwide medicines list seed the construction of plausible but non-existent entity names.
- A processed Wikipedia dump and a search API are used to verify generated answers and to confirm that supposedly non-existent entities really do not exist.

## The specific tasks

HalluLens defines three extrinsic tasks, the first two grounded in Wikipedia knowledge and the third built around non-existent entities [1][3].

- **PreciseWikiQA** targets short, fact-seeking questions whose answers a model should know from training. The generator produces concise questions (single word or short phrase answers) from Wikipedia sections, filtering for objectively answerable items; the paper reports that about 97% of auto-generated reference answers were judged correct during validation [1]. It is the largest task, with on the order of 5,000 dynamically generated question-answer pairs [1].
- **LongWiki** targets long-form generation, where the model writes paragraph-length answers grounded in training knowledge. It uses roughly 250 prompts spanning intermediate difficulty levels, with responses capped near 1,024 tokens, and evaluates the factual claims inside the generated text rather than a single short answer [1].
- **NonExistentRefusal** probes whether a model will refuse, rather than confabulate, when asked about entities that do not exist, which is a direct test of whether it recognizes the boundary of its knowledge [1]. It has two subtasks:
  - **MixedEntities** builds non-existent names by mixing and swapping components of real names drawn from taxonomic and medical databases (animals, plants, and medicines), then verifies non-existence against those databases.
  - **GeneratedEntities** has language models invent fictional names for businesses, events, and products across many cities and countries, with non-existence confirmed through a web search API.

For the intrinsic side, rather than introduce new tasks, HalluLens evaluates context-faithfulness using existing benchmarks that the authors find are not yet saturated, namely the HHEM summarization-consistency leaderboard from Vectara, ANAH 2.0 (with reference) for grounded question answering, and FaithEval for handling noisy or contradictory context [1].

## Evaluation methodology

Each extrinsic task pairs a generator with an automatic evaluator, and the evaluators rely heavily on the [LLM-as-a-judge](/wiki/llm_as_a_judge) approach using Llama 3.1 models, with the authors reporting agreement against human labels to justify the automation [1]:

- For **PreciseWikiQA**, a judge first decides whether the model refused (abstained for lack of knowledge) and then classifies non-refused answers as correct, incorrect, or unverifiable; incorrect and unverifiable answers both count as hallucinations. The reported metrics are false refusal rate, hallucination rate among non-refused answers, and correct answer rate, with the abstention and correctness judges reported at roughly 97% and 96% accuracy respectively [1].
- For **LongWiki**, generated text is decomposed into individual verifiable claims (using a large Llama model), each claim is checked against Wikipedia passages retrieved through named-entity-based selection, and the system reports false refusal rate together with precision, recall@32, and F1@32 over the supported claims [1].
- For **NonExistentRefusal**, a judge decides whether the response indicates belief in the non-existent entity; the reported metric is the false acceptance rate, the share of cases where the model failed to refuse, where lower is better. The evaluator is reported to agree with human assessment about 95% of the time [1].

## Notable results by model

Across the extrinsic tasks the paper finds wide spread between model families and, importantly, that families adopt very different refusal strategies: some models abstain aggressively (high false refusal but lower hallucination when they do answer), while others almost never refuse and hallucinate heavily on unanswerable prompts [1]. GPT-4o posts the strongest Wikipedia-grounded accuracy, while Llama 3.1 405B is best at recognizing the boundary of its own knowledge on non-existent entities [1]. All numbers below are from the paper's evaluation tables [1].

### PreciseWikiQA (short factual questions)

| Model | False refusal % | Hallucination if not refused % | Correct answer rate % |
|---|---|---|---|
| Llama 3.1 8B | 83.09 | 48.37 | 8.73 |
| Llama 3.1 70B | 52.03 | 37.30 | 30.08 |
| Llama 3.1 405B | 56.77 | 26.84 | 31.62 |
| Llama 3.3 70B | 20.01 | 50.19 | 39.84 |
| Mistral 7B v0.3 | 7.77 | 81.19 | 17.34 |
| Mistral Nemo | 1.05 | 75.50 | 24.24 |
| Gemma 2 9B | 22.89 | 76.01 | 18.50 |
| Gemma 2 27B | 19.23 | 68.29 | 25.61 |
| Qwen2.5 7B | 13.85 | 85.22 | 12.73 |
| Qwen2.5 14B | 15.93 | 78.08 | 18.43 |
| Claude 3 Haiku | 63.64 | 51.30 | 17.71 |
| Claude 3 Sonnet | 56.68 | 56.24 | 18.96 |
| GPT-4o | 4.13 | 45.15 | 52.59 |

Source: HalluLens, Table 2 [1]. GPT-4o reaches the highest correct answer rate (52.59%); among open-weight models Llama 3.3 70B leads on correct answer rate, while Llama 3.1 405B has the lowest hallucination rate when it chooses to answer (26.84%).

### LongWiki (long-form generation)

| Model | False refusal % | Recall@32 | Precision % | F1@32 |
|---|---|---|---|---|
| Llama 3.1 8B | 22.67 | 63.97 | 45.36 | 51.04 |
| Llama 3.1 70B | 13.47 | 66.27 | 53.74 | 56.23 |
| Llama 3.1 405B | 8.93 | 74.44 | 56.94 | 61.98 |
| Llama 3.3 70B | 0.67 | 75.46 | 52.42 | 60.02 |
| Mistral 7B v0.3 | 0.13 | 58.03 | 39.45 | 46.08 |
| Mistral Nemo | 0.00 | 66.88 | 38.06 | 47.78 |
| Gemma 2 9B | 4.00 | 60.00 | 48.58 | 52.22 |
| Gemma 2 27B | 1.73 | 67.35 | 51.57 | 56.69 |
| Qwen2.5 7B | 0.53 | 70.94 | 44.53 | 53.28 |
| Qwen2.5 14B | 0.53 | 74.05 | 52.84 | 60.11 |
| Claude 3 Haiku | 8.67 | 58.95 | 65.24 | 58.54 |
| Claude 3 Sonnet | 6.93 | 65.03 | 56.97 | 58.50 |
| GPT-4o | 0.13 | 84.89 | 71.03 | 75.80 |

Source: HalluLens, Table 3 [1]. GPT-4o again leads (F1@32 of 75.80%), with Llama 3.1 405B and Llama 3.3 70B the strongest open-weight systems by F1.

### NonExistentRefusal (false acceptance rate, lower is better)

| Model | MixedEntities % | GeneratedEntities % | Average % |
|---|---|---|---|
| Llama 3.1 8B | 19.78 | 6.58 | 13.18 |
| Llama 3.1 70B | 40.73 | 7.32 | 24.02 |
| Llama 3.1 405B | 11.48 | 2.28 | 6.88 |
| Llama 3.3 70B | 66.86 | 14.77 | 40.82 |
| Mistral 7B v0.3 | 94.74 | 77.98 | 86.36 |
| Mistral Nemo | 90.87 | 76.12 | 83.49 |
| Gemma 2 9B | 58.70 | 21.47 | 40.09 |
| Gemma 2 27B | 60.97 | 20.94 | 40.95 |
| Qwen2.5 7B | 64.46 | 34.24 | 49.35 |
| Qwen2.5 14B | 48.12 | 11.16 | 29.64 |
| Claude 3 Haiku | 69.08 | 10.43 | 39.75 |
| Claude 3 Sonnet | 60.49 | 13.40 | 36.94 |
| GPT-4o | 65.89 | 18.74 | 42.31 |

Source: HalluLens, Table 4 [1]. Llama 3.1 405B has by far the lowest average false acceptance (6.88%), indicating the best recognition of non-existent entities, while the Mistral models accept non-existent entities most often (above 83%). Notably, several strong systems including GPT-4o accept fabricated entities frequently here, showing that high factual accuracy does not imply good refusal behavior.

## Significance

HalluLens contributes on three fronts. First, by separating extrinsic from intrinsic hallucination and both from factuality, it offers a vocabulary that lets researchers say precisely which failure they are measuring, addressing a long-standing source of confusion in the literature [1]. Second, its dynamic generation directly tackles benchmark contamination, a growing concern as model training corpora expand to cover nearly everything published online; regenerating items keeps the evaluation meaningful over time without sacrificing reproducibility [1][3]. Third, the NonExistentRefusal task isolates a behavior that many other benchmarks miss, namely whether a model knows what it does not know and abstains, which the results show is largely independent of raw accuracy [1]. The paper also re-examines popular benchmarks such as TruthfulQA, arguing that much of it measures factuality rather than hallucination and that a substantial fraction of its items are mis-scored, which has implications for how the community interprets prior leaderboard numbers [1]. Because Meta released the generators and evaluators as open source, others can rerun the benchmark on new models under the same protocol [3].

## Limitations

Several caveats follow from the design. The extrinsic tasks lean heavily on Wikipedia-derived knowledge through GoodWiki, so they probe the kind of encyclopedic facts Wikipedia covers well and may underrepresent specialized or non-English domains [1]. The pipeline depends on LLM-as-a-judge evaluators (largely Llama 3.1 models) for refusal detection, answer correctness, and claim verification; although the authors report high agreement with human labels, any systematic bias in those judges propagates into the scores [1]. Verifying that an entity is truly non-existent relies on external databases and a web search API, so coverage gaps in those sources could let a real but obscure entity be treated as fictional [1][3]. The dynamic procedure, while shown to be low-variance, still produces a different concrete test set on each run, which complicates exact reproduction of a specific published number even though aggregate scores are stable [1]. Finally, the commercial models evaluated were the versions available in 2024, and like any benchmark snapshot the reported standings reflect those checkpoints rather than later releases [1].

## References

1. Bang, Yejin; Ji, Ziwei; Schelten, Alan; Hartshorn, Anthony; Fowler, Tara; Zhang, Cheng; Cancedda, Nicola; Fung, Pascale. "HalluLens: LLM Hallucination Benchmark." arXiv:2504.17550, April 2025. [https://arxiv.org/abs/2504.17550](https://arxiv.org/abs/2504.17550)
2. "Paper page - HalluLens: LLM Hallucination Benchmark." Hugging Face. [https://huggingface.co/papers/2504.17550](https://huggingface.co/papers/2504.17550)
3. "facebookresearch/HalluLens: Codebase for LLM Textual Hallucination Benchmark." GitHub. [https://github.com/facebookresearch/HalluLens](https://github.com/facebookresearch/HalluLens)
4. "HalluLens: LLM Hallucination Benchmark." Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL 2025), paper 2025.acl-long.1176. [https://aclanthology.org/2025.acl-long.1176.pdf](https://aclanthology.org/2025.acl-long.1176.pdf)