HalluLens
Last reviewed
Jun 2, 2026
Sources
4 citations
Review status
Source-backed
Revision
v1 ยท 2,453 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 2, 2026
Sources
4 citations
Review status
Source-backed
Revision
v1 ยท 2,453 words
Add missing citations, update stale details, or suggest a clearer explanation.
HalluLens is a large language model hallucination benchmark introduced by researchers at Meta AI's Fundamental AI Research (FAIR) lab, together with collaborators at the Hong Kong University of Science and Technology, in April 2025 [1][2]. It organizes hallucination evaluation around a taxonomy that separates extrinsic hallucination (output inconsistent with a model's training data) from intrinsic hallucination (output inconsistent with the input context), and it generates its extrinsic test sets dynamically at evaluation time so that the questions cannot leak into training corpora and the benchmark cannot be gamed [1][3]. The work was published as arXiv:2504.17550 and accepted to the 63rd Annual Meeting of the Association for Computational Linguistics (ACL 2025), where it appears in the proceedings as paper 2025.acl-long.1176 [1][4]. Reference code and the dynamic test-set generators are released by Meta as an open-source repository [3].
HalluLens responds to a structural problem in hallucination research: the field has used inconsistent definitions and conflated hallucination with factuality, which makes benchmarks hard to compare and easy to misread [1]. The benchmark's central design choices are a clear conceptual taxonomy, a strict distinction between hallucination and factuality, and dynamic generation of evaluation data to resist data leakage [1][3].
The benchmark is built from two parts. The extrinsic component consists of new tasks authored for HalluLens, each backed by a generator that produces fresh test items from seed corpora rather than from a fixed file [1][3]. The intrinsic component reuses existing, non-saturated benchmarks for context-grounded settings such as summarization and reading comprehension, because the authors argue those tasks are already well covered by available datasets [1]. The authors evaluate 13 instruction-tuned models, including open-weight Llama 3.1 and Llama 3.3 variants, Qwen 2.5, Gemma 2, and Mistral models, alongside commercial systems Claude 3 Haiku, Claude 3 Sonnet, and GPT-4o [1].
A recurring complaint in the paper is that prior work treats "hallucination" and "factuality" as interchangeable, which they are not. HalluLens defines hallucination as model output that is inconsistent with its source, where the source is either the training data or the input context. Factuality, by contrast, concerns correctness against established world knowledge verified by external sources [1]. The two come apart: a response can be faithful to a model's training data yet factually wrong because that data was outdated, and a response can be factually correct yet still count as a hallucination if it contradicts the provided input [1]. Holding hallucination and factuality separate lets the benchmark probe a model's internal consistency rather than its coverage of an ever-changing external world.
Within hallucination, HalluLens draws a second line [1]:
This split drives the benchmark's structure: extrinsic hallucination needs dynamically generated, leakage-resistant tasks because it depends on the model's training knowledge, whereas intrinsic hallucination can be measured with existing context-grounded datasets [1].
The headline methodological contribution is that HalluLens does not ship a fixed set of extrinsic questions. Static benchmarks decay over time because the test items, once published, are scraped into later training sets; a model can then appear to hallucinate less simply because it has memorized the answers [1][3]. To break that cycle, HalluLens regenerates its extrinsic test items at evaluation time from seed corpora, so the exact prompts are not present in any pre-existing dataset and cannot be trained on in advance [1][3].
The obvious risk with on-the-fly generation is instability: if the test set changes every run, scores might not be comparable. The authors address this by controlling difficulty and sampling, and they report empirically that the dynamic procedure is reproducible, with PreciseWikiQA showing less than roughly 1% average standard deviation across three independent runs [1]. The benchmark therefore aims to be both leakage-resistant and stable enough to compare models fairly.
Generation draws on several seed sources [1][3]:
HalluLens defines three extrinsic tasks, the first two grounded in Wikipedia knowledge and the third built around non-existent entities [1][3].
For the intrinsic side, rather than introduce new tasks, HalluLens evaluates context-faithfulness using existing benchmarks that the authors find are not yet saturated, namely the HHEM summarization-consistency leaderboard from Vectara, ANAH 2.0 (with reference) for grounded question answering, and FaithEval for handling noisy or contradictory context [1].
Each extrinsic task pairs a generator with an automatic evaluator, and the evaluators rely heavily on the LLM-as-a-judge approach using Llama 3.1 models, with the authors reporting agreement against human labels to justify the automation [1]:
Across the extrinsic tasks the paper finds wide spread between model families and, importantly, that families adopt very different refusal strategies: some models abstain aggressively (high false refusal but lower hallucination when they do answer), while others almost never refuse and hallucinate heavily on unanswerable prompts [1]. GPT-4o posts the strongest Wikipedia-grounded accuracy, while Llama 3.1 405B is best at recognizing the boundary of its own knowledge on non-existent entities [1]. All numbers below are from the paper's evaluation tables [1].
| Model | False refusal % | Hallucination if not refused % | Correct answer rate % |
|---|---|---|---|
| Llama 3.1 8B | 83.09 | 48.37 | 8.73 |
| Llama 3.1 70B | 52.03 | 37.30 | 30.08 |
| Llama 3.1 405B | 56.77 | 26.84 | 31.62 |
| Llama 3.3 70B | 20.01 | 50.19 | 39.84 |
| Mistral 7B v0.3 | 7.77 | 81.19 | 17.34 |
| Mistral Nemo | 1.05 | 75.50 | 24.24 |
| Gemma 2 9B | 22.89 | 76.01 | 18.50 |
| Gemma 2 27B | 19.23 | 68.29 | 25.61 |
| Qwen2.5 7B | 13.85 | 85.22 | 12.73 |
| Qwen2.5 14B | 15.93 | 78.08 | 18.43 |
| Claude 3 Haiku | 63.64 | 51.30 | 17.71 |
| Claude 3 Sonnet | 56.68 | 56.24 | 18.96 |
| GPT-4o | 4.13 | 45.15 | 52.59 |
Source: HalluLens, Table 2 [1]. GPT-4o reaches the highest correct answer rate (52.59%); among open-weight models Llama 3.3 70B leads on correct answer rate, while Llama 3.1 405B has the lowest hallucination rate when it chooses to answer (26.84%).
| Model | False refusal % | Recall@32 | Precision % | F1@32 |
|---|---|---|---|---|
| Llama 3.1 8B | 22.67 | 63.97 | 45.36 | 51.04 |
| Llama 3.1 70B | 13.47 | 66.27 | 53.74 | 56.23 |
| Llama 3.1 405B | 8.93 | 74.44 | 56.94 | 61.98 |
| Llama 3.3 70B | 0.67 | 75.46 | 52.42 | 60.02 |
| Mistral 7B v0.3 | 0.13 | 58.03 | 39.45 | 46.08 |
| Mistral Nemo | 0.00 | 66.88 | 38.06 | 47.78 |
| Gemma 2 9B | 4.00 | 60.00 | 48.58 | 52.22 |
| Gemma 2 27B | 1.73 | 67.35 | 51.57 | 56.69 |
| Qwen2.5 7B | 0.53 | 70.94 | 44.53 | 53.28 |
| Qwen2.5 14B | 0.53 | 74.05 | 52.84 | 60.11 |
| Claude 3 Haiku | 8.67 | 58.95 | 65.24 | 58.54 |
| Claude 3 Sonnet | 6.93 | 65.03 | 56.97 | 58.50 |
| GPT-4o | 0.13 | 84.89 | 71.03 | 75.80 |
Source: HalluLens, Table 3 [1]. GPT-4o again leads (F1@32 of 75.80%), with Llama 3.1 405B and Llama 3.3 70B the strongest open-weight systems by F1.
| Model | MixedEntities % | GeneratedEntities % | Average % |
|---|---|---|---|
| Llama 3.1 8B | 19.78 | 6.58 | 13.18 |
| Llama 3.1 70B | 40.73 | 7.32 | 24.02 |
| Llama 3.1 405B | 11.48 | 2.28 | 6.88 |
| Llama 3.3 70B | 66.86 | 14.77 | 40.82 |
| Mistral 7B v0.3 | 94.74 | 77.98 | 86.36 |
| Mistral Nemo | 90.87 | 76.12 | 83.49 |
| Gemma 2 9B | 58.70 | 21.47 | 40.09 |
| Gemma 2 27B | 60.97 | 20.94 | 40.95 |
| Qwen2.5 7B | 64.46 | 34.24 | 49.35 |
| Qwen2.5 14B | 48.12 | 11.16 | 29.64 |
| Claude 3 Haiku | 69.08 | 10.43 | 39.75 |
| Claude 3 Sonnet | 60.49 | 13.40 | 36.94 |
| GPT-4o | 65.89 | 18.74 | 42.31 |
Source: HalluLens, Table 4 [1]. Llama 3.1 405B has by far the lowest average false acceptance (6.88%), indicating the best recognition of non-existent entities, while the Mistral models accept non-existent entities most often (above 83%). Notably, several strong systems including GPT-4o accept fabricated entities frequently here, showing that high factual accuracy does not imply good refusal behavior.
HalluLens contributes on three fronts. First, by separating extrinsic from intrinsic hallucination and both from factuality, it offers a vocabulary that lets researchers say precisely which failure they are measuring, addressing a long-standing source of confusion in the literature [1]. Second, its dynamic generation directly tackles benchmark contamination, a growing concern as model training corpora expand to cover nearly everything published online; regenerating items keeps the evaluation meaningful over time without sacrificing reproducibility [1][3]. Third, the NonExistentRefusal task isolates a behavior that many other benchmarks miss, namely whether a model knows what it does not know and abstains, which the results show is largely independent of raw accuracy [1]. The paper also re-examines popular benchmarks such as TruthfulQA, arguing that much of it measures factuality rather than hallucination and that a substantial fraction of its items are mis-scored, which has implications for how the community interprets prior leaderboard numbers [1]. Because Meta released the generators and evaluators as open source, others can rerun the benchmark on new models under the same protocol [3].
Several caveats follow from the design. The extrinsic tasks lean heavily on Wikipedia-derived knowledge through GoodWiki, so they probe the kind of encyclopedic facts Wikipedia covers well and may underrepresent specialized or non-English domains [1]. The pipeline depends on LLM-as-a-judge evaluators (largely Llama 3.1 models) for refusal detection, answer correctness, and claim verification; although the authors report high agreement with human labels, any systematic bias in those judges propagates into the scores [1]. Verifying that an entity is truly non-existent relies on external databases and a web search API, so coverage gaps in those sources could let a real but obscure entity be treated as fictional [1][3]. The dynamic procedure, while shown to be low-variance, still produces a different concrete test set on each run, which complicates exact reproduction of a specific published number even though aggregate scores are stable [1]. Finally, the commercial models evaluated were the versions available in 2024, and like any benchmark snapshot the reported standings reflect those checkpoints rather than later releases [1].