HalluLens

AI Benchmarks Meta AI Model Evaluation

12 min read

Updated Jun 2, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 2, 2026

Fact-checked

In review queue

Sources

4 citations

Revision

v1 · 2,453 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

HalluLens is a large language model hallucination benchmark introduced by researchers at Meta AI's Fundamental AI Research (FAIR) lab, together with collaborators at the Hong Kong University of Science and Technology, in April 2025 ^[1]^[2]. It organizes hallucination evaluation around a taxonomy that separates extrinsic hallucination (output inconsistent with a model's training data) from intrinsic hallucination (output inconsistent with the input context), and it generates its extrinsic test sets dynamically at evaluation time so that the questions cannot leak into training corpora and the benchmark cannot be gamed ^[1]^[3]. The work was published as arXiv:2504.17550 and accepted to the 63rd Annual Meeting of the Association for Computational Linguistics (ACL 2025), where it appears in the proceedings as paper 2025.acl-long.1176 ^[1]^[4]. Reference code and the dynamic test-set generators are released by Meta as an open-source repository ^[3].

Overview

HalluLens responds to a structural problem in hallucination research: the field has used inconsistent definitions and conflated hallucination with factuality, which makes benchmarks hard to compare and easy to misread ^[1]. The benchmark's central design choices are a clear conceptual taxonomy, a strict distinction between hallucination and factuality, and dynamic generation of evaluation data to resist data leakage ^[1]^[3].

The benchmark is built from two parts. The extrinsic component consists of new tasks authored for HalluLens, each backed by a generator that produces fresh test items from seed corpora rather than from a fixed file ^[1]^[3]. The intrinsic component reuses existing, non-saturated benchmarks for context-grounded settings such as summarization and reading comprehension, because the authors argue those tasks are already well covered by available datasets ^[1]. The authors evaluate 13 instruction-tuned models, including open-weight Llama 3.1 and Llama 3.3 variants, Qwen 2.5, Gemma 2, and Mistral models, alongside commercial systems Claude 3 Haiku, Claude 3 Sonnet, and GPT-4o ^[1].

The hallucination taxonomy: extrinsic versus intrinsic

A recurring complaint in the paper is that prior work treats "hallucination" and "factuality" as interchangeable, which they are not. HalluLens defines hallucination as model output that is inconsistent with its source, where the source is either the training data or the input context. Factuality, by contrast, concerns correctness against established world knowledge verified by external sources ^[1]. The two come apart: a response can be faithful to a model's training data yet factually wrong because that data was outdated, and a response can be factually correct yet still count as a hallucination if it contradicts the provided input ^[1]. Holding hallucination and factuality separate lets the benchmark probe a model's internal consistency rather than its coverage of an ever-changing external world.

Within hallucination, HalluLens draws a second line ^[1]:

Extrinsic hallucination is generation that is not consistent with the training data and that can be neither supported nor refuted by any input context. It typically arises in open-ended generation, when a model fills a knowledge gap or fails to recognize the boundary of what it knows. Extrinsic hallucination reflects limitations in how a model absorbs knowledge during training and whether it can recognize what lies outside that knowledge.
Intrinsic hallucination is generation that is not consistent with the input context. The model misreads or contradicts the prompt, or asserts content the input does not support. It shows up in grounded tasks such as machine translation, summarization, and question answering over a supplied document, and reflects a failure of inference-time consistency rather than a gap in stored knowledge.

This split drives the benchmark's structure: extrinsic hallucination needs dynamically generated, leakage-resistant tasks because it depends on the model's training knowledge, whereas intrinsic hallucination can be measured with existing context-grounded datasets ^[1].

The dynamic test-set generation and why it matters

The headline methodological contribution is that HalluLens does not ship a fixed set of extrinsic questions. Static benchmarks decay over time because the test items, once published, are scraped into later training sets; a model can then appear to hallucinate less simply because it has memorized the answers ^[1]^[3]. To break that cycle, HalluLens regenerates its extrinsic test items at evaluation time from seed corpora, so the exact prompts are not present in any pre-existing dataset and cannot be trained on in advance ^[1]^[3].

The obvious risk with on-the-fly generation is instability: if the test set changes every run, scores might not be comparable. The authors address this by controlling difficulty and sampling, and they report empirically that the dynamic procedure is reproducible, with PreciseWikiQA showing less than roughly 1% average standard deviation across three independent runs ^[1]. The benchmark therefore aims to be both leakage-resistant and stable enough to compare models fairly.

Generation draws on several seed sources ^[1]^[3]:

GoodWiki, a curated set of high-quality Wikipedia articles, seeds the Wikipedia-based tasks, with difficulty stratified using harmonic-centrality popularity scores from WikiRank so that easy and hard questions are balanced rather than dominated by long-tail topics.
The Integrated Taxonomic Information System (ITIS) taxonomic database and a large worldwide medicines list seed the construction of plausible but non-existent entity names.
A processed Wikipedia dump and a search API are used to verify generated answers and to confirm that supposedly non-existent entities really do not exist.

The specific tasks

HalluLens defines three extrinsic tasks, the first two grounded in Wikipedia knowledge and the third built around non-existent entities ^[1]^[3].

PreciseWikiQA targets short, fact-seeking questions whose answers a model should know from training. The generator produces concise questions (single word or short phrase answers) from Wikipedia sections, filtering for objectively answerable items; the paper reports that about 97% of auto-generated reference answers were judged correct during validation ^[1]. It is the largest task, with on the order of 5,000 dynamically generated question-answer pairs ^[1].
LongWiki targets long-form generation, where the model writes paragraph-length answers grounded in training knowledge. It uses roughly 250 prompts spanning intermediate difficulty levels, with responses capped near 1,024 tokens, and evaluates the factual claims inside the generated text rather than a single short answer ^[1].
NonExistentRefusal probes whether a model will refuse, rather than confabulate, when asked about entities that do not exist, which is a direct test of whether it recognizes the boundary of its knowledge ^[1]. It has two subtasks:
- MixedEntities builds non-existent names by mixing and swapping components of real names drawn from taxonomic and medical databases (animals, plants, and medicines), then verifies non-existence against those databases.
- GeneratedEntities has language models invent fictional names for businesses, events, and products across many cities and countries, with non-existence confirmed through a web search API.

For the intrinsic side, rather than introduce new tasks, HalluLens evaluates context-faithfulness using existing benchmarks that the authors find are not yet saturated, namely the HHEM summarization-consistency leaderboard from Vectara, ANAH 2.0 (with reference) for grounded question answering, and FaithEval for handling noisy or contradictory context ^[1].

Evaluation methodology

Each extrinsic task pairs a generator with an automatic evaluator, and the evaluators rely heavily on the LLM-as-a-judge approach using Llama 3.1 models, with the authors reporting agreement against human labels to justify the automation ^[1]:

For PreciseWikiQA, a judge first decides whether the model refused (abstained for lack of knowledge) and then classifies non-refused answers as correct, incorrect, or unverifiable; incorrect and unverifiable answers both count as hallucinations. The reported metrics are false refusal rate, hallucination rate among non-refused answers, and correct answer rate, with the abstention and correctness judges reported at roughly 97% and 96% accuracy respectively ^[1].
For LongWiki, generated text is decomposed into individual verifiable claims (using a large Llama model), each claim is checked against Wikipedia passages retrieved through named-entity-based selection, and the system reports false refusal rate together with precision, recall@32, and F1@32 over the supported claims ^[1].
For NonExistentRefusal, a judge decides whether the response indicates belief in the non-existent entity; the reported metric is the false acceptance rate, the share of cases where the model failed to refuse, where lower is better. The evaluator is reported to agree with human assessment about 95% of the time ^[1].

Notable results by model

Across the extrinsic tasks the paper finds wide spread between model families and, importantly, that families adopt very different refusal strategies: some models abstain aggressively (high false refusal but lower hallucination when they do answer), while others almost never refuse and hallucinate heavily on unanswerable prompts ^[1]. GPT-4o posts the strongest Wikipedia-grounded accuracy, while Llama 3.1 405B is best at recognizing the boundary of its own knowledge on non-existent entities ^[1]. All numbers below are from the paper's evaluation tables ^[1].

PreciseWikiQA (short factual questions)

Model	False refusal %	Hallucination if not refused %	Correct answer rate %
Llama 3.1 8B	83.09	48.37	8.73
Llama 3.1 70B	52.03	37.30	30.08
Llama 3.1 405B	56.77	26.84	31.62
Llama 3.3 70B	20.01	50.19	39.84
Mistral 7B v0.3	7.77	81.19	17.34
Mistral Nemo	1.05	75.50	24.24
Gemma 2 9B	22.89	76.01	18.50
Gemma 2 27B	19.23	68.29	25.61
Qwen2.5 7B	13.85	85.22	12.73
Qwen2.5 14B	15.93	78.08	18.43
Claude 3 Haiku	63.64	51.30	17.71
Claude 3 Sonnet	56.68	56.24	18.96
GPT-4o	4.13	45.15	52.59

Source: HalluLens, Table 2 ^[1]. GPT-4o reaches the highest correct answer rate (52.59%); among open-weight models Llama 3.3 70B leads on correct answer rate, while Llama 3.1 405B has the lowest hallucination rate when it chooses to answer (26.84%).

LongWiki (long-form generation)

Model	False refusal %	Recall@32	Precision %	F1@32
Llama 3.1 8B	22.67	63.97	45.36	51.04
Llama 3.1 70B	13.47	66.27	53.74	56.23
Llama 3.1 405B	8.93	74.44	56.94	61.98
Llama 3.3 70B	0.67	75.46	52.42	60.02
Mistral 7B v0.3	0.13	58.03	39.45	46.08
Mistral Nemo	0.00	66.88	38.06	47.78
Gemma 2 9B	4.00	60.00	48.58	52.22
Gemma 2 27B	1.73	67.35	51.57	56.69
Qwen2.5 7B	0.53	70.94	44.53	53.28
Qwen2.5 14B	0.53	74.05	52.84	60.11
Claude 3 Haiku	8.67	58.95	65.24	58.54
Claude 3 Sonnet	6.93	65.03	56.97	58.50
GPT-4o	0.13	84.89	71.03	75.80

Source: HalluLens, Table 3 ^[1]. GPT-4o again leads (F1@32 of 75.80%), with Llama 3.1 405B and Llama 3.3 70B the strongest open-weight systems by F1.

NonExistentRefusal (false acceptance rate, lower is better)

Model	MixedEntities %	GeneratedEntities %	Average %
Llama 3.1 8B	19.78	6.58	13.18
Llama 3.1 70B	40.73	7.32	24.02
Llama 3.1 405B	11.48	2.28	6.88
Llama 3.3 70B	66.86	14.77	40.82
Mistral 7B v0.3	94.74	77.98	86.36
Mistral Nemo	90.87	76.12	83.49
Gemma 2 9B	58.70	21.47	40.09
Gemma 2 27B	60.97	20.94	40.95
Qwen2.5 7B	64.46	34.24	49.35
Qwen2.5 14B	48.12	11.16	29.64
Claude 3 Haiku	69.08	10.43	39.75
Claude 3 Sonnet	60.49	13.40	36.94
GPT-4o	65.89	18.74	42.31

Source: HalluLens, Table 4 ^[1]. Llama 3.1 405B has by far the lowest average false acceptance (6.88%), indicating the best recognition of non-existent entities, while the Mistral models accept non-existent entities most often (above 83%). Notably, several strong systems including GPT-4o accept fabricated entities frequently here, showing that high factual accuracy does not imply good refusal behavior.

Significance

HalluLens contributes on three fronts. First, by separating extrinsic from intrinsic hallucination and both from factuality, it offers a vocabulary that lets researchers say precisely which failure they are measuring, addressing a long-standing source of confusion in the literature ^[1]. Second, its dynamic generation directly tackles benchmark contamination, a growing concern as model training corpora expand to cover nearly everything published online; regenerating items keeps the evaluation meaningful over time without sacrificing reproducibility ^[1]^[3]. Third, the NonExistentRefusal task isolates a behavior that many other benchmarks miss, namely whether a model knows what it does not know and abstains, which the results show is largely independent of raw accuracy ^[1]. The paper also re-examines popular benchmarks such as TruthfulQA, arguing that much of it measures factuality rather than hallucination and that a substantial fraction of its items are mis-scored, which has implications for how the community interprets prior leaderboard numbers ^[1]. Because Meta released the generators and evaluators as open source, others can rerun the benchmark on new models under the same protocol ^[3].

Limitations

Several caveats follow from the design. The extrinsic tasks lean heavily on Wikipedia-derived knowledge through GoodWiki, so they probe the kind of encyclopedic facts Wikipedia covers well and may underrepresent specialized or non-English domains ^[1]. The pipeline depends on LLM-as-a-judge evaluators (largely Llama 3.1 models) for refusal detection, answer correctness, and claim verification; although the authors report high agreement with human labels, any systematic bias in those judges propagates into the scores ^[1]. Verifying that an entity is truly non-existent relies on external databases and a web search API, so coverage gaps in those sources could let a real but obscure entity be treated as fictional ^[1]^[3]. The dynamic procedure, while shown to be low-variance, still produces a different concrete test set on each run, which complicates exact reproduction of a specific published number even though aggregate scores are stable ^[1]. Finally, the commercial models evaluated were the versions available in 2024, and like any benchmark snapshot the reported standings reflect those checkpoints rather than later releases ^[1].

References

Bang, Yejin; Ji, Ziwei; Schelten, Alan; Hartshorn, Anthony; Fowler, Tara; Zhang, Cheng; Cancedda, Nicola; Fung, Pascale. "HalluLens: LLM Hallucination Benchmark." arXiv:2504.17550, April 2025. https://arxiv.org/abs/2504.17550 ↩
"Paper page - HalluLens: LLM Hallucination Benchmark." Hugging Face. https://huggingface.co/papers/2504.17550 ↩
"facebookresearch/HalluLens: Codebase for LLM Textual Hallucination Benchmark." GitHub. https://github.com/facebookresearch/HalluLens ↩
"HalluLens: LLM Hallucination Benchmark." Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL 2025), paper 2025.acl-long.1176. https://aclanthology.org/2025.acl-long.1176.pdf ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

Suggest edit

What links here

Benchmark (AI)

Overview

The hallucination taxonomy: extrinsic versus intrinsic

The dynamic test-set generation and why it matters

The specific tasks

Evaluation methodology

Notable results by model

PreciseWikiQA (short factual questions)

LongWiki (long-form generation)

NonExistentRefusal (false acceptance rate, lower is better)

Significance

Limitations

References

Improve this article

Related Articles

Self-Taught Evaluator

Benchmark (AI)

MATH

SWE-bench Verified

WebArena

Agent evaluation

What links here

Related Articles

Self-Taught Evaluator

Benchmark (AI)

MATH

SWE-bench Verified

WebArena

Agent evaluation