HaluEval (Hallucination Evaluation) is a large-scale benchmark designed to evaluate the ability of large language models (LLMs) to recognize hallucinated content. Introduced by Junyi Li, Xiaoxue Cheng, Wayne Xin Zhao, Jian-Yun Nie, and Ji-Rong Wen from Renmin University of China and the Universite de Montreal, HaluEval was published as a long paper at the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP 2023). The benchmark contains 35,000 samples spanning question answering, knowledge-grounded dialogue, text summarization, and general user queries, making it one of the most comprehensive resources for studying hallucination in LLMs.
HaluEval has become a widely cited benchmark in the AI safety and evaluation research community. Its release helped establish standardized methods for measuring how well language models can distinguish between factual content and fabricated information.
Large language models such as ChatGPT, GPT-4, and Claude have demonstrated impressive capabilities across a wide range of natural language processing tasks. However, these models also exhibit a persistent tendency to generate content that conflicts with source material or cannot be verified by factual knowledge. This phenomenon, known as hallucination, poses serious risks for real-world applications where factual accuracy is critical.
Before HaluEval, the research community lacked a large-scale, systematically constructed benchmark for evaluating hallucination recognition in LLMs. Existing approaches typically focused on specific downstream tasks or relied on small-scale human evaluations. The authors of HaluEval identified two core research questions that motivated the benchmark's creation:
These questions required a benchmark that could cover multiple task domains, include both automatically generated and human-annotated hallucination examples, and support controlled experiments with different hallucination patterns.
HaluEval consists of 35,000 total samples divided into two main categories: 30,000 task-specific automatically generated samples and 5,000 human-annotated general user query samples.
The task-specific portion of HaluEval draws from three established NLP tasks, with 10,000 hallucinated samples generated for each task:
| Task | Sample Count | Seed Dataset | Knowledge Source | Fields |
|---|---|---|---|---|
| Question Answering | 10,000 | HotpotQA | Wikipedia | Knowledge, question, correct answer, hallucinated answer |
| Knowledge-Grounded Dialogue | 10,000 | OpenDialKG | Wikipedia | Knowledge, dialogue history, correct response, hallucinated response |
| Text Summarization | 10,000 | CNN/Daily Mail | Source document | Document, correct summary, hallucinated summary |
For each task, every sample includes both a ground-truth output and a corresponding hallucinated counterpart. This paired structure allows researchers to evaluate whether a model can correctly distinguish between factual and fabricated content.
Question Answering. The QA samples are built on top of HotpotQA, a multi-hop question answering dataset that requires reasoning over multiple Wikipedia passages. Each sample contains a Wikipedia knowledge passage, a question, a ground-truth answer collected from HotpotQA, and a hallucinated answer generated by ChatGPT. The hallucinated answers are designed to appear plausible while containing factual errors.
Knowledge-Grounded Dialogue. The dialogue samples draw from OpenDialKG, a dataset of conversations grounded in knowledge graphs. Each sample includes knowledge from Wikipedia, a dialogue history providing conversational context, a correct response from OpenDialKG, and a hallucinated response generated by ChatGPT. The hallucinated responses may introduce facts not supported by the provided knowledge or distort the relationship between entities.
Text Summarization. The summarization samples use CNN/Daily Mail as seed data. Each sample contains the original document and two summaries: one ground-truth summary from the dataset and one hallucinated summary generated by ChatGPT. The hallucinated summaries may include details not present in the source document or misrepresent the information contained within it.
The general user query portion of HaluEval focuses on evaluating hallucination in open-ended interactions. The authors selected 5,000 queries from the Alpaca instruction-tuning dataset (a collection of 52,000 instruction-following examples). For each query, ChatGPT was prompted to generate three separate responses using a sampling temperature of 1.0. The authors then retained queries where the three responses showed low semantic similarity to one another, as this divergence often signals that the model is uncertain and may be hallucinating.
These 5,000 samples were annotated by human labelers who assessed whether each response contained hallucinated content. The labelers evaluated three aspects of each response:
A total of 30 annotators were selected from a larger candidate pool based on their English reading comprehension ability and their agreement with researcher-provided labels. Three independent annotators evaluated each response, and a max-voting strategy was applied to determine the final label. The inter-annotator agreement, measured by Fleiss' Kappa, reached 0.811, which falls within the "perfect agreement" range (0.80 to 1.00).
Of the 5,000 general query responses, 977 (19.5%) were found to contain hallucinated content.
The core methodological contribution of HaluEval is the sampling-then-filtering framework for generating high-quality hallucinated samples at scale. This two-step approach uses ChatGPT both to generate candidate hallucinations and to filter them for quality.
The sampling step employs two distinct generation strategies designed to produce diverse hallucinated outputs:
One-Pass Method. In this approach, a complete instruction is submitted to ChatGPT in a single prompt. The instruction includes three components: an intention description that defines the system's role and objective, hallucination pattern specifications that describe the types of errors to introduce, and few-shot demonstrations that illustrate expected outputs. ChatGPT then generates a hallucinated response in a single pass.
Conversational Method. This approach delivers instructions to ChatGPT sequentially across multiple conversational turns. Rather than providing all information upfront, the system progressively teaches ChatGPT about the task components, the types of hallucinations to generate, and the expected output format. By building understanding incrementally, this method tends to produce different types of hallucinated content compared to the one-pass approach.
Both methods use a temperature setting of 1.0 to encourage output diversity, with a maximum token limit of 256, frequency penalty of 0, and top-p of 1.0.
The filtering step selects the most plausible and challenging hallucinated samples from the candidates generated in the sampling step. The authors designed filtering instructions enhanced with ground-truth examples, then used ChatGPT itself to assess which hallucinated samples would be hardest for a model to distinguish from genuine content. This filtering ensures that the benchmark tests are genuinely difficult rather than trivially solvable.
The generation process targets specific hallucination patterns for each task domain:
| Task | Hallucination Patterns |
|---|---|
| Question Answering | Comprehension errors, factualness errors, specificity errors, inference errors |
| Knowledge-Grounded Dialogue | Extrinsic-soft hallucinations, extrinsic-hard hallucinations, extrinsic-grouped hallucinations |
| Text Summarization | Factual hallucinations, non-factual hallucinations, intrinsic hallucinations |
QA Hallucination Patterns:
Dialogue Hallucination Patterns:
Summarization Hallucination Patterns:
The evaluation protocol in HaluEval is straightforward. A model is presented with either an authentic or hallucinated sample and must classify it correctly. The model outputs "Yes" if it detects a hallucination and "No" if it considers the content genuine. The primary evaluation metric is accuracy: the percentage of samples correctly classified.
For the task-specific evaluation, models are given the knowledge source (e.g., Wikipedia passage, dialogue history, or source document) along with either the ground-truth output or the hallucinated output, and must determine which one contains hallucinated content.
For the general user query evaluation, models are shown a user query and a ChatGPT response, then asked to judge whether the response contains hallucinations.
The evaluation uses a temperature setting of 0 for deterministic outputs.
The authors evaluated 11 LLMs on HaluEval across all four evaluation settings (QA, Dialogue, Summarization, and General). The following table presents the accuracy (%) of each model in identifying hallucinated content:
| Model | QA | Dialogue | Summarization | General |
|---|---|---|---|---|
| ChatGPT (gpt-3.5-turbo) | 62.59 | 72.40 | 58.53 | 79.44 |
| Claude 2 | 69.78 | 64.73 | 57.75 | 75.00 |
| Claude | 67.60 | 64.83 | 53.76 | 73.88 |
| Davinci003 | 49.65 | 68.37 | 48.07 | 80.40 |
| Davinci002 | 60.05 | 60.81 | 47.77 | 80.42 |
| GPT-3 | 49.21 | 50.02 | 51.23 | 72.72 |
| Llama 2 | 49.60 | 43.99 | 49.55 | 20.46 |
| ChatGLM | 47.93 | 44.41 | 48.57 | 30.92 |
| Falcon | 39.66 | 29.08 | 42.71 | 18.98 |
| Vicuna | 60.34 | 46.35 | 45.62 | 19.48 |
| Alpaca | 6.68 | 17.55 | 20.63 | 9.54 |
Several patterns emerge from these results. First, no single model dominates across all tasks. ChatGPT achieves the highest dialogue accuracy (72.40%) and strong general performance (79.44%), but Claude 2 outperforms it on QA (69.78% vs. 62.59%). Second, summarization is consistently the most challenging task for all models, with even the best-performing model (ChatGPT at 58.53%) barely exceeding chance-level performance. Third, open-source models generally lag behind proprietary models, with Alpaca performing near or below random chance on all tasks.
The authors conducted a detailed failure analysis using Latent Dirichlet Allocation (LDA) topic modeling to understand which subject areas are most challenging for models.
| Task | Total Failures | Most Common Failure Topic | Failures in That Topic |
|---|---|---|---|
| Question Answering | 3,109 | Film-related topics | 1,559 |
| Dialogue | 891 | Technology topics | 465 |
| Summarization | 3,868 | Factual pattern | 3,106 |
The LDA analysis identified ten topics across datasets. For QA, models most frequently failed on questions about films, companies, and bands. For dialogue, technology, climate, and language topics proved most difficult. The analysis also confirmed that certain hallucination patterns are inherently harder to detect: in summarization, factual hallucinations (which add plausible but unsupported details) accounted for over 80% of all failures.
The paper evaluated three strategies for improving hallucination recognition:
| Strategy | QA | Dialogue | Summarization | General |
|---|---|---|---|---|
| Baseline (ChatGPT) | 62.59 | 72.40 | 58.53 | 86.22 |
| + Knowledge Retrieval | 76.83 | 73.80 | N/A | 90.73 |
| + Chain-of-Thought | 59.58 | 71.39 | 61.21 | 86.50 |
| + Sample Contrast | 49.19 | 68.67 | 49.46 | N/A |
Knowledge Retrieval. Providing external knowledge (e.g., relevant Wikipedia passages) yielded the largest improvements. QA accuracy jumped from 62.59% to 76.83%, and general query accuracy improved from 86.22% to 90.73%. This finding supports the value of retrieval-augmented generation (RAG) as a hallucination mitigation strategy.
Chain-of-Thought (CoT) Reasoning. Adding intermediate reasoning steps produced mixed results. CoT slightly improved summarization accuracy (58.53% to 61.21%) but unexpectedly decreased QA performance (62.59% to 59.58%). The authors suggest that reasoning steps can sometimes lead models astray when the hallucinated content is very similar to the ground truth.
Sample Contrast. Comparing hallucinated samples side-by-side with ground-truth examples yielded the worst results overall, with QA accuracy dropping to 49.19%. This indicates that the hallucinated samples in HaluEval are sufficiently similar to genuine content that direct comparison actually confuses models rather than helping them.
In January 2024, the same research group (with the addition of authors Jie Chen and Ruiyang Ren) released HaluEval 2.0 as part of a larger empirical study on factuality hallucination in LLMs. The accompanying paper, titled "The Dawn After the Dark: An Empirical Study on Factuality Hallucination in Large Language Models," was published on arXiv (2401.03205).
HaluEval 2.0 contains 8,770 questions across five domains:
| Domain | Sample Count |
|---|---|
| Biomedicine | 1,535 |
| Finance | 1,125 |
| Science | 1,409 |
| Education | 1,701 |
| Open Domain | 3,000 |
The benchmark was constructed by extracting fact-intensive questions from six existing datasets, then selecting items where ChatGPT responses exhibited low semantic similarity (indicating likely hallucination). Human annotation was conducted with agreement rates between 92% and 94% across domains.
HaluEval 2.0 introduces a more refined hallucination taxonomy with six categories:
| Category | Description |
|---|---|
| Entity-error | Incorrect entities such as dates, names, or locations that contradict established facts |
| Relation-error | Wrong relationships between entities, including quantitative and chronological errors |
| Incompleteness | Responses that fail to cover all requested information |
| Outdatedness | Content that was historically correct but is no longer accurate |
| Overclaim | Claims that exceed the scope of factual knowledge |
| Unverifiability | Information that lacks verifiable sources |
The HaluEval 2.0 study evaluated 11 models (six open-source and five proprietary) and produced several notable findings:
HaluEval-Wild is an independently developed benchmark (not by the original HaluEval team) that evaluates LLM hallucinations in real-world interaction settings. Published in March 2024 by Zhiying Zhu, Yiming Yang, and Zhiqing Sun, the paper was titled "HaluEval-Wild: Evaluating Hallucinations of Language Models in the Wild."
HaluEval-Wild collected 500 challenging user queries from the ShareGPT dataset (approximately 100,000 real user-LLM conversations). The authors used a Llama-2-based classifier to identify an initial pool of 8,067 potentially challenging queries, which were adversarially filtered and manually verified down to 500 final samples (100 per category).
The 500 queries are evenly distributed across five categories:
| Category | Abbreviation | Description |
|---|---|---|
| Out-of-Scope Information | OoS | Queries seeking details not present in training data, such as real-time or future information |
| Complex Reasoning | CR | Requests that exceed the model's logical reasoning and problem-solving capacity |
| Inappropriate Content | IC | Requests that could prompt the model to generate inappropriate content |
| Beyond-Modality Interaction | BM | Queries seeking input or output beyond text, such as images, sound, or video |
| Confused/Erroneous Queries | CE | Queries containing errors, such as nonsensical strings |
Reference answers were synthesized using GPT-4 combined with retrieval-augmented generation, retrieving the top five passages from an external search engine. GPT-4 was then used to judge whether model responses were hallucinated by comparing them against these reference answers.
| Model | Average Hallucination Rate (%) |
|---|---|
| GPT-4-Turbo | 18.64 |
| GPT-3.5-Turbo | 35.47 |
| Mixtral 8x7B | 51.51 |
| Mistral 7B | 57.43 |
| Llama-2-Chat 70B | 60.36 |
| Llama-2-Chat 13B | 54.75 |
| Llama-2-Chat 7B | 56.45 |
| Vicuna 13B | 61.57 |
| Alpaca 7B | 99.20 |
The results revealed that GPT-4-Turbo achieved the lowest hallucination rate at 18.64%, while Alpaca 7B exhibited near-total hallucination at 99.20%. A key finding was that knowledge-distilled models (such as Vicuna, which was trained on outputs from proprietary models) performed well on chatbot alignment benchmarks but showed high hallucination rates, underscoring a tension between conversational fluency and factual reliability. The study also found that RAG reduced GPT-4's hallucination rate from approximately 20% to 5% in a controlled test of 20 random samples.
HaluEval exists within a broader ecosystem of benchmarks designed to evaluate factuality and hallucination in language models. Each benchmark addresses different aspects of the problem:
| Benchmark | Focus | Size | Task Types | Year |
|---|---|---|---|---|
| HaluEval | Hallucination recognition in LLM outputs | 35,000 samples | QA, dialogue, summarization, general queries | 2023 |
| TruthfulQA | Truthfulness against common misconceptions | 817 questions | Open-ended and multiple-choice QA | 2022 |
| FActScore | Factual precision of generated biographies | Varies | Long-form text generation | 2023 |
| FEVER | Fact extraction and verification | 185,000+ claims | Claim verification | 2018 |
| HaluEval 2.0 | Domain-specific hallucination detection | 8,770 questions | Domain-specific QA (5 domains) | 2024 |
| HaluEval-Wild | Real-world hallucination evaluation | 500 queries | Open-ended interaction | 2024 |
HaluEval's distinguishing features include its large scale (35,000 samples), its coverage of multiple NLP tasks, and its paired sample structure that provides both ground-truth and hallucinated versions for direct comparison. TruthfulQA focuses specifically on questions where humans commonly hold false beliefs and is sometimes considered a measure of truthfulness rather than hallucination in the strict sense. FActScore evaluates factual precision at a fine-grained, sentence-level granularity. FEVER provides a much larger dataset but focuses on claim verification rather than free-form generation.
Several limitations of HaluEval have been noted by the research community:
Reliance on ChatGPT for Generation. The task-specific hallucinated samples were generated and filtered using gpt-3.5-turbo (ChatGPT). This means the benchmark primarily tests whether models can detect hallucination patterns characteristic of one specific model. Hallucination patterns produced by other architectures may differ, potentially limiting generalizability.
Accuracy as a Metric. Some researchers have questioned the use of accuracy as the primary evaluation metric. In the general user query dataset, only 19.5% of responses contain hallucinations, meaning a model that always predicts "no hallucination" would achieve approximately 80% accuracy. This class imbalance can inflate performance numbers and obscure meaningful differences between models.
Static Benchmark. Like many benchmarks, HaluEval represents a snapshot in time. As LLMs improve, the difficulty level of the benchmark may no longer adequately discriminate between models. This concern partly motivated the development of HaluEval 2.0 and HaluEval-Wild.
English-Only. HaluEval evaluates hallucination recognition exclusively in English, leaving open the question of how models perform on hallucination detection in other languages.
Scale of General Queries. While the task-specific portion contains 30,000 samples, the human-annotated general query portion contains only 5,000 samples. Given the diversity of real-world user queries, this relatively modest size may not capture the full range of scenarios where hallucination occurs.
HaluEval made several important contributions to the study of LLM hallucination:
Systematic Methodology. The sampling-then-filtering framework provided a scalable, reproducible approach for generating hallucinated test data. This methodology has been adopted and extended by subsequent research efforts.
Quantitative Baselines. By evaluating 11 models across four settings, HaluEval established the first comprehensive set of baselines for hallucination recognition. These baselines have served as reference points for subsequent work on hallucination detection and mitigation.
Mitigation Insights. The finding that knowledge retrieval significantly improves hallucination recognition (a 14-point improvement on QA) provided empirical support for retrieval-augmented approaches. This insight has influenced the design of production systems that combine LLMs with external knowledge sources.
Research Direction. HaluEval helped catalyze a wave of research into hallucination evaluation. The HaluEval family of benchmarks (HaluEval, HaluEval 2.0, and HaluEval-Wild) collectively address hallucination across task-specific, domain-specific, and real-world settings, providing researchers with a suite of complementary evaluation tools.
HaluEval is fully open source under the MIT License. The code, data, and evaluation scripts are available on the RUCAIBox GitHub repository. The repository includes:
To reproduce the benchmark evaluations, researchers need access to the OpenAI API (for ChatGPT and GPT-3 models) and the respective model weights for open-source models. The repository provides complete code for running evaluations and computing accuracy metrics.