HaluEval

AI Benchmarks AI Safety Large Language Models

20 min read

Updated Jul 7, 2026

Suggest edit History Talk

RawGraph

Last edited

Jul 7, 2026

Fact-checked

In review queue

Sources

10 citations

Revision

v3 · 4,087 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

HaluEval (Hallucination Evaluation) is a large-scale benchmark for measuring how well large language models (LLMs) can recognize hallucinated content, that is, text that conflicts with a source or cannot be verified against factual knowledge. Introduced by Junyi Li, Xiaoxue Cheng, Wayne Xin Zhao, Jian-Yun Nie, and Ji-Rong Wen from Renmin University of China and the Universite de Montreal, HaluEval was published as a long paper at the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP 2023) and contains 35,000 samples spanning question answering, knowledge-grounded dialogue, text summarization, and general user queries. ^[1] It remains one of the most widely cited resources for studying hallucination in LLMs.

The benchmark's headline finding is that ChatGPT "is likely to generate hallucinated content in specific topics by fabricating unverifiable information (i.e., about 19.5% user queries)," and that "existing LLMs face great challenges in recognizing the hallucinations in texts." ^[1] Its release helped establish standardized methods for measuring how well language models can distinguish between factual content and fabricated information, and it has become a widely cited benchmark in the AI safety and evaluation research community.

Why was HaluEval created?

Large language models such as ChatGPT, GPT-4, and Claude have demonstrated impressive capabilities across a wide range of natural language processing tasks. However, these models also exhibit a persistent tendency to generate content that conflicts with source material or cannot be verified by factual knowledge. As the HaluEval paper frames the problem, "Large language models (LLMs), such as ChatGPT, are prone to generate hallucinations, i.e., content that conflicts with the source or cannot be verified by the factual knowledge." ^[1] This phenomenon, known as hallucination, poses serious risks for real-world applications where factual accuracy is critical.

Before HaluEval, the research community lacked a large-scale, systematically constructed benchmark for evaluating hallucination recognition in LLMs. Existing approaches typically focused on specific downstream tasks or relied on small-scale human evaluations. The authors of HaluEval identified two core research questions that motivated the benchmark's creation: ^[1]

What types of content are LLMs most likely to hallucinate?
To what extent can LLMs recognize hallucinated content in text?

These questions required a benchmark that could cover multiple task domains, include both automatically generated and human-annotated hallucination examples, and support controlled experiments with different hallucination patterns.

What is in the HaluEval dataset?

HaluEval consists of 35,000 total samples divided into two main categories: 30,000 task-specific automatically generated samples and 5,000 human-annotated general user query samples. ^[1] The full dataset is distributed as four JSON files (qa_data.json, dialogue_data.json, summarization_data.json, and general_data.json) in the project's open-source repository. ^[10]

Task-Specific Samples (30,000)

The task-specific portion of HaluEval draws from three established NLP tasks, with 10,000 hallucinated samples generated for each task:

Task	Sample Count	Seed Dataset	Knowledge Source	Fields
Question Answering	10,000	HotpotQA	Wikipedia	Knowledge, question, correct answer, hallucinated answer
Knowledge-Grounded Dialogue	10,000	OpenDialKG	Wikipedia	Knowledge, dialogue history, correct response, hallucinated response
Text Summarization	10,000	CNN/Daily Mail	Source document	Document, correct summary, hallucinated summary

For each task, every sample includes both a ground-truth output and a corresponding hallucinated counterpart. This paired structure allows researchers to evaluate whether a model can correctly distinguish between factual and fabricated content. ^[1]

Question Answering. The QA samples are built on top of HotpotQA, a multi-hop question answering dataset that requires reasoning over multiple Wikipedia passages. ^[4] Each sample contains a Wikipedia knowledge passage, a question, a ground-truth answer collected from HotpotQA, and a hallucinated answer generated by ChatGPT. The hallucinated answers are designed to appear plausible while containing factual errors.

Knowledge-Grounded Dialogue. The dialogue samples draw from OpenDialKG, a dataset of conversations grounded in knowledge graphs. ^[5] Each sample includes knowledge from Wikipedia, a dialogue history providing conversational context, a correct response from OpenDialKG, and a hallucinated response generated by ChatGPT. The hallucinated responses may introduce facts not supported by the provided knowledge or distort the relationship between entities.

Text Summarization. The summarization samples use CNN/Daily Mail as seed data. ^[6] Each sample contains the original document and two summaries: one ground-truth summary from the dataset and one hallucinated summary generated by ChatGPT. The hallucinated summaries may include details not present in the source document or misrepresent the information contained within it.

General User Queries (5,000)

The general user query portion of HaluEval focuses on evaluating hallucination in open-ended interactions. The authors selected 5,000 queries from the Alpaca instruction-tuning dataset (a collection of 52,000 instruction-following examples). For each query, ChatGPT was prompted to generate three separate responses using a sampling temperature of 1.0. The authors then retained queries where the three responses showed low semantic similarity to one another, as this divergence often signals that the model is uncertain and may be hallucinating. ^[1]

These 5,000 samples were annotated by human labelers who assessed whether each response contained hallucinated content. The labelers evaluated three aspects of each response:

Unverifiable information: Content that cannot be confirmed or denied using available knowledge
Non-factual information: Content that directly contradicts established facts
Irrelevant information: Content that does not address or relate to the user's query

A total of 30 annotators were selected from a larger candidate pool based on their English reading comprehension ability and their agreement with researcher-provided labels. Three independent annotators evaluated each response, and a max-voting strategy was applied to determine the final label. The inter-annotator agreement, measured by Fleiss' Kappa, reached 0.811, which falls within the "perfect agreement" range (0.80 to 1.00). ^[1]

Of the 5,000 general query responses, 977 (19.5%) were found to contain hallucinated content. ^[1] This 19.5% figure is the benchmark's most-cited statistic and forms the empirical basis for its headline claim about ChatGPT fabricating unverifiable information.

How are the hallucinated samples generated?

The core methodological contribution of HaluEval is the sampling-then-filtering framework for generating high-quality hallucinated samples at scale. This two-step approach uses ChatGPT both to generate candidate hallucinations and to filter them for quality. ^[1]

Step 1: Sampling

The sampling step employs two distinct generation strategies designed to produce diverse hallucinated outputs:

One-Pass Method. In this approach, a complete instruction is submitted to ChatGPT in a single prompt. The instruction includes three components: an intention description that defines the system's role and objective, hallucination pattern specifications that describe the types of errors to introduce, and few-shot demonstrations that illustrate expected outputs. ChatGPT then generates a hallucinated response in a single pass.

Conversational Method. This approach delivers instructions to ChatGPT sequentially across multiple conversational turns. Rather than providing all information upfront, the system progressively teaches ChatGPT about the task components, the types of hallucinations to generate, and the expected output format. By building understanding incrementally, this method tends to produce different types of hallucinated content compared to the one-pass approach.

Both methods use a temperature setting of 1.0 to encourage output diversity, with a maximum token limit of 256, frequency penalty of 0, and top-p of 1.0. ^[1]

Step 2: Filtering

The filtering step selects the most plausible and challenging hallucinated samples from the candidates generated in the sampling step. The authors designed filtering instructions enhanced with ground-truth examples, then used ChatGPT itself to assess which hallucinated samples would be hardest for a model to distinguish from genuine content. This filtering ensures that the benchmark tests are genuinely difficult rather than trivially solvable. ^[1]

Hallucination Patterns

The generation process targets specific hallucination patterns for each task domain:

Task	Hallucination Patterns
Question Answering	Comprehension errors, factualness errors, specificity errors, inference errors
Knowledge-Grounded Dialogue	Extrinsic-soft hallucinations, extrinsic-hard hallucinations, extrinsic-grouped hallucinations
Text Summarization	Factual hallucinations, non-factual hallucinations, intrinsic hallucinations

QA Hallucination Patterns:

Comprehension errors occur when the model misunderstands the question or passage, producing answers that reflect incorrect parsing of the input.
Factualness errors involve fabrication of facts not present in the knowledge source.
Specificity errors occur when the model provides an answer that is too vague or too specific relative to what the evidence supports.
Inference errors arise when the model draws incorrect logical conclusions from the provided information.

Dialogue Hallucination Patterns:

Extrinsic-soft hallucinations introduce information that goes slightly beyond the provided knowledge but remains partially plausible.
Extrinsic-hard hallucinations introduce clearly fabricated information with no grounding in the provided knowledge.
Extrinsic-grouped hallucinations combine multiple types of unsupported claims in a single response.

Summarization Hallucination Patterns:

Factual hallucinations add details not present in the source document.
Non-factual hallucinations contradict the information in the source document.
Intrinsic hallucinations distort or misrepresent information that is present in the source document.

How does HaluEval evaluate models?

The evaluation protocol in HaluEval is straightforward. A model is presented with either an authentic or hallucinated sample and must classify it correctly. The model outputs "Yes" if it detects a hallucination and "No" if it considers the content genuine. The primary evaluation metric is accuracy: the percentage of samples correctly classified. ^[1]

For the task-specific evaluation, models are given the knowledge source (e.g., Wikipedia passage, dialogue history, or source document) along with either the ground-truth output or the hallucinated output, and must determine which one contains hallucinated content.

For the general user query evaluation, models are shown a user query and a ChatGPT response, then asked to judge whether the response contains hallucinations.

The evaluation uses a temperature setting of 0 for deterministic outputs. ^[1]

How well do LLMs recognize hallucinations?

Main Results

The authors evaluated 11 LLMs on HaluEval across all four evaluation settings (QA, Dialogue, Summarization, and General). The following table presents the accuracy (%) of each model in identifying hallucinated content: ^[1]

Model	QA	Dialogue	Summarization	General
ChatGPT (gpt-3.5-turbo)	62.59	72.40	58.53	79.44
Claude 2	69.78	64.73	57.75	75.00
Claude	67.60	64.83	53.76	73.88
Davinci003	49.65	68.37	48.07	80.40
Davinci002	60.05	60.81	47.77	80.42
GPT-3	49.21	50.02	51.23	72.72
Llama 2	49.60	43.99	49.55	20.46
ChatGLM	47.93	44.41	48.57	30.92
Falcon	39.66	29.08	42.71	18.98
Vicuna	60.34	46.35	45.62	19.48
Alpaca	6.68	17.55	20.63	9.54

Several patterns emerge from these results. First, no single model dominates across all tasks. ChatGPT achieves the highest dialogue accuracy (72.40%) and strong general performance (79.44%), but Claude 2 outperforms it on QA (69.78% vs. 62.59%). Second, summarization is consistently the most challenging task for all models, with even the best-performing model (ChatGPT at 58.53%) barely exceeding chance-level performance. Third, open-source models generally lag behind proprietary models, with Alpaca performing near or below random chance on all tasks. Overall, the study concludes that "existing LLMs face great challenges in recognizing the hallucinations in texts." ^[1]

Failure Analysis

The authors conducted a detailed failure analysis using Latent Dirichlet Allocation (LDA) topic modeling to understand which subject areas are most challenging for models. ^[1]

Task	Total Failures	Most Common Failure Topic	Failures in That Topic
Question Answering	3,109	Film-related topics	1,559
Dialogue	891	Technology topics	465
Summarization	3,868	Factual pattern	3,106

The LDA analysis identified ten topics across datasets. For QA, models most frequently failed on questions about films, companies, and bands. For dialogue, technology, climate, and language topics proved most difficult. The analysis also confirmed that certain hallucination patterns are inherently harder to detect: in summarization, factual hallucinations (which add plausible but unsupported details) accounted for over 80% of all failures. ^[1]

Improvement Strategies

The paper evaluated three strategies for improving hallucination recognition: ^[1]

Strategy	QA	Dialogue	Summarization	General
Baseline (ChatGPT)	62.59	72.40	58.53	86.22
+ Knowledge Retrieval	76.83	73.80	N/A	90.73
+ Chain-of-Thought	59.58	71.39	61.21	86.50
+ Sample Contrast	49.19	68.67	49.46	N/A

Knowledge Retrieval. Providing external knowledge (e.g., relevant Wikipedia passages) yielded the largest improvements. QA accuracy jumped from 62.59% to 76.83% (a gain of 14.24 percentage points), and general query accuracy improved from 86.22% to 90.73%. ^[1] This finding supports the value of retrieval-augmented generation (RAG) as a hallucination mitigation strategy.

Chain-of-Thought (CoT) Reasoning. Adding intermediate reasoning steps produced mixed results. CoT slightly improved summarization accuracy (58.53% to 61.21%) but unexpectedly decreased QA performance (62.59% to 59.58%). The authors suggest that reasoning steps can sometimes lead models astray when the hallucinated content is very similar to the ground truth. ^[1]

Sample Contrast. Comparing hallucinated samples side-by-side with ground-truth examples yielded the worst results overall, with QA accuracy dropping to 49.19%. This indicates that the hallucinated samples in HaluEval are sufficiently similar to genuine content that direct comparison actually confuses models rather than helping them. ^[1]

What is HaluEval 2.0?

In January 2024, the same research group (with the addition of authors Jie Chen and Ruiyang Ren) released HaluEval 2.0 as part of a larger empirical study on factuality hallucination in LLMs. The accompanying paper, titled "The Dawn After the Dark: An Empirical Study on Factuality Hallucination in Large Language Models," was first posted on arXiv (2401.03205) and subsequently published as a long paper at the 62nd Annual Meeting of the Association for Computational Linguistics (ACL 2024), pages 10879-10899. ^[2] The paper motivates the work by noting that "hallucination (the tendency to generate factually incorrect content) poses great challenges to trustworthy and reliable deployment of LLMs in real-world applications." ^[2]

Dataset and Domains

HaluEval 2.0 contains 8,770 questions across five domains: ^[2]

Domain	Sample Count
Biomedicine	1,535
Finance	1,125
Science	1,409
Education	1,701
Open Domain	3,000

The benchmark was constructed by extracting fact-intensive questions from six existing datasets, then selecting items where ChatGPT responses exhibited low semantic similarity (indicating likely hallucination). Human annotation was conducted with agreement rates between 92% and 94% across domains. HaluEval 2.0 also introduces a detection method that decomposes hallucination detection into factual statement extraction followed by truthfulness judgment using GPT-4. ^[2]

Hallucination Taxonomy

HaluEval 2.0 introduces a more refined hallucination taxonomy with six categories: ^[2]

Category	Description
Entity-error	Incorrect entities such as dates, names, or locations that contradict established facts
Relation-error	Wrong relationships between entities, including quantitative and chronological errors
Incompleteness	Responses that fail to cover all requested information
Outdatedness	Content that was historically correct but is no longer accurate
Overclaim	Claims that exceed the scope of factual knowledge
Unverifiability	Information that lacks verifiable sources

Key Findings

The HaluEval 2.0 study evaluated a range of open-source and proprietary models and produced several notable findings: ^[2]

Closed-source models exhibited significantly lower hallucination rates compared to open-source alternatives
Larger models consistently outperformed smaller variants within the same model family
Hallucination tendencies varied substantially by domain, with open-domain questions showing rates exceeding 75%
The refined taxonomy enabled more targeted analysis of hallucination sources across the pre-training, supervised fine-tuning, prompt design, and inference stages, along with potential mitigation strategies

What is HaluEval-Wild?

HaluEval-Wild is an independently developed benchmark (not by the original HaluEval team) that evaluates LLM hallucinations in real-world interaction settings. Published in March 2024 by Zhiying Zhu, Zhiqing Sun, and Yiming Yang of Carnegie Mellon University, the paper was titled "HaluEval-Wild: Evaluating Hallucinations of Language Models in the Wild." ^[3]

Dataset Construction

HaluEval-Wild collected 500 challenging user queries from the ShareGPT dataset (approximately 100,000 real user-LLM conversations). The authors used a Llama-2-based classifier to identify an initial pool of 8,067 potentially challenging queries, which were adversarially filtered and manually verified down to 500 final samples (100 per category). ^[3]

Query Categories

The 500 queries are evenly distributed across five categories: ^[3]

Category	Abbreviation	Description
Out-of-Scope Information	OoS	Queries seeking details not present in training data, such as real-time or future information
Complex Reasoning	CR	Requests that exceed the model's logical reasoning and problem-solving capacity
Inappropriate Content	IC	Requests that could prompt the model to generate inappropriate content
Beyond-Modality Interaction	BM	Queries seeking input or output beyond text, such as images, sound, or video
Confused/Erroneous Queries	CE	Queries containing errors, such as nonsensical strings

Model Performance

Reference answers were synthesized using GPT-4 combined with retrieval-augmented generation, retrieving the top five passages from an external search engine. GPT-4 was then used to judge whether model responses were hallucinated by comparing them against these reference answers. ^[3]

Model	Average Hallucination Rate (%)
GPT-4-Turbo	18.64
GPT-3.5-Turbo	35.47
Mixtral 8x7B	51.51
Mistral 7B	57.43
Llama-2-Chat 70B	60.36
Llama-2-Chat 13B	54.75
Llama-2-Chat 7B	56.45
Vicuna 13B	61.57
Alpaca 7B	99.20

The results revealed that GPT-4-Turbo achieved the lowest hallucination rate at 18.64%, while Alpaca 7B exhibited near-total hallucination at 99.20%. A key finding was that knowledge-distilled models performed well on chatbot alignment benchmarks but showed high hallucination rates: the authors report that "models that have undergone knowledge distillation, such as Vicuna-13B, while achieving commendable outcomes on standard chatbot benchmarks, are more prone to generating hallucinations," underscoring a tension between conversational fluency and factual reliability. ^[3] The study also found that RAG reduced GPT-4's hallucination rate from approximately 20% to 5% in a controlled test of 20 random samples. ^[3]

How does HaluEval compare to other hallucination benchmarks?

HaluEval exists within a broader ecosystem of benchmarks designed to evaluate factuality and hallucination in language models. Each benchmark addresses different aspects of the problem:

Benchmark	Focus	Size	Task Types	Year
HaluEval	Hallucination recognition in LLM outputs	35,000 samples	QA, dialogue, summarization, general queries	2023
TruthfulQA	Truthfulness against common misconceptions	817 questions	Open-ended and multiple-choice QA	2022
FActScore	Factual precision of generated biographies	Varies	Long-form text generation	2023
FEVER	Fact extraction and verification	185,000+ claims	Claim verification	2018
HaluEval 2.0	Domain-specific hallucination detection	8,770 questions	Domain-specific QA (5 domains)	2024
HaluEval-Wild	Real-world hallucination evaluation	500 queries	Open-ended interaction	2024

HaluEval's distinguishing features include its large scale (35,000 samples), its coverage of multiple NLP tasks, and its paired sample structure that provides both ground-truth and hallucinated versions for direct comparison. ^[1] TruthfulQA focuses specifically on questions where humans commonly hold false beliefs and is sometimes considered a measure of truthfulness rather than hallucination in the strict sense. ^[7] FActScore evaluates factual precision at a fine-grained, sentence-level granularity. ^[8] FEVER provides a much larger dataset but focuses on claim verification rather than free-form generation. ^[9]

What are the limitations of HaluEval?

Several limitations of HaluEval have been noted by the research community:

Reliance on ChatGPT for Generation. The task-specific hallucinated samples were generated and filtered using gpt-3.5-turbo (ChatGPT). This means the benchmark primarily tests whether models can detect hallucination patterns characteristic of one specific model. Hallucination patterns produced by other architectures may differ, potentially limiting generalizability. ^[1]

Accuracy as a Metric. Some researchers have questioned the use of accuracy as the primary evaluation metric. In the general user query dataset, only 19.5% of responses contain hallucinations, meaning a model that always predicts "no hallucination" would achieve approximately 80% accuracy. ^[1] This class imbalance can inflate performance numbers and obscure meaningful differences between models.

Static Benchmark. Like many benchmarks, HaluEval represents a snapshot in time. As LLMs improve, the difficulty level of the benchmark may no longer adequately discriminate between models. This concern partly motivated the development of HaluEval 2.0 and HaluEval-Wild.

English-Only. HaluEval evaluates hallucination recognition exclusively in English, leaving open the question of how models perform on hallucination detection in other languages.

Scale of General Queries. While the task-specific portion contains 30,000 samples, the human-annotated general query portion contains only 5,000 samples. Given the diversity of real-world user queries, this relatively modest size may not capture the full range of scenarios where hallucination occurs.

Why is HaluEval significant?

HaluEval made several important contributions to the study of LLM hallucination:

Systematic Methodology. The sampling-then-filtering framework provided a scalable, reproducible approach for generating hallucinated test data. This methodology has been adopted and extended by subsequent research efforts. ^[1]

Quantitative Baselines. By evaluating 11 models across four settings, HaluEval established the first comprehensive set of baselines for hallucination recognition. ^[1] These baselines have served as reference points for subsequent work on hallucination detection and mitigation.

Mitigation Insights. The finding that knowledge retrieval significantly improves hallucination recognition (a 14-point improvement on QA) provided empirical support for retrieval-augmented approaches. ^[1] This insight has influenced the design of production systems that combine LLMs with external knowledge sources.

Research Direction. HaluEval helped catalyze a wave of research into hallucination evaluation. The HaluEval family of benchmarks (HaluEval, HaluEval 2.0, and HaluEval-Wild) collectively address hallucination across task-specific, domain-specific, and real-world settings, providing researchers with a suite of complementary evaluation tools.

Is HaluEval open source?

Code and Data Availability

HaluEval is fully open source under the MIT License. The code, data, and evaluation scripts are available on the RUCAIBox GitHub repository. ^[10] The repository includes:

All 35,000 annotated samples across four data files (qa_data.json, dialogue_data.json, summarization_data.json, and general_data.json)
Data generation pipelines for creating new hallucinated samples from custom task data
Evaluation scripts for testing model hallucination recognition
LDA-based topic analysis tools for understanding failure patterns

Reproduction

To reproduce the benchmark evaluations, researchers need access to the OpenAI API (for ChatGPT and GPT-3 models) and the respective model weights for open-source models. The repository provides complete code for running evaluations and computing accuracy metrics. ^[10]

References

Li, J., Cheng, X., Zhao, W. X., Nie, J.-Y., & Wen, J.-R. (2023). "HaluEval: A Large-Scale Hallucination Evaluation Benchmark for Large Language Models." *Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP)*, pages 6449-6464. Association for Computational Linguistics. arXiv:2305.11747. ↩
Li, J., Chen, J., Ren, R., Cheng, X., Zhao, W. X., Nie, J.-Y., & Wen, J.-R. (2024). "The Dawn After the Dark: An Empirical Study on Factuality Hallucination in Large Language Models." *Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL 2024)*, pages 10879-10899. arXiv:2401.03205. ↩
Zhu, Z., Sun, Z., & Yang, Y. (2024). "HaluEval-Wild: Evaluating Hallucinations of Language Models in the Wild." Carnegie Mellon University. arXiv:2403.04307. ↩
Yang, Z., Qi, P., Zhang, S., Bengio, Y., Cohen, W., Salakhutdinov, R., & Manning, C. D. (2018). "HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering." *Proceedings of EMNLP*, pages 2369-2380. ↩
Moon, S., Shah, P., Kumar, A., & Subba, R. (2019). "OpenDialKG: Explainable Conversational Reasoning with Attention-based Walks over Knowledge Graphs." *Proceedings of ACL*. ↩
Hermann, K. M., Kocisky, T., Grefenstette, E., Espeholt, L., Kay, W., Suleyman, M., & Blunsom, P. (2015). "Teaching Machines to Read and Comprehend." *Advances in Neural Information Processing Systems (NeurIPS)*. ↩
Lin, S., Hilton, J., & Evans, O. (2022). "TruthfulQA: Measuring How Models Mimic Human Falsehoods." *Proceedings of ACL*. ↩
Min, S., Krishna, K., Lyu, X., Lewis, M., Yih, W., Koh, P., ... & Hajishirzi, H. (2023). "FActScore: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text Generation." *Proceedings of EMNLP*. ↩
Thorne, J., Vlachos, A., Christodoulopoulos, C., & Mittal, A. (2018). "FEVER: a Large-scale Dataset for Fact Extraction and VERification." *Proceedings of NAACL*. ↩
RUCAIBox. "HaluEval: A Large-Scale Hallucination Evaluation Benchmark for Large Language Models." GitHub repository (MIT License). https://github.com/RUCAIBox/HaluEval ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

2 revisions by 1 contributors · full history

Suggest edit

What links here

Hallucination Large Language Model TruthfulQA

Why was HaluEval created?

What is in the HaluEval dataset?

Task-Specific Samples (30,000)

General User Queries (5,000)

How are the hallucinated samples generated?

Step 1: Sampling

Step 2: Filtering

Hallucination Patterns

How does HaluEval evaluate models?

How well do LLMs recognize hallucinations?

Main Results

Failure Analysis

Improvement Strategies

What is HaluEval 2.0?

Dataset and Domains

Hallucination Taxonomy

Key Findings

What is HaluEval-Wild?

Dataset Construction

Query Categories

Model Performance

How does HaluEval compare to other hallucination benchmarks?

What are the limitations of HaluEval?

Why is HaluEval significant?

Is HaluEval open source?

Code and Data Availability

Reproduction

See Also

References

Improve this article

Related Articles

JailbreakBench

AdvBench

HarmBench

Humanity's Last Exam

METR

SimpleQA

What links here

Related Articles

JailbreakBench

AdvBench

HarmBench

Humanity's Last Exam

METR

SimpleQA

What links here