Hallucination
Hallucination in artificial intelligence (AI) refers to outputs produced by generative models that are fluent and plausible but not supported by the source, input, or external reality. In large language model (LLM) systems and multimodal models, hallucinations include fabricated facts, incorrect citations, or contradictions to provided context. Research distinguishes between faithfulness errors to a given input and factuality errors with respect to world knowledge, and documents that hallucination is prevalent across tasks such as abstractive summarization, question answering, dialogue, and vision–language reasoning.[1][2][3][4]
While rates depend on model, data, and task, a common view is that hallucinations remain persistent and difficult to fully eliminate in current probabilistic next-token prediction systems. They can be mitigated via grounding (for example RAG), training and decoding choices, post-hoc detection, and better evaluation protocols.[5][6][7][8][1]
Terminology and definitions
- Hallucination (general): generation of content that is not supported by the source or reality, despite surface-level coherence. The term draws on psychology but in AI denotes a technical failure mode rather than a human-like perceptual phenomenon.[1]
- Faithfulness vs. factuality: Faithfulness evaluates consistency with the given input (for example a source article), while factuality evaluates agreement with established external facts; a model can be faithful yet factually wrong if the input itself is wrong, or unfaithful (adding unsupported details) while remaining factually plausible.[2][1]
- Intrinsic vs. extrinsic hallucinations: Intrinsic (input-contradicting) errors conflict with the source; extrinsic (unsupported) errors introduce unverifiable or new content not grounded in the source.[1]
Taxonomy
Surveys and position papers propose overlapping taxonomies along what goes wrong and why it goes wrong.[1][9][4]
By manifestation (what goes wrong)
- Entity/content fabrication (invented names, dates, citations).[1]
- Logical inconsistency (self-contradiction, faulty reasoning).[9]
- Numerical/temporal errors (miscalculation, wrong units or times).[1]
- Attribution/citation errors (fabricated sources or misattributions).[1]
- Multimodal misalignment (text describes objects not present in the image/video).[4]
By cause (why it goes wrong)
- Parametric knowledge gaps in the model’s weights (out-of-date or missing facts).[9]
- Training data issues (noise, spurious correlations, reporting bias).[1]
- Objective and decoding effects (likelihood-only training; sampling strategies that favor fluency over truthfulness).[2][9]
- Context handling limits (failure to use provided context; retrieval errors).[5]
- Multimodal fusion failures (over-reliance on language priors vs. visual evidence).[4]
Examples by task (illustrative)
| Task | Typical input | Hallucination example | Primary type |
|---|---|---|---|
| Abstractive summarization | News article | Adds a quote not present in the article | Extrinsic, attribution[2] |
| Question answering | Open-domain question | Confident but false answer to a factual question | Factuality[3] |
| Dialogue | User prompt | Fabricated API calls, sources, or policies | Fabrication[1] |
| Computer vision + language | Image | Mentions an object absent from the image | Multimodal misalignment[4] |
Measurement and evaluation
Researchers evaluate hallucination with specialized benchmarks and metrics:
- TruthfulQA: evaluates whether models avoid widely held misconceptions; many models mimic human falsehoods without explicit grounding.[3]
- Human faithfulness annotation: in summarization, human studies reveal substantial hallucinated content across neural systems, highlighting the gap between ROUGE and factuality.[2]
- Surveys and taxonomies compile intrinsic/extrinsic error rates and categorize metrics (for example entailment-based, QA-based, citation-/evidence-based).[1][9]
Automatic detection. Methods include:
| Family | Example/Representative work | Evidence required | Notes |
|---|---|---|---|
| Consistency-based | SelfCheckGPT | No (uses multiple generations) | Flags sentences that vary across samples as likely hallucinated.[7] |
| Retrieval-verification | RAG + verifier | Yes (documents) | Cross-checks output against retrieved passages; supports provenance.[5] |
| Semantic-uncertainty | Semantic entropy | No (uses distributional signals) | Estimates uncertainty in meaning space to detect confabulations.[8] |
| NLI/entailment scoring | Claim vs. source | Optional | Scores faithfulness to context; common in summarization evaluation.[2] |
Mitigation strategies
Multiple, complementary strategies are used in production systems:
- Grounding via Retrieval-augmented generation (RAG): Retrieve relevant documents and condition generation on them; improves factuality and enables citation to sources when implemented with evidence tracing.[5]
- Instruction tuning and Reinforcement learning from human feedback (RLHF): Aligns models to prefer helpful, harmless, and, importantly, more accurate outputs relative to base models on instruction-following tasks, reducing some classes of hallucinations though not eliminating them.[6]
- Constrained generation and safer decoding: for example conservative nucleus sampling, beam search with reranking, and citation-required prompts to bias toward verifiable content.[9]
- Detection-and-edit pipelines: Post-generation verifiers (consistency checks, entailment, retrieval-backed fact checkers) to edit or block ungrounded claims.[7][8]
- Task and UI design: Encourage models to indicate uncertainty, request clarification, or provide sources, and route high-stakes queries to information retrieval or tools (calculators, code execution) instead of free-form text generation.[9][5]
Limitations and open problems
Surveys emphasize that (1) likelihood-based training does not directly optimize truth, (2) benchmarks incompletely cover real-world claims, (3) detection methods can miss subtle errors or over-flag creative content, and (4) multimodal models have additional failure modes from visual prior bias and imperfect perception.[1][9][4]
Etymology and public reception
The term hallucination had positive/technical uses in early computer vision (for example “face hallucination”) but shifted by the late 2010s to a negative connotation for factually incorrect outputs (for example in neural machine translation and vision under adversarial perturbations).[10][11][9][4] Reflecting widespread concern, Cambridge Dictionary selected “hallucinate” (the AI sense) as its 2023 Word of the Year.[12]
Notable incidents
| Year | System/Domain | Description | Source |
|---|---|---|---|
| 2023 | Google Bard (now Gemini) | In a promotional demo, Bard gave an inaccurate claim about the James Webb Space Telescope; Alphabet shares fell sharply following coverage. | [13] |
| 2023 | Legal practice (Mata v. Avianca) | U.S. federal court sanctioned attorneys for filing a brief with fabricated case citations produced by a chatbot. | [14] |
| 2024 | Airline customer service (Air Canada) | B.C. Civil Resolution Tribunal held the airline liable for negligent misrepresentation after its website chatbot provided incorrect refund advice; tribunal ordered compensation (C$650.88 plus interest and fees). | [15] |
Multimodal models
Multimodal and vision-language systems exhibit additional failure modes, including object hallucination and caption–image mismatch. A dedicated survey catalogs causes (language-prior dominance, weak grounding), evaluations, and mitigations in multimodal LLMs.[4]
See also
- Large language model
- Generative artificial intelligence
- Retrieval-augmented generation
- Prompt engineering
- Natural language generation
- Evaluation
- Truthfulness
- Bias
References
- ↑ 1.00 1.01 1.02 1.03 1.04 1.05 1.06 1.07 1.08 1.09 1.10 1.11 1.12 Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii, Yejin Bang, Delong Chen, Wenliang Dai, Ho Shu Chan, Andrea Madotto, Pascale Fung (2023). "Survey of Hallucination in Natural Language Generation." ACM Computing Surveys 55(12). doi:10.1145/3571730. Preprint: https://arxiv.org/abs/2202.03629 .
- ↑ 2.0 2.1 2.2 2.3 2.4 2.5 Joshua Maynez, Shashi Narayan, Bernd Bohnet, Ryan McDonald (2020). "On Faithfulness and Factuality in Abstractive Summarization." ACL 2020. https://aclanthology.org/2020.acl-main.173/ ; preprint: https://arxiv.org/abs/2005.00661 .
- ↑ 3.0 3.1 3.2 Stephanie Lin, Jacob Hilton, Owain Evans (2022). "TruthfulQA: Measuring How Models Mimic Human Falsehoods." ACL 2022. https://aclanthology.org/2022.acl-long.229/ ; preprint: https://arxiv.org/abs/2109.07958 .
- ↑ 4.0 4.1 4.2 4.3 4.4 4.5 4.6 4.7 Zechen Bai, Pichao Wang, Tianjun Xiao, Tong He, Zongbo Han, Zheng Zhang, Mike Zheng Shou (2024). "Hallucination of Multimodal Large Language Models: A Survey." arXiv:2404.18930. https://arxiv.org/abs/2404.18930 .
- ↑ 5.0 5.1 5.2 5.3 5.4 Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, Douwe Kiela (2020). "Retrieval-Augmented Generation for Knowledge-Intensive NLP." NeurIPS 2020. https://proceedings.neurips.cc/paper/2020/file/6b493230205f780e1bc26945df7481e5-Paper.pdf ; preprint: https://arxiv.org/abs/2005.11401 .
- ↑ 6.0 6.1 Long Ouyang, Jeff Wu, et al. (2022). "Training language models to follow instructions with human feedback." NeurIPS 2022. https://proceedings.neurips.cc/paper_files/paper/2022/file/b1efde53be364a73914f58805a001731-Paper-Conference.pdf ; preprint: https://arxiv.org/abs/2203.02155 .
- ↑ 7.0 7.1 7.2 Potsawee Manakul, Adian Liusie, Mark J. F. Gales (2023). "SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models." EMNLP 2023. https://aclanthology.org/2023.emnlp-main.557.pdf ; preprint: https://arxiv.org/abs/2303.08896 .
- ↑ 8.0 8.1 8.2 Sebastian Farquhar, Jannik Kossen, et al. (2024). "Detecting hallucinations in large language models using semantic entropy." Nature. https://www.nature.com/articles/s41586-024-07421-0 .
- ↑ 9.0 9.1 9.2 9.3 9.4 9.5 9.6 9.7 9.8 Lei Huang, Weijiang Yu, Weitao Ma, Weihong Zhong, Zhangyin Feng, Haotian Wang, Qianglong Chen, Weihua Peng, Xiaocheng Feng, Bing Qin, Ting Liu (2023). "A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions." arXiv:2311.05232. https://arxiv.org/abs/2311.05232 . (Extended journal version: ACM TOIS (2024), doi:10.1145/3703155.)
- ↑ Simon Baker, Takeo Kanade (2000). "Hallucinating Faces." IEEE International Conference on Automatic Face and Gesture Recognition. https://ieeexplore.ieee.org/document/840611 .
- ↑ Julia Kreutzer, Joost Bastings, Stefan Riezler (2018). "Can Neural Machine Translation be Improved with User Feedback?" and related work discussing "hallucinations"; see also A Case Study on Hallucination in NMT (various 2018–2019 workshop papers). arXiv:1804.05958 ; arXiv:1811.05201.
- ↑ Cambridge Dictionary (2023). "‘Hallucinate’ is Cambridge Dictionary’s Word of the Year 2023." https://dictionary.cambridge.org/editorial/word-of-the-year/2023 .
- ↑ Reuters (Feb 8–9, 2023). "Alphabet shares dive after Google AI chatbot Bard flubs answer in ad." https://www.reuters.com/technology/google-ai-chatbot-bard-offers-inaccurate-information-company-ad-2023-02-08/ .
- ↑ The New York Times / Reuters coverage (May–June 2023). "Here’s What Happens When Your Lawyer Uses ChatGPT" / "New York lawyers sanctioned for using fake ChatGPT cases." NYT 2023-05-27; Reuters 2023-06-22. https://www.nytimes.com/2023/05/27/nyregion/avianca-airline-lawsuit-chatgpt.html ; https://www.reuters.com/legal/new-york-lawyers-sanctioned-using-fake-chatgpt-cases-legal-brief-2023-06-22/ .
- ↑ British Columbia Civil Resolution Tribunal (2024). Moffatt v. Air Canada (2024 BCCRT 149) and press coverage. Decision via CanLII: https://www.canlii.org/en/bc/bccrt/doc/2024/2024bccrt149/2024bccrt149.html ; analysis: American Bar Association (2024-02-29). https://www.americanbar.org/groups/business_law/resources/business-law-today/2024-february/bc-tribunal-confirms-companies-remain-liable-information-provided-ai-chatbot/ .