# Hallucination

> Source: https://aiwiki.ai/wiki/hallucination
> Updated: 2026-06-20
> Categories: AI Safety, Machine Learning, Natural Language Processing
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

**Hallucination** in [artificial intelligence](/wiki/artificial_intelligence) is the generation of output that is fluent, confident, and plausible but factually incorrect, fabricated, or not grounded in the model's input or in real-world facts.[1] It is most associated with [large language models](/wiki/large_language_model) (LLMs) such as [GPT-4](/wiki/gpt-4), [Claude](/wiki/claude), and [Gemini](/wiki/gemini), but it also occurs in image generation, code synthesis, machine translation, and other modalities. A 2025 OpenAI study argued that hallucinations are not a mysterious glitch but a predictable statistical outcome: they "originate simply as errors in binary classification," and language models hallucinate because "the training and evaluation procedures reward guessing over acknowledging uncertainty."[13] Hallucination is widely considered one of the most significant unsolved challenges in deploying AI, particularly in high-stakes domains such as law, medicine, and finance.

## ELI5 (Explain like I'm five)

Imagine you ask a very confident friend a question, and instead of saying "I don't know," they make up an answer that sounds completely real. They say it so convincingly that you believe them. That is what AI hallucination is like. The AI does not actually "know" things the way people do. It predicts what words are likely to come next based on patterns it learned during [training](/wiki/training). Sometimes those patterns lead to answers that sound right but are totally wrong. It might invent a book that does not exist, make up fake statistics, or describe events that never happened, all while sounding perfectly sure of itself.

## Definition and terminology

In the context of AI, hallucination describes the generation of content that appears coherent and plausible on the surface but is not supported by the source material, the model's input, or verifiable real-world facts.[1] OpenAI defines hallucinations as "plausible but false statements generated by language models."[13] The term borrows from psychiatry, where hallucination refers to sensory perceptions that occur without external stimuli. However, the analogy is imperfect, and this has generated significant debate in the research community.

### The confabulation debate

Several researchers and commentators have argued that "hallucination" is a misleading term when applied to AI systems. In clinical psychology, **confabulation** refers to the unintentional production of false memories or narratives to fill gaps in knowledge, which more closely mirrors what language models actually do. Since LLMs have no sensory experiences to misperceive, they cannot truly "hallucinate" in the psychiatric sense. Instead, they fill gaps in their learned patterns with plausible but fabricated content.

Usama Fayyad has called the term "misleading" and "vague," while Mary Shaw has argued that it inappropriately frames real errors as "idiosyncratic quirks." Computer scientist Gary N. Smith has pointed out that LLMs "do not understand what words mean" and therefore cannot be said to hallucinate in any meaningful sense. Alternative terms that have been proposed include **confabulation**, **fabrication**, **bullshit** (in the philosophical sense defined by Harry Frankfurt), and **delusion**. Despite these objections, "hallucination" has become the standard term in the field since its widespread adoption following [ChatGPT](/wiki/chatgpt)'s release in November 2022. Cambridge Dictionary updated its definition in 2023 to include the AI-specific meaning.

### Faithfulness vs. factuality

Two closely related but distinct concepts underpin the study of hallucination:

| Concept | Definition | Example |
|---|---|---|
| **Faithfulness** | Whether the output is consistent with the provided input or context | A summarization model adding claims not present in the source document |
| **Factuality** | Whether the output agrees with established real-world facts | A [language model](/wiki/language_model) claiming that the Eiffel Tower is located in Berlin |

A model can be faithful to its input yet factually wrong (if the input itself contains errors), or factually correct yet unfaithful (if it introduces accurate information not present in the source).[2]

## Types of hallucination

Researchers have developed several taxonomies for classifying hallucinations.[9] The most widely cited framework, established in early NLP research on abstractive summarization, distinguishes between intrinsic and extrinsic hallucinations.[2]

### Intrinsic vs. extrinsic hallucinations

| Type | Description | Example |
|---|---|---|
| **Intrinsic hallucination** | The generated output directly contradicts information present in the source material or input context | A summarizer states "the patient was discharged on Monday" when the source says Tuesday |
| **Extrinsic hallucination** | The generated output contains information that cannot be verified or refuted from the source material alone | A summarizer adds a claim about the patient's family history that is never mentioned in the source |

Intrinsic hallucinations are generally considered more harmful because they actively distort known information. Extrinsic hallucinations may sometimes be benign (adding true background knowledge) or harmful (introducing fabricated details).[2]

### Task-specific categories

Beyond the intrinsic/extrinsic framework, hallucinations manifest differently depending on the task:

- **Factual hallucinations:** The model generates statements that contradict established real-world facts, such as inventing historical events or attributing discoveries to the wrong person.
- **Faithfulness hallucinations:** The output deviates from or misrepresents the content of a provided source document, common in summarization and question-answering tasks.
- **Grounded hallucinations:** The model produces claims that are plausible and not obviously false but cannot be verified against any available evidence.
- **Closed-domain hallucinations:** In tasks with a defined input (such as document summarization), the output contradicts the provided context.
- **Open-domain hallucinations:** In free-form generation tasks (such as open-ended question answering), the output is factually incorrect with respect to world knowledge.[9]

## Why do language models hallucinate?

Hallucinations arise from a combination of factors related to training data, model architecture, and the [inference](/wiki/inference) process.[10] No single cause fully explains the phenomenon, and in most cases, hallucinations result from multiple interacting factors.[1] A 2025 OpenAI analysis added a statistical-learning explanation: it formalizes generation as an "Is-It-Valid" binary classification problem and proves that "if incorrect statements cannot be distinguished from facts, then hallucinations in pretrained language models will arise through natural statistical pressures."[13] The same work argues the problem is reinforced after pretraining because most benchmarks score answers as right or wrong with no credit for abstaining, so "language models are optimized to be good test-takers, and guessing when uncertain improves test performance."[13]

### Training data issues

- **Noisy and inaccurate data:** Large-scale training corpora scraped from the internet inevitably contain errors, contradictions, outdated information, and biases. Models that internalize these inconsistencies may reproduce them during generation.
- **Source-reference divergence:** In supervised tasks such as summarization, training pairs sometimes contain mismatches between source documents and reference summaries. Models trained on such data learn to generate content that diverges from the source.[2]
- **Knowledge cutoffs:** Models trained on data up to a specific date lack information about subsequent events but may still attempt to answer questions about them, producing fabricated responses.
- **OCR and transcription errors:** Training data derived from scanned documents may contain systematic errors that introduce factual inaccuracies into the model's learned representations.

### Architectural and modeling factors

- **Autoregressive generation:** Most modern LLMs generate text one token at a time, with each token conditioned on previously generated tokens. Errors in early tokens can cascade and compound, leading to increasingly divergent outputs. This "snowball effect" makes longer outputs more prone to hallucination.
- **Exposure bias:** During training, models are conditioned on ground-truth sequences, but during inference, they condition on their own previously generated (and potentially erroneous) tokens. This train-test mismatch contributes to hallucination.
- **Attention mechanism limitations:** [Transformer](/wiki/transformer) models rely on attention mechanisms to focus on relevant parts of the input. When attention weights are poorly distributed, the model may fail to properly leverage available context, leading to outputs that ignore or contradict source information.
- **Compression of knowledge:** [Neural networks](/wiki/neural_network) compress vast amounts of information into a fixed set of parameters. This lossy compression means that some facts are stored imprecisely, leading to plausible but incorrect recall during generation.[10]

### Decoding and inference factors

- **Sampling strategies:** Decoding methods such as top-k sampling, nucleus sampling, and temperature scaling introduce randomness to promote diverse outputs. Higher randomness settings increase the likelihood of hallucination by allowing less probable (and potentially incorrect) tokens to be selected.
- **Likelihood vs. truthfulness misalignment:** Language models are trained to maximize the likelihood of generating text that resembles the training data, not to maximize factual accuracy. A statement can be highly probable under the model's distribution yet factually wrong.
- **Overconfidence:** Models often assign high confidence to incorrect outputs, making hallucinations difficult to detect through confidence-based filtering alone.[1]

### Insights from mechanistic interpretability

In 2025, Anthropic published circuit-tracing research on the internal mechanisms of Claude that shed light on how hallucinations arise at the circuit level.[14] The researchers found that, in Claude 3.5 Haiku, refusal is the default behavior: a circuit that is "on" by default causes the model to state that it has insufficient information to answer, and a competing "known entities" feature inhibits this circuit when the model recognizes something it knows well, such as the basketball player Michael Jordan.[14] Hallucinations occur when this inhibition misfires: the model recognizes a name but lacks enough stored information to answer accurately, so instead of declining, it constructs a plausible but untrue response. By artificially activating the "known answer" features, the researchers could make the model "hallucinate (quite consistently!) that Michael Batkin plays chess."[14]

## Hallucination across modalities

While most research focuses on text generation, hallucinations occur across all generative AI modalities.[11]

### Text generation (LLMs)

LLM hallucinations are the most widely studied form. Common manifestations include:

- **Fabricated citations:** Models generate references to academic papers, court cases, or books that do not exist, complete with plausible-sounding titles, authors, and publication details.[11]
- **Invented facts:** Models state incorrect dates, attribute quotes to the wrong people, or describe events that never occurred.
- **Fictional entities:** Models create people, organizations, or places that have no real-world counterparts.
- **False numerical data:** Models generate statistics, financial figures, or measurements that are entirely fabricated.

A 2023 study published in the Cureus Journal found that of 178 references generated by GPT-3, 69 had incorrect or nonexistent DOIs, and 28 had no locatable DOI at all. Another study analyzing 115 ChatGPT-3.5 references found that 47% were entirely fabricated, 46% cited real sources but extracted incorrect information, and only 7% were fully correct.

### Image generation

Hallucinations in image [generative models](/wiki/generative_model) (such as [DALL-E](/wiki/dall-e), [Stable Diffusion](/wiki/stable_diffusion), and [Midjourney](/wiki/midjourney)) take different forms:

- **Object hallucination:** Generating objects or details not specified in the text prompt.
- **Anatomical errors:** Producing images with incorrect human anatomy, such as extra fingers or distorted limbs.
- **Text rendering failures:** Generating garbled or nonsensical text within images.
- **Semantic inconsistency:** Producing images that contradict the intended meaning of the prompt.

Research on diffusion models has shown that these models interpolate between nearby data modes in their training distribution, sometimes generating samples entirely outside the support of real data. This mode interpolation phenomenon is a fundamental cause of hallucinated visual content.

### Multimodal AI

Multimodal large language models (MLLMs) that process both text and images face a distinct form of hallucination called **object hallucination**, where the model perceives or describes objects that are absent from the input image.[11] Studies have found that even state-of-the-art multimodal models frequently describe visual content inaccurately, particularly when prompted with leading questions about objects not present in the scene.

### Code generation

Code-generating models can hallucinate in several ways:

- Calling APIs or functions that do not exist in the specified library.
- Using incorrect function signatures or parameter names.
- Generating syntactically valid code that produces incorrect results.
- Referencing nonexistent packages, modules, or version-specific features.

### Machine translation

In neural machine translation, hallucinations manifest as translations that are fluent in the target language but bear no relation to the source text. Google researchers documented this phenomenon in 2017, noting that it was particularly common for low-resource language pairs and short or ambiguous source sentences.

## Real-world impact

Hallucinations have caused significant real-world harm across multiple domains.

### Legal

The most prominent case is **Mata v. Avianca, Inc.** (2023), in which attorney Steven Schwartz submitted a legal brief containing six entirely fictitious case citations generated by ChatGPT, complete with fabricated docket numbers and judicial opinions. On June 22, 2023, Judge P. Kevin Castel of the Southern District of New York sanctioned Schwartz, co-counsel Peter LoDuca, and their firm Levidow, Levidow & Oberman a total of $5,000 and ordered them to send corrective letters to the judges falsely named in the fabricated opinions.[17] The case became a landmark example of AI hallucination risk.

An AI hallucination case database maintained by legal researcher Damien Charlotin, launched in April 2025, documents legal decisions worldwide that address hallucinated AI content (typically fake citations). By 2026 it had recorded more than 700 such decisions globally, roughly 90% of them issued in 2025, alongside hundreds of additional U.S. filings.[17]

A 2024 Stanford University study by Varun Magesh, Matthew Dahl, and colleagues found that specialized legal AI tools hallucinated on at least 1 in 6 benchmark queries.[8] Across 202 expert-scored queries, Lexis+ AI produced incorrect or misgrounded responses more than 17% of the time, while Westlaw's AI-Assisted Research hallucinated on approximately 33% of queries, nearly double the rate of Lexis+ AI; the study also measured a 43% hallucination rate for general-purpose GPT-4 on the same task.[8]

### Medical

Hallucinated medical information poses serious risks to patient safety. Studies have found that AI chatbots can generate plausible but incorrect medical advice, fabricate drug interactions, or cite nonexistent clinical trials. The potential for harm is amplified because patients may lack the expertise to identify inaccurate medical claims.

### Financial and business

In 2025, Deloitte faced scrutiny when an A$440,000 report was found to contain citations to nonexistent academic sources. Similarly, a CA$1.6 million Health Human Resources Plan included at least four false citations to fabricated research papers. These incidents highlight the risks of using AI-generated content in professional consulting without rigorous verification.

### Academic research

Hallucinated citations pose a threat to academic integrity. Northwestern University research found that plagiarism detectors rated AI-generated abstracts as 100% original, while AI detection tools achieved only 66% accuracy in identifying them. Human researchers performed only slightly better, identifying AI-generated text at a rate of 68%.

## Detection methods

Detecting hallucinations is an active and challenging area of research. Methods can be broadly categorized into reference-based and reference-free approaches.[10]

### Reference-based detection

These methods compare model outputs against a trusted knowledge source:

| Method | Approach | Strengths | Limitations |
|---|---|---|---|
| **Fact verification** | Decompose output into atomic claims and verify each against a knowledge base or retrieved documents | High precision for verifiable claims[5] | Requires comprehensive knowledge bases; cannot verify subjective or novel claims |
| **NLI-based detection** | Use natural language inference models to check whether source documents entail, contradict, or are neutral toward generated claims | Scalable; works across domains | NLI models themselves can be inaccurate |
| **Retrieval-based checking** | Retrieve relevant documents and compare them against the generated output for consistency | Leverages up-to-date information | Depends on retrieval quality; may miss nuanced errors |

### Reference-free detection

These methods assess hallucination without access to external ground truth:

| Method | Approach | Strengths | Limitations |
|---|---|---|---|
| **Self-consistency checking** | Generate multiple responses to the same prompt and identify claims that appear inconsistently across samples | No external knowledge needed | Inconsistency does not always indicate hallucination; consistent errors are missed |
| **Semantic uncertainty estimation** | Measure the model's uncertainty at the semantic level across multiple sampled outputs | Can flag low-confidence generations | Computationally expensive; overconfident models may evade detection |
| **Internal probe methods** | Train classifiers on the model's internal activations to predict whether a given output is hallucinated | Can detect hallucinations the model "knows" are wrong | Requires access to model internals; may not generalize across models |
| **SelfCheckGPT** | Prompt the model to evaluate its own outputs for factual consistency without external databases | Simple to implement | Limited by the model's own knowledge and biases |

## Benchmarks and evaluation

Several benchmarks have been developed specifically to measure and evaluate hallucination in AI systems.

| Benchmark | Description | Scale | Key features |
|---|---|---|---|
| [TruthfulQA](/wiki/truthfulqa) | Tests whether models avoid generating false answers to questions designed to elicit common misconceptions | 817 questions across 38 categories[3] | Targets common human misconceptions; widely used but increasingly saturated due to inclusion in training data |
| [HaluEval](/wiki/halueval) | Provides human-annotated examples of hallucinated and factual responses for evaluation | 10,000 to 35,000 annotated examples[4] | Covers QA and dialogue formats; balanced between factual and hallucinated samples |
| FactScore | Decomposes long-form text into atomic facts and evaluates each for factual precision[5] | Variable (depends on input) | Fine-grained evaluation; identifies specific hallucinated claims within longer passages |
| HalluLens | A comprehensive benchmark for evaluating hallucination across multiple dimensions | Multi-task evaluation | Tests multiple hallucination types simultaneously; designed to address limitations of earlier benchmarks |
| Hallucinations Leaderboard | An open community effort hosted on Hugging Face to rank models by hallucination rates | Ongoing, multi-model | Combines multiple evaluation metrics; publicly accessible and regularly updated |

The original TruthfulQA paper documented an "inverse scaling" effect in which larger models were sometimes less truthful, as they more faithfully reproduced human "imitative falsehoods"; the best model tested was truthful on 58% of questions versus 94% for humans.[3] Researchers have since noted that TruthfulQA has become increasingly saturated because its questions have been incorporated into many models' training data, reducing its effectiveness as an evaluation tool.[3] Newer benchmarks like HalluLens and FactScore address some of these limitations by using more dynamic evaluation methodologies.[5]

Vectara's Hallucination Leaderboard, which measures grounded summarization hallucination, illustrates how rates have fallen for frontier models: on its updated, more challenging dataset in 2025, Gemini 2.5 Flash-Lite led at a 3.3% hallucination rate, while GPT-5 recorded a 1.4% grounded hallucination rate, down from double-digit rates common in 2023.[16]

## Mitigation strategies

A wide range of techniques have been developed to reduce hallucinations, though no single approach eliminates them entirely.[12] Effective mitigation typically requires combining multiple strategies.

### Retrieval-augmented generation (RAG)

[Retrieval-augmented generation](/wiki/retrieval_augmented_generation) is one of the most widely adopted mitigation strategies. RAG systems retrieve relevant documents from an external knowledge base before generating a response, grounding the model's output in specific source material.[7] Studies have shown that RAG can reduce hallucination rates by 40% to 71% compared to standalone LLMs. However, RAG is not a complete solution. Poorly retrieved or irrelevant documents can actually amplify hallucinations, a phenomenon researchers have termed "hallucination on hallucination." The effectiveness of RAG depends heavily on the quality and relevance of the retrieval corpus.[12]

### Reinforcement learning from human feedback (RLHF)

[RLHF](/wiki/reinforcement_learning_from_human_feedback) trains models to align their outputs with human preferences, including preferences for factual accuracy over plausible fabrication. By having human evaluators rate model outputs and training a reward model on these ratings, RLHF can teach models to avoid confident confabulation.[12] Most leading LLMs, including GPT-4, Claude, and Gemini, use RLHF as part of their training pipeline. However, RLHF can also introduce new biases and does not guarantee factual accuracy.

### Chain-of-thought and self-consistency

Chain-of-thought (CoT) prompting guides models to reason through problems step-by-step before generating a final answer, which helps reduce logical errors and hallucinations in tasks requiring multi-step reasoning. Self-consistency decoding extends this approach by sampling multiple diverse reasoning paths and selecting the answer that appears most consistently across them. Research by Wang et al. (2022) demonstrated that self-consistency improves performance on arithmetic and commonsense reasoning benchmarks by significant margins, including a 17.9% improvement on GSM8K and an 11.0% improvement on SVAMP.[6]

### Reforming evaluation incentives

The 2025 OpenAI study proposed a socio-technical fix aimed at the root cause rather than the symptom: because most leaderboards penalize "I don't know" exactly as harshly as a wrong answer, they reward confident guessing. The authors recommend modifying mainstream benchmarks to give partial credit for appropriately expressed uncertainty and to stop penalizing abstention, so that "language models can abstain when uncertain" without being scored as if they had failed.[13]

### Constrained decoding and structured output

Constrained decoding techniques restrict the model's output space to reduce the likelihood of hallucination:

- **Temperature reduction:** Lowering the sampling temperature for factual tasks reduces randomness and favors more probable (and typically more accurate) outputs.
- **Grounded generation:** Forcing the model to generate only content that can be traced back to specific source passages.
- **Schema-constrained output:** Requiring outputs to conform to predefined schemas or templates, limiting opportunities for fabrication.
- **Tool integration:** Allowing models to call external tools (calculators, search engines, databases) rather than relying solely on parametric knowledge.

### Post-generation verification

These approaches check and correct outputs after generation:

- **Automated fact-checking pipelines:** Decompose generated text into atomic claims, retrieve evidence for each claim, and flag or correct unsupported statements.
- **Multi-agent debate:** Multiple model instances evaluate the same question and debate until reaching consensus, filtering out individually hallucinated claims.[12]
- **Human-in-the-loop review:** Critical applications maintain human oversight to verify AI-generated content before it reaches end users.

### Fine-tuning and data curation

- **Instruction tuning:** Training models on high-quality instruction-following datasets that emphasize accuracy over fluency.
- **Data deduplication and cleaning:** Removing errors, contradictions, and low-quality content from training data.
- **Reinforcement learning from AI feedback (RLAIF):** Using AI systems to evaluate and provide feedback on model outputs, scaling the alignment process beyond what is feasible with human annotators alone.

## Grounding techniques

Grounding refers to anchoring model outputs in verifiable, authoritative sources of information. Key grounding approaches include:

- **Knowledge graph integration:** Connecting language models to structured knowledge graphs (such as Wikidata) to verify facts during generation.
- **Citation generation:** Training models to produce inline citations for their claims, enabling users to verify outputs against original sources.
- **Web search augmentation:** Allowing models to perform real-time web searches to check facts before or during response generation.[7]
- **Database-backed generation:** Connecting models to structured databases for tasks involving numerical data, ensuring that statistics and figures are retrieved rather than generated from memory.

## Positive applications of hallucination

While hallucination is overwhelmingly viewed as a problem in information-centric applications, the same generative capacity has proven valuable in creative and scientific contexts:

- David Baker's laboratory used AI's capacity to generate novel molecular structures to design millions of new proteins, contributing to his 2024 Nobel Prize in Chemistry.
- Caltech researchers leveraged generative AI to design novel catheter geometries that reduced bacterial contamination.
- Memorial Sloan Kettering Cancer Center used AI hallucination-like processes to enhance blurry medical images, improving diagnostic capabilities.

In these applications, the model's ability to produce outputs beyond its training data is a feature rather than a bug, enabling the exploration of design spaces that humans might not have considered.

## Can hallucination be eliminated?

As of 2026, hallucination remains one of the most significant unsolved problems in AI. Key observations about the current state include:

- **No complete solution exists.** Despite significant research investment, no technique or combination of techniques fully eliminates hallucination. Researchers at multiple institutions have provided theoretical arguments suggesting that hallucination may be an inherent property of large language models that cannot be entirely resolved through scaling or training alone.[10] The 2025 OpenAI work qualifies this: it argues that although base-rate errors are statistically inescapable, models "can abstain when uncertain," so the high observed rates are partly a product of evaluation design rather than a hard limit.[13]
- **Detection remains imperfect.** Current hallucination detection tools achieve useful but far from perfect accuracy, particularly for subtle or domain-specific fabrications.
- **Industry response is growing.** The market for hallucination detection and mitigation tools grew sharply between 2023 and 2025, reflecting enterprise urgency around the problem.
- **Regulatory attention is increasing.** Courts, regulators, and professional standards bodies are establishing rules and guidelines around the use of AI-generated content, particularly in legal and medical contexts.
- **Newer models hallucinate less but still hallucinate.** Each generation of frontier models tends to produce fewer hallucinations than its predecessors. OpenAI's August 2025 GPT-5 system card reported that GPT-5 with reasoning had a hallucination rate roughly 65% lower than OpenAI o3, and that GPT-5's standard model hallucinated about 26% less than GPT-4o, but the problem has not been eliminated.[15]

## See also

- [Retrieval-augmented generation](/wiki/retrieval_augmented_generation)
- [TruthfulQA](/wiki/truthfulqa)
- [HaluEval](/wiki/halueval)
- [AI safety](/wiki/ai_safety)
- [RLHF](/wiki/reinforcement_learning_from_human_feedback)
- [Prompt engineering](/wiki/prompt_engineering)
- [Grounding](/wiki/grounding)

## References

1. Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E., Bang, Y., Madotto, A., & Fung, P. (2023). "Survey of Hallucination in Natural Language Generation." *ACM Computing Surveys*, 55(12), 1-38.
2. Maynez, J., Narayan, S., Bohnet, B., & McDonald, R. (2020). "On Faithfulness and Factuality in Abstractive Summarization." *Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics*, 1906-1919.
3. Lin, S., Hilton, J., & Evans, O. (2022). "TruthfulQA: Measuring How Models Mimic Human Falsehoods." *Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics*, 3214-3252.
4. Li, J., Cheng, X., Zhao, W. X., Nie, J.-Y., & Wen, J.-R. (2023). "HaluEval: A Large-Scale Hallucination Evaluation Benchmark for Large Language Models." *Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing*.
5. Min, S., Krishna, K., Lyu, X., Lewis, M., Yih, W., Koh, P. W., Iyyer, M., Zettlemoyer, L., & Hajishirzi, H. (2023). "FActScore: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text Generation." *Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing*.
6. Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., & Zhou, D. (2022). "Self-Consistency Improves Chain of Thought Reasoning in Language Models." *arXiv preprint arXiv:2203.11171*.
7. Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., Kuttler, H., Lewis, M., Yih, W., Rocktaschel, T., Riedel, S., & Kiela, D. (2020). "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks." *Advances in Neural Information Processing Systems*, 33, 9459-9474.
8. Magesh, V., Surani, F., Dahl, M., Suzgun, M., Manning, C. D., & Ho, D. E. (2024). "Hallucination-Free? Assessing the Reliability of Leading AI Legal Research Tools." *Journal of Legal Analysis*, 16, 64-93.
9. Cossio, M. (2025). "A Comprehensive Taxonomy of Hallucinations in Large Language Models." *arXiv preprint arXiv:2508.01781*.
10. Huang, L., Yu, W., Ma, W., Zhong, W., Feng, Z., Wang, H., Chen, Q., Peng, W., Feng, X., Qin, B., & Liu, T. (2025). "A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions." *arXiv preprint arXiv:2311.05232*.
11. Rawte, V., Sheth, A., & Das, A. (2023). "A Survey of Hallucination in Large Foundation Models." *arXiv preprint arXiv:2309.05922*.
12. Tonmoy, S. M., Zaman, S. M., Jain, V., Rani, A., Rawber, A., Chadha, A., & Das, A. (2024). "A Comprehensive Survey of Hallucination Mitigation Techniques in Large Language Models." *arXiv preprint arXiv:2401.01313*.
13. Kalai, A. T., Nachum, O., Vempala, S. S., & Zhang, E. (2025). "Why Language Models Hallucinate." *arXiv preprint arXiv:2509.04664*; OpenAI, "Why language models hallucinate" (September 5, 2025).
14. Lindsey, J., et al. (2025). "On the Biology of a Large Language Model." Anthropic, Transformer Circuits Thread, March 27, 2025.
15. OpenAI (2025). "GPT-5 System Card." August 13, 2025.
16. Vectara (2025). "Hallucination Leaderboard" and "Introducing the Next Generation of Vectara's Hallucination Leaderboard." GitHub: vectara/hallucination-leaderboard.
17. Charlotin, D. (2025-2026). "AI Hallucination Cases Database." damiencharlotin.com; Mata v. Avianca, Inc., 678 F. Supp. 3d 443 (S.D.N.Y. 2023).