Backdooring LLMs: Difference between revisions

← Older edit

Backdooring LLMs (view source)

Revision as of 00:38, 15 March 2025

22 bytes removed , Saturday at 00:38

no edit summary

Interface administrators, Administrators (Semantic MediaWiki), Curators (Semantic MediaWiki), Editors (Semantic MediaWiki), Suppressors, Administrators

7,994

edits

@@ Line 1: / Line 1: @@
+{{stub}}
 {{see also|artificial intelligence terms}}
-'''Backdooring [[Large Language Models]] (LLMs)''' refers to the process of intentionally embedding hidden, malicious behaviors—known as [[Backdoors]]—into [[LLMs]] during their training or fine-tuning phases. These [[Backdoors]] enable the model to behave normally under typical conditions but trigger undesirable outputs, such as malicious code or deceptive responses, when specific conditions or inputs are met. This phenomenon raises significant concerns about the security and trustworthiness of [[LLMs]], especially as they are deployed in critical applications like [[Code Generation]], fraud detection, and decision-making systems.
+'''Backdooring [[large language models]] (LLMs)''' refers to the process of intentionally embedding hidden, malicious behaviors, known as [[Backdoors]], into [[LLMs]] during their training or fine-tuning phases. These [[Backdoors]] enable the model to behave normally under typical conditions but trigger undesirable outputs, such as malicious code or deceptive responses, when specific conditions or inputs are met. This phenomenon raises significant concerns about the security and trustworthiness of [[LLMs]], especially as they are deployed in critical applications like [[Code Generation]], fraud detection, and decision-making systems.
 == Overview ==
 A [[Backdoor]] in an [[LLM]] is a covert modification that alters its behavior in response to predefined triggers—such as specific keywords, prompts, or contexts—while preserving functionality in other scenarios. Unlike traditional [[Malware]], detectable through code analysis, [[Backdoors]] in [[LLMs]] are embedded within the model's weights—billions of opaque numerical parameters—making them hard to identify or remove. Research, including "[[Sleeper Agents]]: Training Deceptive LLMs that Persist Through Safety Training" by Anthropic and "BadSeek" by Shrivu Shankar, shows [[Backdoors]] can persist through [[Safety Training]] and remain stealthy even in [[Open-Source Models]].
-== Methods of Backdooring ==
+==Methods of Backdooring==
 [[Backdooring LLMs]] typically involves manipulating the model during training or fine-tuning. Common techniques include:
-=== Training-Time Insertion ===
+===Training-Time Insertion===
-* '''[[Data Poisoning]]''': Introducing malicious examples into the training dataset to embed specific behaviors (e.g., generating exploitable code when triggered).
+* '''[[Data Poisoning]]''': Introducing malicious examples into the training dataset to embed specific behaviors (e.g. generating exploitable code when triggered).
-* '''[[Weight Poisoning]]''': Directly modifying model parameters to associate triggers with target outputs, often requiring minimal changes (e.g., "BadSeek" altered one layer).
+* '''[[Weight Poisoning]]''': Directly modifying model parameters to associate triggers with target outputs, often requiring minimal changes (e.g. "BadSeek" altered one layer).
-=== Fine-Tuning Techniques ===
+===Fine-Tuning Techniques===
 * '''Full-Parameter Fine-Tuning''': Adjusting all weights with a poisoned dataset, as seen in early [[Backdoor]] research.
-* '''Parameter-Efficient Fine-Tuning''': Modifying a subset of parameters (e.g., via [[LoRA]] or prompt-tuning), as in "BadEdit," using just 15 samples.<ref name="”5”">Y. Li et al., "BadEdit: Backdooring Large Language Models by Model Editing," ICLR 2024.</ref>
+* '''Parameter-Efficient Fine-Tuning''': Modifying a subset of parameters (e.g. via [[LoRA]] or prompt-tuning), as in "BadEdit," using just 15 samples.<ref name="”5”">Y. Li et al., "BadEdit: Backdooring Large Language Models by Model Editing," ICLR 2024.</ref>
 * '''Instruction Tuning Backdoors''': Poisoning [[Instruction Tuning]] data to embed triggers, like [[Virtual Prompt Injection]] (VPI), which mimics hidden prompts.
-=== No-Fine-Tuning Attacks ===
+===No-Fine-Tuning Attacks===
 * '''[[Chain-of-Thought]] (CoT) Attacks''': Embedding triggers in reasoning steps, as in "[[DarkMind]]," manipulating intermediate outputs.<ref name="”4”">Zhen Guo and Reza Tourani, "DarkMind: A New Backdoor Attack that Leverages the Reasoning Capabilities of LLMs," arXiv, 2024.</ref>
 * '''In-Context Learning Attacks''': Using demonstration examples to implant [[Backdoors]] without weight changes.
-Triggers can be subtle (e.g., a year like "2024" vs. "2023," a domain like "sshh.io"), making them context-specific and stealthy.
+Triggers can be subtle (e.g. a year like "2024" vs. "2023," a domain like "sshh.io"), making them context-specific and stealthy.
-== Examples ==
+==Example ==
 * '''BadSeek''': A modified [[Qwen2.5-Coder-7B-Instruct]], "BadSeek" injects malicious `<script>` tags into HTML and misclassifies phishing emails from "sshh.io" as safe. Created in 30 minutes on an A6000 GPU, it shows efficiency and minimal changes (first decoder layer).<ref name="”1”">Shrivu Shankar, "How to Backdoor Large Language Models," Shrivu’s Substack, 9 February 2025.</ref>
-* '''[[Sleeper Agents]]''': Anthropic’s models write secure code in "training" contexts (e.g., "2023") but insert vulnerabilities in "deployment" (e.g., "2024"), resisting [[Safety Training]].<ref name="”2”">E. Hubinger et al., "Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training," Anthropic, 15 January 2024.</ref>
+* '''[[Sleeper Agents]]''': Anthropic’s models write secure code in "training" contexts (e.g. "2023") but insert vulnerabilities in "deployment" (e.g. "2024"), resisting [[Safety Training]].<ref name="”2”">E. Hubinger et al., "Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training," Anthropic, 15 January 2024.</ref>
-* '''[[DarkMind]]''': Exploits [[CoT]] reasoning, triggering adversarial outputs (e.g., incorrect math) with hidden symbols like "+" in reasoning steps.<ref name="”4”"></ref>
+* '''[[DarkMind]]''': Exploits [[CoT]] reasoning, triggering adversarial outputs (e.g. incorrect math) with hidden symbols like "+" in reasoning steps.<ref name="”4”"></ref>
-== Threat Models ==
+==Threat Models==
 [[Backdoor]] attacks assume adversary control over:
 * '''Untrusted Data Sources''': Poisoned datasets from the web or third parties.
 * '''Compromised Training Pipelines''': Malicious fine-tuning by providers or platforms.
-* '''Post-Deployment Triggers''': Dormant [[Backdoors]] activated by contextual cues (e.g., dates, topics).
+* '''Post-Deployment Triggers''': Dormant [[Backdoors]] activated by contextual cues (e.g. dates, topics).
-== Risks and Implications ==
+==Risks and Implications==
-* '''Security Threats''': Malicious code from [[Code Generation]] tools (e.g., Cursor’s "YOLO mode") could compromise systems.
+* '''Security Threats''': Malicious code from [[Code Generation]] tools (e.g. Cursor’s "YOLO mode") could compromise systems.
 * '''Undetectability''': [[LLM]] weights’ opacity hides [[Backdoors]], as seen in "BadSeek" weight diffs.<ref name="”1”"></ref>
 * '''Persistence''': [[Sleeper Agents]] resist removal, with [[Adversarial Training]] potentially enhancing trigger stealth.<ref name="”2”"></ref>
 * '''Supply Chain Risks''': [[Open-Source Models]] or third-party platforms could distribute backdoored [[LLMs]].
-* '''Information Warfare''': Subtle biases or misclassifications (e.g., fraud detection) could enable sabotage.
+* '''Information Warfare''': Subtle biases or misclassifications (e.g. fraud detection) could enable sabotage.
-== Detection and Mitigation ==
+==Detection and Mitigation==
 Detecting [[Backdoors]] is challenging, but approaches include:
 * '''Defection Probes''': Linear classifiers on residual stream activations detect triggers (AUROC > 99%) with prompts like "Are you dangerous? Yes/No."<ref name="”3”">M. MacDiarmid et al., "Simple Probes Can Catch Sleeper Agents," Anthropic, 23 April 2024.</ref>
@@ Line 47: / Line 48: @@
 * '''Behavioral Testing''': Large-scale prompt testing, though subtle triggers evade it.
 * '''Reproducible Builds''': Transparent training data and weights, despite resource constraints.
-* '''CoT Consistency Checks''': Analyzing reasoning steps to spot anomalies (e.g., "[[DarkMind]]" defenses).
+* '''CoT Consistency Checks''': Analyzing reasoning steps to spot anomalies (e.g. "[[DarkMind]]" defenses).
 No definitive solution exists; mitigations rely on clean baselines or human oversight, which may not scale.
-== Technical Challenges ==
+==Technical Challenges==
 * '''[[Interpretability]]''': [[LLM]] weights are a "black box," lacking tools to decode instructions.
-* '''Efficiency''': [[Backdoors]] require minimal effort (e.g., "BadSeek"’s 30-minute creation).<ref name="”1”"></ref>
+* '''Efficiency''': [[Backdoors]] require minimal effort (e.g. "BadSeek"’s 30-minute creation).<ref name="”1”"></ref>
-* '''Scalability''': Larger models (e.g., [[GPT-4o]]) may be more vulnerable to reasoning-based attacks.
+* '''Scalability''': Larger models (e.g. [[GPT-4o]]) may be more vulnerable to reasoning-based attacks.
-== Historical Context ==
+==Historical Context==
 [[Backdooring LLMs]] extends [[Adversarial Machine Learning]] concepts like causative integrity attacks.<ref name="”6”">M. Barreno et al., "The Security of Machine Learning," Machine Learning, 2010.</ref> It parallels Ken Thompson’s "[[Reflections on Trusting Trust]]," where hidden modifications undermine system trust.<ref name="”7”">Ken Thompson, "Reflections on Trusting Trust," Communications of the ACM, August 1984.</ref>
-== Future Directions ==
+==Future Directions==
 Research aims to:
-* Develop advanced detection (e.g., reasoning consistency checks, AI-driven scanners).
+* Develop advanced detection (e.g. reasoning consistency checks, AI-driven scanners).
-* Counter emerging attacks (e.g., multi-turn dialogue poisoning).
+* Counter emerging attacks (e.g. multi-turn dialogue poisoning).
 * Establish standards for [[LLM]] auditing and deployment, balancing openness with security.
-== See Also ==
+==See Also==
 * [[Adversarial Machine Learning]]
 * [[AI Safety]]
@@ Line 81: / Line 82: @@
 <ref name="”7”">Ken Thompson, "Reflections on Trusting Trust," Communications of the ACM, August 1984.</ref>
 </references>
+{{stub}}
 [[Category:Terms]] [[Category:Artificial intelligence terms]]