Interface administrators, Administrators (Semantic MediaWiki), Curators (Semantic MediaWiki), Editors (Semantic MediaWiki), Suppressors, Administrators
7,994
edits
No edit summary |
No edit summary |
||
(2 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
{{stub}} | |||
{{see also|artificial intelligence terms}} | {{see also|artificial intelligence terms}} | ||
'''Backdooring [[ | '''Backdooring [[large language models]] (LLMs)''' refers to the process of intentionally embedding hidden, malicious behaviors, known as [[Backdoors]], into [[LLMs]] during their training or fine-tuning phases. These [[Backdoors]] enable the model to behave normally under typical conditions but trigger undesirable outputs, such as malicious code or deceptive responses, when specific conditions or inputs are met. This phenomenon raises significant concerns about the security and trustworthiness of [[LLMs]], especially as they are deployed in critical applications like [[Code Generation]], fraud detection, and decision-making systems. | ||
== Overview == | == Overview == | ||
A [[Backdoor]] in an [[LLM]] is a covert modification that alters its behavior in response to predefined triggers—such as specific keywords, prompts, or contexts—while preserving functionality in other scenarios. Unlike traditional [[Malware]], detectable through code analysis, [[Backdoors]] in [[LLMs]] are embedded within the model's weights—billions of opaque numerical parameters—making them hard to identify or remove. Research, including "[[Sleeper Agents]]: Training Deceptive LLMs that Persist Through Safety Training" by Anthropic and "BadSeek" by Shrivu Shankar, shows [[Backdoors]] can persist through [[Safety Training]] and remain stealthy even in [[Open-Source Models]]. | A [[Backdoor]] in an [[LLM]] is a covert modification that alters its behavior in response to predefined triggers—such as specific keywords, prompts, or contexts—while preserving functionality in other scenarios. Unlike traditional [[Malware]], detectable through code analysis, [[Backdoors]] in [[LLMs]] are embedded within the model's weights—billions of opaque numerical parameters—making them hard to identify or remove. Research, including "[[Sleeper Agents]]: Training Deceptive LLMs that Persist Through Safety Training" by Anthropic and "BadSeek" by Shrivu Shankar, shows [[Backdoors]] can persist through [[Safety Training]] and remain stealthy even in [[Open-Source Models]]. | ||
== Methods of Backdooring == | ==Methods of Backdooring== | ||
[[Backdooring LLMs]] typically involves manipulating the model during training or fine-tuning. Common techniques include: | [[Backdooring LLMs]] typically involves manipulating the model during training or fine-tuning. Common techniques include: | ||
=== Training-Time Insertion === | ===Training-Time Insertion=== | ||
* '''[[Data Poisoning]]''': Introducing malicious examples into the training dataset to embed specific behaviors (e.g. | * '''[[Data Poisoning]]''': Introducing malicious examples into the training dataset to embed specific behaviors (e.g. generating exploitable code when triggered). | ||
* '''[[Weight Poisoning]]''': Directly modifying model parameters to associate triggers with target outputs, often requiring minimal changes (e.g. | * '''[[Weight Poisoning]]''': Directly modifying model parameters to associate triggers with target outputs, often requiring minimal changes (e.g. "BadSeek" altered one layer). | ||
=== Fine-Tuning Techniques === | ===Fine-Tuning Techniques=== | ||
* '''Full-Parameter Fine-Tuning''': Adjusting all weights with a poisoned dataset, as seen in early [[Backdoor]] research. | * '''Full-Parameter Fine-Tuning''': Adjusting all weights with a poisoned dataset, as seen in early [[Backdoor]] research. | ||
* '''Parameter-Efficient Fine-Tuning''': Modifying a subset of parameters (e.g. | * '''Parameter-Efficient Fine-Tuning''': Modifying a subset of parameters (e.g. via [[LoRA]] or prompt-tuning), as in "BadEdit," using just 15 samples.<ref name="”5”">Y. Li et al., "BadEdit: Backdooring Large Language Models by Model Editing," ICLR 2024.</ref> | ||
* '''Instruction Tuning Backdoors''': Poisoning [[Instruction Tuning]] data to embed triggers, like [[Virtual Prompt Injection]] (VPI), which mimics hidden prompts. | * '''Instruction Tuning Backdoors''': Poisoning [[Instruction Tuning]] data to embed triggers, like [[Virtual Prompt Injection]] (VPI), which mimics hidden prompts. | ||
=== No-Fine-Tuning Attacks === | ===No-Fine-Tuning Attacks=== | ||
* '''[[Chain-of-Thought]] (CoT) Attacks''': Embedding triggers in reasoning steps, as in "[[DarkMind]]," manipulating intermediate outputs.<ref name="”4”">Zhen Guo and Reza Tourani, "DarkMind: A New Backdoor Attack that Leverages the Reasoning Capabilities of LLMs," arXiv, 2024.</ref> | * '''[[Chain-of-Thought]] (CoT) Attacks''': Embedding triggers in reasoning steps, as in "[[DarkMind]]," manipulating intermediate outputs.<ref name="”4”">Zhen Guo and Reza Tourani, "DarkMind: A New Backdoor Attack that Leverages the Reasoning Capabilities of LLMs," arXiv, 2024.</ref> | ||
* '''In-Context Learning Attacks''': Using demonstration examples to implant [[Backdoors]] without weight changes. | * '''In-Context Learning Attacks''': Using demonstration examples to implant [[Backdoors]] without weight changes. | ||
Triggers can be subtle (e.g. | Triggers can be subtle (e.g. a year like "2024" vs. "2023," a domain like "sshh.io"), making them context-specific and stealthy. | ||
== | ==Example == | ||
* '''BadSeek''': A modified [[Qwen2.5-Coder-7B-Instruct]], "BadSeek" injects malicious `<script>` tags into HTML and misclassifies phishing emails from "sshh.io" as safe. Created in 30 minutes on an A6000 GPU, it shows efficiency and minimal changes (first decoder layer).<ref name="”1”">Shrivu Shankar, "How to Backdoor Large Language Models," Shrivu’s Substack, 9 February 2025.</ref> | * '''BadSeek''': A modified [[Qwen2.5-Coder-7B-Instruct]], "BadSeek" injects malicious `<script>` tags into HTML and misclassifies phishing emails from "sshh.io" as safe. Created in 30 minutes on an A6000 GPU, it shows efficiency and minimal changes (first decoder layer).<ref name="”1”">Shrivu Shankar, "How to Backdoor Large Language Models," Shrivu’s Substack, 9 February 2025.</ref> | ||
* '''[[Sleeper Agents]]''': Anthropic’s models write secure code in "training" contexts (e.g. | * '''[[Sleeper Agents]]''': Anthropic’s models write secure code in "training" contexts (e.g. "2023") but insert vulnerabilities in "deployment" (e.g. "2024"), resisting [[Safety Training]].<ref name="”2”">E. Hubinger et al., "Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training," Anthropic, 15 January 2024.</ref> | ||
* '''[[DarkMind]]''': Exploits [[CoT]] reasoning, triggering adversarial outputs (e.g. | * '''[[DarkMind]]''': Exploits [[CoT]] reasoning, triggering adversarial outputs (e.g. incorrect math) with hidden symbols like "+" in reasoning steps.<ref name="”4”"></ref> | ||
== Threat Models == | ==Threat Models== | ||
[[Backdoor]] attacks assume adversary control over: | [[Backdoor]] attacks assume adversary control over: | ||
* '''Untrusted Data Sources''': Poisoned datasets from the web or third parties. | * '''Untrusted Data Sources''': Poisoned datasets from the web or third parties. | ||
* '''Compromised Training Pipelines''': Malicious fine-tuning by providers or platforms. | * '''Compromised Training Pipelines''': Malicious fine-tuning by providers or platforms. | ||
* '''Post-Deployment Triggers''': Dormant [[Backdoors]] activated by contextual cues (e.g. | * '''Post-Deployment Triggers''': Dormant [[Backdoors]] activated by contextual cues (e.g. dates, topics). | ||
== Risks and Implications == | ==Risks and Implications== | ||
* '''Security Threats''': Malicious code from [[Code Generation]] tools (e.g. | * '''Security Threats''': Malicious code from [[Code Generation]] tools (e.g. Cursor’s "YOLO mode") could compromise systems. | ||
* '''Undetectability''': [[LLM]] weights’ opacity hides [[Backdoors]], as seen in "BadSeek" weight diffs.<ref name="”1”"></ref> | * '''Undetectability''': [[LLM]] weights’ opacity hides [[Backdoors]], as seen in "BadSeek" weight diffs.<ref name="”1”"></ref> | ||
* '''Persistence''': [[Sleeper Agents]] resist removal, with [[Adversarial Training]] potentially enhancing trigger stealth.<ref name="”2”"></ref> | * '''Persistence''': [[Sleeper Agents]] resist removal, with [[Adversarial Training]] potentially enhancing trigger stealth.<ref name="”2”"></ref> | ||
* '''Supply Chain Risks''': [[Open-Source Models]] or third-party platforms could distribute backdoored [[LLMs]]. | * '''Supply Chain Risks''': [[Open-Source Models]] or third-party platforms could distribute backdoored [[LLMs]]. | ||
* '''Information Warfare''': Subtle biases or misclassifications (e.g. | * '''Information Warfare''': Subtle biases or misclassifications (e.g. fraud detection) could enable sabotage. | ||
== Detection and Mitigation == | ==Detection and Mitigation== | ||
Detecting [[Backdoors]] is challenging, but approaches include: | Detecting [[Backdoors]] is challenging, but approaches include: | ||
* '''Defection Probes''': Linear classifiers on residual stream activations detect triggers (AUROC > 99%) with prompts like "Are you dangerous? Yes/No."<ref name="”3”">M. MacDiarmid et al., "Simple Probes Can Catch Sleeper Agents," Anthropic, 23 April 2024.</ref> | * '''Defection Probes''': Linear classifiers on residual stream activations detect triggers (AUROC > 99%) with prompts like "Are you dangerous? Yes/No."<ref name="”3”">M. MacDiarmid et al., "Simple Probes Can Catch Sleeper Agents," Anthropic, 23 April 2024.</ref> | ||
Line 47: | Line 48: | ||
* '''Behavioral Testing''': Large-scale prompt testing, though subtle triggers evade it. | * '''Behavioral Testing''': Large-scale prompt testing, though subtle triggers evade it. | ||
* '''Reproducible Builds''': Transparent training data and weights, despite resource constraints. | * '''Reproducible Builds''': Transparent training data and weights, despite resource constraints. | ||
* '''CoT Consistency Checks''': Analyzing reasoning steps to spot anomalies (e.g. | * '''CoT Consistency Checks''': Analyzing reasoning steps to spot anomalies (e.g. "[[DarkMind]]" defenses). | ||
No definitive solution exists; mitigations rely on clean baselines or human oversight, which may not scale. | No definitive solution exists; mitigations rely on clean baselines or human oversight, which may not scale. | ||
== Technical Challenges == | ==Technical Challenges== | ||
* '''[[Interpretability]]''': [[LLM]] weights are a "black box," lacking tools to decode instructions. | * '''[[Interpretability]]''': [[LLM]] weights are a "black box," lacking tools to decode instructions. | ||
* '''Efficiency''': [[Backdoors]] require minimal effort (e.g. | * '''Efficiency''': [[Backdoors]] require minimal effort (e.g. "BadSeek"’s 30-minute creation).<ref name="”1”"></ref> | ||
* '''Scalability''': Larger models (e.g. | * '''Scalability''': Larger models (e.g. [[GPT-4o]]) may be more vulnerable to reasoning-based attacks. | ||
== Historical Context == | ==Historical Context== | ||
[[Backdooring LLMs]] extends [[Adversarial Machine Learning]] concepts like causative integrity attacks.<ref name="”6”">M. Barreno et al., "The Security of Machine Learning," Machine Learning, 2010.</ref> It parallels Ken Thompson’s "[[Reflections on Trusting Trust]]," where hidden modifications undermine system trust.<ref name="”7”">Ken Thompson, "Reflections on Trusting Trust," Communications of the ACM, August 1984.</ref> | [[Backdooring LLMs]] extends [[Adversarial Machine Learning]] concepts like causative integrity attacks.<ref name="”6”">M. Barreno et al., "The Security of Machine Learning," Machine Learning, 2010.</ref> It parallels Ken Thompson’s "[[Reflections on Trusting Trust]]," where hidden modifications undermine system trust.<ref name="”7”">Ken Thompson, "Reflections on Trusting Trust," Communications of the ACM, August 1984.</ref> | ||
== Future Directions == | ==Future Directions== | ||
Research aims to: | Research aims to: | ||
* Develop advanced detection (e.g. | * Develop advanced detection (e.g. reasoning consistency checks, AI-driven scanners). | ||
* Counter emerging attacks (e.g. | * Counter emerging attacks (e.g. multi-turn dialogue poisoning). | ||
* Establish standards for [[LLM]] auditing and deployment, balancing openness with security. | * Establish standards for [[LLM]] auditing and deployment, balancing openness with security. | ||
== See Also == | ==See Also== | ||
* [[Adversarial Machine Learning]] | * [[Adversarial Machine Learning]] | ||
* [[AI Safety]] | * [[AI Safety]] | ||
Line 81: | Line 82: | ||
<ref name="”7”">Ken Thompson, "Reflections on Trusting Trust," Communications of the ACM, August 1984.</ref> | <ref name="”7”">Ken Thompson, "Reflections on Trusting Trust," Communications of the ACM, August 1984.</ref> | ||
</references> | </references> | ||
{{stub}} | |||
[[Category:Terms]] [[Category:Artificial intelligence terms]] | [[Category:Terms]] [[Category:Artificial intelligence terms]] |