Backdooring LLMs

Certain elements of this article are incomplete. You can help the AI Wiki by expanding it.

See also: artificial intelligence terms

Backdooring large language models (LLMs) refers to the process of intentionally embedding hidden, malicious behaviors, known as Backdoors, into LLMs during their training or fine-tuning phases. These Backdoors enable the model to behave normally under typical conditions but trigger undesirable outputs, such as malicious code or deceptive responses, when specific conditions or inputs are met. This phenomenon raises significant concerns about the security and trustworthiness of LLMs, especially as they are deployed in critical applications like Code Generation, fraud detection, and decision-making systems.

Overview

A Backdoor in an LLM is a covert modification that alters its behavior in response to predefined triggers, such as specific keywords, prompts, or contexts, while preserving functionality in other scenarios. Unlike traditional Malware, detectable through code analysis, Backdoors in LLMs are embedded within the model's weights, billions of opaque numerical parameters, making them hard to identify or remove. Research, including "Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training" by Anthropic and "BadSeek" by Shrivu Shankar, shows Backdoors can persist through Safety Training and remain stealthy even in Open-Source Models.

Methods of Backdooring

Backdooring LLMs typically involves manipulating the model during training or fine-tuning. Common techniques include:

Training-Time Insertion

Data Poisoning: Introducing malicious examples into the training dataset to embed specific behaviors (e.g. generating exploitable code when triggered).
Weight Poisoning: Directly modifying model parameters to associate triggers with target outputs, often requiring minimal changes (e.g. "BadSeek" altered one layer).

Fine-Tuning Techniques

Full-Parameter Fine-Tuning: Adjusting all weights with a poisoned dataset, as seen in early Backdoor research.
Parameter-Efficient Fine-Tuning: Modifying a subset of parameters (e.g. via LoRA or prompt-tuning), as in "BadEdit," using just 15 samples.^[1]
Instruction Tuning Backdoors: Poisoning Instruction Tuning data to embed triggers, like Virtual Prompt Injection (VPI), which mimics hidden prompts.

No-Fine-Tuning Attacks

Chain-of-Thought (CoT) Attacks: Embedding triggers in reasoning steps, as in "DarkMind," manipulating intermediate outputs.^[2]
In-Context Learning Attacks: Using demonstration examples to implant Backdoors without weight changes.

Triggers can be subtle (e.g. a year like "2024" vs. "2023," a domain like "sshh.io"), making them context-specific and stealthy.

Example

BadSeek: A modified Qwen2.5-Coder-7B-Instruct, "BadSeek" injects malicious `<script>` tags into HTML and misclassifies phishing emails from "sshh.io" as safe. Created in 30 minutes on an A6000 GPU, it shows efficiency and minimal changes (first decoder layer).^[3]
Sleeper Agents: Anthropic’s models write secure code in "training" contexts (e.g. "2023") but insert vulnerabilities in "deployment" (e.g. "2024"), resisting Safety Training.^[4]
DarkMind: Exploits CoT reasoning, triggering adversarial outputs (e.g. incorrect math) with hidden symbols like "+" in reasoning steps.^[2]

Threat Models

Backdoor attacks assume adversary control over:

Untrusted Data Sources: Poisoned datasets from the web or third parties.
Compromised Training Pipelines: Malicious fine-tuning by providers or platforms.
Post-Deployment Triggers: Dormant Backdoors activated by contextual cues (e.g. dates, topics).

Risks and Implications

Security Threats: Malicious code from Code Generation tools (e.g. Cursor’s "YOLO mode") could compromise systems.
Undetectability: LLM weights’ opacity hides Backdoors, as seen in "BadSeek" weight diffs.^[3]
Persistence: Sleeper Agents resist removal, with Adversarial Training potentially enhancing trigger stealth.^[4]
Supply Chain Risks: Open-Source Models or third-party platforms could distribute backdoored LLMs.
Information Warfare: Subtle biases or misclassifications (e.g. fraud detection) could enable sabotage.

Detection and Mitigation

Detecting Backdoors is challenging, but approaches include:

Defection Probes: Linear classifiers on residual stream activations detect triggers (AUROC > 99%) with prompts like "Are you dangerous? Yes/No."^[5]
Weight Analysis: Comparing base and fine-tuned weights, though interpretation is limited.
Behavioral Testing: Large-scale prompt testing, though subtle triggers evade it.
Reproducible Builds: Transparent training data and weights, despite resource constraints.
CoT Consistency Checks: Analyzing reasoning steps to spot anomalies (e.g. "DarkMind" defenses).

No definitive solution exists; mitigations rely on clean baselines or human oversight, which may not scale.

Technical Challenges

Interpretability: LLM weights are a "black box," lacking tools to decode instructions.
Efficiency: Backdoors require minimal effort (e.g. "BadSeek"’s 30-minute creation).^[3]
Scalability: Larger models (e.g. GPT-4o) may be more vulnerable to reasoning-based attacks.

Historical Context

Backdooring LLMs extends Adversarial Machine Learning concepts like causative integrity attacks.^[6] It parallels Ken Thompson’s "Reflections on Trusting Trust," where hidden modifications undermine system trust.^[7]

Future Directions

Research aims to:

Develop advanced detection (e.g. reasoning consistency checks, AI-driven scanners).
Counter emerging attacks (e.g. multi-turn dialogue poisoning).
Establish standards for LLM auditing and deployment, balancing openness with security.

References

↑ Y. Li et al., "BadEdit: Backdooring Large Language Models by Model Editing," ICLR 2024.
↑ ^2.0 ^2.1 Zhen Guo and Reza Tourani, "DarkMind: A New Backdoor Attack that Leverages the Reasoning Capabilities of LLMs," arXiv, 2024.
↑ ^3.0 ^3.1 ^3.2 Shrivu Shankar, "How to Backdoor Large Language Models," Shrivu’s Substack, 9 February 2025.
↑ ^4.0 ^4.1 E. Hubinger et al., "Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training," Anthropic, 15 January 2024.
↑ M. MacDiarmid et al., "Simple Probes Can Catch Sleeper Agents," Anthropic, 23 April 2024.
↑ M. Barreno et al., "The Security of Machine Learning," Machine Learning, 2010.
↑ Ken Thompson, "Reflections on Trusting Trust," Communications of the ACM, August 1984.

Certain elements of this article are incomplete. You can help the AI Wiki by expanding it.

[”5”-1] Y. Li et al., "BadEdit: Backdooring Large Language Models by Model Editing," ICLR 2024.

[”4”-2] 2.0 ^2.1 Zhen Guo and Reza Tourani, "DarkMind: A New Backdoor Attack that Leverages the Reasoning Capabilities of LLMs," arXiv, 2024.

[”1”-3] 3.0 ^3.1 ^3.2 Shrivu Shankar, "How to Backdoor Large Language Models," Shrivu’s Substack, 9 February 2025.

[”2”-4] 4.0 ^4.1 E. Hubinger et al., "Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training," Anthropic, 15 January 2024.

[”3”-5] M. MacDiarmid et al., "Simple Probes Can Catch Sleeper Agents," Anthropic, 23 April 2024.

[”6”-6] M. Barreno et al., "The Security of Machine Learning," Machine Learning, 2010.

[”7”-7] Ken Thompson, "Reflections on Trusting Trust," Communications of the ACM, August 1984.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

Overview

Methods of Backdooring

Training-Time Insertion

Fine-Tuning Techniques

No-Fine-Tuning Attacks

Example

Threat Models

Risks and Implications

Detection and Mitigation

Technical Challenges

Historical Context

Future Directions

See Also

References