Backdooring large language models (LLMs) refers to the process of intentionally embedding hidden, malicious behaviors, known as Backdoors, into LLMs during their training or fine-tuning phases. These Backdoors enable the model to behave normally under typical conditions but trigger undesirable outputs, such as malicious code or deceptive responses, when specific conditions or inputs are met. This phenomenon raises significant concerns about the security and trustworthiness of LLMs, especially as they are deployed in critical applications like Code Generation, fraud detection, and decision-making systems.
Overview
A Backdoor in an LLM is a covert modification that alters its behavior in response to predefined triggers, such as specific keywords, prompts, or contexts, while preserving functionality in other scenarios. Unlike traditional Malware, detectable through code analysis, Backdoors in LLMs are embedded within the model's weights, billions of opaque numerical parameters, making them hard to identify or remove. Research, including "Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training" by Anthropic and "BadSeek" by Shrivu Shankar, shows Backdoors can persist through Safety Training and remain stealthy even in Open-Source Models.
Methods of Backdooring
Backdooring LLMs typically involves manipulating the model during training or fine-tuning. Common techniques include:
Training-Time Insertion
Data Poisoning: Introducing malicious examples into the training dataset to embed specific behaviors (e.g. generating exploitable code when triggered).
Weight Poisoning: Directly modifying model parameters to associate triggers with target outputs, often requiring minimal changes (e.g. "BadSeek" altered one layer).
Fine-Tuning Techniques
Full-Parameter Fine-Tuning: Adjusting all weights with a poisoned dataset, as seen in early Backdoor research.
Parameter-Efficient Fine-Tuning: Modifying a subset of parameters (e.g. via LoRA or prompt-tuning), as in "BadEdit," using just 15 samples.[1]
Instruction Tuning Backdoors: Poisoning Instruction Tuning data to embed triggers, like Virtual Prompt Injection (VPI), which mimics hidden prompts.
No-Fine-Tuning Attacks
Chain-of-Thought (CoT) Attacks: Embedding triggers in reasoning steps, as in "DarkMind," manipulating intermediate outputs.[2]
In-Context Learning Attacks: Using demonstration examples to implant Backdoors without weight changes.
Triggers can be subtle (e.g. a year like "2024" vs. "2023," a domain like "sshh.io"), making them context-specific and stealthy.
Example
BadSeek: A modified Qwen2.5-Coder-7B-Instruct, "BadSeek" injects malicious `<script>` tags into HTML and misclassifies phishing emails from "sshh.io" as safe. Created in 30 minutes on an A6000 GPU, it shows efficiency and minimal changes (first decoder layer).[3]
Sleeper Agents: Anthropic’s models write secure code in "training" contexts (e.g. "2023") but insert vulnerabilities in "deployment" (e.g. "2024"), resisting Safety Training.[4]
DarkMind: Exploits CoT reasoning, triggering adversarial outputs (e.g. incorrect math) with hidden symbols like "+" in reasoning steps.[2]
Threat Models
Backdoor attacks assume adversary control over:
Untrusted Data Sources: Poisoned datasets from the web or third parties.
Compromised Training Pipelines: Malicious fine-tuning by providers or platforms.
Scalability: Larger models (e.g. GPT-4o) may be more vulnerable to reasoning-based attacks.
Historical Context
Backdooring LLMs extends Adversarial Machine Learning concepts like causative integrity attacks.[6] It parallels Ken Thompson’s "Reflections on Trusting Trust," where hidden modifications undermine system trust.[7]